In [2]:
"""

Pre-Proccessing --->

This code is performing text preprocessing, a common step in Natural Language Processing (NLP) to prepare text data for further analysis.
Let's break down the code in detail and introduce some questions you may be asked in an oral examination, along with answers.

Code Explanation

Installation and Importing NLTK Library

python
Copy code
pip install nltk
import nltk
nltk.download('punkt')

pip install nltk: Installs the Natural Language Toolkit (NLTK), a library for working with human language data in Python.
nltk.download('punkt'): Downloads the Punkt tokenizer models, required for tokenizing text into sentences and words.

Importing Modules and Tools for Text Processing

python
Copy code
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize

stopwords: Provides a list of common words in English that typically don’t contribute to the meaning, like "is", "and", "the".
PorterStemmer: This is a stemming algorithm that reduces words to their root form (e.g., "running" to "run").
word_tokenize and sent_tokenize: Tokenizers for breaking text into words and sentences, respectively.

Sample Text Initialization

python
Copy code
random_text = """ """

This variable is expected to contain text data. You would replace the placeholder with actual text data for analysis.
Tokenizing Text into Words

python
Copy code
words = word_tokenize(random_text)

word_tokenize converts random_text into a list of individual words. Tokenization is essential for breaking down text data into manageable units for further processing.
Downloading and Initializing Stop Words

python
Copy code
nltk.download('stopwords')
stop_words = set(stopwords.words("english"))

stopwords.words("english"): Fetches English stop words to filter out non-informative words from the text.
Preprocessing Each Word

python
Copy code
preprocessed_words = []

for word in words:
    word = word.lower()
    word = word.strip('.,?!-()[]{}"\'')
    if word not in stop_words:
        word = stemmer.stem(word)
        preprocessed_words.append(word)

word.lower(): Converts each word to lowercase, ensuring uniformity.
word.strip('.,?!-()[]{}"\''): Removes punctuation from the word, making it easier to process.
if word not in stop_words: Filters out any words that are stop words.
stemmer.stem(word): Applies the Porter Stemming algorithm to reduce words to their base form.
preprocessed_words.append(word): Adds the processed word to the preprocessed_words list.
Combining Preprocessed Words into a Text

python
Copy code
preprocessed_text = " ".join(preprocessed_words)

Joins all the preprocessed words back into a single string, representing the cleaned text.
Printing the Results

python
Copy code
print("Original Text:")
print(random_text)
print("\nPreprocessed Text:")
print(preprocessed_text)

Prints both the original and preprocessed text for comparison.


Key Concepts Behind the Code
Tokenization: Dividing text into smaller units (tokens) like words or sentences. This is crucial for further analysis and helps algorithms focus on individual words.

Stop Words: Commonly used words that add little meaning in text analysis. Filtering these out improves the efficiency of NLP algorithms.

Stemming: Reduces words to their root forms (e.g., "working" becomes "work") to standardize words and reduce redundancy in analysis.


Potential Oral Examination Questions and Answers
Q: What is tokenization, and why is it important in NLP?
A: Tokenization is the process of splitting text into smaller components like words or sentences. It is essential because it allows NLP algorithms to focus on individual units (tokens), making it easier to analyze and process language data accurately.

Q: Why do we remove stop words from the text?
A: Stop words are common words that typically don’t add significant meaning to text data, like "is", "the", and "and". Removing them reduces noise in the data and helps NLP models focus on more informative words.

Q: What is the purpose of stemming, and how does the Porter Stemmer work?
A: Stemming reduces words to their base or root form, which standardizes variations of a word (e.g., "jumps", "jumping" all become "jump"). The Porter Stemmer applies a series of rules to remove suffixes and reduce a word to its stem.

Q: Why do we lowercase and remove punctuation in the text preprocessing step?
A: Lowercasing ensures that different cases of a word (e.g., "Apple" vs. "apple") are treated the same, and removing punctuation helps in cleaning up the text, making it easier to process without irrelevant characters.

Q: Can you explain the significance of the nltk.download('punkt') line?
A: This line downloads the Punkt tokenizer models, which are required for breaking down text into words and sentences. Without this, the word and sentence tokenizers in NLTK wouldn’t function.

Q: What is the difference between stemming and lemmatization?
A: Stemming reduces words to their base form by chopping off suffixes, often resulting in non-standard words (e.g., "running" to "run"). Lemmatization, on the other hand, reduces words to their root form based on their meaning, producing grammatically correct words (e.g., "better" to "good").

Q: How would you modify this code to perform lemmatization instead of stemming?
A: To perform lemmatization, we would import WordNetLemmatizer from NLTK, initialize it, and replace stemmer.stem(word) with lemmatizer.lemmatize(word). We would also need to download the WordNet dataset using nltk.download('wordnet').
"""

In [None]:
"""

Email Spam Detection --->

This code is an implementation of a spam detection classifier using machine learning. It leverages the Support Vector Machine (SVM) algorithm to classify emails as either "Spam" or "Not Spam."
Here’s a breakdown of the code, along with potential questions and answers that could be asked during an oral examination.

Code Explanation
Importing Libraries

python
Copy code
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
import re

pandas: Used for data manipulation and reading the dataset.
TfidfVectorizer: Converts text data into numerical features using Term Frequency-Inverse Document Frequency (TF-IDF).
train_test_split: Splits the dataset into training and testing sets.
SVC: The Support Vector Classifier from the sklearn library used for training the spam detection model.
accuracy_score and classification_report: Metrics to evaluate the performance of the classifier.
re: Provides support for regular expressions, used for cleaning email subjects.

Loading the Dataset

python
Copy code
data = pd.read_csv('emails.csv')
This loads the dataset from a CSV file, where each row represents an email with text content and a label indicating whether it is spam (1) or not spam (0).

Splitting Data into Features and Labels

python
Copy code
X = data['text']
y = data['spam']
X contains the email text content, while y contains the target labels (spam or not spam).

Converting Text Data to Numerical Features with TF-IDF

python
Copy code
vectorizer = TfidfVectorizer()
X_vectorized = vectorizer.fit_transform(X)
TfidfVectorizer converts the email text into numerical form by calculating the importance of each word in each email relative to the whole dataset.
This transformation allows us to use text data as input for the machine learning model.

Splitting the Data into Training and Testing Sets

python
Copy code
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size=0.2, random_state=42)
This splits the vectorized data into training and testing sets, using 80% for training and 20% for testing. random_state=42 ensures reproducibility of the split.

Creating and Training the SVM Classifier

python
Copy code
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)
SVC(kernel='linear'): Creates a Support Vector Machine (SVM) classifier with a linear kernel, suitable for binary classification tasks.
svm_classifier.fit(X_train, y_train): Trains the classifier on the training data.

Making Predictions and Evaluating the Model

python
Copy code
y_pred = svm_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
y_pred: Contains the predictions made by the classifier on the test set.
accuracy_score(y_test, y_pred): Computes the accuracy of the classifier.

Displaying the Classification Report

python
Copy code
class_report = classification_report(y_test, y_pred, target_names=['Not Spam', 'Spam'])
print("Classification Report:\n", class_report)
The classification report shows the precision, recall, and F1-score for each class (Not Spam and Spam), providing a detailed evaluation of the model’s performance.

Defining a Function to Classify Email Subjects

python
Copy code
def classify_email(subject):
    cleaned_subject = re.sub(r'^Subject:\s*', '', subject)
    vectorized_subject = vectorizer.transform([cleaned_subject])
    prediction = svm_classifier.predict(vectorized_subject)
    return "Spam" if prediction[0] == 1 else "Not Spam"
This function takes an email subject as input, removes any "Subject:" prefix using regular expressions, transforms it using the TF-IDF vectorizer, and then classifies it using the SVM model.

Getting User Input and Classifying the Email Subject

python
Copy code
user_input = input("Enter an email subject: ")
classification_result = classify_email(user_input)
print("Classification:", classification_result)
The user is prompted to enter an email subject, which is then classified as "Spam" or "Not Spam."

Key Concepts Behind the Code
TF-IDF Vectorization: A technique that transforms text data into numerical form by assigning weights to words based on their importance in the text. Higher weights are assigned to unique words, and common words receive lower weights.

Support Vector Machine (SVM): A supervised learning algorithm used for classification. The linear kernel helps separate data points in high-dimensional spaces by finding the optimal hyperplane.

Classification Metrics: Accuracy, precision, recall, and F1-score are used to evaluate the model's performance on unseen data.


Potential Oral Examination Questions and Answers
Q: What is TF-IDF, and why is it used in this code?
A: TF-IDF (Term Frequency-Inverse Document Frequency) is a technique that transforms text into numerical data by giving higher weights to unique words and lower weights to common words. In this code, it is used to convert email text into a format suitable for input into the SVM model.

Q: Why do we use train-test split in machine learning?
A: Train-test split divides the dataset into training and testing subsets. The training set is used to teach the model, while the test set evaluates its performance on unseen data, helping to detect overfitting and generalization.

Q: What is the purpose of using an SVM with a linear kernel in this code?
A: The SVM with a linear kernel is used for binary classification (spam vs. not spam) because it efficiently separates the data into two classes using a linear decision boundary.

Q: Explain precision, recall, and F1-score. Why are these metrics important?
A: Precision is the percentage of correctly identified spam emails out of all predicted spam. Recall is the percentage of actual spam emails correctly identified. The F1-score is the harmonic mean of precision and recall, balancing both. These metrics help us understand the model’s effectiveness beyond simple accuracy.

Q: Why is it important to clean the email subject before classification?
A: Cleaning the email subject removes unnecessary text, like the "Subject:" prefix, which could introduce irrelevant data. This helps the model focus on the meaningful content of the subject.

Q: What would you modify to use this code with a different dataset?
A: To use a different dataset, replace emails.csv with the new dataset file path, ensuring that the new file has similar column names ("text" for email content and "spam" for labels) or adjust the code to match the new column names.

Q: How would you handle a case where you want to classify an email body instead of the subject?
A: Replace the subject input in the classify_email function with the entire email body. Additionally, you may need to apply further preprocessing steps to clean and standardize the email body text.

Q: Can you explain the role of the regular expression used in the classify_email function?
A: The regular expression re.sub(r'^Subject:\s*', '', subject) removes any "Subject:" prefix from the input, ensuring that the classification focuses solely on the content of the email subject.

"""

In [None]:
"""
Agglomerative Clustering --->

This code applies hierarchical clustering to a credit card dataset to identify distinct customer segments based on selected features.
It uses Agglomerative Clustering and visualizes the clustering results with both a dendrogram and a scatter plot.

Code Explanation
Importing Libraries

python
Copy code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
numpy: Useful for numerical operations, though not used directly in this code.
pandas: Loads and handles the dataset.
matplotlib.pyplot: Used to create plots for data visualization.
StandardScaler: Scales features to have zero mean and unit variance, which is essential for clustering.
AgglomerativeClustering: Performs hierarchical clustering on standardized data.
dendrogram and linkage: Create and display the hierarchical structure of clusters.

Loading the Dataset

python
Copy code
data = pd.read_csv('BankChurners.csv')
This loads a dataset of credit card customers, stored as BankChurners.csv.

Selecting Relevant Features

python
Copy code
X = data[['Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon']]
Five features are selected to represent each customer, which are likely related to their banking behavior. These features will be used to form clusters.

Standardizing the Features

python
Copy code
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
StandardScaler standardizes the features to ensure that they contribute equally to the clustering process. This avoids bias toward features with larger ranges.

Applying Agglomerative Clustering

python
Copy code
agg_clustering = AgglomerativeClustering(n_clusters=3)
agg_labels = agg_clustering.fit_predict(X_scaled)
AgglomerativeClustering performs hierarchical clustering by iteratively merging clusters. Setting n_clusters=3 instructs it to create three clusters.
fit_predict returns cluster labels for each data point, stored in agg_labels.

Creating and Displaying the Dendrogram

python
Copy code
linked = linkage(X_scaled, 'ward')
dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=True)
plt.show()
linkage(X_scaled, 'ward') computes the hierarchical clustering using the Ward method, which minimizes the variance within clusters as they merge.
dendrogram generates a visual representation of the clustering process, showing the hierarchical structure and possible clusters at various levels.

Plotting the Clusters in 2D

python
Copy code
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=agg_labels, cmap='rainbow')
plt.xlabel('Standardized Customer Age')
plt.ylabel('Standardized Dependent Count')
plt.show()
This scatter plot visualizes clusters in two dimensions, using 'Customer Age' and 'Dependent Count' to represent each customer. Colors indicate the clusters assigned by the model.

Key Concepts Behind the Code
Hierarchical Clustering: Hierarchical clustering is a type of unsupervised learning that builds a hierarchy of clusters.
In this case, agglomerative (bottom-up) clustering is used, where individual points are progressively merged into clusters.

Ward’s Method: This is a variance-minimizing approach used in hierarchical clustering to merge clusters.
It attempts to minimize the total within-cluster variance.

Dendrogram: A dendrogram is a tree-like diagram representing hierarchical relationships. It shows how clusters are formed at each level,
allowing the user to decide on the optimal number of clusters visually.

Standardization: Standardizing the features is critical in clustering as it ensures that each feature contributes equally,
preventing any single feature from dominating due to its scale.


Potential Oral Examination Questions and Answers
Q: What is hierarchical clustering, and why do we use it here?
A: Hierarchical clustering is a method of creating a nested sequence of clusters by either merging smaller clusters or splitting larger ones. We use it here to explore different customer segments and understand relationships at multiple levels of hierarchy.

Q: Explain the role of the StandardScaler in this code.
A: StandardScaler standardizes each feature by centering it to have a mean of zero and a standard deviation of one. This step is essential in clustering to ensure that all features contribute equally to the distance metric used for clustering.

Q: What is the purpose of the dendrogram in this code?
A: The dendrogram visually represents the hierarchical clustering process, showing the order in which clusters are merged and allowing us to decide on the optimal number of clusters by examining cluster distances.

Q: Why do we use the Ward method in linkage for creating the dendrogram?
A: The Ward method minimizes the variance within clusters as they merge, which helps produce compact clusters and reduces the variance within each cluster.

Q: Why might we select a different number of clusters when using hierarchical clustering?
A: The dendrogram shows clustering at multiple levels. By observing it, we can decide on the number of clusters based on where the largest distance jumps occur, indicating a natural split.

Q: Can you explain the difference between agglomerative and divisive clustering?
A: Agglomerative clustering starts with each data point as its own cluster and merges them iteratively, while divisive clustering starts with all data points in a single cluster and splits them iteratively.

Q: What does fit_predict do in the AgglomerativeClustering model?
A: fit_predict fits the model to the data and returns an array of cluster labels for each data point, allowing us to visualize or analyze the assigned clusters.

Q: Why are only two features (Customer_Age and Dependent_count) used in the 2D scatter plot?
A: Visualizing all five features at once is challenging in 2D. By selecting two of the most relevant features, we create a simplified plot to observe clustering trends, though this limits the representation of clusters to only these two features.

Q: How would you determine the optimal number of clusters in hierarchical clustering?
A: By examining the dendrogram, we look for the largest vertical distances (gaps) between clusters, as these indicate natural points to cut the tree and select clusters.

Q: If you were to use this code on a different dataset, what modifications would you make?
A: I would change the data = pd.read_csv(...) line to load the new dataset and adjust the X variable to include relevant features from the new data. Additionally, the plot labels may need to be updated to match the new features used.

Q: Can you use other clustering methods with this dataset? If so, which ones?
A: Yes, other clustering methods like K-means or DBSCAN could be applied to this dataset, though they do not produce a dendrogram and may handle different data distributions or density patterns differently.
"""

In [None]:
"""

Inverted-files --->

This code builds an inverted index for a set of sample documents and performs simple document retrieval based on a query.

Code Explanation
Importing Libraries

python
Copy code
import re
import collections
re: This library is used for regular expressions, which help in text processing.
collections: Provides a defaultdict, which is used here to create the inverted index.

Sample Documents

python
Copy code
documents = {
    1: "This is the first document. It contains some words.",
    2: "This is the second document. It also contains words.",
    3: "The third document is different from the first two.",
    4: "Inverted index is essential for document retrieval.",
}
A dictionary of sample documents where keys are document IDs and values are document text.

Preprocessing and Tokenization

python
Copy code
def preprocess(text):
    text = text.lower()
    tokens = re.findall(r'\w+', text)
    return tokens
This function converts the text to lowercase for case-insensitive matching and uses re.findall to extract words (alphanumeric characters) as tokens.

Building the Inverted Index

python
Copy code
def build_inverted_index(documents):
    inverted_index = collections.defaultdict(list)
    for doc_id, document in documents.items():
        tokens = preprocess(document)
        for token in tokens:
            inverted_index[token].append(doc_id)
    return inverted_index
This function creates an inverted index, which maps each unique word (token) to a list of document IDs where the word appears.
defaultdict(list) automatically initializes empty lists for new tokens, allowing easy insertion of document IDs.

Retrieving Documents

python
Copy code
def retrieve_documents(query, inverted_index):
    query_tokens = preprocess(query)
    result = set()

    # Retrieve documents containing each query token
    for token in query_tokens:
        if token in inverted_index:
            if not result:
                result = set(inverted_index[token])
            else:
                result = result.intersection(inverted_index[token])

    return result
This function takes a query, tokenizes it, and finds documents containing all the query tokens (AND operation).
It uses set operations to perform intersections, so the result contains only document IDs that have all the query tokens.

Building the Inverted Index and Testing with a Query

python
Copy code
inverted_index = build_inverted_index(documents)
query1 = input("Enter query: ")
result1 = retrieve_documents(query1, inverted_index)
print("Query:", query1)
print("Matching Documents:", result1)
build_inverted_index generates the index for the sample documents.
retrieve_documents returns document IDs matching the input query. For example, a query like "first document" would return documents containing both "first" and "document."

Key Concepts Behind the Code
Inverted Index: An inverted index maps each unique word in a collection to the list of documents where it appears. It is commonly used in search engines for efficient document retrieval.

Preprocessing: Lowercasing and tokenizing text simplifies matching by ensuring case-insensitive searches and removing non-alphanumeric characters.

Set Intersection for Query Matching: When a query has multiple words, this code finds documents that contain all query words by intersecting sets of document IDs for each word.


Potential Oral Examination Questions and Answers
Q: What is an inverted index, and why is it useful in document retrieval?
A: An inverted index maps each unique word in a document collection to a list of documents where it appears. This structure is useful for quick document retrieval, as it allows us to look up documents containing specific words or phrases.

Q: How does the preprocess function work? Why do we use regular expressions here?
A: preprocess converts the text to lowercase and uses re.findall(r'\w+', text) to extract words, ensuring consistent case-insensitive matching and removing punctuation. Regular expressions help capture only the alphanumeric parts of the text as tokens.

Q: Explain the logic behind the retrieve_documents function.
A: retrieve_documents tokenizes the query and retrieves documents for each token from the inverted index. It intersects document lists for each token to return documents containing all query words, implementing an "AND" search.

Q: Why is defaultdict(list) used in building the inverted index?
A: defaultdict(list) allows us to initialize a new list automatically for each token that doesn’t yet exist in the dictionary. This simplifies code by avoiding the need to check for the existence of a token before appending a document ID.

Q: How would the retrieval change if we wanted an "OR" search instead of an "AND" search?
A: For an "OR" search, we would use a union of document lists for each query token rather than an intersection, meaning we would return documents containing at least one of the query words.

Q: If a document contains a query word multiple times, how is this handled in this code?
A: This code does not consider word frequency; it only checks for the presence of a word in a document. Document IDs are added once per document, regardless of how often a word appears.

Q: What is the time complexity of the document retrieval function?
A: Document retrieval’s time complexity is proportional to the number of query tokens and the length of the document lists in the inverted index. Finding intersections may take time but is generally efficient due to Python’s set operations.

Q: Why are sets used for result in retrieve_documents?
A: Sets allow efficient intersection operations, which are necessary for the "AND" search functionality in this function, and they ensure that each document ID appears only once in the result.

Q: How would you modify this code to handle phrase searches (e.g., "first document")?
A: For phrase searches, we would need to store and retrieve not just document IDs but also the positions of each token within documents, then check if the tokens appear sequentially.

Q: What modifications would you make to this code to work with a large dataset?
A: For larger datasets, we could optimize by using more memory-efficient data structures or indexing techniques like Apache Lucene. We might also consider storing the inverted index on disk with efficient I/O operations for retrieval.

Q: Can this code handle partial word searches (like searching for "doc" instead of "document")?
A: No, this code only matches complete tokens. To support partial matching, we would need to modify the build_inverted_index function to index word stems or prefixes.

Q: Why are regular expressions important in text preprocessing?
A: Regular expressions allow precise extraction and manipulation of text patterns, which is essential for breaking down sentences into words and removing non-alphanumeric characters effectively.

"""

In [None]:
"""

PageRank Algorithm --->

This code calculates the PageRank of pages in a web graph using the iterative PageRank algorithm.

Code Explanation
Importing NumPy Library

python
Copy code
import numpy as np
This library is used for numerical operations and handling arrays efficiently.

PageRank Function

python
Copy code
def page_rank(graph, damping_factor=0.85, max_iterations=100, tol=1e-6):
    # Number of pages
    num_pages = len(graph)

    # Initialize the PageRank values
    pagerank = np.ones(num_pages) / num_pages
page_rank is a function that calculates PageRank values for pages in a web graph.
graph is the adjacency matrix representing links between pages.
damping_factor simulates the probability that a user continues clicking on links (default is 0.85).
max_iterations limits the number of times the algorithm runs if convergence is not reached.
tol is the tolerance level for convergence (when the change in values is smaller than this, the algorithm stops).

Main Iteration Loop

python
Copy code
for _ in range(max_iterations):
    new_pagerank = np.zeros(num_pages)
    for i in range(num_pages):
        for j in range(num_pages):
            if graph[j][i]:
                new_pagerank[i] += pagerank[j] / sum(graph[j])
For each page, new_pagerank is calculated based on the sum of incoming PageRank values from pages linking to it.
If graph[j][i] is 1, page j links to page i, contributing its PageRank proportionally based on its outlinks.

Applying the Damping Factor

python
Copy code
new_pagerank = (1 - damping_factor) / num_pages + damping_factor * new_pagerank
After calculating new_pagerank for all pages, it is adjusted by the damping factor, ensuring that some probability (e.g., 0.15) is assigned to "randomly" visiting any page.

Checking for Convergence

python
Copy code
if np.linalg.norm(new_pagerank - pagerank) < tol:
    return new_pagerank
pagerank = new_pagerank
If the difference between new_pagerank and the previous pagerank is smaller than tol, the algorithm assumes convergence and returns new_pagerank as the final PageRank values.
If not, pagerank is updated, and the loop continues.

Example Graph and Running the Function

python
Copy code
web_graph = [
    [0, 1, 1, 0],
    [0, 0, 1, 0],
    [1, 0, 0, 1],
    [0, 0, 1, 0]
]
pagerank_values = page_rank(web_graph)
print("PageRank values:", pagerank_values)
web_graph represents a sample set of web pages where 1 indicates a link between pages. This graph is passed to the page_rank function to calculate PageRank values, which are then printed.

Key Concepts Behind the Code
PageRank: A link-based algorithm for ranking web pages. Pages with more and higher-quality links have higher PageRank.
Damping Factor: Simulates the chance a user continues clicking links, with a probability (usually 0.85) and randomly selects a page otherwise.
Convergence: When the algorithm stabilizes, meaning further iterations cause negligible change in PageRank values.


Potential Oral Examination Questions and Answers
Q: What is the purpose of the PageRank algorithm?
A: PageRank assigns a numerical weight to each web page to measure its relative importance based on the structure of links across the web. Higher PageRank indicates a more influential or authoritative page.

Q: Explain the role of the damping factor in the PageRank algorithm.
A: The damping factor represents the probability that a user continues clicking links rather than randomly jumping to another page. A typical value of 0.85 assumes an 85% chance of following links and 15% chance of randomly visiting a page.

Q: Why do we check for convergence in the PageRank function?
A: We check for convergence to ensure that the algorithm stops once PageRank values stabilize, preventing unnecessary computations. Convergence occurs when the difference between current and previous PageRank values falls below the tolerance (tol).

Q: How would the algorithm change if we set the damping factor to 1?
A: With a damping factor of 1, the algorithm assumes that users only follow links without random jumps, making the PageRank calculation dependent solely on the link structure without any random factor.

Q: What is the time complexity of this PageRank algorithm?
A: The time complexity is approximately O(N^2 * max_iterations), where N is the number of pages, as each iteration involves checking each page's links to other pages.

Q: Why is the PageRank initialized equally across pages?
A: Initially, we assume all pages are equally important; this uniform start helps the algorithm adjust ranks solely based on the link structure over successive iterations.

Q: What would happen if there is a page with no outbound links? How is it handled?
A: In this implementation, if a page has no outbound links, it could create a "sink" where all PageRank flows into that page. To handle this, we might modify the graph or redistribute the PageRank equally to all pages in such cases.

Q: Can the PageRank algorithm be used outside of web search engines?
A: Yes, PageRank can be applied in social networks, citation networks, recommendation systems, and any domain where the importance of nodes depends on their connections.

Q: Explain why np.linalg.norm is used in the convergence check.
A: np.linalg.norm calculates the Euclidean distance between the current and previous PageRank vectors. A small norm value indicates minor changes between iterations, implying convergence.

Q: How would you modify the code to handle a larger graph efficiently?
A: For large graphs, we might store the adjacency matrix as a sparse matrix to save memory. Additionally, implementing parallel processing or using a distributed computing framework like Apache Spark could improve efficiency.

Q: Why is the new_pagerank vector initialized to zeros at each iteration?
A: new_pagerank starts at zero to accumulate contributions from each linked page in each iteration. Resetting it prevents carrying over previous values.

Q: How does the algorithm handle pages that have no incoming links?
A: Pages with no incoming links get a baseline PageRank due to the (1 - damping_factor) term, representing the chance of arriving there randomly.

"""