# Project2 Part1 - Text Analysis through TFIDF computation


In [1]:
from text_analyzer import read_sonnets, clean_corpus, tf, get_top_k, idf, tf_idf, cosine_sim

import pandas as pd
import plotly.express as px

%load_ext autoreload
%autoreload 2

In [2]:
# run text_analyzer.py with default arguments
!python text_analyzer.py


Sonnet 1 TF (Top 20):
[('the', 6), ('thy', 5), ('to', 4), ('and', 3), ('that', 2), ('might', 2), ('but', 2), ('by', 2), ('his', 2), ('tender', 2), ('thou', 2), ('thine', 2), ('own', 2), ('self', 2), ('worlds', 2), ('from', 1), ('fairest', 1), ('creatures', 1), ('we', 1), ('desire', 1)]
Corpus TF (Top 20):
[('and', 491), ('the', 430), ('to', 408), ('my', 397), ('of', 372), ('i', 343), ('in', 322), ('that', 320), ('thy', 287), ('thou', 235), ('with', 181), ('for', 171), ('is', 168), ('a', 166), ('not', 166), ('me', 164), ('but', 163), ('thee', 162), ('love', 162), ('so', 144)]
Corpus IDF (Top 20):
[('despising', 5.0369526024136295), ('arising', 5.0369526024136295), ('enjoy', 5.0369526024136295), ('mans', 5.0369526024136295), ('outcast', 5.0369526024136295), ('featured', 5.0369526024136295), ('beweep', 5.0369526024136295), ('deaf', 5.0369526024136295), ('gate', 5.0369526024136295), ('desiring', 5.0369526024136295), ('lark', 5.0369526024136295), ('trouble', 5.0369526024136295), ('fate', 5

## a. Read about argparse.
Look at its implementation in the Python Script. Follow the instruction and answer the questions in the Argparse section.

In [3]:
!python text_analyzer.py --help

usage: text_analyzer.py [-h] [-i INPUT] [-c CORPUS] [--tfidf]

Text Analysis through TFIDF computation

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input text file or files. (default:
                        ./data/shakespeare_sonnets/1.txt)
  -c CORPUS, --corpus CORPUS
                        Directory containing document collection (i.e.,
                        corpus) (default: ./data/shakespeare_sonnets/)
  --tfidf               Determine the TF IDF of a document w.r.t. a given
                        corpus (default: False)


#### TODO: answer here

**_Answer:_**

- a. The `argparse` module is used in the `text_analyzer.py` script to parse command-line arguments. It simplifies the process of reading and validating command-line inputs, making it easier for users to interact with the script. In the script, `argparse` is used to define optional command-line arguments like `-i`, `--input`, `-c`, `--corpus`, and `--tfidf`. These arguments are used to customize the input file, the corpus directory, and the task (TF-IDF computation) to be performed by the script. It starts by creating an `ArgumentParser` object with a description of the program's purpose. Then, it adds arguments with their associated flags, types, default values, and help messages using the `add_argument` method. When the script is executed, the `parse_args()` method is called to process the provided command-line arguments and return an object with the parsed arguments as attributes. This allows the script to access the values of the command-line arguments and perform the desired actions based on the provided input.

- b. When we run `python text_analyzer.py --help`, it prints the help message for the script as shown in the above code block. The message includes a description of the program, optional arguments, their default values, and help messages. The help message is generated by the `argparse` module, and it is printed because of the following lines of code in `text_analyzer.py`:
  ```
  parser = argparse.ArgumentParser(
        description="Text Analysis through TFIDF computation",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
    )
  ```
  The `-h` and `--help` options are added automatically by `argparse` when we create an `ArgumentParser` object. The `description` and `formatter_class` arguments provide additional information to be displayed in the help message. The script's other command-line arguments are defined using the `add_argument` method, such as the arguments `-i`, `--input`:
  ```
  parser.add_argument(
        "-i",
        "--input",
        type=str,
        default="./data/shakespeare_sonnets/1.txt",
        help="Input text file or files.",
    )
  ```

## b. Read and Clean the data

In [4]:
d_corpus='data/shakespeare_sonnets/'

# return dictionary with keys corresponding to file names and values being the respective contents
corpus = read_sonnets(d_corpus)

# return corpus (dict) with each sonnet cleaned and tokenized for further processing
corpus = clean_corpus(corpus)

In [5]:
corpus['1']

['from',
 'fairest',
 'creatures',
 'we',
 'desire',
 'increase',
 'that',
 'thereby',
 'beautys',
 'rose',
 'might',
 'never',
 'die',
 'but',
 'as',
 'the',
 'riper',
 'should',
 'by',
 'time',
 'decease',
 'his',
 'tender',
 'heir',
 'might',
 'bear',
 'his',
 'memory',
 'but',
 'thou',
 'contracted',
 'to',
 'thine',
 'own',
 'bright',
 'eyes',
 'feedst',
 'thy',
 'lights',
 'flame',
 'with',
 'selfsubstantial',
 'fuel',
 'making',
 'a',
 'famine',
 'where',
 'abundance',
 'lies',
 'thy',
 'self',
 'thy',
 'foe',
 'to',
 'thy',
 'sweet',
 'self',
 'too',
 'cruel',
 'thou',
 'that',
 'art',
 'now',
 'the',
 'worlds',
 'fresh',
 'ornament',
 'and',
 'only',
 'herald',
 'to',
 'the',
 'gaudy',
 'spring',
 'within',
 'thine',
 'own',
 'bud',
 'buriest',
 'thy',
 'content',
 'and',
 'tender',
 'churl',
 'makst',
 'waste',
 'in',
 'niggarding',
 'pity',
 'the',
 'world',
 'or',
 'else',
 'this',
 'glutton',
 'be',
 'to',
 'eat',
 'the',
 'worlds',
 'due',
 'by',
 'the',
 'grave',
 'and',

**_Answer:_**

In the `text_analyzer.py` script, the functions responsible for reading and cleaning the text are `read_sonnets()` and `clean_corpus()`.

- The `read_sonnets()` function reads the contents of the text files from a given directory (corpus) and returns a dictionary with the file names as keys and their respective contents (list of strings) as values. This is done using the `os` module to list all files in the directory and the `open()` function to read the content of each file.

- The `clean_corpus()` function takes the corpus (dictionary returned by `read_sonnets()`) as input and cleans the text in each document. It removes any punctuation and converts all words to lowercase.

In summary, the script reads text files from a given directory, processes their content by removing punctuation and converting words to lowercase, and then stores the cleaned text in a dictionary for further analysis.

## c. TF

In [6]:
# assign 1.txt to variable sonnet to process and find its TF (Note corpus is of type dic, but sonnet1 is just a str)
sonnet1 = corpus['1']

# determine tf of sonnet
sonnet1_tf = tf(sonnet1)

# get sorted list and slice out top 20
sonnet1_top20 = get_top_k(sonnet1_tf)
# print
print("Sonnet 1 TF (Top 20):")
df = pd.DataFrame(sonnet1_top20, columns=["word", "count"])
df.head(20)

Sonnet 1 TF (Top 20):


Unnamed: 0,word,count
0,the,6
1,thy,5
2,to,4
3,and,3
4,that,2
5,might,2
6,but,2
7,by,2
8,his,2
9,tender,2


In [7]:
# TF of entire corpus
flattened_corpus = [word for sonnet in corpus.values() for word in sonnet] 
corpus_tf = tf(flattened_corpus)
corpus_top20 = get_top_k(corpus_tf)
# print
print("Corpus TF (Top 20):")
df = pd.DataFrame(corpus_top20, columns=["word", "count"])
df.head(20)

Corpus TF (Top 20):


Unnamed: 0,word,count
0,and,491
1,the,430
2,to,408
3,my,397
4,of,372
5,i,343
6,in,322
7,that,320
8,thy,287
9,thou,235


### Q: Discussion
Do you believe the most frequent words would discriminate between documents well? Why or why not? Any thoughts on how we can improve this representation? Does there appear to be any ‘noise’? If so, where? If not, it should be clear by the end of the assignment.

#### TODO: answer here

**_Answer:_**

- No, the most frequent words might not discriminate between documents well. This is because the most frequent words include common words like "the", "and", "to", "of", etc., which are known as "stop words". Stop words are generally not very informative and do not provide much useful information about the content of documents. They appear in almost all texts and do not help in distinguishing between documents effectively. Therefore, they are not very helpful for discriminating between documents. 

- To improve the representation, we can preprocess the text by removing "stop words" and other frequent words that do not provide much useful information about the documents. This will help in reducing noise and emphasizing more relevant words that might be more indicative of the content and better at discriminating between documents. Additionally, we can consider stemming or lemmatization to reduce words to their base or root form, which can further help in improving the text representation.

- Yes, there appears to be some noise as most of the top frequent words are "stop words" like "the", "and", "to", "of" that might not provide much useful information about the content of documents. By filtering out these "stop words" and applying stemming or lemmatization, the representation can be improved, and the analysis can potentially become more effective.

## d. IDF

In [8]:
# IDF of corpus
corpus_idf = idf(corpus)
corpus_tf_ordered = get_top_k(corpus_idf)
# print top 20 to add to report
print("Corpus IDF (Top 20):")
df = pd.DataFrame(corpus_tf_ordered, columns=["word", "score"])
df.head(20)

Corpus IDF (Top 20):


Unnamed: 0,word,score
0,enjoy,5.036953
1,outcast,5.036953
2,trouble,5.036953
3,lark,5.036953
4,beweep,5.036953
5,bootless,5.036953
6,gate,5.036953
7,featured,5.036953
8,despising,5.036953
9,desiring,5.036953


### Q: observe and briefly comment on the difference in top 20 lists (comparing TF of corpus vs its IDF).

#### TODO: answer here

**_Answer:_**

When comparing the top 20 lists of TF and IDF of the corpus, there is a clear difference in the words that appear in each list.

The top 20 words in the TF list are mainly common function words such as "and", "the", "to", "my", "of", "i", etc. These words occur frequently in the corpus, but do not provide much information about the content or meaning of a specific document. They are often referred to as "stop words" and are considered to be noise when analyzing text. Including these words in the analysis may not be very helpful in discriminating between different documents, as they are common across all texts.

On the other hand, the top 20 words in the IDF list are more unique and less frequent across the corpus. These words, such as "gate", "bootless", "wishing", "mans", "arising", etc., are more likely to be distinctive to specific documents. High IDF scores indicate that these words are rare across the sonnets, making them more informative and potentially useful for discriminating between different documents. These words can provide insight into the theme or subject matter of a sonnet and contribute to a better representation of the text.

In conclusion, the top 20 words in the TF list represent common words that may not be very helpful for discriminating between documents, while the top 20 words in the IDF list represent more unique and informative words that can help differentiate between different sonnets. Combining both TF and IDF measures (TF-IDF) can provide a more robust and meaningful representation of the documents.

## e. TF-IDF

In [9]:
# TFIDF of Sonnet1 w.r.t. corpus
sonnet1_tfidf = tf_idf(corpus_idf, sonnet1_tf)
sonnet1_tfidf_ordered = get_top_k(sonnet1_tfidf)
# print
print("Sonnet 1 TFIDF (Top 20):")
df = pd.DataFrame(sonnet1_tfidf_ordered, columns=["word", "score"])
df.head(20)

Sonnet 1 TFIDF (Top 20):


Unnamed: 0,word,score
0,worlds,7.301316
1,tender,6.490386
2,feedst,5.036953
3,lights,5.036953
4,selfsubstantial,5.036953
5,fuel,5.036953
6,famine,5.036953
7,foe,5.036953
8,herald,5.036953
9,gaudy,5.036953


### Q. What is different with this list than just using TF?

#### TODO: answer here

**_Answer:_**

The TF-IDF list is different from the TF list because it combines both the TF and IDF measures to provide a more robust representation of the documents. The TF-IDF measure is the product of the TF and IDF measures, which combines the local and global information about the words in the corpus. It is calculated by multiplying the TF and IDF scores of each word, which results in a higher score for words that are frequent in a document but rare across the corpus. This helps in emphasizing words that are more informative and potentially useful for discriminating between documents. In contrast, the TF measure only considers the local information about the words in a document, which may not be very helpful in distinguishing between documents. Therefore, the TF-IDF list is more informative and useful than the TF list.

The top 20 words in the TF-IDF list for Sonnet 1 show a different set of words compared to the top 20 words in the TF list. The words in this list have a higher significance in Sonnet 1, as they are more unique and specific to this particular sonnet.

Words like "worlds", "tender", "feedst", "lights", "selfsubstantial", "fuel", and "famine" have higher TF-IDF scores, indicating that they are important to the meaning of Sonnet 1 and are less frequent in the entire corpus. These words can help us understand the content and theme of the sonnet, making them more useful for discriminating between different documents.

In summary, The TF-IDF representation improves upon the simple TF representation by giving more weight to unique and informative words while downplaying the significance of common words. This approach helps in reducing the noise caused by common words and provides a better representation of the documents that are used for text analysis and comparison.

## f. Compare all documents

In [10]:
# TODO: Visualize as a heatmap

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Calculate cosine similarity scores between all pairs of documents
num_docs = len(corpus)
similarity_matrix = np.zeros((num_docs, num_docs))

for i, doc1 in enumerate(corpus.values()):
    for j, doc2 in enumerate(corpus.values()):
        tfidf1 = tf_idf(corpus_idf, tf(doc1))
        tfidf2 = tf_idf(corpus_idf, tf(doc2))
        similarity_matrix[i, j] = cosine_sim(tfidf1, tfidf2)

# Use plotly to visualize the similarity matrix as a heatmap
fig = px.imshow(similarity_matrix)
fig.update_xaxes(title_text="Document Index")
fig.update_yaxes(title_text="Document Index")
fig.update_layout(title="Cosine Similarity Heatmap", width=1000, height=1000)
fig.show()

In [11]:
from scipy.cluster.hierarchy import linkage, leaves_list

# Perform hierarchical clustering
linkage_matrix = linkage(similarity_matrix, method="average")
ordered_indices = leaves_list(linkage_matrix)

# Sort the similarity matrix based on the clustering results
sorted_similarity_matrix = similarity_matrix[ordered_indices, :]
sorted_similarity_matrix = sorted_similarity_matrix[:, ordered_indices]

# Use plotly to visualize the sorted similarity matrix as a heatmap
fig = px.imshow(sorted_similarity_matrix)
fig.update_xaxes(title_text="Document Index")
fig.update_yaxes(title_text="Document Index")
fig.update_layout(title="Cosine Similarity Heatmap (Cluster-based Sorting)", width=1000, height=1000)
fig.show()

In [12]:
# Calculate the average similarity for each document
average_similarity = np.mean(similarity_matrix, axis=1) 

# Get the sorted indices based on the average similarity
sorted_indices = np.argsort(average_similarity)

# Sort the similarity matrix based on the average similarity
sorted_similarity_matrix = similarity_matrix[sorted_indices, :]
sorted_similarity_matrix = sorted_similarity_matrix[:, sorted_indices]

# Use plotly to visualize the sorted similarity matrix as a heatmap
fig = px.imshow(sorted_similarity_matrix)
fig.update_xaxes(title_text="Document Index")
fig.update_yaxes(title_text="Document Index")
fig.update_layout(title="Cosine Similarity Heatmap (Sorted by Average Similarity)", width=1000, height=1000)
fig.show()

### Q. Observe the heatmap. What insight do you get from it?

#### TODO: answer here

**_Answer:_**

- The heatmap shows the cosine similarity between all the documents in the corpus. The warmer the color (the higher the value), the more similar the documents are. We can also observe that the similarity score on diagonal is all 1. This makes sense because every document is identical to itself.

- There are some ways to sort the classes before generating a heatmap that helps show insight. 
    - We can sort the classes by the average similarity of each document to all others, which can help us identify the most similar and least similar documents. The heatmap can help us identify the most similar (upper left) to the least similar (bottom right) documents in the corpus.
    - We can use a clustering algorithm to group similar documents together, which can help reveal patterns and clusters of similar documents in the heatmap.

## g. Distance Metrics
Add functions called `euclidean_distance` and `manhattan_distance`. Then, process the
documents using the new metric, analyze and discuss the differences between these and cosine
similarity.

In [13]:
from math import sqrt
from typing import Dict


def euclidean_distance(
        vec1: Dict[str, float], vec2: Dict[str, float]
) -> float:
    """
    Calculate the Euclidean distance between two tf-idf vectors.

    :param vec1: A dictionary where the keys are words and the values are their TF-IDF scores in the sonnet.
    :param vec2: A dictionary where the keys are words and the values are their TF-IDF scores in the sonnet.
    :return: The Euclidean distance between the vectors

    Example:
    # >>> vec1 = {'apple': 2.1972245773362196, 'banana': 0.4054651081081644, 'orange': 0.0}
    # >>> vec2 = {'apple': 2.1972245773362196, 'banana': 0.4054651081081644, 'peach': 2.0794415416798357}
    # >>> euclidean_distance(vec1, vec2)
    # >>> 2.0794415416798357
    """
    # Combine the keys from both dictionaries and set the values to 0 for the keys 
    # that are not present in one of the dictionaries. Ensure that both vectors have the same keys and are of the same length, 
    # and then calculates the distance.
    all_keys = set(vec1.keys()).union(set(vec2.keys()))
    
    vec1 = {key: vec1.get(key, 0) for key in all_keys}
    vec2 = {key: vec2.get(key, 0) for key in all_keys}

    return sqrt(sum([(vec1[key] - vec2[key])**2 for key in all_keys]))


def manhattan_distance(
        vec1: Dict[str, float], vec2: Dict[str, float]
) -> float:
    """
    Calculate the Manhattan distance between two tf-idf vectors.

    :param vec1: A dictionary where the keys are words and the values are their TF-IDF scores in the sonnet.
    :param vec2: A dictionary where the keys are words and the values are their TF-IDF scores in the sonnet.
    :return: The Manhattan distance between the vectors

    Example:
    # >>> vec1 = {'apple': 2.1972245773362196, 'banana': 0.4054651081081644, 'orange': 0.0}
    # >>> vec2 = {'apple': 2.1972245773362196, 'banana': 0.4054651081081644, 'peach': 2.0794415416798357}
    # >>> manhattan_distance(vec1, vec2)
    # >>> 2.0794415416798357
    """
    all_keys = set(vec1.keys()).union(set(vec2.keys()))

    vec1 = {key: vec1.get(key, 0) for key in all_keys}
    vec2 = {key: vec2.get(key, 0) for key in all_keys}

    return sum([abs(vec1[key] - vec2[key]) for key in all_keys])


def compare_documents(corpus, metric_func):
    """
    Compare all the documents in the corpus using the given metric function.

    :param corpus: A dictionary where the keys are sonnet IDs and the values are the sonnets.
    :param metric_func: A function that takes in two tf-idf vectors and returns a float.

    :return: A 2D array where the (i, j)-th entry is the similarity score between the i-th and j-th documents.
    """
    scores = [] # 2D array to store the similarity scores
    sonnet_ids = list(corpus.keys()) # the sonnet IDs
    for idx1, sonnet1 in enumerate(sonnet_ids): 
        row = [] # a list to store the scores for each sonnet
        for idx2, sonnet2 in enumerate(sonnet_ids): 
            if idx1 == idx2:
                row.append(0) # the score is 0 if the sonnet is compared with itself
            else:
                vec1 = tf_idf(corpus_idf, tf(corpus[sonnet1]))
                vec2 = tf_idf(corpus_idf, tf(corpus[sonnet2]))
                score = metric_func(vec1, vec2)
                row.append(score)
        scores.append(row)
    return scores


def visualize_heatmap(scores, title, show=True):
    """
    Visualize the similarity/distance scores between documents as a heatmap.

    :param scores: A 2D array where the (i, j)-th entry is the similarity/distance score between the i-th and j-th documents.
    :param title: The title of the heatmap.
    """
    # Use plotly to visualize the scores as a heatmap
    fig = px.imshow(scores)
    fig.update_xaxes(title_text="Document Index")
    fig.update_yaxes(title_text="Document Index")
    fig.update_layout(title=title, width=1000, height=1000)
    if show:
        fig.show()
    else:
        return fig

# process the documents using the new metrics (Euclidean and Manhattan distances) and visualize the results.
# Euclidean distance
euclidean_scores = compare_documents(corpus, euclidean_distance)
visualize_heatmap(euclidean_scores, "Euclidean Distance Heatmap")

# Manhattan distance
manhattan_scores = compare_documents(corpus, manhattan_distance)
visualize_heatmap(manhattan_scores, "Manhattan Distance Heatmap")

### Q. Analyze and discuss the differences between these metrics and cosine similarity.

**_Answer:_**

- Cosine Similarity
    - Measures the cosine of the angle between two vectors.
    - Values range from -1 (completely dissimilar) to 1 (completely similar).
    - It is not affected by the magnitude of the vectors, only their direction. Thus, it focuses on the pattern of the features (words) rather than their absolute values (frequency).
    - Suitable for text data, where the pattern of the words is more important than their frequency.

- Euclidean Distance
    - Measures the straight-line distance between two points in Euclidean space.
    - Lower values indicate higher similarity, as opposed to cosine similarity. Therefore, the value on the diagonal is all 0.
    - It is sensitive to the magnitude of the vectors, and the larger the difference in magnitude, the greater the Euclidean distance.
    - Suitable for cases where the magnitude of the features is important, and the data is in a continuous form.

- Manhattan Distance
    - Measures the sum of the absolute differences between the coordinates of the two points.
    - Lower values indicate higher similarity. Therefore, the value on the diagonal is all 0.
    - Also sensitive to the magnitude of the vectors.
    - Suitable for cases where the data is in a grid-like structure, and movement is restricted to horizontal and vertical directions.

In the context of text analysis and the current sonnet dataset, cosine similarity might be a more suitable metric for comparing documents as it is less sensitive to the frequency of the words and more focused on their patterns. Euclidean and Manhattan distances might lead to different results, as they consider the word frequencies more heavily.

## h. BM25
Add a function called `bm25` and implement as descibed in Section 3.3. Then, again, look at the top 20 words for document 1.txt, and then create a cosine similarity matrix and compare it to the one generated for Part 1.

In [14]:
from typing import List


def bm25(F: List[str], w: str, C: Dict[str, List[str]], q: float = 1.25, b: float = 0.75) -> float:
    """
    Calculate the BM25 score for a word in a given document.

    :param F: A sonnet as a list of words.
    :param w: The word to calculate the score for.
    :param C: A dictionary where the keys are sonnet IDs and the values are the sonnets.
    :param q: A parameter that controls the scaling of the term frequency.
    :param b: A parameter that controls the scaling of the document length.
    :return: The BM25 score for the word in the document.
    """
    # Frequency of word in document F
    TF_F_w = F.count(w)
    
    # Number of words in document F
    F_len = len(F)
    
    # Average size considering all documents in the corpus C
    F_avg = sum([len(doc) for doc in C.values()]) / len(C)
    
    # Calculate the BM25 score
    numerator = (TF_F_w * (q + 1))
    denominator = TF_F_w + q * (1 - b + b * (F_len / F_avg))
    idf = np.log((len(C) - len([doc for doc in C.values() if w in doc]) + 0.5) / (len([doc for doc in C.values() if w in doc]) + 0.5) + 1)
    return (numerator / denominator) * idf

# Calculate the BM25 scores for document 1 and find the top 20 words
sonnet1_bm25 = {word: bm25(sonnet1, word, corpus) for word in sonnet1}
sonnet1_bm25_top20 = get_top_k(sonnet1_bm25)
print("Sonnet 1 BM25 (Top 20):")
print(sonnet1_bm25_top20)

def bm25_document(corpus: Dict[str, List[str]], doc: List[str]) -> Dict[str, float]:
    """
    Calculate the BM25 scores for all words in a given document.

    :param corpus: A dictionary where the keys are sonnet IDs and the values are the sonnets.
    :param doc: A sonnet as a list of words.
    """
    return {word: bm25(doc, word, corpus) for word in doc}

# precompute the BM25 scores for all documents in the corpus
bm25_corpus = {key: bm25_document(corpus, doc) for key, doc in corpus.items()}

def bm25_similarity_matrix(corpus: Dict[str, List[str]], bm25_corpus: Dict[str, Dict[str, float]]) -> List[List[float]]:
    """
    Create a cosine similarity matrix using the BM25 scores of all documents in the corpus.

    :param corpus: A dictionary where the keys are sonnet IDs and the values are the sonnets.
    :param bm25_corpus: A dictionary where the keys are sonnet IDs and the values are dictionaries of BM25 scores.
    :return: A 2D array where the (i, j)-th entry is the cosine similarity score between the i-th and j-th documents.
    """
    scores = []
    for sonnet1 in sorted(corpus.keys()):
        row = []
        for sonnet2 in sorted(corpus.keys()):
            vec1 = bm25_corpus[sonnet1]
            vec2 = bm25_corpus[sonnet2]
            score = cosine_sim(vec1, vec2)
            row.append(score)
        scores.append(row)
    return scores

bm25_cosine_sim_matrix = bm25_similarity_matrix(corpus, bm25_corpus)
visualize_heatmap(bm25_cosine_sim_matrix, "BM25 Cosine Similarity Heatmap")

Sonnet 1 BM25 (Top 20):
[('worlds', 5.004069494824573), ('feedst', 4.780695749747), ('lights', 4.780695749747), ('selfsubstantial', 4.780695749747), ('fuel', 4.780695749747), ('famine', 4.780695749747), ('foe', 4.780695749747), ('herald', 4.780695749747), ('gaudy', 4.780695749747), ('buriest', 4.780695749747), ('niggarding', 4.780695749747), ('glutton', 4.780695749747), ('tender', 4.484165687621973), ('creatures', 4.25414918967307), ('thereby', 4.25414918967307), ('riper', 4.25414918967307), ('contracted', 4.25414918967307), ('bud', 4.25414918967307), ('content', 4.25414918967307), ('churl', 4.25414918967307)]


### Q. Compare it to the one generated for Part 1.

**_Answer:_**

- TF-IDF
    - TF measures the frequency of a term in a document, while IDF measures the rarity of a term across the entire corpus.
    - One shortcoming of TF-IDF is that it scales term frequency linearly, which may lead to an exaggerated importance of highly frequent terms in a document.

- BM25
    - BM25 is an improvement over TF-IDF and aims to overcome its limitations by introducing non-linear term frequency scaling.
    - In BM25, term frequency scaling is achieved using a formula that incorporates document length normalization and a saturation function. This helps reduce the impact of highly frequent terms in a document while still considering their importance.
    - BM25 uses constants (q and b) to control term frequency scaling and document length normalization, which allows for parameter tuning of the algorithm based on specific needs.

In summary, both TF-IDF and BM25 are used to measure the importance of terms in a document relative to a corpus. BM25 can be an improvement over TF-IDF due to its non-linear term frequency scaling and document length normalization.

## i. SBERT
Add a function called `sbert` and implement as descibed in Section 3.4. Then, again,
look at the top 20 words for document 1.txt, and then create a cosine similarity matrix and
compare it to the one generated for Part 1 and using `bm25`.

### Q. Compare it to the one generated for Part 1 and using `bm25`.

**_Answer:_**

