## Assignment-2: n-gram TF-IDF and document similarity

** Objective **
- Learn Term Frequency
- Compute document similarity

Tasks
* 1) Task 1: Dataset Preparation: Prepare the Nepali news dataset (hint: you can
obtain text from news websites, at least 20 different news of 2/3 different
categories).
Host the dataset in the public git repository.
In your notebook data should be downloaded from git or some other public
places.
No additional step should be done to get the notebook working.
You can reuse the dataset of Assignment 1 as well.
* 2) Task 2: Prepare one-gram, bi-gram, tri-gram vocabulary
* 3) Task 3: Compute TF-IDF vectors for each vocabulary
* 4) Task 4: Compute document similarity matrix (if your document size = N , this
will result in the NxN matrix) for each vocab list.
* 5) Task 5: Write your interpretation on the result of Task 4.

In [1]:
import datetime

student_rollno = 26
student_name = 'Ram Krishna Pudasaini'
assignment_tag = 'MDS555-2023-Assignment-2'

In [2]:
# from checker_utils import done
def done(task):
    _date = datetime.datetime.now()
    task = task + ": " + str(_date)
    print('='*len(task), '\n', task , '\n', '='*len(task), sep='')
    pass

## Literature Review
In this NPL assignment we used the previously created nepali news data set to create the one-gram, bigram and trigram vocabualary, but before that we have done the mendatory preprocessing steps to clean, tokenize and remove stopwords from our text. After preprocessing we create the vocabualary and compute the TF-IDF vectors from those vocabualary. Lastly we create the document similarity matrix from those vectors and figure out the similarity of the words in each documents.

For this purpose we have used different library. We have used NLTK as ususal for Text Preprocessing, tokenizing and removing stopwords. Later we used nltk utility function ngrams to convert the text into n gram vocabualary. Then we use TfidfVectorizer function from sklearn / feature_extraction  to compute the TF-IDF vectors for each vocabulary and finally we create the document similarity matrix to figure out the similarities between the words in the documents.

The similarity matrix are displayed using the dataframe and for that purpose we have used the basic python library pandas. Cosine Similarity is used to prepare the document similarity matrix.

## Task 1 Dataset Preparation:
* Prepare the Nepali news dataset (hint: you can
obtain text from news websites, at least 20 different news of 2/3 different
categories).
* Host the dataset in the public git repository.
* In your notebook data should be downloaded from git or some other public
places.
* No additional step should be done to get the notebook working.
* You can reuse the dataset of Assignment 1 as well.

In [3]:
import pandas as pd

#Getting the data from github
github_csv_url = 'https://raw.githubusercontent.com/Rk-Pudasaini/NLP/main/Assignments_NLP/nepali_news_dataset.csv'

# Read the CSV file from GitHub into a DataFrame
df = pd.read_csv(github_csv_url, encoding='utf-8')

df.head()

Unnamed: 0,Input,Category
0,"नेप्से १९६० मा ओर्लियो, सबै सूचक घटे",business
1,"कच्चा तेलको भाउ २०२३ कै उच्च विन्दुमा, नेपालमा...",business
2,साधना र सबैको लघुवित्त मर्जमा जाने,business
3,भारत निर्यात हुने बिजुलीले बढी मूल्य पाउन थाल्यो,business
4,निजी बैंकका कर्मचारीमा पनि श्रम ऐन लागू हुने,business


### Preprocessing
Do some preprocessing of the Content. You can remove the punctuation, remove special characters, tokenized the words, remove nepali stopwords and display all in the dataframe for later use.

In [4]:
import re
import string
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

# Function to clean text
def clean_text(text):
    #Lowercase the text
    #Remove punctuation and numbers, and convert to lowercase
    text = text.replace('।', '')
    text = text.replace('‘', '')
    text = text.replace('’', '')
    text = text.replace('–', '')
    text = text.lower()

    # Remove punctuation and numbers
    text = re.sub(f"[{re.escape(string.punctuation)}0-9]", '', text)

    # Join the words back into a single string
    cleaned_text = ''.join([char for char in text if char not in string.punctuation and not char.isdigit()])

    return cleaned_text

# Apply the function to the 'Input' column in the DataFrame
df['Cleaned_Text'] = df['Input'].apply(clean_text)
df['Tokenized_Text'] = df['Cleaned_Text'].apply(word_tokenize)

df.head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Unnamed: 0,Input,Category,Cleaned_Text,Tokenized_Text
0,"नेप्से १९६० मा ओर्लियो, सबै सूचक घटे",business,नेप्से मा ओर्लियो सबै सूचक घटे,"[नेप्से, मा, ओर्लियो, सबै, सूचक, घटे]"
1,"कच्चा तेलको भाउ २०२३ कै उच्च विन्दुमा, नेपालमा...",business,कच्चा तेलको भाउ कै उच्च विन्दुमा नेपालमा पनि ...,"[कच्चा, तेलको, भाउ, कै, उच्च, विन्दुमा, नेपालम..."
2,साधना र सबैको लघुवित्त मर्जमा जाने,business,साधना र सबैको लघुवित्त मर्जमा जाने,"[साधना, र, सबैको, लघुवित्त, मर्जमा, जाने]"
3,भारत निर्यात हुने बिजुलीले बढी मूल्य पाउन थाल्यो,business,भारत निर्यात हुने बिजुलीले बढी मूल्य पाउन थाल्यो,"[भारत, निर्यात, हुने, बिजुलीले, बढी, मूल्य, पा..."
4,निजी बैंकका कर्मचारीमा पनि श्रम ऐन लागू हुने,business,निजी बैंकका कर्मचारीमा पनि श्रम ऐन लागू हुने,"[निजी, बैंकका, कर्मचारीमा, पनि, श्रम, ऐन, लागू..."


In [5]:
#import libraries
nltk.download('stopwords')
from nltk.corpus import stopwords

# Remove stop words
def remove_stopwords(tokens):
    stop_words = set(stopwords.words('nepali'))
    return [token for token in tokens if token.lower() not in stop_words]

df['No_Stopwords'] = df['Tokenized_Text'].apply(remove_stopwords)
# Join the content without stop words and store it in a new column 'Content_No_Stopwords'
df['Content_No_Stopwords'] = df['No_Stopwords'].apply(lambda x: ' '.join(x))

df.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,Input,Category,Cleaned_Text,Tokenized_Text,No_Stopwords,Content_No_Stopwords
0,"नेप्से १९६० मा ओर्लियो, सबै सूचक घटे",business,नेप्से मा ओर्लियो सबै सूचक घटे,"[नेप्से, मा, ओर्लियो, सबै, सूचक, घटे]","[नेप्से, ओर्लियो, सूचक, घटे]",नेप्से ओर्लियो सूचक घटे
1,"कच्चा तेलको भाउ २०२३ कै उच्च विन्दुमा, नेपालमा...",business,कच्चा तेलको भाउ कै उच्च विन्दुमा नेपालमा पनि ...,"[कच्चा, तेलको, भाउ, कै, उच्च, विन्दुमा, नेपालम...","[कच्चा, तेलको, भाउ, कै, उच्च, विन्दुमा, नेपालम...",कच्चा तेलको भाउ कै उच्च विन्दुमा नेपालमा मूल्य...
2,साधना र सबैको लघुवित्त मर्जमा जाने,business,साधना र सबैको लघुवित्त मर्जमा जाने,"[साधना, र, सबैको, लघुवित्त, मर्जमा, जाने]","[साधना, सबैको, लघुवित्त, मर्जमा, जाने]",साधना सबैको लघुवित्त मर्जमा जाने
3,भारत निर्यात हुने बिजुलीले बढी मूल्य पाउन थाल्यो,business,भारत निर्यात हुने बिजुलीले बढी मूल्य पाउन थाल्यो,"[भारत, निर्यात, हुने, बिजुलीले, बढी, मूल्य, पा...","[भारत, निर्यात, बिजुलीले, बढी, मूल्य, पाउन, था...",भारत निर्यात बिजुलीले बढी मूल्य पाउन थाल्यो
4,निजी बैंकका कर्मचारीमा पनि श्रम ऐन लागू हुने,business,निजी बैंकका कर्मचारीमा पनि श्रम ऐन लागू हुने,"[निजी, बैंकका, कर्मचारीमा, पनि, श्रम, ऐन, लागू...","[निजी, बैंकका, कर्मचारीमा, श्रम, ऐन, लागू]",निजी बैंकका कर्मचारीमा श्रम ऐन लागू


In [6]:
done('Task 1')

Task 1: 2023-09-25 09:23:56.198572


##  Task 2: Prepare one-gram, bi-gram, tri-gram vocabulary

A **one-gram vocabulary** consists of individual words or tokens from your text data. In order to create the one gram vocabulary we need to first clean the text or preprocess text to remove puncuatation, unnecessary characters, then tokenize the words and remove the stop words. One gram is actually the individual words from the text.

A **bigram vocabulary** consists of pairs of consecutive words from the text data. and the **tri gram vocabulary** consists of sets of three consecutive words.  

In [7]:
#GENERATE N GRAMS FROM THE CLEANED TEXT

from nltk.util import ngrams

# Function to generate n-grams from a list of tokens
def generate_ngrams(tokens, n):
    return list(ngrams(tokens, n))

# Create one-gram, bi-gram, and tri-gram columns in the data frame
df['one_gram'] = df['No_Stopwords'].apply(lambda x: generate_ngrams(x, 1))
df['bi_gram'] = df['No_Stopwords'].apply(lambda x: generate_ngrams(x, 2))
df['tri_gram'] = df['No_Stopwords'].apply(lambda x: generate_ngrams(x, 3))

# Create one-gram, bi-gram, and tri-gram vocabularies
one_gram_vocab = set([word for sublist in df['one_gram'].tolist() for word in sublist])
bi_gram_vocab = set([word for sublist in df['bi_gram'].tolist() for word in sublist])
tri_gram_vocab = set([word for sublist in df['tri_gram'].tolist() for word in sublist])

# Print the vocabularies
print("One-gram Vocabulary:")
print(one_gram_vocab)
print("\nBi-gram Vocabulary:")
print(bi_gram_vocab)
print("\nTri-gram Vocabulary:")
print(tri_gram_vocab)


One-gram Vocabulary:
{('एसिया',), ('बास्केटबल',), ('स्पेसएक्सबीच',), ('मोन्टीले',), ('मिनेटमा',), ('सिनेमा',), ('वाणिज्य',), ('समेटिएका',), ('सेमिफाइनलमाकान्तिपुर',), ('हुनेछन्',), ('सबैको',), ('निक्षेप',), ('साताको',), ('विसं',), ('सक्ने',), ('कामादाले',), ('रैथाने',), ('भूमि',), ('संघर्षपूर्ण',), ('कोइरालाकान्तिपुर',), ('तामाङपाककलासम्बन्धी',), ('भारतपाक',), ('बताएकी',), ('मनोभावना',), ('एकाएक',), ('रातोपहेंलो',), ('वाटिका',), ('तिथि',), ('सोमबार',), ('इन्ज्युरी',), ('पाककलासम्बन्धी',), ('विभेदविरुद्धकी',), ('ग्लोबलस्टार',), ('अवसरमा',), ('उद्घाटन',), ('बेहोरेको',), ('बुढासुब्बा',), ('पहिरोको',), ('युरिया',), ('कलिङ',), ('दशकसम्म',), ('थुप्रै',), ('भएपछि',), ('विभेदविरुद्ध',), ('खेलाडीले',), ('बढ्यो',), ('महासंघको',), ('हिले',), ('शो',), ('सञ्चार',), ('सिरी',), ('थारुगाईजात्राको',), ('नेपाल',), ('इनोभेटिभले',), ('बिमाकान्तिपुर',), ('स्ट्राइकरले',), ('प्रशिक्षक',), ('मोन्टी',), ('गोल',), ('पाण्डेनेपाल',), ('रियालिटी',), ('मायाका',), ('ब्राइटनले',), ('रचनाको',), ('खजना',), ('हाले',), (

In [8]:
df.head()

Unnamed: 0,Input,Category,Cleaned_Text,Tokenized_Text,No_Stopwords,Content_No_Stopwords,one_gram,bi_gram,tri_gram
0,"नेप्से १९६० मा ओर्लियो, सबै सूचक घटे",business,नेप्से मा ओर्लियो सबै सूचक घटे,"[नेप्से, मा, ओर्लियो, सबै, सूचक, घटे]","[नेप्से, ओर्लियो, सूचक, घटे]",नेप्से ओर्लियो सूचक घटे,"[(नेप्से,), (ओर्लियो,), (सूचक,), (घटे,)]","[(नेप्से, ओर्लियो), (ओर्लियो, सूचक), (सूचक, घटे)]","[(नेप्से, ओर्लियो, सूचक), (ओर्लियो, सूचक, घटे)]"
1,"कच्चा तेलको भाउ २०२३ कै उच्च विन्दुमा, नेपालमा...",business,कच्चा तेलको भाउ कै उच्च विन्दुमा नेपालमा पनि ...,"[कच्चा, तेलको, भाउ, कै, उच्च, विन्दुमा, नेपालम...","[कच्चा, तेलको, भाउ, कै, उच्च, विन्दुमा, नेपालम...",कच्चा तेलको भाउ कै उच्च विन्दुमा नेपालमा मूल्य...,"[(कच्चा,), (तेलको,), (भाउ,), (कै,), (उच्च,), (...","[(कच्चा, तेलको), (तेलको, भाउ), (भाउ, कै), (कै,...","[(कच्चा, तेलको, भाउ), (तेलको, भाउ, कै), (भाउ, ..."
2,साधना र सबैको लघुवित्त मर्जमा जाने,business,साधना र सबैको लघुवित्त मर्जमा जाने,"[साधना, र, सबैको, लघुवित्त, मर्जमा, जाने]","[साधना, सबैको, लघुवित्त, मर्जमा, जाने]",साधना सबैको लघुवित्त मर्जमा जाने,"[(साधना,), (सबैको,), (लघुवित्त,), (मर्जमा,), (...","[(साधना, सबैको), (सबैको, लघुवित्त), (लघुवित्त,...","[(साधना, सबैको, लघुवित्त), (सबैको, लघुवित्त, म..."
3,भारत निर्यात हुने बिजुलीले बढी मूल्य पाउन थाल्यो,business,भारत निर्यात हुने बिजुलीले बढी मूल्य पाउन थाल्यो,"[भारत, निर्यात, हुने, बिजुलीले, बढी, मूल्य, पा...","[भारत, निर्यात, बिजुलीले, बढी, मूल्य, पाउन, था...",भारत निर्यात बिजुलीले बढी मूल्य पाउन थाल्यो,"[(भारत,), (निर्यात,), (बिजुलीले,), (बढी,), (मू...","[(भारत, निर्यात), (निर्यात, बिजुलीले), (बिजुली...","[(भारत, निर्यात, बिजुलीले), (निर्यात, बिजुलीले..."
4,निजी बैंकका कर्मचारीमा पनि श्रम ऐन लागू हुने,business,निजी बैंकका कर्मचारीमा पनि श्रम ऐन लागू हुने,"[निजी, बैंकका, कर्मचारीमा, पनि, श्रम, ऐन, लागू...","[निजी, बैंकका, कर्मचारीमा, श्रम, ऐन, लागू]",निजी बैंकका कर्मचारीमा श्रम ऐन लागू,"[(निजी,), (बैंकका,), (कर्मचारीमा,), (श्रम,), (...","[(निजी, बैंकका), (बैंकका, कर्मचारीमा), (कर्मचा...","[(निजी, बैंकका, कर्मचारीमा), (बैंकका, कर्मचारी..."


In [9]:
done('Task 2')

Task 2: 2023-09-25 09:23:56.317568


## Task 3: Compute TF-IDF vectors for each vocabulary

TF-IDF stands for "Term Frequency-Inverse Document Frequency." TF-IDF vectors are numerical representations of text documents that capture the importance of individual words (terms) in the context of a collection of documents (corpus).
Here we calculate TF-IDF vectors for each vocabulary using Python's scikit-learn library's TfidfVectorizer function.

In [10]:
#COMPUTE TF - IDF VECTORS FOR ONE GRAM, BIGRAM AND TRIGRAM VOCABLARIES

#Compute TF-IDF vectors for each vocabulary
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TF-IDF vectorizer for one-gram, bi-gram, tri-gram vocabulary
tfidf_vectorizer_onegram = TfidfVectorizer(ngram_range=(1, 1))
tfidf_vectorizer_bigram = TfidfVectorizer(ngram_range=(2, 2))
tfidf_vectorizer_trigram = TfidfVectorizer(ngram_range=(3, 3))

# Fit and transform the text data for one-gram, bi-gram, tri-gram vocabulary
tfidf_vectors_onegram = tfidf_vectorizer_onegram.fit_transform(df['Content_No_Stopwords'])
tfidf_vectors_bigram = tfidf_vectorizer_bigram.fit_transform(df['Content_No_Stopwords'])
tfidf_vectors_trigram = tfidf_vectorizer_trigram.fit_transform(df['Content_No_Stopwords'])

# Convert TF-IDF vectors to a DataFrame for one-gram, bi-gram, tri-gram vocabulary
tfidf_df_onegram = pd.DataFrame(tfidf_vectors_onegram.toarray(), columns=tfidf_vectorizer_onegram.get_feature_names_out())
tfidf_df_bigram = pd.DataFrame(tfidf_vectors_bigram.toarray(), columns=tfidf_vectorizer_bigram.get_feature_names_out())
tfidf_df_trigram = pd.DataFrame(tfidf_vectors_trigram.toarray(), columns=tfidf_vectorizer_trigram.get_feature_names_out())

# Print the TF-IDF DataFrames for each vocabulary
print("TF-IDF DataFrame for One-gram Vocabulary:")
print(tfidf_df_onegram.head())

print("\nTF-IDF DataFrame for Bi-gram Vocabulary:")
print(tfidf_df_bigram.head())

print("\nTF-IDF DataFrame for Tri-gram Vocabulary:")
print(tfidf_df_trigram.head())



TF-IDF DataFrame for One-gram Vocabulary:
    अध   अन  अनप   अप  अपर   अफ   अभ   अम   अर   अल  ...  सलल   सव  सवक   सस  \
0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
2  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
3  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   
4  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0  0.0   

    सह  सहय   हक   हर   हल  हलम  
0  0.0  0.0  0.0  0.0  0.0  0.0  
1  0.0  0.0  0.0  0.0  0.0  0.0  
2  0.0  0.0  0.0  0.0  0.0  0.0  
3  0.0  0.0  0.0  0.0  0.0  0.0  
4  0.0  0.0  0.0  0.0  0.0  0.0  

[5 rows x 347 columns]

TF-IDF DataFrame for Bi-gram Vocabulary:
   अध धनक  अध यक  अन इर  अन खजन  अन तर  अन तरव  अन पछ  अन बन  अन भव  अनप पम  \
0     0.0    0.0    0.0     0.0    0.0     0.0    0.0    0.0    0.0     0.0   
1     0.0    0.0    0.0     0.0    0.0     0.0   

In [11]:
df.head()

Unnamed: 0,Input,Category,Cleaned_Text,Tokenized_Text,No_Stopwords,Content_No_Stopwords,one_gram,bi_gram,tri_gram
0,"नेप्से १९६० मा ओर्लियो, सबै सूचक घटे",business,नेप्से मा ओर्लियो सबै सूचक घटे,"[नेप्से, मा, ओर्लियो, सबै, सूचक, घटे]","[नेप्से, ओर्लियो, सूचक, घटे]",नेप्से ओर्लियो सूचक घटे,"[(नेप्से,), (ओर्लियो,), (सूचक,), (घटे,)]","[(नेप्से, ओर्लियो), (ओर्लियो, सूचक), (सूचक, घटे)]","[(नेप्से, ओर्लियो, सूचक), (ओर्लियो, सूचक, घटे)]"
1,"कच्चा तेलको भाउ २०२३ कै उच्च विन्दुमा, नेपालमा...",business,कच्चा तेलको भाउ कै उच्च विन्दुमा नेपालमा पनि ...,"[कच्चा, तेलको, भाउ, कै, उच्च, विन्दुमा, नेपालम...","[कच्चा, तेलको, भाउ, कै, उच्च, विन्दुमा, नेपालम...",कच्चा तेलको भाउ कै उच्च विन्दुमा नेपालमा मूल्य...,"[(कच्चा,), (तेलको,), (भाउ,), (कै,), (उच्च,), (...","[(कच्चा, तेलको), (तेलको, भाउ), (भाउ, कै), (कै,...","[(कच्चा, तेलको, भाउ), (तेलको, भाउ, कै), (भाउ, ..."
2,साधना र सबैको लघुवित्त मर्जमा जाने,business,साधना र सबैको लघुवित्त मर्जमा जाने,"[साधना, र, सबैको, लघुवित्त, मर्जमा, जाने]","[साधना, सबैको, लघुवित्त, मर्जमा, जाने]",साधना सबैको लघुवित्त मर्जमा जाने,"[(साधना,), (सबैको,), (लघुवित्त,), (मर्जमा,), (...","[(साधना, सबैको), (सबैको, लघुवित्त), (लघुवित्त,...","[(साधना, सबैको, लघुवित्त), (सबैको, लघुवित्त, म..."
3,भारत निर्यात हुने बिजुलीले बढी मूल्य पाउन थाल्यो,business,भारत निर्यात हुने बिजुलीले बढी मूल्य पाउन थाल्यो,"[भारत, निर्यात, हुने, बिजुलीले, बढी, मूल्य, पा...","[भारत, निर्यात, बिजुलीले, बढी, मूल्य, पाउन, था...",भारत निर्यात बिजुलीले बढी मूल्य पाउन थाल्यो,"[(भारत,), (निर्यात,), (बिजुलीले,), (बढी,), (मू...","[(भारत, निर्यात), (निर्यात, बिजुलीले), (बिजुली...","[(भारत, निर्यात, बिजुलीले), (निर्यात, बिजुलीले..."
4,निजी बैंकका कर्मचारीमा पनि श्रम ऐन लागू हुने,business,निजी बैंकका कर्मचारीमा पनि श्रम ऐन लागू हुने,"[निजी, बैंकका, कर्मचारीमा, पनि, श्रम, ऐन, लागू...","[निजी, बैंकका, कर्मचारीमा, श्रम, ऐन, लागू]",निजी बैंकका कर्मचारीमा श्रम ऐन लागू,"[(निजी,), (बैंकका,), (कर्मचारीमा,), (श्रम,), (...","[(निजी, बैंकका), (बैंकका, कर्मचारीमा), (कर्मचा...","[(निजी, बैंकका, कर्मचारीमा), (बैंकका, कर्मचारी..."


In [12]:
done('Task 3')

Task 3: 2023-09-25 09:23:56.484523


## Task 4: Compute document similarity matrix
If your document size = N , this will result in the NxN matrix) for each vocab list

In [13]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculate cosine similarity matrices for each vocabulary
cosine_sim_onegram = cosine_similarity(tfidf_vectors_onegram)
cosine_sim_bigram = cosine_similarity(tfidf_vectors_bigram)
cosine_sim_trigram = cosine_similarity(tfidf_vectors_trigram)

# Convert the similarity matrices to DataFrames
cosine_sim_onegram_df = pd.DataFrame(cosine_sim_onegram, columns=df.index, index=df.index)
cosine_sim_bigram_df = pd.DataFrame(cosine_sim_bigram, columns=df.index, index=df.index)
cosine_sim_trigram_df = pd.DataFrame(cosine_sim_trigram, columns=df.index, index=df.index)

cosine_sim_onegram_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,24,25,26,27,28,29,30,31,32,33
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.067731,0.08996,0.0
1,0.0,1.0,0.0,0.266849,0.0,0.0,0.068051,0.0,0.0,0.024657,...,0.0,0.0,0.048293,0.026443,0.085572,0.0392,0.04112,0.05847,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.177355,0.053678,0.0,...,0.0,0.0,0.135516,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.266849,0.0,1.0,0.0,0.072424,0.198414,0.0,0.0,0.0,...,0.0,0.0,0.045644,0.0,0.167006,0.0,0.0,0.098459,0.0,0.158216
4,0.0,0.0,0.0,0.0,1.0,0.235786,0.0,0.0,0.0,0.0,...,0.0,0.163317,0.0,0.0,0.0,0.205981,0.0,0.0,0.0,0.0


In [14]:
cosine_sim_bigram_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,24,25,26,27,28,29,30,31,32,33
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.06251,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
cosine_sim_trigram_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,24,25,26,27,28,29,30,31,32,33
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
done('Task 4')

Task 4: 2023-09-25 09:23:56.626383


## Task 5: Write your interpretation on the result of Task 4.

The resulting cosine similarity matrices (cosine_sim_onegram, cosine_sim_bigram, and cosine_sim_trigram) represent the similarity between documents based on the one-gram, bi-gram, and tri-gram TF-IDF representations, respectively.

In [17]:
done('Task 5')

Task 5: 2023-09-25 09:23:56.637979
