# Introduction
Document similarity measures how alike two or more texts are based on their content, which is essential for applications like information retrieval, clustering, and recommendation systems. In this notebook, we will utilize **Cosine Similarity** as our primary method for evaluating this similarity of docs in the dataset of 20NewsGroups.

# Steps

## 1. Importing necessary libraries

In [1]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_numeric, strip_punctuation, strip_multiple_whitespaces
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

## 2. Load dataset 

In [2]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
print(newsgroups_data.target_names)  # List of 20 categories
# print(newsgroups_data.data[0])       

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


## 3. Creating a DataFrame

In [3]:
import pandas as pd
# Create DataFrame
data_df = pd.DataFrame({
    'text': newsgroups_data.data,      
    'category': newsgroups_data.target  
})
# Map category integers to their respective names 
data_df['category_name'] = data_df['category'].apply(lambda x: newsgroups_data.target_names[x])

# Show the first few rows
data_df.head()

Unnamed: 0,text,category,category_name
0,\n\nI am sure some bashers of Pens fans are pr...,10,rec.sport.hockey
1,My brother is in the market for a high-perform...,3,comp.sys.ibm.pc.hardware
2,\n\n\n\n\tFinally you said what you dream abou...,17,talk.politics.mideast
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,3,comp.sys.ibm.pc.hardware
4,1) I have an old Jasmine drive which I cann...,4,comp.sys.mac.hardware


In [4]:
data_df.text[0]

"\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n"

## 4. Data Cleaning & Text Preprocessing

In [5]:
def preprocess_text(text):
    filters = [
        lambda x: x.lower(),           
        strip_tags,                    # Remove HTML tags 
        strip_numeric,                 
        strip_punctuation,             
        strip_multiple_whitespaces     
    ]
    text = ' '.join(preprocess_string(text, filters=filters))
    
    # NLTK preprocessing
    words = word_tokenize(text)  
    lemmatizer = WordNetLemmatizer() 

  
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in lemmatized_words if word not in stop_words]
    
    return ' '.join(filtered_words)

# Apply preprocessing steps to the 'text' column
data_df['text_Preprocessed'] = data_df['text'].apply(preprocess_text)


In [6]:
# Drop original text column
data_df=data_df.drop('text', axis=1)

In [7]:
data_df.head()

Unnamed: 0,category,category_name,text_Preprocessed
0,10,rec.sport.hockey,sure bashers pen fan pretty confused lack kind...
1,3,comp.sys.ibm.pc.hardware,brother market high performance video card sup...
2,17,talk.politics.mideast,finally said dream mediterranean wa new area g...
3,3,comp.sys.ibm.pc.hardware,think scsi card dma transfer disk scsi card dm...
4,4,comp.sys.mac.hardware,old jasmine drive use new system understanding...


In [8]:
data_df.columns

Index(['category', 'category_name', 'text_Preprocessed'], dtype='object')

## 5. Feature Extraction using Tfidf

In [9]:
vectorizer=TfidfVectorizer(max_features=5000, ngram_range=(1,1))
tfidf_matrix = vectorizer.fit_transform(data_df['text_Preprocessed'])
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print(tfidf_df.head())

    aa  aaron   ab  abc  abiding   ability  able  abortion  absence  absolute  \
0  0.0    0.0  0.0  0.0      0.0  0.000000   0.0       0.0      0.0       0.0   
1  0.0    0.0  0.0  0.0      0.0  0.000000   0.0       0.0      0.0       0.0   
2  0.0    0.0  0.0  0.0      0.0  0.000000   0.0       0.0      0.0       0.0   
3  0.0    0.0  0.0  0.0      0.0  0.067664   0.0       0.0      0.0       0.0   
4  0.0    0.0  0.0  0.0      0.0  0.000000   0.0       0.0      0.0       0.0   

   ...   ze  zealand  zei  zero  zionism  zionist  zip  zone  zoom   zx  
0  ...  0.0      0.0  0.0   0.0      0.0      0.0  0.0   0.0   0.0  0.0  
1  ...  0.0      0.0  0.0   0.0      0.0      0.0  0.0   0.0   0.0  0.0  
2  ...  0.0      0.0  0.0   0.0      0.0      0.0  0.0   0.0   0.0  0.0  
3  ...  0.0      0.0  0.0   0.0      0.0      0.0  0.0   0.0   0.0  0.0  
4  ...  0.0      0.0  0.0   0.0      0.0      0.0  0.0   0.0   0.0  0.0  

[5 rows x 5000 columns]


## 6. Calculating Cosine Similarity

In [10]:
cosine_sim = cosine_similarity(tfidf_matrix)
cosine_sim_df = pd.DataFrame(cosine_sim, columns=data_df.index, index=data_df.index)

In [11]:
cosine_sim_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18836,18837,18838,18839,18840,18841,18842,18843,18844,18845
0,1.0,0.007907,0.052631,0.0,0.0,0.010961,0.0,0.06308,0.007643,0.00237,...,0.053269,0.0,0.021328,0.004695,0.034655,0.017693,0.0,0.011421,0.006884,0.019153
1,0.007907,1.0,0.0,0.150919,0.034603,0.0317,0.031316,0.0,0.008909,0.0,...,0.019442,0.0,0.008431,0.013821,0.0,0.0,0.0,0.0,0.0,0.050957
2,0.052631,0.0,1.0,0.003837,0.014205,0.002583,0.049734,0.007678,0.041863,0.068258,...,0.084711,0.019511,0.065632,0.041922,0.036146,0.029691,0.01731,0.002729,0.0,0.023256
3,0.0,0.150919,0.003837,1.0,0.05595,0.004544,0.0,0.004021,0.007265,0.00501,...,0.002342,0.0,0.008647,0.013568,0.005027,0.003677,0.006469,0.0,0.0,0.021086
4,0.0,0.034603,0.014205,0.05595,1.0,0.020244,0.043863,0.007341,0.018427,0.0,...,0.0,0.019069,0.014163,0.02009,0.030402,0.008214,0.013722,0.010422,0.0,0.073181


# Conclusion
In this notebook, we used Cosine Similarity to assess document relationships, with scores ranging from -1 to +1. A score of 1 indicates identical documents, 0 means no similarity, and -1 suggests opposite content. This analysis highlighted clusters of similar texts and potential duplicates, showcasing the effectiveness of cosine similarity in enhancing text analysis and information retrieval.

