You are a data scientist working for a Political Consulting Firm. You are given a dataset containing in Twitter_Data.csv. This dataset has the following two columns:
+ clean_text: Tweets made by the people extracted from Twitter Mainly Focused on tweets Made by People on Modi(2019 Indian Prime Minister candidate) and Other Prime Ministerial Candidates.
+ category: It describes the actual sentiment of the respective tweet with three values of -1, 0, and 1.

Data source: https://www.kaggle.com/cosmos98/twitter-and-reddit-sentimental-analysis-dataset?select=Twitter_Data.csv

In [1]:
import pandas as pd
import spacy
import numpy as np
from numpy import dot

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from scipy.sparse.linalg import norm 
from scipy import spatial

### Q1. Load the dataset of Twitter_Data.csv into memory.

In [2]:
Twitter_Data = pd.read_csv('Twitter_Data.csv')
Twitter_Data.head(10)

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,-1.0
1,talk all the nonsense and continue all the dra...,0.0
2,what did just say vote for modi welcome bjp t...,1.0
3,asking his supporters prefix chowkidar their n...,1.0
4,answer who among these the most powerful world...,1.0
5,kiya tho refresh maarkefir comment karo,0.0
6,surat women perform yagna seeks divine grace f...,0.0
7,this comes from cabinet which has scholars lik...,0.0
8,with upcoming election india saga going import...,1.0
9,gandhi was gay does modi,1.0


### Q2. Find the cosine similarity in clean_text between the 100th and 10,000th tweets using dot and norm functions.

In [3]:
tfidf_vectorizer = TfidfVectorizer(use_idf = True,smooth_idf = True, sublinear_tf = False)

two_tweets = Twitter_Data.loc[[100, 10000],:]
tf_idf_matrix = tfidf_vectorizer.fit_transform(two_tweets['clean_text'])
print(f'The size of the tf_idf matrix for the texts = {tf_idf_matrix.get_shape()}.')

The size of the tf_idf matrix for the texts = (2, 30).


In [4]:
cos_sim =  dot(tf_idf_matrix[0, :], tf_idf_matrix[1, :].T) / (norm(tf_idf_matrix[0, :]) * norm(tf_idf_matrix[1, :]))
print(f'The cosine similarity between "{two_tweets.loc[two_tweets.index[0], "clean_text"]}" and "{two_tweets.loc[two_tweets.index[1], "clean_text"]}" = {cos_sim.todense()}.')

The cosine similarity between "why limited here are other prefixes for twitter that perhaps more accurately capture the state the citizens " and "dought all constitution post now modi bhaktaur bjp band hai kya galat kah diya" = [[0.]].


### Q3 Find the cosine similarity in clean_text between the 100th and 10,000th tweets using the cosine function.

In [5]:
cos_sim = 1 - spatial.distance.cosine(tf_idf_matrix[0, :].todense(), tf_idf_matrix[1, :].todense())
print(f'The cosine similarity between "{two_tweets.loc[two_tweets.index[0], "clean_text"]}" and "{two_tweets.loc[two_tweets.index[1], "clean_text"]}" = {cos_sim}.')

The cosine similarity between "why limited here are other prefixes for twitter that perhaps more accurately capture the state the citizens " and "dought all constitution post now modi bhaktaur bjp band hai kya galat kah diya" = 0.0.


### Q4. Find the cosine similarity in clean_text between the 100th and 10,000th tweets using cosine_similarity function.

In [6]:
cos_sim = cosine_similarity(tf_idf_matrix,dense_output = True)
print(f'The cosine similarity between "{two_tweets.loc[two_tweets.index[0], "clean_text"]}" and "{two_tweets.loc[two_tweets.index[1], "clean_text"]}" = {cos_sim[0,1]}.')

The cosine similarity between "why limited here are other prefixes for twitter that perhaps more accurately capture the state the citizens " and "dought all constitution post now modi bhaktaur bjp band hai kya galat kah diya" = 0.0.


### Q5. Find the cosine similarity in clean_text between the 100th and 10,000th tweets using the Spacy function.

In [7]:
nlp = spacy.load("en_core_web_lg")

review1 = Twitter_Data['clean_text'][100]
review2 = Twitter_Data['clean_text'][10000]

print(review1)
print('')
print(review2)
print('')

doc1 = nlp(review1)
doc2 = nlp(review2)

print(f"The similarity between them = {doc1.similarity(doc2):.2f}.")

why limited here are other prefixes for twitter that perhaps more accurately capture the state the citizens 

dought all constitution post now modi bhaktaur bjp band hai kya galat kah diya

The similarity between them = 0.21.


### Q6. Find the tweets with the cosine similarity > 60% with the 100th tweet in this dataset.

In [8]:
similarity_df = pd.DataFrame(columns=['Index', 'Review', 'Similarity to Tweet 100'])

# Looping over the whole dataset would take a long time, so a set number of rows to examine can be placed here.
# Any value from 1 - 162980 is valid.
value = 10

for i in range(0, value):
    if i == 100:
        continue
    
    review1 = str(Twitter_Data['clean_text'][100])
    review2 = str(Twitter_Data['clean_text'][i])
    
    doc1 = nlp(review1)
    doc2 = nlp(review2)
    
    if doc1.similarity(doc2) > 0.6:
        # I would have used .append(), but I recieved a warning saying that this function will soon be removed.
        info = {'Index': [i], 'Review': [review2], 'Similarity to Tweet 100': [str(doc1.similarity(doc2))]}
        new_row = pd.DataFrame(data = info)
        similarity_df = pd.concat([similarity_df, new_row])
        
similarity_df.reset_index(drop=True, inplace=True)
similarity_df

Unnamed: 0,Index,Review,Similarity to Tweet 100
0,0,when modi promised “minimum government maximum...,0.9009649756099746
1,1,talk all the nonsense and continue all the dra...,0.8674500325008898
2,2,what did just say vote for modi welcome bjp t...,0.7512252259472123
3,3,asking his supporters prefix chowkidar their n...,0.91567076410869
4,4,answer who among these the most powerful world...,0.8474105556247228
5,7,this comes from cabinet which has scholars lik...,0.7505986418265169
6,8,with upcoming election india saga going import...,0.8377886532052728


### Q7. Find the corpus vector equal to the average of all the document vectors, where each document corresponds to a tweet or a row in this dataset.

In [9]:
# Looping over the whole dataset would take a long time, so a set number of rows to examine can be placed here.
# Any value from 1 - 162980 is valid.
value2 = 100

Twitter_Data2 = Twitter_Data.head(value2)

corpus_vector = np.array([nlp(str(doc)).vector for doc in Twitter_Data2['clean_text']]).mean(axis=0)
print(corpus_vector)

[-1.01065248e-01  1.13562442e-01 -1.33278277e-02 -4.37233634e-02
  2.38086246e-02  1.81605257e-02  6.12723157e-02 -7.96386302e-02
  1.68087631e-02  1.71301281e+00 -1.58889353e-01 -7.09819868e-02
  1.80515531e-03 -3.41923870e-02 -7.32177794e-02  1.32191768e-02
 -6.23754598e-02  5.59716403e-01 -8.89636800e-02  5.47447689e-02
  1.38740633e-02 -3.49752456e-02  5.93754351e-02 -7.52546191e-02
  8.37379172e-02  3.84355895e-02 -8.71685892e-02 -4.35740985e-02
 -3.56519930e-02  1.10440873e-01 -2.27905586e-02  4.90446715e-03
 -8.56709946e-03  1.12290122e-02  8.47478770e-03 -8.92388728e-03
 -7.15297237e-02 -3.00285369e-02 -1.18787177e-01 -2.17433162e-02
  2.82667973e-03 -8.77292547e-03 -2.61600558e-02 -8.01814497e-02
  1.01425347e-03  8.22294801e-02 -1.10451330e-03  7.96260759e-02
 -5.27382419e-02  6.74669445e-03 -7.30925202e-02  6.89959824e-02
  4.35188552e-03 -2.82859989e-02 -2.34459899e-02  5.00258021e-02
 -7.17234910e-02 -3.33820134e-02 -3.22896950e-02 -6.71841279e-02
 -8.43210667e-02 -5.35150