# Unsupervisded Sentiment Analysis by word2vec + K-means clustering (3)

#### By Joyce Jiang | Code by Joyce

### There are three steps of this unsupervised NPL analysis: 
(1) word2vec model training

(2) K-means clustering to group words into positive and negative clusters

(3) Perform unsupervised NLP and predict sentiments of data sample

### Citation & Source of my code
Declare: Though I have prior knowledge of conducting text categorization through word2vec and K-means, my script (from 1 to 3) is almost fully adapted from rafaljanwojciki's tutorial on GitHub, under his repo Unsupervised-Sentiment-Analysis, you can check it out at https://github.com/rafaljanwojcik/Unsupervised-Sentiment-Analysis.

Thanks rafaljanwojciki for explaining in a digestible way for me to understand fully how to use word2vec and k-means for a supervised NLP. This script is published for study and research exploration purpose only, and it would not be used for any commercial purpose. 

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score
from IPython.display import display

In [13]:
final_file = pd.read_csv('test_dataset.csv')
final_file.rename(columns = {'SENTIMENT':'Sentiment','Tweet':'old_Tweet','old_Tweet':'Tweet'}, inplace = True) 
final_file

Unnamed: 0,old_Tweet,Tweet,Sentiment
0,"['his', 'excellency', 'president', 'uganda', '...",his excellency president uganda kagutamuseveni...,1.0
1,"['katabasasa', 'nbstv', 'nbsfrontline', 'nbsup...",katabasasa nbstv nbsfrontline nbsupdates alrig...,
2,"['mr', 'erias', 'lukwago', 'should', 'know', '...",mr erias lukwago should know that sops are ins...,
3,"['energy', 'minister', 'drkitutu', 'hands', 'o...",energy minister drkitutu hands over the ugx 20...,
4,"['15', 'new', 'covid', '19', 'cases', 'have', ...",15 new covid 19 cases have been confirmed by t...,
...,...,...,...
5027,"['oworisylvia', 'we', 'have', 'been', 'engagin...",oworisylvia we have been engaging directly the...,
5028,"['major', 'rubalamira', 'owc', 'ug', 'has', 'b...",major rubalamira owc ug has been supplied by t...,
5029,"['it', 'is', 'going', 'to', 'be', 'very', 'dif...",it is going to be very difficult for governmen...,
5030,"['for', 'making', 'the', 'covid19ug', 'lockdow...",for making the covid19ug lockdown bearable tha...,


#### Unzip words from sentiment dictionary

In [14]:
sentiment_map = pd.read_csv('sentiment_dictionary.csv')
sentiment_dict = dict(zip(sentiment_map.words.values, sentiment_map.sentiment_coeff.values))

In [15]:
file_weighting = final_file.copy()

##### Apply TF-IDF function to words in each tweet 

#### *TF-IDF: a model to measure the importance of each word in a document

William Scott gives a clear explanation of what TF-IDF is. In his blog on Medium, he explains how the model calculate the importance of a word based on its frequency and document length by using a log computation. Check it out at https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089

In [16]:
tfidf = TfidfVectorizer(tokenizer=lambda y: y.split(), norm=None)
tfidf.fit(file_weighting.Tweet)
features = pd.Series(tfidf.get_feature_names())
transformed = tfidf.transform(file_weighting.Tweet)



In [17]:
def create_tfidf_dictionary(x, transformed_file, features):
    '''
    create dictionary for each input sentence x, where each word has assigned its tfidf score
    
    inspired  by function from this wonderful article: 
    https://medium.com/analytics-vidhya/automated-keyword-extraction-from-articles-using-nlp-bfd864f41b34
    
    x - row of dataframe, containing sentences, and their indexes,
    transformed_file - all sentences transformed with TfidfVectorizer
    features - names of all words in corpus used in TfidfVectorizer

    '''
    vector_coo = transformed_file[x.name].tocoo()
    vector_coo.col = features.iloc[vector_coo.col].values
    dict_from_coo = dict(zip(vector_coo.col, vector_coo.data))
    return dict_from_coo

def replace_tfidf_words(x, transformed_file, features):
    '''
    replacing each word with it's calculated tfidf dictionary with scores of each word
    x - row of dataframe, containing sentences, and their indexes,
    transformed_file - all sentences transformed with TfidfVectorizer
    features - names of all words in corpus used in TfidfVectorizer
    '''
    dictionary = create_tfidf_dictionary(x, transformed_file, features)   
    return list(map(lambda y:dictionary[f'{y}'], x.Tweet.split()))

In [18]:
%%time
replaced_tfidf_scores = file_weighting.apply(lambda x: replace_tfidf_words(x, transformed, features), axis=1)#this step takes around 3-4 minutes minutes to calculate

Wall time: 1.41 s


In [19]:
def replace_sentiment_words(word, sentiment_dict):
    '''
    replacing each word with its associated sentiment score from sentiment dict
    '''
    try:
        out = sentiment_dict[word]
    except KeyError:
        out = 0
    return out

#### Map sentiment scores from sentiment dictionary

In [20]:
replaced_closeness_scores = file_weighting.Tweet.apply(lambda x: list(map(lambda y: replace_sentiment_words(y, sentiment_dict), x.split())))

In [21]:
file_weighting.Sentiment.value_counts()

 1.0    36
-1.0    35
Name: Sentiment, dtype: int64

#### Use dot product in vector to calculate sentiment_rate of each tweet 

In [58]:
replacement_df = pd.DataFrame(data=[replaced_closeness_scores, replaced_tfidf_scores, file_weighting.Tweet, file_weighting.Sentiment]).T
replacement_df.columns = ['sentiment_coeff', 'tfidf_scores', 'sentence', 'Label_sentiment']
replacement_df['sentiment_rate'] = replacement_df.apply(lambda x: np.array(x.loc['sentiment_coeff']) @ np.array(x.loc['tfidf_scores']), axis=1)
replacement_df

Unnamed: 0,sentiment_coeff,tfidf_scores,sentence,Label_sentiment,sentiment_rate
0,"[-1.541800827499423, -2.3033457333645297, -1.6...","[4.744648013664831, 8.42515921810825, 3.680227...",his excellency president uganda kagutamuseveni...,1,-356.488838
1,"[0, -1.5565653208367431, -3.2015054954053355, ...","[8.830624326216416, 4.618496728337931, 6.34571...",katabasasa nbstv nbsfrontline nbsupdates alrig...,,-522.359925
2,"[-1.2957153680580322, -3.188078768336516, -1.7...","[6.227934640772031, 8.137477145656469, 7.73201...",mr erias lukwago should know that sops are ins...,,-369.949592
3,"[-3.0098813755615854, -1.7890353035388755, 0, ...","[7.91433359434226, 5.069424210522852, 8.830624...",energy minister drkitutu hands over the ugx 20...,,-714.016576
4,"[-1.1767939877229152, -1.3548661756600078, -1....","[6.089784302291213, 7.918502198907333, 6.69165...",15 new covid 19 cases have been confirmed by t...,,-343.314726
...,...,...,...,...,...
5027,"[-2.835482437494105, -1.7283063267743914, -1.7...","[7.577861357721047, 5.326215670656147, 2.83044...",oworisylvia we have been engaging directly the...,,-519.271275
5028,"[-1.1904924649395259, 0, -3.8522682417561738, ...","[7.444329965096524, 8.830624326216416, 7.22118...",major rubalamira owc ug has been supplied by t...,,-504.140840
5029,"[-1.6771065597055412, -1.6522155987037777, -1....","[3.2120366976234456, 2.4155273670448194, 4.744...",it is going to be very difficult for governmen...,,-425.720320
5030,"[-1.9038026048395025, -2.2224452127640792, -1....","[2.4788663896469894, 6.02726394530988, 1.49531...",for making the covid19ug lockdown bearable tha...,,-47.974229


#### *Checking average score of all sentiment, we can see that the avg score is -248, which means my model is highly twisted

In [59]:
replacement_df.mean()

Label_sentiment      0.014085
sentiment_rate    -248.239124
dtype: float64

In [60]:
#Convert [Polarity] to positive (1), negative (-1), and neutral (0) for comparison

prediction = [-1 if v < -100 else 1 for v in replacement_df['sentiment_rate']]
replacement_df['prediction'] = prediction

In [61]:
replacement_df=replacement_df.drop_duplicates(subset ="sentence").reset_index(drop=True)
replacement_df.prediction.value_counts()

-1    4106
 1     850
Name: prediction, dtype: int64

In [62]:
df_drop=replacement_df.dropna()
filter_df=df_drop[df_drop['Label_sentiment']!=0]
filter_df=filter_df.reset_index(drop=True)

####  The following section is for performing confusion matrix. Since I didn't label [Sentiment] in Boolean (0/1), I was not able to perform a confusion matrix.

In [63]:
'''predicted_classes = df_drop.prediction
y_test = df_drop.Label_sentiment

conf_matrix = pd.DataFrame(confusion_matrix(df_drop.Label_sentiment, df_drop.prediction))
print('Confusion Matrix')
display(conf_matrix)

test_scores = accuracy_score(y_test,predicted_classes), precision_score(y_test, predicted_classes), recall_score(y_test, predicted_classes), f1_score(y_test, predicted_classes)

print('\n \n Scores')
scores = pd.DataFrame(data=[test_scores])
scores.columns = ['accuracy', 'precision', 'recall', 'f1']
scores = scores.T
scores.columns = ['scores']
display(scores)'''

"predicted_classes = df_drop.prediction\ny_test = df_drop.Label_sentiment\n\nconf_matrix = pd.DataFrame(confusion_matrix(df_drop.Label_sentiment, df_drop.prediction))\nprint('Confusion Matrix')\ndisplay(conf_matrix)\n\ntest_scores = accuracy_score(y_test,predicted_classes), precision_score(y_test, predicted_classes), recall_score(y_test, predicted_classes), f1_score(y_test, predicted_classes)\n\nprint('\n \n Scores')\nscores = pd.DataFrame(data=[test_scores])\nscores.columns = ['accuracy', 'precision', 'recall', 'f1']\nscores = scores.T\nscores.columns = ['scores']\ndisplay(scores)"

### Test Accuracy
Notice: The accuracy test is not valid for my sentiment analysis because I wasn't able to label positive and negative words. 

Interestingly, I was able to label my words in English and non-English words. The tweet received prediction of positive means it's dominated by non-English content, vice versa. That's also why we get way more negative labels than positive ones. 

In [64]:
n=0

for i in range(len(filter_df['prediction'])):
    if filter_df['prediction'][i]==filter_df['Label_sentiment'][i]:
        n+=1
OverallAccuracy = n/len(filter_df['prediction'])*100

In [65]:
positive = filter_df[filter_df['Label_sentiment']==1]
negative = filter_df[filter_df['Label_sentiment']==-1]

positive=positive.reset_index(drop=True)
negative=negative.reset_index(drop=True)

ng=0

for i in range(len(negative['Label_sentiment'])):
    if negative['Label_sentiment'][i]==negative['prediction'][i]:
        ng+=1
NegativeAccuracy = ng/len(negative['Label_sentiment'])*100

p=0

for i in range(len(positive['Label_sentiment'])):
    if positive['Label_sentiment'][i]==positive['prediction'][i]:
        p+=1
PositiveAccuracy = p/len(positive['Label_sentiment'])*100


In [66]:
print("Overall Accuracy: "+ str(OverallAccuracy))
print("Negative Accuracy: "+ str(NegativeAccuracy))
print("Positive Accuracy: "+ str(PositiveAccuracy))

Overall Accuracy: 47.88732394366197
Negative Accuracy: 88.57142857142857
Positive Accuracy: 8.333333333333332


In [67]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', None)

filter_df[['sentence','Label_sentiment','sentiment_rate','prediction']]

Unnamed: 0,sentence,Label_sentiment,sentiment_rate,prediction
0,his excellency president uganda kagutamuseveni recognising efforts of ahmadiyya muslim community towards covid19ug so grateful to our partners humanityfirstuk ughumanityfirst hfi1995 4 support accorded to the ppl of uganda ntvuganda eidmubarak eidulfitr nbsupdates,1,-356.488838,-1
1,yes he is one of the most celebrated journalists of our time yes he is the reigning bbc komla dumor award winner but he is still humble enough to carry video equipment with the rest of us to the frontlines of the covid19 field the lesson is humility covid19ug,1,-591.000997,-1
2,aajtak social distancing is a myth anybody can see this now a days on road in streets at local market and at shops it s very dangerous at this time social distancing covid19ug aamaadmiparty delhicm ndtv indiatv hindustantimmes timesofindia bbcnews zeenews dainikbhasker,-1,-321.754354,-1
3,covid19ug cases have been smalolized from 264 to 145 where is the luwero hajji we celebrate ? !,1,-79.117716,1
4,football256 update prolinefc are uncontended though they accept to the fact that they ve been relegated due to the coronavirus pandemic covid19ug,-1,-196.798208,-1
5,covid19ug messed me up i would be here talking to my dog,-1,-99.618699,1
6,i don t think that covid19ug should stop elections i don t think that the millions who vote attend political rallies katikkiroonthespot ntvonthespot,-1,-362.548436,-1
7,nbsfrontline should change the opposition frontliners none of them is discussing substance all of them think president m7 has failed to manage covid19ug are blaming his tea taking skills if indeed he has failed why do we have the least cases in the region amp no deaths ?,-1,-536.405473,-1
8,for accurate info about covid19ug modoanita,1,-44.509596,1
9,godbertumushabe janeruth aceng nurses in soroti also protesting no clarity on what the task force funds are really doing covid19ug,-1,-197.909449,-1


### Notice: My cluster analysis failed. My prediction result is NOT reliable, so does my accuracy test :(