# Build Sentiment Dictionaries from VSMs

This script allows you to create your own Sentiment Dictionary using Vector Space Models

## 1. Preparation

Download the model.  
You can select any model from here: https://fasttext.cc/docs/en/crawl-vectors.html

In [None]:
import os
import gensim
import urllib.request
import os.path
import pandas
import numpy as np
import scipy.stats as stats

In [None]:
# here we download the model
# remember to change URL and filename according to the model you want 
# here we do a test with the Italian model, named "cc.it.300.vec.gz"

!wget "https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.it.300.vec.gz"
!gunzip cc.it.300.vec.gz

--2022-06-07 10:02:10--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.it.300.vec.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 172.67.9.4, 104.22.74.142, 104.22.75.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1272825284 (1.2G) [binary/octet-stream]
Saving to: ‘cc.it.300.vec.gz’


2022-06-07 10:02:42 (38.5 MB/s) - ‘cc.it.300.vec.gz’ saved [1272825284/1272825284]



In [None]:
# remember to change the filename according to the model you downloaded 
# here we do a test with the Italian model, named "cc.it.300.vec" (note that ".gz" is not in the name anymore, as we unzipped the file)

filename = 'cc.it.300.vec'

my_model = gensim.models.KeyedVectors.load_word2vec_format(filename, binary=False)

## 2. Prepare SA lexicon

Here you need to define the "seed words" for your lexicon.  
Here we test it with two dimensions, "happy" and "sad" (but you can use many different dimensions). 

In [None]:
happy_labels = ['felice', 'contento'] # note that you can add how many words you like!
sad_labels = ['triste', 'dispiaciuto'] # note that you can add how many words you like!
# you can add more labels for more categories, if you like...

all_words = list(my_model.vocab.keys())

In [None]:
happy_ordered_words = my_model.most_similar(positive = happy_labels, topn = len(all_words))
sad_ordered_words = my_model.most_similar(positive = sad_labels, topn = len(all_words))
# you can add more categories, if you like...

In [None]:
# happy
happy_words = []
happy_value = []

for my_tuple in happy_ordered_words:
  happy_words.append(my_tuple[0])
  happy_value.append(my_tuple[1])

# sad
sad_words = []
sad_value = []

for my_tuple in sad_ordered_words:
  sad_words.append(my_tuple[0])
  sad_value.append(my_tuple[1])

# you can add more categories, if you like...

In [None]:
# happy
happy_value = np.array(happy_value)
happy_value = stats.zscore(happy_value)

happy_df = pandas.DataFrame(list(zip(happy_words, happy_value)), 
               columns =['word', 'happy'])

happy_df = happy_df.sort_values('word', ascending=True)


# sad
sad_value = np.array(sad_value)
sad_value = stats.zscore(sad_value)

sad_df = pandas.DataFrame(list(zip(sad_words, sad_value)), 
               columns =['word', 'sad'])

sad_df = sad_df.sort_values('word', ascending=True)

# you can add more categories, if you like...

In [None]:
# save all to unique dataframe
sa_df = happy_df.merge(sad_df, how = 'inner', on = ['word'])
sa_df ["valence"] = sa_df["happy"] - sa_df["sad"] 
# if you add more categories, you should re-write the command above like that:
# sa_df = happy_df.merge(sad_df, fear_df, surprise_df, ..., how = 'inner', on = ['word'])

sa_df.to_csv('my_SA_dictionary.csv', index=False)