# Analyzing word2vec embeddings by demostrating clustering.

The entire assignment can be split up into subsections:
- Install dependencies
- Preprocessing the subtitle file to generate tokens
- Training the model over these tokens
- Running the clustering algorithm and assessing the clusters formed

# Installing dependencies

In [None]:
%pip install scipy==1.10.1
%pip install scikit-learn gensim nltk pysrt bs4 contractions

In [None]:
# For preprocessing we need the following libraries
import string
import pysrt
from bs4 import BeautifulSoup
import contractions
import nltk

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
nltk.download("stopwords")
nltk.download("punkt")

# For training the model we need the following library
from gensim.models import Word2Vec

# For clustering we need the following library
from sklearn.cluster import KMeans

# Preprocessing Text
We first need to preprocess text, since it will taken from a subtitle file. The goal here is to convert text from a subtitle `.srt` file to an array of sentences, where each sentence is an array of words. This will be the input to our model.

We've employed various methods and used multiple packages to achieve this preprocessing, including the following:
- We've used `pysrt` package to extract the text from the subtitle file
- but then we also need to clear out the html tags (`<i>...<\i>`),
- contractions (`i'll` should be converted to `i will`),
- hypenated words (`25-to-1` should be reduced to `25`, `to`, `1`),
- remove stopwords (words such as i, he, she, am, etc. that aren't useful to our analysis).
- We tokenize the text with `nltk` library, and create an array of sentences, where each sentence is an array of words.
- We also _lemmentize_ the words, so plurals of words are reduced to the singular form.

In [100]:
class PreprocssText:
    def __init__(self):
        self.text = ''
        self.stop_words = set(stopwords.words('english') + list(string.punctuation) + ['...', '``', '\'\'', '\'s', 'us'])
        self.wnl = WordNetLemmatizer()

    def separate_hypenated_word(self, sentence):
        ret_sentence = []
        for word in sentence:
            if '-' in word:
                parts = word.split('-')
                for part in parts:
                    if part:
                        ret_sentence.append(part)
            else:
                ret_sentence.append(word)
        return ret_sentence

    def lemmatize(self, sentence):
        ret_sentence = []
        for word in sentence:
            lem_word = self.wnl.lemmatize(word)
            if lem_word == 'cdos':
                lem_word = 'cdo'
            ret_sentence.append(lem_word)
        return ret_sentence

    def tokenize(self, remove_stopwords=True, lemmatize=True):
        # Tokenize into sentences
        sentences = sent_tokenize(self.text)

        # Tokenize each sentence into words and remove stopwords
        tokenized_sentences = []
        for sentence in sentences:
            words = word_tokenize(sentence)
            words = [word.lower() for word in words if word and not word.isdigit()]
            words = self.separate_hypenated_word(words)
            if remove_stopwords:
                words = [word for word in words if word not in self.stop_words]
            if lemmatize:
                words = self.lemmatize(words)
            if len(words)>1:
                tokenized_sentences.append(words)
        return tokenized_sentences

    def read_file(self, file_path):
        subs = pysrt.open(file_path, encoding='utf-8')

        # Extract text from subtitle objects and remove HTML tags
        text = ' '.join([sub.text for sub in subs])
        soup = BeautifulSoup(text, 'html.parser')
        for font_tag in soup.find_all('font'):
            font_tag.decompose()
        clean_text = soup.get_text()

        clean_text = contractions.fix(clean_text)
        self.text = clean_text
        return self

# Path to the subtitle file
subtitle_path = "big_short.srt"
tokens = PreprocssText().read_file(subtitle_path).tokenize()
print(tokens[:5], len(tokens), sep='\n')

[['hiya', 'frank'], ['wife', 'kid'], ['know', 'considering', 'treasury', 'bond', 'utility', 'stock'], ['late', "'70s", 'banking', 'job', 'went', 'make', 'large', 'sum', 'money'], ['fucking', 'snooze']]
1654


# What is an embedding?
To be able to run computation over language, it is often quite convineint to create an embedding - a mapping - from words to a mathematical entity, here an n-dimensional vector. 

<img src="images/word_to_vector.png" width="500">

We want to encode meaning with this vector representation. The words that are used in similar context are closer in the vector space, so for example `king` is closer to `man` than to `cat`. But their direction and magnitude also encodes information, that allows us to add to vector for `man` with the vector `queen - king` to get `woman`.

<img src="images/king_example.png" width="500">

# Generating word embeddings
To generate word embeddings, we use the `gensim` library's `Word2Vec` class.

What it does is that it trains a simple neural network to predict words that follows certain words, and then words that have similar weights in the model are the words that are used in similar context (the words following or preceding them are similar). This gets us the required embeddings.

<img src="images/gnn.png" width="500">

<img src="images/ww.png" width="500">

Among the two methods present in this class, we use the first one, continuous-bag-of-words approach. 

<img src="images/cbow.png" width="500">

In [72]:
# Train the model 
#   vector_size: size of the word vectors. The more the better, 
#       but also the more computationally expensive.
#   window: maximum distance between the current and predicted word within a sentence. 
#       Set to 100, so entire sentences are considered.
#   min_count: ignore all words with total frequency lower than this.
#   workers: number of worker threads to train the model. 
#       1 Here, to keep the output replicable.
model = Word2Vec(tokens, vector_size=2000, window=100,  min_count=5, workers=1)

# Extract word embeddings
word_vectors = model.wv

# Get word and corresponding vector
word_vector_list = [word_vectors[word] for word in word_vectors.key_to_index.keys()]
words_list = list(word_vectors.key_to_index.keys())

# Qualitatively assess the word embedding generated
print(word_vectors.most_similar_cosmul('cdo')) 
print(word_vectors.most_similar_cosmul('mortgage')) 


[('bond', 0.6647899746894836), ('wall', 0.6568737626075745), ('mortgage', 0.6547439098358154), ('housing', 0.6545563340187073), ('morgan', 0.6457417011260986), ('know', 0.6447020173072815), ('get', 0.6384490132331848), ('year', 0.6329621076583862), ('bank', 0.6312693357467651), ('right', 0.6299655437469482)]
[('bond', 0.691154420375824), ('going', 0.6875411868095398), ('wall', 0.6870430111885071), ('housing', 0.6826854348182678), ('subprime', 0.6760833859443665), ('people', 0.6741327047348022), ('short', 0.6736552119255066), ('know', 0.6720733642578125), ('morgan', 0.6694979071617126), ('get', 0.6644611954689026)]


### Interpretation of the results
Since the movie `Big Short` is about the 2008 housing crisis, we expect a lot of talk regarding CDOs and mortgages. The output is illustrative of this. CDOs were a type of bond, formed by morgages, and were shorted by the main characters by purchase of swaps. `wall` in the output is mostly from Wall Street, name of the exchange.


# Clustering
We use K-means algorithm to cluster the embeddings. We create 20 clusters, and then run some qualitative tests to see how words are clustered together. Words that are functionally similar, like a set of verbs, or a set of nouns that could be used interchangeable, should be clustered together.

In [73]:
# Apply K-means clustering
kmeans = KMeans(n_clusters=20, random_state=13)  # Fixing random state so the result are replicable
kmeans.fit(word_vector_list)
clusters = kmeans.labels_.tolist()

# Associate words with their clusters
clustered_words = {cluster: [] for cluster in set(clusters)}
for i, word in enumerate(words_list):
    clustered_words[clusters[i]].append(word)

# Print words in each cluster
for cluster, words in clustered_words.items():
    print(f"Cluster {cluster}: {words}")

  super()._check_params_vs_input(X, default_n_init=10)


Cluster 0: ['bet', 'rate', 'face', 'wanted']
Cluster 1: ['price', 'find', 'wait', 'heard', 'hell', 'banker', 'another', 'rickert']
Cluster 2: ['want', 'make', 'time', 'big', 'thing', 'jamie', 'loss']
Cluster 3: ['would', 'got', 'way', 'maybe', 'vinnie', 'business', 'position', 'mike', 'saying', 'kathy', 'help', 'backed', 'real', 'sold', 'merrill', 'taking', 'looking', 'stop', 'many', 'started', 'four', 'fact', 'return', 'leave', 'thousand', 'old', 'lynch', 'boy', 'deutsche', 'filled', 'dollar', 'worth', 'goldman', 'close']
Cluster 4: ['talk', 'aaa', 'new', 'hear', 'getting', 'friend', 'insurance', 'minute', 'caught']
Cluster 5: ['see', 'take', 'always', 'yes', 'back', 'made', 'bad', 'bubble', 'lawrence', 'thank', 'question', 'interest', 'hate', 'nice', 'name', 'bbb', 'part', 'stearns', 'collapse', 'sorry', 'anyone', 'everybody', 'nobody', 'lewis', 'paying', 'kind', 'greenspan', 'capital', 'frontpoint', 'american', 'lost', 'everyone', 'gentleman', 'today', 'together', 'sir', 'jesus', 'n

### To qualitatively analyze the clusters formed, we check the cluster the following words are assigned to.

In [116]:
words_to_check = ["cdo", "bond", "subprime", "mortgage", "default", "short", "swap", "money", "loan", "stock", "market", "wall", "street", "bank"]
for cluster, words in clustered_words.items():
    for word in words_to_check:
        if word in words:
            print(f"{word} is in Cluster {cluster}")


default is in Cluster 7
stock is in Cluster 8
market is in Cluster 8
bond is in Cluster 9
subprime is in Cluster 9
mortgage is in Cluster 9
short is in Cluster 9
swap is in Cluster 9
loan is in Cluster 9
cdo is in Cluster 15
money is in Cluster 15
wall is in Cluster 15
street is in Cluster 15
bank is in Cluster 15


### Interpretation of the results
The results are illustrative of the financial instruments named throught the movie. Various words that are used together appear the same cluster, for example wall street, stock market, subprime mortgages, mortgage bonds, bank and money, short and swaps. 

Throughout the movie the protagonists were buying up credit default swaps, but we see in the above output default and swap are in separate clusters. This is probably because default is also used in another context, i.e., when someone becomes delinquent.

The clustering clearly isn't perfect though, given we would expect stock market and wall street to be clustered together, and also many other words clustered with them aren't as similar.