# Exercise 02 - Sandro Suter

In the Computational Language Technologies module, the performance record consists of 4 tasks. The first 3 are to be handed in at regular intervals during the semester as Google Colab Notebook and the fourth is taken as an online exam.

The following notebook contains the second performance record. First, the set task is explained and the data used is defined, then it is fulfilled. Finally, the findings are explained and classified.


## Task

Exercise two contains three main parts:

1. Find Dataset - The goal is to find and define a Dataset on wich I can train my own word embedding. Concerning the source of the data, there are no restrictions
2. Train word2vec or GloVe embeddings and visualize the semantic space - The goal is to compare the self trained embeddings to a pretrained space.
3. Summarize - Draw conclusions about learnings, pro- and contra regarding pretrained embeddings

The deadline for solving this exercise is the 22. May 2022 and it has to be submitted by adding the link to [this google-doc](https://docs.google.com/spreadsheets/d/1btNmOWkqxykIh1xu5or1ufPgvdr-472sL9g9JkWU_Kg/edit#gid=0).


### Data selection

For this exercise I decided to have a look at Georg Orwells timeless classic 'Animal Farm'. This is a fable about the Russian Revolution and i got the text from this source:

- ['Animal Farm'](https://gutenberg.net.au/ebooks01/0100011h.html)

This book seems particularly suitable to me because it is a fable and therefore a fantasy world. Thus, the semantic spaces of this book and the pre-trained ones should be very different. In addition, I would like to find out whether it is possible to classify the individual animals in the semantic space according to their roles in the book. I will give more details on this point in the notebook. 

The following section should give a short overview of the content if you are interested.

### The Animal Farm

The 'Animal Farm' tries to project the real events during and after the Russian Revolution onto a fictional farm of animals. The focus is on a critical view of Stalinism and the destruction of socialism caused by it. To do this, Orwell makes use of the fable and assigns the various actors in the Russian Revolution a fitting role as animals. For example, the working people become horses, the party leadership become pigs and the old people become donkeys. This list is not exhaustive and is only intended to give an impression. A deeper impression and a possible interpretation of this book can be found at [this link](https://www.getabstract.com/de/zusammenfassung/farm-der-tiere/5576#:~:text=Farm%20der%20Tiere%20ist%20eine,Repression%20und%20ein%20F%C3%BChrerkult%20entwickelten.) (in German).


## Setup

As a first step I have to setup the needed libraries. For this 'notebook' I will use:

- Tensorflow
- io
- pandas
- nltk
- contractions
- re
- gensim
- sklearn
- matplotlib
- numpy
- plotly

The exact use of each library is written as a comment in the code

In [1]:
# To download Data from gutenberg.net
from tensorflow.keras.utils import get_file
import io

#standard
import pandas as pd

# Tokenize, stopwords
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
nltk.download('punkt')

# To remove contractions
!pip install contractions
import contractions

# To use regex
import re

# Word2Vec
from gensim.models import Word2Vec

# To download pre-trained embedings
import gensim.downloader as api

#Visualization Tools
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Collecting contractions
  Downloading contractions-0.1.72-py2.py3-none-any.whl (8.3 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting pyahocorasick
  Downloading pyahocorasick-1.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (106 kB)
[K     |████████████████████████████████| 106 kB 29.2 MB/s 
[?25hCollecting anyascii
  Downloading anyascii-0.3.1-py3-none-any.whl (287 kB)
[K     |████████████████████████████████| 287 kB 20.9 MB/s 
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.1 contractions-0.1.72 pyahocorasick-1.4.4 textsearch-0.0.21


### Load Data

Before I can start creating my own word embeddings I need to download the data. For this purpose I will scrape it from gutenberg.net.au. This is a Project which provides free ebooks.

In [2]:
# Scrape Data from gutenberg.net

path = get_file('Animal_Farm.txt', origin='https://gutenberg.net.au/ebooks01/0100011h.html')

with io.open(path, errors='ignore') as f:
  text_af = f.read()
  text_af = str(text_af)

print('Corpus: ' + str(len(text_af)))

Downloading data from https://gutenberg.net.au/ebooks01/0100011h.html
Corpus: 174841


### Clean Data

In order to be able to train my own word space, I still have to do some preprocessing. Since I have the whole text as HTML, I have to trim the data so that only the main body is included, then I have to remove the numbers and special characters, as well as the COntractions, redundant spacing and stopwords. After these steps, the individual sentences can be tokenized and thus the word embedding can be created.

#### Cut HTML

In [None]:
#Have a look at the raw-data
text_af[0:4000]

In [3]:
# Data as HTML-Text.
# Cut file such that just main body is inside

text_af = text_af[3360:-433]

In [4]:
# Cut data to sentences
l_af  = []
for s in sent_tokenize(text_af):
    l_af.append(s)

# Add sentences to df as rows
df_af = pd.DataFrame(l_af, columns=['text'])
print(df_af.head(3))
print(len(df_af))

                                                text
0  s for the\nnight, but was too drunk to remembe...
1  With\nthe ring of light from his lantern danci...
2  Word had\ngone round during the day that old M...
1434


In [5]:
#Add categorie
df_af['Book_Title'] = 'Animal Farm'

In [6]:
#Check which special characters are in Data
sp_char = []
for x in range(len(df_af)):
  list_to_search =  list(set(df_af['text'][x]))
  for i in list_to_search:
    if i not in ['a','b','c','d','e','f','g','h','i','j',
          'k','l','m','n','o','p','q','r','s','t',
          'u','v','w','x','y','z','A','B','C','D',
          'E','F','G','H','I','J','K','L','M','N',
          'O','P','Q','R','S','T','U','V','W','X',
          'Y','Z', ' ']:
          sp_char.append(i)

In [7]:
l_sp_char = list(set(sp_char))
l_sp_char

['6',
 '9',
 '\n',
 '>',
 '7',
 ')',
 '/',
 '<',
 '4',
 '8',
 '=',
 '0',
 '-',
 '2',
 '1',
 '5',
 "'",
 '&',
 ':',
 '"',
 '3',
 ';',
 '(',
 ',',
 '.',
 '?',
 '!']

#### Remove Special Chars, Digits & Numbers

In [8]:
#Function to remove special characters, digits and punctuation
def text_cleaner(df, column):
  df_transform = df.copy(deep=True)
  #Loop over list
  for i in l_sp_char:
    #loop over DataFrame
    for x in range(len(df)):
      #Search for special characters, digits and punctuation and replace by ''
      if i in df_transform[column][x]:
        df_transform[column][x] = df_transform[column][x].replace(i,' ')
  return df_transform['text']

In [9]:
#Add column whitout special character to dataframe
df_af['clean_text'] = text_cleaner(df_af,'text')

In [10]:
df_af

Unnamed: 0,text,Book_Title,clean_text
0,"s for the\nnight, but was too drunk to remembe...",Animal Farm,s for the night but was too drunk to remember...
1,With\nthe ring of light from his lantern danci...,Animal Farm,With the ring of light from his lantern dancin...
2,Word had\ngone round during the day that old M...,Animal Farm,Word had gone round during the day that old Ma...
3,It had been agreed that they\nshould all meet ...,Animal Farm,It had been agreed that they should all meet i...
4,"Old Major (so he was always called, though the...",Animal Farm,Old Major so he was always called though the...
...,...,...,...
1429,"Yes, a violent quarrel was in\nprogress.",Animal Farm,Yes a violent quarrel was in progress
1430,"There were shoutings, bangings on the table, s...",Animal Farm,There were shoutings bangings on the table s...
1431,The source of the trouble\nappeared to be that...,Animal Farm,The source of the trouble appeared to be that ...
1432,"No question, now, what had happened to the fac...",Animal Farm,No question now what had happened to the fac...


#### Remove contractions


In [11]:

#Function to remove contractions
def remove_contractions(df, column):
  df_transform = df.copy(deep=True)
  l = []
  for i in df_transform[column]:
    text = i
    list_cleaned = []
    for word in text.split():
      list_cleaned.append(contractions.fix(word))
    expanded_text = ' '.join(list_cleaned)
    l.append(expanded_text)
  df_transform['without_contractions'] = l
  return df_transform['without_contractions']




In [12]:
#Add column without contractions to df
df_af['without_contractions'] = remove_contractions(df_af, 'clean_text')

In [13]:
df_af

Unnamed: 0,text,Book_Title,clean_text,without_contractions
0,"s for the\nnight, but was too drunk to remembe...",Animal Farm,s for the night but was too drunk to remember...,s for the night but was too drunk to remember ...
1,With\nthe ring of light from his lantern danci...,Animal Farm,With the ring of light from his lantern dancin...,With the ring of light from his lantern dancin...
2,Word had\ngone round during the day that old M...,Animal Farm,Word had gone round during the day that old Ma...,Word had gone round during the day that old Ma...
3,It had been agreed that they\nshould all meet ...,Animal Farm,It had been agreed that they should all meet i...,It had been agreed that they should all meet i...
4,"Old Major (so he was always called, though the...",Animal Farm,Old Major so he was always called though the...,Old Major so he was always called though the n...
...,...,...,...,...
1429,"Yes, a violent quarrel was in\nprogress.",Animal Farm,Yes a violent quarrel was in progress,Yes a violent quarrel was in progress
1430,"There were shoutings, bangings on the table, s...",Animal Farm,There were shoutings bangings on the table s...,There were shoutings bangings on the table sha...
1431,The source of the trouble\nappeared to be that...,Animal Farm,The source of the trouble appeared to be that ...,The source of the trouble appeared to be that ...
1432,"No question, now, what had happened to the fac...",Animal Farm,No question now what had happened to the fac...,No question now what had happened to the faces...


#### Remove redundant whitespace

In [14]:
#Function to remove double whitespaces
def remove_whitespace(df, column):
  df_transform = df.copy(deep=True)
  l = []
  for i in df_transform[column]:
    text = i
    text = re.sub('/  +/g', " ", text)
    l.append(text)
  df_transform['without_whitespace'] = l
  return df_transform['without_whitespace']

In [15]:
#Add column without double whitespaces to df
df_af['without_whitespace'] = remove_whitespace(df_af, 'without_contractions')

In [16]:
df_af

Unnamed: 0,text,Book_Title,clean_text,without_contractions,without_whitespace
0,"s for the\nnight, but was too drunk to remembe...",Animal Farm,s for the night but was too drunk to remember...,s for the night but was too drunk to remember ...,s for the night but was too drunk to remember ...
1,With\nthe ring of light from his lantern danci...,Animal Farm,With the ring of light from his lantern dancin...,With the ring of light from his lantern dancin...,With the ring of light from his lantern dancin...
2,Word had\ngone round during the day that old M...,Animal Farm,Word had gone round during the day that old Ma...,Word had gone round during the day that old Ma...,Word had gone round during the day that old Ma...
3,It had been agreed that they\nshould all meet ...,Animal Farm,It had been agreed that they should all meet i...,It had been agreed that they should all meet i...,It had been agreed that they should all meet i...
4,"Old Major (so he was always called, though the...",Animal Farm,Old Major so he was always called though the...,Old Major so he was always called though the n...,Old Major so he was always called though the n...
...,...,...,...,...,...
1429,"Yes, a violent quarrel was in\nprogress.",Animal Farm,Yes a violent quarrel was in progress,Yes a violent quarrel was in progress,Yes a violent quarrel was in progress
1430,"There were shoutings, bangings on the table, s...",Animal Farm,There were shoutings bangings on the table s...,There were shoutings bangings on the table sha...,There were shoutings bangings on the table sha...
1431,The source of the trouble\nappeared to be that...,Animal Farm,The source of the trouble appeared to be that ...,The source of the trouble appeared to be that ...,The source of the trouble appeared to be that ...
1432,"No question, now, what had happened to the fac...",Animal Farm,No question now what had happened to the fac...,No question now what had happened to the faces...,No question now what had happened to the faces...


#### Remove Stopwords


In [17]:
#Define english stopwords
stop_words = set(stopwords.words('english'))

In [18]:
# Function to capitalize stopwords
def capitalize_string(list):
  list_cap = []
  for i in list:
    word_cap = i.capitalize()
    list_cap.append(word_cap)
  return list_cap

In [19]:
#Create set of stopwords capitalized
stopword_cap = capitalize_string(stop_words)
stopword_cap = set(stopword_cap)

In [20]:
#Check
print(len(stop_words))
print(len(stopword_cap))

179
179


In [21]:
#Function to remove stopwords
def remove_stopwords(df, column):
  df_transform = df.copy(deep=True)
  l = []
  for i in df_transform[column]:
    text = i
    list_without_stopword = []
    word_tokens = word_tokenize(text)
    for x in word_tokens:
      if x not in stop_words and x not in stopword_cap:
        list_without_stopword.append(x)
    text = ' '.join([token for token in list_without_stopword])
    l.append(text)
  df_transform['without_stopwords'] = l
  return df_transform['without_stopwords']

In [22]:
#Add column without stopwords
df_af['without_stopwords'] = remove_stopwords(df_af, 'without_whitespace')

In [23]:
df_af

Unnamed: 0,text,Book_Title,clean_text,without_contractions,without_whitespace,without_stopwords
0,"s for the\nnight, but was too drunk to remembe...",Animal Farm,s for the night but was too drunk to remember...,s for the night but was too drunk to remember ...,s for the night but was too drunk to remember ...,night drunk remember shut pop holes
1,With\nthe ring of light from his lantern danci...,Animal Farm,With the ring of light from his lantern dancin...,With the ring of light from his lantern dancin...,With the ring of light from his lantern dancin...,ring light lantern dancing side side lurched a...
2,Word had\ngone round during the day that old M...,Animal Farm,Word had gone round during the day that old Ma...,Word had gone round during the day that old Ma...,Word had gone round during the day that old Ma...,Word gone round day old Major prize Middle Whi...
3,It had been agreed that they\nshould all meet ...,Animal Farm,It had been agreed that they should all meet i...,It had been agreed that they should all meet i...,It had been agreed that they should all meet i...,agreed meet big barn soon Mr Jones safely way
4,"Old Major (so he was always called, though the...",Animal Farm,Old Major so he was always called though the...,Old Major so he was always called though the n...,Old Major so he was always called though the n...,Old Major always called though name exhibited ...
...,...,...,...,...,...,...
1429,"Yes, a violent quarrel was in\nprogress.",Animal Farm,Yes a violent quarrel was in progress,Yes a violent quarrel was in progress,Yes a violent quarrel was in progress,Yes violent quarrel progress
1430,"There were shoutings, bangings on the table, s...",Animal Farm,There were shoutings bangings on the table s...,There were shoutings bangings on the table sha...,There were shoutings bangings on the table sha...,shoutings bangings table sharp suspicious glan...
1431,The source of the trouble\nappeared to be that...,Animal Farm,The source of the trouble appeared to be that ...,The source of the trouble appeared to be that ...,The source of the trouble appeared to be that ...,source trouble appeared Napoleon Mr Pilkington...
1432,"No question, now, what had happened to the fac...",Animal Farm,No question now what had happened to the fac...,No question now what had happened to the faces...,No question now what had happened to the faces...,question happened faces pigs


#### Tokenization

Tokenisation is about preparing the data so that it can be transferred into embedings using the Word2Vec algorithm. Here I will tokenise the columns 'without_stopwords' and 'clean_text'. I do this to see what effect it has on the word space if stopwords are included or not.

In [24]:
#Tokenize Column 'without_stopwords' and 'clean_text' (with stopwords, and double whitespaces)

el  = [] #empty list
for row in df_af['without_stopwords']:
  tokens = word_tokenize(row)
  el.append(tokens)

df_af['Tokenized'] = el

el  = [] #empty list
for row in df_af['clean_text']:
  tokens = word_tokenize(row)
  el.append(tokens)

df_af['Tokenized_clean_text'] = el

df_af

Unnamed: 0,text,Book_Title,clean_text,without_contractions,without_whitespace,without_stopwords,Tokenized,Tokenized_clean_text
0,"s for the\nnight, but was too drunk to remembe...",Animal Farm,s for the night but was too drunk to remember...,s for the night but was too drunk to remember ...,s for the night but was too drunk to remember ...,night drunk remember shut pop holes,"[night, drunk, remember, shut, pop, holes]","[s, for, the, night, but, was, too, drunk, to,..."
1,With\nthe ring of light from his lantern danci...,Animal Farm,With the ring of light from his lantern dancin...,With the ring of light from his lantern dancin...,With the ring of light from his lantern dancin...,ring light lantern dancing side side lurched a...,"[ring, light, lantern, dancing, side, side, lu...","[With, the, ring, of, light, from, his, lanter..."
2,Word had\ngone round during the day that old M...,Animal Farm,Word had gone round during the day that old Ma...,Word had gone round during the day that old Ma...,Word had gone round during the day that old Ma...,Word gone round day old Major prize Middle Whi...,"[Word, gone, round, day, old, Major, prize, Mi...","[Word, had, gone, round, during, the, day, tha..."
3,It had been agreed that they\nshould all meet ...,Animal Farm,It had been agreed that they should all meet i...,It had been agreed that they should all meet i...,It had been agreed that they should all meet i...,agreed meet big barn soon Mr Jones safely way,"[agreed, meet, big, barn, soon, Mr, Jones, saf...","[It, had, been, agreed, that, they, should, al..."
4,"Old Major (so he was always called, though the...",Animal Farm,Old Major so he was always called though the...,Old Major so he was always called though the n...,Old Major so he was always called though the n...,Old Major always called though name exhibited ...,"[Old, Major, always, called, though, name, exh...","[Old, Major, so, he, was, always, called, thou..."
...,...,...,...,...,...,...,...,...
1429,"Yes, a violent quarrel was in\nprogress.",Animal Farm,Yes a violent quarrel was in progress,Yes a violent quarrel was in progress,Yes a violent quarrel was in progress,Yes violent quarrel progress,"[Yes, violent, quarrel, progress]","[Yes, a, violent, quarrel, was, in, progress]"
1430,"There were shoutings, bangings on the table, s...",Animal Farm,There were shoutings bangings on the table s...,There were shoutings bangings on the table sha...,There were shoutings bangings on the table sha...,shoutings bangings table sharp suspicious glan...,"[shoutings, bangings, table, sharp, suspicious...","[There, were, shoutings, bangings, on, the, ta..."
1431,The source of the trouble\nappeared to be that...,Animal Farm,The source of the trouble appeared to be that ...,The source of the trouble appeared to be that ...,The source of the trouble appeared to be that ...,source trouble appeared Napoleon Mr Pilkington...,"[source, trouble, appeared, Napoleon, Mr, Pilk...","[The, source, of, the, trouble, appeared, to, ..."
1432,"No question, now, what had happened to the fac...",Animal Farm,No question now what had happened to the fac...,No question now what had happened to the faces...,No question now what had happened to the faces...,question happened faces pigs,"[question, happened, faces, pigs]","[No, question, now, what, had, happened, to, t..."


## Vectorization

After the preprocessing is completed, the Word2Vec model can be created. For this purpose, the parameters must first be set and then the model created using the Word2Vec()-command.

#### Word2Vec Model

In [25]:
# Parameter setting
feature_size = 200  # Length of embbeding vector.
window_context = 5  # Context window size. Maximum distance between the current and predicted word within a sentence.
min_word_count = 5  # Minimum word count. Ignores all words with total frequency lower than this.
mode = 0            # 0=Skip-grams, 1=Continuous-bag-of-words. Details here: https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html

#Model Space without stopwords
w2vmodel = Word2Vec(sentences=df_af['Tokenized'],
                    size=feature_size,
                    window=window_context,
                    min_count=min_word_count,
                    sg = mode)

#Model space with stopwords
w2vmodel_1 = Word2Vec(sentences=df_af['Tokenized_clean_text'],
                    size=feature_size,
                    window=window_context,
                    min_count=min_word_count,
                    sg = mode)


## Visualization and Comparison of Vectorspaces

Now the created word spaces can be looked at more closely and compared with pre-trained spaces, in this case with GloVe and Fasttext.

In [26]:
#Get self trained embeddings
words = w2vmodel.wv.index2word
words_1 = w2vmodel_1.wv.index2word

#Download pretrained embeddings
# Glove embeddings
glove_vectors = api.load('glove-wiki-gigaword-100')

# Fasttext embeddings
ft_vectors = api.load('fasttext-wiki-news-subwords-300')



In [None]:
#Check length of vectors per word

wvs = w2vmodel.wv[words[10]]
wvs_1 = w2vmodel_1.wv[words_1[10]]
glo = glove_vectors.wv[glove_vectors.index2word[10]]
ft = ft_vectors.wv[ft_vectors.index2word[10]]

print('****Word2Vec Tokenized - Self trained****')
print('length: ', len(wvs), '\n\n')

print('****Word2Vec Tokenized clean text - Self trained****')
print('length: ', len(wvs_1), '\n\n')

print('****GloVe****')
print('length: ', len(glo), '\n\n')

print('****Fasttext****')
print('length: ', len(ft))

First, let's look at the most common and the number of words in each embedding.

In [35]:
# Most frequent words in Embeddings

print('****Word2Vec Tokenized - Self trained****')
print('5 most frequent words: ', w2vmodel.wv.index2word[0:10])
print('Total no. of words:    ', len(w2vmodel.wv.index2word), '\n\n')

print('****Word2Vec Tokenized clean text - Self trained****')
print('5 most frequent words: ', w2vmodel_1.wv.index2word[0:10])
print('Total no. of words:    ', len(w2vmodel_1.wv.index2word), '\n\n')

print('****GloVE****')
print('5 most frequent words: ', glove_vectors.index2word[0:10])
print('Total no. of words:    ', len(glove_vectors.index2word),'\n\n')

print('****Fasttext****')
print('5 most frequent words: ', ft_vectors.index2word[0:10])
print('Total no. of words:    ', len(ft_vectors.index2word))

****Word2Vec Tokenized - Self trained****
5 most frequent words:  ['p', 'animals', 'Napoleon', 'would', 'Snowball', 'farm', 'Boxer', 'pigs', 'said', 'Jones']
Total no. of words:     680 


****Word2Vec Tokenized clean text - Self trained****
5 most frequent words:  ['the', 'and', 'of', 'to', 'was', 'a', 'p', 'had', 'in', 'that']
Total no. of words:     837 


****GloVE****
5 most frequent words:  ['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s"]
Total no. of words:     400000 


****Fasttext****
5 most frequent words:  [',', 'the', '.', 'and', 'of', 'to', 'in', 'a', '"', ':']
Total no. of words:     999999


Interesting findings can already be made here. The first thing that stands out is that the pre-trained embeddings with 400,000 and 999,999 words are massively more comprehensive than the ones I created. It is also noticeable that the pre-trained ones also contain the stop words and the punctuation.

With regard to the self-trained embeddings, it is interesting to see that in the variant with stop words, these obviously dominate, while in the variant without stop words, the focus is on the main actors of the book.

In addition, the letter 'p' is also included. This is due to the fact that the page reference was not removed when cleaning and can be seen as a nice example of the elaborateness of this step.

Next, I want to focus on the self-trained embeddings and see which words are most similar to the main characters in the book:

In [36]:
# Check similar words in self trained embedding without stopwords

similar_words = {
    search_term:
    [item[0] for item in w2vmodel.wv.most_similar([search_term], topn=5)]
    for search_term in ['animals', 'Napoleon', 'would', 'Snowball', 'farm', 'Boxer', 'pigs', 'said', 'Jones']
}

print(similar_words)

# Check similar words in self trained embedding with stopwords
similar_words_1 = {
    search_term:
    [item[0] for item in w2vmodel_1.wv.most_similar([search_term], topn=5)]
    for search_term in ['animals', 'Napoleon', 'would', 'Snowball', 'farm', 'Boxer', 'pigs', 'said', 'Jones']
}

print(similar_words_1)

{'animals': ['p', 'Snowball', 'would', 'Napoleon', 'dogs'], 'Napoleon': ['p', 'would', 'animals', 'Snowball', 'two'], 'would': ['p', 'Snowball', 'Napoleon', 'dogs', 'mdash'], 'Snowball': ['p', 'would', 'animals', 'dogs', 'could'], 'farm': ['would', 'p', 'Napoleon', 'Snowball', 'Jones'], 'Boxer': ['Napoleon', 'p', 'would', 'animals', 'mdash'], 'pigs': ['p', 'Snowball', 'would', 'animals', 'dogs'], 'said': ['would', 'Snowball', 'p', 'dogs', 'animals'], 'Jones': ['p', 'Snowball', 'would', 'animals', 'dogs']}
{'animals': ['of', 'to', 'and', 'it', 'on'], 'Napoleon': ['to', 'his', 'a', 'had', 'for'], 'would': ['to', 'in', 'no', 'from', 'him'], 'Snowball': ['and', 'in', 'at', 'of', 'to'], 'farm': ['in', 'and', 'this', 'of', 'it'], 'Boxer': ['to', 'up', 'Napoleon', 'and', 'by'], 'pigs': ['in', 'of', 'up', 'to', 'and'], 'said': ['to', 'his', 'would', 'it', 'up'], 'Jones': ['with', 'a', 'in', 'to', 'it']}


The insight gained above that the stop words dominate the word space is confirmed once again here. Therefore, the following analysis focuses on embedding without stop words. The word 'animals' is the most frequently occurring word. Therefore, the most similar words are searched for again, but this time in all four embeddings:

In [30]:
# Check most similar words like 'animal' per space

word = 'animals'

print('****Word2Vec Tokenized - Self trained')
print(w2vmodel.wv.most_similar(word, topn=10), '\n\n')

print('****Word2Vec Tokenized clean Text - Self trained')
print(w2vmodel_1.wv.most_similar(word, topn=10), '\n\n')

print('****GloVe****')
print(glove_vectors.most_similar(word), '\n\n')

print('****Fasttext****')
print(ft_vectors.most_similar(word))

****Word2Vec Tokenized - Self trained
[('p', 0.9985370635986328), ('Snowball', 0.9981609582901001), ('would', 0.9978682994842529), ('Napoleon', 0.9977008104324341), ('dogs', 0.9976803660392761), ('Jones', 0.9973463416099548), ('could', 0.9973201751708984), ('even', 0.9971562027931213), ('two', 0.9971539974212646), ('mdash', 0.9971175789833069)] 


****Word2Vec Tokenized clean Text - Self trained
[('of', 0.9999536275863647), ('to', 0.9999517202377319), ('and', 0.9999513626098633), ('it', 0.9999508261680603), ('on', 0.999950647354126), ('with', 0.9999502897262573), ('a', 0.9999499917030334), ('he', 0.9999488592147827), ('that', 0.9999485015869141), ('in', 0.999948263168335)] 


****GloVe****
[('animal', 0.855276346206665), ('humans', 0.8193878531455994), ('birds', 0.8137691020965576), ('mammals', 0.7603077292442322), ('dogs', 0.7566978931427002), ('cats', 0.7538944482803345), ('creatures', 0.732198178768158), ('pigs', 0.7274112105369568), ('pets', 0.7217082977294922), ('cows', 0.71314162

Again, interesting findings can be made. That the self-trained with the stop words is dominated by them has already been established before. The pre-trained ones, however, behave differently. Although they also contain the stop words and special characters, the focus is on living beings (birds, mammals, dogs, etc.).

Finally, the word spaces are to be presented in an interactive scatterplot. I would like to note here that due to the running time and the display, only 680 points were displayed. I would like to encourage you to take a closer look at the plots and play around with them a little:

### Interactive Plot Word2Vec without Stopwords

In [31]:
#Visualize word space in an interactive way

words = w2vmodel.wv.index2word     # Get the word forms of voculary sample
wvs = w2vmodel.wv[words]                     # Get embeddings of word forms

tsne = TSNE(n_components=2, random_state=0, n_iter=5000, perplexity=5)
np.set_printoptions(suppress=True)
T = tsne.fit_transform(wvs)
labels = words


fig = px.scatter(
    zip(labels, T[:, 0], T[:, 1]), x=1, y=2,
    text = labels
)

fig.update_traces(textposition='top center')

fig.show()



In the plot without stop words, it is noticeable that the individual clusters are not so clearly divided.

However, it is exciting to see that the story of the book is very well reflected in the space. For example, 'Napoleon', 'Snowball' and 'Jones', which represent pigs and the farmer in the book, and 'Lenin', 'Stalin' and the 'Tsar' in real life, are very close to each other.

Based on this analysis, I would say that an analysis without stopwords is very good for analysing the actors and the strength of their connection in a text.

### Interactive Plot with Word2Vec with Stopwords

In [32]:
#Visualize word space in an interactive way

words = w2vmodel_1.wv.index2word     # Get the word forms of voculary sample
wvs = w2vmodel_1.wv[words]                     # Get embeddings of word forms

tsne = TSNE(n_components=2, random_state=0, n_iter=5000, perplexity=5)
np.set_printoptions(suppress=True)
T = tsne.fit_transform(wvs)
labels = words

fig = px.scatter(
    zip(labels, T[:, 0], T[:, 1]), x=1, y=2,
    text = labels
)

fig.update_traces(textposition='top center')

fig.show()


The default initialization in TSNE will change from 'random' to 'pca' in 1.2.


The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.



In this plot (self-trained with stopwords) we already see very clearly separated clusters. Within these clusters there is also already a certain structure in terms of word groups. However, these are not yet developed in such a way that, for example, the animals or people are clearly structured in a cluster.

This can have several reasons. On the one hand, there is very little data (only a little more than 800 words) and on the other hand, it must be said that the word space was built on the basis of a fable and thus a fantasy world.

### Interactive Plot GloVe & Fasttext

In [33]:
#Visualize word space in an interactive way

words = glove_vectors.wv.index2word[0:680]     # Get the word forms of voculary sample
wvs = glove_vectors.wv[words]                     # Get embeddings of word forms

tsne = TSNE(n_components=2, random_state=0, n_iter=5000, perplexity=5)
np.set_printoptions(suppress=True)
T = tsne.fit_transform(wvs)
labels = words

fig = px.scatter(
    zip(labels, T[:, 0], T[:, 1]), x=1, y=2,
    text = labels
)

fig.update_traces(textposition='top center')

fig.show()


Call to deprecated `wv` (Attribute will be removed in 4.0.0, use self instead).


Call to deprecated `wv` (Attribute will be removed in 4.0.0, use self instead).


The default initialization in TSNE will change from 'random' to 'pca' in 1.2.


The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.



In [34]:
#Visualize word space in an interactive way

words = ft_vectors.wv.index2word[0:680]     # Get the word forms of voculary sample
wvs = ft_vectors.wv[words]                     # Get embeddings of word forms

tsne = TSNE(n_components=2, random_state=0, n_iter=5000, perplexity=5)
np.set_printoptions(suppress=True)
T = tsne.fit_transform(wvs)
labels = words

fig = px.scatter(
    zip(labels, T[:, 0], T[:, 1]), x=1, y=2,
    text = labels
)

fig.update_traces(textposition='top center')

fig.show()


Call to deprecated `wv` (Attribute will be removed in 4.0.0, use self instead).


Call to deprecated `wv` (Attribute will be removed in 4.0.0, use self instead).


The default initialization in TSNE will change from 'random' to 'pca' in 1.2.


The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.



Very clear clusters can also be recognised in these plots. In contrast to the self-trained embedding with stop words, the clusters also have very clear word group contents. For example, there is a cluster with numeric numbers, one with punctuation and special characters and one with written out numbers.

## Learnings

But what are the lessons learned from this task? The main point from my point of view is certainly that the method one uses has to be adapted to the goal.

While it can make sense to train your own embedding for a specific analysis of a specific text from a specific genre, the pre-trained embeddings are certainly better suited for more generalised tasks. This is certainly also related to the training data used. Furthermore, the analysis of texts with or without stop words was very informative. Here, too, it depends very much on the underlying task. If a predictive model is to be created, it is indispensable to integrate the stop words.

In general, I find the Word2Vec method very well documented and intuitive, and I was able to develop an initial understanding of the process during the task.

Finally, I would like to mention again that data cleaning is an extremely important and time-consuming point. Although I invested a lot of time in this step, I did not succeed in eliminating all the impurities.