# Introduction to Word Vectors in Python

This Jupyter Notebook is designed to walk you through the basics of creating a word embedding model using two of the most popular natural language processing libraries, Gensim and Spacy. This notebook shows you how to use both Gensim and Spacy because, like most libraries, there are some pros and cons that come with both. 

### What is word embedding useful for?

In addition to allowing you to ask really interesting questions of your textual data (for instance, what word is most similar to "king"), word embeddings have other uses in natural langauge processing. For instance, a word embedding model can be used for other natural language processing tasks such as text classification and often increases the accuracy of these tasks. Because word embeddings capture the semantic use of a word, many natural language processing tasks become much easier with a model trained on word vectors. This is because word embedding models allow a machine learning algorithm to work with words it hasn't seen in the training process. Additionally, while Word2Vec is the most popular algorithm for constructing word embeddings, the algorithm Doc2Vec extends Word2Vec to instead treat individual documents as "words" and thus allows you to compare the semantics of entire documents. This algorithm can be useful if you are wanting to find the semantic similarities between two documents and can also allow you to break down a corpus at both the word level, using Word2Vec, and on the document level, using Doc2Vec.

### What algorithm do I use for word embeddings?

**Gensim** is a popular natural language processing library that is usually used for topic modelling. Gensim comes with the popular Word2Vec algorithm

**Spacy** is also a popular natural language processing library that is designed to be very fast. Spacy also uses Word2Vec style word embeddings, but tends to be slightly faster than Gensim. Spacy also comes with pre-trained models built in which is incredibly useful if you are wanting to get familiar with querying a model before building your own. 

**GloVe** is an unsupervised learning algorithm developed by Stanford University. GloVe comes with some nice pre-trained models if you are wanting to play around with word embeddings without having to train your own model.

### Why are we using Gensim?

Gensim is a very memory-efficient way to work with word embedding models. Not only does Gensim come with some cool algorithms that you can apply to a downstream task such as topic modelling, but Gensim also allows you to process large amounts of text without storing them into memory. Developed by Radim Řehůřek, Gensim is one of the model popular libraries for training word embeddding models in Python. Its popularity is an important feature because that means there is a vast amount of community support for the library, making troubleshooting very easy. 

### Downloading Anaconda

https://docs.anaconda.com/anaconda/install/mac-os/

### How do I navigate this Jupyter Notebook?

This notebook is designed to be read from top-to-bottom. We consider this particular notebook to contain the core concepts that you need to get started with Word2Vec. The notebook uses a combination of text and code cell. The code cell contain real code that can be run in the notebook, itself, or brought over into your IDE of choice. In order to run a code cell, click the "run" button in the toolbar at the top after clicking the cell. As a warning, some of the code blocks may not produce very useful results if they have been taken out of a larger block of code. Typically, the code will be explained line-by-line and then the code, in its entirety, will be located in a single block at the end of each section.


# Word Embeddings Using Gensim

One of the first things that we need to do is make sure that all of libraries that we need are installed. For this tutorial, we will be using the following libraries:

-
-
-

In order to install these libraries, you should refer back to the "Libraries" portion of this tutorial. It is a good coding practice to have all of your imports at the top of your code, so we are going to go ahead and load everything that we need for the entire tutorial here. There are comments next to each library explaining what each library is for. 

In [3]:
# A good practice in programming is to place your import statements at the top of your code, and to keep them together

import re  # for regular expressions
import os  # to look up operating system-based info
import string  # to do fancy things with strings
import matplotlib.pyplot as plt # we'll use this for visualization
from mpl_toolkits.mplot3d import Axes3D
import glob # lets you locate a specific file type
from pathlib import Path # let's us access files in other directories
import gensim # we need to import gensim to access its instance of Word2Vec
from gensim.models import Word2Vec # this is how we actually get access to Word2Vec
import pandas as pd #pandas allows us to work with dataframes, it makes sorting data much easier
from gensim.models import KeyedVectors #KeyedVectors allows us to create an instance of just the word vectors instead of loading the full model into memory every time
from sklearn.decomposition import PCA
from matplotlib import pyplot # this library is going to let us visualize our data as graphs
from sklearn import cluster #for k-means clustering
from sklearn import metrics #for k-means clustering
from sklearn.manifold import TSNE #for tsne plot
import numpy as np #for PCA
import plotly.graph_objs as go #for PCA

## Loading Your Data ##


### Loading Texts from a Folder ###

Next, we need to actually load our data into Python. It is a good idea to place your dataset somewhere where it's easy to navigate to. For instance, it's a good idea to place your data in a folder on your Desktop or in the same respository as your code file. In either case, you will need to know what the **file path** is for the folder that is currently holding your data. Then, we are going to tell the computer to iterate through that folder, pull the text from each file, and store it in a dictionary. The code is written to process a folder with plain text files (.txt). These files can be anywhere within this folder, including in sub-folders. 

A few important things to note:

1. When you are inputing your filepath, you should use the **entire** file path. For example, on a Windows computer, that filepath might look something like: C:/users/Avery/Desktop/MY_FOLDER

2. If you are having trouble getting your filepath to load successfully, try using either double slashes in the filepath or even switching the direction of the slashes (Windows machines and Macs use slashes in different directions for their filepaths)

3. Remember, you can use a file path to a folder full of different types of files, but this code is only going to look for **.txt** files. If you want to work with different file types, you'll have to change the "endswith(.txt)" call. However, keep in mind that these files should always contain some form of plain text. For example, a Word document or a PDF won't work with this code. 

In [None]:
dirpath = r'INSERT FILE PATH HERE' #get file path (you can change this)

filenames = []
data = []

 #this for loop will run through folders and subfolders looking for a specific file type
for root, dirs, files in os.walk(dirpath, topdown=False):
   for name in files:
    #if you are wanting a different file type, change this to a different ending
       if (root + os.sep + name).endswith(".txt"): 
           filenames.append(os.path.join(root, name))
   for name in dirs:
    #if you are wanting a different file type, change this to a different ending
       if (root + os.sep + name).endswith(".txt"): 
           filenames.append(os.path.join(root, name))

#this for loop then goes through the list of files, reads them, and then adds the text to a list
for filename in filenames:
    with open(filename) as afile:
        data.append(afile.read()) #read the file and then add it to the list
        afile.close() #close the file when you're done



Okay, lets walkthrough what this code is doing, exactly. As the comments indicate, the code begins by reading the file path that you provided. That little "r" in front of the file path tells the computer "hey, read whatever is at this file path location." Then, we have two empty lists that have been initiated, one called "filenames" and one called "data." Filenames is going to be used to store the name of each file as the code is **traversing** (or walking through) the folder. Data is going to actually hold all of the textual data from each .txt file. 

The first set of for loops tells the computer "hey, find all of the files that end with .txt in this folder and save their filenames to our "filenames" list. The reason why there are two for loops here, is that this code will traverse through subfolders, as well. So, you could provide a file path which points to a folder with tons of other folders nested at varying levels within that main folder and the code will peek into each of these folders and pull out any file that ends with .txt

The second code chunk takes that list of relevant filenames and tells the computer "open each file in this filename list, and dump whatever is in that file into our 'data'." As the computer is working through the files, it will open a file, read it, and then close it. Closing the file once it has been read is an important step for saving memory. Otherwise, you could very well have over a hundred text files open. Remember computers are actually pretty simple--they only do what you tell them to and nothing else. 

### Loading Text from a Spreadsheet ###

Gensim is pretty versitile in that it doesn't particularly care **where** your text data comes from, as long as it is formatted as machine readable. Let's take, for example, a researcher who instead of individual text files, instead has a spreadsheet where one column records where the text is sourced from (an online database, for example) and one column contains the actual text that the researcher is interested in. Converting a spreadsheet like this to plain text and feeding it into Gensim is actually really simple. 

Begin by saving your spreadsheet in a CSV format. CSV (comma seperated values) is machine readable unlike an .xsl file and so our code will be able to understand what the spreadsheet actually is. Once your have your CSV file, you are going to run the following code:

In [None]:
col_list = ["cluster", "text"] # columns you want to use, can change to whateve

df = pd.read_csv(r'FILEPATH TO CSV FILE/file.csv', usecols= col_list)


Pretty easy, right? Let's walk through what is happening in this code. First, we initialize a list called "col_list." This list will serve to tell the computer which columns from the spreadsheet to pull out. You can change the "cluster" and "text" values to whatever would be relevant to your particular spreadsheet. Next, we initialize a pandas DataFrame which we are just going to call "df" for simplicity's sake. This dataframe uses the pandas function "read_csv" to read the CSV file located at the file path you provided (which should, again, be the entire file path). The attribute "usecols" tells the computer which columns to pull from the CSV file (that is, if we don't want to include **all** the columns in the file). The usecols attribute is set to our col_list list. 

## Cleaning the Data ##

Now that we have our data (either in the "data" variable or the "df" variable, depending on whether or not you decided to use a folder of plain text files or a single CSV file with text data), it's time to actually do something with. This next step is extremely important as it can have an enormous impact on your results, so make sure to take careful notes as you are proceeding with this code.

An important next step after collecting your data is cleaning it. When we say "clean" what we mean, is to remove some of the noise and inconsistencies in our data which may impact how accurately how model understands our data. For example, if you are working with text data that was created through OCR (optical character recognization) the computer-generated transcription may contain errors and inconsistencies in spelling. These errors and inconsistencies can actually make our word embedding models inaccurate. Particularly if you are using a word embedding model for downstream tasks such as classification, this inaccuracy can make the task much more difficult. If you are interested in reading more about how OCR errors can impact word embedding models, check out Dutta and Gupta (2022), Strange and McNamara (2014), and Mutuvi et al. (2018). These authors generally conclude that depending on the text mining task, OCR noise can have varying levels of impact on the model's understanding of the data. Although word embedding models **are** negatively impacted by OCR noise, thankfully, Gensim and Word2Vec tend to handle it a little better than other language modeling libraries such as BERT or GloVe.

However, it is important to understand that not all noise is bad noise. Some researchers, for example Cordell (2017) and Rawson and Muñoz (2019) advocate for more embracing of noise, emphasizing that textual noise can also be useful for some research. For this reason, "the cleaner the better" isn't necessarily the best approach depending on the types of questions you are asking of your data. 

OCR errors aren't the only kind of "noise." Even inconsistencies in capitalization, punctuation, and what we call "stop words" or words such as in, and, but, over, etc. can impact how well your model understands your data. Computers don't actually understand human language, so it won't understand that "Apple" and "apple" are the same word unless you make it extremely obvious (by making both words lowercase, for example). No matter what, you will have to clean your data in some way, but you should make sure to make informed decisions about how and why you are cleaning in a particular way before proceeding.

It is important to take careful notes as you are cleaning your data on what, exactly, you decided to do with some of this noise. Transparency is always key when working with natural language processing tasks since so many of the even small decisions you make as a researcher can impact your results. 

For this walkthrough, we are going to do some basic cleaning. Some of the errors/changes we are going to make to the data are: 

1. Making all of the words lowercase. We do this so that "Apple" and "apple" are misinterpreted as two distinct words
2. Removing punctuation. We are removing punctuation because, again, we don't want something like "'Apple'" and "Apple." to be confused as distinct words
3. Remove any numbers from the data since we're only interested in words
4. Tokenize the data. Tokenizing means we are separating the individual words in the data so that they get fed to the model individually rather than in sentence or paragraph format.

We are going to start by writing a function that will perform our cleaning tasks. This way, if we want to clean other data later on, it is easy to simply pass that data into this function.

In [None]:
def clean_text(text):
    '''
    Cleans the given text using regular
    expressions to split and lower-cased versions to create
    a list of tokens for each text.
    Parameters:
        list_of_texts: list of str 
    Return: list of lists of tokens, one list per text
    '''
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))

    # lower case
    tokens = text.split()
    tokens = [t.lower() for t in tokens]
    # remove punctuation
    tokens = [re_punc.sub('', token) for token in tokens] 
    # only include tokens that aren't numbers
    tokens = [token for token in tokens if token.isalpha()]
    return tokens


Next, we are going to actually call the function and give it our data. 

In [None]:
#clean text from folder of text files
data_clean = []
for x in data:
    data_clean.append(clean_text(x))

This code begins by initializing an empty list called "data_clean" which will hold the cleaned text. Then, using a for loop, the code walks through our data list from earlier and applies the clean_text function to each item in that list, and then stores the cleaned text in our data_clean list.

In [None]:
# these print statements makes sure that the original data and the clean data are the same length
#(nothing was left out and skipped)


print(len(data))
print(len(data_clean))
print(data[1])
print(data_clean[1])
print(data_clean[len(data_clean)-1])
print(data[len(data)-1])

It can be useful to just check that data_clean didn't miss any entries from data. You can do this by running print statements like these which display the length of both lists, prints the first entry in each list and then the last entry in each list. By checking that each of the print statements produce the same output, you ensure that nothing went wrong with our earlier function call. 

In order to apply clean_text to a dataframe, such as the dataframe that we stored our CSV data in earlier, all you have to do is run the following code:

In [None]:
#clean text from dataframe
df['text'] = df['text'].apply(clean_text)

This code tells the computer to go to the column titled "text" and apply the clean_text function to each entry in that column. What is useful about working with text in a dataframe such as this, is that the dataframe will maintain columns and rows even when you are manipulating much of the data within. This structure can be useful for keeping your data formatted in a particular way or even for remembering which text your data was pulled from.

## Training the Model ##

Now we are going to actually move on to training our model. Word2Vec allows you to control a lot of how the training process works through attributes. Some of the attributes that may be of particular interest are:

1. **Sentences**. The sentences attribute is where you tell Word2Vec what data to train the model with. In our case, we are going to set this attribute to our cleaned textual data

2. **Min_count**. Min_count is how many times a word has to appear in the dictionary in order for it to 'count' as a word in the model. The default value for min_count is 5. You will likely want to change this value depending on the size of your corpus.

3. **Window**. Thw window attribute lets you set the size of the "window" that is sliding along the text. The default is 5, which means that the window will look at five word pairings at a time. The window attribute is important because word embedding models take the approach that you can a tell the context of the word based on the company it keeps. The larger the window, the more words you are including in that calculation of context. Essentially, the window size impacts how different words are allowed to be.

4. **Workers**. "Workers" are just what they sound like: how many "worker" threads do you want processing your text at a time? The default setting for this attribute is 3. This attribute is also optional.

5. **Epochs**. Like workers, the epoch attribute is an optional attribute. Basically, the number of epochs correlates to how many iterations over the text you are wanting the model to be trained on.  

6. **Sg**. The sg attribute tells the computer what training algorithm to use. The options are CBOW (continuous bag of words) or skig-gram. In order to select CBOW, you set sg to the value 0 and in order to select skip-gram, you set the sg value to 1. There are no particular downsides to one or the other. The best choice of training algorithm really depends on what your data looks like. 


There are several other settings that you can adjust, but the ones above are the most crucial to understand. You can read about the additional attributes and their default settings at Gensim's creator Radim Rehurek's website: https://radimrehurek.com/gensim/models/word2vec.html



In [None]:
model = Word2Vec(sentences=data_clean, window=5, min_count=3, workers=4)
model.save("word2vec.model")

In the code above, we start by intializing our model and saving it under the variable "model." As you can see, we are using some of the attributes from above: sentences, window, min_count, and workers. The values of each of these attributes, save for the "sentences" attribute will likely have to be adjusted several times. There isn't a setting for each of these attributes that works particularly well--it really depends on what your text looks like. We recommend running this training call several times with varying settings in order to figure out what works best. It is also important to keep notes of the settings for each iteration. The model will be different every time you train it, so keeping track of the changes you make each time will be very useful. 

In the second line, we save our model as "word2vec.model." As you'll note, the file type that the model gets saved as in Python is a ".model" file as opposed to the ".bin" file you might be familiar with if you work in R. It is important to save the model each time you run the code because otherwise, the model will disappear with each run. It can be useful to give your model a better name than what we have above. For example, you might save the model as the date you trained it or some other description and distinctive name that will make recalling which model it is easier. 

To access the model once it has been saved, you can simply run the following code:

In [None]:
model = Word2Vec.load("word2vec.model")

Now, our model is stored under the variable "model." From now on, if we need to access our model, we will do it through the model variable. 

Another way to save and load the model is the following: 

In [None]:
word_vectors = model.wv
word_vectors.save("word2vec.wordvectors")
wv = KeyedVectors.load("word2vec.wordvectors", mmap='r') #by doing this, you don't have to load the full model every time


In this code, we initiate a new variable called word_vectors which will hold "model.wv." Model.wv represents the word vectors within the model, itself. Then, we save the word vectors that we have pulled out of the model as a .wordvectors file called "word2vec." Finally, we initialize a variable called wv (short for 'word vector') and use the KeyedVectors function to load the wordvectors file and read it. 

A reason you may want to save and load your model as a .wordvectors file (in **addition** to saving your model as a .model file) is that a .wordvector file is much easier on your computer's memory. Rather than having to load the full model into memory every time, but just loading the word vectors, themselves, you are able to perform almost the same tasks, but in a much more memory-efficient way. You'll need the full .model file if you want to continue training, but if you are just querying the model, usng just the KeyVectors is fine. 

## Word2Vec Functions ##

Word2Vec has a number of built in functions that are quite powerful. These built in functions make performing querying tasks really simple and can be extremely useful for research. 

Start by loading either your model or the .wordvectors file we created above. If you are using the full model, you're going to use the "model" variable to call each of these functions. If you are using the .wordvectors file, you'll use the "wv" variable to call the functions.

In [None]:

model = Word2Vec.load("word2vec.model")

model.wv.most_similar('recipe', topn=10)
model.wv.similarity("milk", "cream") #how similar two words are
model.predict_output_word([ "flour", "eggs", "cream"])  #predict the other words in the sentence given these words

print(len(model.wv)) #number of words in the vocabulary

model.wv.most_similar(positive = ["recipe"], negative=["cream"], topn=10)
model.wv.most_similar(positive = ["recipe", "milk"], topn=10)


Okay, let's walk through each of these function calls in order to understand what is happening in the code above. First, we begin by loading our model. Since we'll need the full model for later tasks in this walkthrough, I am loading the full model and storing it in the "model" variable. 

The way that you call functions with Word2Vec, is to preface each function call with "model.wv." This will likely be familiar to you. Remember how we saved our KeyedVectors file under the variable "wv"? As you've probably guessed here, "wv" in this case _also_ stands for "word vectors." Essentially by calling model.wv, what we are really doing is telling the computer "hey, crack open this model and apply this function only to the word vectors inside." Because these functions are only dealing with the word vectors, themselves, that is why it can be useful to just load the KeyedVectors file which elimnates the step of having to look inside of the model for those same vectors. If you are loading the .wordvectors file, you'll make function calls by instead writing "wv.SOME_FUNCTION" and not including "model."

Now, let's walk through each of these function calls. For this example, I am using a model that was trained on recipes, so each of the words used are words that are likely to appear in that corpus. For your own model, you'll want to change each of these function calls to better reflect the vocabulary that your model would have been exposed to. Finally, keep in mind that word embedding models capture the **semantics** of words. What this means, is that the results you get from each of these function calls do not reflect words that are, say, **definitionally** similar, but rather words that are used in the same **context**. This is an extremely important distinction to keep in mind because while some of the words you'll get in your results are likely to be synonyms or to have similar definitions, you may have a few words in there that seem confusing. Remember, word embeddings guess the context of a word based on the words that often appear around it. Having a weird word appear in your results does not indicate necessarily that something is wrong with your model or corpus, but rather may reflect that those words are used in the same way in your corpus. Like all Digital Humanities work, you should be careful to be as precise as possible when interpreting your results so that they aren't misunderstood.

1. **Most_similar** -- this function allows you to retrieve words that similar to chosen word. In this case, I am asking for the top ten words in my corpus that are contextually similar to the word 'recipe.' If you are wanting a longer list, simply change the number assigned to 'topn'

2. **Similarity** -- this function will return a cosine similarity score for the two words you provide it. We'll get into cosine similarity below, but for now just know that the higher the cosine similarity, the more similar those words are

3. **Predict_output_word** -- this function will predict the next word likely to appear in a sentence with the other words you provide. This function works by _inferring_ the vector of an unseen word

4. **Most_similar** -- this function will return a list of words that are most similar to the words that are provided. You'll notice that one word is in the "positive" attribute and the other is associated with "negative." We'll explore what this means, exactly, below but in short, because vectors are numerical representations of words, you are able to perform mathematical equations with them such as adding words together or subtracting them.

The last call in the code above that is useful to know, is the model.wv call. By typing model.wv, you get the vocabulary list for your model. You can also apply the len() function in order to see how long your vocabulary is. This is pretty important information as it can lead you to decide that you should actually train your model on more data in order to expand this vocabulary and thus receive more nuanced results.

### Vector Math ###

As I mentioned above, because word vectors represent natural language numerically, this means that it is possible to perform mathematical equations with them. For example, say you wanted to know what words in your corpus reflect this equation:

    king - man = ?

As humans, we can of course predict that the top word which would result from this equation would be "queen" or even "dowager." However, because computers don't understand natural language, the computer will simply perform the equation by subtracting the vector for "woman" from the vector for "king." What may result is a list of words that you may not expect, but reveals interesting patterns in how those words are used in your corpus.

Vector math also allows you to make your function queries much more precise. Let's say for example that you wanted to ask your corpus the following question: "how do people in nineteenth-century novels use the word 'bread' when they aren't referring to food?" 
 
The equation that you might use to ask your corpus of nineteenth-century novels that exact question might be:

    bread - food = ?

Or to be even more precise, what if you wanted to ask "how do people talk about bread in kitchens when they aren't referring to food?" That equatioin may look like:

    bread + kitchen - food = ?

In Python, the syntax for making these sorts of calls, is to use the "postive" attribute in place of the plus sign and the "negative" attribute in place of the minus sign. So, the above equation would look like this in Python:


In [None]:
model.wv.most_similar(positive = ["bread", "kitchen"], negative = ["food"], topn=10)

Running this call will return a list of 10 words that are most similar in context to bread + kitchen, but with the concept of "food" removed.

### Cosine Similarity ###

As I mentioned above, the way that word embedding models understand words is through their numerical representation. A word "vector," is a numerical value that represents the positioning of a word in some multi-dimensional space. Because word vectors are located in this multi-dimensional space, just like we could perform basic math on words in the corpus, we can perform slightly more complicated math. 

A "vector" is not simply a point in space, but a point in space that has both **magnitude** and **direction**. This means that vectors are less isolated points and more lines that trace a path from some origin point to that vector's designated position in what is called "vector space."

Since a vector is really a line, that means when you are comparing two vectors from the same corpus, you are comparing two lines each of which shares an origin point. Since those two lines are already connected at the origin point, in order to figure out how similar those words are, all we need to do is to connect their designated position in vector space with an additional line. And what shape does that then form? A triangle. And the how far apart these two vectors are in vector space is calculated using the cosine of this new line which is determined by subtracting the adjacent line by the hypotenuse.

If you'll remember from trigonometry, you can calculate the cosine of an angle by completing the following calculation: 

    cos(a) = b/c where b is vector 1 and c is vector 2

The larger this number is, the closer those two vectors are in vector space and thus, the more similar they are. Generally, a cosine similarity score above 0.5 tends to indicate a degree of similarity that would be considered significant.

## Evaluating a Model ##

Now that we have a working model and have explored some of its functionality, it is important to evaluate the model. When I say "evaluate" what I mean is: Does the model respond well to the queries it should? Is the model making obvious mistakes?

In order to evaluate our model, we are going to present it with a series of words that are clearly similar and which should be present in most corpuses. Then, we will calculate the cosine similarity for each of these pairs of words, and save the results in a .csv file. This way, we will be able to review each of the cosine similarities and determine if the model is making obvious mistakes. 

In [None]:
dirpath = Path(r"FILEPATH").glob('*.model') #current directory plus only files that end in 'model' 
files = dirpath
model_list = [] # a list to hold the actual models
model_filenames = []  # the filepath for the models so we know where they came from

We're going to start by declaring a few variables. First, we declare the variable "dirpath" which is hold the file path to your model. This file path can be a folder where you are saving your .model files or even your current working directory. This variable tells the computer to only pay attention to files that end with .model, so your model doesn't necessarily need to be isolated in its own folder. 

Then, we set the variable "files" equal to our file path. Next, we declare two empty lists, model_list and model_filenames. Model_list will hold the actual models, themselves, and model_filenames will hold the filename of the model so that we know which model is producing which results. This way, you can run this code on a folder with many models and get evaluation information for each of them. 

In [None]:
#this for loop looks for files that end with ".model" loads them, and then adds those to a lsit
for filename in files:
    file_path = str(filename)
    model = Word2Vec.load(file_path)
    model_list.append(model)
    model_filenames.append(file_path)

This for loop traverses through the "files" variable which holds all of the files from our file path that end with ".model." Then, for each of these files, the filename is converted to a string and added to our file_path list. Then, the model itself is loaded using Word2Vec.load, and it is added to our list of models.

In [None]:
#test word pairs that we are going to use to evaluate the models
test_words = [("away", "off"),
            ("before", "after"),
            ("cause", "effects"),
            ("children", "parents"),
            ("come", "go"),
            ("day", "night"),
            ("first", "second"),
            ("good", "bad"),
            ("last", "first"),
            ("kind", "sort"),
            ("leave", "quit"),
            ("life", "death"),
            ("girl", "boy"),
            ("little", "small")]

We are going to be using this list of tuples, saved until the variable "test_words," to query our models. These are words which are obviously similar and fairly common, so they should be present in most model vocabularies and their cosine similarities should be relatively high if the model is working like it should. 

In [None]:
#these for loops will go through each list, the test word list and the models list, 
#and will run all the words through each model
#then the results will be added to a dataframe
evaluation_results = pd.DataFrame(columns=['Model', 'Test Words', 'Cosine Similarity'])
for i in range(len(model_list)):
    for x in range(len(test_words)):
        similarity_score = model_list[i].wv.similarity(*test_words[x])
        df = [model_filenames[i], test_words[x], similarity_score]
        evaluation_results.loc[i] = df
        
evaluation_results.to_csv('word2vec_model_evaluation.csv') #dump the results into a csv

Now, we're going to start feeding our list of tuples into this for loop which will open each model one at a time, and get the similarity score for each tuple in the list. We initialize a dataframe called "evaluation results" which contains the columns "Model," "Test Words," and "Cosine Similarity." With these columns, we'll be able to keep track of which model is produsing which cosine similarities and for which tuples. The nested for loop moves in this way until each model has calculated the cosine similarity score for each tuple. Then, the results are appended one at a time a temporary dataframe and finally added to out evaluation_results data frame. 

Using the pandas function to_csv, we save the evaluation_results dataframe as a .csv file titled "word2vec_model_evaluation." This .csv file will contain the results for each model. 

This evaluation method will allow you to determine which of your models is performing the best. The results of this evaluation may also indicate that your corpus should be varied slightly or should include more data. 

# Model Analysis #



## K-means Clustering ##

Cosine similarity is not the only way to calculate the distance between two vectors. Another method for performing this calculation is through k-means clustering. K-means clustering uses Euclidean distance rather than cosine similarity in order to determine how close two vectors are in vector space. 

Calculating Euclidean distance in k-means clustering means rather than calculating the cosine of the new angle that was formed by connecting the two vectors in order to form  a triangle, the **length** of that new line is being used to calcuate distance. As a result, whereas vectors tend to be more similar when the cosine of the two is larger, for Euclidean distance, a smaller number indicates a shorter line connecting the two vectors, or that they are similar.

With this in mind, k-means clustering begins by picking a bunch of random points in vector space, called "centroids," and seeing what vectors tend to be clustered together in those random locations. By calculating the Euclidean distance, the algorithm determines which points are closest to the centroids, which of them have the smaller Euclidean distances, while maintaining larger distances from other centroids. The algorithm tries to maintain some distance between clusters in order to ensure that they are unique. 

K-means is called "k-means" because some number of clusters (k) are used to calculate vector distance by taking the mean of all vectors within those clusters by adding the squared Euclidean distance between of all of the vectors within the cluster and the centroid.

While this is a somewhat complicated description of how k-means clustering works, it is essential to understand how k-means clustering is calculating the distances between vectors in a way that is nearly the opposite of how built in functions such as the similarity function calculate the same thing. 

When working with word embedding models, k-means clustering can be an extremely useful in order to get a sense of what words tend to occupy the same general space. The centroids will be placed in vector space randomly, but since a crucial part of the k-means algorithm requires that vectors be distant from neighboring clusters, this ensures that there will likely be very minimal overlap between your sampling of random clusters. 

In this walkthrough, we are going to use the k-means algorithm that comes with the popular scikit-learn library in Python.

This code was adapted from: https://dylancastillo.co/nlp-snippets-cluster-documents-using-word2vec/#cluster-documents-using-mini-batches-k-means


In [None]:
VOCAB = model.wv[model.wv.key_to_index]
NUM_CLUSTERS = 3
kmeans = cluster.KMeans(n_clusters=NUM_CLUSTERS, max_iter=40).fit(VOCAB) #default is 8 clusters, 300 iterations

We're going to start off by just declaring a few variables. The first variable VOCAB, is going to hold our model's vocabulary. In Gensim 4.0, you retrieve the model's vocabulary by calling model.wv.key_to_index. In older versions of Gensim, you replace key_to_index with model.wv.vocab. 

Next, we are going to declare NUM_CLUSTERS which is where we will determine how many random clusters we are wanting to retrieve using the k-means algorithm. 

Finally, we declare a variable 'kmeans' that will hold the call to scikit-learn's k-means algorithm. As you can see, this algorithm initializes with some number of clusters, some number of iterations, and then is fitted to the vocabulary of your model. Like the training model code above, these are settings that you may wish to play around with. You can visit scikit-learn's documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html to read more about additional settings that may be of use to you.

In [None]:
centroids = kmeans.cluster_centers_
clusters_df = pd.DataFrame()

The next set of variables we are going to declare are related to the clusters, themselves. The first, the centroids variable, will hold the center points around which the clusters are arranged. You can imagine these centroids as points on a map that we have thrown random darts at. 

Finally, we declare the clusters_df dataframe which will be used to store the words within in our random clusters. I am storing these clusters in a dataframe because it will allow me to preserve distinctions between clusters using columnes and rows, and will make saving the results to a .csv file very easy.

In [None]:
for i in range(NUM_CLUSTERS):
    most_representative = model.wv.most_similar(positive=[centroids[i]], topn=15)
    temp_df={'Cluster Number': i, 'Words in Cluster': most_representative}
    clusters_df = clusters_df.append(temp_df, ignore_index = True)
    print(f"Cluster {i}: {most_representative}")

Now, using a for loop, we are going to visit each of the random clusters and gather some of the words in them. This for loop starts at the first cluster and will iterate through each of the clusters, stopping once it has finished with the last one. 

As the for loop reaches a cluster, it calculates the most representative words within that cluster by using the most_similar function. The function is calculating the words that are most similar to the centroid of that cluster and returns the top 15 words. Those words are stored within a variable, "most_representative." 

Then, we declare a temporary dataframe, called "temp_df" that will store the ID of the current cluster and the words associated with that cluster. Saving both the cluster ID as well as the words allows us to remember which words came from which cluster and will make interpreting the results much easier later. 

Next, the temporary dataframe is appended to our clusters_df dataframe and the cluster ID and list of words are printed to the console. 

In [None]:
clusters_df.to_csv("random_clusters.csv")  
#this will output the random sampling of clusters into a CSV located in your current directory. 
#If you want the file to save somewhere else, just include that filepath in the csv name 
#(so C:/Users/avery/Desktop/random_clusters.csv for example)


In order to make our results a little easier to read, we save the dataframe to a CSV by simply using the built in pandas function "to_csv." This function will preserve the columns and rows within our clusters_df dataframe and can be opened in any editor that can work with spreadsheets such as Excel or Google Sheets.

## Principle Component Analysis ##

Another useful form of model analysis is PCA (principle component analysis). For a much more detailed breakdown of PCA, check out The Datasitter's Club's write up on PCA here: https://datasittersclub.github.io/site/dsc10.html

In general, PCA is a dimensionality reduction algorithm. PCA is called PCA because it attempts to reduce a data set to its **principle components**. Just as k-means differed in its mathematical approach from cosine similarities, PCA also takes a different approach to dealing with vectors. Rather than calcuate the length of a line or the cosine of an angle, PCA determines the principle components of a data set by using linear algebra. The algorithm uses eigenvectors and eigenvalues to mix together items within a data set and produce new items that contain most of the information from the old items, or their principle components.

Whereas k-means and cosine similarity try to capture similarity or closeness, PCA is more concerned with capturing the largest amount of variance in a dataset. It does this by using eigenvectors to determine what the variance is amongst items in a data set. The PCA algorithm will continue to calculate these eigenvectors while trying to maintain the most variance between components as possible while discarding items that are less significant. The items that we decide to keep are called "feature vectors." These feature vectors are then plotted and represent the essential features of the data set while reducing some of that dataset's bulk. Probably the best way to think of these components, is as a sifter that you dig into sand and filter out shells and rocks. Since there is so much sand, we don't necessarily care about including the sand in a description of what you were able to find with the sifter. We do care, however, about the unique shells and rocks and those items can tell us more about the features of the beach than any individual grain of sand. 

For the code, we are going to use scikit-learn's built in PCA algorithm. 


In [None]:
labels = list(model.wv.key_to_index)

def pca(model):
    X = model.wv[model.wv.key_to_index]  #get all the vectors
    pca = PCA(n_components=3)
    result = pca.fit_transform(X)

We are going to keep the PCA calls withiin a function that we are calling "pca." We start by declaring the variable "labels" which will hold the vocabulary of our model, but formatted as a list. We'll use this variable later when we are labeling the points on our PCA plot. 

The next thing that we do is, within the function, declare a variable X which will hold all of the vectors in our model. Remember, PCA is going to try and sift through all of the sand in our vectors in order to pull out the seashells and rocks that we're actually interested in, so we have to feed it all of the vectors. Next, we declare a variable "pca" which will actually hold the PCA call from scikit-learn. The PCA function takes one attribute, n_components, which is the number of components you want the plot to represent. I have set the n_components to 3 since I have written the code to produce a 3D plot. 

Finally, we fit the PCA function to our particular model's vectors and let it start sifting. 

In [None]:
# create a scatter plot of the projection
    x_axis = result[:,0]
    y_axis = result[:,1]
    z_axis = result[:,2]
    fig = plt.figure(figsize=(8, 6))
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(x_axis, y_axis, z_axis)
    for i in range(len(labels)):
        ax.text(x_axis[i], y_axis[i], z_axis[i], labels[i], style ='italic',
        fontsize = 10, color ="blue")
    ax.set_xlabel('x axis')
    ax.set_ylabel('y axis')
    ax.set_zlabel('z axis')

    
    plt.show()

After actually calculating the princple components of our model's vectors, we want to actually do something with them. We're going to plot our 3D components on a graph which will allow us to actually see the shape of the model. Since this is a three dimensional graph, we need to define the x, y, and z axis. We do this by assigning the each of these axes a component to plot, which was produced by our "result" variable from above. 

Now that we have our x, y, and z axes, we can plot them. We start by declaring a variable, "fig" which will represent our plot. We set the size of our particular figure to (8, 6), but you should feel free to play around with the sizing on your own. Then, we declare a variable, "ax," which is short of axis. The ax variable will allow us to access particular points on the plot which will be useful for labeling them. 

We use the ax variable to tell the computer that we want a 3D graph, and that we want to produce a scatter plot using the values associated with the x, y, and z axes, each of which contains one of our principle components. 

The for loop which follows is how we are going to label our points. Using the built in function, ax.text, to tell the computer to visit each point on the graph, and assign the corresponding label from our "labels" list to that point. I've set the font to be italic, blue, and the font size to 10, but these values can be adjusted to your liking. You can also comment out the for loop if you want to look at the PCA graph without the labels. 

Finally, we label the x, y, and z axis. For this walkthrough, I have simply labeled them "x axis," "y axis," and "z axis," though these can of course be changed. 

We end the function by telling the computer to show us the resulting plot. 

The entire function code is below:

In [None]:
labels = list(model.wv.key_to_index)
def pca(model):
    X = model.wv[model.wv.key_to_index]  #get all the vectors
    pca = PCA(n_components=3)
    result = pca.fit_transform(X)
    # create a scatter plot of the projection
    x_axis = result[:,0]
    y_axis = result[:,1]
    z_axis = result[:,2]
    fig = plt.figure(figsize=(8, 6))
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(x_axis, y_axis, z_axis)
    for i in range(len(labels)):
        ax.text(x_axis[i], y_axis[i], z_axis[i], labels[i], style ='italic',
        fontsize = 10, color ="blue")
    ax.set_xlabel('x axis')
    ax.set_ylabel('y axis')
    ax.set_zlabel('z axis')

    
    plt.show()

If you want to perform PCA and produce a plot on your own model, all you need to do is call: 

    pca(model)
   
This call will tell the function to run on your particular model and will produce a 3D graph of your model. You should also be able to use your mouse to drag the graph around in order to view it from different angles. If you want to see a version of the graph without the labels, since labels can making viewing the points clearly a little difficult, simply run this version of the PCA function:

In [None]:
def pca(model):
    X = model.wv[model.wv.key_to_index]  #get all the vectors
    pca = PCA(n_components=3)
    result = pca.fit_transform(X)
    # create a scatter plot of the projection
    x_axis = result[:,0]
    y_axis = result[:,1]
    z_axis = result[:,2]
    fig = plt.figure(figsize=(8, 6))
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(x_axis, y_axis, z_axis)
    ax.set_xlabel('x axis')
    ax.set_ylabel('y axis')
    ax.set_zlabel('z axis')

    
    plt.show()

### Interpreting PCA Results ###

A couple of things to keep in mind while you explore your new visualization are:

1. PCA results are usually discussed in terms of component scores. Component scores represent how significant a particular component is to the data. A high component score means that that particular component is highly influential

2. It can be useful to actually look at the components themselves and explore what types of words tend to be gathered by that component.

PCA can generally help you get a sense of how your data is shaped. However, PCA is not particularly useful for determining what individual clusters of words might be. In order to determine that, you should turn to tSNE analysis, which is particularly useful for getting a sense of how your data is grouped whereas PCA more so captures a sense of the data as a whole.

## tSNE Analysis ##

The final form of mathematical analysis that we will cover in this tutorial is tSNE analysis. T-distributed Stochastic Neighbourhood Embedding (tSNE) is a dimensionality reduction algorithm similar to PCA. However, while PCA is more concerned with preserving variance in a data set, tSNE cares more about things that are close together. Another important difference between tSNE and PCA, is that the results of tSNE analysis vary with each run. This is because tSNE is a probabilistic technique. tSNE is also always working in a two-dimensional space whereas PCA can work with many dimensions. 

However, while tSNE does differ in important ways from PCA, a researcher might find tSNE's ability to represent the shape of data better more appealing than PCA. whereas PCA tends to mix data together and represent it as a singular grouping, tSNE often produces visualizations of clusters in a data set. This ability to represent these groupings can be helpful from an exploratory perspective as it allows researchers to see the variance in their data. While generally, it is recommended to start with PCA, following up a PCA graph with a tSNE analysis can help produce a more full picture of a word embedding model. 

tSNE, like our other methods of analysis, uses a different type of math in order to calculate the distances between vectors. While PCA uses eigenvectors and linear algebra to calculate distance, tSNE uses t-distributions, a technique from statistics. The algorithm begins by calculating the Euclidean distance and then calculates a probability distribution across these distances using t-distributions. The goal of the algorithm is to keep similar words close together in tSNE's two-dimensional space while maximizing the distance between words that are not similar.

Like with PCA, we are going to store our tSNE in a function in order to make applying it to our model easier. The code for this tSNE function is adapted from: https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne

In [None]:
def tsne_plot(model, focus_word = None, n = 50):
    "Creates and TSNE model and plots it"
    labels = []
    tokens = []

    if focus_word is not None:
        tokens.append(model.wv[focus_word])
        labels.append(focus_word)
        neighbors = model.wv.most_similar(focus_word, topn = n)
        for neighbor in neighbors:
            tokens.append(model.wv[neighbor[0]])
            labels.append(neighbor[0])
    else:
        for word in model.wv.key_to_index:
            tokens.append(model.wv[word])
            labels.append(word)
    
    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
    new_values = tsne_model.fit_transform(tokens)

    x = [value[0] for value in new_values]
    y = [value[1] for value in new_values]
        
    plt.figure(figsize=(16, 16)) 
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.show()

Let's walk through what is going on in this code. We begin by defining a function, tsne_plot, which accepts a model, a focus word, and a number of words as parameters. As you can see, focus word is set to "none" by default and the number of words is set to 50. These parameters can be adjusted to suit your needs.

We begin by declaring two lists, tokens and labels. We'll use these lists to keep track of the vectors and their labels for each item in the model. 

After those two lists are declared, the function proceeds to an if statement and a for loop. The if statement asks whether the focus_word parameter is equal to None or not. If the focus_word is _not_ set to None, the function proceeds to calculate the top "n" (remember n is set to 50 in the function definition) most similar words to that focus word. The function then proceeds to add these 50 neighbors to the labels and tokens lists. 

If there is no focus word, the function traverses through the model's vocabulary and adds the words in the model to the labels and tokens lists respectively. 

This initial if statement allows you to perform tSNE analysis around a particular word or to focus on a particular area of the vector space. As we have walked through, if there is a focus word, then what gets added to the labels and tokens lists are the 50 nearest neighbors to that particular word. This approach limits the tSNE analysis to that particular word. If there is no focus word, then the tSNE analysis is performed on the entire model.

Next, like in the PCA function, we declare a variable, tsne_model, to hold the function call to scikit-learn's tSNE algorithm. scikit-learn's tSNE function accepts a number of parameters which can impact how the algorithm traverses through your data: 

1. **n_components** -- This parameter cooresponds to how many dimensions the analysis should work in. The default is 2. 

2. **Perplexitiy** -- This parameter relates to the nearest neighbors in the analysis. It basically tries to guess how many neighbors a particular vector will have in order to balance the attention given to each vector. Scikit-learn suggests using numbers between 5 and 50

3. **init** -- This parameter allows you to suggest how the components will be calculated, either 'random' or by 'pca'

4. **n_iter** -- This parameter represents the number of times the algorithm should traverse through the data before producing the plot

5. **random_state** -- This parameter helps to prevent different results being produced with different runs of the algorithm

There are a number of additional parameters which you can view in scikit-learn's documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

Following the establishment of the tSNE algorithm, the results are fed into a plot in a similar manner to the PCA graph. Unlike the PCA graph, this tSNE graph is 2D and thus flat. The x and y axes are assigned each a component from the results of the tSNE analysis. Then, a plot is declared, labels are added to that plot using the labels list, and then we use plt.show() to reveal the graph. 

You can run this function by calling tsne_plot(model), like the PCA function.

### Interpreting tSNE ###

There are a few important things to note about interpreting the results of a tSNE analysis. 

1. tSNE is not deterministic, so your results may vary across runs of the same algorithm with the same settings. 

2. tSNE tries to average cluster sizes, so clusters may appear to be the same size in a visualization of the analysis when in actuality, they can vary quite a bit in size. This means that you cannot determine the size of a cluster based on tSNE alone

3. The distances between clusters may be meaningless. In addition, if you add new data to the corpus (for example if you retrain your word embedding model), you must also increase the perplexity in the tSNE analysis

The bottom line, is the tSNE will try to "clean up" it visualizations, so something that appear significant in the visualization may actually just be a result of this cleaning up. In order to get the most out of tSNE analysis, try running it multiple times and changing the hyperparameters. This will likely give you a more accurate picture of your data. 

# Conclusion #

As you can see, word embedding models are fairly versitile and powerful. Not only do these models enable you to capture the semantic significance of words in any particular corpus, but when analysis techniques are applied such as k-means clustering, PCA, or tSNE, it becomes much for evident how useful word embedding models are for representing the complexities of natural language. 

While this walkthrough focused in particular on a localized implementation of word embedding models in Python, there are a number of tools online that are particularly useful for analyzing word embedding models. One tool in particular that is very useful is the Tensorflow Projector located here: https://projector.tensorflow.org/. 

The Tensorflow Projector allows you to upload your model and produce interactive PCA and tSNE plots for your model. If you are interested in digging more deeply into the individual words reflected in tSNE or PCA, then the Projector is a great place to explore as its interactive features operate relatively quickly, even with large amounts of data. 

We also encourage you to continue learning about word embedding models through some of the great communities located in places such as Stackoverflow. Mutual aid is an essential feature of the coding community, and you should feel comfortable participating in that community, even as a beginner programmer. As we hope this walkthrough has demonstrated, the best work in programming happens when programmers work together. 

And finally, while this walkthrough is focused on Word2Vec, we also want to point to the newer Doc2Vec which can be implemented almost exactly how this walkthrough implements Word2Vec. Doc2Vec is a word embedding algorithm that produces vectors for sentences or entire documents by using something called "paragraph embeddings." If you are interested in training a model based on documents rather than individual words, we encourage you to check out Doc2Vec which comes preinstalled with Gensim. 


_This walkthrough was written on June 25, 2022 using Python 3.8.3, Gensim 4.2.0, and Scikit-learn 0.23.1_