# Model Analysis #

This notebook picks up where the core notebook left off to discuss analytical methods for working with word embedding models. In this notebook, we will be covering K-Means analysis, Principal Component Analysis (PCA) and T-Distributed Stochastic Neighbor Embedding (tSNE). This notebook assumes that you have already trained a model using Word2Vec as described in the core notebook.

While in the core notebook, we covered some of the ways you can query a model and how those queries can help you answer interesting research questions. Using some slightly more complicated analytical methods, however, allows you to get a better sense of the model as a whole. These analytical methods can help you target your research questions.

## K-means Clustering ##

Cosine similarity is not the only way to calculate the distance between two vectors. Another method for performing this calculation is through k-means clustering. K-means clustering uses Euclidean distance rather than cosine similarity in order to determine how close two vectors are in vector space. 

Calculating Euclidean distance in k-means clustering means rather than calculating the cosine of the new angle that was formed by connecting the two vectors in order to form  a triangle, the _length_ of that new line is being used to calcuate distance. As a result, whereas vectors tend to be more similar when the cosine of the two is larger, for Euclidean distance, a smaller number indicates a shorter line connecting the two vectors, or that they are similar.

With this in mind, k-means clustering begins by picking a bunch of random points in vector space, called **centroids**, and seeing what vectors tend to be clustered together in those random locations. By calculating the Euclidean distance, the algorithm determines which points are closest to the centroids, which of them have the smaller Euclidean distances, while maintaining larger distances from other centroids. The algorithm tries to maintain some distance between clusters in order to ensure that they are unique. 

K-means is called "k-means" because some number of clusters (k) are used to calculate vector distance by taking the mean of all vectors within those clusters by adding the squared Euclidean distance between of all of the vectors within the cluster and the centroid.

Essentially, k-means clustering is calculating the distances between vectors in a way that is nearly the opposite of how built in functions such as the similarity function calculate the same thing. 

When working with word embedding models, k-means clustering can be useful in order to get a sense of what words tend to occupy the same general space. The centroids will be placed in vector space randomly, but since a crucial part of the k-means algorithm requires that vectors be distant from neighboring clusters, this ensures that there will likely be very minimal overlap between your sampling of random clusters. 

In this walkthrough, we are going to use the k-means algorithm that comes with the popular scikit-learn library in Python.

This code was adapted from: https://dylancastillo.co/nlp-snippets-cluster-documents-using-word2vec/#cluster-documents-using-mini-batches-k-means



### The Code ###

We're going to start off by just declaring a few variables. The first variable `VOCAB`, is going to hold our model's vocabulary. In Gensim 4.0, you retrieve the model's vocabulary by calling `model.wv.key_to_index`. In older versions of Gensim, you replace `key_to_index` with `vocab`. 

Next, we are going to declare `NUM_CLUSTERS` which is where we will determine how many random clusters we are wanting to retrieve using the k-means algorithm. 

Finally, we declare a variable `kmeans` that will hold the call to scikit-learn's k-means algorithm. As you can see, this algorithm initializes with some number of clusters, some number of iterations, and then is fitted to the vocabulary of your model. Like the training model code above, these are settings that you may wish to play around with. You can visit scikit-learn's [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) to read more about additional settings that may be of use to you. 

In the actual call to the kmeans algorithm, there are a few parameters which you can change. 

__num_clusters__ -- this parameter represents the number of clusters (or centroids) you want to have. By default, this value is 8 though you can change the number to whatever you want. Keep in mind that for a smaller vocabulary, a smaller number of clusters might generate more useful data than a large number. I have chosen three in the code block below.

__max_iter__ -- this parameter represents the number of iterations you want the algorithm to complete in a single run. By default this parameter is set to 300 though you can change this number to whatever you like. I have chosen 40 iterations since the recipe model is relatively small.

There are a few additional optional parameters, though the two above are the move important. You can read about other parameters in the documentation. 

In [None]:
# VOCAB will hold our model's vocabulary
VOCAB = model.wv[model.wv.key_to_index]

# we are setting the number of clusters to 3. You can change this if you want
NUM_CLUSTERS = 3

# declare kmeans to hold the call to scikit-learn's kmeans algorithm
# by default, the algorithm sets the number of clusters to 8 and the max iterations to 300
# you can change the number of clusters or the number of max iterations 
kmeans = cluster.KMeans(n_clusters=NUM_CLUSTERS, max_iter=40).fit(VOCAB) 

The next set of variables we are going to declare are related to the clusters, themselves. The `centroids` variable, will hold the center points around which the clusters are arranged. You can imagine these centroids as points on a map that we have thrown random darts at. The centroids are generated by the kmeans algorithm as it runs. The way that we get access to these centroids is by using the `cluster_centers_` variable that the algorithm generates. This is somewhat similar to the way you call `model.wv` to get access to a model's word vectors.

Finally, we declare the `clusters_df` dataframe which will be used to store the words within in our random clusters. I am storing these clusters in a dataframe because it will allow me to preserve distinctions between clusters using columns and rows and will make saving the results to a `.csv` file very easy.

In [None]:
# set the centroids variable to the kmeans cluster centers
centroids = kmeans.cluster_centers_

# declare an empty dataframe
clusters_df = pd.DataFrame()

Now, using a `for` loop, we are going to visit each of the random clusters and gather some of the words within them. This `for` loop starts at the first cluster and will iterate through each of the clusters, stopping once it has finished with the last one.

As the `for` loop reaches a cluster, it calculates the most representative words within that cluster by using the `most_similar` function. The function is calculating the words that are most similar to the centroid of that cluster and returns the top 15 words. Those words are stored within a variable, `most_representative`.

Then, we declare a temporary dataframe, called `temp_df` that will store the ID of the current cluster and the words associated with that cluster. Saving both the cluster ID as well as the words allows us to remember which words came from which cluster and will make interpreting the results much easier later.

Next, the temporary dataframe is appended to our `clusters_df` dataframe and the cluster ID and list of words are printed to the console.

In [None]:
# iterate through each of the clusters
for i in range(NUM_CLUSTERS):
    
    # calculate the top fifteen most similar words within the current cluster
    most_representative = model.wv.most_similar(positive=[centroids[i]], topn=15)
    
    # store the cluster number and the most representative words in a temporary dataframe
    temp_df={'Cluster Number': i, 'Words in Cluster': most_representative}
    
    # add the items in the temporary dataframe to our bigger clusters dataframe which will hold all the clusters
    clusters_df = clusters_df.append(temp_df, ignore_index = True)
    
    # print the cluster id and the most representative words to the console
    print(f"Cluster {i}: {most_representative}")

In order to make our results a little easier to read and viewable later, we save the dataframe to a `.csv` by using the built in pandas function `.to_csv()`. This function will preserve the columns and rows within our `clusters_df` dataframe and can be opened in any editor that can work with spreadsheets such as Excel or Google Sheets. Currently, the results are saved as `random_clusters.csv` though you can change this filename to something more meaningful if need be.

The code block below saves the results to your current working directory, but if you want your results saved somewhere more specific, include the filepath with the call to `.to_csv()`

In [None]:
# this will output the random sampling of clusters into a CSV located in your current directory. 
# if you want the file to save somewhere else, just include that filepath in the csv name 
# (C:/Users/avery/Documents/random_clusters.csv for example)
clusters_df.to_csv("random_clusters.csv")  

### Interpreting the Results ###

K-means clustering is a useful way to identify patterns in your model that you may not have been aware of. Because of this, k-means clustering is a useful first step in your analysis because it can help you explore your data and the algorithm scales up to large datasets well. Since Word2Vec is an unsupervised algorithm, it can be difficult to determine how the algorithm has grouped words together without the exploratory phase. 

However, when using k-means clustering, you should be aware that the results of the clustering may be impacted by a few factors. For example, rather than ignoring outliers the k-means clustering algorithm includes outliers with all of the rest of the data. For this reason, an outlier case in your model may drag other words into a cluster in a way that is not automatically significant. Additionally, since you are being asked to manually select the number of clusters as well as the number of iterations, it may take a few tries to generate useful data. 

## Principal Component Analysis ##

Another useful form of model analysis is PCA (principle component analysis). For a much more detailed breakdown of PCA, check out The Datasitter's Club's [write up on PCA](https://datasittersclub.github.io/site/dsc10.html).

In general, PCA is a dimensionality reduction algorithm. PCA is called PCA because it attempts to reduce a data set to its **principal components**. Just as k-means differed in its mathematical approach from cosine similarities, PCA also takes a different approach to dealing with vectors. Rather than calcuate the length of a line or the cosine of an angle, PCA determines the principal components of a dataset by using _linear algebra_. The algorithm uses eigenvectors and eigenvalues to mix together items within a data set and produce new items that contain most of the information from the old items, or their principle components.

Whereas k-means and cosine similarity try to capture similarity or closeness, PCA is more concerned with capturing the largest amount of variance in a dataset. It does this by using eigenvectors to determine what the variance is amongst items in a data set. The PCA algorithm will continue to calculate these eigenvectors while trying to maintain the most variance between components as possible while discarding items that are less significant. The items that we decide to keep are called "feature vectors." These feature vectors are then plotted and represent the essential features of the data set while reducing some of that dataset's bulk. Probably the best way to think of these components, is as a sifter that you dig into sand and filter out shells and rocks. Since there is so much sand, we don't necessarily care about including the sand in a description of what you were able to find with the sifter. We do care, however, about the unique shells and rocks and those items can tell us more about the features of the beach than any individual grain of sand. 

For the code, we are going to use scikit-learn's built in PCA algorithm. 