Word Vectors Intro (in python with `gensim`)
---------

Author: Felix Muzny  
Date: 6/30/2021

Getting Started & Using this notebook
------

This file is an introduction to training and querying a model using word2vec with `python` and `jupyter notebooks` on your own computer.

For more information about how to use Jupyter Notebooks:
- [An introduction to Jupyter Notebooks](https://realpython.com/jupyter-notebook-introduction/)
- [Felix's Notes from an intro computer science course](https://muzny.github.io/csci1200-notes/01/using-jupyter-notebooks.html)

A *very brief* intro to how to use this file/Jupyter Notebooks
-----
(See [section 1.1 of the Jupyter/Python notes](https://muzny.github.io/csci1200-notes/01/1/intro_jupyter_notebooks.html) for more detail)

Jupyter notebooks are composed of `cell`s. A cell either contains text (like this one) or runnable python code.

To run a cell with code in it, you'll either use one of the buttons at the top of the notebook ("▶️ Run") or use a keyboard shortcut to run a cell (or cells). To view the keyboard shortcut for running cells, got to the "Cell" menu at the top of the notebook, then look at what is listed next to "Run Cells". By default, this is command+enter on mac operating systems and control+enter on windows.  

When you run a cell that contains python code, two things happen:
1. any output as a result of running the code is displayed beneath the cell
2. any variables or functions defined as a result of running the code are stored in the kernel ("python's working memory"--this memory is only reset if you restart your notebook or use the Kernel menu to restart the kernel)

In [None]:
# an example—run this cell!
favorite_animal = "whale"  # define a string variable
favorite_number = 27  # define an integer variable
print(favorite_animal)  # print out the value of one of the variables

In [None]:
# the variables that you defined in the previous cell are now defined 
# in our current memory so we can access them and do what we want with
# them
print(favorite_number)
print(favorite_number ** 2)  # what is the value of this number squared?

The last type of output that you might see underneath a cell is output that is preceded by the text `Out [n]:` (where `n` is some number). This output happens if no code in the cell displayed anything using `print` *and* the last line of code in the cell evaluates some expression that isn't saved in a variable.

In [None]:
# an example of a cell with Out [n]: output
# let's just do math and both not print it and not save it in a variable
# notice that because we didn't save this value in a variable, we can't 
# access it later unless we re-do the calculation!
favorite_number / 3

If your notebook is ever running and you want to stop it, you can press the stop ("🔳") button on the top menu.

Installing libraries 
----------------
(in general and for this project)

By default, python comes with some libraries already installed. Some of the libraries that we'll use (e.g. `gensim`) do not come installed by default. Depending on how you installed Jupyter Notebook, you'll follow different installation instructions. 

You'll likely be installing libraries via either:
1. `conda`
2. `conda` via the Anaconda Navigator user interface
3. `pip`

For help with this reach out to Felix (f.muzny@northeastern.edu) or the vast but sometimes confusing knowledge base of The Internet.

In [None]:
# to access tools in a certain library, you'll need to import that library
# in jupyter notebooks, we tend to like to put imports together without 
# other code in the cell to separate loading a library from code that "does things"
import re  # for regular expressions
import os  # to look up operating system-based info
import string  # to do fancy things with strings

In [None]:
# First, figure out what your working directory is
print(os.getcwd())

In [None]:
# we're using a relative path to take your working directory, go "back"/"up" one
# step to the maing WordVectors folder, then go into the data folder within that
datapath = os.path.join(os.getcwd(), '..', 'data/WomensNovelsDemo/')
print("Using datapath:", datapath)
# if this worked successfully, you should see a list of the files that 
# you are trying to work with here:
print("We'll be working with the files:")
print(os.listdir(datapath))

In [None]:
def clean_text(list_of_texts):
    '''
    Cleans the given text using regular
    expressions to split and lower-cased versions to create
    a list of tokens for each text.
    Parameters:
        list_of_texts: list of str 
    Return: list of lists of tokens, one list per text
    '''
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    all_tokens = []
    for text in list_of_texts:
        # lower case
        tokens = text.split()
        tokens = [t.lower() for t in tokens]
        # remove punctuation
        tokens = [re_punc.sub('', token) for token in tokens] 
        # only include tokens that aren't numbers
        tokens = [token for token in tokens if token.isalpha()]
        all_tokens.append(tokens)
    return all_tokens

def read_files(datapath):
    '''
    Reads and Returns the "data" as list of lists (as shown above)
    '''
    data = []
    all_files = os.listdir(datapath)
    for file_name in all_files:
        print("Processing", file_name)
        with open(os.path.join(datapath, file_name)) as file:
            data.append(file.read())
            break
    return data

# Now, we're going to actually read in the files
# and split them into tokens
raw_data = read_files(datapath)
cleaned_data = clean_text(raw_data)
# sanity check to check out the beginning and end of our data
print("Number of files read:", len(cleaned_data))
print(cleaned_data[0][:10])  # beginning of first sentence
print(cleaned_data[0][-10:]) # end of first sentence

Word Vectors and Word2Vec
------
We'll use the `gensim` model to train word vectors in python. Use the code below to import the library, then define the size (number of dimensions) that you'd like to have in your vectors.

Other parameters that we've included here are:
- `sg`: stands for "skip-gram" this is a flag about whether to use the skip-gram algorithm ("1") or the CBOW algorithm ("0"). These are two different algorithms for creating word vectors. They both work well and produce very similar results!
- `window`: this defines the size of the distance around a particular word to "look" when creating word vectors.
- `size`: dimensionality of the output vectors
- `min_count`: ignores all tokens with total frequency lower than this number

For more parameters, [see the documentation](https://radimrehurek.com/gensim/models/word2vec.html).

In [None]:
# for the actual word embedding models
from gensim.models import Word2Vec

# The dimension of the word embeddings that we're producing. 
# This variable will be used throughout the program
EMBEDDINGS_SIZE = 100

# Training Word2Vec model from Gensim. 
model = Word2Vec(cleaned_data, 
                 sg=1,
                 window=5, 
                 size=EMBEDDINGS_SIZE,
                 min_count=2)

In [None]:
# find out how big our vocabulary is
print('Vocab size:', len(model.wv.vocab))

In [None]:
# Saving file in txt format, so you don't have to remake it each time
model.wv.save_word2vec_format('my_embeddings.txt', binary=False)

Querying the model
--------

There are a few different ways that we can query the resulting model! Here are a few of them.

In [None]:
# visualizing by projecting the word vectors into two-dimensional space

# we'll need to import some graphing libraries
import matplotlib.pyplot as plt
# and the projection method that we'll be using
from sklearn.manifold import TSNE

# tells jupyter notebooks to display the graphs under the cells that produce them
%matplotlib inline

In [None]:
# This code is heavily based off of code from
# https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne

def tsne_plot(model, focus_word = None, n = 50):
    "Creates and TSNE model and plots it"
    labels = []
    tokens = []

    if focus_word is not None:
        tokens.append(model.wv[focus_word])
        labels.append(focus_word)
        neighbors = model.wv.most_similar(focus_word, topn = n)
        for neighbor in neighbors:
            tokens.append(model.wv[neighbor[0]])
            labels.append(neighbor[0])
    else:
        for word in model.wv.vocab:
            tokens.append(model.wv[word])
            labels.append(word)
    
    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
    new_values = tsne_model.fit_transform(tokens)

    x = [value[0] for value in new_values]
    y = [value[1] for value in new_values]
        
    plt.figure(figsize=(16, 16)) 
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.show()

In [None]:
tsne_plot(model, focus_word="friend")

We can also directly query the model to ask for the list of words that are most similar to a specific word or the words that are most similar to a target word (or words).

See [this documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#what-can-i-do-with-word-vectors) for more examples of interacting with these word vectors and similarity.

In [None]:
# Running a query to get the items most similar to a given term
# documentation:
# https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.most_similar.html
model.wv.most_similar("girl", topn = 20)

In [None]:
# most similar to multiple words
model.wv.most_similar(positive = ["girl", "woman"], topn = 20)

In [None]:
# most similar to woman - man
model.wv.most_similar(positive = ["woman"], negative = ["man"], topn = 20)

In [None]:
# get the distance between two vectors
distance = model.wv.distance("woman", "man")
print(distance)
distance = model.wv.distance("horse", "man")
print(distance)