# Advanced Word Vectors in Python

This notebook builds off of the core notebook and analysis notebook and assumes that you have completed both before proceeding to this one. 

# What Algorithm Should I Use?

**Gensim** is a popular natural language processing library that is usually used for topic modelling. Gensim comes with the popular Word2Vec algorithm

**Spacy** is also a popular natural language processing library that is designed to be very fast. Spacy also uses Word2Vec style word embeddings, but tends to be slightly faster than Gensim. Spacy also comes with pre-trained models built in which is incredibly useful if you are wanting to get familiar with querying a model before building your own. 

**GloVe** is an unsupervised learning algorithm developed by Stanford University. GloVe comes with some nice pre-trained models if you want to play around with word embeddings without having to train your own model.

# Why Use Gensim?

Gensim is a very memory-efficient way to work with word embedding models. Not only does Gensim come with some cool algorithms that you can apply to a downstream task such as topic modelling, but Gensim also allows you to process large amounts of text without storing them into memory. Developed by Radim Řehůřek, Gensim is one of the most popular libraries for training word embeddding models in Python. Its popularity is an important feature because that means there is a vast amount of community support for the library, making troubleshooting very easy. 

However, if you want some quick, out of the box models to work with to test things out, then GloVe may be a better choice for your. GloVe and Gensim, however, approach [training](https://machinelearninginterview.com/topics/natural-language-processing/what-is-the-difference-between-word2vec-and-glove/) in different ways which is important to keep in mind if you switch back and forth between the two.

# How to Be More Memory Efficient

## Loading Model Vectors

Above, I noted that Gensim is a memory efficient way to work with word embedding models. However, just because Gensim is memory efficient does not mean that there aren't little things that you can do to lessen the burden of the model on your machine. One way to make working with word embedding models through Gensim more memory efficient, is to only load the vectors, themselves, rather than the entire model when you are just planning on querying the model. By only loading the vectors, you don't have to load the entire model into memory, and depending on the size of your model, this can be quite the task for your machine. Below, I explain how to load the vectors from a Gensim model.

### The Code

The first thing that we want to do is actually pull the vectors out of the model. Then, we'll save those vectors as a new file just like the `.model` file so that we can call those vectors later. In this code, we initiate a new variable called `word_vectors` which will hold `model.wv.` `Model.wv` represents the word vectors within the model, itself. Then, we save the word vectors that we have pulled out of the model as a `.wordvectors` file called `word2vec`. Finally, we initialize a variable called `wv` (short for 'word vector') and use the `KeyedVectors()` function to load the wordvectors file and read it. 

In [None]:
# make sure that your model is loaded
model = Word2Vec.load("word2vec.model")

# declare a variable to hold the vectors
word_vectors = model.wv

# save those vectors to a new file so that we can use them later
word_vectors.save("word2vec.wordvectors")

# now load those vectors
# you can now query the model by using "wv"
wv = KeyedVectors.load("word2vec.wordvectors", mmap='r')

You can now just use `wv.` to query the model rather than `model.wv.` `wv` is capable of performing all of the querying functions as `model.wv` can

In [None]:
# gives the word most similar to 'recipe'
wv.most_similar('recipe', topn=10)

# gives a cosine similarity for the words 'milk' and 'cream'
wv.similarity("milk", "cream")

# Resuming Training

Working with word embedding models is typically an iterative process where you train a model, evaluate it, and then train the model again. Thankfully, Gensim provides the functionality for introducing new data to an existing model. 

The code below is formatted to read in a list of sentences that are hard-coded, however, if you have a folder of texts or a spreadsheet, you would just use the methods outlined in the core notebook which I won't re-hash here. Essentially, the entire process is the same as before, except instead of declaring a new model we just use the `train()` function that we have built in. 

### The Code

After loading in the libraries we need, the code below begins by loading the current version of our model into memory by using `Word2Vec.load()`. Next, I have declared a list variable with a few sentences from a cupcake recipe in it as strings. If you are using more extensive data, then this is where you would repeat the step of loading in from a folder or spreadsheet from the core notebook. 

Then, I have the `clean_text()` defined. This function is exactly the same as the `clean_text()` function from the core notebook. I apply the `clean_text()` function to my list in the same way I did in the core notebook. Up until this point, all of these steps are borrowing from the core notebook. 

Now, we get to the bit that differs from the core notebook. Now that we have a list of tokens, we build the vocabulary for our model by calling `model.build_vocab()`. We are building the vocabulary using the new data and using the `update=True` parameter to let our model know that we are updating the vocabulary in our existing model.

Finally, we call the built in function `train()` to retrain the model. Whereas in the core notebook, we used `model= Word2Vec()` to train a new model, by using `model.train()` we tell the model to train using these additional items rather than replacing the vocabulary that the model has already built. If you were to call `model=Word2Vec` instead, you would be overwriting the existing model to only contain the vocabulary of the new data.

In [None]:
import gensim                      # for Word2Vec
from gensim.models import Word2Vec # for Word2Vec
import re                          # for regular expressions
import string                      # for string comprehension

# load our current model
model = Word2Vec.load(r"C:\Users\avery\.spyder-py3\models\wordvector models\word2vectest.model")

# declare a variable with our new sentences/words
# you can use the folder/spreadsheet method from the core notebook if you have more data
more_sentences = [
    "Cup cake is about as good as pound cake, and is cheaper.", 
     "One cup of butter, two cups of sugar, three cups of flour,", 
     "and four eggs, well beat together, and baked in pans or cups.", 
     "Bake twenty minutes, and no more."
]

# we're going to use out clean_text() function from the core notebook
def clean_text(text):
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))

    # lower case
    tokens = text.split()
    tokens = [t.lower() for t in tokens]
    # remove punctuation
    tokens = [re_punc.sub('', token) for token in tokens] 
    # only include tokens that aren't numbers
    tokens = [token for token in tokens if token.isalpha()]
    return tokens

# declare an empty list to hold our clean text
data_clean = []

# iterate through the new data and apply the data_clean() to each item
for x in more_sentences:
    data_clean.append(clean_text(x))

# build our model vocab and tell the model that we are updating the vocab
model.build_vocab(data_clean, update=True)

# tell the model to re-train with the new data
model.train(data_clean, total_examples=model.corpus_count, epochs=model.epochs)

You can add additional data and retrain a model as many times as you want. Something to keep in mind, however, is that you may to save your model under a different name than the name of the previous model. This way, you'll still have access to the old model in case something goes wrong in the re-training process. To do so, you would do the following:

In [None]:
# save the model as a retrained model with the date that it was retrained 
# you can save your newly trained model under whatever name makes the most sense to you
model.save("word2vec_retrained_08012022.model")

# Using Pre-Trained Models

Gensim also provides the functionality to load an existing model into memory. Using pre-trained models can be incredibly useful if you know of a model that someone else has already trained that suits your needs. For example, the Google News Dataset is freely available to use and is already trained on roughly three million words.

### The Code

In the code below, we begin by loading the `gensim.downloader` as `download`. Then we declare a variable `wv` and use it to store our call to download the Google dataset. To query the model, you can use all of the queries and analysis functions introduced in the core and analysis notebooks by calling `wv`. Keep in might that the Google model is very large and may take a while to download.

In [None]:
import gensim.downloader as download
wv = download.load('word2vec-google-news-300')

# Supervised vs. Unsupervised vs. Semi-Supervised

# What to Do With Your Model

## What is a Downstream Task?