# Word2Vec: training your own model

In this notebook we will see how you can train your own Word2Vec model.

There are two main steps:
* Preprocessing the corpus you will train your model on.
* Training the model.

Let's go!

## Preprocessing

Word2Vec requires the training data to be provided as an iterable (e.g. a list of tokens).

### Define your preprocessing

Many different choices can be made when preprocessing a corpus to be used as Word2Vec training data. Ultimately the choices you'll make will usually depend on:

1. The size of your corpus (it might be computationally too expensive to preprocess very large corpora: choose a quality-time trade-off wisely!)
2. What you are planning to use the model for (e.g. maybe you want to look at non-lemmatized tokens? Maybe you'd like to keep numbers?)

For this first complete example, our preprocessing will consist of:
- Lowercasing
- Punctuation removal
- Stopword removal
- Lemmatization

But preprocessing may also include tasks like:
- Removal of tokens with less than _n_ characters
- Removal of numbers
- PoS tagging

Since we are experts in spacy now, we will use it for preprocessing our corpus.

In [None]:
import spacy

Load the English model `en_core_web_sm` from `spacy`:

**Note:** If the following cell does not work, this is because you need to download the Spacy model again. You can do it typing the following in a new cell:
```
!python3 -m spacy download en_core_web_sm
```

In [None]:
nlp = spacy.load('en_core_web_sm')

The following function takes in a sentence and returns a list of tokens (i.e. words):

In [None]:
# Create a function that tokenizes a sentence and applies whichever preprocessing you want:
def tokenize_a_sentence(sentence):
    sentence = sentence.lower() # Lowercase the sentence
    doc = nlp(sentence) # Turn the sentence into a Spacy Doc object
    list_tokens = [] # Instantiate an empty list, where we will store the tokens we want to keep.
    for token in doc: # Iterate over the tokens in the sentence.
        if token.is_stop == False and token.text.isalpha() == True: # Keep tokens that are not stopwords and consist only of alphabetic characters.
            list_tokens.append(token.lemma_) # Add the token's lemma into the list of tokens.
    return list_tokens

See how the function works with an example:

In [None]:
sentence = 'These are test sentences, just to have a look at how a processed sentence looks like.'

# Apply the function to a new sentence:
tokenized_sentence = tokenize_a_sentence(sentence)

print(tokenized_sentence)

✏️ **Exercise 1:**

Can you rewrite the `tokenize_a_sentence()` function as a list comprehension?

In [None]:
# Write your code here:



### Load a dataset and preprocess it

In our toy example, we will use the `text` column from a CSV file of 19th-century newspaper articles.

We will use `pandas` library because it makes it will make it easier to extract a column from a CSV.

We will then pre-process each text with our `tokenize_a_sentence` function.

In [None]:
import pandas as pd

In [None]:
# Read the data from a pandas dataframe:
incsv = pd.read_csv('data/newspapers-toy.csv')

In [None]:
# Get the size of the dataset:
incsv.shape

In [None]:
# Show the first rows of the dataframe:
incsv.head()

In [None]:
# Keep the values of the `text` column in a new variable:
training_set = incsv["text"]

In [None]:
# Show the content of training_set:
print(training_set)

We will now apply the preprocessing function to each sentence in the dataset.

The output will be a list of lists, where each of the inner lists contains the relevant tokens in the text.

We will use the `tqdm` to track the progress of the training (i.e. to show the progress bar).

In [None]:
from tqdm import tqdm # To track the progress of the training

In [None]:
training_data = [] # Empty list: that's where we will store our training data.

# Iterate over all texts in our training set:
for text in tqdm(training_set):
    
    # Tokenise
    tokenized_text = tokenize_a_sentence(text)

    # Add the tokenized text (i.e. a list of tokens) to the list that will be
    # used to train the model.
    training_data.append(tokenized_text)

In [None]:
# Check the size of the `training_data` variable
len(training_data)

In [None]:
# Get the first element of `training_data` (i.e. the first text, tokenised):
training_data[0]

## Train a word2vec model

We will use the Word2Vec module from Gensim to train a model.

In [None]:
from gensim.models import Word2Vec

**Hyperparameters** are training parameters whose values can be decided by the user. Different choices of hyperparameters will impact the training.

[Here](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec) is a breakdown of all possible hyperparameters which can be changed if we train a model using Gensim's `Word2Vec`. We'll mostly stick to the defaults (but spell out the most important ones for clarity).

Hyperparameters are specified by the user when initialiasing a new Word2Vec model, before providing the data to learn from, and before starting the training.

It's okay to start with the default ones (which you can find specified in the link just provided, e.g. min_count = 5), which correspond to the ones that were found by several studies to be optimal for many different tasks.

Normally, it is good practice to find the **optimal parameters** for your model, or better still, for the purpose for which you plan to use your model.

For example, a higher window of words, i.e. the parameter `window`, will make the model consider more words before and after each token when calculating the vector for each word in the corpus: this seems to lead the model to capture more semantic similarities/relationships, whereas a smaller window will tend to reflect similarities which are more syntactic in nature.

You can instantiate Word2Vec (using the default parameters) as follows:

In [None]:
# Instantiate Word2Vec with the default hyperparameters:
w2v_model = Word2Vec()

But you can also specify your choice of hyperparameters in the brackets:

In [None]:
# Instantiate Word2Vec with your own hyperparameters:
w2v_model = Word2Vec(min_count=5, # how often a word should appear in order to be included
                     window=5, # how many words before and after count as context
                     sg=1, # using the SkipGram algorithm (1) or the CBOW algorithm (0)?
                     vector_size=100, # Size of the vector
                    )

Before training, you will need to build the vocabulary. You can do it as follows:

In [None]:
w2v_model.build_vocab(training_data) # Build vocabulary

We then train the model as follows:

In [None]:
w2v_model.train(training_data, # tokenised data
                total_examples=len(training_data), # Number of sentences to use for training
                epochs=5, #how many epochs to train for
                )

We will now save both the whole model and the vectors. This is because, while a file containing only the vectors is easier to manage, saving the full model as well allows you to update it in the future (e.g. to keep training it on more relevant sentences).

Note: `binary=False` will simply save the vectors in non-binary format (i.e. human-readable), which can take longer to store and process, but easier to deal with.

In [None]:
w2v_model.save("models/test-model") # Save the full model (in case we'd like to update it in the future)
w2v_model.wv.save_word2vec_format("models/test-model-vectors.txt", binary=False) # Also save the vectors only (easier to work with)

## Loading a full model

We have already seen how to load the vectors-only file.

Now, to load a full model, we will use the `Word2Vec` module from `gensim.models`.

In [None]:
# To load the full model, we need to import Word2Vec from gensim:
from gensim.models import Word2Vec

In [None]:
# To read a word2vec model, use the .load() method, passing in the path to the model we just trained and saved:
our_test_model = Word2Vec.load("models/test-model")

In [None]:
# Check the data type:
type(our_test_model)

If we load a full model like this, we **can't** access the embeddings in the same way we did with `KeyedVectors`:

In [None]:
our_test_model["liverpool"]

However, we can simply obtain the model's vectors as a `KeyedVectors` object by 'isolating' the vectors, with method `wv`, in the following way:

In [None]:
# Get the model's vectors as a KeyedVectors object and store it in `our_test_vectors`:
our_test_vectors = our_test_model.wv

In [None]:
# Check the data type of the pretrained vectors in `our_test_vectors`:
type(our_test_vectors)

In [None]:
# Get the embedding of token `liverpool`:
print(our_test_vectors["liverpool"])

In [None]:
# Get the most similar to word "liverpool":
our_test_vectors.most_similar("liverpool")

✏️ **Exercise 2:**

Obtain a corpus fo text that is interesting for your research (medieval chronicles, the Shakespeare sonnets, Don Quijote, etc.).

Download it, preprocess it, train a Word2Vec model, explore the model.

In [None]:
# Type your solution here:


# Solutions

✏️ **Exercise 1:**

Can you rewrite the `tokenize_a_sentence()` function as a list comprehension?

In [None]:
# Sentence provided:
sentence = 'These are test sentences, just to have a look at how a processed sentence looks like.'

In [None]:
# Function provided:
def tokenize_a_sentence(sentence):
    doc = nlp(sentence) # Create Spacy Doc object
    list_tokens = [] # Instantiate an empty list, where we will store the tokens we want to keep.
    for token in doc:
        if token.is_stop == False and token.text.isalpha() == True: # Keep tokens that are not stopwords and are all alphabetic.
            list_tokens.append(token.lemma_) # Add the lemma into the list of tokens.
    return list_tokens

# Calling the function:
tokenized_sentence = tokenize_a_sentence(sentence)
print(tokenized_sentence)

In [None]:
# Suggested answer, using a list comprehension:
tokenized_sentence = [token.lemma_ for token in nlp(sentence) if token.is_stop == False and token.text.isalpha() == True]
print(tokenized_sentence)