Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt), based on [Word2Vec Tutorial Notebook](https://github.com/kavgan/nlp-in-practice/tree/master/word2vec) by Kavita Ganesan and on [Gensim's documentation on the Word2Vec Model](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html).

# WORD EMBEDDINGS

## Word2Vec in Gensim

[Word2Vec](https://code.google.com/archive/p/word2vec/) is a model for training word embeddings that revolutionized the way words are represented. [Gensim](https://radimrehurek.com/gensim_3.8.3/models/word2vec.html) provides an implementation of the algorithm, with which we can train our own word embeddings.

#### The data

Training embeddings requires a big corpus, the bigger the better.

For illustration purposes, we'll make use of the (not very big) [OpinRank](http://kavita-ganesan.com/entity-ranking-data/) dataset, which includes full reviews of cars and hotels. More specifically, we'll use an 84MB compressed file with 255404 hotel reviews.

This is how each review looks like:

In [2]:
import gzip

data_file="reviews_data.txt.gz"

with gzip.open ('reviews_data.txt.gz', 'rb') as f:
    for i,line in enumerate (f):
        print(line)
        break

b"Oct 12 2009 \tNice trendy hotel location not too bad.\tI stayed in this hotel for one night. As this is a fairly new place some of the taxi drivers did not know where it was and/or did not want to drive there. Once I have eventually arrived at the hotel, I was very pleasantly surprised with the decor of the lobby/ground floor area. It was very stylish and modern. I found the reception's staff geeting me with 'Aloha' a bit out of place, but I guess they are briefed to say that to keep up the coroporate image.As I have a Starwood Preferred Guest member, I was given a small gift upon-check in. It was only a couple of fridge magnets in a gift box, but nevertheless a nice gesture.My room was nice and roomy, there are tea and coffee facilities in each room and you get two complimentary bottles of water plus some toiletries by 'bliss'.The location is not great. It is at the last metro stop and you then need to take a taxi, but if you are not planning on going to see the historic sites in Be

Let's read the whole dataset into a list, while providing some logging information.

In the process of reading the data directly from the compressed file, we'll perform some pre-processing of the reviews using [gensim.utils.simple_preprocess](https://tedboy.github.io/nlps/generated/generated/gensim.utils.simple_preprocess.html). This does some basic pre-processing such as tokenization and lowercasing, and returns back a list of tokens (words).

In [4]:
import gensim
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

def read_input(input_file):
    """This method reads the input file which is in gzip format"""
    
    logging.info("reading file {0}...this may take a while".format(input_file))
    
    with gzip.open (input_file, 'rb') as f:
        for i, line in enumerate (f): 

            if (i%10000==0):
                logging.info ("read {0} reviews".format (i))
            # do some pre-processing and return a list of words for each review text
            yield gensim.utils.simple_preprocess(line)
    logging.info("Done reading data file")

In [5]:
# compressed file with the data
data_file="reviews_data.txt.gz"

# read the tokenized reviews into a list
documents = list(read_input(data_file))

len(documents)

2023-03-10 12:11:33,341 : INFO : reading file reviews_data.txt.gz...this may take a while
2023-03-10 12:11:33,342 : INFO : read 0 reviews
2023-03-10 12:11:35,816 : INFO : read 10000 reviews
2023-03-10 12:11:38,332 : INFO : read 20000 reviews
2023-03-10 12:11:41,508 : INFO : read 30000 reviews
2023-03-10 12:11:44,362 : INFO : read 40000 reviews
2023-03-10 12:11:47,477 : INFO : read 50000 reviews
2023-03-10 12:11:50,166 : INFO : read 60000 reviews
2023-03-10 12:11:52,662 : INFO : read 70000 reviews
2023-03-10 12:11:54,667 : INFO : read 80000 reviews
2023-03-10 12:11:59,259 : INFO : read 90000 reviews
2023-03-10 12:12:02,165 : INFO : read 100000 reviews
2023-03-10 12:12:05,066 : INFO : read 110000 reviews
2023-03-10 12:12:07,722 : INFO : read 120000 reviews
2023-03-10 12:12:12,396 : INFO : read 130000 reviews
2023-03-10 12:12:17,664 : INFO : read 140000 reviews
2023-03-10 12:12:20,224 : INFO : read 150000 reviews
2023-03-10 12:12:22,910 : INFO : read 160000 reviews
2023-03-10 12:12:25,802

255404

Each review item becomes a list of words, so what we have is a list of lists.

In [6]:
print(documents[0])

['oct', 'nice', 'trendy', 'hotel', 'location', 'not', 'too', 'bad', 'stayed', 'in', 'this', 'hotel', 'for', 'one', 'night', 'as', 'this', 'is', 'fairly', 'new', 'place', 'some', 'of', 'the', 'taxi', 'drivers', 'did', 'not', 'know', 'where', 'it', 'was', 'and', 'or', 'did', 'not', 'want', 'to', 'drive', 'there', 'once', 'have', 'eventually', 'arrived', 'at', 'the', 'hotel', 'was', 'very', 'pleasantly', 'surprised', 'with', 'the', 'decor', 'of', 'the', 'lobby', 'ground', 'floor', 'area', 'it', 'was', 'very', 'stylish', 'and', 'modern', 'found', 'the', 'reception', 'staff', 'geeting', 'me', 'with', 'aloha', 'bit', 'out', 'of', 'place', 'but', 'guess', 'they', 'are', 'briefed', 'to', 'say', 'that', 'to', 'keep', 'up', 'the', 'coroporate', 'image', 'as', 'have', 'starwood', 'preferred', 'guest', 'member', 'was', 'given', 'small', 'gift', 'upon', 'check', 'in', 'it', 'was', 'only', 'couple', 'of', 'fridge', 'magnets', 'in', 'gift', 'box', 'but', 'nevertheless', 'nice', 'gesture', 'my', 'room

#### Training the Word2Vec model

To train a Word2Vec model, we instantiate Word2Vec and pass it the text we have loaded before. You can check the available options for [instantiation](https://radimrehurek.com/gensim_3.8.3/models/word2vec.html#gensim.models.word2vec.Word2Vec) and for [training](https://radimrehurek.com/gensim_3.8.3/models/word2vec.html#gensim.models.word2vec.Word2Vec.train).

Training a Word2Vec model takes time, depending on your hardware. In this particular case, expect training to take something between 5 to 10 minutes on an Intel Core-i7 16GB desktop. I know, that's a very wide range, but of course it depends on which other processes are running in the same machine...

In [7]:
from datetime import datetime

start_time = datetime.now()

model = gensim.models.Word2Vec(documents, vector_size=150, window=10, min_count=2, workers=10, sg=1)

print("Training time:", datetime.now() - start_time)

2023-03-10 12:15:31,876 : INFO : collecting all words and their counts
2023-03-10 12:15:31,876 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2023-03-10 12:15:32,270 : INFO : PROGRESS: at sentence #10000, processed 1655714 words, keeping 25777 word types
2023-03-10 12:15:32,652 : INFO : PROGRESS: at sentence #20000, processed 3317863 words, keeping 35016 word types
2023-03-10 12:15:33,267 : INFO : PROGRESS: at sentence #30000, processed 5264072 words, keeping 47518 word types
2023-03-10 12:15:34,071 : INFO : PROGRESS: at sentence #40000, processed 7081746 words, keeping 56675 word types
2023-03-10 12:15:34,785 : INFO : PROGRESS: at sentence #50000, processed 9089491 words, keeping 63744 word types
2023-03-10 12:15:35,377 : INFO : PROGRESS: at sentence #60000, processed 11013726 words, keeping 76786 word types
2023-03-10 12:15:35,927 : INFO : PROGRESS: at sentence #70000, processed 12637528 words, keeping 83199 word types
2023-03-10 12:15:36,319 : INFO : PROG

KeyboardInterrupt: 

#### Exploiting the Word2Vec model

We can now inspect the word embeddings that we have trained. We can start by looking at the embeddings of a specific word:

In [None]:
# embeddings
model.wv.get_vector("dirty")

Which are the words most similar to this one?

In [None]:
# similarity
w1 = "dirty"
model.wv.most_similar(positive=w1)

You can also limit to a smaller number of hits:

In [None]:
# look up top 6 words similar to 'polite'
w1 = ["polite"]
model.wv.most_similar(positive=w1, topn=6)

In [None]:
# look up top 6 words similar to 'france'
w1 = ["france"]
model.wv.most_similar(positive=w1, topn=6)

We can also provide to *most_similar* not only positive concepts, but also negative ones. This allows us to do some arithmetic on the vector representations for certain sets of words!

The famous example **king - man + woman = queen** goes as follows:

In [None]:
# arithmetic: vec(“king”) - vec(“man”) + vec(“woman”) =~ vec(“queen”)
w1 = ["king",'woman']
w2 = ['man']
model.wv.most_similar(positive=w1, negative=w2, topn=1)

We can get the similarity scores for specific word pairs.

In [None]:
# similarity between two different words
model.wv.similarity(w1="dirty", w2="smelly")

In [None]:
# similarity between two identical words
model.wv.similarity(w1="dirty", w2="dirty")

In [None]:
# similarity between two unrelated words
model.wv.similarity(w1="dirty", w2="clean")

And we can check which word in a list of words is an intruder:

In [None]:
model.wv.doesnt_match(["cat", "dog", "france"])

In [None]:
model.wv.doesnt_match(["bed", "pillow", "duvet", "shower"])

In [None]:
model.wv.doesnt_match(["car", "bicycle", "plane", "skate"])

In [None]:
model.wv.doesnt_match(["car", "bicycle", "bus", "trolley"])

Make your own experiments! Try to find out:
- Which word is most similar to *lift*?
- What are the 3 words most similar to *crab*?
- How similar are the words *waitress* and *waiter*?
- If you take *portugal*, remove *lisbon*, and add *dublin*, what do you get?

In [None]:
# your code here


#### Saving and loading a Word2Vec model

You can save a trained model so that you are able to load it again in the future, and optionally continue training it.

In [None]:
# save full model (including trainable vectors to resume training)
model.save("reviews_model")

In [None]:
# load full model
model = gensim.models.Word2Vec.load("reviews_model")

#### Saving and loading the word embeddings

If you're sure you won't be training the model any longer, you can save its *KeyedVectors* (the word embeddings).

In [None]:
# save model word vectors
model.wv.save("reviews_wv")

After saving the embeddings, you can load them and use them.

In [None]:
# load model word vectors
wv = gensim.models.KeyedVectors.load("reviews_wv")

print(wv.most_similar(positive="lift", topn=1))

## Visualization

Word embeddings can be visualized by reducing dimensionality of the words to 2 dimensions using [tSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html).

Given enough training data, we can observe certain patterns in the vector space, including:
- Semantic relations: words like *cat*, *dog*, *cow*, etc. have a tendency to lie close by.
- Syntactic relations: words like *run*, *running* or *cut*, *cutting* lie close together.
- Arithmetic properties such as *King - Man = Queen - Woman*.

In [None]:
from sklearn.manifold import TSNE
import numpy as np
import matplotlib.pyplot as plt
import random


def reduce_dimensions(model, num_dimensions=2, words=[]):

    vectors = [] # positions in vector space
    labels = [] # keep track of words to label our data again later
    word_count = 0
    
    # if no word list is given, assume we want to use the whole data in the model
    if(words == []):
        words = model.wv.index_to_key

    for word in words:
        vectors.append(model.wv[word])
        labels.append(word)

    # convert both lists into numpy vectors for reduction
    vectors = np.asarray(vectors)
    labels = np.asarray(labels)

    # reduce using t-SNE
    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    return vectors, labels


# 2 dimension plotting
def plot_with_matplotlib(x_vals, y_vals, labels, words=[]):

    random.seed(0)
    
    x_vals_new = np.array([])
    y_vals_new = np.array([])
    labels_new = np.array([])
    if(words == []):
        # if no word list is given, assume we want to plot the whole data
        x_vals_new = x_vals
        y_vals_new = y_vals
        labels_new = labels
    else:
        for i in range(len(labels)):
            if(labels[i] in words):
                x_vals_new = np.append(x_vals_new,x_vals[i])
                y_vals_new = np.append(y_vals_new,y_vals[i])
                labels_new = np.append(labels_new,labels[i])
    
    plt.figure(figsize=(12, 12))
    plt.scatter(x_vals_new, y_vals_new)

    # apply labels
    for i in range(len(labels_new)):
        plt.annotate(labels_new[i], (x_vals_new[i], y_vals_new[i]))
    
    plt.show()

In [None]:
words = []
words.extend(["king", "man", "queen", "woman"])

vectors, labels = reduce_dimensions(model, 2, words)
x_vals = [v[0] for v in vectors]
y_vals = [v[1] for v in vectors]
print(x_vals, y_vals, labels)

In [None]:
plot_with_matplotlib(x_vals, y_vals, labels, ["king", "man", "queen", "woman"])

## Portuguese embeddings

A number of embeddings for Portuguese are available at [NILC](http://nilc.icmc.usp.br/embeddings), as well as at the [NLX-group](https://github.com/nlx-group/LX-DSemVectors).

In [None]:
from gensim.models import KeyedVectors

# takes a while to load...
model_pt = KeyedVectors.load_word2vec_format('skip_s100.txt')

In [None]:
# save model word vectors
model_pt.save("pt_wv")

In [None]:
# load model word vectors (much faster than the above)
model_pt = gensim.models.KeyedVectors.load("pt_wv")

In [None]:
model_pt.most_similar(positive=["cão"])

In [None]:
model_pt.most_similar(positive=["rei", "mulher"], negative=["homem"])

Make your own experiments!

In [None]:
# your code here
