# Credits



The seminar includes materials from  the following sources:

* [Kaggle competition for "Women's E-Commerce Clothing Reviews" TF-IDF kernel](https://www.kaggle.com/shivam1600/simple-information-retrieval-using-tf-idf-and-lsa)
* [Information retrival: TF-IDF Ranking](https://github.com/williamscott701/Information-Retrieval/blob/master/2.%20TF-IDF%20Ranking%20-%20Cosine%20Similarity,%20Matching%20Score/TF-IDF.ipynb)
* [YDS word vectors seminar](https://github.com/yandexdataschool/nlp_course/tree/2019/week01_embeddings)
* [Lena Voita's Word Embeddings Lecture](https://drive.google.com/file/d/1y2GKIKBzie7l8iycBO6gTKGiTTfJc4Dr/view)
* [Word2Vec Pytorch implementation](https://github.com/blackredscarf/pytorch-SkipGram)
* [Doc2Vec tutorial](https://github.com/RaRe-Technologies/gensim/blob/ca0dcaa1eca8b1764f6456adac5719309e0d8e6d/docs/notebooks/doc2vec-IMDB.ipynb)

# Downloads

Prerequisite download:

In [None]:
# For Count-based models section
!wget https://raw.githubusercontent.com/dardem/word2vec_seminar/master/Womens%20Clothing%20E-Commerce%20Reviews.csv

In [None]:
# For Word2Vec section
!wget http://mattmahoney.net/dc/text8.zip
!unzip text8.zip
!wget https://github.com/blackredscarf/pytorch-SkipGram/raw/master/data_utils.py
!wget https://github.com/blackredscarf/pytorch-SkipGram/raw/master/vector_handle.py
!wget https://github.com/dardem/word2vec_seminar/raw/master/eval.zip
!unzip eval.zip

In [None]:
# For Pretrained models examples

# English
!wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O ./quora.txt
!wget nlp.stanford.edu/data/wordvecs/glove.6B.zip
!unzip glove.6B.zip

# Russian
!wget http://vectors.nlpl.eu/repository/20/214.zip
!unzip 214.zip -d ru_fasttext_model

# Vector text representation: Motivation



<img src="https://github.com/dardem/word2vec_seminar/raw/master/img/Vector-representation-motivation.png" style="width:100%">

Source: https://drive.google.com/file/d/1y2GKIKBzie7l8iycBO6gTKGiTTfJc4Dr/view

# Count-based models

<img src="https://github.com/dardem/word2vec_seminar/raw/master/img/td-idf-idea.png" style="width:100%">

We are going to solve the **task**:
*   the dataset of products reviews is given;
*   find the most similar review from this dataset to the given query.

## Data preparation

In [None]:
# Data download
# !wget https://raw.githubusercontent.com/dardem/word2vec_seminar/master/Womens%20Clothing%20E-Commerce%20Reviews.csv

In [None]:
import pandas as pd
reviews = pd.read_csv('Womens Clothing E-Commerce Reviews.csv', index_col=0)

In [None]:
reviews.shape

(23486, 10)

In [None]:
reviews.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


## Classic preprocessing

In [None]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')

wordnet_lemmatizer = WordNetLemmatizer()
tokenizer = RegexpTokenizer(r'[a-z]+')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/moskovskii/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/moskovskii/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/moskovskii/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [None]:
stopwords.words('russian')[:10]

['и', 'в', 'во', 'не', 'что', 'он', 'на', 'я', 'с', 'со']

In [None]:
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [None]:
def preprocess(document):
    """
    TODO: write your preprocessing function, including following steps:
    - convert the whole text to the lowercase;
    - tokenize the text;
    - remove stopwords;
    - lemmatize the text.
    Return: string, resulted list of tokens joined with the space.
    """

    document = document.lower() # Convert to lowercase
    words = tokenizer.tokenize(document) # Tokenize
    words = [w for w in words if not w in stop_words] # Removing stopwords
    # Lemmatizing
    for pos in [wordnet.NOUN, wordnet.VERB, wordnet.ADJ, wordnet.ADV]:
        words = [wordnet_lemmatizer.lemmatize(x, pos) for x in words]
    return " ".join(words)

In [None]:
# We are reducing the size of our dataset to decrease the running time of code
reviews_pr = reviews.loc[reviews['Clothing ID'] == 1078 , :]
reviews_pr

# Delete missing observations for variables that we will be working with
for x in ["Recommended IND","Review Text"]:
    reviews_pr = reviews_pr[reviews_pr[x].notnull()]

# Keeping only those features that we will explore
reviews_pr = reviews_pr[["Recommended IND","Review Text"]]

# Resetting the index
reviews_pr.index = pd.Series(list(range(reviews_pr.shape[0])))

reviews_pr['Processed Review'] = reviews_pr['Review Text'].apply(preprocess)

reviews_pr.head()

Unnamed: 0,Recommended IND,Review Text,Processed Review
0,0,"I really wanted this to work. alas, it had a s...",really want work ala strange fit strap would s...
1,1,"I love cute summer dresses and this one, espec...",love cute summer dress one especially make lin...
2,1,This is the perfect summer dress. it can be dr...,perfect summer dress dress quality linen fabri...
3,1,"Nice fit and flare style, not clingy at all. i...",nice fit flare style clingy get grey color pet...
4,0,When i first opened this dress and tried it on...,first open dress try think adorable flat hourg...


Let's look how our texts have changed.

In [None]:
from IPython.display import HTML, display

In [None]:
def texts_comparison(text1, text2):
    hdr = ''
    hdr += '<th style="width:50%">' + 'Original Text' + '</th>'
    hdr += '<th style="width:50%">' + 'Preprocessed Text' + '</th>'
    hdr = '<tr style="background-color:#cbcdd1">' + hdr + '</tr>'

    dt = ''
    dt = dt + '<tr>'
    dt += '<td style="vertical-align:top">' + text1 + '</td>'
    dt += '<td style="vertical-align:top">' + text2 + '</td>'
    dt = dt + '</tr>'

    display(HTML('<table style="width:80%">' + hdr + dt + '</table>'))

In [None]:
texts_comparison(reviews_pr.iloc[13]['Review Text'], reviews_pr.iloc[13]['Processed Review']) #TODO: play with the index, notice the difference

Original Text,Preprocessed Text
"This knit dress is very comfortable. i liked the various colors used in the stripes. my only issue with it, is that the skirt of the dress flares out oddly and is quite short. in my opinion, this dress, with its sturdy fabric and long sleeves, would appear more proportional with a longer skirt. skirt length, along with horizontal stripes just did not work for me. regrettably, i sent it back.",knit dress comfortable like various color use stripe issue skirt dress flare oddly quite short opinion dress sturdy fabric long sleeve would appear proportional long skirt skirt length along horizontal stripe work regrettably send back


## Explicit implementation

A small reminder:

Our goal is to calculates:

$tf$-$idf(t,d,D) = tf(t,d) \times idf(t,D)$

So, the terms from this formula are:

$tf(t,d) = \frac{n_t}{\sum_k n_k}$,

where $n_t$ -- the number of word $t$ occurances in the document $d$, in the denominator -- total number of words in the documents set.

$idf(t,D) = \log \frac{|D|}{|\{d_i \in D| t \in d_i \}|}$,

where $|D|$ -- the total number of documents in the collection, $|\{d_i \in D| t \in d_i \}|$ -- the number of document from the collection $D$, where the word $t$ occures.

1. Calculate DF for all words

In [None]:
# We want to save here the structure
# where the keys are the tokens
# and the values are the number of documents
# where this token occures
DF = {}

for index, row in reviews_pr.iterrows():

    tokens = row['Processed Review'].split(' ')
    for token in tokens:
        try:
            DF[token].add(index)
        except:
            DF[token] = {index}

for token in DF:
    DF[token] = len(DF[token])

In [None]:
len(DF.keys())

2529

In [None]:
total_vocab = list(DF.keys())  # just saving the whole set of tokens in the collection

In [None]:
def doc_freq(token):
    """
    Returns the number of documents where the given token occures.
    """

    result = 0
    try:
        result = DF[token]
    except KeyError as e:
        pass
    return result

2. Calculate TF-IDF

In [None]:
from collections import Counter

In [None]:
text = "Some sample text for the Counter application illustration using the text"
tokens = text.split(' ')
Counter(tokens)

Counter({'text': 2,
         'the': 2,
         'Some': 1,
         'sample': 1,
         'for': 1,
         'Counter': 1,
         'application': 1,
         'illustration': 1,
         'using': 1})

In [None]:
import numpy as np
tf_idf = {} # We want to save here the structure where the keys are the tuple (doc, token) and the values are tf-idf.
# Hint: you can use Counter for number of token occuranes calculation.

for index, row in reviews_pr.iterrows():

    tokens = row['Processed Review'].split(' ')

    counter = Counter(tokens)
    tokens_count = len(tokens) # the total number of words in this document

    for token in np.unique(tokens):
        tf = counter[token] / tokens_count
        df = doc_freq(token)
        idf = np.log((reviews_pr.shape[0]+1) / (df+1)) # we add +1 to avoid possible division by zero

        tf_idf[index, token] = tf * idf

In [None]:
len(tf_idf.keys()) # check the obtained length of the structure

25688

In [None]:
tf_idf[(17, 'dress')]

0.010038179740060304

In [None]:
D = np.zeros((reviews_pr.shape[0], len(total_vocab))) # Matrix documents*words
# Here we want to transform our dict into the matrix to compare the results.

for key in tf_idf:
    try:
        token_idx = total_vocab.index(key[1])
        D[key[0]][token_idx] = tf_idf[key]
    except:
        pass

In [None]:
tf_idf

{(0, 'ala'): 0.250426742573999,
 (0, 'beautiful'): 0.09218855669510387,
 (0, 'breast'): 0.250426742573999,
 (0, 'fabric'): 0.06508413148737535,
 (0, 'fell'): 0.1689831303363601,
 (0, 'fit'): 0.0788549252288336,
 (0, 'love'): 0.046143168697598526,
 (0, 'minute'): 0.2189200525485469,
 (0, 'pocket'): 0.15315100786962302,
 (0, 'really'): 0.08753951809872124,
 (0, 'shoulder'): 0.1251158168798334,
 (0, 'sit'): 0.18465769789507505,
 (0, 'stand'): 0.19348297127875497,
 (0, 'stay'): 0.20444488295224986,
 (0, 'strange'): 0.1772704738269944,
 (0, 'strap'): 0.16712758513089396,
 (0, 'want'): 0.10741287293611873,
 (0, 'weird'): 0.19348297127875497,
 (0, 'work'): 0.17829170587428875,
 (0, 'would'): 0.07104381384011357,
 (1, 'although'): 0.11289267767375506,
 (1, 'book'): 0.17721530049108347,
 (1, 'bust'): 0.07864422775303813,
 (1, 'c'): 0.09543813246452724,
 (1, 'curvy'): 0.09162294981811232,
 (1, 'cute'): 0.05929717520408088,
 (1, 'design'): 0.07646214264490744,
 (1, 'difficult'): 0.131231360135823

In [None]:
D

array([[0.08753952, 0.10741287, 0.17829171, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.09452333, 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.29535883, 0.29535883,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.15128135],
       [0.        , 0.        , 0.25580984, ..., 0.        , 0.        ,
        0.        ]])

In [None]:
D.shape

(987, 2529)

In [None]:
D.shape[0] * D.shape[1]

2496123

Let's check memory usage:

In [None]:
from sys import getsizeof # Return the size of an object in bytes.

In [None]:
getsizeof(tf_idf)

1310808

In [None]:
getsizeof(D)

19969112

3. TF-IDF Matching Score Ranking

In [None]:
def matching_similarity_search(collection, k, query):
    """
    Search of the top-k similar texts from documents collection to the query based on matching score.
    collection: list of texts;
    k: int, length of the result;
    query: str, your text.

    Return: list of most similar texts to the query;
            each element of the list is prepresented as a tuple: (text_idx, sim_score)
    """

    # 1. Preprocess text and split into tokens.
    preprocessed_query = preprocess(query)
    tokens = preprocessed_query.split(' ')

    query_weights = {}

    # 2. The score for each text from dataset in respect of the query is
    #    the sum of tf_idf of words that are presented both in the query and in the text.

    for key in tf_idf:
        if key[1] in tokens:
            try:
                query_weights[key[0]] += tf_idf[key]
            except:
                query_weights[key[0]] = tf_idf[key]

    query_weights = sorted(query_weights.items(), key=lambda x: x[1], reverse=True)

    return query_weights[:k]

In [None]:
query = reviews_pr.iloc[0]['Review Text']
print("Query: ", query, "\n")

result = matching_similarity_search(reviews_pr, 10, query)

for (text_idx, sim_score) in result:
    print("Index: ", text_idx, " Similarity: ", sim_score, " Text: ", reviews_pr.iloc[text_idx]["Review Text"])

Query:  I really wanted this to work. alas, it had a strange fit for me. the straps would not stay up, and it had a weird fit under the breast. it worked standing up, but the minute i sat down it fell off my shoulders. the fabric was beautiful! and i loved that it had pockets. 

Index:  0  Similarity:  3.014048768003238  Text:  I really wanted this to work. alas, it had a strange fit for me. the straps would not stay up, and it had a weird fit under the breast. it worked standing up, but the minute i sat down it fell off my shoulders. the fabric was beautiful! and i loved that it had pockets.
Index:  860  Similarity:  0.8866572257470782  Text:  The print on this is gorgeous, and it's incredibly comfortable. ultimately, i didn't buy it because the length was weird for me (i'm 5'7"); it just looked strange, like it was cutting me off at a weird point.
Index:  177  Similarity:  0.7550558721114813  Text:  Too tight in strange ways. beautiful dress i was so excited to receive it, however it

4. TF-IDF Cosine Similarity Ranking

In [None]:
import math

In [None]:
def cosine_sim(a, b):
    cos_sim = np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))
    return cos_sim

In [None]:
def gen_vector(tokens, total_vocab, collection_len):
    """
    Transform the tokens into the vector representation.
    The idea is based on tf-idf as well: we fill the vector with the tf-idfs of the tokens.
    """

    vector = np.zeros((len(total_vocab)))

    counter = Counter(tokens)
    tokens_count = len(tokens)

    for token in np.unique(tokens):

        tf = counter[token]/tokens_count
        df = doc_freq(token)
        idf = math.log((collection_len+1)/(df+1))

        try:
            ind = total_vocab.index(token)
            vector[ind] = tf*idf
        except:
            pass

    return vector

In [None]:
def cosine_similarity_search(df, k, query):
    """
    Search of the top-k similar texts from documents collection to the query based on cosine similarity.
    collection: list of texts;
    k: int, length of the result;
    query: str, your text.

    Return: list of most similar texts to the query;
            each element of the list is prepresented as a tuple: (text_idx, sim_score)
    """

    # 1. Preprocess text and split into tokens.
    preprocessed_query = preprocess(query)
    tokens = preprocessed_query.split(' ')

    # 1.1 Transform obtained tokens into vector representation.
    query_vector = gen_vector(tokens, total_vocab, df.shape[0])

    # 2. The score for each text from dataset in respect of the query is
    #    the cosine similarity between tf-idf vectors of these texts.

    # Hint: remember structure D that we calculated before?

    d_cosines = []

    for doc_idx, d in enumerate(D):
        d_cosines.append((doc_idx, cosine_sim(query_vector, d)))

    d_cosines.sort(key=lambda tup: tup[1], reverse=True)
    result = d_cosines[:k]

    return result

In [None]:
query = reviews_pr.iloc[0]['Review Text']
print("Query: ", query, "\n")

result = cosine_similarity_search(reviews_pr, 10, query)

for (text_idx, sim_score) in result:
    print("Index: ", text_idx, " Similarity: ", round(sim_score, 6), " Text: ", reviews_pr.iloc[text_idx]["Review Text"], end='\n')

Query:  I really wanted this to work. alas, it had a strange fit for me. the straps would not stay up, and it had a weird fit under the breast. it worked standing up, but the minute i sat down it fell off my shoulders. the fabric was beautiful! and i loved that it had pockets. 

Index:  0  Similarity:  1.0  Text:  I really wanted this to work. alas, it had a strange fit for me. the straps would not stay up, and it had a weird fit under the breast. it worked standing up, but the minute i sat down it fell off my shoulders. the fabric was beautiful! and i loved that it had pockets.
Index:  860  Similarity:  0.233098  Text:  The print on this is gorgeous, and it's incredibly comfortable. ultimately, i didn't buy it because the length was weird for me (i'm 5'7"); it just looked strange, like it was cutting me off at a weird point.
Index:  824  Similarity:  0.177983  Text:  This dress is beautiful. very vibrant and rich looking...however, after wearing it to work the back of the dress was co

Compare the results! What is about the time and the memory?

## TF-IDF with LSA

And now, after the spent time on coding of your own tf-idf implementation, we will show you that you can use already implemented ```TfidfVectorizer``` :)



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sys import getsizeof

1. Create TF_IDF matrix

In [None]:
%%time

vectorizer = TfidfVectorizer()
TF_IDF_matrix = vectorizer.fit_transform(reviews_pr['Processed Review'])
TF_IDF_matrix = TF_IDF_matrix.T

print('Vocabulary Size : ', len(vectorizer.get_feature_names_out()))
print('Shape of Matrix : ', TF_IDF_matrix.shape)
print('Memory usage: ', getsizeof(TF_IDF_matrix))

Vocabulary Size :  2513
Shape of Matrix :  (2513, 987)
Memory usage:  48
CPU times: user 44.6 ms, sys: 839 µs, total: 45.5 ms
Wall time: 44.6 ms


Why so less memory?

In [None]:
TF_IDF_matrix

<2513x987 sparse matrix of type '<class 'numpy.float64'>'
	with 25475 stored elements in Compressed Sparse Column format>

2. Get SVD decomposition

<img src="https://github.com/dardem/word2vec_seminar/raw/master/img/svd.png" style="width:100%">

where, $C$ is the matrix of *Terms* x *Documents*, $U$ -  semi-unitary matrix of *Terms* x *Dimension*, $\sigma$ - diagonal matrix of singular values,  *V* - semi-unitary matrix of *Documents* x *Dimension*.

In [None]:
K = 10 # number of most important components

In [None]:
%%time

# Applying SVD
U, s, VT = np.linalg.svd(TF_IDF_matrix.toarray()) # .toarray() is used to convert sparse matrix to normal matrix

TF_IDF_matrix_reduced = np.dot(U[:,:K], np.dot(np.diag(s[:K]), VT[:K, :]))

# Getting document and term representation
terms_rep = np.dot(U[:,:K], np.diag(s[:K])) # TODO: M X N matrix where M = Vocabulary Size and N = Number of documents
docs_rep = np.dot(np.diag(s[:K]), VT[:K, :]).T # TODO: N x K matrix

CPU times: user 8 s, sys: 13.2 s, total: 21.2 s
Wall time: 914 ms


In [None]:
getsizeof(U)

50521480

In [None]:
getsizeof(s)

8008

In [None]:
getsizeof(VT)

7793480

In [None]:
getsizeof(TF_IDF_matrix_reduced)

19842776

3. Find most similar to the query

In [None]:
from scipy.spatial.distance import cosine

In [None]:
def lsa_query_rep(query):
    query_rep = [vectorizer.vocabulary_[x] for x in preprocess(query).split()]
    query_rep = np.mean(terms_rep[query_rep],axis=0)
    return query_rep

In [None]:
query = reviews_pr.iloc[0]['Review Text']
print("Query: ", query, "\n")

query_rep = lsa_query_rep(query)

query_doc_cos_dist = [cosine(query_rep, doc_rep) for doc_rep in docs_rep]
query_doc_sort_index = np.argsort(np.array(query_doc_cos_dist))

print_count = 0
for rank, sort_index in enumerate(query_doc_sort_index):
    print ('Rank : ', rank, ' Cosine sim: ', round(1 - query_doc_cos_dist[sort_index], 6),' Review : ', reviews_pr['Review Text'][sort_index])
    if print_count == 10 :
        break
    else:
        print_count += 1

Query:  I really wanted this to work. alas, it had a strange fit for me. the straps would not stay up, and it had a weird fit under the breast. it worked standing up, but the minute i sat down it fell off my shoulders. the fabric was beautiful! and i loved that it had pockets. 

Rank :  0  Cosine sim:  0.95473  Review :  I really wanted to love this dress, but it just didn't fit in anyway. i ordered an xs petite, and it was very large. the lacing in the back is completely useless, and that's what i liked about the dress. i was thinking it would be unlaced more at the top (because of chest) and then lace more tightly near the waist. it was completely closed off and the dress was still very loose and unflattering. beautiful fabric, great color combo, bad fit.
Rank :  1  Cosine sim:  0.94549  Review :  I wanted to love this dress as it seemed perfect for spring and summer, but it just didn't work for me. it made me feel a bit frumpy, honestly. i purchased the neutral color. maybe the "red

Compare the results!

# Word2Vec

`Word2Vec` is a more recent model that embeds words in a lower-dimensional vector space using a shallow neural network. The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings. For example, `strong` and `powerful` would be close together and `strong` and `Paris` would be relatively far.

## Main idea

<img src="https://github.com/dardem/word2vec_seminar/raw/master/img/w2v-example1.png" style="width:100%">
<img src="https://github.com/dardem/word2vec_seminar/raw/master/img/w2v-example2.png" style="width:100%">

Source: https://drive.google.com/file/d/1y2GKIKBzie7l8iycBO6gTKGiTTfJc4Dr/view

## Implementation

### Some additional downloading

Dataset download:

(for more details about the data please see: http://mattmahoney.net/dc/textdata.html, section Relationship of Wikipedia Text to Clean Text)

In [None]:
# !wget http://mattmahoney.net/dc/text8.zip
# !unzip text8.zip

--2022-11-21 17:32:23--  http://mattmahoney.net/dc/text8.zip
Resolving mattmahoney.net (mattmahoney.net)... 67.195.197.24
Connecting to mattmahoney.net (mattmahoney.net)|67.195.197.24|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31344016 (30M) [application/zip]
Saving to: ‘text8.zip’


2022-11-21 17:33:07 (710 KB/s) - ‘text8.zip’ saved [31344016/31344016]

Archive:  text8.zip
  inflating: text8                   


Supplementary functions for dataset preprocessing download:

In [None]:
# !wget https://github.com/blackredscarf/pytorch-SkipGram/raw/master/data_utils.py
# !wget https://github.com/blackredscarf/pytorch-SkipGram/raw/master/vector_handle.py

--2022-11-21 17:08:53--  https://github.com/blackredscarf/pytorch-SkipGram/raw/master/data_utils.py
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/blackredscarf/pytorch-SkipGram/master/data_utils.py [following]
--2022-11-21 17:08:53--  https://raw.githubusercontent.com/blackredscarf/pytorch-SkipGram/master/data_utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5323 (5.2K) [text/plain]
Saving to: ‘data_utils.py’


2022-11-21 17:08:53 (72.1 MB/s) - ‘data_utils.py’ saved [5323/5323]

--2022-11-21 17:08:54--  https://github.com/blackredscarf/pytorch-SkipGram/raw/master/vector_handle.py
Resol

Supplementary functions for model evaluation download:

In [None]:
# !wget https://github.com/dardem/word2vec_seminar/raw/master/eval.zip
# !unzip eval.zip

--2022-11-21 17:08:54--  https://github.com/dardem/word2vec_seminar/raw/master/eval.zip
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dardem/word2vec_seminar/master/eval.zip [following]
--2022-11-21 17:08:54--  https://raw.githubusercontent.com/dardem/word2vec_seminar/master/eval.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7054 (6.9K) [application/zip]
Saving to: ‘eval.zip’


2022-11-21 17:08:55 (75.3 MB/s) - ‘eval.zip’ saved [7054/7054]

Archive:  eval.zip
   creating: eval/
  inflating: eval/wordsim.py         
  inflating: eval/read_write.py      
  inflating: eval/ranking.py       

Check the installed files:

In [None]:
!ls

 214.zip		    glove.6B.zip
 214.zip.1		    out
 data_utils.py		    __pycache__
 embeddings		    quora.txt
 eval			    ru_fasttext_model
 eval.zip		    text8
 gensim_glove_vectors.txt   text8.zip
 glove.6B.100d.txt	    vector_handle.py
 glove.6B.200d.txt	   'Womens Clothing E-Commerce Reviews.csv'
 glove.6B.300d.txt	   'Womens Clothing E-Commerce Reviews.csv.1'
 glove.6B.50d.txt	    Word_embeddings.ipynb


### SkipGram model

<img src="https://raw.githubusercontent.com/dardem/word2vec_seminar/master/img/word2vec_diagram-1.jpg" style="width:100%">

<img src="https://raw.githubusercontent.com/dardem/word2vec_seminar/master/img/Skip-gram-architecture-2.jpg" style="width:30%">

<img src="https://raw.githubusercontent.com/dardem/word2vec_seminar/master/img/w2v-loss.png" style="width:100%">

source: https://drive.google.com/file/d/1y2GKIKBzie7l8iycBO6gTKGiTTfJc4Dr/view

<img src="https://raw.githubusercontent.com/dardem/word2vec_seminar/master/img/w2v-objective.png" style="width:100%">

In [None]:
import torch
from torch import nn


class SkipGramNeg(nn.Module):
    def __init__(self, vocab_size, emb_dim):
        super(SkipGramNeg, self).__init__()
        self.input_emb = nn.Embedding(vocab_size, emb_dim)
        self.output_emb = nn.Embedding(vocab_size, emb_dim)
        self.log_sigmoid = nn.LogSigmoid()

        initrange = (2.0 / (vocab_size + emb_dim)) ** 0.5  # Xavier init
        self.input_emb.weight.data.uniform_(-initrange, initrange)
        self.output_emb.weight.data.uniform_(-0, 0)


    def forward(self, target_input, context, neg):
        """
        :param target_input: [batch_size]
        :param context: [batch_size]
        :param neg: [batch_size, neg_size]
        :return:
        """
        # u,v: [batch_size, emb_dim]
        v = self.input_emb(target_input)
        u = self.output_emb(context)
        # positive_val: [batch_size]
        positive_val = self.log_sigmoid(torch.sum(u * v, dim=1)).squeeze()

        # u_hat: [batch_size, neg_size, emb_dim]
        u_hat = self.output_emb(neg)
        # [batch_size, neg_size, emb_dim] x [batch_size, emb_dim, 1] = [batch_size, neg_size, 1]
        # neg_vals: [batch_size, neg_size]
        neg_vals = torch.bmm(u_hat, v.unsqueeze(2)).squeeze(2) # batch matrix-matrix product of matrices
        # neg_val: [batch_size]
        neg_val = self.log_sigmoid(-torch.sum(neg_vals, dim=1)).squeeze()

        loss = positive_val + neg_val
        return -loss.mean()

    def predict(self, inputs):
        return self.input_emb(inputs)

In [None]:
import os
import random
import torch
from torch.optim import SGD
from data_utils import read_own_data, build_dataset, DataPipeline
from vector_handle import nearest


class Word2Vec:
    def __init__(self, data_path, vocabulary_size, embedding_size, learning_rate=1.0):
        self.corpus = read_own_data(data_path)

        self.data, self.word_count, self.word2index, self.index2word = build_dataset(
            self.corpus, vocabulary_size
        )
        self.vocabs = list(set(self.data))

        self.model: SkipGramNeg = SkipGramNeg(vocabulary_size, embedding_size).cuda()
        self.model_optim = SGD(self.model.parameters(), lr=learning_rate)

    def train(
        self,
        train_steps,
        skip_window=1,
        num_skips=2,
        num_neg=20,
        batch_size=128,
        data_offest=0,
        vali_size=3,
        output_dir="out",
    ):
        if not os.path.exists(output_dir):
            os.mkdir(output_dir)
        self.outputdir = output_dir

        avg_loss = 0
        pipeline = DataPipeline(self.data, self.vocabs, self.word_count, data_offest)
        vali_examples = random.sample(self.vocabs, vali_size)

        for step in range(train_steps):
            batch_inputs, batch_labels = pipeline.generate_batch(
                batch_size, num_skips, skip_window
            )
            batch_neg = pipeline.get_neg_data(batch_size, num_neg, batch_inputs)

            batch_inputs = torch.tensor(batch_inputs, dtype=torch.long).cuda()
            batch_labels = torch.tensor(batch_labels, dtype=torch.long).cuda()
            batch_neg = torch.tensor(batch_neg, dtype=torch.long).cuda()

            loss = self.model(batch_inputs, batch_labels, batch_neg)
            self.model_optim.zero_grad()
            loss.backward()
            self.model_optim.step()

            avg_loss += loss.item()

            if step % 2000 == 0 and step > 0:
                avg_loss /= 2000
                print("Average loss at step ", step, ": ", avg_loss)
                avg_loss = 0
            if step % 10000 == 0 and vali_size > 0:
                nearest(self.model, vali_examples, vali_size, self.index2word, top_k=8)
            # checkpoint
            if step % 100000 == 0 and step > 0:
                torch.save(
                    self.model.state_dict(), self.outputdir + "/model_step%d.pt" % step
                )
        # save model at last
        torch.save(
            self.model.state_dict(), self.outputdir + "/model_step%d.pt" % train_steps
        )

    def save_model(self, out_path):
        torch.save(self.model.state_dict(), out_path + "/model.pt")

    def get_list_vector(self):
        sd = self.model.state_dict()
        return sd["input_emb.weight"].tolist()

    def save_vector_txt(self, path_dir):
        embeddings = self.get_list_vector()
        fo = open(path_dir + "/vector.txt", "w")
        for idx in range(len(embeddings)):
            word = self.index2word[idx]
            embed = embeddings[idx]
            embed_list = [str(i) for i in embed]
            line_str = " ".join(embed_list)
            fo.write(word + " " + line_str + "\n")
        fo.close()

    def load_model(self, model_path):
        self.model.load_state_dict(torch.load(model_path))

    def vector(self, index):
        self.model.predict(index)

    def most_similar(self, word, top_k=8):
        index = self.word2index[word]
        index = torch.tensor(index, dtype=torch.long).cuda().unsqueeze(0)
        emb = self.model.predict(index)
        sim = torch.mm(emb, self.model.input_emb.weight.transpose(0, 1))
        nearest = (-sim[0]).sort()[1][1 : top_k + 1]
        top_list = []
        for k in range(top_k):
            close_word = self.index2word[nearest[k].item()]
            top_list.append(close_word)
        return top_list


### Run the model

So, let's finally build your own Word2Vec model!

In [None]:
# init dataset and model
word2vec = Word2Vec(data_path='text8',
                    vocabulary_size=50000,
                    embedding_size=300)

reading data...
corpus size 17005207


In [None]:
# additional check for output folder
if not os.path.exists('out'):
  os.mkdir('out')

In [None]:
%%time

# train model
word2vec.train(train_steps=10000, #100000, 200000,
               skip_window=5,
               num_skips=2,
               num_neg=20,
               output_dir='out/run-1')


# save vector txt file
word2vec.save_vector_txt(path_dir='out/run-1')

vocabulary size 50000
unigram_table size: 2870
Nearest to contributing: ade, borne, gog, ounces, balder, neurotransmitter, zebra, coupling,
Nearest to polka: tulsa, undeniably, dg, primo, scattering, decoder, isothermal, schama,
Nearest to gao: archipelago, rejuvenation, riders, fostered, dickinson, menstrual, ganga, ideologue,
Average loss at step  2000 :  0.9115511239767075
Average loss at step  4000 :  0.8026582121625543
Average loss at step  6000 :  0.7826968179345131
Average loss at step  8000 :  0.767112163439393
Average loss at step  10000 :  0.7428255868703126
Nearest to contributing: eight, five, one, seven, archie, agave, zero, four,
Nearest to polka: zero, two, five, polka, four, to, and, gland,
Nearest to gao: a, agave, as, UNK, is, or, an, that,
Average loss at step  12000 :  0.7416164618730545
Average loss at step  14000 :  0.7296159547418356
Average loss at step  16000 :  0.6954028851240873
Average loss at step  18000 :  0.6853225033283233
Average loss at step  20000 :  

In [None]:
 !ls ./out/run-1

model_step100000.pt  model_step10000.pt  vector.txt


In [None]:
# Example of extracting word's representation
vector = word2vec.get_list_vector()
print(vector[123])
print(vector[word2vec.word2index['hello']])

[0.02993914671242237, -0.03296276554465294, 0.05721461772918701, 0.0032790484838187695, -0.045988406985998154, -0.02553842030465603, -0.005984775256365538, 0.02494504302740097, -0.048611342906951904, -0.01586763933300972, 0.0045379698276519775, 0.009204240515828133, 0.024623500183224678, -0.017930280417203903, 0.011793102137744427, 0.0014372544828802347, 0.022291338071227074, 0.0182212982326746, -0.03774702548980713, -0.037110645323991776, 0.010638721287250519, -0.03145379200577736, -0.024130264297127724, 0.054582662880420685, -0.003392234444618225, 0.03257051482796669, 0.05203283205628395, -0.035986557602882385, -0.022964324802160263, -0.04559066891670227, -0.10107764601707458, -0.022228572517633438, -0.06592140346765518, -0.08699414134025574, -0.00477756280452013, 0.08182952553033829, -0.0010690552880987525, -0.0001771818206179887, -0.006366613786667585, 0.02718970738351345, -0.010745959356427193, 0.009019610472023487, 0.08704546838998795, -0.07676339894533157, -0.02311559207737446, 

In [None]:
# get top k similar word
sim_list = word2vec.most_similar('one', top_k=8)
print(sim_list)

['nine', 'eight', 'six', 'seven', 'two', 'zero', 'four', 'three']


In [None]:
# try also for random validation samples and check if the model became better
# sim_list = word2vec.most_similar(<some random sample>, top_k=8)
# print(sim_list)

In [None]:
# load pre-train model
# word2vec.load_model('out/run-1/model_step200000.pt')

In [None]:
#some magic for the famous trick
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
mystery_word = (np.array(vector[word2vec.word2index['king']]) - np.array(vector[word2vec.word2index['man']])).reshape(1, -1)

# try with othe random words, e.g. kitty :)
cosine_similarity(mystery_word, np.array(vector[word2vec.word2index['queen']]).reshape(1, -1))

array([[0.03364092]])

## Pretrained models

__Word vectors:__ as the saying goes, there's more than one way to train word embeddings. There's Word2Vec and GloVe with different objective functions. Then there's fasttext that uses character-level models to train word embeddings.

The choice is huge, so let's start someplace small: __gensim__ is another nlp library that features many vector-based models incuding word2vec.

### Predefined architecture

Train data downloading:

In [None]:
# download the data:
# !wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O ./quora.txt
# alternative download link: https://yadi.sk/i/BPQrUu1NaTduEw

--2022-11-22 10:22:10--  https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.5.18, 2620:100:601d:18::a27d:512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.5.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/obaitrix9jyu84r/quora.txt [following]
--2022-11-22 10:22:10--  https://www.dropbox.com/s/dl/obaitrix9jyu84r/quora.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uca7e8d79db091c573fd54ba17fc.dl.dropboxusercontent.com/cd/0/get/BxODsyf4Pa0owtQUVHBNj54DHeVkk8ojdMuspoMXT7xF3AS6Ul39BMrQTETnXcY1VsnC643jC1ZvOqO38pyDo78fiLMTQu96d-pUbVJI1SSNSHrQHJItAilYZZ4jwTsYlRBEbRfR9Cq3cD5X9LlnNfrTFzehdw1NTlB8DkXRarEZFA/file?dl=1# [following]
--2022-11-22 10:22:11--  https://uca7e8d79db091c573fd54ba17fc.dl.dropboxusercontent.com/cd/0/get/BxODsyf4Pa0owtQUVHBNj54DHeVkk8ojdMuspoMXT7xF3AS6Ul39BMrQTETnXcY1VsnC643jC1ZvOqO38pyDo78

In [None]:
import numpy as np

data = list(open("./quora.txt", encoding="utf-8"))
data[42]
data[20:22]

['Who discovered plate tectonics and how?\n',
 'Is the Earth the only planet that has life on it?\n']

__Tokenization:__ a typical first step for an nlp task is to split raw data into words.
The text we're working with is in raw format: with all the punctuation and smiles attached to some words, so a simple str.split won't do.

Let's use __`nltk`__ - a library that handles many nlp tasks like tokenization, stemming or part-of-speech tagging.

In [None]:
from nltk.tokenize import WordPunctTokenizer
# !pip install pymorphy2
import pymorphy2
morph = pymorphy2.MorphAnalyzer()
tokenizer = WordPunctTokenizer()
# data = ['я тебя люблю очень сильно', 'ты здание банка китая']
print(tokenizer.tokenize(data[42]))
morph.parse(data[0])

['How', 'does', 'the', 'finance', 'credit', 'score', 'work', '?']


[Parse(word="can i get back with my ex even though she is pregnant with another guy's baby?\n", tag=OpencorporaTag('LATN'), normal_form="can i get back with my ex even though she is pregnant with another guy's baby?\n", score=1.0, methods_stack=((LatinAnalyzer(score=0.9), "Can I get back with my ex even though she is pregnant with another guy's baby?\n"),))]

In [None]:
data_tok = [tokenizer.tokenize(sent.lower()) for sent in data]

In [None]:
data_tok[0]
morph.parse(data_tok[1][3])

[Parse(word='ways', tag=OpencorporaTag('LATN'), normal_form='ways', score=1.0, methods_stack=((LatinAnalyzer(score=0.9), 'ways'),))]

In [None]:
print([' '.join(row) for row in data_tok[:2]])

["can i get back with my ex even though she is pregnant with another guy ' s baby ?", 'what are some ways to overcome a fast food addiction ?']


Load the model architecture:

In [None]:
from gensim.models import Word2Vec

In [None]:
%%time

en_w2v_model = Word2Vec(vector_size=32,      # embedding vector size
                 min_count=5,  # consider words that occured at least 5 times
                 window=5)  # define context as a 5-word window around the target word

CPU times: user 1.25 ms, sys: 44 µs, total: 1.29 ms
Wall time: 1.04 ms


In [None]:
en_w2v_model.build_vocab(data_tok)

In [None]:
en_w2v_model.train(data_tok, epochs=100, total_words=10e5)

(478995204, 713134500)

In [None]:
# now you can get word vectors !
en_w2v_model.wv.get_vector('anything')

array([-4.336226  , -3.8685353 ,  3.3240254 , -0.11518155,  2.360612  ,
       -1.0111531 ,  3.9578047 , -2.6149962 , -1.1785423 ,  8.675043  ,
       -1.3252401 ,  0.11825431,  5.074105  , -6.042519  , -0.9434248 ,
        1.7335101 , -0.19674218,  0.3741732 ,  5.030587  , -2.0936818 ,
        3.6172826 , -1.9526972 ,  4.2750626 , -1.7522057 ,  2.4683518 ,
        1.8074629 , -2.3640206 ,  0.77915543,  7.5844936 , -1.872772  ,
       -1.9331567 ,  0.8436681 ], dtype=float32)

In [None]:
# or query similar words directly. Go play with it!
en_w2v_model.wv.most_similar('hi')

[('hey', 0.8107662796974182),
 ('however', 0.6886940002441406),
 ('knw', 0.648284912109375),
 ('d', 0.6446653604507446),
 ('here', 0.6443655490875244),
 ('josh', 0.6431625485420227),
 ('hello', 0.6381719708442688),
 ('rahman', 0.6322135925292969),
 ('sorry', 0.6321156024932861),
 ('guess', 0.6250742077827454)]

In [None]:
en_w2v_model.wv.most_similar(positive=['king', 'man'], negative=['woman'])

[('stephanie', 0.7667716145515442),
 ('impaler', 0.7640563249588013),
 ('vlad', 0.761570394039154),
 ('wick', 0.7601789832115173),
 ('emperor', 0.7565460205078125),
 ('percy', 0.7528669238090515),
 ('paul', 0.7515218257904053),
 ('prince', 0.7456627488136292),
 ('queen', 0.7442420721054077),
 ('patton', 0.7438921332359314)]

### Pretrained weights

Download model based on Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 300d vectors, 822 MB download)

In [None]:
# ! wget nlp.stanford.edu/data/wordvecs/glove.6B.zip

--2022-11-21 19:30:09--  http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/wordvecs/glove.6B.zip [following]
--2022-11-21 19:30:09--  https://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip [following]
--2022-11-21 19:30:10--  https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182753 (822M) [app

In [None]:
# ! unzip glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       
  inflating: glove.6B.50d.txt        


More about different pretrained corpuses:
[GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/)

In [None]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_input_file="glove.6B.300d.txt", word2vec_output_file="gensim_glove_vectors.txt")

  glove2word2vec(glove_input_file="glove.6B.300d.txt", word2vec_output_file="gensim_glove_vectors.txt")


(400001, 300)

In [None]:
%%time

from gensim.models.keyedvectors import KeyedVectors
en_w2v_model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt", binary=False)

CPU times: user 48.7 s, sys: 758 ms, total: 49.4 s
Wall time: 49 s


In [None]:
# %%time
# Alternative method for glove model download

# import gensim.downloader as api

# model = api.load('glove-twitter-100')

In [None]:
en_w2v_model.get_vector("language")

array([-6.7832e-01, -2.8658e-01, -2.8904e-01,  1.5099e-01, -4.6720e-01,
       -1.7424e-01, -7.7790e-01,  3.5469e-01,  6.9431e-02, -1.7409e+00,
       -4.8699e-03,  3.2813e-01, -5.5443e-01,  5.1388e-01,  5.3065e-01,
        2.3718e-02,  2.2542e-01,  7.6866e-01,  1.8348e-01,  1.6765e-01,
       -1.5293e-01, -2.7201e-01, -5.3389e-02,  1.0727e+00, -4.6678e-01,
       -2.4596e-01,  1.9205e-01, -7.6138e-02,  3.9775e-02,  1.6546e-01,
        6.4188e-02,  4.1207e-01, -4.1290e-01,  8.8176e-01, -6.5510e-01,
       -1.9994e-01,  2.8036e-01, -8.3058e-01,  1.0374e-02,  2.5017e-01,
       -2.7072e-01, -5.8058e-02,  4.0706e-01, -2.3871e-01,  1.8965e-01,
       -4.7930e-02, -2.0027e-01,  8.7983e-01, -1.5852e-01, -2.8104e-01,
        1.5497e-01, -4.3207e-02,  4.2794e-01, -8.6033e-01, -2.6242e-01,
       -1.0455e-02,  2.3501e-01, -6.6707e-01,  9.1331e-01,  5.2429e-01,
        5.8939e-01,  5.7586e-01,  5.5180e-01,  7.6329e-03, -8.5204e-03,
        3.0554e-01,  7.6697e-01,  5.9108e-01,  7.0538e-01,  1.12

In [None]:
en_w2v_model.most_similar(positive=["queen", "man"], negative=["woman"])

[('king', 0.6552621126174927),
 ('ii', 0.5050469040870667),
 ('prince', 0.491478830575943),
 ('majesty', 0.48908838629722595),
 ('monarch', 0.47834306955337524),
 ('royal', 0.46305179595947266),
 ('elizabeth', 0.45092126727104187),
 ('vi', 0.44612547755241394),
 ('crown', 0.4368758201599121),
 ('brother', 0.43661490082740784)]

In [None]:
# try with your own example
en_w2v_model.most_similar(positive=["physicist", "brain"], negative=["money"])

[('neuroscientist', 0.524850070476532),
 ('mathematician', 0.4939815104007721),
 ('biologist', 0.4928779602050781),
 ('geneticist', 0.4879351854324341),
 ('biochemist', 0.47275030612945557),
 ('scientist', 0.4704717993736267),
 ('chemist', 0.46199890971183777),
 ('astrophysicist', 0.45520147681236267),
 ('physics', 0.45381951332092285),
 ('neurologist', 0.45040658116340637)]

In [None]:
en_w2v_model.most_similar(positive=["python"])

[('monty', 0.6837382316589355),
 ('perl', 0.519283652305603),
 ('cleese', 0.5092198252677917),
 ('pythons', 0.5007115006446838),
 ('php', 0.4942314326763153),
 ('grail', 0.4683017134666443),
 ('scripting', 0.46761268377304077),
 ('skit', 0.4474538266658783),
 ('javascript', 0.4312553107738495),
 ('spamalot', 0.43117913603782654)]

In [None]:
en_w2v_model.most_similar(positive=["phd"])

[('ph.d.', 0.8992077708244324),
 ('ph.d', 0.8668009638786316),
 ('doctoral', 0.8411757349967957),
 ('doctorate', 0.8270341157913208),
 ('dissertation', 0.7371240854263306),
 ('thesis', 0.7319737672805786),
 ('graduate', 0.6834654808044434),
 ('postgraduate', 0.6737526059150696),
 ('b.a.', 0.6614392399787903),
 ('post-graduate', 0.6560631990432739)]

In [None]:
en_w2v_model.most_similar(positive=["phd"], negative=["panini", "coffee", "code", "experiments", "paper", "conference"])

[('step-sister', 0.42128294706344604),
 ('edmore', 0.4122830927371979),
 ('kalwa', 0.4101993143558502),
 ('grammia', 0.4099090099334717),
 ('thất', 0.4093276560306549),
 ('cw96', 0.4002326428890228),
 ('rw96', 0.39980965852737427),
 ('pangle', 0.39888158440589905),
 ('iliyan', 0.39772501587867737),
 ('3.6730', 0.392292320728302)]

**Have fun**: how good are you at word vectors algebra now?


Let's check it: play [Semantic Space Surfer](https://lena-voita.github.io/nlp_course/word_embeddings.html#have_fun).

### Flair - yet another NLP framework, that supports word embeddings.

![image.png](attachment:c5531b8f-f8b5-4de9-a409-1528ce12870a.png)

Flair embeddings are powerful embeddings that capture latent syntactic-semantic information that goes beyond standard word embeddings. Key differences are:

    -they are trained without any explicit notion of words and thus fundamentally model words as sequences of characters
    -they are contextualized by their surrounding text, meaning that the same word will have different embeddings depending on its contextual use.

Source: http://aclanthology.lst.uni-saarland.de/C18-1139.pdf

In [None]:
from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings

stacked_embeddings = StackedEmbeddings([
                                        WordEmbeddings('glove'),
                                        FlairEmbeddings('news-forward'),
                                        FlairEmbeddings('news-backward'),
                                       ])

2023-11-14 14:00:22,435 https://flair.informatik.hu-berlin.de/resources/embeddings/flair/news-backward-0.4.1.pt not found in cache, downloading to /tmp/tmpinwcvmzo


100%|██████████████████████████████████████████████████████████| 69.7M/69.7M [00:10<00:00, 7.28MB/s]

2023-11-14 14:00:32,655 copying /tmp/tmpinwcvmzo to cache at /home/moskovskii/.flair/embeddings/news-backward-0.4.1.pt





2023-11-14 14:00:32,730 removing temp file /tmp/tmpinwcvmzo


In [None]:
sentence = Sentence('The grass is green .')

# just embed a sentence using the StackedEmbedding as you would with any single embedding.
stacked_embeddings.embed(sentence)

# now check out the embedded tokens.
for token in sentence:
    print(token)
    print(token.embedding)

Token[0]: "The"
tensor([-0.0382, -0.2449,  0.7281,  ..., -0.0065, -0.0053,  0.0091],
       device='cuda:0')
Token[1]: "grass"
tensor([-0.8135,  0.9404, -0.2405,  ...,  0.0354, -0.0255, -0.0143],
       device='cuda:0')
Token[2]: "is"
tensor([-5.4264e-01,  4.1476e-01,  1.0322e+00,  ..., -5.3687e-04,
        -9.6725e-03, -2.7530e-02], device='cuda:0')
Token[3]: "green"
tensor([-0.6791,  0.3491, -0.2398,  ..., -0.0007, -0.1333,  0.0161],
       device='cuda:0')
Token[4]: "."
tensor([-0.3398,  0.2094,  0.4635,  ...,  0.0005, -0.0177,  0.0032],
       device='cuda:0')


### Pre-trained weights for Russian

One of the main hubs of pretrained models for Russian language is [**RusVectores**](https://rusvectores.org/ru/). The whole list of models is presented [here](https://rusvectores.org/ru/models/). We will also try some examples of usage.

In [None]:
import gensim

In [None]:
# model download. For this example we will use fasttex pretrained model.
# !wget http://vectors.nlpl.eu/repository/20/214.zip

--2022-11-21 19:42:00--  http://vectors.nlpl.eu/repository/20/214.zip
Resolving vectors.nlpl.eu (vectors.nlpl.eu)... 129.240.189.181
Connecting to vectors.nlpl.eu (vectors.nlpl.eu)|129.240.189.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1920218982 (1.8G) [application/zip]
Saving to: ‘214.zip’


2022-11-21 19:42:19 (99.8 MB/s) - ‘214.zip’ saved [1920218982/1920218982]



In [None]:
# !unzip 214.zip -d ru_fasttext_model

Archive:  214.zip
  inflating: ru_fasttext_model/meta.json  
  inflating: ru_fasttext_model/model.model  
  inflating: ru_fasttext_model/model.model.vectors_ngrams.npy  
  inflating: ru_fasttext_model/model.model.vectors.npy  
  inflating: ru_fasttext_model/model.model.vectors_vocab.npy  
  inflating: ru_fasttext_model/README  


In [None]:
ru_fasttext_model = gensim.models.KeyedVectors.load('ru_fasttext_model/model.model')

In [None]:
ru_fasttext_model.get_vector("естественный")

array([-3.32608342e-01, -1.26286536e-01, -1.79735586e-01,  2.79385895e-01,
       -3.69018912e-01,  3.44438046e-01, -2.34145205e-02,  5.60232043e-01,
       -3.31771702e-01,  7.50510246e-02, -3.97710502e-02,  9.58546773e-02,
        6.02775812e-01,  2.64807463e-01,  4.67248619e-01,  2.27449253e-01,
       -1.75586492e-02,  3.64083916e-01,  2.85187215e-01,  1.60460010e-01,
       -1.00663744e-01, -2.84378797e-01, -3.49444419e-01,  3.71782854e-02,
       -2.86672674e-02, -2.15512160e-02, -1.13702953e-01, -1.83207437e-01,
       -1.48359165e-01, -3.76394279e-02,  1.14443544e-02,  2.52620071e-01,
       -1.51189208e-01, -2.27908477e-01,  2.39898071e-01, -3.15357924e-01,
        2.69230425e-01, -3.75274599e-01, -1.18599355e-01, -1.66700840e-01,
        1.93800889e-02, -2.19127648e-02,  7.95794204e-02,  2.42556125e-01,
       -3.45300317e-01,  2.60304689e-01, -2.70207196e-01, -1.52698100e-01,
        3.82836431e-01,  2.37352714e-01, -4.83656645e-01, -3.28339636e-02,
        3.38784158e-02,  

In [None]:
ru_fasttext_model.most_similar("ягуар")

[('ягуара', 0.6922987103462219),
 ('ягуаре', 0.658053457736969),
 ('джип', 0.6504993438720703),
 ('ягуары', 0.6424834728240967),
 ('крайслер', 0.6181568503379822),
 ('митсубиши', 0.6164671182632446),
 ('бмв', 0.6137527823448181),
 ('тигуан', 0.6030640602111816),
 ('ситроен', 0.6001013517379761),
 ('хаммер', 0.5977892279624939)]

In [None]:
ru_fasttext_model.most_similar(positive=["учеба", "время"], negative="экзамен")

[('время-то', 0.4860388934612274),
 ('врем', 0.48514583706855774),
 ('время,а', 0.4616581201553345),
 ('десятилетие', 0.45002153515815735),
 ('еда', 0.44088080525398254),
 ('детство', 0.43911275267601013),
 ('времяпровождение', 0.43864935636520386),
 ('продолжительное', 0.4309692084789276),
 ('готовка', 0.4296607971191406),
 ('продолжительная', 0.42860114574432373)]

Again, have fun and check [vectors calculator](https://rusvectores.org/ru/calculator/)!

## Visualization

### Single words

One way to see if our vectors are any good is to plot them. Thing is, those vectors are in 30D+ space and we humans are more used to 2-3D.

Luckily, we machine learners know about __dimensionality reduction__ methods.

Let's use that to plot 1000 most frequent words

In [None]:
words = sorted(en_w2v_model.key_to_index.keys(),
               key=lambda word: en_w2v_model.get_vecattr(word, "count"),
               reverse=True)[:1000]

print(words[::100])


['the', 'so', 'according', 'man', 'troops', 'working', 'together', 'meet', '40', 'either']


In [None]:
# for each word, compute it's vector with model
word_vectors = np.array([en_w2v_model[word] for word in words])

Linear projection: PCA

The simplest linear dimensionality reduction method is __P__rincipial __C__omponent __A__nalysis.

In geometric terms, PCA tries to find axes along which most of the variance occurs. The "natural" axes, if you wish.

<img src="https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/pca_fish.png" style="width:30%">


Under the hood, it attempts to decompose object-feature matrix $X$ into two smaller matrices: $W$ and $\hat W$ minimizing _mean squared error_:

$$\|(X W) \hat{W} - X\|^2_2 \to_{W, \hat{W}} \min$$
- $X \in \mathbb{R}^{n \times m}$ - object matrix (**centered**);
- $W \in \mathbb{R}^{m \times d}$ - matrix of direct transformation;
- $\hat{W} \in \mathbb{R}^{d \times m}$ - matrix of reverse transformation;
- $n$ samples, $m$ original dimensions and $d$ target dimensions;



In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [None]:
%%time

# map word vectors onto 2d plane with PCA. Use good old sklearn api (fit, transform)
# after that, normalize vectors to make sure they have zero mean and unit variance
word_vectors_pca = PCA(n_components=2).fit_transform(word_vectors)
word_vectors_pca = StandardScaler().fit_transform(word_vectors_pca)

CPU times: user 377 ms, sys: 1.23 s, total: 1.61 s
Wall time: 57.5 ms


Let's draw it!

In [None]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook
output_notebook()

def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxilirary info on hover """
    if isinstance(color, str): color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: pl.show(fig)
    return fig

In [None]:
import matplotlib.pyplot as plt
draw_vectors(word_vectors_pca[:, 0], word_vectors_pca[:, 1], token=words)
plt.show()
# hover a mouse over there and see if you can identify the clusters

### Phrases

Word embeddings can also be used to represent short phrases. The simplest way is to take __an average__ of vectors for all tokens in the phrase with some weights.

This trick is useful to identify what data are you working with: find if there are any outliers, clusters or other artefacts.

Let's try this new hammer on our data!

In [None]:
def get_phrase_embedding(model, phrase):
    """
    Convert phrase to a vector by aggregating it's word embeddings. See description above.
    """
    # 1. lowercase phrase
    # 2. tokenize phrase
    # 3. average word vectors for all words in tokenized phrase
    # skip words that are not in model's vocabulary
    # if all words are missing from vocabulary, return zeros

    vector = np.zeros([model.vector_size], dtype='float32')

    phrase = phrase.lower()
    tokens = tokenizer.tokenize(phrase)
    used_words = 0

    for word in tokens:
        if word in model:
            vector += model[word]
            used_words += 1

    if used_words > 0:
        vector = vector / used_words

    return vector

In [None]:
vector = get_phrase_embedding(en_w2v_model, "I'm very sure. This never happened to me before...")

In [None]:
# let's only consider ~5k phrases for a first run.
chosen_phrases = data[::len(data) // 1000]

# compute vectors for chosen phrases
phrase_vectors = np.array([get_phrase_embedding(en_w2v_model, phrase) for phrase in chosen_phrases]) #SOLUTION

In [None]:
assert isinstance(phrase_vectors, np.ndarray) and np.isfinite(phrase_vectors).all()
assert phrase_vectors.shape == (len(chosen_phrases), en_w2v_model.vector_size)

In [None]:
# map vectors into 2d space with pca, tsne or your other method of choice
# don't forget to normalize

phrase_vectors_2d = PCA(n_components=2).fit_transform(phrase_vectors)

phrase_vectors_2d = (phrase_vectors_2d - phrase_vectors_2d.mean(axis=0)) / phrase_vectors_2d.std(axis=0)

In [None]:
draw_vectors(phrase_vectors_2d[:, 0], phrase_vectors_2d[:, 1],
             phrase=[phrase[:50] for phrase in chosen_phrases],
             radius=20,)

# Application Examples

Now we gonna play with word embeddings: train our own little embedding with gensim.models package, load one from   gensim model zoo and use it to visualize text corpora.

This whole thing is gonna happen on top of embedding dataset.

__Requirements:__  `pip install --upgrade nltk gensim bokeh` , but only if you're running locally.

**Revision** of some code that could be missed to session and is useful for further steps:

In [None]:
def get_phrase_embedding(model, phrase):
    """
    Convert phrase to a vector by aggregating it's word embeddings. See description above.
    """
    # 1. lowercase phrase
    # 2. tokenize phrase
    # 3. average word vectors for all words in tokenized phrase
    # skip words that are not in model's vocabulary
    # if all words are missing from vocabulary, return zeros

    vector = np.zeros([model.vector_size], dtype='float32')

    phrase = phrase.lower()
    tokens = tokenizer.tokenize(phrase)
    used_words = 0

    for word in tokens:
        if word in model:
            vector += model[word]
            used_words += 1

    if used_words > 0:
        vector = vector / used_words

    return vector

In [None]:
%%time

from gensim.models.keyedvectors import KeyedVectors
en_w2v_model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt", binary=False)

CPU times: user 48.6 s, sys: 565 ms, total: 49.2 s
Wall time: 49 s


### Search of Similar Reviews

In case of kernel reconnection, for the data and preprocessing return to the **Count-based models: Data preparation** section.

In [None]:
from scipy.spatial.distance import cosine

In [None]:
def cosine_similarity_w2v(model, df, k, query):
    """
    Search of the top-k similar texts from documents collection to the query based on cosine similarity.
    However, now the vectors are calculated based on w2v representation of tokens.

    collection: list of texts;
    k: int, length of the result;
    query: str, your text.

    Return: list of most similar texts to the query;
          each element of the list is prepresented as a tuple: (text_idx, sim_score)
    """

    query2vec = get_phrase_embedding(model, query)

    similarities = []
    for index, row in df.iterrows():
        similarities.append((index, cosine(query2vec, get_phrase_embedding(model, row['Processed Review']))))

    similarities.sort(key=lambda tup: tup[1], reverse=True)
    result = similarities[:k]

    return result

In [None]:
query = reviews_pr.iloc[0]['Review Text']
print("Query: ", query, "\n")

result = cosine_similarity_w2v(en_w2v_model, reviews_pr, 10, query)

for (text_idx, sim_score) in result:
    print("Index: ", text_idx, " Similarity: ", sim_score, " Text: ", reviews_pr.iloc[text_idx]["Review Text"])

Query:  I really wanted this to work. alas, it had a strange fit for me. the straps would not stay up, and it had a weird fit under the breast. it worked standing up, but the minute i sat down it fell off my shoulders. the fabric was beautiful! and i loved that it had pockets. 

Index:  317  Similarity:  0.672203928232193  Text:  This dress is adorable. dress it up or dress it down
Index:  440  Similarity:  0.584073930978775  Text:  Love the pattern of this tunic. it's comfy and unique.
Index:  495  Similarity:  0.5620391368865967  Text:  Horrible fit. i do not understand why they but a aline dress with a skin non aline camisole under the dress.
Index:  254  Similarity:  0.4566788077354431  Text:  Easy to take this dress from casual to dressy! super comfortable and cute!
Index:  487  Similarity:  0.4554866552352905  Text:  This dress is great. wear now and into fall with boots and leggings. it seems to have structure with the seams and is nicley flowy.
Index:  127  Similarity:  0.44731

Your conclusion: which method is better?

One more improvement that can be made: **w2v representation weighted with tf-idf score**.

In [None]:
dict_idf = dict(zip(vectorizer.get_feature_names_out(), list(vectorizer.idf_)))

In [None]:
def get_phrase_embedding_w2v_tfidf(model, phrase, dict_idf):
    """
    Convert phrase to a vector by aggregating it's word embeddings weighted with tf-idf score.
    """

    vector = np.zeros([model.vector_size], dtype='float32')

    phrase = phrase.lower()
    tokens = tokenizer.tokenize(phrase)

    weighted_sum = 0

    for word in tokens:
        if word in model and word in dict_idf.keys():
            tf_idf = (tokens.count(word)/len(tokens)) * dict_idf[word]
            vector += model[word] * tf_idf
            weighted_sum += tf_idf

    if weighted_sum > 0:
        vector = vector / weighted_sum

    return vector

In [None]:
def cosine_similarity_w2v_tfidf(model, df, k, query, dict_idf):
    """
    Search of the top-k similar texts from documents collection to the query based on cosine similarity.
    However, now the vectors are calculated based on WEIGHTED with tfidf w2v representation of tokens.

    collection: list of texts;
    k: int, length of the result;
    query: str, your text;
    dict_idf: IDF of tokens;

    Return: list of most similar texts to the query;
          each element of the list is prepresented as a tuple: (text_idx, sim_score)
    """

    query2vec = get_phrase_embedding_w2v_tfidf(model, query, dict_idf)

    similarities = []
    for index, row in df.iterrows():
        similarities.append((index, cosine(query2vec, get_phrase_embedding(model, row['Processed Review']))))

    similarities.sort(key=lambda tup: tup[1], reverse=True)
    result = similarities[:k]

    return result

In [None]:
query = reviews_pr.iloc[0]['Review Text']
print("Query: ", query, "\n")

result = cosine_similarity_w2v_tfidf(en_w2v_model, reviews_pr, 10, query, dict_idf)

for (text_idx, sim_score) in result:
    print("Index: ", text_idx, " Similarity: ", sim_score, " Text: ", reviews_pr.iloc[text_idx]["Review Text"])

Query:  I really wanted this to work. alas, it had a strange fit for me. the straps would not stay up, and it had a weird fit under the breast. it worked standing up, but the minute i sat down it fell off my shoulders. the fabric was beautiful! and i loved that it had pockets. 

Index:  317  Similarity:  0.6033726334571838  Text:  This dress is adorable. dress it up or dress it down
Index:  440  Similarity:  0.5167878568172455  Text:  Love the pattern of this tunic. it's comfy and unique.
Index:  771  Similarity:  0.46498554944992065  Text:  Bought this dress for an indian wedding- it was perfect!
Index:  495  Similarity:  0.44923317432403564  Text:  Horrible fit. i do not understand why they but a aline dress with a skin non aline camisole under the dress.
Index:  488  Similarity:  0.4249442219734192  Text:  This dress can be casual, or you can dress it up for going out. i just love this dress!
Index:  673  Similarity:  0.41793370246887207  Text:  Love this dress and the color
Index: 

Finally, which model works the best?

### Simple Question Answering

We use phrase embeddings from the previous section.

In [None]:
# in case you session crashed

import numpy as np

from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()

data = list(open("./quora.txt", encoding="utf-8"))

In [None]:
data[0]

"Can I get back with my ex even though she is pregnant with another guy's baby?\n"

In [None]:
%%time

# compute vector embedding for all lines in data
data_vectors = np.array([get_phrase_embedding(en_w2v_model, l) for l in data])

CPU times: user 16.4 s, sys: 1.2 s, total: 17.6 s
Wall time: 16.7 s


In [None]:
def find_nearest(model, query, k=10):
    """
    given text line (query), return k most similar lines from data, sorted from most to least similar
    similarity should be measured as cosine between query and line embedding vectors
    hint: it's okay to use global variables: data and data_vectors. see also: np.argpartition, np.argsort
    """
    ### SOLUTION ###
    from sklearn.metrics.pairwise import cosine_similarity
    from scipy.spatial import distance

    query_emb = get_phrase_embedding(model, query)
    distances = distance.cdist([query_emb], data_vectors, "cosine")[0]
    ranged_vectors = np.argsort(distances)
    ### SOLUTION ###

    return [data[index] for index in ranged_vectors[:k]] #SOLUTION <YOUR CODE: top-k lines starting from most similar>

In [None]:
%%time

find_nearest(en_w2v_model, query="How do i enter the matrix?", k=10)

CPU times: user 415 ms, sys: 305 ms, total: 720 ms
Wall time: 719 ms


['How do I do a matrix transpose in Go?\n',
 'Do you live in the matrix? Why?\n',
 'How can I do the impossible?\n',
 'How do you define the one?\n',
 'How can I do the right thing?\n',
 'How can I become the best in everything I do?\n',
 'How do you know if you are in the friendzone?\n',
 'If I want to start writing, where do I begin? How can I learn the structure of what I want to write?\n',
 'What can I do in the future?\n',
 'How do I can get .in domain?\n']

In [None]:
find_nearest(en_w2v_model, query="How does Trump?", k=10)

['What does Tiffany Trump do?\n',
 'How would you describe Donald Trump?\n',
 'What is Trump?\n',
 'Do you like Trump?\n',
 'Why does everybody hate Trump?\n',
 'How does it feel to date Ivanka Trump?\n',
 'What do you think Melania Trump really thinks of Donald?\n',
 'What do you think about Donald trump?\n',
 'What do you think about Donald Trump?\n',
 'Why is Trump bad?\n']

In [None]:
find_nearest(en_w2v_model, query="Why don't i ask a question myself?", k=10)

["Why I don't believe in myself?\n",
 "Why I don't get any answer on my question?\n",
 "Why don't I get a job call?\n",
 "How do I ask a girl I don't know to fuck?\n",
 "Why don't I get a date?\n",
 "Can I ask a girl out that I don't know?\n",
 "Why do you always answer a question with a question? I don't, or do I?\n",
 "Why don't I feel like talking to anyone?\n",
 "Why do I love someone I don't know?\n",
 "Why don't I get a girlfriend?\n"]

More advanced vector indexing: [FAISS](https://github.com/facebookresearch/faiss)

### Simple Machine Translation

Adopted from original material by YSDA.

(_синій кіт_ vs. _синій кит_)

![blue_cat_blue_whale.png](https://github.com/yandexdataschool/nlp_course/raw/master/resources/blue_cat_blue_whale.png)

**Frament of the Swadesh list for some slavic languages**

The Swadesh list is a lexicostatistical stuff. It's named after American linguist Morris Swadesh and contains basic lexis. This list are used to define subgroupings of languages, its relatedness.

So we can see some kind of word invariance for different Slavic languages.


| Russian         | Belorussian              | Ukrainian               | Polish             | Czech                         | Bulgarian            |
|-----------------|--------------------------|-------------------------|--------------------|-------------------------------|-----------------------|
| женщина         | жанчына, кабета, баба    | жінка                   | kobieta            | žena                          | жена                  |
| мужчина         | мужчына                  | чоловік, мужчина        | mężczyzna          | muž                           | мъж                   |
| человек         | чалавек                  | людина, чоловік         | człowiek           | člověk                        | човек                 |
| ребёнок, дитя   | дзіця, дзіцёнак, немаўля | дитина, дитя            | dziecko            | dítě                          | дете                  |
| жена            | жонка                    | дружина, жінка          | żona               | žena, manželka, choť          | съпруга, жена         |
| муж             | муж, гаспадар            | чоловiк, муж            | mąż                | muž, manžel, choť             | съпруг, мъж           |
| мать, мама      | маці, матка              | мати, матір, неня, мама | matka              | matka, máma, 'стар.' mateř    | майка                 |
| отец, тятя      | бацька, тата             | батько, тато, татусь    | ojciec             | otec                          | баща, татко           |
| много           | шмат, багата             | багато                  | wiele              | mnoho, hodně                  | много                 |
| несколько       | некалькі, колькі         | декілька, кілька        | kilka              | několik, pár, trocha          | няколко               |
| другой, иной    | іншы                     | інший                   | inny               | druhý, jiný                   | друг                  |
| зверь, животное | жывёла, звер, істота     | тварина, звір           | zwierzę            | zvíře                         | животно               |
| рыба            | рыба                     | риба                    | ryba               | ryba                          | риба                  |
| птица           | птушка                   | птах, птиця             | ptak               | pták                          | птица                 |
| собака, пёс     | сабака                   | собака, пес             | pies               | pes                           | куче, пес             |
| вошь            | вош                      | воша                    | wesz               | veš                           | въшка                 |
| змея, гад       | змяя                     | змія, гад               | wąż                | had                           | змия                  |
| червь, червяк   | чарвяк                   | хробак, черв'як         | robak              | červ                          | червей                |
| дерево          | дрэва                    | дерево                  | drzewo             | strom, dřevo                  | дърво                 |
| лес             | лес                      | ліс                     | las                | les                           | гора, лес             |
| палка           | кій, палка               | палиця                  | patyk, pręt, pałka | hůl, klacek, prut, kůl, pálka | палка, пръчка, бастун |

#### Data preparation

In [None]:
import gensim
import numpy as np
from gensim.models import KeyedVectors

import requests

Download all files using code below (if it doesn't work for some reason copy can be found here):

Embeddings:
* [cc.uk.300.vec.zip](https://yadi.sk/d/9CAeNsJiInoyUA)
* [cc.ru.300.vec.zip](https://yadi.sk/d/3yG0-M4M8fypeQ)

Corpuses of ukr-rus words' pairs:
* [ukr_rus.test.txt](https://yadi.sk/i/uViaQLmdBy0Hag)
* [ukr_rus.train.txt](https://yadi.sk/i/PEgqqvG1po-L9Q)

Fairy tale for translation test:
* [fairy_tale.txt](https://yadi.sk/d/baqp13Nb2QEoEw)

In [None]:
%%time

!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ru.300.vec.gz
!gunzip cc.ru.300.vec.gz
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.uk.300.vec.gz
!gunzip cc.uk.300.vec.gz
!wget https://raw.githubusercontent.com/yandexdataschool/nlp_course/2022/week01_embeddings/ukr_rus.train.txt
!wget https://raw.githubusercontent.com/yandexdataschool/nlp_course/2022/week01_embeddings/ukr_rus.test.txt
!wget https://raw.githubusercontent.com/yandexdataschool/nlp_course/2022/week01_embeddings/fairy_tale.txt

--2022-11-21 22:20:29--  http://panchenko.me/slides/nnlp/data/cc.ru.300.vec.zip
Resolving panchenko.me (panchenko.me)... 130.104.253.4
Connecting to panchenko.me (panchenko.me)|130.104.253.4|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2022-11-21 22:20:29 ERROR 404: Not Found.

unzip:  cannot find or open cc.ru.300.vec.zip, cc.ru.300.vec.zip.zip or cc.ru.300.vec.zip.ZIP.
--2022-11-21 22:20:29--  http://panchenko.me/slides/nnlp/data/cc.uk.300.vec.zip
Resolving panchenko.me (panchenko.me)... 130.104.253.4
Connecting to panchenko.me (panchenko.me)|130.104.253.4|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2022-11-21 22:20:30 ERROR 404: Not Found.

unzip:  cannot find or open cc.uk.300.vec.zip, cc.uk.300.vec.zip.zip or cc.uk.300.vec.zip.ZIP.
--2022-11-21 22:20:30--  http://panchenko.me/slides/nnlp/data/ukr_rus.train.txt
Resolving panchenko.me (panchenko.me)... 130.104.253.4
Connecting to panchenko.me (panchenko.me)|130.104.253.4|:80... 

In [None]:
%%time

uk_emb = KeyedVectors.load_word2vec_format("cc.uk.300.vec")
ru_emb = KeyedVectors.load_word2vec_format("cc.ru.300.vec")

FileNotFoundError: ignored

In [None]:
ru_emb.most_similar([ru_emb["август"]], topn=10)

In [None]:
uk_emb.most_similar([uk_emb["серпень"]])

In [None]:
ru_emb.most_similar([uk_emb["серпень"]])

Load small dictionaries for correspoinding words pairs as trainset and testset

In [None]:
def load_word_pairs(filename):
    uk_ru_pairs = []
    uk_vectors = []
    ru_vectors = []
    with open(filename, "r") as inpf:
        for line in inpf:
            uk, ru = line.rstrip().split("\t")
            if uk not in uk_emb or ru not in ru_emb:
                continue
            uk_ru_pairs.append((uk, ru))
            uk_vectors.append(uk_emb[uk])
            ru_vectors.append(ru_emb[ru])
    return uk_ru_pairs, np.array(uk_vectors), np.array(ru_vectors)

In [None]:
uk_ru_train, X_train, Y_train = load_word_pairs("ukr_rus.train.txt")

In [None]:
uk_ru_test, X_test, Y_test = load_word_pairs("ukr_rus.test.txt")

#### Embedding space mapping

Let $x_i \in \mathrm{R}^d$ be the distributed representation of word $i$ in the source language, and $y_i \in \mathrm{R}^d$ is the vector representation of its translation. Our purpose is to learn such linear transform $W$ that minimizes euclidian distance between $Wx_i$ and $y_i$ for some subset of word embeddings. Thus we can formulate so-called Procrustes problem:

$$W^*= \arg\min_W \sum_{i=1}^n||Wx_i - y_i||_2$$
or
$$W^*= \arg\min_W ||WX - Y||_F$$

where $||*||_F$ - Frobenius norm.

In Greek mythology, Procrustes or "the stretcher" was a rogue smith and bandit from Attica who attacked people by stretching them or cutting off their legs, so as to force them to fit the size of an iron bed. We make same bad things with source embedding space. Our Procrustean bed is target embedding space.

![embedding_mapping.png](https://github.com/yandexdataschool/nlp_course/raw/master/resources/embedding_mapping.png)

![procrustes.png](https://github.com/yandexdataschool/nlp_course/raw/master/resources/procrustes.png)

But wait...$W^*= \arg\min_W \sum_{i=1}^n||Wx_i - y_i||_2$ looks like simple multiple linear regression (without intercept fit). So let's code.

In [None]:
from sklearn.linear_model import LinearRegression

mapping = LinearRegression().fit(X_train, Y_train) # SOLUTION
mapping.score(X_train, Y_train) # SOLUTION

In [None]:
august = mapping.predict(uk_emb["серпень"].reshape(1, -1))
ru_emb.most_similar(august)

We can see that neighbourhood of this embedding cosists of different months, but right variant is on the ninth place.

As quality measure we will use precision top-1, top-5 and top-10 (for each transformed Ukrainian embedding we count how many right target pairs are found in top N nearest neighbours in Russian embedding space).

#### Score the solution

In [None]:
def precision(pairs, mapped_vectors, topn=1):
    """
    :args:
        pairs = list of right word pairs [(uk_word_0, ru_word_0), ...]
        mapped_vectors = list of embeddings after mapping from source embedding space to destination embedding space
        topn = the number of nearest neighbours in destination embedding space to choose from
    :returns:
        precision_val, float number, total number of words for those we can find right translation at top K.
    """
    assert len(pairs) == len(mapped_vectors)
    num_matches = 0
    similar_vectors = [ru_emb.most_similar([mapped_vector]) for mapped_vector in mapped_vectors] # SOLUTION

    for i, (_, ru) in enumerate(pairs):
        ### SOLUTION ###
        if ru in [el[0] for el in similar_vectors[i]][:topn]:
            num_matches += 1
        ### SOLUTION ###
    precision_val = num_matches / len(pairs)
    return precision_val

In [None]:
assert precision([("серпень", "август")], august, topn=5) == 0.0
assert precision([("серпень", "август")], august, topn=9) == 1.0
assert precision([("серпень", "август")], august, topn=10) == 1.0

In [None]:
%%time

assert precision(uk_ru_test, X_test) == 0.0
assert precision(uk_ru_test, Y_test) == 1.0

In [None]:
precision_top1 = precision(uk_ru_test, mapping.predict(X_test), 1)
precision_top5 = precision(uk_ru_test, mapping.predict(X_test), 5)

assert precision_top1 >= 0.635
assert precision_top5 >= 0.810

#### Make it better (orthogonal Procrustean problem)

It can be shown (see original paper) that a self-consistent linear mapping between semantic spaces should be orthogonal.
We can restrict transform $W$ to be orthogonal. Then we will solve next problem:

$$W^*= \arg\min_W ||WX - Y||_F \text{, where: } W^TW = I$$

$$I \text{- identity matrix}$$

Instead of making yet another regression problem we can find optimal orthogonal transformation using singular value decomposition. It turns out that optimal transformation $W^*$ can be expressed via SVD components:
$$X^TY=U\Sigma V^T\text{, singular value decompostion}$$
$$W^*=UV^T$$

In [None]:
def learn_transform(X_train, Y_train):
    """
    :returns: W* : float matrix[emb_dim x emb_dim] as defined in formulae above
    """
    ### SOLUTION ###
    U, s, Vt = np.linalg.svd(np.matmul(X_train.transpose(), Y_train))

    return np.matmul(U, Vt)
    ### SOLUTION ###

In [None]:
%%time

W = learn_transform(X_train, Y_train)

In [None]:
ru_emb.most_similar([np.matmul(uk_emb["серпень"], W)])

In [None]:
assert precision(uk_ru_test, np.matmul(X_test, W)) >= 0.653
assert precision(uk_ru_test, np.matmul(X_test, W), 5) >= 0.824

#### Translation engine

Now we are ready to make simple word-based translator: for earch word in source language in shared embedding space we find the nearest in target language.

In [None]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()

In [None]:
with open("fairy_tale.txt", "r") as inpf:
    uk_sentences = [line.rstrip().lower() for line in inpf]

In [None]:
def translate(sentence):
    """
    :args:
        sentence - sentence in Ukrainian (str)
    :returns:
        translation - sentence in Russian (str)

    * find ukrainian embedding for each word in sentence
    * transform ukrainian embedding vector
    * find nearest russian word and replace
    """
    ### SOLUTION ###
    result = []

    for word in tokenizer.tokenize(sentence):
        if word not in uk_emb:
            result.append(word)
            continue
        word_emb = uk_emb[word]
        word_translation = ru_emb.most_similar([np.matmul(word_emb, W)])[0][0]
        result.append(word_translation)

    return ' '.join(result)
    ### SOLUTION ###

In [None]:
assert translate(".") == "."
assert translate("1 , 3") == "1 , 3"
assert translate("кіт зловив мишу") == "кот поймал мышку"

In [None]:
for sentence in uk_sentences:
    print("src: {}\ndst: {}\n".format(sentence, translate(sentence)))

Not so bad, right? We can easily improve translation using language model and not one but several nearest neighbours in shared embedding space. But next time.

# Bonus: Doc2Vec

<img src="https://github.com/dardem/word2vec_seminar/raw/master/img/doc2vec.png" style="width:100%">

The straightforward approach of averaging each of a text's words' word-vectors creates a quick and crude document-vector that can often be useful. However, Le and Mikolov in 2014 introduced the <i>Paragraph Vector</i>, which usually outperforms such simple-averaging.

The basic idea is: act as if a document has another floating word-like vector, which contributes to all training predictions, and is updated like other word-vectors, but we will call it a doc-vector. Gensim's `Doc2Vec` class implements this algorithm.

**Paragraph Vector - Distributed Memory (PV-DM)**
This is the Paragraph Vector model analogous to Word2Vec CBOW. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a center word based an average of both context word-vectors and the full document's doc-vector.

**Paragraph Vector - Distributed Bag of Words (PV-DBOW)**
This is the Paragraph Vector model analogous to Word2Vec SG. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a target word just from the full document's doc-vector. (It is also common to combine this with skip-gram testing, using both the doc-vector and nearby word-vectors to predict a single target word, but only one at a time.)

### Requirements

The following python modules are dependencies for this tutorial:
* testfixtures ( `pip install testfixtures` )
* statsmodels ( `pip install statsmodels` )

Let's download the IMDB archive if it is not already downloaded (84 MB). This will be our text data for this tutorial.   
The data can be found here: http://ai.stanford.edu/~amaas/data/sentiment/

This cell will only reattempt steps (such as downloading the compressed data) if their output isn't already present, so it is safe to re-run until it completes successfully.

In [None]:
# !pip install testfixtures
# !pip install statsmodels

### Data preparation

In [None]:
%%time

import locale
import glob
import os.path
import requests
import tarfile
import sys
import codecs
import re

dirname = 'aclImdb'
filename = 'aclImdb_v1.tar.gz'
locale.setlocale(locale.LC_ALL, 'C')
all_lines = []

if sys.version > '3':
    control_chars = [chr(0x85)]
else:
    control_chars = [unichr(0x85)]

# Convert text to lower-case and strip punctuation/symbols from words
def normalize_text(text):
    norm_text = text.lower()
    # Replace breaks with spaces
    norm_text = norm_text.replace('<br />', ' ')
    # Pad punctuation with spaces on both sides
    norm_text = re.sub(r"([\.\",\(\)!\?;:])", " \\1 ", norm_text)
    return norm_text


# Download IMDB archive
print("Downloading IMDB archive...")
url = u'http://ai.stanford.edu/~amaas/data/sentiment/' + filename
r = requests.get(url, stream=True)
with open(filename, 'wb') as f:
    f.write(r.raw.read())

Downloading IMDB archive...
CPU times: user 136 ms, sys: 370 ms, total: 507 ms
Wall time: 7.8 s


In [None]:
!tar -xvf aclImdb_v1.tar.gz

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
aclImdb/train/unsup/44983_0.txt
aclImdb/train/unsup/44982_0.txt
aclImdb/train/unsup/44981_0.txt
aclImdb/train/unsup/44980_0.txt
aclImdb/train/unsup/44979_0.txt
aclImdb/train/unsup/44978_0.txt
aclImdb/train/unsup/44977_0.txt
aclImdb/train/unsup/44976_0.txt
aclImdb/train/unsup/44975_0.txt
aclImdb/train/unsup/44974_0.txt
aclImdb/train/unsup/44973_0.txt
aclImdb/train/unsup/44972_0.txt
aclImdb/train/unsup/44971_0.txt
aclImdb/train/unsup/44970_0.txt
aclImdb/train/unsup/44969_0.txt
aclImdb/train/unsup/44968_0.txt
aclImdb/train/unsup/44967_0.txt
aclImdb/train/unsup/44966_0.txt
aclImdb/train/unsup/44965_0.txt
aclImdb/train/unsup/44964_0.txt
aclImdb/train/unsup/44963_0.txt
aclImdb/train/unsup/44962_0.txt
aclImdb/train/unsup/44961_0.txt
aclImdb/train/unsup/44960_0.txt
aclImdb/train/unsup/44959_0.txt
aclImdb/train/unsup/44958_0.txt
aclImdb/train/unsup/44957_0.txt
aclImdb/train/unsup/44956_0.txt
aclImdb/train/unsup/44955_0.txt
aclImdb

In [None]:
from smart_open import smart_open

# Collect & normalize test/train data
print("Cleaning up dataset...")
folders = ['train/pos', 'train/neg', 'test/pos', 'test/neg', 'train/unsup']
for fol in folders:
    temp = u''
    newline = "\n".encode("utf-8")
    output = fol.replace('/', '-') + '.txt'
    # Is there a better pattern to use?
    txt_files = glob.glob(os.path.join(dirname, fol, '*.txt'))
    print(" %s: %i files" % (fol, len(txt_files)))
    with smart_open(os.path.join(dirname, output), "wb") as n:
        for i, txt in enumerate(txt_files):
            with smart_open(txt, "rb") as t:
                one_text = t.read().decode("utf-8")
                for c in control_chars:
                    one_text = one_text.replace(c, ' ')
                one_text = normalize_text(one_text)
                all_lines.append(one_text)
                n.write(one_text.encode("utf-8"))
                n.write(newline)

# Save to disk for instant re-use on any future runs
with smart_open('alldata-id.txt', 'wb') as f:
    for idx, line in enumerate(all_lines):
        num_line = u"_*{0} {1}\n".format(idx, line)
        f.write(num_line.encode("utf-8"))

assert os.path.isfile("alldata-id.txt"), "alldata-id.txt unavailable"
print("Success, alldata-id.txt is available for next steps.")

Cleaning up dataset...
 train/pos: 12500 files
 train/neg: 12500 files
 test/pos: 12500 files
 test/neg: 12500 files
 train/unsup: 50000 files
Success, alldata-id.txt is available for next steps.


The text data is small enough to be read into memory.

In [None]:
%%time

import gensim
from gensim.models.doc2vec import TaggedDocument
from collections import namedtuple

# this data object class suffices as a `TaggedDocument` (with `words` and `tags`)
# plus adds other state helpful for our later evaluation/reporting
SentimentDocument = namedtuple('SentimentDocument', 'words tags split sentiment')

alldocs = []
with smart_open('alldata-id.txt', 'rb', encoding='utf-8') as alldata:
    for line_no, line in enumerate(alldata):
        tokens = gensim.utils.to_unicode(line).split()
        words = tokens[1:]
        tags = [line_no] # 'tags = [tokens[0]]' would also work at extra memory cost
        split = ['train', 'test', 'extra', 'extra'][line_no//25000]  # 25k train, 25k test, 25k extra
        sentiment = [1.0, 0.0, 1.0, 0.0, None, None, None, None][line_no//12500] # [12.5K pos, 12.5K neg]*2 then unknown
        alldocs.append(SentimentDocument(words, tags, split, sentiment))

train_docs = [doc for doc in alldocs if doc.split == 'train']
test_docs = [doc for doc in alldocs if doc.split == 'test']

print('%d docs: %d train-sentiment, %d test-sentiment' % (len(alldocs), len(train_docs), len(test_docs)))

100000 docs: 25000 train-sentiment, 25000 test-sentiment
CPU times: user 3.11 s, sys: 879 ms, total: 3.99 s
Wall time: 4 s


Because the native document-order has similar-sentiment documents in large clumps – which is suboptimal for training – we work with once-shuffled copy of the training set.

In [None]:
from random import shuffle
doc_list = alldocs[:]
shuffle(doc_list)

### Model setup

We approximate the experiment of Le & Mikolov ["Distributed Representations of Sentences and Documents"](http://cs.stanford.edu/~quocle/paragraph_vector.pdf) with guidance from Mikolov's [example go.sh](https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ):

`./word2vec -train ../alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1`

We vary the following parameter choices:
* 100-dimensional vectors, as the 400-d vectors of the paper take a lot of memory and, in our tests of this task, don't seem to offer much benefit
* Similarly, frequent word subsampling seems to decrease sentiment-prediction accuracy, so it's left out
* `cbow=0` means skip-gram which is equivalent to the paper's 'PV-DBOW' mode, matched in gensim with `dm=0`
* Added to that DBOW model are two DM models, one which averages context vectors (`dm_mean`) and one which concatenates them (`dm_concat`, resulting in a much larger, slower, more data-hungry model)
* A `min_count=2` saves quite a bit of model memory, discarding only words that appear in a single doc (and are thus no more expressive than the unique-to-each doc vectors themselves)

In [None]:
%%time
from gensim.models import Doc2Vec
import gensim.models.doc2vec
from collections import OrderedDict
import multiprocessing

cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

simple_models = [
    # PV-DBOW plain
    Doc2Vec(dm=0, vector_size=100, negative=5, hs=0, min_count=2, sample=0,
            epochs=20, workers=cores),
    # PV-DM w/ default averaging; a higher starting alpha may improve CBOW/PV-DM modes
    Doc2Vec(dm=1, vector_size=100, window=10, negative=5, hs=0, min_count=2, sample=0,
            epochs=20, workers=cores, alpha=0.05, comment='alpha=0.05'),
    # PV-DM w/ concatenation - big, slow, experimental mode
    # window=5 (both sides) approximates paper's apparent 10-word total window size
    Doc2Vec(dm=1, dm_concat=1, vector_size=100, window=5, negative=5, hs=0, min_count=2, sample=0,
            epochs=20, workers=cores),
]

for model in simple_models:
    model.build_vocab(alldocs)
    print("%s vocabulary scanned & state initialized" % model)

models_by_name = OrderedDict((str(model), model) for model in simple_models)

Doc2Vec(dbow,d100,n5,mc2,t2) vocabulary scanned & state initialized
Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t2) vocabulary scanned & state initialized
Doc2Vec(dm/c,d100,n5,w5,mc2,t2) vocabulary scanned & state initialized
CPU times: user 2min 4s, sys: 2.5 s, total: 2min 7s
Wall time: 2min 5s


Le and Mikolov notes that combining a paragraph vector from Distributed Bag of Words (DBOW) and Distributed Memory (DM) improves performance. We will follow, pairing the models together for evaluation. Here, we concatenate the paragraph vectors obtained from each model with the help of a thin wrapper class included in a gensim test module. (Note that this a separate, later concatenation of output-vectors than the kind of input-window-concatenation enabled by the `dm_concat=1` mode above.)

In [None]:
!pip install testfixtures

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting testfixtures
  Downloading testfixtures-7.0.3-py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 4.4 MB/s 
[?25hInstalling collected packages: testfixtures
Successfully installed testfixtures-7.0.3


In [None]:
# make sure: !pip install testfixtures

from gensim.test.test_doc2vec import ConcatenatedDoc2Vec
models_by_name['dbow+dmm'] = ConcatenatedDoc2Vec([simple_models[0], simple_models[1]])
models_by_name['dbow+dmc'] = ConcatenatedDoc2Vec([simple_models[0], simple_models[2]])

### Model training

Let's define some helper methods for evaluating the performance of our Doc2vec using paragraph vectors. We will classify document sentiments using a logistic regression model based on our paragraph embeddings. We will compare the error rates based on word embeddings from our various Doc2vec models.

In [None]:
import numpy as np
import statsmodels.api as sm
from random import sample

def logistic_predictor_from_data(train_targets, train_regressors):
    """Fit a statsmodel logistic predictor on supplied data"""
    logit = sm.Logit(train_targets, train_regressors)
    predictor = logit.fit(disp=0)
    # print(predictor.summary())
    return predictor

def error_rate_for_model(test_model, train_set, test_set,
                         reinfer_train=False, reinfer_test=False,
                         infer_steps=None, infer_alpha=None, infer_subsample=0.2):
    """Report error rate on test_doc sentiments, using supplied model and train_docs"""

    train_targets = [doc.sentiment for doc in train_set]
    if reinfer_train:
        train_regressors = [test_model.infer_vector(doc.words, steps=infer_steps, alpha=infer_alpha) for doc in train_set]
    else:
        train_regressors = [test_model.docvecs[doc.tags[0]] for doc in train_set]
    train_regressors = sm.add_constant(train_regressors)
    predictor = logistic_predictor_from_data(train_targets, train_regressors)

    test_data = test_set
    if reinfer_test:
        if infer_subsample < 1.0:
            test_data = sample(test_data, int(infer_subsample * len(test_data)))
        test_regressors = [test_model.infer_vector(doc.words, steps=infer_steps, alpha=infer_alpha) for doc in test_data]
    else:
        test_regressors = [test_model.docvecs[doc.tags[0]] for doc in test_docs]
    test_regressors = sm.add_constant(test_regressors)

    # Predict & evaluate
    test_predictions = predictor.predict(test_regressors)
    corrects = sum(np.rint(test_predictions) == [doc.sentiment for doc in test_data])
    errors = len(test_predictions) - corrects
    error_rate = float(errors) / len(test_predictions)
    return (error_rate, errors, len(test_predictions), predictor)

Note that doc-vector training is occurring on all documents of the dataset, which includes all TRAIN/TEST/DEV docs.

We evaluate each model's sentiment predictive power based on error rate, and the evaluation is done for each model.

(On a 4-core 2.6Ghz Intel Core i7, these 20 passes training and evaluating 3 main models takes about an hour.)

In [None]:
from collections import defaultdict
error_rates = defaultdict(lambda: 1.0)  # To selectively print only best errors achieved

In [None]:
for model in simple_models:
    print("Training %s" % model)
    %time model.train(doc_list, total_examples=len(doc_list), epochs=model.epochs)

    print("\nEvaluating %s" % model)
    %time err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)
    error_rates[str(model)] = err_rate
    print("\n%f %s\n" % (err_rate, model))



Training Doc2Vec(dbow,d100,n5,mc2,t2)


KeyboardInterrupt: ignored


Evaluating Doc2Vec(dbow,d100,n5,mc2,t2)


KeyboardInterrupt: ignored




0.104480 Doc2Vec(dbow,d100,n5,mc2,t2)

Training Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t2)


KeyboardInterrupt: ignored


Evaluating Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t2)




CPU times: user 994 ms, sys: 283 ms, total: 1.28 s
Wall time: 691 ms

0.385720 Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t2)

Training Doc2Vec(dm/c,d100,n5,w5,mc2,t2)


KeyboardInterrupt: ignored


Evaluating Doc2Vec(dm/c,d100,n5,w5,mc2,t2)
CPU times: user 1.23 s, sys: 321 ms, total: 1.56 s
Wall time: 805 ms

0.453640 Doc2Vec(dm/c,d100,n5,w5,mc2,t2)



In [None]:
for model in [models_by_name['dbow+dmm'], models_by_name['dbow+dmc']]:
    print("\nEvaluating %s" % model)
    %time err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)
    error_rates[str(model)] = err_rate
    print("\n%f %s\n" % (err_rate, model))


Evaluating Doc2Vec(dbow,d100,n5,mc2,t2)+Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t2)


KeyboardInterrupt: ignored


0.457480 Doc2Vec(dbow,d100,n5,mc2,t2)+Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t2)


Evaluating Doc2Vec(dbow,d100,n5,mc2,t2)+Doc2Vec(dm/c,d100,n5,w5,mc2,t2)
CPU times: user 2.19 s, sys: 549 ms, total: 2.74 s
Wall time: 1.55 s

0.104480 Doc2Vec(dbow,d100,n5,mc2,t2)+Doc2Vec(dm/c,d100,n5,w5,mc2,t2)



### Evaluation

In [None]:
# Compare error rates achieved, best-to-worst
print("Err_rate Model")
for rate, name in sorted((rate, name) for name, rate in error_rates.items()):
    print(rate, name)

In our testing, contrary to the results of the paper, on this problem, PV-DBOW alone performs as good as anything else. Concatenating vectors from different models only sometimes offers a tiny predictive improvement – and stays generally close to the best-performing solo model included.

The best results achieved here are just around 10% error rate, still a long way from the paper's reported 7.42% error rate.

(Other trials not shown, with larger vectors and other changes, also don't come close to the paper's reported value. Others around the net have reported a similar inability to reproduce the paper's best numbers. The PV-DM/C mode improves a bit with many more training epochs – but doesn't reach parity with PV-DBOW.)

Play with pretrained Doc2Vec model by yourself! The pretrained models can be found [here](https://github.com/jhlau/doc2vec).