<img src="Images/header.png" width="1200" >

**Welcome!**

In this workshop we first see an introduction to concepts of numerical representation of text and Word-Embedding and as we go on we'll learn about the Word2Vec and Doc2Vec algorithms and see how can we implement them in a ML task.

Please note that the main purpose of this workshop is to make familiar a beginner ML user with the mentioned concepts instead of focusing on the most efficient - or pythonic way - to write your code. There are several ways the codes you'll see here can be written in a better/more intelligent way but as a result, they may seem more complex to a non-expert Python coder. That's why I intentionally decided to not use Classes, decorators and tools like these.

To run practical examples I use the Simpsons [dataset from Kaggle](https://www.kaggle.com/pierremegret/dialogue-lines-of-the-simpsons/download)  which consists more that 150k lines and covers more than 600(old!) episodes.

## Requirements

This notebook is written in Python3 and needs the following libraries:

* **pandas** _(pip install pandas)_
* **numpy** _(python -m pip install --user numpy)_
* **gensim** _(pip install --upgrade gensim)_
* **tqdm** _(pip install tqdm)_
* **matplotlib** _(python -m pip install -U matplotlib)_
* **sklearn** _(pip install -U scikit-learn)_
* **SpaCy** _(python3 -m pip install spacy)_ --> python -m spacy download en
* **testfixtures** _(pip install testfixtures)_

To make sure which one you already have and which you should install, try to run the cell with the title **Loading libraries** and install the missing libraries accordingly.

## What is word-embedding?

Sometime between 50,000 years and 2 million years ago, humans started to talk and what we know today as "languages" is the result of this long process. 

The reason for it is rather simple: to be able to communicate about the world around them which directly impacts their lives. So, they create "representations" (which are not limited to words) to transfer an idea/concept/meaning to another human. While at the beginning these words/sounds/... were just independent symbols, by evolution of human language, words were also evolved not only as symbols but also as a way to link other concepts together. This process has happened during thousands of years, and during this period, human brain was unable to create a model of the reality, to link words(symbols) not just assign them to a single concept/idea but to understand their dynamics with the other symbols. So when I ask you ``give me a pair of scissors to cut this rope``, if you can't find scissors and instead see a knife you will probably bring it to me because in the model your brain has from the reality, scissors are used to cut, just as a knife. While at the first glance it may seem trivial, actually trying to replicate this phenomenon outside the human brain won't be so simple! before going in deep to see what options are out, let’s review how human brain deals with words:

<img src="Images/human_process.png" height="300" width="700">

To better grasp the concept of *creating representations*, let's take a look at a simple example:

Imagine we have a bunch of apples and for some reasons we want to find out the apples which are similar and dissimilar to each other:

<img src="Images/mele.png" height="300" width="700">

How, in your opinion we can represent an apple?

Let's say we decide to *represent* each one using these characteristic:
* Color
* Height
* Perimeter

So now it's possible to show them in space, this means we just created a mathematical representation for our apples!

<img src="Images/coor.jpeg" width="900">

Ok now, imagine that we have to do the same thing as a computer algorithm, starting from concepts and finishing with a model which represents the real world (language model)...not so easy, right? ;)
We're lucky that we shouldn't start from the first step, mapping concepts/ideas to symbols. Human beings have done it for us during thousands of years!
Unfortunately having words is not enough for creating our model. To do so we need something called Word-embedding, one of the methods used for creating a numerical representation of textual data. Why we need such a representation you ask? well, to find out, just try to give the following paragraph to a classification algorithm, as it it, a text!

``Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.``

I think now you're agree that we need to have a numerical representation instead ;)
Is Word2Vec is the only way to perform a word-embedding? of course not!

Actually there are two main category of methods:

* __Frequency Based Methods__
    * Count Vectors
    * TF-IDF
    * Co-Occurance Vectors
    * ...
* __Predication Based Methods__
    * Word2Vec
    * GloVe
    * Tyrion
    * ...

_Prediction Bases Methods are both more sophisticated and computationally expensive respect to Frequency Based Methods._

## What is the expected outcome of a Word2Vec model?

In a nutshell, a Vector!

To be more precise, a dense numerical vector with a pre-defined length.

<img src="Images/word2vec.png" height="300" width="700">

Simply put, a Word2Vec is just a simple neural network which performs a binary classification and uses the learned classifier weights as embeddings of words. All the classifier does is to return the probability of ___c___ being a context word for a target word ___t___. that's it!

A Word2Vec model, internally uses one of the following algorithms to create the embeddings:

* Continuous Bag Of Words (CBOW)
* Skip-Gram (SG)

## General mechanism of Word2Vec

As we've said earlier, Word2Vec uses a Neural Network in order to generate a corespondent vector for each word in the corpus. To understand the details about such an algorithm you need to have a basic understanding of Neural Networks.
(In case you don't feel confident about NNs, take a look at [this](https://www.youtube.com/watch?v=aircAruvnKk&pbjreload=10) or [this](https://www.youtube.com/watch?v=BR9h47Jtqyw&t=1342s) videos.

<img src="Images/NN_1.png" height="300" width="800">

An extremely simplified version of Word2Vec algorithm would be:
* set the first word of corpus as target
* based on the defined window size, set the neighbors as context of the target word
* select random words from neighbors based on their vicinity to the target
* for each neighbor and target word create a fully connected NN with one layer and no activation function. the task of NN is to decide if a given word of corpus could be a context word for the target or not
* the weight matrix of the NN's hidden layer is the vector of the word.

Following example shows a corpus with size of V and window size of 2:
<img src="Images/NN_2.png" height="700" width="700">

<img src="Images/diff.png" height="700" width="700">

### What is the difference between CBOW and SG?

__CBOW__: The input to the model is $𝑤_{𝑖−2},𝑤_{𝑖−1},𝑤_{𝑖+1},𝑤_{𝑖+2}$, the preceding and following words of the target word. The output of the neural network will be $𝑤_𝑖$. so the task is __"predicting the word given its context"__


__Skip-gram__: The input to the model is $𝑤_𝑖$, and the output is $𝑤_{𝑖−2},𝑤_{𝑖−1},𝑤_{𝑖+1},𝑤_{𝑖+2}$. So the task here is __"predicting the context given a word"__. 

<img src="Images/cbow_skipgram.png" height="300" width="800">

_Note : As a hyperparameter you should define the Max window size but don't forget that the algorithm won't necessarily use all the words in the window but instead it randomly choose some of the words based on the distance they have from the target word (in a way that closer words have a higher chance to be picked)_

### Which one should I use?

According to the author of Word2Vec:

* __Skip-gram__: works well with small amount of the training data, represents well even rare words or phrases. 
* __CBOW__: several times faster to train than the skip-gram, slightly better accuracy for the frequent words.

So it depends on your data and computational power

## Creating a Word2Vec model

The following diagram shows the general process we will follow in this workshop
<img src="Images/process_2.png" height="300" width="1100">

### Loading libraries

In [None]:
import numpy as np
import spacy
import re
import os
import pandas as pd
import itertools
from tqdm import tqdm_notebook
import matplotlib.pyplot as plt
import warnings

import gensim.parsing as gm
from gensim.parsing.preprocessing import preprocess_string
from gensim.models.phrases import Phrases, Phraser
from gensim.models import Word2Vec, Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from gensim.test.test_doc2vec import ConcatenatedDoc2Vec

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn import utils
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

### Defining some parameters

In [None]:
plt.rcParams['figure.figsize'] = (15, 10)
warnings.filterwarnings('ignore')
cwd = os.getcwd()

### Loading data from CSV file

In [None]:
df = pd.read_csv(f'{cwd}/Datasets/simpsons_dataset.csv',
                sep=',').dropna().reset_index(drop=True)
df.sample(10)

### Pre-Processing

Just like any other ML task, we should start our work by preprocessing the data. To start let's use gensim's text processing tools to define a function. Consider that this function is created for this specific dataset. For example, it doesn't clean html tags, URLs,... since we know that they won't be present in our data. As you can see I commented out the stem_text function since we later use SpaCy to Lemmatize the text. (More on Lemmatization :
[Lemmatization Approaches with Examples in Python](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/) )

In [None]:
def text_preproc(text: str):
    """Performs pre-processing steps on the given string.
    
    Pre-processing is done using methods from gensim module.
    Methods are hard-coded in the function.
    
    Args:
        text: A string
    
    Returns:
        A list with tokenized and Pre-processed tokens of the given string.
    """

    my_filter = [
        lambda x: x.lower(),
        gm.strip_punctuation,
        gm.strip_multiple_whitespaces,
        gm.strip_numeric,
        gm.remove_stopwords,
        gm.strip_short,
        gm.strip_tags,
        #gm.stem_text
    ]
    return preprocess_string(text, filters=my_filter)

In [None]:
# grabbing the column which contains the text
text_column = 'spoken_words'
data = df[text_column]

# passing texts to the function we defined above
training_set_raw = []
for d in tqdm_notebook(data, desc='Pre-Processing: Cleaning'):
    training_set_raw.append((text_preproc(d)))

# initialize spacy 'en' model
nlp = spacy.load('en', disable=['parser', 'ner'])

# creating sentences from tokens
training_sentence = [' '.join(d) for d in training_set_raw]

# lmmatizing sentences
training_set_0 = []
for ts in tqdm_notebook(training_sentence, desc='Pre-Processing: Lemmatizing'):
    doc = nlp(ts)
    lemm = [token.lemma_ for token in doc]
    if len(lemm) is not 0:
        training_set_0.append(lemm)
        
# removing duplicated data
training_set = []
for l in tqdm_notebook(training_set_0, desc='Pre-Processing: Deduplicating'):
    if l not in training_set:
        training_set.append(l)

In [None]:
# let's see the result of our pre-processing for the first 5 sentences
for i,j in zip(data[:5], training_set[:5]):
    print(f"-{i}\n-{' '.join(j)}")
    print('-'*40)

### N-Grams

In this particular case we want to embed words using their neighbours, the words are frequently appear (or doesn't appear) close to them. Imagine that you want to calculate the occurrence probability of a set of three words : 
$$P(w_{1},w_{2},w_{3})$$
We can write it in this way:
$$P(w_{1},w_{2},w_{3}) = P(w_{1})P(w_{2}\mid w_{1})P(w_{3}\mid w_{1} w_{2})$$

The general form of this formula is named Chain Rule:

$$P(w_{1},w_{2},w_{3},...,w_{n}) = P(w_{1})P(w_{2}\mid w_{1})...P(w_{n}\mid w_{1}...w_{n-1})$$

$$or$$

$$P(w_{1}w_{2}...w_{i} )=\prod  P(w_{i}\mid w_{1} w_{2}...w_{i-1})$$

_if you're interested in this topic take a look at [this video](https://www.youtube.com/watch?v=dkUtavsPqNA
)_

Ok, let's calculate the bigrams for our sentences. To do so we use Phrases class from Gensim library. According to its documentation Phrases automatically detect common phrases – multi-word expressions / word n-grams – from a stream of sentences.

In [None]:
def create_ngrams(data: list, log: bool = True, n=2):
    """Creates ngrams(2-3) from the given list of tokenized and cleaned data.
    
    Args:
        data: a list of ([tokens], key) pairs.
        log: If True, it prints the status.
        
    Returns : 
        A Gensim Trigram object.
    """
    if log:
        print("Learning lexicon from files...")

    if log:
        print("Creating Bigrams...")
    ngrams = Phrases(data, min_count=30)
    
    if log:
        print("Bigrams are Ready.")
    if n ==3:
        if log:
            print("Creating Trigrams...")
        ngrams = Phrases(ngrams[data], min_count=10, threshold=2)
        if log:
            print("Trigrams are ready to use.")

    return ngrams

In [None]:
ngram = create_ngrams(training_set, n=2)

In [None]:
def get_or_build_model(ngram: Phrases,
                       skipgram: int=0,
                       size: int=300,
                       window: int=2,
                       min_count: int=20,
                       alpha=0.03, 
                       negative=20,
                       hs: int=1,
                       ns_exponent = 0.05,
                       iter=10,
                       load: str=None,
                       save: str=None,
                       log: bool=True):
    
    """Build or if a saved model exists, load the model.

    Args: 
        ngram: ngram object created by create_ngrams() function.
        skipgram: internal algorithm used to create the model. 0=CBOW, 1=skipgram. [default = 0]
        size: Dimensionality of the word vectors.
        window: Maximum distance between the current and predicted word within a sentence.
        min_count: Ignores all words with total frequency lower than this.
        alpha: The initial learning rate
        hs: hierarchical softmax if 0 non-zero, negative sampling will be used.[default = 1]
        load: file path to the pre-built model to be loaded. If None, model will be created [default = None]
        save: file path to be used to save the created model. If None, model won't be saved locally [default = None]
        log: whether log messages should be printed for the use [default = True]
    
    Returns:
        A word2vec model
    """
    if log:
        print("Creating ngrams...")
    sentences = [ngram[pair] for pair in tqdm_notebook(training_set)]

    if load is not None:
        if log:
            print("Loading model...")
        model = Word2Vec.load(f"./{load}")
        if log:
            print("Model has been loaded successfully.")
    else:
        if log:
            print("Building Word2Vec...")
        model = Word2Vec(
            sentences,
            size=size,
            window=window,
            min_count=min_count,
            sg=skipgram,
            hs=hs,
            alpha=alpha, 
            negative=negative,
            ns_exponent = ns_exponent,
            iter=iter)
        if log:
            print("Model has been built.")
        if save is not None:
            if log:
                print("Saving the model...")
            model.save(f"./{save}")
            if log:
                print("Model has been saved successfully.")

    return model

In [None]:
# creating Word2Vec model
model = get_or_build_model(ngram)

In [None]:
def n_most_similar(model: Word2Vec,
                   text: str,
                   n: int = 10):
    """Prints the n most similar words of the given string.

    Args:
        model : An object of Word2Vec class, created by get_or_build_model().
        text : the string we want to get it similar words.
        n : number of similar words we want to get [default = 10]        
    """

    most_similar_to = model.wv.most_similar(positive=[text], topn=n)
    for i, similar_item in enumerate(most_similar_to):
        similar_key = similar_item[0]
        score = str(round(float(similar_item[1]), 2))
        print(f"{i+1} - {similar_key} -- {score}")


In [None]:
n_most_similar(model, text='bart', n=10)

In [None]:
model.wv.doesnt_match(["nelson", "bart", "milhouse"])

In [None]:
model.wv.doesnt_match(["bart", "lisa", "milhouse"])

### Cosine Similarity

Cosine similarity is actually the Cosine of two vectors A and B in a Way that if they are orthogonal to each other(not similarity) the Cosine will be 0 and in case they are equal, the Cosine would be equal to 1.

$$similarity\,score = cos(\theta ) = \frac{A \cdot B}{\left \| A  \right \| \left \| B  \right \|} =\frac{\sum_{i=1}^{n}A_{i}B_{i}}{\sqrt{\sum_{i=1}^{n}A_{i}^{2}} \sqrt{\sum_{i=1}^{n}B_{i}^{2}}}$$

In [None]:
model.wv.n_similarity(['bart'], ['lisa'])

### Visualization

How do you think we can visualize the results of our word-embedding model? the short answer is, we can't! :D
Simply because we are human beings and our brain can't comprehend an object which has more than 3 dimensions! _(reminder: in our example we've created 300-dimension vectors!)_

So what is the solution? Fortunately back in 2008 a smart guy named _Laurens van der Maaten_ came up with a method which can reduce a high-dimension model to a 2-3 dimension space.
This technique is called __t-SNE__ which stands for __t-Distributed Stochastic Neighbor Embedding__ . Explanation of t-SNE is way beyond the scope of this class but if you're interested in the topic, here is a nice video which explains how does it work:[link](https://www.youtube.com/watch?v=NEaUSP4YerM) 

_Note: Remember that getting a good result from t-SNE is not so easy! there are LOTS of parameters you can/should modify inside t-SNE function. [Here](https://distill.pub/2016/misread-tsne/) is a super-useful article with an interactive tool which helps you to play with parameters and discover their impact on the final results._

In [None]:
def plot_tsne(model,
              n_components: int=2,
              perplexity:int =20,
              learning_rate: int=10,
              n_iter: int=1000,
              metric: str='euclidean'):
    
    """t-SNE plot for word-embedding model.
    
    Args:
        n_components: Dimension of the embedded space.
        perplexity: The perplexity is related to the number of nearest neighbors that
            is used in other manifold learning algorithms. Larger datasets
            usually require a larger perplexity. 
        learning_rate: The learning rate for t-SNE is usually in the range [10.0, 1000.0].
            If the learning rate is too high, the data may look like a 'ball' with any
            point approximately equidistant from its nearest neighbours.
        n_iter: Maximum number of iterations for the optimization. Should be at least 250.
        metric: The metric to use when calculating distance between instances in a 
            feature array.
    """

    X = model[model.wv.vocab]
    # diminesion reduction from 300 to 50 with PCA
    X = PCA(n_components=50).fit_transform(X)

    tsne = TSNE(
        n_components=n_components,
        perplexity=perplexity,
        learning_rate=learning_rate,
        n_iter=n_iter,
        metric=metric)

    X_tsne = tsne.fit_transform(X)
    plt.figure(figsize=(13, 8))
    plt.scatter(X_tsne[:, 0], X_tsne[:, 1], s=2, alpha=0.4)
    plt.title('t-SNE plot', fontsize=20)
    plt.xticks([])
    plt.yticks([])
    plt.show()

In [None]:
plot_tsne(model)

In [None]:
def plot_top_n(model, word, n:int = 10):
    
    """t-SNE plot for top n similar words for the given input
    
    Args:
        model : An object of Word2Vec class, created by get_or_build_model().
        word : the string we want to plot similar words for
        n : number of similar words we want to get [default = 10]   
    """
    
    X = model[model.wv.vocab]
    tsne = TSNE(n_components=2,
                  perplexity=20,
                  learning_rate=10,
                  n_iter=1000,
                  metric='euclidean')

    X_tsne = tsne.fit_transform(X)

    tsne_df = pd.DataFrame([X_tsne[:, 0], X_tsne[:, 1]]).T
    tsne_df.columns = ['X', 'Y']

    tsne_df['words'] = model.wv.vocab.keys()

    top = [x[0] for x in model.wv.most_similar(positive=[word], topn=n)]


    df_point = tsne_df[tsne_df.words == word].reset_index(drop=True)
    df_1 = tsne_df[tsne_df.words.isin(top)].reset_index(drop=True)
    plt.scatter(df_1.X, df_1.Y)
    plt.scatter(df_point.X, df_point.Y, c='r')
    for i in range(len(df_1)):
        plt.text(df_1.X[i], df_1.Y[i], df_1.words[i])
    plt.text(df_point.X[0], df_point.Y[0], df_point.words[0], horizontalalignment='center')
    plt.title(f'Top {n} words for {word}', fontsize=20)
    plt.show()

In [None]:
plot_top_n(model, 'homer')

## Doc2Vec Algorithm

Hopefully now that you know the concept of word embeddings and Word2Vec algorithm, it's much more easier to talk about Doc2Vec.

Doc2Vec is a generalization of a Word2Vec algorithm which not only takes into consideration the context of words, but also takes into account the context of the the document as a whole.

In the doc2vec architecture, the two algorithm that are **“continuous bag of words” (CBOW)** and **“skip-gram” (SG)**; correspond to the **“distributed memory” (DM)** and **“distributed bag of words” (DBOW)**.

The following figure shows how  Word2Vec integrates the Cocument ID into the context:


<img src="Images/doc2vec.jpg" height="300" width="800">

### Data preparation

While in Word2Vec example we used embeddings to create a visulization, in case of Doc2Vec we try to create a simple classifier which given a sentence can predict who's sentence is that!

To do so we need to do a further step: **Balancing the data**

In [None]:
# normalizing the names
df['raw_character_text'] = [x.lower().strip() for x in df['raw_character_text']]

In [None]:
df.raw_character_text.value_counts()[:4]

As you can see **Lisa** has the lowest lines : 10756 . One way to balance the data is down sample others to Lisa level:

In [None]:
data_ho = df[df.raw_character_text == 'homer simpson'].sample(10756)
data_ma = df[df.raw_character_text == 'marge simpson'].sample(10756)
data_ba = df[df.raw_character_text == 'bart simpson'].sample(10756)
data_li = df[df.raw_character_text == 'lisa simpson']

Now we should individually divide data for each character to train and test and then combine them together.

In [None]:
data_ho_test = data_ho.sample(3226)
data_ho_train = data_ho[~data_ho.index.isin(data_ho_test.index)]

In [None]:
data_ma_test = data_ma.sample(3226)
data_ma_train = data_ma[~data_ma.index.isin(data_ma_test.index)]

In [None]:
data_ba_test = data_ba.sample(3226)
data_ba_train = data_ba[~data_ba.index.isin(data_ba_test.index)]

In [None]:
data_li_test = data_li.sample(3226)
data_li_train = data_li[~data_li.index.isin(data_li_test.index)]

In [None]:
# Training data
data_train = pd.concat([data_ho_train, data_ma_train, data_ba_train, 
                      data_li_train]).reset_index(drop=True)

In [None]:
# Testing data
data_test = pd.concat([data_ho_test, data_ma_test, data_ba_test, 
                      data_li_test]).reset_index(drop=True)

In [None]:
# Tagging each document with its label
train_tagged = data_train.apply(lambda x: TaggedDocument(words=text_preproc(x['spoken_words']), tags=[x.raw_character_text]), axis=1)
test_tagged = data_test.apply(lambda x: TaggedDocument(words=text_preproc(x['spoken_words']), tags=[x.raw_character_text]), axis=1)

In [None]:
# an example
train_tagged.values[42]

As we mentioned before, in order to create a Doc2Vec model we have two options. Let's start from DBOW model:

In [None]:
# Inizializing the Doc2Vec model and creating the vocabulary
model_dbow = Doc2Vec(dm=0, vector_size=300, negative=30, hs=0, min_count=6)
model_dbow.build_vocab([x for x in tqdm_notebook(train_tagged.values)])

In [None]:
# training the model in 30 epochs
for epoch in tqdm_notebook(range(30), desc= 'Training model'):
    model_dbow.train(utils.shuffle([x for x in train_tagged.values]), total_examples=len(train_tagged.values), epochs=1)
    model_dbow.alpha -= 0.002
    model_dbow.min_alpha = model_dbow.alpha

In [None]:
def labeled_vectors(model:Doc2Vec,
                     tagged_docs:pd.core.series.Series):
    """Tagged vectors for the given model and documnets
    
    Args:
        model: trained Doc2Vec model
        tagged_docs: A pandas.Series which contains tagged documenst with their labels
        
    Returns:
        labels and vectors for the giben documents
        
    """
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=20)) for doc in sents])
    return targets, regressors

In [None]:
y_train, X_train = labeled_vectors(model_dbow, train_tagged)
y_test, X_test = labeled_vectors(model_dbow, test_tagged)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(f'Accuracy: {round(accuracy_score(y_test, y_pred), 4)}')
print(f'F1 score: {round(f1_score(y_test, y_pred, average="weighted"),4)}')

In [None]:
model_dm = Doc2Vec(dm=1, window=3, vector_size=300, negative=30, min_count=6,alpha=0.1, min_alpha=0.1)
model_dm.build_vocab([x for x in train_tagged.values])

In [None]:
# training the model in 30 epochs
for epoch in tqdm_notebook(range(30), desc= 'Training model'):
    model_dm.train(utils.shuffle([x for x in train_tagged.values]), total_examples=len(train_tagged.values), epochs=1)
    model_dm.alpha -= 0.002
    model_dm.min_alpha = model_dm.alpha

In [None]:
y_train, X_train = labeled_vectors(model_dm, train_tagged)
y_test, X_test = labeled_vectors(model_dm, test_tagged)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(f'Accuracy: {round(accuracy_score(y_test, y_pred), 4)}')
print(f'F1 score: {round(f1_score(y_test, y_pred, average="weighted"),4)}')

According to creators of gensim library combining a paragraph vector from Distributed Bag of Words (DBOW) and Distributed Memory (DM) improves performance. So le't do it first by deleting training data of **model_dbow** and **model_dm** to free up the momory:

In [None]:
model_dbow.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)
model_dm.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

# concatenating models
new_model = ConcatenatedDoc2Vec([model_dbow, model_dm])

In [None]:
y_train, X_train = labeled_vectors(new_model, train_tagged)
y_test, X_test = labeled_vectors(new_model, test_tagged)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(f'Accuracy: {round(accuracy_score(y_test, y_pred), 4)}')
print(f'F1 score: {round(f1_score(y_test, y_pred, average="weighted"),4)}')

## Sources

* [Distributed Representations of Words and Phrases and their Compositionality (Milkov et al)](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
* [Gensim documentation](https://radimrehurek.com/gensim/index.html)
* [RaRe-Technologies Github Page](https://github.com/RaRe-Technologies/gensim/blob/3c3506d51a2caf6b890de3b1b32a8b85f7566ca5/docs/notebooks/doc2vec-IMDB.ipynb)
* [Multi-Class Text Classification with Doc2Vec & Logistic Regression by Susan Li](https://towardsdatascience.com/multi-class-text-classification-with-doc2vec-logistic-regression-9da9947b43f4)
* [Gensim Word2Vec Tutorial by Pierre Megret](https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial)