<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 35px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Token Embedding
  </div> 
  
<div style="
      font-weight: normal; 
      font-size: 25px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Word2Vec, FastText, Doc2Vec
  </div> 



  <div style="
      font-size: 15px; 
      line-height: 12px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Jean-baptiste AUJOGUE
  </div> 


  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  December 2022
  </div>

<a id="TOC"></a>

#### Table Of Content

1. [Corpus](#data) <br>
2. [Word2Vec](#w2v) <br>
3. [FastText](#ft) <br>
4. [Doc2Vec](#d2v) <br>

# Overview

We expose here pretraining methods for the _Token Embedding_ layer of a NLP model. The `transformers` library does not carry components for such pretraining, but it is still a valuable topic and was the center of many papers before transformer models and their contextual embeddings took the advantage.



The global purpose of Word Embedding is to represent a _Token_ , a raw string representing a unit of text, as a low dimensional (dense) vector. The way tokens are defined only depends on the method used to split a text into text units : using blank spaces as separators or using classical NLTK or SpaCy's segmentation models leave _words_ as tokens, but splitting protocols yielding _subword units_ , that are half-way between characters and full words, are also investigated :

- [Neural Machine Translation of Rare Words with Subword Units (2015)](https://www.aclweb.org/anthology/P16-1162.pdf)
- [Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (2016)](https://arxiv.org/pdf/1609.08144.pdf)). 
- [BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages (2018)](https://www.aclweb.org/anthology/L18-1473.pdf)
- [SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (2018)](https://arxiv.org/abs/1808.06226)


Here we broadly denote by _word_ any such token. Commonly followed approaches for the embedding of words (aka tokens) decompose into three levels of granularity :

| Level |  | |
|------|------|------|
| **Word** | [I.1 Custom model](#word_level_custom) | [I.2 Gensim Model](#gensim) |
| **sub-word unit** | [II.1 FastText model](#fastText) |  |
| **Character** |  |  |


<br>
Visualization with TensorBoard : https://www.tensorflow.org/guide/embedding (TODO)

# Training objectives

#### CBOW training objective

Cette méthode de vectorisation est introduite dans \cite{mikolov2013distributed, mikolov2013efficient}, et consiste à construire pour un vocabulaire de mots une table de vectorisation $T$ contenant un vecteur par mot. La spécificité de cette méthode est que cette vectorisation est faite de façon à pouvoir prédire chaque mot à partir de son contexte. La construction de cette table $T$ passe par la création d'un réseau de neurones, qui sert de modèle pour l'estimation de la probabilité de prédiction d'un mot $w_t$ d'après son contexte $c = w_{t-N}, \, ... \, , w_{t-1}$, $w_{t+1}, \, ... \, , w_{t+N}$. La table $T$ intégrée au modèle sera optimisée lorsque ce modèle sera entrainé de façon à ce qu'un mot $w_t$ maximise la vraisemblance de la probabilité $P(. \, | \, c)$ fournie par le modèle. 

Le réseau de neurones de décrit de la façon suivante :

![cbow](figs/CBOW.png)

Un contexte $c = w_{t-N}, \, ... \, , w_{t-1}$, $w_{t+1}, \, ... \, , w_{t+N}$ est vectorisé via une table $T$ fournissant un ensemble de vecteurs denses (typiquement de dimension comprise entre 50 et 300) $T(w_{t-N}), \, ... \, , T(w_{t-1})$, $T(w_{t+1}), \, ... \, , T(w_{t+N})$. Chaque vecteur est ensuite transformé via une transformation affine, dont les vecteurs résultants sont superposés en un unique vecteur

\begin{align*}
v_c = \sum _{i = - N}^N M_i T(w_{t+i}) + b_i
\end{align*}

Le vecteur $v_c$ est de dimension typiquement égale à la dimension de la vectorisation de mots. Une autre table $T'$ est utilisée pour une nouvelle vectorisation du vocabulaire, de sorte que le mot $w_{t}$ soit transformé en un vecteur $T'(w_{t})$ par cette table, et soit proposé en position $t$ avec probabilité

\begin{align*}
P(w_{t} \, | \, c\,) = \frac{\exp\left( T'(w_{t}) \cdot v_c \right) }{\displaystyle \sum _{w \in \mathcal{V}} \exp\left(   T'(w) \cdot v_c 
\right) }
\end{align*}

Ici $\cdot$ désigne le produit scalaire entre vecteurs. L'optimisation de ce modèle permet d'ajuster la table $T$ afin que les vecteurs de mots portent suffisamment d'information pour reformer un mot à partir du contexte.


#### Skip-Gram training objective


Cette méthode de vectorisation est introduite dans \cite{mikolov2013distributed, mikolov2013efficient} comme version mirroir au Continuous Bag Of Words, et consiste là encore à construire pour un vocabulaire de mots une table de vectorisation $T$ contenant un vecteur par mot. La spécificité de cette méthode est que cette vectorisation est faite non pas de façon prédire un mot central $w$ à partir d'un contexte $c $ comme pour CBOW, mais plutôt de prédire le contexte $c $ à partir du mot central $w$. La construction de cette table $T$ passe par la création d'un réseau de neurones servant de modèle pour l'estimation de la probabilité de prédiction d'un contexte $c = w_{t-N}, \, ... \, , w_{t-1}$, $w_{t+1}, \, ... \, , w_{t+N}$ à partir d'un mot central $w_t$. La table $T$ intégrée au modèle sera optimisée lorsque ce modèle sera entrainé de façon à ce que le contexte  $ c $ maximise la vraisemblance de la probabilité $P( . \, | \, w_t)$ fournie par le modèle.


Une implémentation de ce modèle est la suivante : 


![skipgram](figs/Skipgram.png)


Un mot courant $w_t$ est vectorisé par une table $T$ fournissant un vecteur dense (typiquement de dimension comprise entre 50 et 300) $T(w_t)$. Ce vecteur est alors transformé en un ensemble de $2N$ vecteurs

\begin{align*}
\sigma (M_{i} T(w_t) + b_{i}) \qquad \qquad i =-N,\, ...\, , -1, 1, \, ...\, , N
\end{align*}

où $N$ désigne la taille de la fenêtre retenue, d'une dimension typiquement égale à la dimension de la vectorisation de mots, et $\sigma$ une fonction non linéaire (typiquement la _Rectified Linear Unit_ $\sigma (x) = max (0, x)$). Une autre table $T'$ est utilisée pour une nouvelle vectorisation du vocabulaire, de sorte que chaque mot $w_{t+i}$, transformé en un vecteur $T'(w_{t+i})$ par cette table, soit proposé en position $t+i$ avec probabilité

\begin{align*}
P( w_{t+i} | \, w_t) = \frac{\exp\left(  T'(w_{t+i}) ^\perp \sigma \left( M_i T(w_t) + b_{i}\right) \right) }{\displaystyle \sum _{w \in \mathcal{V}} \exp\left(   T'(w) ^\perp \sigma \left( M_i T(w_t) + b_i\right) \right) }
\end{align*}

On modélise alors la probabilité qu'un ensemble de mots $c = w_{t-N}, \, ... \, , w_{t-1}$, $w_{t+1}, \, ... \, , w_{t+N}$ soit le contexte d'un mot $w_t$ par le produit

\begin{align*}
 P( c\, | \, w_t) = \prod _{i = -N}^N P( w_{t+i}\, | \, w_t)
\end{align*}

Ce modèle de probabilité du contexte d'un mot est naif au sens où les mots de contextes sont considérés comme indépendants deux à deux dès lors que le mot central est connu. Cette approximation rend cependant le calcul d'optimisation beaucoup plus court.



L'optimisation de ce modèle permet d'ajuster la table $T$ afin que les vecteurs de mots portent suffisamment d'information pour reformer l'intégralité du contexte à partir de ce seul mot. La vectorisation Skip-Gram est typiquement plus performante que CBOW, car la table $T$ subit plus de contrainte dans son optimisation, et puisque le vecteur d'un mot est obtenu de façon à pouvoir prédire l'utilisation réelle du mot, ici donnée par son contexte. 


A complete review of methods for learning Token Embeddings is provided in this [PhD thesis, 2018](https://www.skoltech.ru/app/data/uploads/2018/09/Thesis-Fonarev1.pdf).

# Packages

[Back to top](#plan)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
import os
import time
import math
import re
import random
import pickle
import copy
from unidecode import unidecode
import multiprocessing

# data 
import numpy as np
import pandas as pd
from datasets import Dataset, load_from_disk

# models
from transformers import AutoTokenizer
from gensim.models import Word2Vec, FastText, Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from mittens import GloVe

# viz
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


#### Custom paths & imports

In [3]:
path_to_repo = os.path.dirname(os.getcwd())
path_to_data = os.path.join(path_to_repo, 'datasets', 'clinical trials CTTI')
path_to_save = os.path.join(path_to_repo, 'saves', 'MLM')
path_to_src  = os.path.join(path_to_repo, 'src')

#### Constants

In [4]:
dataset_name = 'clinical-trials-ctti'
final_dataset_name = 'clinical-trials-ctti-tokenized'
base_model_name = os.path.join('albert-small-clinical-trials', 'tokenizer')
final_model_name = os.path.join('albert-small-clinical-trials', 'w2v')

<a id="data"></a>

# 1. Corpus

[Table of content](#TOC)

## 1.1 Load Clinical Trials corpus

[Table of content](#TOC)

In [5]:
with open(os.path.join(path_to_data, '{}.txt'.format(dataset_name)), 'r', encoding = 'utf-8') as f:
    texts = [t.strip() for t in f.readlines()]

In [6]:
len(texts)

430108

## 1.2 Tokenize corpus

[Table of content](#TOC)

In [7]:
tokenizer = AutoTokenizer.from_pretrained(os.path.join(path_to_save, base_model_name))

Create & export tokenized corpus (uncoment and run this only once):

In [8]:
tokenized_corpus = [tokenizer.tokenize(t) for t in tqdm(texts)]

100%|█████████████████████████████████████████████████████████████████████████| 430108/430108 [13:28<00:00, 532.30it/s]


In [9]:
# dataset = Dataset.from_dict({'text': tokenized_corpus})
# dataset.save_to_disk(os.path.join(path_to_data, final_dataset_name))

# # Import back tokenized corpus
# dataset = load_from_disk(os.path.join(path_to_data, final_dataset_name))
# tokenized_corpus = [dataset[i]['text'] for i in tqdm(range(len(dataset)))]

In [10]:
tokenized_corpus[0]

['▁this',
 '▁study',
 '▁will',
 '▁test',
 '▁the',
 '▁',
 'ability',
 '▁of',
 '▁extended',
 '▁release',
 '▁nifedip',
 'in',
 'e',
 '▁',
 '(',
 'pro',
 'cardia',
 '▁',
 'x',
 'l',
 ')',
 ',',
 '▁',
 'a',
 '▁blood',
 '▁pressure',
 '▁medication',
 ',',
 '▁to',
 '▁permit',
 '▁',
 'a',
 '▁decrease',
 '▁in',
 '▁the',
 '▁dose',
 '▁of',
 '▁glucocorticoid',
 '▁medication',
 '▁children',
 '▁take',
 '▁to',
 '▁treat',
 '▁congenital',
 '▁adrenal',
 '▁hyperplasia',
 '▁',
 '(',
 'c',
 'a',
 'h',
 ')',
 '.',
 '▁this',
 '▁protocol',
 '▁is',
 '▁',
 'designed',
 '▁to',
 '▁assess',
 '▁both',
 '▁acute',
 '▁and',
 '▁chronic',
 '▁effects',
 '▁of',
 '▁the',
 '▁calcium',
 '▁channel',
 '▁antagonist',
 ',',
 '▁nifedip',
 'in',
 'e',
 ',',
 '▁on',
 '▁the',
 '▁hypothalamic',
 '-',
 'pituitary',
 '-',
 'a',
 'd',
 'renal',
 '▁axis',
 '▁in',
 '▁patients',
 '▁with',
 '▁congenital',
 '▁adrenal',
 '▁hyperplasia',
 '.',
 '▁the',
 '▁multicenter',
 '▁trial',
 '▁is',
 '▁compose',
 'd',
 '▁of',
 '▁two',
 '▁phases',
 '▁and',


<a id="w2v"></a>


# 2. Word2Vec

[Table of content](#TOC)

## 2.1 CBOW training objective

[Table of content](#TOC)

Link : https://radimrehurek.com/gensim/models/word2vec.html<br>
Tutorials :

- https://cambridgespark.com/4046-2/
- https://rare-technologies.com/word2vec-tutorial/
- http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/


In [17]:
cbow = Word2Vec(
    vector_size = 128, 
    window = 7, 
    min_count = 0, 
    negative = 15, 
    sg = 0,
    workers = multiprocessing.cpu_count(),
    seed = 42,
)

The model needs to collect the vocabulary of tokens present in the (tokenized) corpus, prior to building and optimizing the token embedding matrix.

In [18]:
cbow.build_vocab(tokenized_corpus)

Since all tokens present in the corpus were outputed by our tokenizer, this outputed vocabulary is a subset of the tokenizer's own vocabulary.<br> 
Here we see that the two vocabularies are the same, up to integer indexing and disregarding special tokens:<br>
This is due to the fact that the tokenizer we use was fitted on this same corpus, so all tokens of its vocabulary are represented at least once in the corpus.

In [19]:
base_vocab = tokenizer.get_vocab()
cbow_vocab = cbow.wv.key_to_index

len(base_vocab), len(cbow_vocab)

(5000, 4995)

In [20]:
[tok for tok in base_vocab.keys() if tok not in cbow_vocab.keys()]

['[SEP]', 'ed.', '<unk>', '<pad>', '[MASK]']

In [22]:
cbow.train(
    corpus_iterable = tokenized_corpus, 
    epochs = 1, 
    total_examples = len(tokenized_corpus),
    start_alpha = 2.5e-2,
    end_alpha = 1e-5,
)

(157350770, 213777853)

In [33]:
cbow.save(os.path.join(path_to_save, final_model_name, 'cbow'))

Evaluate the model:

In [5]:
cbow = Word2Vec.load(os.path.join(path_to_save, final_model_name, 'cbow'))

In [6]:
np.linalg.norm(cbow.wv.vectors, axis = -1).tolist()

[18.52962303161621,
 20.736326217651367,
 17.766706466674805,
 22.1402530670166,
 18.04116439819336,
 16.62861442565918,
 20.921104431152344,
 17.259843826293945,
 30.158266067504883,
 23.113059997558594,
 22.103801727294922,
 22.06257438659668,
 24.066749572753906,
 42.8422737121582,
 25.536203384399414,
 27.679298400878906,
 24.401941299438477,
 23.187719345092773,
 44.36274719238281,
 37.11742401123047,
 23.714698791503906,
 23.547819137573242,
 21.603256225585938,
 35.93287658691406,
 38.611854553222656,
 34.73106002807617,
 23.82558822631836,
 23.648563385009766,
 31.85660171508789,
 33.13561248779297,
 23.52503776550293,
 33.60490798950195,
 24.43172836303711,
 25.4432315826416,
 30.88538360595703,
 35.02460479736328,
 35.9189338684082,
 27.407958984375,
 26.702219009399414,
 34.960548400878906,
 26.272668838500977,
 35.79170227050781,
 36.988033294677734,
 26.821701049804688,
 47.4622688293457,
 28.254806518554688,
 34.41810989379883,
 35.0435905456543,
 29.554332733154297,
 31.

In [37]:
cbow.wv.most_similar('▁evaluate')

[('▁assess', 0.9663772583007812),
 ('▁investigate', 0.9392503499984741),
 ('▁determine', 0.9024204611778259),
 ('▁examine', 0.8868530988693237),
 ('▁compare', 0.86357182264328),
 ('▁explore', 0.853326141834259),
 ('▁describe', 0.8049159049987793),
 ('▁characterize', 0.8014445900917053),
 ('▁verify', 0.7951874136924744),
 ('▁evaluating', 0.756216824054718)]

In [25]:
cbow.wv.most_similar('▁glucocorticoid')

[('▁corticosteroid', 0.8697609901428223),
 ('steroid', 0.7753329277038574),
 ('▁beta-blocker', 0.7272238731384277),
 ('▁nsaid', 0.7263624668121338),
 ('▁anticoagulant', 0.7142004370689392),
 ('▁diuretic', 0.6762709021568298),
 ('▁anticonvulsant', 0.6405121088027954),
 ('▁hormonal', 0.6374244689941406),
 ('▁estrogen', 0.6114485859870911),
 ('benzodiazepine', 0.6072638034820557)]

In [30]:
cbow.wv.most_similar('▁paracetamol')

[('▁acetaminophen', 0.9059516787528992),
 ('▁ibuprofen', 0.8393154740333557),
 ('▁morphine', 0.8097772598266602),
 ('▁fentanyl', 0.7805302143096924),
 ('▁midazolam', 0.7642548084259033),
 ('▁gabapentin', 0.7307537794113159),
 ('▁ketamine', 0.7143698334693909),
 ('▁dexmedetomidine', 0.6839478611946106),
 ('▁analgesic', 0.6655339002609253),
 ('▁remifentanil', 0.6416815519332886)]

## 2.2 Skip-Gram training objective

[Table of content](#TOC)

In [11]:
sgram = Word2Vec(
    vector_size = 128, 
    window = 7, 
    min_count = 0, 
    negative = 15, 
    sg = 1,
    workers = multiprocessing.cpu_count(),
    seed = 42,
)

In [12]:
sgram.build_vocab([list(tokenizer.get_vocab())] + tokenized_corpus)

In [13]:
base_vocab  = tokenizer.get_vocab()
sgram_vocab = sgram.wv.key_to_index

len(base_vocab), len(sgram_vocab), (sgram_vocab == base_vocab)

(10000, 10000, False)

In [14]:
sgram.train(
    corpus_iterable = tokenized_corpus, 
    epochs = 3, 
    total_examples = len(tokenized_corpus),
    start_alpha = 2.5e-2,
    end_alpha = 1e-5,
)

(437135781, 658914720)

In [15]:
sgram.save(os.path.join(path_to_save, final_model_name, 'sgram'))

Evaluation

In [16]:
sgram = Word2Vec.load(os.path.join(path_to_save, final_model_name, 'sgram'))

In [17]:
np.linalg.norm(sgram.wv.vectors, axis = -1).tolist()

[1.4936128854751587,
 1.6605130434036255,
 1.7028515338897705,
 1.4862797260284424,
 1.7892918586730957,
 1.794647216796875,
 1.6289222240447998,
 1.631468415260315,
 2.0154943466186523,
 2.1520259380340576,
 1.8849847316741943,
 2.0168864727020264,
 1.7805209159851074,
 1.6925245523452759,
 2.0719501972198486,
 1.9361293315887451,
 1.8657211065292358,
 2.301326036453247,
 1.967976689338684,
 2.19937801361084,
 2.3560495376586914,
 1.908614993095398,
 2.1233270168304443,
 1.949370265007019,
 2.2105886936187744,
 2.317613124847412,
 1.9564261436462402,
 2.0055251121520996,
 2.2597129344940186,
 2.2539544105529785,
 2.1048340797424316,
 1.923978328704834,
 1.9290603399276733,
 1.9028393030166626,
 2.2513070106506348,
 2.147373914718628,
 2.165884494781494,
 2.213029623031616,
 2.0426077842712402,
 2.0926897525787354,
 2.1423630714416504,
 2.433199405670166,
 2.362342119216919,
 2.3098573684692383,
 2.2407915592193604,
 2.069993734359741,
 2.2968733310699463,
 2.07165265083313,
 2.0333116

In [18]:
sgram.wv.most_similar('▁evaluat')

[('▁explor', 0.8182735443115234),
 ('▁investigating', 0.7874332666397095),
 ('▁determin', 0.7509390115737915),
 ('▁examining', 0.7351354956626892),
 ('▁demonstrat', 0.7319681644439697),
 ('▁analyz', 0.7245328426361084),
 ('▁assess', 0.7081125974655151),
 ('▁describ', 0.6848630905151367),
 ('▁comparing', 0.6807157397270203),
 ('▁utiliz', 0.6705877780914307)]

In [19]:
sgram.wv.most_similar('▁glucocorticoid')

[('steroids', 0.8449226021766663),
 ('▁corticosteroids', 0.8431997895240784),
 ('▁steroid', 0.7978275418281555),
 ('▁corticosteroid', 0.7943652272224426),
 ('▁bisphosphonate', 0.7266825437545776),
 ('▁prednisolone', 0.6790269613265991),
 ('▁nsaids', 0.6750874519348145),
 ('mmunosuppressant', 0.6662101149559021),
 ('▁retinoid', 0.6542328000068665),
 ('▁statin', 0.6438979506492615)]

In [20]:
sgram.wv.most_similar('▁ibuprofen')

[('▁acetaminophen', 0.896304190158844),
 ('▁tramadol', 0.8056350946426392),
 ('▁gabapentin', 0.7794120907783508),
 ('▁diclofenac', 0.7573017477989197),
 ('▁pregabalin', 0.7537317276000977),
 ('▁oxycodon', 0.7453684210777283),
 ('acetaminophen', 0.7368971705436707),
 ('coxib', 0.718144953250885),
 ('▁indomethacin', 0.7162020802497864),
 ('▁ketorol', 0.7129977941513062)]

<a id="ft"></a>


# 3. FastText

[Table of content](#TOC)

## 3.1 CBOW training objective

[Table of content](#TOC)


FastText's Word Embedding via character n-grams


We consider the Gensim implementation of FastText, based on the CBOW training objective.<br>
Tutorial : [Gensim FastText](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/FastText_Tutorial.ipynb)<br>
Link to the original paper : [Enriching Word Vectors with Subword Information](https://arxiv.org/pdf/1607.04606.pdf).

In [48]:
cbow_ft = FastText(
    vector_size = 128, 
    window = 7, 
    min_count = 0, 
    negative = 15, 
    sg = 0,
    workers = multiprocessing.cpu_count(),
    seed = 42,
)

In [49]:
cbow_ft.build_vocab(tokenized_corpus)

In [50]:
cbow_ft.train(
    corpus_iterable = tokenized_corpus, 
    epochs = 1, 
    total_examples = len(tokenized_corpus),
    start_alpha = 2.5e-2,
    end_alpha = 1e-5,
)

(157352132, 213777853)

In [51]:
cbow_ft.save(os.path.join(path_to_save, final_model_name, 'cbow_ft'))

Evaluation

In [None]:
cbow_ft = FastText.load(os.path.join(path_to_save, final_model_name, 'cbow_ft'))

In [53]:
cbow_ft.wv.most_similar('▁evaluate')

[('▁assess', 0.9489772319793701),
 ('▁determine', 0.90901118516922),
 ('▁investigate', 0.8902094960212708),
 ('▁examine', 0.8681502938270569),
 ('▁explore', 0.8442973494529724),
 ('▁evaluating', 0.8324539661407471),
 ('▁compare', 0.8023945093154907),
 ('▁assessing', 0.7686477899551392),
 ('▁verify', 0.7602091431617737),
 ('▁determining', 0.7566972374916077)]

In [54]:
cbow_ft.wv.most_similar('▁glucocorticoid')

[('▁corticosteroid', 0.8412089347839355),
 ('steroid', 0.7385424375534058),
 ('▁melatonin', 0.6800440549850464),
 ('▁anticoagulant', 0.6730968952178955),
 ('▁anticonvulsant', 0.6728842258453369),
 ('▁hormone', 0.66767817735672),
 ('▁diuretic', 0.6548454165458679),
 ('▁insulin', 0.6490825414657593),
 ('▁metformin', 0.6321661472320557),
 ('▁anticoagulation', 0.6299314498901367)]

In [55]:
cbow_ft.wv.most_similar('▁paracetamol')

[('▁acetaminophen', 0.8600711822509766),
 ('▁ibuprofen', 0.7738017439842224),
 ('▁gabapentin', 0.767635703086853),
 ('▁morphine', 0.736064076423645),
 ('▁midazolam', 0.725612461566925),
 ('▁dexamethasone', 0.7200802564620972),
 ('▁dexmedetomidine', 0.6897672414779663),
 ('▁fentanyl', 0.6858271360397339),
 ('▁ketamine', 0.6840497255325317),
 ('▁prednisone', 0.6433501243591309)]

## 3.2 Skip-Gram training objective

[Table of content](#TOC)

In [56]:
sgram_ft = FastText(
    vector_size = 128, 
    window = 7, 
    min_count = 0, 
    negative = 15, 
    sg = 1,
    workers = multiprocessing.cpu_count(),
    seed = 42,
)

In [57]:
sgram_ft.build_vocab(tokenized_corpus)

Training

In [58]:
sgram_ft.train(
    corpus_iterable = tokenized_corpus, 
    epochs = 1, 
    total_examples = len(tokenized_corpus),
    start_alpha = 2.5e-2,
    end_alpha = 1e-5,
)

(157348744, 213777853)

In [59]:
sgram_ft.save(os.path.join(path_to_save, final_model_name, 'sgram_ft'))

Evaluation

In [60]:
sgram_ft = FastText.load(os.path.join(path_to_save, final_model_name, 'sgram_ft'))

In [61]:
sgram_ft.wv.most_similar('▁evaluate')

[('▁assess', 0.9387299418449402),
 ('▁investigate', 0.9106242060661316),
 ('▁compare', 0.8407846093177795),
 ('▁determine', 0.8407755494117737),
 ('▁examine', 0.8298873901367188),
 ('▁explore', 0.8108099102973938),
 ('▁characterize', 0.7549089789390564),
 ('▁describe', 0.7497537136077881),
 ('▁evaluating', 0.7173411250114441),
 ('▁verify', 0.7089744210243225)]

In [62]:
sgram_ft.wv.most_similar('▁glucocorticoid')

[('▁corticosteroid', 0.8841887712478638),
 ('steroid', 0.8095722198486328),
 ('▁prednisolone', 0.7155464291572571),
 ('▁nsaid', 0.7149364352226257),
 ('▁methotrexate', 0.6829048991203308),
 ('▁anticoagulant', 0.6805861592292786),
 ('mmunosuppressant', 0.6801298260688782),
 ('▁prednisone', 0.6631572246551514),
 ('▁inhaled', 0.6483165621757507),
 ('▁antagonist', 0.6387893557548523)]

In [63]:
sgram_ft.wv.most_similar('▁paracetamol')

[('▁acetaminophen', 0.9111426472663879),
 ('▁ibuprofen', 0.8305103182792664),
 ('▁gabapentin', 0.7761817574501038),
 ('▁nsaid', 0.7502484321594238),
 ('▁morphine', 0.7421273589134216),
 ('▁fentanyl', 0.7231634855270386),
 ('▁celecoxib', 0.7175020575523376),
 ('▁midazolam', 0.7022755742073059),
 ('▁analgesic', 0.6811712384223938),
 ('▁ketamine', 0.6758464574813843)]

In [67]:
sgram_ft.wv.most_similar('paaaaracetamol')

[('▁paracetamol', 0.9820312261581421),
 ('▁acetaminophen', 0.9063007235527039),
 ('▁ibuprofen', 0.8336237668991089),
 ('▁gabapentin', 0.7589367628097534),
 ('▁nsaid', 0.7414977550506592),
 ('▁morphine', 0.7340535521507263),
 ('▁fentanyl', 0.7043936848640442),
 ('▁celecoxib', 0.7024633884429932),
 ('▁midazolam', 0.6804414391517639),
 ('▁lidocaine', 0.664604127407074)]

<a id="d2v"></a>


# 4. Doc2Vec

[Table of content](#TOC)

## 4.1 CBOW training objective

[Table of content](#TOC)

This experiment is performed **on a much smaller training corpus**, because training is **way more computationally expensive** and because results are anyway for demonstration purpose only.

In [8]:
tagged_corpus = [TaggedDocument(doc, [i]) for i, doc in enumerate(tokenized_corpus[:1000])]

In [9]:
cbow_d2v = Doc2Vec(
    vector_size = 128, 
    window = 7, 
    min_count = 0, 
    negative = 15, 
    dm = 1,
    workers = multiprocessing.cpu_count(),
    seed = 42,
)

In [10]:
cbow_d2v.build_vocab(tagged_corpus)

In [11]:
cbow_d2v.train(
    corpus_iterable = tagged_corpus, 
    epochs = 1, 
    total_examples = len(tagged_corpus),
    start_alpha = 2.5e-2,
    end_alpha = 1e-5,
)

In [12]:
cbow_d2v.save(os.path.join(path_to_save, final_model_name, 'cbow_d2v'))

In [13]:
cbow_d2v = Doc2Vec.load(os.path.join(path_to_save, final_model_name, 'cbow_d2v'))

In [14]:
cbow_d2v.wv.most_similar('▁evaluate')

[('▁is', 0.9999059438705444),
 ('▁of', 0.999886155128479),
 ('▁the', 0.9998841881752014),
 ('▁to', 0.9998838305473328),
 ('▁these', 0.9998620748519897),
 ('▁this', 0.9998597502708435),
 ('▁study', 0.9998595714569092),
 ('▁that', 0.9998573660850525),
 ('▁with', 0.9998547434806824),
 ('▁was', 0.999849796295166)]

In [15]:
cbow_d2v.wv.most_similar('▁glucocorticoid')

[('▁giving', 0.7857160568237305),
 ('▁pharmacokinetic', 0.78131103515625),
 ('▁telephone', 0.7809470891952515),
 ('▁atrophy', 0.7805244326591492),
 ('▁content', 0.7800859808921814),
 ('▁requiring', 0.7795609831809998),
 ('▁third', 0.7794311046600342),
 ('ard', 0.7790563702583313),
 ('▁bacterial', 0.7788977026939392),
 ('▁panel', 0.7788200378417969)]

## 4.2 Skip-Gram training objective

[Table of content](#TOC)


This experiment is performed **on a much smaller training corpus**, because training is **way more computationally expensive** and because results are anyway for demonstration purpose only.

In [17]:
sgram_d2v = Doc2Vec(
    vector_size = 128, 
    window = 7, 
    min_count = 0, 
    negative = 15, 
    dm = 0,
    workers = multiprocessing.cpu_count(),
    seed = 42,
)

In [18]:
sgram_d2v.build_vocab(tagged_corpus)

In [19]:
sgram_d2v.train(
    corpus_iterable = tagged_corpus, 
    epochs = 1, 
    total_examples = len(tagged_corpus),
    start_alpha = 2.5e-2,
    end_alpha = 1e-5,
)

In [20]:
sgram_d2v.save(os.path.join(path_to_save, final_model_name, 'sgram_d2v'))

In [21]:
sgram_d2v = Doc2Vec.load(os.path.join(path_to_save, final_model_name, 'sgram_d2v'))

In [22]:
sgram_d2v.wv.most_similar('▁evaluate')

[('▁amplitude', 0.30480095744132996),
 ('▁30', 0.29561254382133484),
 ('tion', 0.26839596033096313),
 ('▁(n', 0.2623017728328705),
 ('▁po', 0.2511439323425293),
 ('20', 0.24565120041370392),
 ('▁expectancy', 0.23870612680912018),
 ('bra', 0.23677323758602142),
 ('/day', 0.23665915429592133),
 ('▁delirium', 0.23582018911838531)]

In [23]:
sgram_d2v.wv.most_similar('▁glucocorticoid')

[('▁giving', 0.3195812404155731),
 ('▁principal', 0.30242210626602173),
 ('▁chosen', 0.2952287197113037),
 ('▁following', 0.27384042739868164),
 ('ard', 0.2628988027572632),
 ('▁requiring', 0.26093459129333496),
 ('▁intake', 0.2521917521953583),
 ('s)', 0.25009459257125854),
 ('ally', 0.24567338824272156),
 ('▁mass', 0.24209511280059814)]

In [None]:
sgram_d2v.wv.most_similar('▁paracetamol')