# Word Embeddings

This colab notebook is an introduction to Word Embeddings and recent advances in language models.

## Word2Vec (CBOW, skip-gram)

The paper of Word2Vec: https://arxiv.org/pdf/1301.3781.pdf <br>
Further improvement paper with important updates to Word2Vec: https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

### n-grams

Very simple introduction: https://blog.xrds.acm.org/2017/10/introduction-n-grams-need/ <br>
a litle bit more advanced: https://web.stanford.edu/~jurafsky/slp3/3.pdf

In [0]:
import pandas as pd
from nltk.util import ngrams

In [2]:
data = pd.read_csv('imdb_sample.csv')

FileNotFoundError: ignored

In [0]:
data.head()

In [0]:
tokens = data.loc[0,'review'].split()
tokens

In [0]:
output = ['_'.join(ngram) for ngram in ngrams(tokens, 5)]
output

### collocations

detecting frequent phrases (aka n-grams) with gensim https://radimrehurek.com/gensim/models/phrases.html

In [0]:
from gensim.models.phrases import Phrases, Phraser

In [0]:
sent = data['review'].apply(lambda x: x.split()).to_numpy()

In [0]:
sent

In [0]:
phrases = Phrases(sent, min_count=30)
bigram = Phraser(phrases)
sentences = list(bigram[sent])

In [0]:
sentences[0]

### CBOW
Continuous Bag of Words

![alt text](https://cdn-images-1.medium.com/max/800/1*UVe8b6CWYykcxbBOR6uCfg.png)

Nice tutorial: https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial

In [0]:
from gensim.models import Word2Vec

In [0]:
w2v_model = Word2Vec(min_count=20,
                     window=5,
                     size=300,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=1)

In [8]:
w2v_model.build_vocab(sentences)

NameError: ignored

In [0]:
w2v_model.corpus_count

In [5]:
w2v_model.wv.most_similar(positive=["awesome"])

NameError: ignored

In [0]:
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=10, report_delay=1)

(4392641, 12922180)

In [0]:
w2v_model.wv.most_similar(negative=["awesome"])

  if np.issubdtype(vec.dtype, np.int):


[('him', 0.07712670415639877),
 ('finds', 0.039329614490270615),
 ('her', 0.037964582443237305),
 ('his_wife', 0.027132555842399597),
 ('she', 0.025708287954330444),
 ('decides_to', 0.02090851590037346),
 ('father', 0.01855352893471718),
 ('tells', 0.016433771699666977),
 ('son', 0.009563196450471878),
 ('kill', 0.006780259311199188)]

### skip-gram

![alt text](https://miro.medium.com/max/568/1*3xy5IOpScN0aQwwfFbCmGQ.png)

In [0]:
sg_model = Word2Vec(min_count=20,
                     window=5,
                     size=300,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=1,
                     sg = 1)

In [0]:
sg_model.build_vocab(sentences)

In [0]:
sg_model.corpus_count

5812

In [0]:
sg_model.wv.most_similar(positive=["awesome"])

  if np.issubdtype(vec.dtype, np.int):


[('nomination', 0.20625518262386322),
 ('Season', 0.20247775316238403),
 ('psychotic', 0.18889707326889038),
 ('alongside', 0.18707317113876343),
 ('change', 0.17955568432807922),
 ('late', 0.17776718735694885),
 ('wanna', 0.17692437767982483),
 ('Saturday', 0.17276087403297424),
 ('sheer', 0.17222486436367035),
 ('pack', 0.17181582748889923)]

In [0]:
sg_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=10, report_delay=1)

(4392641, 12922180)

In [0]:
sg_model.wv.most_similar(negative=["awesome"])

  if np.issubdtype(vec.dtype, np.int):


[('drug', -0.09210406243801117),
 ('finds', -0.12461350858211517),
 ('married', -0.13180327415466309),
 ('Satan', -0.13506510853767395),
 ('English', -0.13669031858444214),
 ('murdered', -0.13688446581363678),
 ('Custer', -0.141212597489357),
 ('Soon', -0.14555160701274872),
 ('boy', -0.14579124748706818),
 ('middle', -0.14619497954845428)]

## GloVe


All about GloVe https://nlp.stanford.edu/projects/glove/

In [0]:
!pip install glove_python

Collecting glove_python
[?25l  Downloading https://files.pythonhosted.org/packages/3e/79/7e7e548dd9dcb741935d031117f4bed133276c2a047aadad42f1552d1771/glove_python-0.1.0.tar.gz (263kB)
[K     |█▎                              | 10kB 15.6MB/s eta 0:00:01[K     |██▌                             | 20kB 6.9MB/s eta 0:00:01[K     |███▊                            | 30kB 9.4MB/s eta 0:00:01[K     |█████                           | 40kB 6.0MB/s eta 0:00:01[K     |██████▎                         | 51kB 7.1MB/s eta 0:00:01[K     |███████▌                        | 61kB 8.4MB/s eta 0:00:01[K     |████████▊                       | 71kB 9.5MB/s eta 0:00:01[K     |██████████                      | 81kB 10.6MB/s eta 0:00:01[K     |███████████▏                    | 92kB 11.7MB/s eta 0:00:01[K     |████████████▌                   | 102kB 9.4MB/s eta 0:00:01[K     |█████████████▊                  | 112kB 9.4MB/s eta 0:00:01[K     |███████████████                 | 122kB 9.4MB/s eta 

In [0]:
from glove import Corpus, Glove

In [0]:
# creating a corpus object
corpus = Corpus() 

In [0]:
#training the corpus to generate the co occurence matrix which is used in GloVe
corpus.fit(sentences, window=10)

In [0]:
corpus.matrix

<102953x102953 sparse matrix of type '<class 'numpy.float64'>'
	with 5281503 stored elements in COOrdinate format>

In [0]:
#creating a Glove object which will use the matrix created in the above lines to create embeddings
#We can set the learning rate as it uses Gradient Descent and number of components
glove = Glove(no_components=300, learning_rate=0.05)
 
glove.fit(corpus.matrix, epochs=30, no_threads=2, verbose=True)

Performing 30 training epochs with 2 threads
Epoch 0
Epoch 1
Epoch 2
Epoch 3
Epoch 4
Epoch 5
Epoch 6
Epoch 7
Epoch 8
Epoch 9
Epoch 10
Epoch 11
Epoch 12
Epoch 13
Epoch 14
Epoch 15
Epoch 16
Epoch 17
Epoch 18
Epoch 19
Epoch 20
Epoch 21
Epoch 22
Epoch 23
Epoch 24
Epoch 25
Epoch 26
Epoch 27
Epoch 28
Epoch 29


In [0]:
glove.add_dictionary(corpus.dictionary)

In [0]:
glove.most_similar('awesome')

[('incredible', 0.9191289430844729),
 ('amazing', 0.8805096697632062),
 ('extremely', 0.8775054411864747),
 ('amateurish', 0.8701313309573746)]

## Transformers (Attention)

### encoder decoder architecture with LSTM

![alt text](https://camo.githubusercontent.com/9e88497fcdec5a9c716e0de5bc4b6d1793c6e23f/687474703a2f2f73757269796164656570616e2e6769746875622e696f2f696d672f736571327365712f73657132736571322e706e67)

### incoproprating attention

![alt text](https://cdn-images-1.medium.com/max/1000/1*9Lcq9ni9aujScFYyyHRhhA.png)

### what is Transformer?

![alt text](https://logodix.com/logo/2009798.jpg)

The famous paper on Transformers (Attention Is All You Need): https://arxiv.org/pdf/1706.03762.pdf

![alt text](https://lilianweng.github.io/lil-log/assets/images/transformer.png)

comprehensive walkthrough how Transformers architecture works: http://jalammar.github.io/illustrated-transformer/ <br>
If you want to also see implementation and much more see: http://nlp.seas.harvard.edu/2018/04/03/attention.html#embeddings-and-softmax

## BERT (Transfer Learning)

![alt text](https://i2.wp.com/mlexplained.com/wp-content/uploads/2019/01/bert.png?fit=400%2C400)

BERT paper: https://arxiv.org/pdf/1810.04805.pdf

Understanding how BERT architecture works: https://medium.com/dissecting-bert

Finetuning BERT with [huggingface](https://huggingface.co/)

![alt text](https://pbs.twimg.com/profile_images/1126261339029164037/KgmTJrZI_400x400.png)