![Practicum AI Logo image](https://github.com/PracticumAI/practicumai.github.io/blob/main/images/logo/PracticumAI_logo_250x50.png?raw=true)
***
# *Practicum AI:* NLP - Embeddings

These exercises adapted from Baig et al. (2020) <i>The Deep Learning Workshop</i> from <a href="https://www.packtpub.com/product/the-deep-learning-workshop/9781839219856">Packt Publishers</a> (Exercises 4.01 - 4.06, page 159).

(15 Minutes: Exercises 4.05 - 4.06)

#### Introduction

Word2Vec is a popular word embedding model, developed by Tomas Mikolov in 2013 at Google.  As such, it comes in two flavors (both shallow Neural Networks): Skip Gram and Common Bag Of Words (CBOW).

In [None]:
import nltk
from gensim.models import word2vec

#### Training Our Own Embeddings - Page 197

<div style="padding: 10px;margin-bottom: 20px;border: thin solid #30335D;border-left-width: 10px;background-color: #fff"><strong>Note:</strong> The code block below is in the text but is commented out as the text8 dataset is in the data folder.</div>

In [3]:
# Another way of loading the data. if this doesn't work, you could use the text8 corpus local file
# dataset = api.load("text8")

In [2]:
# Load the dataset from the data folder.
dataset = word2vec.Text8Corpus("data/text8")

To ensure reproducible results, set random seed to 1.

<div style="padding: 10px;margin-bottom: 20px;border: thin solid #30335D;border-left-width: 10px;background-color: #fff"><strong>Note:</strong> The text does not import the numpy library so that needs to be done prior to setting the random seed.</div>

In [3]:
import numpy as np

np.random.seed(1)

In [4]:
model = word2vec.Word2Vec(dataset)

In [5]:
print(model.wv["animal"])

[ 1.28147197e+00  4.93714213e-02 -1.77742636e+00  7.55727291e-01
 -4.48607475e-01 -1.61682606e+00  6.28471494e-01  1.29946005e+00
 -1.47697222e+00  7.33656824e-01 -2.50913620e-01  1.51851982e-01
 -8.39449286e-01  2.08915448e+00 -1.26336062e+00 -2.56653976e+00
  5.14868915e-01  7.32167482e-01 -2.00053260e-01  5.38162589e-01
  7.24335492e-01 -2.29516530e+00  1.24233079e+00  1.90678668e+00
  2.64481211e+00 -3.21866304e-01 -2.15857774e-01  8.05656314e-01
  7.09271729e-01 -3.06913555e-02  1.01351726e+00  1.23749733e+00
  1.07249832e+00 -3.15461636e+00 -6.08641505e-01 -2.17098817e-01
  1.12186813e+00  1.78734243e+00 -1.53762472e+00 -8.31309557e-02
  5.96177101e-01 -7.12612689e-01 -6.29826248e-01  2.86765814e+00
 -1.24110842e+00  3.18110204e+00 -1.22431970e+00  7.18615949e-01
  2.28058863e-02  1.10605106e-01 -3.84237826e-01 -1.09326792e+00
 -1.59575546e+00  1.34581447e+00  1.97974071e-01  1.38159752e+00
  1.04777730e+00  1.09187233e+00 -6.55379534e-01 -4.87283587e-01
 -3.02258093e-04 -3.09790

In [6]:
len(model.wv["animal"])

100

In [7]:
model.wv.most_similar("animal")

[('animals', 0.7239352464675903),
 ('insect', 0.7155537009239197),
 ('ants', 0.6823122501373291),
 ('insects', 0.679811954498291),
 ('aquatic', 0.660001814365387),
 ('organism', 0.656354546546936),
 ('eating', 0.6549410820007324),
 ('mammal', 0.6496347188949585),
 ('human', 0.6453425288200378),
 ('humans', 0.642937421798706)]

In [8]:
model.wv.most_similar("happiness")

[('humanity', 0.7880685329437256),
 ('goodness', 0.7656909227371216),
 ('compassion', 0.7424624562263489),
 ('dignity', 0.7383203506469727),
 ('fear', 0.7318388819694519),
 ('pleasure', 0.7316447496414185),
 ('perfection', 0.7248966693878174),
 ('salvation', 0.7177927494049072),
 ('mankind', 0.7145527601242065),
 ('immortality', 0.7055302262306213)]

##### Semantic Regularities in Word Embeddings

In [9]:
model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn = 5)

[('queen', 0.6914654970169067),
 ('princess', 0.6395758986473083),
 ('prince', 0.6220191717147827),
 ('daughter', 0.6123440265655518),
 ('empress', 0.6098198294639587)]

In [10]:
model.wv.most_similar(positive=['uncle', 'woman'], negative=['man'], topn = 5)

[('wife', 0.8296380043029785),
 ('aunt', 0.8148874640464783),
 ('niece', 0.8077722787857056),
 ('daughter', 0.7944497466087341),
 ('grandmother', 0.7883173227310181)]

#### Exercise 4.05 (Vectors for Phrases) - Page 201

***

#### 1. Extract the vector for 'get'

```python
v1 = model.wv['get']
```

In [None]:
# Code it!

#### 2. Extract the vector for 'happy'

```python
v2 = model.wv['happy']
```

In [None]:
# Code it!

#### 3. Create a new vector as the average of the two vectors

```python
res1 = (v1 + v2) / 2
```

In [None]:
# Code it!

#### 4. Extract vectors for 'make' and 'merry'

```python
v1 = model.wv['make']
v2 = model.wv['merry']
```

In [None]:
# Code it!

#### 5. Create a new vector as the average of the two vectors

```python
res2 = (v1+v2)/2
```

In [None]:
# Code it!

#### 6. Find cosine similarity between the two averaged vectors

```python
model.wv.cosine_similarities(res1, [res2])
```

In [7]:
# Code it!

***
#### Effect of Parameters - 'size' of the Vector

In [135]:
model = word2vec.Word2Vec(dataset, size = 30)

In [136]:
model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn = 5)

[('son', 0.8133434653282166),
 ('empress', 0.8022229671478271),
 ('emperor', 0.7999867796897888),
 ('archbishop', 0.7950774431228638),
 ('constantine', 0.7858606576919556)]

#### Effect of parameters - skipgram vs. CBOW

##### Rare terms - oeuvre

In [137]:
model = word2vec.Word2Vec(dataset)

In [138]:
model.wv.most_similar("oeuvre", topn = 5)

[('seminal', 0.7173739671707153),
 ('baglione', 0.6992780566215515),
 ('wace', 0.6952950954437256),
 ('mockery', 0.6938953399658203),
 ('foxe', 0.687375545501709)]

In [139]:
model_sg = word2vec.Word2Vec(dataset, sg = 1)

In [140]:
model_sg.wv.most_similar("oeuvre", topn = 5)

[('masterful', 0.8323545455932617),
 ('satiric', 0.8200669288635254),
 ('masterwork', 0.815832257270813),
 ('mussorgsky', 0.815514862537384),
 ('librettos', 0.8108195662498474)]

#### Exercise 4.06 (Training Word Vectors on Different Datasets) - Page 205

***

#### 1. Import the Brown and IMDb movie reviews corpus

```python
nltk.download('brown')
nltk.download('movie_reviews')

from nltk.corpus import brown, movie_reviews
```

In [13]:
# Code it!

#### 2. Extract the sentences and words using the .sent() method

```python
model_brown = word2vec.Word2Vec(brown.sents(), sg = 1)
model_movie = word2vec.Word2Vec(movie_reviews.sents(), sg = 1)
```

In [1]:
# Code it!

#### 3. Print the five terms most similar to 'money' in the Brown corpus

```python
model_brown.wv.most_similar('money', topn = 5)
```

In [11]:
# Code it!

#### 4. Print the five terms most similar to 'money' in the movie corpus

```python
model_movie.wv.most_similar('money', topn = 5)
```

In [12]:
# Code it!

***
##### Using Pre-Trained Word Vectors

In [1]:
from gensim.scripts.glove2word2vec import glove2word2vec

glove_input_file     = 'data/glove.6B.100d.txt'
word2vec_output_file = 'data/glove.6B.100d.w2vformat.txt'

glove2word2vec(glove_input_file, word2vec_output_file)

(400000, 100)

In [2]:
from gensim.models.keyedvectors import KeyedVectors

glove_model = KeyedVectors.load_word2vec_format("data/glove.6B.100d.w2vformat.txt", binary = False)

In [3]:
glove_model.most_similar("money", topn = 5)

[('funds', 0.8508071303367615),
 ('cash', 0.848483681678772),
 ('fund', 0.7594833374023438),
 ('paying', 0.7415367364883423),
 ('pay', 0.7407673001289368)]

In [4]:
glove_model.most_similar(positive=['woman', 'king'], negative=['man'], topn = 5)

[('queen', 0.7698541283607483),
 ('monarch', 0.6843380928039551),
 ('throne', 0.6755735874176025),
 ('daughter', 0.6594556570053101),
 ('princess', 0.6520534753799438)]

##### Bias in Embeddings – A Word of Caution

In [149]:
model.wv.most_similar(positive=['woman', 'doctor'], negative=['man'], topn = 5)

[('child', 0.6149958372116089),
 ('nurse', 0.6090491414070129),
 ('teacher', 0.5878923535346985),
 ('dominatrix', 0.5384681224822998),
 ('detective', 0.5246642231941223)]

In [150]:
model.wv.most_similar(positive=['woman', 'smart'], negative=['man'], topn = 5)

[('pet', 0.6097452640533447),
 ('odie', 0.567996621131897),
 ('lingerie', 0.5643869042396545),
 ('scam', 0.5464061498641968),
 ('thug', 0.5415985584259033)]