<h1 align=center>Word Embeddings Tutorial</h1>

In this notebook we will go through word embeddings using deep learning, we will not train a new model we will use pre-trained ones as training a new one will cost a lot.

We will be using `spacy` in this tutorial to demonstrate word embeddings

Update pip tools and install spacy

`pip install -U pip setuptools wheel`

`pip install -U spacy`

Download the English model

`python -m spacy download en_core_web_md`

In [11]:
! pip install -U pip setuptools wheel --user
! pip install -U spacy --user

Collecting setuptools
  Using cached setuptools-62.6.0-py3-none-any.whl (1.2 MB)
Installing collected packages: setuptools
Successfully installed setuptools-62.6.0


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
conda-repo-cli 1.0.4 requires pathlib, which is not installed.
anaconda-project 0.10.2 requires ruamel-yaml, which is not installed.






In [12]:
! python -m spacy download en_core_web_md --user

Collecting en-core-web-md==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.3.0/en_core_web_md-3.3.0-py3-none-any.whl (33.5 MB)
     ---------------------------------------- 33.5/33.5 MB 1.8 MB/s eta 0:00:00
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.3.0
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_md')


2022-06-30 12:32:57.090627: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-06-30 12:32:57.090658: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [3]:
import spacy
import pandas as pd
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances

cm = sns.light_palette("blue", as_cmap=True)
nlp = spacy.load('en_core_web_md')

In [4]:
words = ['cat', 'dog', 'car', 'bird', 'eagle', 'milk', 'fly']
vectors = [nlp(word).vector for word in words]

In [5]:
similarities = cosine_similarity(vectors, vectors)
pd.DataFrame(similarities, columns=words, index=words).style.background_gradient(cmap=cm)

Unnamed: 0,cat,dog,car,bird,eagle,milk,fly
cat,1.0,1.0,0.19305,0.258738,0.307154,0.310448,0.198101
dog,1.0,1.0,0.19305,0.258738,0.307154,0.310448,0.198101
car,0.19305,0.19305,1.0,0.078293,0.364785,0.172172,0.170922
bird,0.258738,0.258738,0.078293,1.0,0.181234,0.271852,0.314563
eagle,0.307154,0.307154,0.364785,0.181234,1.0,0.247915,0.146649
milk,0.310448,0.310448,0.172172,0.271852,0.247915,1.0,0.246011
fly,0.198101,0.198101,0.170922,0.314563,0.146649,0.246011,1.0


# Vectors !

The vectors generated by `spacy` model is a 300 dimensional vector which is the output of a pre-trained GloVe model.

In [6]:
vector = nlp("Bank").vector
print(vector.shape)
print(vector[:5])

(300,)
[-0.60877  0.30253 -0.12351 -0.23647  0.2665 ]


## Embeddings as feature

We can use word embedding as features of the text and build a classifier using them

In [7]:
import numpy as np
from tqdm.auto import tqdm
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']
              
x_train, y_train = fetch_20newsgroups(categories=categories, 
                          remove=('headers', 'footers', 'quotes'), return_X_y=True)
x_test, y_test = fetch_20newsgroups(categories=categories, 
                          remove=('headers', 'footers', 'quotes'), return_X_y=True, subset='test')

In [8]:
x_train_v = np.zeros((len(x_train), 300))
x_test_v = np.zeros((len(x_test), 300))

for i, doc in tqdm(enumerate(nlp.pipe(x_train)), total=len(x_train)):
    x_train_v[i, :] = doc.vector

for i, doc in tqdm(enumerate(nlp.pipe(x_test)), total=len(x_test)):
    x_test_v[i, :] = doc.vector

  0%|          | 0/2257 [00:00<?, ?it/s]

  0%|          | 0/1502 [00:00<?, ?it/s]

# Train a classifier

In [9]:
clf = LinearSVC()
clf.fit(x_train_v, y_train)
print(classification_report(y_test, clf.predict(x_test_v), target_names=categories))

                        precision    recall  f1-score   support

           alt.atheism       0.73      0.57      0.64       319
soc.religion.christian       0.85      0.93      0.89       389
         comp.graphics       0.84      0.86      0.85       396
               sci.med       0.75      0.80      0.77       398

              accuracy                           0.80      1502
             macro avg       0.79      0.79      0.79      1502
          weighted avg       0.80      0.80      0.79      1502



# Get top similar

In [10]:
import random
from termcolor import colored

for i in random.choices(range(0, len(x_test_v)), k=5):
    print(f"ID: {i}")
    print("True label:", colored(categories[y_test[i]], 'green'))
    distances = cosine_similarity([x_test_v[i]], x_train_v).flatten()
    indices = np.argsort(distances)[::-1]
    for _, j in enumerate(indices[:3]):
        print(f"{_} nearest label is",
              f"{colored(categories[y_train[j]], 'green' if y_train[j]==y_test[i] else 'red')}",
              f"similarity score: {colored(round(distances[j], 3), 'yellow')}")

ID: 756
True label: [32msci.med[0m
0 nearest label is [31msoc.religion.christian[0m similarity score: [33m0.988[0m
1 nearest label is [31msoc.religion.christian[0m similarity score: [33m0.987[0m
2 nearest label is [31msoc.religion.christian[0m similarity score: [33m0.987[0m
ID: 108
True label: [32mcomp.graphics[0m
0 nearest label is [31msci.med[0m similarity score: [33m0.996[0m
1 nearest label is [31msoc.religion.christian[0m similarity score: [33m0.994[0m
2 nearest label is [31msoc.religion.christian[0m similarity score: [33m0.993[0m
ID: 852
True label: [32mcomp.graphics[0m
0 nearest label is [32mcomp.graphics[0m similarity score: [33m0.988[0m
1 nearest label is [32mcomp.graphics[0m similarity score: [33m0.988[0m
2 nearest label is [32mcomp.graphics[0m similarity score: [33m0.987[0m
ID: 1222
True label: [32msoc.religion.christian[0m
0 nearest label is [32msoc.religion.christian[0m similarity score: [33m0.983[0m
1 nearest label is [32mso

# Conclusion

- Word embedding is a very powerful feature specially if you have small data, as your model will make use of the learned features of the word2vec model and thus will be able to make better predictions.
- Word2vec and GloVe don't count for different context that the same word can have in different sentences