# Word Embeddings tutorial

In this notebook I will go through word embeddings using deep learning, I will not train a new model I will use pre-trained ones as training a new one will cost a lot and the pretrained word embeddings are usually trained on large and diverse text corpora, often containing vast amounts of data. This means they have already learned a lot about the semantic relationships and meanings of words. By using these embeddings, you can leverage this valuable knowledge as a starting point for your specific NLP task.

I will be using `spacy` in this tutorial to demonstrate word embeddings

Update pip tools and install spacy

`pip install -U pip setuptools wheel`

`pip install -U spacy`

Download the English model

`python -m spacy download en_core_web_md
`

In [25]:
# Import the spaCy library for natural language processing
import spacy

# Import the pandas library for data manipulation and analysis
import pandas as pd

# Import the seaborn library for data visualization
import seaborn as sns

# Import cosine_similarity and cosine_distances functions for measuring similarity
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances

# Create a color map using seaborn for data visualization
cm = sns.light_palette("blue", as_cmap=True)

# Load the English spaCy model for text processing
nlp = spacy.load('en_core_web_md')


In [26]:
# Define a list of words for which we want to obtain word vectors
words = ['cat', 'dog', 'car', 'bird', 'eagle','fly','drink','drive']

# Create a list of word vectors by processing each word using spaCy's 'en_core_web_sm' model
vectors = [nlp(word).vector for word in words]


In [27]:
import numpy as np

# Calculate the shape of the vectors list
shape_of_vectors = np.array(vectors).shape

# Print the shape
print("Shape of vectors list:", shape_of_vectors)

Shape of vectors list: (8, 300)


 The output `(8, 300)`,  indicates that I have a list of 8 word vectors, and each word vector has a dimensionality of 300. So, each word in this list is represented by a 300-dimensional embedding. Each dimension in the embedding space captures different aspects of the word's meaning or context, and these embeddings are typically learned during the training of the word embedding model.

In [28]:
similarities = cosine_similarity(vectors, vectors)
pd.DataFrame(similarities, columns=words, index=words).style.background_gradient(cmap=cm)

Unnamed: 0,cat,dog,car,bird,eagle,fly,drink,drive
cat,1.0,0.822082,0.196986,0.536937,0.330381,0.154706,0.222942,-0.043103
dog,0.822082,1.0,0.325003,0.45674,0.268695,0.117945,0.286763,0.078213
car,0.196986,0.325003,1.0,0.153305,0.069607,0.139716,0.191315,0.440595
bird,0.536937,0.45674,0.153305,1.0,0.623638,0.366934,0.179202,-0.004182
eagle,0.330381,0.268695,0.069607,0.623638,1.0,0.320517,0.055756,-0.012408
fly,0.154706,0.117945,0.139716,0.366934,0.320517,1.0,0.185683,0.379096
drink,0.222942,0.286763,0.191315,0.179202,0.055756,0.185683,1.0,0.239019
drive,-0.043103,0.078213,0.440595,-0.004182,-0.012408,0.379096,0.239019,1.0


# Vectors !

The vectors generated by `spacy` model is a 300 dimensional vector which is the output of a pre-trained GloVe model.

In [29]:
# Obtain the word vector for the word "Bank" using spaCy's 'en_core_web_sm' model
vector = nlp("Bank").vector

# Print the shape of the word vector to show its dimensionality
print("Shape of the word vector:", vector.shape)

# Print the first 5 dimensions of the word vector
print("First 5 dimensions of the word vector:", vector[:5])


Shape of the word vector: (300,)
First 5 dimensions of the word vector: [-2.761   -7.7805   2.612    2.7064  -0.31857]


## Embeddings as feature

We can use word embedding as features of the text and build a classifier using them

In [30]:
# Import necessary libraries
import numpy as np
from tqdm.auto import tqdm  # tqdm for progress bar
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

# Define the categories (newsgroup topics) we want to classify
categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']

# Fetch the training data and labels from the 20 Newsgroups dataset
x_train, y_train = fetch_20newsgroups(categories=categories,
                          remove=('headers', 'footers', 'quotes'), return_X_y=True)

# Fetch the test data and labels from the 20 Newsgroups dataset
x_test, y_test = fetch_20newsgroups(categories=categories,
                          remove=('headers', 'footers', 'quotes'), return_X_y=True, subset='test')


In [31]:
# Create zero-initialized arrays to store word vectors for training and testing data
# Each word vector has a dimensionality of 300
x_train_v = np.zeros((len(x_train), 300))
x_test_v = np.zeros((len(x_test), 300))

# Loop through and process training data with spaCy, extracting word vectors
# tqdm is used to display a progress bar for the loop
for i, doc in tqdm(enumerate(nlp.pipe(x_train)), total=len(x_train)):
    # Store the word vector of the processed document in the training data array
    x_train_v[i, :] = doc.vector

# Loop through and process testing data with spaCy, extracting word vectors
# tqdm is used to display a progress bar for the loop
for i, doc in tqdm(enumerate(nlp.pipe(x_test)), total=len(x_test)):
    # Store the word vector of the processed document in the testing data array
    x_test_v[i, :] = doc.vector


  0%|          | 0/2257 [00:00<?, ?it/s]

  0%|          | 0/1502 [00:00<?, ?it/s]

# Train a classifier

In [32]:
# Initialize a Linear Support Vector Classifier (LinearSVC)
clf = LinearSVC()

# Fit the classifier on the training data with their corresponding word vectors
clf.fit(x_train_v, y_train)

# Generate and print a classification report on the test data
# This report includes precision, recall, F1-score, and other metrics for each category
# It helps evaluate the performance of the classifier
print(classification_report(y_test, clf.predict(x_test_v), target_names=categories))


                        precision    recall  f1-score   support

           alt.atheism       0.58      0.70      0.63       319
soc.religion.christian       0.89      0.87      0.88       389
         comp.graphics       0.82      0.83      0.83       396
               sci.med       0.78      0.66      0.72       398

              accuracy                           0.77      1502
             macro avg       0.77      0.77      0.76      1502
          weighted avg       0.78      0.77      0.77      1502





# Get top similar

In [33]:
# Import necessary libraries
import random
from termcolor import colored

# Randomly choose 5 samples from the test data
for i in random.choices(range(0, len(x_test_v)), k=5):
    print(f"ID: {i}")

    # Print the true label in green color
    print("True label:", colored(categories[y_test[i]], 'green'))

    # Calculate cosine similarity between the test sample and all training samples
    distances = cosine_similarity([x_test_v[i]], x_train_v).flatten()

    # Sort the indices of training samples by similarity (descending order)
    indices = np.argsort(distances)[::-1]

    # Print the top 3 nearest labels and their similarity scores
    for _, j in enumerate(indices[:3]):
        # Label color is green if it matches the true label, red otherwise
        label_color = 'green' if y_train[j] == y_test[i] else 'red'

        # Print the nearest label, its color, and the similarity score in yellow
        print(f"{_} nearest label is",
              f"{colored(categories[y_train[j]], label_color)}",
              f"similarity score: {colored(round(distances[j], 3), 'yellow')}")


ID: 82
True label: sci.med
0 nearest label is sci.med similarity score: 0.99
1 nearest label is sci.med similarity score: 0.988
2 nearest label is sci.med similarity score: 0.988
ID: 516
True label: sci.med
0 nearest label is sci.med similarity score: 0.977
1 nearest label is sci.med similarity score: 0.973
2 nearest label is sci.med similarity score: 0.972
ID: 1359
True label: comp.graphics
0 nearest label is comp.graphics similarity score: 0.952
1 nearest label is sci.med similarity score: 0.945
2 nearest label is comp.graphics similarity score: 0.937
ID: 35
True label: alt.atheism
0 nearest label is alt.atheism similarity score: 0.987
1 nearest label is sci.med similarity score: 0.983
2 nearest label is alt.atheism similarity score: 0.981
ID: 1225
True label: soc.religion.christian
0 nearest label is sci.med similarity score: 0.932
1 nearest label is alt.atheism similarity score: 0.927
2 nearest label is alt.atheism similarity score: 0.923


# Conclusion

- Word embedding is a very powerful feature specially if you have small data, as your model will make use of the learned features of the word2vec model and thus will be able to make better predictions.
- Word2vec and GloVe don't count for different context that the same word can have in different sentences