# MP5: Embeddings and Sentence Classification 

### Introduction

Embeddings are a way to represent words (or more generally, *tokens*) as vectors. These vectors are useful for many tasks in Natural Language Processing, short for NLP, including but not limited to: Text Generation, Machine Translation, and Sentence Classification. In this notebook, I will be exploring the concept of Embeddings, and using them for Sentence Classification.

### Imporing Libraries

In [83]:
import re  # For text preprocessing
import numpy as np  # For numerical operations and handling arrays
import pandas as pd  # For data manipulation and analysis
import nltk  # The Natural Language Toolkit (nltk) library, useful for text processing and NLP tasks
from nltk.corpus import stopwords  # A list of common words to filter out in text preprocessing
from sklearn.model_selection import train_test_split  # To split datasets into training and test sets
from sklearn.linear_model import LogisticRegression # The logistic regression model
from sklearn.metrics import accuracy_score, classification_report # Infometrics for the created model
import gensim.downloader as api  # To load pre-trained word embeddings
from gensim.models.word2vec import Word2Vec  # Word2Vec model from gensim for creating and training word embeddings
from gpt4all import Embed4All  # For generating embeddings for text
import torch  # Deep learning library with in-built mathematical operations
import torch.nn as nn  # For building and training models

nltk.download('stopwords')  # Downloading the 'stopwords' dataset from nltk, necessary for filtering out common words in text
nltk.download('wordnet')  # Downloading the 'wordnet' dataset, useful for lemmatization and other NLP tasks


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\GNG\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\GNG\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Exploring Embeddings

Put simply, Embeddings are fixed-size **dense** vector representations of tokens in natural language. This means we can represent words as vectors, sentences as vectors, even other entities like entire graphs as vectors.

So what really makes them different from something like One-Hot vectors? What's special is that they have semantic meaning baked into them. This means you can model relationships between entities in text, which itself leads to a lot of fun applications. All modern architectures make use of Embeddings in some way.

More info about them can be found [here](https://aman.ai/primers/ai/word-vectors/). In this notebook, I will be using *pretrained* Embeddings, that have already been trained on a large corpus of text. This is primarily because training Embeddings from scratch is a very computationally expensive task.

In [32]:
# Downloading the pretrained word2vec model (this may take a few minutes)
corpus = api.load('text8') # text8 is a small corpus of compressed Wikipedia articles
w2vmodel = Word2Vec(corpus) # The w2vmodel learns vector representations of words from the downlaoded text corpus

print("Done loading word2vec model!")

Done loading word2vec model!


Now that the Embeddings have been loaded, we can create an Embedding **layer** in PyTorch, `nn.Embedding`, that will perform the processing step for us.

Note in the following cell how there is a given **vocab size** and **embedding dimension** for the model: this is important to note because some sets of Embeddings may be defined for a large set of words (a large vocab), whereas older ones perhaps have a smaller set (a small vocab); the Embedding dimension essentially tells us how many *features* have been learned for a given word, that will allow us to perform further processing on top of it.

In [33]:
# Defining embedding layer using gensim
embedding_layer = nn.Embedding.from_pretrained(torch.FloatTensor(w2vmodel.wv.vectors))

# Getting some information from the w2vmodel
print(f"Vocab size: {len(w2vmodel.wv.key_to_index)}")

print(f"Some of the words in the vocabulary:\n{list(w2vmodel.wv.key_to_index.keys())[:15]}")

print(f"Embedding dimension: {w2vmodel.wv.vectors.shape[1]}")

Vocab size: 71290
Some of the words in the vocabulary:
['the', 'of', 'and', 'one', 'in', 'a', 'to', 'zero', 'nine', 'two', 'is', 'as', 'eight', 'for', 's']
Embedding dimension: 100


Now, for a demonstration, we instantiate two words, turn them into numbers (encoding them via their index in the vocab), and pass them through the Embedding layer. 

Note how the resultant Embeddings both have the same shape: 1 word, and 100 elements in the vector representing that word.

In [34]:
# Taking two words and getting their embeddings
word1 = "king"
word2 = "queen"

def word2vec(word):
    return embedding_layer(torch.LongTensor([w2vmodel.wv.key_to_index[word]]))

king_embedding = word2vec(word1)
queen_embedding = word2vec(word2)

print(f"Embedding Shape for '{word1}': {king_embedding.shape}")
print(f"Embedding Shape for '{word2}': {queen_embedding.shape}")

Embedding Shape for 'king': torch.Size([1, 100])
Embedding Shape for 'queen': torch.Size([1, 100])


When we have vectors whose scale is arbitrary, one nice way to measure how *similar* they are is with the Cosine Similarity measure.


$$ \text{Cosine Similarity}(\mathbf{u},\mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \|\mathbf{v}\|} $$


We can apply this idea to our Embeddings. To see how "similar" two words are to the model, we can generate their Embeddings and take the Cosine Similarity of those embeddings. This will be a number between -1 and 1 (just like the range of the cosine function). When the number is close to 0, the words are not similar.

In [68]:
def cosine_similarity(vec1, vec2):
    '''
    Computes the cosine similarity between two vectors using (PyTorch)
    '''
    
    cosine_simi = torch.dot(vec1, vec2) / (torch.norm(vec1) * torch.norm(vec2))
    return cosine_simi.item()

def compute_word_similarity(word1, word2):
    '''
    Takes in two words, computes their embeddings and returns the cosine similarity
    '''
    # without using .view(-1), I get dimentionality issues within the tensors
    return cosine_similarity(word2vec(word1).view(-1), word2vec(word2).view(-1))

# Defining three words (one pair similar and one pair dissimilar) and computing their similarity
word1 = "king"
word2 = "queen"
word3 = "earth"
print(f"Similarity between '{word1}' and '{word2}': {compute_word_similarity(word1, word2)}")
print(f"Similarity between '{word1}' and '{word3}': {compute_word_similarity(word1, word3)}")

Similarity between 'king' and 'queen': 0.7342362403869629
Similarity between 'king' and 'earth': 0.012497778050601482


In [30]:
# Run this cell if you're done with the above section
del embedding_layer

### Sentence Classification with Sentence Embeddings

Now let's move on to an actual application: classifying whether a tweet is about a real disaster or not. As you can imagine, this could be a valuable model when monitoring social media for disaster relief efforts.

Since we are using Sentence Embeddings, we want something that will take in a sequence of words and throw out a single fixed-size vector. For this task, we will make use of an LLM via the `gpt4all` library.

This library will allow us to generate pretrained embeddings for sentences, that we can use as **features** to feed to any classifier of our choice.

In [71]:
# Reading in the data over here
df = pd.read_csv("./disaster_tweets.csv")
df = df[["text", "target"]]
print(df.shape)
print(" ")

# Splitting the data into train and test
...
train, val = train_test_split(df, test_size = 0.1, random_state = 420, stratify = df['target'])
print("Printing a few rows of the training data:")
print(train.head())

print(" ")
print("Printing a few rows of the validation data:")
print(val.head())

print(" ")
print("Training data:", train.shape, "Validation data:", val.shape)

(7613, 2)
 
Printing a few rows of the training data:
                                                   text  target
4557  #golf McIlroy fuels PGA speculation after vide...       0
6443  @RayquazaErk There are Christian terrorists to...       1
1397  Warfighting Robots Could Reduce Civilian Casua...       1
592   FedEx no longer to transport bioterror germs i...       1
5871  You can only make yourself happy. Fuck those t...       0
 
Printing a few rows of the validation data:
                                                   text  target
1998  'Mages of Fairy Tail.. Specialize in property ...       0
7116  Storm blitzes Traverse City disrupts Managemen...       1
7060  Series finale of #TheGame :( It survived so mu...       0
4908  @nataliealund \nParents of Colorado theater sh...       1
5510  Reddit's new content policy goes into effect m...       0
 
Training data: (6851, 2) Validation data: (762, 2)


Before jumping straight to Embeddings, since our data is sourced from the cesspool that is Twitter (now X), we should probably do some cleaning. This can involve the removal of URLs, punctuation, numbers that don't provide any meaning, stopwords, and so on.

In the following cell, I have written functions to clean the sentences. 

**Note:** After cleaning the sentences, it is possible that we may end up with empty sentences (or some that are so short they have lost all meaning). In this event, since we want to demonstrate setting up a Sentence Classification task, I removed them from the dataset (cuz like data cleaning is not the center of this notebook).

In [74]:
# Functions for cleaning the data
def lowercase(txt):
    return txt.lower()

def remove_punctuation(txt):
    return re.sub(r'[^\w\s]', '', txt)

def remove_stopwords(txt):
    stop_words = set(stopwords.words('english'))
    words = txt.split()
    filtered_words = [word for word in words if lowercase(word) not in stop_words]
    return ' '.join(filtered_words)

def remove_numbers(txt):
    return re.sub(r'\d', '', txt)

def remove_url(txt):
    return re.sub(r'http\S+', '', txt)

def normalize_sentence(txt):
    '''
    Aggregates all the above functions to normalize/clean a sentence
    '''
    txt = lowercase(txt)
    txt = remove_punctuation(txt)
    txt = remove_stopwords(txt)
    txt = remove_numbers(txt)
    txt = remove_url(txt)
    return txt

# Cleaning the sentences
train.loc[:, 'cleaned_text'] = train['text'].apply(normalize_sentence)
val.loc[:, 'cleaned_text'] = val['text'].apply(normalize_sentence)

# Filtering sentences that are too short (less than 20ish characters)
min_characters = 20
train = train[train['cleaned_text'].str.len() >= min_characters]
val = val[val['cleaned_text'].str.len() >= min_characters]

# Printing the now clean training and validation data
print("Train Data")
print(train.head())
print(train.shape)
print(" ")
print("Validation Data")
print(val.head())
print(val.shape)

Train Data
                                                   text  target  \
4557  #golf McIlroy fuels PGA speculation after vide...       0   
6443  @RayquazaErk There are Christian terrorists to...       1   
1397  Warfighting Robots Could Reduce Civilian Casua...       1   
592   FedEx no longer to transport bioterror germs i...       1   
5871  You can only make yourself happy. Fuck those t...       0   

                                           cleaned_text  
4557  golf mcilroy fuels pga speculation video injur...  
6443  rayquazaerk christian terrorists sure dont sui...  
1397  warfighting robots could reduce civilian casua...  
592   fedex longer transport bioterror germs wake an...  
5871            make happy fuck tryna ruin keep smiling  
(6613, 3)
 
Validation Data
                                                   text  target  \
1998  'Mages of Fairy Tail.. Specialize in property ...       0   
7116  Storm blitzes Traverse City disrupts Managemen...       1   
7060  Ser

Now we create the Embeddings!

We will be using the `gpt4all.Embed4All` class for this purpose. The documentation can be looked up over [here](https://docs.gpt4all.io/gpt4all_python_embedding.html#gpt4all.gpt4all.Embed4All.embed).

This functionality makes use of a model called [Sentence-BERT](https://arxiv.org/abs/1908.10084). This is a Transformer-based model that has been trained on a large corpus of text, and is able to generate high-quality Sentence Embeddings, exactly what we need.

In [75]:
# Generating embeddings for train and validation sentences
feature_extractor = Embed4All()

# Encoding the train samples
train_samples = train['cleaned_text'].tolist()
train_embeddings = [feature_extractor.embed(sentence) for sentence in train_samples]

# Encoding the validation sentences
validation_samples = val['cleaned_text'].tolist()
validation_embeddings = [feature_extractor.embed(sentence) for sentence in validation_samples]

# Preparing the labels
train_labels = train['target'].tolist()
val_labels = val['target'].tolist()

Downloading: 100%|██████████| 45.9M/45.9M [00:04<00:00, 9.43MiB/s]
Verifying: 100%|██████████| 45.9M/45.9M [00:00<00:00, 768MiB/s]


In [81]:
# Printing the lengths of the embeddings and their labels
print(len(train_labels), ",",len(train_embeddings))
print("")
print(len(val_labels), ",",len(validation_embeddings))

6613 , 6613

742 , 742


Now with our Embeddings ready, we can move on to the actual classification task.

You have the choice of using **any** classifier you wish. You can use a simple Logistic Regression model, get fancy with Support Vector Machines, or even use a Neural Network. The choice is yours.

We will be looking for a model with a **Validation Accuracy** of around $0.8$. You must also use this model to make predictions on your own provided inputs, after completing the `predict` function.

In [84]:
# Converting lists to NumPy arrays
x_train = train_embeddings
y_train = train_labels
x_val = validation_embeddings
y_val = val_labels

model = LogisticRegression(random_state=420)
model.fit(x_train, y_train)

y_val_predicted = model.predict(x_val)

val_acc = accuracy_score(y_val, y_val_predicted)
print("Validation Accuracy [%]: ", val_acc * 100)

Validation Accuracy [%]:  82.0754716981132


In [85]:
# Creating a function to predict on a sentence
def predict(sentence, clf):
    '''
    Takes in a sentence and returns the predicted class along with the probability
    '''
    # Cleaning and encoding the sentence
    cleaned_sentence = normalize_sentence(sentence)
    sentence_embedding = feature_extractor.embed(cleaned_sentence)

    # Predicting the class and probability
    prediction = clf.predict([sentence_embedding])[0]
    probability = clf.predict_proba([sentence_embedding])[0][1]

    return prediction, probability

In [88]:
# Predict on a few custom sentences
sentences_to_predict = [
    "My life is nothing short of a disaster at the moment ",
    "This semester was not a disaster surprisingly",
    "Peaceful disaster",
    "Disaster strikes in the city.",
    "Disaster did not strike in the country",
]

# Predict on the sentences
for sentence in sentences_to_predict:
    prediction, probability = predict(sentence, model)
    print(f"Sentence: {sentence}")
    print(f"Prediction: {prediction}")
    print(f"Probability: {probability:.2f}\n")

Sentence: My life is nothing short of a disaster at the moment 
Prediction: 1
Probability: 0.62

Sentence: This semester was not a disaster surprisingly
Prediction: 1
Probability: 0.54

Sentence: Peaceful disaster
Prediction: 1
Probability: 0.90

Sentence: Disaster strikes in the city.
Prediction: 1
Probability: 0.95

Sentence: Disaster did not strike in the country
Prediction: 1
Probability: 0.94



Fin.