# Word Embeddings
In this notebook, we will: 1.train word embeddings using word2vec, import GloVe

1.   Train word embeddings using word2vec
2.   Import GloVe embeddings
3.   Use these embeddings as input for classification using again the CoronaNLP
     dataset
4.   EXTRA: Implement the Skip-Gram model from scratch using pytorch

# 1. Training Word Embeddings with Word2Vec


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from tqdm.notebook import tqdm
import pandas as pd
import re
import seaborn as sns
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

In [None]:
#after installing restart the kernel
#!pip install gensim

In [None]:
from gensim.models import Word2Vec

In [None]:
# Example corpus: list of tokenized sentences
corpus = [
    ["cat", "sat", "on", "the", "mat"],
    ["dog", "barked", "at", "the", "cat"],
    ["bird", "flew", "over", "the", "house"],
    ["cat", "and", "dog", "are", "friends"],
]

In [None]:
# Ensure corpus is a list of lists
#if isinstance(corpus[0], str):
   # corpus = [sentence.split() for sentence in corpus]

# Train Word2Vec model
model = Word2Vec(
    sentences=corpus,
    vector_size=5,    # size of the embedding vectors
    window=2,         # context window size
    min_count=1,      # minimum word frequency to include
    sg=1              # 1 for skip-gram; 0 for CBOW
)

In [None]:
# Save model
# model.save("word2vec.model")

# Load model (example)
# model = Word2Vec.load("word2vec.model")

In [None]:
# Example: access embedding vector
cat_vector = model.wv["cat"]
print("Embedding vector for 'cat':", cat_vector)

In [None]:
# Example: find most similar words
print("Most similar to 'cat':", model.wv.most_similar("cat"))

In [None]:
# Example: find odd word out
print("Odd one out among ['cat', 'dog', 'bird', 'house']:", model.wv.doesnt_match(["cat", "dog", "bird", "house"]))

In [None]:
# Example: similarity between two words
print("Similarity between 'cat' and 'dog':", model.wv.similarity("cat", "dog"))

### Word Embeddings Visualization

Go to https://projector.tensorflow.org/ and visualize Word2Vec embeddings.

Original Word2Vec repository: https://code.google.com/archive/p/word2vec/

# 2. Exploring Word Vectors with GloVe

As we have seen, the Word2vec algorithms (such as Skip-Gram) predicts words in a context (e.g. what is the most likely word to appear in "the cat ? the mouse"). GloVe vectors are based on global counts across the corpus.  

The advantage of GloVe is that, unlike Word2vec, GloVe does not rely just on local statistics (local context information of words), but incorporates global statistics (word co-occurrence) to obtain word vectors â€” see [How is GloVe different from word2vec?](https://www.quora.com/How-is-GloVe-different-from-word2vec) and [Intuitive Guide to Understanding GloVe Embeddings](https://towardsdatascience.com/light-on-math-ml-intuitive-guide-to-understanding-glove-embeddings-b13b4f19c010)  for some better explanations.

Multiple sets of pre-trained GloVe vectors are easily available for [download](https://nlp.stanford.edu/projects/glove/), so that's what we'll use here.

Part of this section is taken from [practical-pytorch tutorials](https://github.com/spro/practical-pytorch/blob/master/glove-word-vectors/glove-word-vectors.ipynb)

### Loading word vectors

Gensim includes functions to download embeddings

In [None]:
import gensim.downloader
import gensim.downloader as api

In [None]:
print(list(gensim.downloader.info()['models'].keys()))

In [None]:
dim=100

In [None]:
#glove-twitter-25 has embeddings size 25, glove-twitter-100 has embeddings size 100, etc.
glove_model = gensim.downloader.load(f'glove-twitter-{dim}')

In [None]:
glove_model.get_vector("lol")

In [None]:
# Find the most similar words to a given word
print("Example - Most similar to 'computer':", glove_model.most_similar('computer')[:3])

In [None]:
# Compute the similarity between two words
print("Example - Similarity between 'computer' and 'laptop':", glove_model.similarity('computer', 'laptop'))

In [None]:
# PCA for GloVe embeddings
glove_matrix = np.array([glove_model[word] for word in glove_model.index_to_key[:100]])
pca = PCA(n_components=2)
glove_embeddings_2d = pca.fit_transform(glove_matrix)

In [None]:
plt.figure(figsize=(14, 10))
plt.scatter(glove_embeddings_2d[:, 0], glove_embeddings_2d[:, 1], alpha=0.6)
selected_words = glove_model.index_to_key[:100]
for i, word in enumerate(selected_words):
    plt.annotate(word, (glove_embeddings_2d[i, 0], glove_embeddings_2d[i, 1]), fontsize=8, alpha=0.7)
plt.title(f'Word Embeddings (First 100 Words)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.grid(True)
plt.show()

### Word analogies with vector arithmetic
The most interesting feature of a well-trained word vector space is that certain semantic relationships (beyond just closeness of words) can be captured with regular vector arithmetic.

![image-2.png](attachment:image-2.png)


(image borrowed from https://jalammar.github.io/illustrated-word2vec/)

Read [The Illustrated Word2vec](https://jalammar.github.io/illustrated-word2vec/) for more information.

In [None]:
result = glove_model.most_similar(positive=['king', 'woman'], negative=['man'])
print("\nWord Analogy Example (king - man + woman):", result[:5])

In [None]:
def analogy(word1, word2, word3, model=glove_model, topn=3):
    try:
        result = model.most_similar(positive=[word2, word3], negative=[word1], topn=topn)
        print(f"\nAnalogy ({word1} -> {word2} like {word3} -> ?):")
        for word, similarity in result:
            print(f"- {word}: {similarity:.4f}")
    except KeyError as e:
        print(f"Word not in vocabulary: {e}")

In [None]:
analogy('king', 'man', 'queen')

Now let's explore the word space and see what stereotypes we can uncover:

In [None]:
analogy('man', 'actor', 'woman')
analogy('cat', 'kitten', 'dog')
analogy('dog', 'puppy', 'cat')
analogy('russia', 'moscow', 'france')
analogy('obama', 'president', 'trump')
analogy('rich', 'mansion', 'poor')
analogy('elvis', 'rock', 'eminem')
analogy('paper', 'newspaper', 'screen')
analogy('monet', 'paint', 'michelangelo')
analogy('beer', 'barley', 'wine')
analogy('earth', 'moon', 'sun')
analogy('house', 'roof', 'castle')
analogy('building', 'architect', 'software')
analogy('good', 'heaven', 'bad')
analogy('jordan', 'basketball', 'ronaldo')

# 3. Training and Classification with Word Embeddings
### Now we will see how we can apply word embeddings to feature engineer our corpus and classify the sentiment of **tweets**

In [None]:
df = pd.read_csv("Corona_NLP.csv", encoding='latin-1')
pd.options.display.max_colwidth = 500
df.head(10)

In [None]:
df  = df[['OriginalTweet', 'Sentiment']].head(500)

In [None]:
set(df['Sentiment'].values)

In [None]:
df = df[df['Sentiment']!="Neutral"]

In [None]:
df['LabelSentiment'] = df['Sentiment'].apply(lambda x: 1 if x in ['Extremely Positive', 'Positive'] else 0)

In [None]:
df.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['OriginalTweet'], df['LabelSentiment'], test_size=0.20, random_state=4)

In [None]:
len(X_train), len(X_test), len(y_train), len(y_test)

In [None]:
y_train.hist()

### Clean text

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
stop = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
lemma = WordNetLemmatizer()

In [None]:
def clean(text_list):

    updates = []

    for j in tqdm(text_list):

        text = j

        #LOWERCASE TEXT
        text = text.lower()

        #REMOVE NUMERICAL DATA and PUNCTUATION
        text = re.sub("[^a-zA-Z]"," ", text )

        #REMOVE STOPWORDS
        text = " ".join([word for word in text.split() if word not in stop])

        #Lemmatize
        text = " ".join(lemma.lemmatize(word) for word in text.split())

        updates.append(text)

    return updates

In [None]:
X_train_clean = clean(X_train)

In [None]:
X_test_clean = clean(X_test)

### Define extracting Embeddings

In [None]:
X_train_clean[:10]

In [None]:
#Extract sentence embeddings from X_train_clean by averaging word embeddings per sentence ---
def average_embedding(text, model, dim):
    words = text.split()
    vectors = []
    for word in words:
        if word in model:
            vectors.append(model[word])
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(dim)

In [None]:
X_train_embeddings = np.array([average_embedding(text, glove_model, dim=dim) for text in X_train_clean])

In [None]:
X_train_embeddings

In [None]:
X_train_embeddings.shape

In [None]:
X_test_embeddings = np.array([average_embedding(text, glove_model, dim=dim) for text in X_test_clean])

In [None]:
# Initialize and train classifier
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_embeddings, y_train)

In [None]:
# Predict
y_pred = clf.predict(X_test_embeddings)

In [None]:
# Evaluate
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# 4. EXTRA: Implemening and training the Skip-Gram model from scratch


![skip-gram.png](attachment:skip-gram.png)

**NOTE:** This part of the notebook requires you to install pytorch.

In [None]:
#!pip install torch

In [None]:
import torch
from torch.autograd import Variable
import torch.nn.functional as F

Let's start with a simple corpus:

In [None]:
corpus = [
    'he is a king',
    'she is a queen',
    'she is mad',
    'she is in love',
    'a mountain falls',
    'paris is france capital',
]

In [None]:
def tokenize_corpus(corpus):
    tokens = [x.split() for x in corpus]
    return tokens

In [None]:
tokenized_corpus = tokenize_corpus(corpus)

In [None]:
tokenized_corpus

In [None]:
vocabulary = {word for doc in tokenized_corpus for word in doc}

In [None]:
vocabulary

In [None]:
word2idx = {w:idx for (idx, w) in enumerate(vocabulary)}

In [None]:
word2idx

As you have seen in the theoretical lesson, we want to build pairs of words that appear within the same context.

![image.png](attachment:image.png)

In [None]:
def build_training(tokenized_corpus, word2idx, window_size=2):
    idx_pairs = []

    # for each sentence
    for sentence in tokenized_corpus:
        indices = [word2idx[word] for word in sentence]
        # for each word, treated as center word
        for center_word_pos in range(len(indices)):
            # for each window position
            for w in range(-window_size, window_size + 1):
                context_word_pos = center_word_pos + w
                # make sure not jump out sentence
                if  context_word_pos < 0 or \
                    context_word_pos >= len(indices) or \
                    center_word_pos == context_word_pos:
                    continue
                context_word_idx = indices[context_word_pos]
                idx_pairs.append((indices[center_word_pos], context_word_idx))
    return np.array(idx_pairs)

In [None]:
training_pairs = build_training(tokenized_corpus, word2idx)

In [None]:
training_pairs

In [None]:
def get_onehot_vector(word_idx, vocabulary):
    x = torch.zeros(len(vocabulary)).float()
    x[word_idx] = 1.0
    return x

def Skip_Gram(training_pairs, vocabulary, embedding_dims=5, learning_rate=0.001, epochs=10):

    torch.manual_seed(3)

    W1 = torch.randn(embedding_dims, len(vocabulary), requires_grad=True).float()

    losses = []

    for epo in tqdm(range(epochs)):
        loss_val = 0

        for input_word, target in training_pairs:
            x = get_onehot_vector(input_word, vocabulary).float()
            y_true = torch.from_numpy(np.array([target])).long()

            # Matrix multiplication to obtain the input word embedding
            z1 = torch.matmul(W1, x)

            # Matrix multiplication to obtain the z score for each word
            z2 = torch.matmul(z1, W1)

            # Apply Log and softmax functions
            log_softmax = F.log_softmax(z2, dim=0)

            # Compute the negative-log-likelihood loss
            loss = F.nll_loss(log_softmax.view(1,-1), y_true)# .view -> Returns a tensor with the same data but with a different shape.
            loss_val += loss.item()# -item -> Returns the value of this tensor as a standard Python number.

            # Compute the gradient in function of the error
            loss.backward()

            # Update your embeddings
            W1.data -= learning_rate * W1.grad.data

            W1.grad.data.zero_()
            # .grad -> This attribute is None by default and becomes a Tensor the first time a call to backward()
            #computes gradients. The attribute will then contain the gradients computed and future
            #calls to backward() will accumulate (add) gradients into it.

        losses.append(loss_val/len(training_pairs))

    return W1, losses

In [None]:
W1, losses = Skip_Gram(training_pairs, word2idx, epochs=1000)

In [None]:
def plot_loss(loss):
    x_axis = [epoch+1 for epoch in range(len(loss))]
    plt.plot(x_axis, loss, '-g', linewidth=1, label='Train')
    plt.xlabel("Epochs")
    plt.ylabel("Loss")
    plt.legend()
    plt.gca().spines['top'].set_visible(False)
    plt.gca().spines['right'].set_visible(False)
    plt.show()

In [None]:
plot_loss(losses)

### Final Embedding Matrix

In [None]:
W = torch.t(W1).clone().detach()

In [None]:
W[word2idx["she"]], W[word2idx["mad"]]

In [None]:
from sklearn.metrics.pairwise import euclidean_distances

euclidean_distances([W[word2idx["she"]].numpy()], [W[word2idx["falls"]].numpy()])

In [None]:
euclidean_distances([W[word2idx["she"]].numpy()], [W[word2idx["mad"]].numpy()])

As you can see from the previous example the vector representing "she" and the vector representing "mad" are closer then the vector representing "she" and "falls". This happens because "she" and "falls" never appear together inside the same context window.