[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JamesMTucker/DATA_340_NLP/blob/master/Fall_2023/notebooks/06_Vector_Semantics.ipynb)

# Lecture 15: 2023-28-03 Vector Semantics

## Overview of lecture

- Introduction to lexical semantics
- Introduction to Vector Semantics
  - Vector semantics: Osgood et al. (1957)
  - Vector semantics: Joos (1950), Harris (1954), Firth (1957)
- Embeddings
    - Word2Vec
    - GloVe
    - FastText
    - ELMo
    - BERT


## Introduction to Neural Networks

<center><img src="images/Neuron.drawio.png" width="800" height="400" /></center>

* Caculating Loss (measuring error - training a model adjusts weights and biases to minimize loss)
* Optimizing Loss (adjust weights and biases to minimize loss)
* Backpropagation (calculate the gradient of the loss function with respect to the weights and biases)

## Introduction to Lexical Semantics

Taken from Jurafsky and Martin (2023) chapter 23:

```

Lady Bracknell: Are your parents living?
Jack: I have lost both my parents.
Lady Bracknell: To lose one parent, Mr. Worthing, may be regarded as a misfortune; to lose both looks like carelessness.

```

* words are relational units that are prone to messiness and ambiguity
* Ambiguity is a fact of life in language (`mouse` as in a rodent or a computer device)
* Polysemy: a word or lemma with multiple meanings (`bank` as in a river bank or a financial institution)
* `Antonymy`: words (or lemmas) with opposite meanings (`hot` and `cold`)
* `Synonym`: words (or lemmas) that are similar in meaning (`couch` and `sofa`)
* Taxonimic relations
    * `hyponymy` (subordinate): words (or lemmas) that are more specific (`poodle` is a hyponym of `dog`) - subclasses or members
    * `hypernym` (superordinate): words (or lemmas) that are more general (`dog` is a hypernym of `poodle`) - classes
        * entailment: being A entails being B (`dog` entails `poodle`)
        * is-a hierarchy: a hierarchy of classes that is organized by the is-a relation or A IS-A B
    * `meronymy`: words (or lemmas) that are part of a larger entity (`leg` is a meronym of `human`) - part-whole relationships
    * `metonymy`: words (or lemmas) that are associated with a larger entity (`the crown` is a metonym of `the queen`) - association (prototype categories)
    * `holonymy`: words (or lemmas) that are a whole of a smaller entity (`face` is a holonym of `eye`) - whole-part relationships


## Define our datasets dir

In [None]:
## set environment variables if in google colab
import os

IN_COLAB = False

try:
    import google.colab
    from google.colab import drive
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

if IN_COLAB:
    # mount our google drive
    drive.mount('/content/drive', force_remount=True)
    data_dir = "/content/drive/MyDrive/DATA_340_NLP/Datasets"
else:
    data_dir = "../datasets"
    
os.listdir(data_dir)

In [None]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn

def get_taxonomy(noun):
    synsets = wn.synsets(noun)
    if synsets:
        synset = synsets[0]  # take the first synset
        hypernyms = synset.hypernyms()
        hyponyms = synset.hyponyms()
        meronyms = synset.part_meronyms() + synset.substance_meronyms() + synset.member_holonyms()
        holonyms = synset.part_holonyms() + synset.substance_holonyms() + synset.member_meronyms()
        return {
            "word": synset.name(),
            "definition": synset.definition(),
            "hypernyms": [h.name() for h in hypernyms],
            "hyponyms": [h.name() for h in hyponyms],
            "meronyms": [m.name() for m in meronyms],
            "holonyms": [h.name() for h in holonyms]
        }
    else:
        return None


In [None]:
result = get_taxonomy("dog")
for k,v in result.items():
    print(k, v, sep=":")

In [None]:
from nltk.corpus import wordnet as wn

def get_verb_relations(verb):
    synsets = wn.synsets(verb, pos=wn.VERB)
    if synsets:
        relations = {
            "antonyms": set(),
            "entailments": set(),
            "causes": set(),
            "also_sees": set(),
            "verb_groups": set(),
            "similar_tos": set()
        }
        for synset in synsets:
            for lemma in synset.lemmas():
                antonyms = lemma.antonyms()
                if antonyms:
                    relations["antonyms"].add(antonyms[0].name())
            for entailment in synset.entailments():
                relations["entailments"].add(entailment.name())
            for cause in synset.causes():
                relations["causes"].add(cause.name())
            for also_see in synset.also_sees():
                relations["also_sees"].add(also_see.name())
            for verb_group in synset.verb_groups():
                relations["verb_groups"].add(verb_group.name())
            for similar in synset.similar_tos():
                relations["similar_tos"].add(similar.name())
        return relations
    else:
        return None


In [None]:
result = get_verb_relations("catch")
for k,v in result.items():
    print(k, v, sep=":")

## Distributional Semantics

* Firth (1957) proposed a model of word meaning based on the idea that words are associated with other words in a network of semantic relations.
* Firth (1957), Joos (1950), and Harris (1954) all proposed models of word meaning based on the idea that words are associated with other words in a network of semantic relations. Thus the idea of distributional semantics takes its name from the fact that the meaning of a word is discerned by the words that tend to occur in its company.

> You shall know a word by the company it keeps. (Firth, 1957)

## Word Similarity

* Word similarity is a measure of the degree of semantic similarity between two words. This measure takes into account the distributional properties of words in a corpus. Whereas words like `coffee` would rarely occur in a dictionary entry for the word `cup`, users of language expect that the words `coffee` and `cup` are similar in meaning. They are similar, in this case, because semantic frames are shared between the two words. The semantic frame of `coffee` is a hot beverage, and the semantic frame of `cup` is a container for a hot beverage. The semantic frames of `coffee` and `cup` overlap, and this overlap is the basis for the similarity between the two words. We can capture these similarities by computing the distributional properties of words in a corpus.

## How can we represent words and their meanings in numerical format?

We vectorize it!

We can represent words in a vector space or embedding space.

### Word2Vec, Mikolov et al., 2013

Goal: to create “techniques for measuring the quality of the resulting vector representations, with the expectation that not only will similar words tend to be close to each other, but that words can have multiple degrees of similarity.” (Mikolov, et al., 2013a, 2013b)

Mikolov et al. propose two log-linear solutions

* Continuous Bag-of-Words Model
* Continuous Skip-gram Model 


<center><img src="images/mikolov.png" width="900" height="500" /></center>

Word2Vec embeddings are static embeddings, and therefore they do not capture the cooccurrence of words in a sentence. This is a problem for downstream tasks that require contextualized embeddings.

## Glove, Pennington et al., 2014

“...the shallow window-based methods [e.g., log bi-linear models, CBOW, or Skipgram] suffer from the disadvantage that they do not operate directly on the co-occurrence statistics of the corpus. Instead, these models scan context windows across the entire corpus, which fails to take advantage of the vast amount of repetition in the data.” Pennington, et al., 2014.

<center><img src="images/glove.png" width="800" height="400" /></center>

## FastText, Bojanowski et al., 2017

<center><img src="images/fasttext.png" width="900" height="500" /></center>

## Elmo, Peters et al., 2018

<center><img src="images/elmo.png" width="900" height="400" /></center>

"They [embeddings] should ideally model both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy)." ([Peters et al., 2018, p. 1](https://arxiv.org/pdf/1802.05365.pdf))

# How to create static word embeddings

Let's code out the word2vec CBOW and Skipgram models and compare them. To do this, let's define our configuration parameters.

In [None]:
import os

# Number of dimensions
EMBEDDING_SIZE = 10

# Window size
WINDOW_SIZE = 5

ITERATIONS = 10000

# OUTPUT
OUTPUT_PATH = "outputs"

## Let's plot the loss for the skipgram model
SKIPGRAM_LOSS = os.path.join(OUTPUT_PATH, 'loss_skipgram')
SKIPGRAM_TSNE = os.path.join(OUTPUT_PATH, 'tsne_skipgram')

## let's plot the loss for the cbow model
CBOW_LOSS = os.path.join(OUTPUT_PATH, 'loss_cbow')
CBOW_TSNE = os.path.join(OUTPUT_PATH, 'tsne_cbow')

In [None]:
# We need to preprocess the textual data

# We can use tensorflow to preprocess the data
import tensorflow as tf

def tokenize_data(data):
    # https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/text_to_word_sequence
    tokenized_text = tf.keras.preprocessing.text.text_to_word_sequence(input_text=data)

    vocab = sorted(set(tokenized_text))
    tokenized_text_size = len(tokenized_text)

    return (vocab, tokenized_text_size, tokenized_text)

In [None]:
print(tf.config.list_physical_devices('GPU'))

## Implement the CBOW algorithm

In [None]:
# define our imports 
import tensorflow as tf

tf.random.set_seed(42)
from sklearn.manifold import TSNE
from joblib import Parallel, delayed
import matplotlib.pyplot as plt
from tqdm import tqdm
import pandas as pd
import numpy as np
import os

### Load our data - we use the Lord of the Rings trilogy

In [None]:
# use google to load the data from drive
#from google.colab import drive
#drive.mount('/content/drive', force_remount=True)
#datasets_dir = "/content/drive/My Drive/DATA_340_3_NLP/Datasets/LOTR/"

datasets_dir = "../datasets/LOTR/"

# get the txt files
filenames = [os.path.join(datasets_dir, f) for f in os.listdir(datasets_dir) if f.endswith(".txt") and 'LOTR' in f]

# read the files
corpus = []

# read 
for f in filenames:
    with open(f, 'r', encoding='UTF-8') as file:
        corpus.append(file.read())

In [None]:
# let's shorten the corpus
corpus = corpus[:1]

### Preprocess the data

In [None]:
# let's flatten the corpus to one string and remove unnecessary spaces
corpus = " ".join(corpus)
corpus = " ".join(corpus.split())

#### Standardize the case of the text

In [None]:
corpus = corpus.lower()

# let's take the first 1000 words
corpus = " ".join(corpus.split()[:1000])
corpus

#### Remove accents

In [None]:
try:
  import unidecode
except ModuleNotFoundError:
  !pip install unidecode

In [None]:
from unidecode import unidecode

corpus = unidecode(corpus)

#### Tokenize the text

In [None]:
# Preprocess the data
(vocab, tokenized_text_size, tokenized_text) = tokenize_data(corpus)

In [None]:
# lets look at our data
print("Vocab size: {}".format(len(vocab)))
print("Text size: {}".format(tokenized_text_size))
print("Text: {}".format(tokenized_text[:10]))

### Create our context and center vectors

In [None]:
# Map our words to indices
vocab_to_index = {
    uniqueWord:index for (index, uniqueWord) in enumerate(vocab)
}

In [None]:
# Create an array of our vocab
index_to_vocab = np.array(vocab)
index_to_vocab

In [None]:
# convert the text to integers
text_as_int = np.array([vocab_to_index[word] for word in tokenized_text])
text_as_int

We want to slide over our text and create our context and center vectors. Let's illustrate with an example:

In [None]:
def slide_window_over_tokens(tokens, window_size):
    """
    Slides a window over the given list of tokens.
    
    Parameters:
    - tokens: List of words/tokens in a sentence.
    - window_size: The total size of the window, including the target word and context words.
    
    Yields:
    - The position of the target word and the words within the window around it.
    """
    for index, word in enumerate(tokens):
        start = max(0, index - window_size // 2)
        end = min(len(tokens), index + window_size // 2 + 1)
        window = tokens[start:end]
        print(f"Target: {word}, Window: {window}")

# Example usage
tokens = ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
window_size = 5  # This means 1 word before and 1 word after the target
slide_window_over_tokens(tokens, window_size)

### Intialize our context and center vectors

In [None]:
# Create a matrix of random data for our context vectors
context_vector_matrix = tf.Variable(
    np.random.rand(tokenized_text_size, EMBEDDING_SIZE)
)
context_vector_matrix[0]

In [None]:
# Create a matrix of random data for our center vectors
center_vector_matrix = tf.Variable(
    np.random.rand(tokenized_text_size, EMBEDDING_SIZE)
)
center_vector_matrix[0]

### Define our optimizer

Word2Vec employs two architectures for producing a distributed representation of words: Continuous Bag of Words (CBOW) and Skip-Gram. Both architectures use a shallow neural network model for learning word embeddings, but they differ in the way they predict words.

CBOW predicts a target word based on context words surrounding it. The objective is to estimate the probability of a word given a context.
Skip-Gram, on the other hand, uses a target word to predict context words. This model aims to maximize the probability of context words given a target word.

The optimization process in Word2Vec involves adjusting the weights of the neural network to minimize a loss function. This loss function measures the difference between the predicted probability distribution of context words and the actual distribution from the corpus. For CBOW, the loss function could be the negative log likelihood of the target word given the context. For Skip-Gram, it involves the sum of the negative log likelihoods for each context word given the target word.

<center><img src="images/Neuron.drawio.png" width="800" height="400" /></center>

In [None]:
# Pseudo-code for updating weights in Word2Vec optimization
def update_weights(weights, learning_rate, gradient):
    # Update the weights by moving a small step in the direction of the gradient
    new_weights = weights - learning_rate * gradient
    return new_weights

# Example values (in a real scenario, these would be computed based on your model and data)
weights = 0.5  # Initial weights
learning_rate = 0.01  # Learning rate
gradient = 0.2  # Example gradient

# Update weights based on gradient
new_weights = update_weights(weights, learning_rate, gradient)
print(f"Updated weights: {new_weights}")


In [None]:
# Define our optimizer

# https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam
optimizer = tf.optimizers.Adam(learning_rate=1e-3)
loss_list = []

In [None]:
def adam_update(weights, gradients, m, v, t, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
    # Update biased first moment estimate
    m = [beta1 * m_i + (1 - beta1) * g for m_i, g in zip(m, gradients)]
    # Update biased second raw moment estimate
    v = [beta2 * v_i + (1 - beta2) * (g ** 2) for v_i, g in zip(v, gradients)]
    # Compute bias-corrected first moment estimate
    m_hat = [m_i / (1 - beta1 ** t) for m_i in m]
    # Compute bias-corrected second raw moment estimate
    v_hat = [v_i / (1 - beta2 ** t) for v_i in v]
    # Update weights
    weights = [w - learning_rate * m_i / (v_i ** 0.5 + epsilon) for w, m_i, v_i in zip(weights, m_hat, v_hat)]
    return weights, m, v

# Example usage
weights = [0.1, 0.2]  # Example weights
gradients = [0.01, -0.02]  # Example gradients
m = [0, 0]  # Initial first moment vector
v = [0, 0]  # Initial second moment vector
t = 1  # Time step

# Update weights using Adam
weights, m, v = adam_update(weights, gradients, m, v, t)
print("Updated weights:", weights)

Adam (Adaptive moment estimation) optimization is based on adaptive estimates of lower-order moments. The algorithm maintains two moving averages for each weight in the network: one for gradients ($m_t$) and one for the square of gradients ($v_t$). These moving averages are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively.

The Adam update rule is given by:

- $m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$
- $v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$

General values:

- $\beta_1$ = 0.9 (dw)
- $\beta_2$ = 0.999 (dw^2)
- $\epsilon$ = $10^{-8}$

where:
- $m_t$ is the biased first moment estimate,
- $v_t$ is the biased second moment estimate,
- $g_t$ is the gradient at time step $t$,
- $\beta_1$ and $\beta_2$ are exponential decay rates for the moment estimates, typically close to 1.

To correct for their initialization bias towards zero, Adam computes bias-corrected versions of these moving averages:

- $\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$
- $\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$

Finally, the weights are updated with:

- $w_{t+1} = w_t - \frac{\eta \cdot \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$

where:
- $w_{t+1}$ is the updated weight,
- $\eta$ is the learning rate,
- $\epsilon$ is a small number to prevent division by zero, often $10^{-8}$.

### Role in Optimization

The role of the Adam optimizer in neural network training is to adaptively adjust the learning rate for each weight. This means it scales the step size by an estimate of the first and second moments of the gradients. This adaptability helps in dealing with sparse gradients and different curvature across parameters, making Adam well-suited for a wide range of problems and data types.

For additional information: https://youtu.be/JXQT_vxqwIs?si=XhDqIou_jHLWlnHw

### Train our CBOW model

In [None]:
## Compute the vectors for the context and center words
for iter in tqdm(range(ITERATIONS)):
    loss_per_epoch = 0 # initialize the loss per epoch to 0

    # create our context slider
    for start in range(tokenized_text_size - WINDOW_SIZE):
        indices = text_as_int[start:start + WINDOW_SIZE]

    # intialize the gradient for automatic differentiation
    # https://www.tensorflow.org/api_docs/python/tf/GradientTape
    with tf.GradientTape() as tape:
        combined_context = 0 # initialize the combined context to 0

        # loop through the indices to create the combined context
        for count, index in enumerate(indices):
            if count != WINDOW_SIZE // 2: # skip the center word
                combined_context += context_vector_matrix[index, :] # add the context vector to the combined context
        
        combined_context /= (WINDOW_SIZE - 1) # divide by the window size minus the center word to create an average

        # perform the matrix multiplication between the center vector and the combined context
        # https://www.tensorflow.org/api_docs/python/tf/linalg/matmul
        output = tf.matmul(center_vector_matrix, tf.expand_dims(combined_context, 1))

        # apply softmax to the output
        # https://www.tensorflow.org/api_docs/python/tf/nn/softmax
        softout = tf.nn.softmax(output, axis=0)
        loss = softout[indices[WINDOW_SIZE // 2]] # get the loss for the center word

        # compute the log loss (negative log likelihood)
        logloss = -tf.math.log(loss)

        # accumulate the loss per epoch : we want this number to decrease
        loss_per_epoch += logloss.numpy()
        
        # compute the gradient of the loss with respect to the context and center vectors
        # https://www.tensorflow.org/api_docs/python/tf/GradientTape
        grad = tape.gradient(
            logloss, [context_vector_matrix, center_vector_matrix]
        )

        # apply the gradient to the context and center vectors
        optimizer.apply_gradients(
            zip(grad, [context_vector_matrix, center_vector_matrix])
        )

        # append the loss per epoch to the loss list
        loss_list.append(loss_per_epoch)

### Plot the loss

In [None]:
# create the output directory if it doesn't exist
if not os.path.exists(OUTPUT_PATH):
    os.makedirs(OUTPUT_PATH)

print("[INFO] Plotting loss ...")
plt.plot(loss_list)
plt.xlabel("epoch")
plt.ylabel("loss")
plt.savefig(CBOW_LOSS)

### Reduce the dimensionality of the embeddings

In [None]:
# Convert the embeddings to 2D
# tsne_embed = (
#     TSNE(n_components=2)
#     .fit_transform(center_vector_matrix.numpy())
# )
# tsne_decode = (
#     TSNE(n_components=2)
#     .fit_transform(context_vector_matrix.numpy())
# )


# Assuming center_vector_matrix and context_vector_matrix are available
# center_vector_matrix = np.random.rand(100, 300)  # Example data
# context_vector_matrix = np.random.rand(100, 300)  # Example data

def compute_tsne(data):
    tsne = TSNE(n_components=2)
    return tsne.fit_transform(data)

# Using joblib to parallelize
results = Parallel(n_jobs=-1)(delayed(compute_tsne)(data) for data in [center_vector_matrix, context_vector_matrix])

tsne_embed, tsne_decode = results[0], results[1]


In [None]:
# save the tsne embeddings
if not os.path.exists(CBOW_TSNE):
    os.makedirs(CBOW_TSNE)

# save both the center and context vectors
np.save(os.path.join(CBOW_TSNE, "center_vectors"), tsne_embed)
np.save(os.path.join(CBOW_TSNE, "context_vectors"), tsne_decode)

In [None]:
# load the tsne embeddings
tsne_embed = np.load(os.path.join(CBOW_TSNE, "center_vectors.npy"))
tsne_decode = np.load(os.path.join(CBOW_TSNE, "context_vectors.npy"))

In [None]:
# Plot the embeddings for 100 words
index_count = 0
plt.figure(figsize=(25, 5))

print("[INFO] Plotting TSNE embeddings ...")

for (word, embedding) in tsne_decode[:100]:
    # plot the point in 2d space
    plt.scatter(word, embedding)
    # annotate the point with the word
    plt.annotate(index_to_vocab[index_count], (word, embedding))
    index_count += 1
plt.savefig(CBOW_TSNE)

## Implement the SKIPGRAM algorithm

In [None]:
## same as above but for skipgram
(vocab, tokenize_text_size, tokenized_text) = tokenize_data(corpus)

# Map our words to indices
vocab_to_index = {
    unique_word:index for (index, unique_word) in enumerate(vocab)
}

# Create an array of our vocab
index_to_vocab = np.array(vocab)

# convert the text to integers
text_as_int = np.array([vocab_to_index[word] for word in tokenized_text])

# Create a matrix of random data for our context vectors
context_vector_matrix = tf.Variable(
    np.random.rand(tokenize_text_size, EMBEDDING_SIZE)
)

# Create a matrix of random data for our center vectors
center_vector_matrix = tf.Variable(
    np.random.rand(tokenize_text_size, EMBEDDING_SIZE)
)

# Define our optimizer
optimizer = tf.optimizers.Adam()
loss_list = []

### Train our SKIPGRAM model

In [None]:
for iter in tqdm(range(ITERATIONS)):
    loss_per_epoch = 0

    for start in range(tokenize_text_size - WINDOW_SIZE):
        indices = text_as_int[start:start + WINDOW_SIZE]
        
    # https://www.tensorflow.org/api_docs/python/tf/GradientTape
    with tf.GradientTape() as tape:
        
        loss = 0

        # loop through the indices to create the combined context
        center_vector = center_vector_matrix[indices[WINDOW_SIZE // 2], :]
        
        # multiply the center vector by the context vector matrix
        output = tf.matmul(
            context_vector_matrix, tf.expand_dims(center_vector, 1)
        )

        # apply softmax to the output
        softmax_output = tf.nn.softmax(output, axis=0)

        # compute the loss
        for (count, index) in enumerate(indices):
            if count != WINDOW_SIZE // 2: # skip the center word
                loss += softmax_output[index]

            # compute the log loss (negative log likelihood)
            logloss = -tf.math.log(loss)

        # accumulate the loss per epoch : we want this number to decrease
        loss_per_epoch += logloss.numpy()
        
        # https://www.tensorflow.org/api_docs/python/tf/GradientTape
        grad = tape.gradient(
            logloss, [context_vector_matrix, center_vector_matrix]
        )
        
        # apply the gradient to the context and center vectors
        optimizer.apply_gradients(
            zip(grad, [context_vector_matrix, center_vector_matrix])
        )
    # append our loss per epoch to the loss list
    loss_list.append(loss_per_epoch)

### Plot the loss for SKIPGRAM

In [None]:
print("[INFO] plotting loss ...")
plt.plot(loss_list)
plt.xlabel("epoch")
plt.ylabel("loss")
plt.savefig(SKIPGRAM_LOSS)

### Reduce the dimensionality of the embeddings

In [None]:
# Convert the embeddings to 2D
tsneEmbed = (
    TSNE(n_components=2)
    .fit_transform(center_vector_matrix.numpy())
)
tsneDecode = (
    TSNE(n_components=2)
    .fit_transform(context_vector_matrix.numpy())
)

In [None]:
# save the tsne embeddings
if not os.path.exists(SKIPGRAM_TSNE):
    os.makedirs(SKIPGRAM_TSNE)

# save both the center and context vectors
np.save(os.path.join(SKIPGRAM_TSNE, "center_vectors"), tsneEmbed)
np.save(os.path.join(SKIPGRAM_TSNE, "context_vectors"), tsneDecode)

In [None]:
# load the tsne embeddings
tsneEmbed = np.load(os.path.join(SKIPGRAM_TSNE, "center_vectors.npy"))
tsneDecode = np.load(os.path.join(SKIPGRAM_TSNE, "context_vectors.npy"))

In [None]:
indexCount = 0 

plt.figure(figsize=(25, 5))

print("[INFO] Plotting TSNE Embeddings...")
for (word, embedding) in tsneEmbed[100:200]:
    plt.scatter(word, embedding)
    plt.annotate(index_to_vocab[indexCount], (word, embedding))
    indexCount += 1
plt.savefig(SKIPGRAM_TSNE)

## Federalist Papers - Word2Vec with Gensim

In [None]:
## load the papers
import os
from pathlib import Path
import gensim

# load the papers
corpus_dir = '../datasets/Federalist_Papers/FedPapersCorpus/FedPapersCorpus'
corpus_file_names = [f for f in os.listdir(corpus_dir) if f.endswith('.txt')]
len(corpus_file_names)

In [None]:
# create our text corpus of a list of lists
corpus = []
for file_name in corpus_file_names:
    with open(os.path.join(corpus_dir, file_name), 'r', encoding='utf-8') as file:
        corpus.append(file.read())
        
assert len(corpus) == len(corpus_file_names)

In [None]:
corpus

In [None]:
def clean_text(text):
    # strip the nbsp
    text = text.replace('&nbsp;||', ' ')
    # strip tabs
    text = text.replace('\t', ' ')
    # strip new lines
    text = " ".join(text.split())
    return text

corpus = [clean_text(text) for text in corpus]
corpus

In [None]:
# examine the metadata
import pandas as pd

fed_df = pd.read_csv(Path("..", "datasets", "Federalist_Papers", "fedPapers85.csv"))
fed_df.head()

In [None]:
# plot the authors
fed_df['author'].value_counts().plot(kind='bar')

## Train a Word2Vec model with Gensim

In order to train a Word2Vec model with Gensim, we need to install the Gensim library.

In [None]:
try:
    import gensim
except ModuleNotFoundError:
    !pip install gensim
    
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
import os

In [None]:
dir(Word2Vec)

In [None]:
Word2Vec?

### Convert our data to a list of sentences

In [None]:
# convert the corpus to lemmas
import spacy

nlp = spacy.load("en_core_web_sm")

# reduce text to lemmas and strip punctuation
def preprocessor(text):
    doc = nlp(text)
    sentences = [sentence for sentence in doc.sents]
    # replace the sentence with the lemmatized version
    # lemmas = " ".join([word.lemma_.lower() for sentence in sentences for word in sentence if not word.is_punct])
    # alternate version - return the lemma of only nouns
    lemmas = " ".join([word.lemma_.lower() for sentence in sentences for word in sentence if word.pos_ == "NOUN"])
    return lemmas

In [None]:
cleaned_corpus = [preprocessor(text) for text in corpus]
cleaned_corpus

In [None]:
# save the cleaned corpus to disk as one file per text in the corpus
corpus_dir = '../datasets/Federalist_Papers/FedPapersCorpus/FedPapersCorpus/processed'
if not os.path.exists(corpus_dir):
    os.makedirs(corpus_dir)

with open(os.path.join(corpus_dir, 'fed_papers_nouns.txt'), 'w', encoding='utf-8') as file:
    file.write("\n".join(cleaned_corpus))

In [None]:
# create a generator to read the file
corpus_file = os.path.join(corpus_dir, 'fed_papers_nouns.txt')

class MyCorpus:
    def __iter__(self):
        for line in open(corpus_file, 'r', encoding='utf-8'):
            yield line.split()
            
sentences = MyCorpus()
sentences

In [None]:
# examine the first 10 sentences
for i, sentence in enumerate(sentences):
    print(sentence)
    if i > 10:
        break

In [None]:

# train a word2vec model
model = Word2Vec(sentences=sentences,
                 vector_size=300,
                 sg=1,
                 window=5,
                 compute_loss=True,
                 min_count=5,
                 workers=-1,
                 epochs=5000)

# save the model
model.save("fed_papers.model")

# # load the model
model = Word2Vec.load("fed_papers.model")

In [None]:
# get a list of the vocabulary words in a dataframe
vocab = list(model.wv.index_to_key)

vocab_df = pd.DataFrame(vocab, columns=["word"])
vocab_df[vocab_df.word.str.contains('gover*', regex=True)]

In [None]:
# get the most similar words
model.wv.most_similar("government")

In [None]:
# visualize the word embeddings with tensorboard
words = list(model.wv.index_to_key)
vectors = model.wv.vectors

# save the data to disk as embeddings and metadata
import os
import tensorflow as tf
from tensorboard.plugins import projector

LOG_DIR = "logs"
if not os.path.exists(LOG_DIR):
    os.makedirs(LOG_DIR)

# save the words to disk as metadata
with open(os.path.join(LOG_DIR, "metadata_noun.tsv"), "w", encoding="utf-8") as file:
    for word in words:
        file.write(f"{word}\n")

# save the vectors to dist as embeddings
with open(os.path.join(LOG_DIR, "embeddings_noun.tsv"), "w", encoding="utf-8") as file:
    for vector in vectors:
        file.write("\t".join([str(x) for x in vector]) + "\n")


In [None]:
# load a pretrained word2vec model
import gensim.downloader as api

# get the list of available models
api.info()

# list the available models
models = api.info()['models']
print("\n".join(models.keys()))

In [None]:
# load the word2vec model google news
word2vec_model = api.load("word2vec-google-news-300")

# get the most similar words for government
word2vec_model.most_similar("government", topn=20)

In [None]:
# We want to pose the question: "What is the capital of France?"
# We can do this by computing the vector for "Paris" - "France" + "Italy"
# We expect the most similar word to be "Rome"
word2vec_model.most_similar(positive=["Paris", "Italy"], negative=["France"], topn=10)

# we can to the same for "Berlin" - "Germany" + "France"
word2vec_model.most_similar(positive=["Berlin", "France"], negative=["Germany"], topn=10)

# and genderized words
word2vec_model.most_similar(positive=["King", "Queen"], negative=["Man"], topn=10)

# and topics
word2vec_model.most_similar(positive=["Nuclear", "Energy"], topn=10)

# and sports
word2vec_model.most_similar(positive=["Tennis", "Soccer"], topn=10)

### Document to Vector

With our word embeddings, we can now convert our documents to vectors. This is a common technique in NLP and is used in many applications such as document classification, clustering, and information retrieval. But we have different methodologies we can use to convert our documents to vectors.

### Average Word Embeddings

One simple way to convert a document to a vector is to average the word embeddings of the words in the document. This is a simple way to convert a document to a vector. We can then use this vector to compare documents using cosine similarity. This method, however, is naive in several respects, and there are more sophisticated methods for converting documents to vectors. You should be aware of how algorithms generate document vectors, as you may find the results unsatisfactory - yet have ideas on how to improve them.

### Doc2Vec

In the `gensim` library, we can use the `Doc2Vec` model to convert documents to vectors. The `Doc2Vec` model is an extension of the `Word2Vec` model, and it is used to convert documents to vectors. The `Doc2Vec` model is an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents.

## Load the Clinton Email Corpus

In [None]:
import pandas as pd
import numpy as np

clinton_emails = pd.read_csv("../datasets/Clinton_Emails/Emails.csv")

### Examine our data

In [None]:
clinton_emails.info()

### Preprocess our data

In [None]:
# let's concatenate the text fields
clinton_emails['text'] = clinton_emails['ExtractedSubject'].fillna('') + " " + clinton_emails['ExtractedBodyText'].fillna('')

clinton_emails.info()

### Simplified preprocessing

In [None]:
def preprocessor(text: str) -> str:
    return text.lower()

### Train a Word2Vec model with Gensim

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Preprocess the text data
preprocessed_texts = [preprocessor(text) for text in clinton_emails['ExtractedBodyText'].fillna('')]

# Create tagged documents
tagged_documents = [TaggedDocument(words=text.split(), tags=[i]) for i, text in enumerate(preprocessed_texts)]

# Train the Doc2Vec model
model = Doc2Vec(tagged_documents, vector_size=300, window=5, min_count=5, workers=-1, epochs=5000)

# Get the document vectors
document_vectors = [model.infer_vector(tagged_document.words) for tagged_document in tagged_documents]


### Convert the documents to numpy arrays

In [None]:
document_vectors_numpy = np.array(document_vectors)

### Save the model

In [None]:
# save the model
model.save("clinton_emails.model")

In [None]:
# load the model
model = Doc2Vec.load("clinton_emails.model")

In [None]:
# save the document vectors from the model to visualize with tensorboard
LOG_DIR = "logs"
if not os.path.exists(LOG_DIR):
    os.makedirs(LOG_DIR)

# save the document vectors to disk as embeddings
with open(os.path.join(LOG_DIR, "embeddings_clinton_emails.tsv"), "w", encoding="utf-8") as file:
    for vector in document_vectors:
        file.write("\t".join([str(x) for x in vector]) + "\n")

# save the document vectors to disk as metadata
with open(os.path.join(LOG_DIR, "metadata_clinton_emails.tsv"), "w", encoding="utf-8") as file:
    for i in range(len(preprocessed_texts)):
        file.write(f"{i}\n")

### Cluster the documents

In [None]:
from sklearn.cluster import KMeans

# Specify the number of clusters
num_clusters = 8

# Initialize the KMeans model
kmeans = KMeans(n_clusters=num_clusters, random_state=42)

# Fit the model to the document vectors
kmeans.fit(document_vectors)

# Get the cluster labels for each document
cluster_labels = kmeans.labels_


In [None]:
# merge the cluster labels onto the dataframe
clinton_emails['cluster'] = cluster_labels

# merge the document vectors onto the dataframe
clinton_emails['document_vector'] = document_vectors

# drop all except the DocNumber, MetadataSent, MetadataFrom, document_vector, and cluster columns
clinton_emails_clustered = clinton_emails[['DocNumber', 'MetadataDateSent', 'MetadataFrom', 'document_vector', 'cluster']].copy()


In [None]:
clinton_emails_clustered.info()

### Visualize the clusters

In [None]:
# Initialize the tSNE model
tsne = TSNE(n_components=2, random_state=42)

# Fit the model to the document vectors
document_vectors_2d = tsne.fit_transform(document_vectors_numpy)

In [None]:
import matplotlib.pyplot as plt

# Get the unique cluster IDs
unique_clusters = clinton_emails_clustered['cluster'].unique()

# Plot the document vectors
plt.figure(figsize=(10, 6))
for cluster_id in unique_clusters:
    cluster_vectors = document_vectors_2d[clinton_emails_clustered['cluster'] == cluster_id]
    plt.scatter(cluster_vectors[:, 0], cluster_vectors[:, 1], label=f'Cluster {cluster_id}')

plt.title('2D Document Vectors')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.legend()
plt.show()


In [None]:
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering

# Reduce the dimensionality of the document vectors using PCA
pca = PCA(n_components=25, random_state=42)
document_vectors_pca = pca.fit_transform(document_vectors_numpy)

# Perform agglomerative clustering on the document clusters
agglomerative = AgglomerativeClustering(n_clusters=num_clusters)
cluster_labels_agglomerative = agglomerative.fit_predict(document_vectors_pca)


In [None]:
import matplotlib.pyplot as plt

# Plot the agglomerative clusters
plt.figure(figsize=(10, 6))
for cluster_id in range(num_clusters):
    cluster_vectors = document_vectors_2d[cluster_labels_agglomerative == cluster_id]
    plt.scatter(cluster_vectors[:, 0], cluster_vectors[:, 1], label=f'Cluster {cluster_id}')

plt.title('Agglomerative Clusters')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.legend()
plt.show()


In [None]:
# examine the clusters
clinton_emails_clustered[clinton_emails_clustered['cluster_x'] == 7].head()

In [None]:
clinton_emails_clustered.iloc[280]

## Tensorboard Embeddings Projector

https://www.tensorflow.org/tensorboard/tensorboard_projector_plugin