[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JamesMTucker/DATA_340_NLP/blob/master/Fall_2023/notebooks/06_Vector_Semantics.ipynb)

# Lecture 15: 2023-28-03 Vector Semantics

## Overview of lecture

- Introduction to lexical semantics
- Introduction to Vector Semantics
  - Vector semantics: Osgood et al. (1957)
  - Vector semantics: Joos (1950), Harris (1954), Firth (1957)
- Embeddings
    - Word2Vec
    - GloVe
    - FastText
    - ELMo
    - BERT


## Introduction to Neural Networks

<center><img src="images/Neuron.drawio.png" width="800" height="400" /></center>

* Caculating Loss (measuring error - training a model adjusts weights and biases to minimize loss)
* Optimizing Loss (adjust weights and biases to minimize loss)
* Backpropagation (calculate the gradient of the loss function with respect to the weights and biases)

## Introduction to Lexical Semantics

Taken from Jurafsky and Martin (2023) chapter 23:

```

Lady Bracknell: Are your parents living?
Jack: I have lost both my parents.
Lady Bracknell: To lose one parent, Mr. Worthing, may be regarded as a misfortune; to lose both looks like carelessness.

```

* words are relational units that are prone to messiness and ambiguity
* Ambiguity is a fact of life in language (`mouse` as in a rodent or a computer device)
* Polysemy: a word or lemma with multiple meanings (`bank` as in a river bank or a financial institution)
* `Antonymy`: words (or lemmas) with opposite meanings (`hot` and `cold`)
* `Synonym`: words (or lemmas) that are similar in meaning (`couch` and `sofa`)
* Taxonimic relations
    * `hyponymy` (subordinate): words (or lemmas) that are more specific (`poodle` is a hyponym of `dog`) - subclasses or members
    * `hypernym` (superordinate): words (or lemmas) that are more general (`dog` is a hypernym of `poodle`) - classes
        * entailment: being A entails being B (`dog` entails `poodle`)
        * is-a hierarchy: a hierarchy of classes that is organized by the is-a relation or A IS-A B
    * `meronymy`: words (or lemmas) that are part of a larger entity (`leg` is a meronym of `human`) - part-whole relationships
    * `metonymy`: words (or lemmas) that are associated with a larger entity (`the crown` is a metonym of `the queen`) - association (prototype categories)
    * `holonymy`: words (or lemmas) that are a whole of a smaller entity (`face` is a holonym of `eye`) - whole-part relationships


In [None]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn

def get_taxonomy(noun):
    synsets = wn.synsets(noun)
    if synsets:
        synset = synsets[0]  # take the first synset
        hypernyms = synset.hypernyms()
        hyponyms = synset.hyponyms()
        meronyms = synset.part_meronyms() + synset.substance_meronyms() + synset.member_holonyms()
        holonyms = synset.part_holonyms() + synset.substance_holonyms() + synset.member_meronyms()
        return {
            "word": synset.name(),
            "definition": synset.definition(),
            "hypernyms": [h.name() for h in hypernyms],
            "hyponyms": [h.name() for h in hyponyms],
            "meronyms": [m.name() for m in meronyms],
            "holonyms": [h.name() for h in holonyms]
        }
    else:
        return None


In [None]:
result = get_taxonomy("dog")
for k,v in result.items():
    print(k, v, sep=":")

In [None]:
from nltk.corpus import wordnet as wn

def get_verb_relations(verb):
    synsets = wn.synsets(verb, pos=wn.VERB)
    if synsets:
        relations = {
            "antonyms": set(),
            "entailments": set(),
            "causes": set(),
            "also_sees": set(),
            "verb_groups": set(),
            "similar_tos": set()
        }
        for synset in synsets:
            for lemma in synset.lemmas():
                antonyms = lemma.antonyms()
                if antonyms:
                    relations["antonyms"].add(antonyms[0].name())
            for entailment in synset.entailments():
                relations["entailments"].add(entailment.name())
            for cause in synset.causes():
                relations["causes"].add(cause.name())
            for also_see in synset.also_sees():
                relations["also_sees"].add(also_see.name())
            for verb_group in synset.verb_groups():
                relations["verb_groups"].add(verb_group.name())
            for similar in synset.similar_tos():
                relations["similar_tos"].add(similar.name())
        return relations
    else:
        return None


In [None]:
result = get_verb_relations("catch")
for k,v in result.items():
    print(k, v, sep=":")

## Distributional Semantics

* Firth (1957) proposed a model of word meaning based on the idea that words are associated with other words in a network of semantic relations.
* Firth (1957), Joos (1950), and Harris (1954) all proposed models of word meaning based on the idea that words are associated with other words in a network of semantic relations. Thus the idea of distributional semantics takes its name from the fact that the meaning of a word is discerned by the words that tend to occur in its company.

> You shall know a word by the company it keeps. (Firth, 1957)

## Word Similarity

* Word similarity is a measure of the degree of semantic similarity between two words. This measure takes into account the distributional properties of words in a corpus. Whereas words like `coffee` would rarely occur in a dictionary entry for the word `cup`, users of language expect that the words `coffee` and `cup` are similar in meaning. They are similar, in this case, because semantic frames are shared between the two words. The semantic frame of `coffee` is a hot beverage, and the semantic frame of `cup` is a container for a hot beverage. The semantic frames of `coffee` and `cup` overlap, and this overlap is the basis for the similarity between the two words. We can capture these similarities by computing the distributional properties of words in a corpus.

## How can we represent words and their meanings in numerical format?

We vectorize it!

We can represent words in a vector space or embedding space.

### Word2Vec, Mikolov et al., 2013

Goal: to create “techniques for measuring the quality of the resulting vector representations, with the expectation that not only will similar words tend to be close to each other, but that words can have multiple degrees of similarity.” (Mikolov, et al., 2013a, 2013b)

Mikolov et al. propose two log-linear solutions

* Continuous Bag-of-Words Model
* Continuous Skip-gram Model 


<center><img src="images/mikolov.png" width="900" height="500" /></center>

Word2Vec embeddings are static embeddings, and therefore they do not capture the cooccurrence of words in a sentence. This is a problem for downstream tasks that require contextualized embeddings.

## Glove, Pennington et al., 2014

“...the shallow window-based methods [e.g., log bi-linear models, CBOW, or Skipgram] suffer from the disadvantage that they do not operate directly on the co-occurrence statistics of the corpus. Instead, these models scan context windows across the entire corpus, which fails to take advantage of the vast amount of repetition in the data.” Pennington, et al., 2014.

<center><img src="images/glove.png" width="800" height="400" /></center>

## FastText, Bojanowski et al., 2017

<center><img src="images/fasttext.png" width="900" height="500" /></center>

## Elmo, Peters et al., 2018

<center><img src="images/elmo.png" width="900" height="400" /></center>

"They [embeddings] should ideally model both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy)." ([Peters et al., 2018, p. 1](https://arxiv.org/pdf/1802.05365.pdf))

# How to create static word embeddings

Let's code out the word2vec CBOW and Skipgram models and compare them. To do this, let's define our configuration parameters.

In [None]:
import os

# Number of dimensions
EMBEDDING_SIZE = 10

# Window size
WINDOW_SIZE = 5

ITERATIONS = 10000

# OUTPUT
OUTPUT_PATH = "outputs"

## Let's plot the loss for the skipgram model
SKIPGRAM_LOSS = os.path.join(OUTPUT_PATH, 'loss_skipgram')
SKIPGRAM_TSNE = os.path.join(OUTPUT_PATH, 'tsne_skipgram')

## let's plot the loss for the cbow model
CBOW_LOSS = os.path.join(OUTPUT_PATH, 'loss_cbow')
CBOW_TSNE = os.path.join(OUTPUT_PATH, 'tsne_cbow')

In [None]:
# We need to preprocess the textual data

# We can use tensorflow to preprocess the data
import tensorflow as tf

def tokenize_data(data):
    # https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/text_to_word_sequence
    tokenized_text = tf.keras.preprocessing.text.text_to_word_sequence(input_text=data)

    vocab = sorted(set(tokenized_text))
    tokenized_text_size = len(tokenized_text)

    return (vocab, tokenized_text_size, tokenized_text)

## Implement the CBOW algorithm

In [None]:
# define our imports 
import tensorflow as tf

tf.random.set_seed(42)
from sklearn.manifold import TSNE
from joblib import Parallel, delayed
import matplotlib.pyplot as plt
from tqdm import tqdm
import pandas as pd
import numpy as np
import os

### Load our data - we use the Lord of the Rings trilogy

In [None]:
# use google to load the data from drive
#from google.colab import drive
#drive.mount('/content/drive', force_remount=True)
#datasets_dir = "/content/drive/My Drive/DATA_340_3_NLP/Datasets/LOTR/"

datasets_dir = "../datasets/LOTR/"

# get the txt files
filenames = [os.path.join(datasets_dir, f) for f in os.listdir(datasets_dir) if f.endswith(".txt") and 'LOTR' in f]

# read the files
corpus = []

# read 
for f in filenames:
    with open(f, 'r', encoding='UTF-8') as file:
        corpus.append(file.read())

In [None]:
# let's shorten the corpus
corpus = corpus[:1]

### Preprocess the data

In [None]:
# let's flatten the corpus to one string and remove unnecessary spaces

corpus = " ".join(corpus)
corpus = " ".join(corpus.split())

In [None]:
corpus = corpus.lower()

# let's take the first 1000 words
corpus = " ".join(corpus.split()[:1000])
corpus

In [None]:
try:
  import unidecode
except ModuleNotFoundError:
  !pip install unidecode

### Clean up the accents in the text

In [None]:
from unidecode import unidecode

corpus = unidecode(corpus)

In [None]:
# Preprocess the data
(vocab, tokenized_text_size, tokenized_text) = tokenize_data(corpus)

In [None]:
# lets look at our data
print("Vocab size: {}".format(len(vocab)))
print("Text size: {}".format(tokenized_text_size))
print("Text: {}".format(tokenized_text[:10]))

### Create our context and center vectors

In [None]:
# Map our words to indices
vocab_to_index = {
    uniqueWord:index for (index, uniqueWord) in enumerate(vocab)
}

In [None]:
# Create an array of our vocab
index_to_vocab = np.array(vocab)
index_to_vocab

In [None]:
# convert the text to integers
text_as_int = np.array([vocab_to_index[word] for word in tokenized_text])
text_as_int

### Intialize our context and center vectors

In [None]:
# Create a matrix of random data for our context vectors
context_vector_matrix = tf.Variable(
    np.random.rand(tokenized_text_size, EMBEDDING_SIZE)
)
context_vector_matrix[0]

In [None]:
# Create a matrix of random data for our center vectors
center_vector_matrix = tf.Variable(
    np.random.rand(tokenized_text_size, EMBEDDING_SIZE)
)
center_vector_matrix[0]

### Define our optimizer

<center><img src="images/Neuron.drawio.png" width="800" height="400" /></center>

In [None]:
# Define our optimizer

# https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam
optimizer = tf.optimizers.Adam()
loss_list = []

### Train our CBOW model

In [None]:
## Compute the vectors for the context and center words
for iter in tqdm(range(ITERATIONS)):
    loss_per_epoch = 0 # initialize the loss per epoch to 0

    # create our context slider
    for start in range(tokenized_text_size - WINDOW_SIZE):
        indices = text_as_int[start:start + WINDOW_SIZE]

    # intialize the gradient for automatic differentiation
    # https://www.tensorflow.org/api_docs/python/tf/GradientTape
    with tf.GradientTape() as tape:
        combined_context = 0 # initialize the combined context to 0

        # loop through the indices to create the combined context
        for count, index in enumerate(indices):
            if count != WINDOW_SIZE // 2: # skip the center word
                combined_context += context_vector_matrix[index, :] # add the context vector to the combined context
        
        combined_context /= (WINDOW_SIZE - 1) # divide by the window size minus the center word to create an average

        # perform the matrix multiplication between the center vector and the combined context
        # https://www.tensorflow.org/api_docs/python/tf/linalg/matmul
        output = tf.matmul(center_vector_matrix, tf.expand_dims(combined_context, 1))

        # apply softmax to the output
        # https://www.tensorflow.org/api_docs/python/tf/nn/softmax
        softout = tf.nn.softmax(output, axis=0)
        loss = softout[indices[WINDOW_SIZE // 2]] # get the loss for the center word

        # compute the log loss (negative log likelihood)
        logloss = -tf.math.log(loss)

        # accumulate the loss per epoch : we want this number to decrease
        loss_per_epoch += logloss.numpy()
        
        # compute the gradient of the loss with respect to the context and center vectors
        # https://www.tensorflow.org/api_docs/python/tf/GradientTape
        grad = tape.gradient(
            logloss, [context_vector_matrix, center_vector_matrix]
        )

        # apply the gradient to the context and center vectors
        optimizer.apply_gradients(
            zip(grad, [context_vector_matrix, center_vector_matrix])
        )

        # append the loss per epoch to the loss list
        loss_list.append(loss_per_epoch)

### Plot the loss

In [None]:
# create the output directory if it doesn't exist
if not os.path.exists(OUTPUT_PATH):
    os.makedirs(OUTPUT_PATH)

print("[INFO] Plotting loss ...")
plt.plot(loss_list)
plt.xlabel("epoch")
plt.ylabel("loss")
plt.savefig(CBOW_LOSS)

### Reduce the dimensionality of the embeddings

In [None]:
# Convert the embeddings to 2D
# tsne_embed = (
#     TSNE(n_components=2)
#     .fit_transform(center_vector_matrix.numpy())
# )
# tsne_decode = (
#     TSNE(n_components=2)
#     .fit_transform(context_vector_matrix.numpy())
# )


# Assuming center_vector_matrix and context_vector_matrix are available
# center_vector_matrix = np.random.rand(100, 300)  # Example data
# context_vector_matrix = np.random.rand(100, 300)  # Example data

def compute_tsne(data):
    tsne = TSNE(n_components=2)
    return tsne.fit_transform(data)

# Using joblib to parallelize
results = Parallel(n_jobs=-1)(delayed(compute_tsne)(data) for data in [center_vector_matrix, context_vector_matrix])

tsne_embed, tsne_decode = results[0], results[1]


In [None]:
# save the tsne embeddings
if not os.path.exists(CBOW_TSNE):
    os.makedirs(CBOW_TSNE)

# save both the center and context vectors
np.save(os.path.join(CBOW_TSNE, "center_vectors"), tsne_embed)
np.save(os.path.join(CBOW_TSNE, "context_vectors"), tsne_decode)

In [None]:
# load the tsne embeddings
tsne_embed = np.load(os.path.join(CBOW_TSNE, "center_vectors.npy"))
tsne_decode = np.load(os.path.join(CBOW_TSNE, "context_vectors.npy"))

In [None]:
# Plot the embeddings for 100 words
index_count = 0
plt.figure(figsize=(25, 5))

print("[INFO] Plotting TSNE embeddings ...")

for (word, embedding) in tsne_decode[:100]:
    # plot the point in 2d space
    plt.scatter(word, embedding)
    # annotate the point with the word
    plt.annotate(index_to_vocab[index_count], (word, embedding))
    index_count += 1
plt.savefig(CBOW_TSNE)

## Implement the SKIPGRAM algorithm

In [None]:
## same as above but for skipgram
(vocab, tokenize_text_size, tokenized_text) = tokenize_data(corpus)

# Map our words to indices
vocab_to_index = {
    unique_word:index for (index, unique_word) in enumerate(vocab)
}

# Create an array of our vocab
index_to_vocab = np.array(vocab)

# convert the text to integers
text_as_int = np.array([vocab_to_index[word] for word in tokenized_text])

# Create a matrix of random data for our context vectors
context_vector_matrix = tf.Variable(
    np.random.rand(tokenize_text_size, EMBEDDING_SIZE)
)

# Create a matrix of random data for our center vectors
center_vector_matrix = tf.Variable(
    np.random.rand(tokenize_text_size, EMBEDDING_SIZE)
)

# Define our optimizer
optimizer = tf.optimizers.Adam()
loss_list = []

### Train our SKIPGRAM model

In [None]:
for iter in tqdm(range(ITERATIONS)):
    loss_per_epoch = 0

    for start in range(tokenize_text_size - WINDOW_SIZE):
        indices = text_as_int[start:start + WINDOW_SIZE]
        
    # https://www.tensorflow.org/api_docs/python/tf/GradientTape
    with tf.GradientTape() as tape:
        
        loss = 0

        # loop through the indices to create the combined context
        center_vector = center_vector_matrix[indices[WINDOW_SIZE // 2], :]
        
        # multiply the center vector by the context vector matrix
        output = tf.matmul(
            context_vector_matrix, tf.expand_dims(center_vector, 1)
        )

        # apply softmax to the output
        softmax_output = tf.nn.softmax(output, axis=0)

        # compute the loss
        for (count, index) in enumerate(indices):
            if count != WINDOW_SIZE // 2: # skip the center word
                loss += softmax_output[index]

            # compute the log loss (negative log likelihood)
            logloss = -tf.math.log(loss)

        # accumulate the loss per epoch : we want this number to decrease
        loss_per_epoch += logloss.numpy()
        
        # https://www.tensorflow.org/api_docs/python/tf/GradientTape
        grad = tape.gradient(
            logloss, [context_vector_matrix, center_vector_matrix]
        )
        
        # apply the gradient to the context and center vectors
        optimizer.apply_gradients(
            zip(grad, [context_vector_matrix, center_vector_matrix])
        )
    # append our loss per epoch to the loss list
    loss_list.append(loss_per_epoch)

### Plot the loss for SKIPGRAM

In [None]:
print("[INFO] plotting loss ...")
plt.plot(loss_list)
plt.xlabel("epoch")
plt.ylabel("loss")
plt.savefig(SKIPGRAM_LOSS)

### Reduce the dimensionality of the embeddings

In [None]:
# Convert the embeddings to 2D
tsneEmbed = (
    TSNE(n_components=2)
    .fit_transform(center_vector_matrix.numpy())
)
tsneDecode = (
    TSNE(n_components=2)
    .fit_transform(context_vector_matrix.numpy())
)

In [None]:
# save the tsne embeddings
if not os.path.exists(SKIPGRAM_TSNE):
    os.makedirs(SKIPGRAM_TSNE)

# save both the center and context vectors
np.save(os.path.join(SKIPGRAM_TSNE, "center_vectors"), tsneEmbed)
np.save(os.path.join(SKIPGRAM_TSNE, "context_vectors"), tsneDecode)

In [None]:
# load the tsne embeddings
tsneEmbed = np.load(os.path.join(SKIPGRAM_TSNE, "center_vectors.npy"))
tsneDecode = np.load(os.path.join(SKIPGRAM_TSNE, "context_vectors.npy"))

In [None]:
indexCount = 0 

plt.figure(figsize=(25, 5))

print("[INFO] Plotting TSNE Embeddings...")
for (word, embedding) in tsneEmbed[100:200]:
    plt.scatter(word, embedding)
    plt.annotate(index_to_vocab[indexCount], (word, embedding))
    indexCount += 1
plt.savefig(SKIPGRAM_TSNE)

## Federalist Papers - Word2Vec with Gensim

In [None]:
## load the papers
import os
from pathlib import Path
import gensim

# load the papers
corpus_dir = '../datasets/Federalist_Papers/FedPapersCorpus/FedPapersCorpus'
corpus_file_names = [f for f in os.listdir(corpus_dir) if f.endswith('.txt')]
len(corpus_file_names)

In [None]:
# create our text corpus of a list of lists
corpus = []
for file_name in corpus_file_names:
    with open(os.path.join(corpus_dir, file_name), 'r', encoding='utf-8') as file:
        corpus.append(file.read())
        
assert len(corpus) == len(corpus_file_names)

In [None]:
corpus

In [None]:
def clean_text(text):
    # strip the nbsp
    text = text.replace('&nbsp;||', ' ')
    # strip tabs
    text = text.replace('\t', ' ')
    # strip new lines
    text = " ".join(text.split())
    return text

corpus = [clean_text(text) for text in corpus]
corpus

In [None]:
# examine the metadata
import pandas as pd

fed_df = pd.read_csv(Path("..", "datasets", "Federalist_Papers", "fedPapers85.csv"))
fed_df.head()

In [None]:
# plot the authors
fed_df['author'].value_counts().plot(kind='bar')

## Train a Word2Vec model with Gensim

In order to train a Word2Vec model with Gensim, we need to install the Gensim library.

In [None]:
try:
    import gensim
except ModuleNotFoundError:
    !pip install gensim
    
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
import os

In [None]:
dir(Word2Vec)

In [None]:
Word2Vec?

### Convert our data to a list of sentences

In [None]:
# convert the corpus to lemmas
import spacy

nlp = spacy.load("en_core_web_sm")

def preprocessor(text):
    doc = nlp(text)
    sentences = [sent.lemma_ for sent in doc.sents]
    return sentences

In [None]:
cleaned_corpus = [preprocessor(text) for text in corpus]
# flatten the corpus
cleaned_corpus = [item for sublist in cleaned_corpus for item in sublist]
cleaned_corpus

In [None]:
# save the cleaned corpus to disk as one file per text in the corpus
corpus_dir = '../datasets/Federalist_Papers/FedPapersCorpus/FedPapersCorpus/processed'
if not os.path.exists(corpus_dir):
    os.makedirs(corpus_dir)

with open(os.path.join(corpus_dir, 'fed_papers_cleaned.txt'), 'w', encoding='utf-8') as file:
    file.write("\n".join(corpus))

In [None]:
# create a generator to read the file
corpus_file = os.path.join(corpus_dir, 'fed_papers_cleaned.txt')

# yield the lines of the file
def read_corpus(corpus_file):
    with open(corpus_file, 'r', encoding='utf-8') as file:
        for line in file:
            yield line.split()
            
corpus = read_corpus(corpus_file)

In [None]:
# build vocabulary
vocab = list(set([word for sentence in corpus for word in sentence]))

model.build_vocab(list(corpus))

# train a word2vec model
model = Word2Vec(sentences=list(corpus),
                 vector_size=300,
                 window=5,
                 min_count=5,
                 workers=-1,
                 epochs=10,
                 max_vocab_size=len(vocab))

# save the model
# model.save("fed_papers.model")

# # load the model
# model = Word2Vec.load("fed_papers.model")

In [None]:
# get a list of the vocabulary words in a dataframe
vocab = list(model.wv.index_to_key)

vocab_df = pd.DataFrame(vocab, columns=["word"])
vocab_df.head()

In [None]:
# get the most similar words
model.wv.most_similar("government")

## Tensorboard Embeddings Projector

https://www.tensorflow.org/tensorboard/tensorboard_projector_plugin