<a href="https://colab.research.google.com/github/SeanOnamade/BasicProjects/blob/main/02_2_Going_From_Word_Embeddings_to_Neural_Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Going From Word Embeddings to Neural Networks
Great job making it this far! In the previous notebook, we covered different types of representing concepts as discrete elements, from symbolic representation (a form of nondistributed representation) to distributed representation.

This course section will now narrow down to focus on distributed representation and the basics of Artificial Neural Networks, but keep in mind that there are many different learning algorithms that use each type of representation. For example, a variety of very useful and common learning algorithms that use nondistributed representations include clustering methods and k-nearest neighbor algorithms, but this is out of scope for this notebook.

The reason we are focusing on distributed representation and the basics of Artificial Neural Networks is because these are basic building blocks to understanding the inner workings of Large Language Models. To build up to that, we first need to step back and understand how we might represent language as numerical vectors, and how a learning algorithm can be built to learn from this experience.

**Table of Contents:**
1. [Word2vec](#word2vec)
2. [Brief Intermission: Intro to Artificial Neural Networks](#ann)
3. [Classifying Movie Review Sentiment](#sentiment)
4. [Conclusion](#conclusion)

## <a name="word2vec"> 1. Word2vec
This section introduces a technique called Word2Vec to obtain vector representations of words. You will create your own word embeddings using Word2vec.


### Intro to Word2vec
Word2vec is a technique used in natural language processing to process a large corpus of text and represent words as vectors, also referred to as word embeddings. It was developed in 2013 at Google. Once the word embeddings are created, there are a variety of uses for them, such as analyzing the similarity between corpuses, recommendation systems, and sentiment analysis of text.

Note that Word2vec itself does not refer to one model or algorithm. It is a family of model architectures and optimizations. Within Word2vec, [the original paper](https://arxiv.org/pdf/1301.3781.pdf) proposed two different methods for learning representations of words, continuous bag-of-words and continuous skip-gram. We describe continuous skip-gram below.

#### Continuous skip-gram

A continuous skip-gram model predicts words within a certain range before and after the current word in a sentence. Take a look at the demonstration image below ([credits to Tensorflow](https://www.tensorflow.org/text/tutorials/word2vec)) for the eight-word long sentence `The wide road shimmered in the hot sun`:

<div>
<img src="resources/2.2/word2vec_skipgram.png" width="500"/>
</div>

Notice how the skip-grams consist of the `target_word`(highlighted in green) and all combinations of that `target_word` with each word in the context (highlighted in yellow). The context window size determines the span of words on either side of a `target_word` that are considered to be a `context_word` to form the list of skip-grams for each target word in the sentence. Note that the image above only shows three of the eight possible skip-grams lists formed for each window size.

### Let's create some skip-grams to be used for sentiment analysis!
We will be using a dataset of 25k IMDB movie review [(Huggingface link)](https://huggingface.co/datasets/imdb)

Execute the code cells below.

In [None]:
from gensim.models import Word2Vec
import pandas as pd
from datasets import load_dataset
import re
import nltk
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
nltk.download('punkt')
from tqdm.notebook import tqdm
tqdm.pandas()

[nltk_data] Downloading package punkt to /Users/cyang/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### Load the dataset and use pandas to manually examine a few examples of data

In [None]:
dataset = load_dataset("imdb")
df = pd.DataFrame(dataset['train'])
df.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


#### Preprocessing text

Text preprocessing occurs when training any model on text corpuses. However, the exact steps involved in preprocessing differ, depending on the goal of the model.

In this case, we'd like to preprocess the text so that we remove information that doesn't provide any pertinent information to the positive/negative sentiment of the text. In this notebook, we will remove the HTML tags and lowercase all the words, since creating two embeddings for one word capitalized differently (e.g., `CURIOUS` and `Curious`) does not add any pertinent information for the model to classify the text's sentiment. We opt to not remove the punctuation, since our hypothesis is that the presence of punctuation such as `!` vs `...` can be useful in determining the sentiment of text.

As another example, if we were trying to analyze the similarity between two website copies, we would not want to remove relevant information like capitalization or HTML tags, since those can contribute greatly to similarities/differences in text.

In [None]:
# Code credit to https://www.kaggle.com/code/abdmental01/text-preprocessing-nlp-steps-to-process-text/notebook

# Lowercase all words
df['text'] = df['text'].str.lower()
# Remove any HTML tags
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)
df['text'] = df['text'].apply(remove_html_tags)
df.head()

Unnamed: 0,text,label
0,i rented i am curious-yellow from my video sto...,0
1,"""i am curious: yellow"" is a risible and preten...",0
2,if only to avoid making this type of film in t...,0
3,this film was probably inspired by godard's ma...,0
4,"oh, brother...after hearing about this ridicul...",0


#### Split the dataset into a train and test dataset

We'll use 70% of the data for training the Word2vec word embedding model (and the subsequent neural network for predicting the positive/negative sentiment of the review) and 30% of it for scoring the model's resulting accuracy.

When creating these train/test splits, we need to shuffle the order of the dataset (by setting `shuffle_state=True`), since the first half of the dataset consists only of movie reviews with label = `0`. We want both labels represented in both the training and test datasets. This will become important later in the notebook when we train a neural network (don't worry about it for now!)

In [None]:
def split_train_test(df, test_size=0.3, shuffle_state=True):
    X_train, X_test, Y_train, Y_test = train_test_split(df[['text']],
                                                        df['label'],
                                                        shuffle=shuffle_state,
                                                        test_size=test_size,
                                                        random_state=42)
    print("Value counts for Train sentiments")
    print(Y_train.value_counts())
    print("________________________________")
    print("Value counts for Test sentiments")
    print(Y_test.value_counts())
    X_train = X_train.reset_index()
    X_test = X_test.reset_index()
    Y_train = Y_train.to_frame()
    Y_train = Y_train.reset_index()
    Y_test = Y_test.to_frame()
    Y_test = Y_test.reset_index()
    return X_train, X_test, Y_train, Y_test

# Call the train_test_split
X_train, X_test, Y_train, Y_test = split_train_test(df)

Value counts for Train sentiments
1    8752
0    8748
Name: label, dtype: int64
________________________________
Value counts for Test sentiments
0    3752
1    3748
Name: label, dtype: int64


#### Tokenize each movie review

This is necessary for proper syntactical input into the Word2vec model we are using. Each review becomes a list of sentences, with each sentence becoming a list of words in each sentence.

In [None]:
def tokenize_sentence(text: str) -> list:
  sentences = [nltk.word_tokenize(sent) for sent in nltk.sent_tokenize(text)]
  return sentences

X_train['tokenized_text'] = X_train['text'].progress_apply(tokenize_sentence)
X_test['tokenized_text'] = X_test['text'].progress_apply(tokenize_sentence)

  0%|          | 0/17500 [00:00<?, ?it/s]

  0%|          | 0/7500 [00:00<?, ?it/s]

In [None]:
def create_nested_list_of_tokenized_sentences(df: pd.DataFrame) -> list:
  df_exploded = df.explode('tokenized_text')
  df_exploded.reset_index(drop=True, inplace=True)
  sentences_text = list(df_exploded['tokenized_text'])
  return sentences_text

sentences_text = create_nested_list_of_tokenized_sentences(X_train)
print(f"There are {len(sentences_text)} sentences in the training dataset")

There are 188049 sentences in the training dataset


### Let's train a Word2vec model

There are many parameters for Word2vec, which can be tuned, but we won't focus on parameter tuning in this notebook.

To understand the parameters below though, you can glance at the definitions here:
- `vector_size` denotes the length of the resulting vector that represents each token
- `window` denotes the size of the window for continuous skip-gram
- `min_count` denotes the minimum number of times a word has to occur in order to be included in the Word2vec representation
- `sg` denotes whether we use skip-gram instead of bag of words (1 means true)
- `seed` allows us to set a seed number to make this deterministic

Execute the code cells below.

In [None]:
# Train the Word2vec model - this might take a bit to run
model = Word2Vec(sentences=sentences_text, vector_size=100, window=5, min_count=5, sg=1, seed=42)

### What properties do we observe about these word embeddings?

Let's find the top ten closest words in the embedding space to the words `enjoyable`, `boring`, and `loved` using cosine similarity (measured between 0 and 1 with 1 being an exact vector match).

Execute the code cells below.


In [None]:
# Top ten closest words to "enjoyable" in the embedding space
model.wv.most_similar("enjoyable")

[('entertaining', 0.88079833984375),
 ('watchable', 0.8228954076766968),
 ('uplifting', 0.7767974734306335),
 ('accessible', 0.7718669772148132),
 ('undemanding', 0.7685814499855042),
 ('engrossing', 0.7664482593536377),
 ('heartwarming', 0.7610951662063599),
 ('addictive', 0.7598646283149719),
 ('rewarding', 0.756846010684967),
 ('satisfying', 0.7504492998123169)]

In [None]:
# Top ten closest words to "boring" in the embedding space
model.wv.most_similar("boring")

[('dull', 0.8673138618469238),
 ('tedious', 0.8511406183242798),
 ('pointless', 0.8353320956230164),
 ('unoriginal', 0.7825552821159363),
 ('uninteresting', 0.7795570492744446),
 ('preachy', 0.7795037031173706),
 ('confusing', 0.7696241140365601),
 ('predictable', 0.7688359022140503),
 ('unbelievable', 0.7660384178161621),
 ('illogical', 0.7642027735710144)]

In [None]:
# Top ten closest words to "loved" in the embedding space
model.wv.most_similar("loved")

[('hated', 0.7692369818687439),
 ('liked', 0.748568594455719),
 ('enjoyed', 0.7188347578048706),
 ('adored', 0.7025707960128784),
 ('disliked', 0.6714362502098083),
 ('hate', 0.669241726398468),
 ('adore', 0.6677272319793701),
 ('listened', 0.6588900685310364),
 ('preferred', 0.6512401700019836),
 ('downloaded', 0.650779664516449)]

 Word embeddings encode the meaning of words such that words closer to each other in the vector space are expected to be used in similar contexts, and sometimes contain similar meanings.

 However, as you may observe in the example of `loved`, there is no guarantee that just because words occur in similar contexts that they are actually synonyms. In the case of movie reviews, it actually makes a lot of sense that the words `loved` and `hated` are used in similar ways (`I hated that movie`, `I loved that movie`). If we trained Word2Vec on Shakespearean literature instead, we would most likely see different results, as the context is very different.

In [None]:
# Top ten closest words to "dumb" + "corny" in the embedding space
model.wv.most_similar(['dumb', 'corny'])

[('silly', 0.8656347393989563),
 ('stupid', 0.8524393439292908),
 ('goofy', 0.8368464708328247),
 ('lame', 0.8202411532402039),
 ('cheezy', 0.801953136920929),
 ('unoriginal', 0.8005430102348328),
 ('daft', 0.7996158599853516),
 ('unrealistic', 0.7986085414886475),
 ('unbelievable', 0.7957474589347839),
 ('distasteful', 0.79521644115448)]

In [None]:
# Top ten closest words to "corny" - "dumb" in the embedding space
model.wv.most_similar(positive=['corny'], negative=['dumb'])

[('narration', 0.4795984923839569),
 ('editing', 0.47862666845321655),
 ('lighting', 0.4759976267814636),
 ('camera-work', 0.4463352560997009),
 ('cinematography', 0.44403335452079773),
 ('pacing', 0.4364607632160187),
 ('images', 0.4343623220920563),
 ('soundtrack', 0.4319990873336792),
 ('storytelling', 0.42912930250167847),
 ('photography', 0.4263083040714264)]

You can also form loose analogies by adding and subtracting vectors in the embedding space. In the two examples above, you can see that summing the embeddings of the words `dumb` and `corny` results in other adjectives that seem related to the intersection of dumb and corny. If we take the embedding of `corny` and subtract `dumb` from it though, it results in a completely different set of results.

#### Conclusion to Word2vec and discussion on how bias in text influences embeddings

Note that Word2vec is not the only way to generate word embeddings, and more recently, larger neural networks have been used to generate potentially richer and higher dimension representations of these words.

Additionally, this training dataset only contained ~188k+ sentences (specifically in the domain of movie reviews), but imagine how a large amount of training data across multiple domains can allow us to gain a more general representation. However, this is where issues of biased and contaminated data arise. If there are many examples of bias and/or bigoted speech in a text dataset, then the embedding space may shift to accommodate for that bias in its representation.

In [None]:
model.wv.most_similar(positive=['doctor','woman'], negative=['man'])

[('gina', 0.664168119430542),
 ('boyfriend', 0.6634090542793274),
 ('prostitute', 0.6619176864624023),
 ('husband', 0.6538667678833008),
 ('maria', 0.6498344540596008),
 ('hermann', 0.6467203497886658),
 ('client', 0.6465845704078674),
 ('shin-ae', 0.6462011337280273),
 ('loretta', 0.6457286477088928),
 ('priest', 0.6451437473297119)]

In [None]:
model.wv.most_similar(positive=['doctor','man'], negative=['woman'])

[('scientist', 0.6646406054496765),
 ('chief', 0.6150426268577576),
 ('mastermind', 0.6135583519935608),
 ('sheriff', 0.611149787902832),
 ('murderer', 0.6094347238540649),
 ('comrade', 0.6058278679847717),
 ('lawyer', 0.6004095673561096),
 ('goons', 0.5967718958854675),
 ('hit-man', 0.5965445041656494),
 ('cop', 0.5949344038963318)]

In the examples of `doctor + woman - man` shown above, you can see that `prostitute` is one of the closest words to the resulting embedding, compared to `doctor + man - woman`, where words like `scientist` and `sheriff` are some of the closest words to the resulting embedding. One can infer that at least a few examples of movies reviewed in this dataset may have mentions of female doctors in similar contexts as prostitutes, and of male doctors in similar contexts as authoritative figures like scientists and sheriffs. This is biased by the contents of the movies in the dataset in this case. However, you can imagine how different societal biases may manifest in large datasets scraped from the internet. There is a lot of research surrounding how/whether larger datasets reduce different biases or amplify them.


If you're interested, you can read more about these issues in word embeddings in Section 4.3 of [`On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? by Bender et al.`](https://s10251.pcdn.co/pdf/2021-bender-parrots.pdf), published in 2021.

## <a name="ann"> 2. Brief Intermission: Intro to Artificial Neural Networks

Before we continue onwards to train a model to predict the sentiment of a movie review on the word embeddings we just created, we first need to take a quick step back to understand artificial neural networks.

### What are neural networks?

A neural network is a network of small computing units, each of which takes in a vector of input values and produces a single output value. There are many types of neural networks, such as feedforward networks, recurrent networks, convolutional networks, graph networks, etc. We will only focus on feedforward networks in the scope of this notebook.

While a neural network can be trained for many different tasks, we will first focus on classification (predicting the correct label of a given input data). In this case, we are aiming to do binary classification: predicting whether a movie review is positive (1) or negative (0), given the movie review text itself.

First, let's gain an intuition for the basic building blocks of a neural network. Let's start with a simple linear regression.

#### Starting from Linear Regression

The simplest type of neural network we can build is a linear network, which is a network that predicts an output from an input based on a linear relationship.

You may remember learning an equation `y = mx + b` (aka `F(x) = mx + b`) in a math class at some point. `y` represents the output number, given some input `x` and two parameters, `m` and `b`, which control the slope of the line and how high/low the line intersects with the y-axis. This is the gist of linear regression, as pictured below. We'd have to choose the values of `m` and `b` in order to find a best fit with the training data.


<div>
<img src="resources/2.2/linear_regression_diagram.png" width="500"/>
</div>

If we were to build a linear network that could learn to predict the sentiment of a movie review, based on its word embedding vector of length 100, it might look something like this. The `X` input is now a vector of length 100, and the network would learn the best fit values of `m` and `b` without us having to manually choose them, in a process called supervised learning.


<div>
<img src="resources/2.2/linear_net_diagram.png" width="500"/>
</div>


However, there may not be a linear relationship between a word embedding vector and its sentiment, which is why neural networks are much more powerful once we introduce non-linearity.

#### Non-Linearity and Neural Networks

Neural networks are so powerful because of their non-linear activation functions. Activation functions are nodes that take in a set of inputs and weights and return an output. These outputs can then be connected to other nodes in layers. When a network has more than two layers, that's when we start calling them deep neural networks - the word `deep` refers to the fact that there are multiple of these layers stacked together.

The reason non-linear activation functions are so powerful is that it has been shown in proofs (check out the [universal approximation theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem)) that it is possible for a neural network to represent a very wide variety of relationships between inputs and outputs, as long as the appropriate weights are chosen/learned.

In our case, because we have a fairly simple task (one feature and only two choices of output - positive or negative), we will stick with a 1-layer neural network, as illustrated below.


<div>
<img src="resources/2.2/simple_nn_diagram.png" width="800"/>
</div>

## <a name="sentiment"> 3. Classifying Movie Review Sentiment

Let's use our Word2vec embeddings to build a shallow 1-layer network to classify a movie review as positive or negative!

### Build a 1-layer neural network to classify whether a IMDB movie review is positive or negative

In [None]:
# Let's average over the word embeddings in each review to get a single feature for each review

def calculate_avg_review_embedding(tokenized_sentences: list) -> np.ndarray:
  vectors = []
  word2vec_model_vocab = list(model.wv.index_to_key)
  for word_sentence in tokenized_sentences:
    word_count = 0
    single_sentence_embedding = np.zeros(100)
    for word in word_sentence:
      if word in word2vec_model_vocab:
        single_sentence_embedding = single_sentence_embedding + model.wv[word]
        word_count = word_count + 1
    if word_count != 0:
      single_sentence_embedding = single_sentence_embedding / word_count
    vectors.append(single_sentence_embedding)
  return sum(vectors) / len(vectors)

X_train['avg_word_embedding'] = X_train['tokenized_text'].progress_apply(calculate_avg_review_embedding)
X_test['avg_word_embedding'] = X_test['tokenized_text'].progress_apply(calculate_avg_review_embedding)

  0%|          | 0/17500 [00:00<?, ?it/s]

  0%|          | 0/7500 [00:00<?, ?it/s]

In [None]:
# Instantiate and train the neural network - this might take a minute to run
clf = MLPClassifier(random_state=42, max_iter=700)
clf.fit(np.stack(X_train['avg_word_embedding'].to_numpy()), Y_train['label'])

#### Evaluate the model

In [None]:
score = clf.score(np.stack(X_test['avg_word_embedding'].to_numpy()), Y_test['label'])
print("Accuracy:", score*100, "%")

Accuracy: 84.10666666666667 %


We care about accuracy, but it's not the whole picture.

Specifically, accuracy doesn't give us a sense of the rate of false positives (a movie review was negative, but the model classified it as positive) or false negatives (a movie review was positive, but the model classified it as negative). This is important to give us a better understanding of the potential pitfalls of a model during its evaluation.

If you imagine that this was a binary classifier trained to detect whether a tumor was cancerous, we'd care a LOT about the false negatives (a tumor is indeed cancerous, but the model classified it as benign), even if the accuracy overall was really high.

Let's inspect what the false positive / false negatives rates look like for this model:

In [None]:
Y_pred = clf.predict(np.stack(X_test['avg_word_embedding'].to_numpy()))
tn, fp, fn, tp = confusion_matrix(Y_test['label'], Y_pred).ravel()

print("# of True negatives (model predicted negative, and review is actually negative): ", tn)
print("# of False positives (model predicted positive, but review is actually negative): ", fp)
print("# of False negatives (model predicted negative, but review is actually positive): " , fn)
print("# of True positives (model predicted positive, and review is actually positive): ", tp)

# of True negatives (model predicted negative, and review is actually negative):  3194
# of False positives (model predicted positive, but review is actually negative):  558
# of False negatives (model predicted negative, but review is actually positive):  634
# of True positives (model predicted positive, and review is actually positive):  3114


#### How do word embeddings and neural networks relate to transformers (large language models)?

A transformer is a deep learning architecture developed at Google in 2017. This architecture _specifically_ is what underpins the large language models released in the past 5 years. The unique point of a transformer is its multi-head attention mechanism, which allows the network to take in huge amounts of data and attend to different parts of the word sequences, allowing it to form connections between not just adjacent windows of words, but longer range dependencies and patterns across a large sequence of words. We will cover this more in detail in later sections.

A big difference with transformers from what we've seen so far that we'd like to highlight is the difference in task objectives. For the simple neural network we just built, the task was to classify a movie review text as positive or negative (aka binary classification). In contrast, the typical task objective of a transformer consists of predicting the next token in a sequence of text, or "filling in the blank".

A second big difference here is that with transformers, the creation of the word embedding happens within the neural network architecture in an "embedding layer". The overall intuition though is the same as what we did in this notebook – the neural networks first convert text into embeddings and then learn an objective over those embeddings.

That's all we'll say about this topic for now, but hopefully this helps you get a sense for how word embeddings can be used to build models, whether they are used to classify the sentiment of text, or for predicting the next words in a sentence.

## <a name="conclusion"> Conclusion
Now you've learned about how to go from representation to a neural network, specifically for learning language! The main things we hope you took away from this lesson is that word embeddings are used as a representation of words for a variety of tasks that take text as an input. Some examples of these tasks can include classifying the sentiment of text (e.g., a movie review, a product review), or even predicting the next word in a sentence.

If you'd like to dive a bit deeper into the topics discussed in this notebook and solidify your understanding with great visualizations, we also highly recommend [this 30 minute video](https://www.youtube.com/watch?v=wjZofJX0v4M&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=7) by Grant Sanderson, who runs the YouTube channel 3Blue1Brown.

Next, you will learn about how to train a neural network using PyTorch, along with more of the core concepts in neural networks.


#### Credits
Many of the explanations and inspiration for exercises were taken from these sources:
- [Tensorflow's tutorial on Word2vec](https://www.tensorflow.org/text/tutorials/word2vec)
- [Word2vec Wikipedia](https://en.wikipedia.org/wiki/Word2vec)
- [Speech and Language Processing Chapter 7](https://web.stanford.edu/~jurafsky/slp3/7.pdf)