## Word2Vec (w2v)

Word2Vec is a method for turning words into numbers (vectors) in a way that captures their meaning based on the context in which they appear. Unlike simpler methods like Bag-of-Words (BoW) or TF-IDF, which just count word occurrences, Word2Vec builds contextual word representations. This means that words with similar meanings end up with similar vectors in the Word2Vec model.

A famous example of Word2Vec's power is the analogy:
**king - man + woman = queen**

This shows that Word2Vec understands relationships between words so well that you can perform basic arithmetic on the vectors to solve analogies!

### How Word2Vec Works

Word2Vec works by creating word embeddings, which are n-dimensional vectors representing words. So a single word will be represented as a list of n numbers. These vectors are trained using a neural network with two key architectures:

- CBOW (Continuous Bag of Words)
- Skip-Gram

In both cases, the network is simple (just two layers) and the goal isn't to classify data but to learn the word embeddings from the training data. Once trained, you throw away the neural network and use the input-to-hidden weight matrix as your word embeddings.


### Continuous Bag of Words (CBOW)

**How CBOW Works:**

In CBOW, the model takes words from the context (the words surrounding a target word) as input and tries to predict the target word.

**Example:** For the sentence:
“I enjoy reading books about history,”
if the context window size is 2, the model will take the words:
["I", "enjoy", "books", "about"]
and try to predict the middle word, which is "reading".

This is repeated for every word in the training data, using the neighboring words to predict the target word. You can set the context window size to decide how many words on each side of the target word to include.

**How the Neural Network is Built:**

1. **Vocabulary and One-Hot Encoding:** First, every word in your vocabulary (all unique words in the text) is assigned a one-hot encoded vector. This is a large vector with a length equal to the vocabulary size, with a 1 in the position of the word and 0s everywhere else.
    - Example: If the vocabulary size is 10,000, the word "cat" might be represented as [0, 0, 0, ..., 1, 0, 0].

2. **Word Embeddings Matrix:** The input layer of the network is the one-hot encoded vectors for the context words. These vectors are multiplied by the input-to-hidden weights matrix, which gives you a much smaller vector called the word embedding (e.g., instead of a 10,000-length vector, you might get a 300-length vector).

    After training, the weight matrix becomes your word embedding matrix, where each row corresponds to a word’s embedding.

3. **Training Objective:** The network learns by adjusting the weights to maximize the probability of correctly predicting the target word, given the context words. Once trained, you can discard the neural network and keep the learned embeddings.

**Embedding Matrix Example:**

- Suppose you have a vocabulary of 39,783 words.
- You define the hidden layer (embedding size) to be 300.
- After training, you get a 39,783 x 300 matrix, where each word in your vocabulary is represented by a 300-dimensional vector.

When you input a word (like "cat"), it is converted to a one-hot vector, which is then multiplied by the embedding matrix to get its word embedding.

**CBOW Summary:**
- Input: Surrounding words (context).
- Goal: Predict the middle word.
- Training Output: A word embeddings matrix.


### Skip-Gram

Skip-Gram is the reverse of CBOW. Instead of using context words to predict a target word, Skip-Gram takes a target word and predicts the context words around it.

**Example:** For the sentence:
“The small kid ate a banana”, with a context window size of 2, you would use the word "ate" as the input, and the targets would be the words ["small", "kid", "a", "banana"].

**How the Neural Network is Built:**

The process is similar to CBOW, but in Skip-Gram:

- The input to the neural network is a single target word.
- The output is the prediction of the surrounding context words.

Again, the neural network is trained to maximize the probability of predicting the correct context words given the target word. Once trained, the word embeddings are stored in the input-to-hidden weight matrix, just like with CBOW.

**Skip-Gram Summary:**

- Input: Target word.
- Goal: Predict surrounding words.
- Training Output: A word embeddings matrix.

### CBOW vs. Skip-Gram: Which One to Use?

- CBOW is faster and works better on large datasets. It’s good at predicting common words and is often the default choice when you have a lot of data.
- Skip-Gram is slower but better for smaller datasets and works well when you're focusing on rare words. It does a better job learning the embeddings for less frequent words.


### How to use it

Preprocess the Text:

- Tokenize the text (split it into words).
- Remove stopwords (optional).
- Convert words into embeddings using the trained Word2Vec model.

For example, the sentence "I loved the movie" could become:

- "I" → [0.01, 0.02, -0.01, ...]
- "loved" → [0.3, -0.1, 0.2, ...]
- "movie" → [0.25, -0.15, 0.05, ...]

Aggregate the Embeddings:

Since the sentence has multiple words, you need to combine their embeddings into a single vector.
Common ways to do this are:
- Average: Take the average of the embeddings.
- Sum: Sum the embeddings.

### Implementation in Python

Let's begin by importing the libraries we need. We will use a library called Gensim...

In [1]:
import pandas as pd
import numpy as np
from gensim import models

ModuleNotFoundError: No module named 'gensim'

Let's create a sample dataset. Just like we did previously.

In [13]:
# Sample DataFrame with Netflix app reviews
df = pd.DataFrame({
    'Content_cleaned': [
        'the app is great new features but crashes often',
        'love the app love the content but it crashes',
        'the app crashes too much it is frustrating',
        'the content is great it is easy to use it is great'
    ]
})

Let's visualize it.

In [14]:
print(df.to_string())

                                      Content_cleaned
0     the app is great new features but crashes often
1        love the app love the content but it crashes
2          the app crashes too much it is frustrating
3  the content is great it is easy to use it is great


There is the possibility to use pretrained word embeddings or train a new model ourselves. The pretrained usually used is provided by Google. In this notebook we will try both of them and see how they compare, both in vectorizing and later in our models.

Here is the code for applying pretrained embeddings using the ones provided by Google with vector size 300. It consists of 100 bilion words.

In [17]:
w2v = models.KeyedVectors.load_word2vec_format(
'../../GoogleNews-vectors-negative300.bin', binary=True)

In [20]:
def get_average_word2vec(tokens_list, model, vector_size):
    """
    This function computes the average Word2Vec for a given list of tokens.
    """
    # Filter the tokens that are present in the Word2Vec model
    valid_tokens = [token for token in tokens_list if token in model]
    if not valid_tokens:
        return np.zeros(vector_size)
    
    # Compute the average Word2Vec
    word_vectors = [model[token] for token in valid_tokens]
    average_vector = np.mean(word_vectors, axis=0)
    return average_vector

Unnamed: 0,Content_cleaned,tokens,word2vec_pretrained
0,the app is great new features but crashes often,"[the, app, is, great, new, features, but, cras...","[0.056230333, 0.04981825, -0.042060003, 0.0209..."
1,love the app love the content but it crashes,"[love, the, app, love, the, content, but, it, ...","[0.072252065, -0.026354684, -0.020263672, 0.04..."
2,the app crashes too much it is frustrating,"[the, app, crashes, too, much, it, is, frustra...","[0.1003685, 0.001052618, -0.05606079, 0.036048..."
3,the content is great it is easy to use it is g...,"[the, content, is, great, it, is, easy, to, us...","[0.07695146, 0.021551305, 0.054709695, 0.08551..."


In [None]:
# Tokenize the text data
df['tokens'] = df['Content_cleaned'].apply(lambda x: x.split())

# Compute the average Word2Vec for each row
vector_size = w2v.vector_size
df['word2vec_pretrained'] = df['tokens'].apply(lambda x: get_average_word2vec(x, w2v, vector_size))

df.head()

When calculating the vectors of a sequence, we calculate the word embedding of every token seperately and then we average over them to get the sequence vector. This way the overall semantic meaning of the text is captured, while each word contributes to the final vector, allowing the resulting vector to represent the combined meanings of the individual words.

As we see in the example below, our sequences have been turned into a vector of size 1x300.

In [26]:
print(df['word2vec_pretrained'][1])

[ 0.07225206 -0.02635468 -0.02026367  0.04443359 -0.10232883  0.01063368
  0.11337619 -0.06597222  0.08262804  0.09733073 -0.08997938 -0.10812717
 -0.04382324 -0.08186849 -0.13113064  0.0128852   0.06037055  0.08719889
  0.04602729 -0.13327366  0.05490451  0.06732856  0.01048448 -0.06485494
  0.08699545  0.00037977 -0.03846571  0.1333279   0.01371596 -0.13808864
 -0.11214015  0.00744629 -0.00478787  0.02857802  0.05989583 -0.01656087
  0.0215861  -0.01957533  0.04159885  0.1653239   0.07190619 -0.09299045
  0.10758463  0.01479085 -0.05858358 -0.10636054 -0.01426358 -0.01078966
  0.10125054  0.02113173 -0.02650282  0.04202949 -0.02031793 -0.04790582
  0.056722    0.07293023 -0.03001404 -0.05926514 -0.05763075 -0.01670498
  0.09899902  0.07141113 -0.06174045 -0.04302301 -0.06892904  0.0373603
 -0.07122888 -0.01359389 -0.07541911  0.0830485   0.05344645 -0.03997125
 -0.02484364 -0.04001193 -0.12537977 -0.10517035  0.05368381  0.05932617
  0.03462728  0.13919067 -0.0236952  -0.01462131  0.

Below is the code for training our own CBoW and Skip-Gram models. As in the pretrained case, the final result is that they turn our sequences into 1x300 vectors. 

In [23]:
import multiprocessing

def get_average_word2vec2(tokens_list, model, vector_size):
    valid_tokens = [token for token in tokens_list if token in model.wv]
    if not valid_tokens:
        return np.zeros(vector_size)
    word_vectors = [model.wv[token] for token in valid_tokens]
    average_vector = np.mean(word_vectors, axis=0)
    return average_vector

# Define model parameters
vector_size = 300   # Dimensionality of the word vectors
window_size = 5     # Context window size
min_count = 1       # Minimum word frequency
workers = multiprocessing.cpu_count()  # Number of worker threads to use

# Train the Word2Vec model
cbow = models.Word2Vec(df['tokens'].tolist(), vector_size=vector_size, sg=0, window=window_size, min_count=min_count, workers=workers)

df['word2vec_cbow'] = df['tokens'].apply(lambda x: get_average_word2vec2(x, cbow, vector_size))

df.head()

Unnamed: 0,Content_cleaned,tokens,word2vec_pretrained,word2vec_cbow
0,the app is great new features but crashes often,"[the, app, is, great, new, features, but, cras...","[0.056230333, 0.04981825, -0.042060003, 0.0209...","[-3.9464252e-05, 8.735372e-05, 0.00023468967, ..."
1,love the app love the content but it crashes,"[love, the, app, love, the, content, but, it, ...","[0.072252065, -0.026354684, -0.020263672, 0.04...","[-0.00015348375, -0.00045190606, 0.00050380756..."
2,the app crashes too much it is frustrating,"[the, app, crashes, too, much, it, is, frustra...","[0.1003685, 0.001052618, -0.05606079, 0.036048...","[-6.593054e-05, 0.0005229764, -0.0010689881, -..."
3,the content is great it is easy to use it is g...,"[the, content, is, great, it, is, easy, to, us...","[0.07695146, 0.021551305, 0.054709695, 0.08551...","[-1.8200057e-05, 0.0007764301, 0.0007181768, -..."


In [24]:
# Train the Word2Vec model
skipgram = models.Word2Vec(df['tokens'].tolist(), vector_size=vector_size, sg=1, window=window_size, min_count=min_count, workers=workers)

df['word2vec_skipgram'] = df['tokens'].apply(lambda x: get_average_word2vec2(x, skipgram, vector_size))

df.head()

Unnamed: 0,Content_cleaned,tokens,word2vec_pretrained,word2vec_cbow,word2vec_skipgram
0,the app is great new features but crashes often,"[the, app, is, great, new, features, but, cras...","[0.056230333, 0.04981825, -0.042060003, 0.0209...","[-3.9464252e-05, 8.735372e-05, 0.00023468967, ...","[-3.8613955e-05, 8.666539e-05, 0.00023473676, ..."
1,love the app love the content but it crashes,"[love, the, app, love, the, content, but, it, ...","[0.072252065, -0.026354684, -0.020263672, 0.04...","[-0.00015348375, -0.00045190606, 0.00050380756...","[-0.00015304553, -0.00045252705, 0.00050321664..."
2,the app crashes too much it is frustrating,"[the, app, crashes, too, much, it, is, frustra...","[0.1003685, 0.001052618, -0.05606079, 0.036048...","[-6.593054e-05, 0.0005229764, -0.0010689881, -...","[-6.565961e-05, 0.0005227553, -0.0010691253, -..."
3,the content is great it is easy to use it is g...,"[the, content, is, great, it, is, easy, to, us...","[0.07695146, 0.021551305, 0.054709695, 0.08551...","[-1.8200057e-05, 0.0007764301, 0.0007181768, -...","[-1.7555154e-05, 0.00077540375, 0.00071704615,..."


Let's try now applying what we have learned to the Netflix dataset. 

In [36]:
# Read the dataset
df = pd.read_csv('../DATASETS/preprocessed_text.csv')

# Filling empty text that occurred after text preprocessing
df.fillna('', inplace=True)

# Tokenize the text data
df['tokens'] = df['content_cleaned'].apply(lambda x: x.split())

# Compute the average Word2Vec for each row
vector_size = w2v.vector_size
df['word2vec_pretrained'] = df['tokens'].apply(lambda x: get_average_word2vec(x, w2v, vector_size))


# Train the Word2Vec model
cbow = models.Word2Vec(df['tokens'].tolist(), vector_size=vector_size, sg=0, window=window_size, min_count=min_count, workers=workers)

df['word2vec_cbow'] = df['tokens'].apply(lambda x: get_average_word2vec2(x, cbow, vector_size))

# Train the Word2Vec model
skipgram = models.Word2Vec(df['tokens'].tolist(), vector_size=vector_size, sg=1, window=window_size, min_count=min_count, workers=workers)

df['word2vec_skipgram'] = df['tokens'].apply(lambda x: get_average_word2vec2(x, skipgram, vector_size))

df.head()

Unnamed: 0,content,sentiment,content_cleaned,tokens,word2vec_pretrained,word2vec_cbow,word2vec_skipgram
0,Plsssss stoppppp giving screen limit like when...,negative,plss stopp giving screen limit like when you a...,"[plss, stopp, giving, screen, limit, like, whe...","[0.060924955, 0.036983438, 0.052580304, 0.1163...","[-0.12855822, 0.192796, -0.13645074, 0.4576288...","[0.054039676, 0.1618012, -0.11154804, 0.030863..."
1,Good,positive,good,[good],"[0.040527344, 0.0625, -0.017456055, 0.07861328...","[0.33246085, -0.19840702, -0.75562155, 0.06854...","[0.2617957, 0.37434918, -0.0420645, 0.16268118..."
2,👍👍,positive,thumbs_up,[thumbs_up],"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.23665495, -0.22608165, 0.38516867, 0.079759...","[0.3395284, 0.21369722, 0.19105081, -0.0170758..."
3,Good,neutral,good,[good],"[0.040527344, 0.0625, -0.017456055, 0.07861328...","[0.33246085, -0.19840702, -0.75562155, 0.06854...","[0.2617957, 0.37434918, -0.0420645, 0.16268118..."
4,"App is useful to certain phone brand ,,,,it is...",negative,app is useful to certain phone brand it is not...,"[app, is, useful, to, certain, phone, brand, i...","[0.016780308, -0.041579314, 0.043486457, 0.058...","[-0.5931904, 0.101889275, -0.09978452, 0.49786...","[0.048506577, 0.2286796, -0.0663114, 0.0801587..."


The size of the matrix will be the same as BoW.

In [35]:
print(df['content_cleaned'][44])
print(df['word2vec_pretrained'][44])

i love to watch the movie s you have on here and the tv shows i love them growing_heart
[-2.08816528e-02  1.15051270e-02  2.57377625e-02  1.76330566e-01
 -6.28890991e-02 -5.34057617e-05  5.42964935e-02 -1.09580994e-01
  1.54647827e-02  7.56883621e-02 -3.56063843e-02 -1.48437500e-01
 -7.52716064e-02 -1.43718719e-02 -1.27414703e-01  8.87413025e-02
  7.78579712e-02  9.60235596e-02  8.70895386e-03 -3.39450836e-02
 -5.30548096e-02  6.39114380e-02  7.28273392e-02  1.10912323e-03
  4.01535034e-02  3.60412598e-02 -5.64794540e-02  4.14695740e-02
  1.10197067e-02 -4.98657227e-02 -4.34494019e-02  7.13858604e-02
 -8.34045410e-02  2.02445984e-02  1.19743347e-02  7.50732422e-03
 -9.05013084e-03  3.79447937e-02  2.40020752e-02  1.17538452e-01
  3.45458984e-02 -8.88824463e-02  1.42479420e-01  2.61783600e-02
 -1.01394653e-02  1.56650543e-02  3.45573425e-02 -4.45880890e-02
  4.49352264e-02 -2.30102539e-02 -2.92930603e-02  5.40466309e-02
  1.87149048e-02  8.07876587e-02  9.26971436e-04 -1.28021240e-02
 -

Essentially, we cannot understand just by seeing the embedding vector what the sequence represents. However, the machine learning algorithms that use these embeddings as inputs, can understand much better the features and the sequences than the frequency-based vectorizing methods, since they are created in a way that takes the context into account.

An interesting feature of the Word2Vec models is that for a particular word, they can give its most similar words in the corpus. This is done by calculating the cosine similarity of our word with the closest words in the vector space.

In [39]:
print(w2v.most_similar("movie"))

[('film', 0.8676770329475403), ('movies', 0.8013108372688293), ('films', 0.7363011837005615), ('moive', 0.6830360889434814), ('Movie', 0.6693680286407471), ('horror_flick', 0.6577848792076111), ('sequel', 0.6577793955802917), ('Guy_Ritchie_Revolver', 0.650975227355957), ('romantic_comedy', 0.6413198709487915), ('flick', 0.6321909427642822)]


In [37]:
print(cbow.wv.most_similar("movie"))

[('film', 0.8720470070838928), ('show', 0.6472330093383789), ('program', 0.6291745901107788), ('movies', 0.6163832545280457), ('films', 0.579606294631958), ('trailer', 0.5703409910202026), ('episode', 0.5682822465896606), ('programme', 0.5584979057312012), ('scene', 0.5447840690612793), ('title', 0.539584755897522)]


In [38]:
print(skipgram.wv.most_similar("movie"))

[('film', 0.7316080331802368), ('tittle', 0.6571516394615173), ('movies', 0.6406260132789612), ('continuation', 0.6287208199501038), ('moive', 0.626655638217926), ('flim', 0.6253799200057983), ('serie', 0.6253756880760193), ('weather', 0.6172146797180176), ('flick', 0.6149208545684814), ('unpopular', 0.6134818196296692)]


In the above example of the word "movie" we see what similar words the 3 different models are giving. For all 3 of the different embeddings, we see similar words, however CBoW seems to be better than Skip-Gram. We can notice that CBoW gives more frequent and correct words, while Skip-Gram might be giving more rare  and single is-occurences of words. For this reason we will continue with the pretrained version and CBoW when we deploy our models.


## Pros and Cons of Word2Vec

Positive:
- **Scalability**: Works well with large datasets and can be easily scaled.
- **Low Dimensionality**: The dimensionality is exceptionally reduced, compared to the frequency-based methods.
- **Context taken into account**: Semantics and context are the base element of the vectors.

Negative:
- **Needs big dataset** : Requires a large amount of training data to produce high-quality embeddings.
- **Out-Of-Vocabulary Issue**: Once again we work on a finite vocabulary, based on the corpus our model was trained on. If we have a new sequence with new words, they will not be taken into consideration when the embedding is produced, possibly losing important information.
- **Polysemy issue**: Can struggle with words with multiple meanings since it produces one vector per word.
- **Fixed window space**: Context is limited to a certain window size around the target word, losing long range dependencies.
