## Word2Vec (w2v)

Word2Vec is a method for turning words into numbers (vectors) in a way that captures their meaning based on the context in which they appear. Unlike simpler methods like Bag-of-Words (BoW) or TF-IDF, which just count word occurrences, Word2Vec builds contextual word representations. This means that words with similar meanings end up with similar vectors in the Word2Vec model.

A famous example of Word2Vec's power is the analogy:
**king - man + woman = queen**

This shows that Word2Vec understands relationships between words so well that you can perform basic arithmetic on the vectors to solve analogies!

### How Word2Vec Works

Word2Vec works by creating word embeddings, which are n-dimensional vectors representing words. So a single word will be represented as a list of n numbers. These vectors are trained using a neural network with two key architectures:

- CBOW (Continuous Bag of Words)
- Skip-Gram

In both cases, the network is simple (just two layers) and the goal isn't to classify data but to learn the word embeddings from the training data. Once trained, you throw away the neural network and use the input-to-hidden weight matrix as your word embeddings.


### Continuous Bag of Words (CBOW)

**How CBOW Works:**

In CBOW, the model takes words from the context (the words surrounding a target word) as input and tries to predict the target word.

**Example:** For the sentence:
“I enjoy reading books about history,”
if the context window size is 2, the model will take the words:
["I", "enjoy", "books", "about"]
and try to predict the middle word, which is "reading".

This is repeated for every word in the training data, using the neighboring words to predict the target word. You can set the context window size to decide how many words on each side of the target word to include.

**How the Neural Network is Built:**

1. **Vocabulary and One-Hot Encoding:** First, every word in your vocabulary (all unique words in the text) is assigned a one-hot encoded vector. This is a large vector with a length equal to the vocabulary size, with a 1 in the position of the word and 0s everywhere else.
    - Example: If the vocabulary size is 10,000, the word "cat" might be represented as [0, 0, 0, ..., 1, 0, 0] (vector of length 10,000).

2. **Word Embeddings Matrix:** The input layer of the network is the one-hot encoded vectors for the context words. These vectors are multiplied by the input-to-hidden weights matrix, which gives you a much smaller vector called the word embedding (e.g., instead of a 10,000-length vector, you might get a 300-length vector).

    After training, the weight matrix becomes your word embedding matrix, where each row corresponds to a word’s embedding.

3. **Training Objective:** The network learns by adjusting the weights to maximize the probability of correctly predicting the target word, given the context words. Once trained, you can discard the neural network and keep the learned embeddings.

**Embedding Matrix Example:**

- Suppose you have a vocabulary of 39,783 words.
- You define the hidden layer (embedding size) to be 300.
- After training, you get a 39,783 x 300 matrix, where each word in your vocabulary is represented by a 300-dimensional vector.

When you input a word (like "cat"), it is converted to a one-hot vector, which is then multiplied by the embedding matrix to get its word embedding.

**CBOW Summary:**
- Input: Surrounding words (context).
- Goal: Predict the middle word.
- Training Output: A word embeddings matrix.


### Skip-Gram

Skip-Gram is the reverse of CBOW. Instead of using context words to predict a target word, Skip-Gram takes a target word and predicts the context words around it.

**Example:** For the sentence:
“The small kid ate a banana”, with a context window size of 2, you would use the word "ate" as the input, and the targets would be the words ["small", "kid", "a", "banana"].

**How the Neural Network is Built:**

The process is similar to CBOW, but in Skip-Gram:

- The input to the neural network is a single target word.
- The output is the prediction of the surrounding context words.

Again, the neural network is trained to maximize the probability of predicting the correct context words given the target word. Once trained, the word embeddings are stored in the input-to-hidden weight matrix, just like with CBOW.

**Skip-Gram Summary:**

- Input: Target word.
- Goal: Predict surrounding words.
- Training Output: A word embeddings matrix.

### CBOW vs. Skip-Gram: Which One to Use?

- CBOW is faster and works better on large datasets. It’s good at predicting common words and is often the default choice when you have a lot of data.
- Skip-Gram is slower but better for smaller datasets and works well when you're focusing on rare words. It does a better job learning the embeddings for less frequent words.


### How to use it for Sentiment Analysis

Preprocess the Text:

- Tokenize the text (split it into words).
- Remove stopwords (optional).
- Convert words into embeddings using the trained Word2Vec model.

For example, the sentence "I loved the movie" could become:

- "I" → [0.01, 0.02, -0.01, ...]
- "loved" → [0.3, -0.1, 0.2, ...]
- "movie" → [0.25, -0.15, 0.05, ...]

**Aggregating Embeddings**:

After obtaining the embeddings, the next step is to combine them into a single vector representing the entire sentence. Some possible methods are:

- Averaging (used here): Take the mean of the embeddings.
- Summing: Add the embeddings together.
- Weighted Averaging: Weight words by importance.
- Concatenation: Combine embeddings for the first few words.
- Recurrent Models: Use LSTMs or GRUs to capture sequential information.
- Attention Mechanisms: Focus on important words.

In this notebook, We will keep it simple and use averaging to create a single vector per sentence.

**Note**: word2vec is not inherently a method for modeling sentences, only words. So there is no single, official way to use word2vec to represent sentences.

### Implementation in Python

Let's begin by importing the libraries we need. We will use a library called Gensim - you will see why in a bit.

In [60]:
import os
import pandas as pd
import numpy as np
from gensim import models
import gensim.downloader as api

Let's create a sample dataset. Just like we did in the other notebooks.

In [61]:
# Sample DataFrame with Netflix app reviews
df = pd.DataFrame({
    'Content_cleaned': [
        'the app is great new features but crashes often',
        'love the app love the content but it crashes',
        'the app crashes too much it is frustrating',
        'the content is great it is easy to use it is great'
    ]
})

Let's visualize it.

In [62]:
print(df.to_string())

                                      Content_cleaned
0     the app is great new features but crashes often
1        love the app love the content but it crashes
2          the app crashes too much it is frustrating
3  the content is great it is easy to use it is great


### Pre-trained vs. Self-trained Word Embeddings

In this notebook, we will explore two methods for generating word embeddings:

- Using a pre-trained model: We will use Google’s pre-trained Word2Vec model, which is widely used and trained on 100 billion words from the Google News dataset. The model generates word vectors of size 300.
- Training our own model: We'll also train our own Word2Vec model from scratch to see how it compares in both vectorizing text and in model performance.

**Pre-trained Word2Vec Model**

Google’s pre-trained Word2Vec model is available via the Gensim library. If the model is not already downloaded, the notebook will automatically download it, save it locally, and load it into memory.

Here’s how we download and load the pre-trained model:

In [63]:
# Define paths for the model
model_dir = 'Pre-trained-models'
model_path_gensim = os.path.join(model_dir, 'GoogleNews-vectors-negative300.kv')

# Ensure the model directory exists
os.makedirs(model_dir, exist_ok=True)

# Download and load the pre-trained model if not already present
if not os.path.isfile(model_path_gensim):
    print("Downloading Google News Word2Vec model...")
    w2v = api.load('word2vec-google-news-300')
    w2v.save(model_path_gensim)
    print("Model downloaded and saved.")
else:
    print("Loading existing model.")
    w2v = models.KeyedVectors.load(model_path_gensim)

Loading existing model.


**Averaging Word Vectors to Represent Text Sequences**

To represent a sequence of text (like a sentence or a document), we calculate the word embedding for each individual token (word) and then average these vectors. This approach captures the overall meaning of the text. Each word contributes to the final vector, so the meaning of the whole sentence is a blend of all the individual word meanings.

The process is implemented in the following function:

In [64]:
def get_average_word2vec(tokens_list, model, vector_size):
    
    # Filter valid tokens and return zero vector if none are found
    valid_tokens = [token for token in tokens_list if token in model]
    if not valid_tokens:
        return np.zeros(vector_size)
    
    # Compute and return the average vector
    word_vectors = np.array([model[token] for token in valid_tokens])
    return np.mean(word_vectors, axis=0)

Next, we tokenize our text data and apply the pre-trained model:

In [65]:
# Tokenize the text data
df['tokens'] = df['Content_cleaned'].apply(lambda x: x.split())

# Compute the average Word2Vec for each row using the custom function
vector_size = w2v.vector_size
df['word2vec_pretrained'] = df['tokens'].apply(lambda x: get_average_word2vec(x, w2v, vector_size))

# Display the resulting DataFrame
df.head()

Unnamed: 0,Content_cleaned,tokens,word2vec_pretrained
0,the app is great new features but crashes often,"[the, app, is, great, new, features, but, cras...","[0.056230333, 0.04981825, -0.042060003, 0.0209..."
1,love the app love the content but it crashes,"[love, the, app, love, the, content, but, it, ...","[0.072252065, -0.026354684, -0.020263672, 0.04..."
2,the app crashes too much it is frustrating,"[the, app, crashes, too, much, it, is, frustra...","[0.1003685, 0.001052618, -0.05606079, 0.036048..."
3,the content is great it is easy to use it is g...,"[the, content, is, great, it, is, easy, to, us...","[0.07695146, 0.021551305, 0.054709695, 0.08551..."


As we see in the example below, our sequences have been turned into a vector of length 300.

In [66]:
print(df['word2vec_pretrained'][1].shape)

(300,)


### Training Our Own Word2Vec Models

In this section, we will train our own Continuous Bag of Words (CBOW) and Skip-Gram models using Gensim. Both models will transform our text sequences into 1x300 dimensional vectors, capturing the meanings of the words in context.

**Code Implementation**

First, let's import the necessary library and define a function to compute the average word vectors:

In [67]:
from gensim.models import Word2Vec

def get_average_word2vec_custom(tokens_list, model, vector_size):
    # Filter tokens that are valid (present in the model)
    valid_tokens = [token for token in tokens_list if token in model.wv]
    if not valid_tokens:
        return np.zeros(vector_size)  # Return a zero vector if no valid tokens are found
    
    # Calculate the average vector of valid tokens
    word_vectors = [model.wv[token] for token in valid_tokens]
    return np.mean(word_vectors, axis=0)

**Model Parameters**

Next, we will define the parameters for our Word2Vec models:

In [68]:
# Set model parameters
vector_size = 300   # Dimension of word vectors
window_size = 5     # Size of the context window
min_count = 1       # Minimum frequency for words to be included in the model

**Training the Models**

Now, let's train the CBOW and Skip-Gram models. The sg parameter determines the model type: sg=0 for CBOW and sg=1 for Skip-Gram.

In [69]:
# Train the CBoW model (sg=0)
cbow_model = Word2Vec(df['tokens'].tolist(), vector_size=vector_size, sg=0, window=window_size, min_count=min_count, workers=1)

# Calculate average word vectors for the CBOW model
df['word2vec_cbow'] = df['tokens'].apply(lambda x: get_average_word2vec_custom(x, cbow_model, vector_size))

In [70]:
# Train the Skip-Gram model (sg=1)
skipgram_model = Word2Vec(df['tokens'].tolist(), vector_size=vector_size, sg=1, window=window_size, min_count=min_count, workers=1)

# Calculate average word vectors for the Skip-Gram model
df['word2vec_skipgram'] = df['tokens'].apply(lambda x: get_average_word2vec_custom(x, skipgram_model, vector_size))

**Summary of Word2Vec Parameters**

- Vector Size: The dimensionality of the word embeddings (300 dimensions).
- Window Size: The size of the context window used to predict words.
- Minimum Count: The minimum frequency a word must have to be considered for training.

**Model Types**

- CBOW: Uses the context (surrounding words) to predict the target word (sg=0).
- Skip-Gram: Uses the target word to predict its surrounding context (sg=1).

**Result**

After training both models, we can display the first few entries of our DataFrame to see the average word vector representations:

In [71]:
df.head()

Unnamed: 0,Content_cleaned,tokens,word2vec_pretrained,word2vec_cbow,word2vec_skipgram
0,the app is great new features but crashes often,"[the, app, is, great, new, features, but, cras...","[0.056230333, 0.04981825, -0.042060003, 0.0209...","[-3.9464252e-05, 8.735372e-05, 0.00023468967, ...","[-3.8613955e-05, 8.666539e-05, 0.00023473676, ..."
1,love the app love the content but it crashes,"[love, the, app, love, the, content, but, it, ...","[0.072252065, -0.026354684, -0.020263672, 0.04...","[-0.00015348375, -0.00045190606, 0.00050380756...","[-0.00015304553, -0.00045252705, 0.00050321664..."
2,the app crashes too much it is frustrating,"[the, app, crashes, too, much, it, is, frustra...","[0.1003685, 0.001052618, -0.05606079, 0.036048...","[-6.593054e-05, 0.0005229764, -0.0010689881, -...","[-6.565961e-05, 0.0005227553, -0.0010691253, -..."
3,the content is great it is easy to use it is g...,"[the, content, is, great, it, is, easy, to, us...","[0.07695146, 0.021551305, 0.054709695, 0.08551...","[-1.8200057e-05, 0.0007764301, 0.0007181768, -...","[-1.7555154e-05, 0.00077540375, 0.00071704615,..."


### Applying Word2Vec to the Netflix Dataset

In this section, we will utilize the Word2Vec models we’ve trained on the Netflix dataset. Let’s start by loading the dataset and processing the text data.

**Loading the dataset**

In [72]:
import pandas as pd

# Read the preprocessed dataset
df = pd.read_csv('../DATASETS/preprocessed_text.csv')

# Fill any empty text entries that resulted from preprocessing
df.fillna('', inplace=True)

**Tokenizing the Text Data**

Next, we will tokenize the cleaned content to prepare it for embedding calculations:

In [73]:
# Tokenize the cleaned text data
df['tokens'] = df['content_cleaned'].apply(lambda x: x.split())

**Calculating Average Word Vectors**

Now, we will compute the average Word2Vec vectors for each row in the dataset using the pre-trained model:

In [74]:
# Compute the average Word2Vec vectors using the pre-trained model
vector_size = w2v.vector_size  # Retrieve vector size from the pre-trained model
df['word2vec_pretrained'] = df['tokens'].apply(lambda x: get_average_word2vec(x, w2v, vector_size))

**Training Custom Word2Vec Models**

Let’s proceed to train our own CBOW and Skip-Gram models using the tokenized data:

In [75]:
# Train the CBOW model
cbow_model = Word2Vec(df['tokens'].tolist(), vector_size=vector_size, sg=0, window=window_size, min_count=min_count, workers=1)

# Calculate average word vectors for the CBOW model
df['word2vec_cbow'] = df['tokens'].apply(lambda x: get_average_word2vec_custom(x, cbow_model, vector_size))

# Train the Skip-Gram model
skipgram_model = Word2Vec(df['tokens'].tolist(), vector_size=vector_size, sg=1, window=window_size, min_count=min_count, workers=1)

# Calculate average word vectors for the Skip-Gram model
df['word2vec_skipgram'] = df['tokens'].apply(lambda x: get_average_word2vec_custom(x, skipgram_model, vector_size))

**Displaying the Results**

After processing, let’s display the first few entries of our DataFrame to observe the results:

In [78]:
# Display the results
df.head()

Unnamed: 0,content,score,content_cleaned,tokens,word2vec_pretrained,word2vec_cbow,word2vec_skipgram
0,Plsssss stoppppp giving screen limit like when...,2,plss stopp giving screen limit like when you a...,"[plss, stopp, giving, screen, limit, like, whe...","[0.060924955, 0.036983438, 0.052580304, 0.1163...","[0.007439245, -0.1203659, -0.104098774, -0.202...","[0.020204937, 0.15391015, -0.06393335, -0.0217..."
1,Good,5,good,[good],"[0.040527344, 0.0625, -0.017456055, 0.07861328...","[0.6157658, -0.33845198, -0.5885259, -0.028577...","[0.070144735, 0.44430047, -0.008070361, 0.1648..."
2,👍👍,5,thumbs_up,[thumbs_up],"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.12089115, -0.37054986, 0.04350766, -0.1041...","[0.05704466, 0.121295236, -0.053278703, -0.183..."
3,Good,3,good,[good],"[0.040527344, 0.0625, -0.017456055, 0.07861328...","[0.6157658, -0.33845198, -0.5885259, -0.028577...","[0.070144735, 0.44430047, -0.008070361, 0.1648..."
4,"App is useful to certain phone brand ,,,,it is...",1,app is useful to certain phone brand it is not...,"[app, is, useful, to, certain, phone, brand, i...","[0.016780308, -0.041579314, 0.043486457, 0.058...","[-0.14768976, -0.20620891, -0.036397066, -0.10...","[0.058957115, 0.17874919, -0.0928607, 0.039189..."


**Understanding the Embeddings**

For example, we can check a specific content entry and its corresponding Word2Vec vector:

In [79]:
# Check the content and corresponding Word2Vec vector
print(df['content_cleaned'][44])
print(df['word2vec_pretrained'][44])

i love to watch the movie s you have on here and the tv shows i love them growing_heart
[-2.08816528e-02  1.15051270e-02  2.57377625e-02  1.76330566e-01
 -6.28890991e-02 -5.34057617e-05  5.42964935e-02 -1.09580994e-01
  1.54647827e-02  7.56883621e-02 -3.56063843e-02 -1.48437500e-01
 -7.52716064e-02 -1.43718719e-02 -1.27414703e-01  8.87413025e-02
  7.78579712e-02  9.60235596e-02  8.70895386e-03 -3.39450836e-02
 -5.30548096e-02  6.39114380e-02  7.28273392e-02  1.10912323e-03
  4.01535034e-02  3.60412598e-02 -5.64794540e-02  4.14695740e-02
  1.10197067e-02 -4.98657227e-02 -4.34494019e-02  7.13858604e-02
 -8.34045410e-02  2.02445984e-02  1.19743347e-02  7.50732422e-03
 -9.05013084e-03  3.79447937e-02  2.40020752e-02  1.17538452e-01
  3.45458984e-02 -8.88824463e-02  1.42479420e-01  2.61783600e-02
 -1.01394653e-02  1.56650543e-02  3.45573425e-02 -4.45880890e-02
  4.49352264e-02 -2.30102539e-02 -2.92930603e-02  5.40466309e-02
  1.87149048e-02  8.07876587e-02  9.26971436e-04 -1.28021240e-02
 -

While you might not be able to understand the embedding vectors straight away, machine learning algorithms can use these to capture semantic relationships more effectively than traditional frequency-based methods. Word2Vec embeddings are designed to consider the context of words, which helps to represent textual features better.

### Finding Similar Words

One of the powerful features of Word2Vec is its ability to find similar words based on cosine similarity in the vector space. We can explore this functionality as follows:

In [80]:
# Display similar words for the term "movie" from all models
print("Similar words for 'movie' using pre-trained model:", w2v.most_similar("movie"))
print("Similar words for 'movie' using CBOW model:", cbow_model.wv.most_similar("movie"))
print("Similar words for 'movie' using Skip-Gram model:", skipgram_model.wv.most_similar("movie"))

Similar words for 'movie' using pre-trained model: [('film', 0.8676770329475403), ('movies', 0.8013108372688293), ('films', 0.7363011837005615), ('moive', 0.6830360889434814), ('Movie', 0.6693680286407471), ('horror_flick', 0.6577848792076111), ('sequel', 0.6577793955802917), ('Guy_Ritchie_Revolver', 0.650975227355957), ('romantic_comedy', 0.6413198709487915), ('flick', 0.6321909427642822)]
Similar words for 'movie' using CBOW model: [('film', 0.8702491521835327), ('show', 0.6649054884910583), ('movies', 0.6187590956687927), ('program', 0.607060432434082), ('trailer', 0.594946026802063), ('episode', 0.5799102783203125), ('films', 0.5590505599975586), ('title', 0.5378875732421875), ('genre', 0.5195510983467102), ('category', 0.5140288472175598)]
Similar words for 'movie' using Skip-Gram model: [('film', 0.761932373046875), ('tittle', 0.6838789582252502), ('flim', 0.6558706760406494), ('serie', 0.6544528007507324), ('movies', 0.6517087817192078), ('moive', 0.645378828048706), ('flick', 0

In this example, we observe the similar words for the term "movie" from each model. The results may vary among the models.

## Pros and Cons of Word2Vec

Positive:
- **Scalability**: Works well with large datasets and can be easily scaled.
- **Low Dimensionality**: The dimensionality is exceptionally reduced, compared to the frequency-based methods.
- **Context taken into account**: Semantics and context are the base element of the vectors.

Negative:
- **Needs big dataset** : Requires a large amount of training data to produce high-quality embeddings.
- **Out-Of-Vocabulary Issue**: Once again we work on a finite vocabulary, based on the corpus our model was trained on. If we have a new sequence with new words, they will not be taken into consideration when the embedding is produced, possibly losing important information.
- **Polysemy issue**: Can struggle with words with multiple meanings since it produces one vector per word.
- **Fixed window space**: Context is limited to a certain window size around the target word, losing long range dependencies.
