# Training Word2Vec Model

In this notebook, we train the **Word2Vec** model using the processed dataset.  
Once trained, we will generate **tweet embeddings** and save them for training the LSTM model in the next step.


In [1]:
import pandas as pd
import numpy as np
from gensim.models import Word2Vec

## Loading Tokenized Dataset

We load the tokenized dataset that was saved in the preprocessing step.
Since lists were saved as strings, we convert them back to lists before training the Word2Vec model.


In [2]:
data_path = "../data/processed/sentiment140_tokenized.csv"
df = pd.read_csv(data_path)

df["tokens"] = df["tokens"].apply(eval)

sentences = df["tokens"].tolist()

## Training the Word2Vec Model

We train a **Word2Vec** model using the tokenized tweets.  
This model learns vector representations of words based on their context in tweets.

In [3]:
word2vec_model = Word2Vec(
    sentences=sentences,
    vector_size=100,  
    window=3,         
    min_count=3,     
    workers=4,       
    sg=1,           
    epochs=10        
)

## Saving the Word2Vec Model

We save the trained model in the `models/` directory so it can be used later.

In [4]:
word2vec_model_path = "../models/word2vec_model"
word2vec_model.save(word2vec_model_path)


## Generating Tweet Embeddings

Each tweet is converted into a **numerical vector** by averaging the Word2Vec embeddings of its words.
If a tweet has no words in the vocabulary, it will be represented by a zero vector.

In [5]:
def tweet_vector(tokens, model):
    """
    Converts a tweet into a numerical vector by averaging its Word2Vec word embeddings.

    Args:
        tokens (list): List of tokenized words.
        model (Word2Vec): Trained Word2Vec model.

    Returns:
        np.array: Vector representation of the tweet.
    """
    valid_tokens = [token for token in tokens if token in model.wv]
    if not valid_tokens:
        return np.zeros(model.vector_size)
    return np.mean([model.wv[token] for token in valid_tokens], axis=0)

# Apply function to generate embeddings
df["vector"] = df["tokens"].apply(lambda x: tweet_vector(x, word2vec_model))

df[["tokens", "vector"]].head()


Unnamed: 0,tokens,vector
0,"[switchfoot, bummer, shoulda, david, carr, third]","[-0.048860773, -0.0153961675, 0.16348921, 0.03..."
1,"[upset, cant, update, facebook, texting, might...","[-0.18636408, 0.5754361, 0.17716423, 0.0499408..."
2,"[kenichan, dived, many, time, ball, managed, s...","[-0.20103584, 0.40490478, 0.022874601, -0.1132..."
3,"[whole, body, feel, itchy, fire]","[-0.46060118, 0.21935824, 0.012844193, 0.07566..."
4,"[nationwideclass, behaving, mad, cant]","[-0.49469107, 0.49707192, 0.1308029, 0.0195472..."


## Saving the Preprocessed Dataset with Embeddings

We save the dataset with tweet embeddings so it can be used in the next step:   training a **LSTM model** for sentiment classification.

In [6]:
# Convert vector (NumPy array) to string with commas for safe CSV writing
df["vector"] = df["vector"].apply(lambda v: str(list(v)))

# Save the dataset
output_path = "../data/processed/sentiment140_vectors.csv"
df[["sentiment", "vector"]].to_csv(output_path, index=False)

print("Dataset saved")


Dataset saved
