<a href="https://colab.research.google.com/github/Neverlost0311/nlp-word-embeddings-lab/blob/main/04-sentiment-classification/lab4_sentiment_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 4: Sentiment Classification with Embeddings

## Objective
In this lab, we will build a sentiment analysis system using **text embeddings** as features for a machine learning model.

We will:
- Load and preprocess the IMDB movie reviews dataset
- Generate embeddings for reviews using a modern embedding model
- Convert sentiment labels (positive/negative) into numeric form
- Split the data into training and test sets
- Train a classifier on top of embeddings
- Evaluate the model using standard metrics
- Build a prediction function for new reviews

This lab demonstrates how **embeddings can be used as input features for classical machine learning models**.


In [None]:
# ================================
# Cell 2: Install & Import Libraries
# ================================

# Install required libraries
!pip install -q google-genai scikit-learn pandas numpy tqdm

# Core libraries
import numpy as np
import pandas as pd
from tqdm import tqdm

# Machine learning
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

# Gemini client
from google import genai
import os

print("‚úÖ All libraries installed and imported successfully.")


‚úÖ All libraries installed and imported successfully.


In [None]:
# ================================
# Cell 3: Setup Gemini API Key
# ================================

# Enter your API key securely
from getpass import getpass

API_KEY = getpass("Enter your Gemini API Key: ")

# Set environment variable
os.environ["GEMINI_API_KEY"] = API_KEY

# Create Gemini client
client = genai.Client(api_key=API_KEY)

print("‚úÖ API key loaded and Gemini client created successfully.")


Enter your Gemini API Key: ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑
‚úÖ API key loaded and Gemini client created successfully.


In [None]:
# ================================
# Cell 4: Embedding Helper Functions
# ================================

MODEL_NAME = "models/text-embedding-004"

def get_embeddings_batch(texts, batch_size=50):
    """
    Generate embeddings for a list of texts using Gemini in batches.
    Returns a numpy array of shape (len(texts), embedding_dim)
    """
    all_embeddings = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        print(f"Embedding batch {i} to {i+len(batch)-1}...")

        result = client.models.embed_content(
            model=MODEL_NAME,
            contents=batch
        )

        batch_embeddings = [e.values for e in result.embeddings]
        all_embeddings.extend(batch_embeddings)

    return np.array(all_embeddings)

print("‚úÖ Embedding functions ready.")


‚úÖ Embedding functions ready.


In [11]:
import zipfile
import os

# Find the zip file automatically
zip_path = None
for file in os.listdir("."):
    if file.endswith(".zip"):
        zip_path = file
        break

if zip_path is None:
    raise FileNotFoundError("‚ùå No ZIP file found in current directory.")
else:
    print("‚úÖ Found ZIP file:", zip_path)

# Extract it
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall("dataset")

print("‚úÖ ZIP extracted successfully into ./dataset/")


‚úÖ Found ZIP file: 25bf82dd-b16e-4f0f-9ea2-eba1c8eb9828_Code-text_embeddings (1).zip
‚úÖ ZIP extracted successfully into ./dataset/


In [12]:
# ================================
# Cell 6: Find and Load IMDB Dataset
# ================================

import os
import pandas as pd

# Search for IMDB Dataset CSV inside dataset folder
imdb_path = None

for root, dirs, files in os.walk("dataset"):
    for file in files:
        if "IMDB" in file and file.endswith(".csv"):
            imdb_path = os.path.join(root, file)
            break

if imdb_path is None:
    raise FileNotFoundError("‚ùå Could not find IMDB Dataset CSV inside dataset folder.")
else:
    print("‚úÖ Found IMDB dataset at:")
    print(imdb_path)

# Load dataset
df = pd.read_csv(imdb_path)

print("\n‚úÖ Dataset loaded successfully!")
print("Shape:", df.shape)

# Show first 5 rows
df.head()


‚úÖ Found IMDB dataset at:
dataset/Code/code - openai version/data/IMDB Dataset.csv

‚úÖ Dataset loaded successfully!
Shape: (50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [13]:
# ================================
# Cell 7: Prepare Dataset for Training
# ================================

# Map sentiment to numeric labels
df["sentiment_label"] = df["sentiment"].map({"positive": 1, "negative": 0})

# Shuffle dataset
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Take only 1000 samples to save API usage
df_small = df.iloc[:1000]

print("‚úÖ Using subset shape:", df_small.shape)
print("Sentiment distribution:")
print(df_small["sentiment_label"].value_counts())

# Extract texts and labels
texts = df_small["review"].tolist()
labels = df_small["sentiment_label"].tolist()

# Show one example
print("\nüìù Sample review (first 500 chars):\n")
print(texts[0][:500], "...")
print("\nLabel:", labels[0])


‚úÖ Using subset shape: (1000, 3)
Sentiment distribution:
sentiment_label
0    524
1    476
Name: count, dtype: int64

üìù Sample review (first 500 chars):

I really liked this Summerslam due to the look of the arena, the curtains and just the look overall was interesting to me for some reason. Anyways, this could have been one of the best Summerslam's ever if the WWF didn't have Lex Luger in the main event against Yokozuna, now for it's time it was ok to have a huge fat man vs a strong man but I'm glad times have changed. It was a terrible main event just like every match Luger is in is terrible. Other matches on the card were Razor Ramon vs Ted Di ...

Label: 1


In [14]:
# ================================
# Cell 8: Generate Embeddings for Reviews
# ================================

import numpy as np

print("üöÄ Generating embeddings for", len(texts), "reviews...")

# Generate embeddings in batches
X_embeddings = get_embeddings_batch(texts, batch_size=50)

# Convert labels to numpy array
y = np.array(labels)

print("‚úÖ Embeddings generated!")
print("Embedding matrix shape:", X_embeddings.shape)
print("Labels shape:", y.shape)


üöÄ Generating embeddings for 1000 reviews...
Embedding batch 0 to 49...
Embedding batch 50 to 99...
Embedding batch 100 to 149...
Embedding batch 150 to 199...
Embedding batch 200 to 249...
Embedding batch 250 to 299...
Embedding batch 300 to 349...
Embedding batch 350 to 399...
Embedding batch 400 to 449...
Embedding batch 450 to 499...
Embedding batch 500 to 549...
Embedding batch 550 to 599...
Embedding batch 600 to 649...
Embedding batch 650 to 699...
Embedding batch 700 to 749...
Embedding batch 750 to 799...
Embedding batch 800 to 849...
Embedding batch 850 to 899...
Embedding batch 900 to 949...
Embedding batch 950 to 999...
‚úÖ Embeddings generated!
Embedding matrix shape: (1000, 768)
Labels shape: (1000,)


In [15]:
# ================================
# Cell 9: Train / Test Split
# ================================

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_embeddings, y, test_size=0.2, random_state=42
)

print("‚úÖ Train/Test split done!")
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)
print("Training labels shape:", y_train.shape)
print("Test labels shape:", y_test.shape)


‚úÖ Train/Test split done!
Training set shape: (800, 768)
Test set shape: (200, 768)
Training labels shape: (800,)
Test labels shape: (200,)


In [16]:
# ================================
# Cell 10: Train XGBoost Classifier
# ================================

from xgboost import XGBClassifier

model = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric="logloss",
    random_state=42
)

print("üöÄ Training XGBoost classifier...")

model.fit(X_train, y_train)

print("‚úÖ Model training complete!")


üöÄ Training XGBoost classifier...
‚úÖ Model training complete!


In [17]:
# ================================
# Cell 11: Evaluate the Model
# ================================

from sklearn.metrics import classification_report, accuracy_score

# Predict on test data
y_pred = model.predict(X_test)

# Print accuracy
acc = accuracy_score(y_test, y_pred)
print("‚úÖ Test Accuracy:", acc)

# Detailed report
print("\nüìä Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=["Negative", "Positive"]))


‚úÖ Test Accuracy: 0.955

üìä Classification Report:

              precision    recall  f1-score   support

    Negative       0.96      0.96      0.96       114
    Positive       0.95      0.94      0.95        86

    accuracy                           0.95       200
   macro avg       0.95      0.95      0.95       200
weighted avg       0.95      0.95      0.95       200



In [18]:
# ================================
# Cell 12: Predict Sentiment for New Reviews
# ================================

def predict_sentiment(review_text):
    # Generate embedding for the input text
    embedding = get_embeddings_batch([review_text], batch_size=1)

    # Predict using trained model
    pred = model.predict(embedding)[0]
    prob = model.predict_proba(embedding)[0]

    label = "Positive" if pred == 1 else "Negative"

    print("üìù Review:")
    print(review_text)
    print("\nüéØ Prediction:", label)
    print("üìä Confidence:", prob)

# Test examples
print("\n=== Test 1 ===")
predict_sentiment("This movie was absolutely fantastic. I loved every minute of it!")

print("\n=== Test 2 ===")
predict_sentiment("Worst movie ever. Total waste of time and money.")

print("\n=== Test 3 ===")
predict_sentiment("The film had good acting but the story was boring and predictable.")



=== Test 1 ===
Embedding batch 0 to 0...
üìù Review:
This movie was absolutely fantastic. I loved every minute of it!

üéØ Prediction: Positive
üìä Confidence: [3.8653612e-04 9.9961346e-01]

=== Test 2 ===
Embedding batch 0 to 0...
üìù Review:
Worst movie ever. Total waste of time and money.

üéØ Prediction: Negative
üìä Confidence: [9.9992335e-01 7.6668890e-05]

=== Test 3 ===
Embedding batch 0 to 0...
üìù Review:
The film had good acting but the story was boring and predictable.

üéØ Prediction: Negative
üìä Confidence: [0.9926853  0.00731467]
