**üîí Proprietary & All Rights Reserved**

**¬© 2025 Sweety Seelam.** This work is proprietary and protected by copyright. All content, models, code, and visuals are ¬© 2025 Sweety Seelam. 
No part of this project, app, code, or analysis may be copied, reproduced, distributed, or used for any purpose‚Äîcommercial or otherwise‚Äîwithout explicit written permission from the author.

-------------

# StreamIntel360: A Multi-Agent RAG Platform for Streaming Content & Revenue Intelligence

-----------

# 03 ‚Äì Sentiment Model on IMDB Movie Reviews

This notebook trains a simple sentiment classifier using movie reviews.

**Goals:**
- Load a labeled movie review dataset (e.g., IMDB 50k reviews).
- Train a baseline model using TF-IDF + Logistic Regression.
- Evaluate accuracy and basic metrics.
- (Later) Make this model available to StreamIntel360 agents as an additional signal.


In [1]:
# Install dependencies
!pip install scikit-learn



In [2]:
# Cell 2 ‚Äì Code: Imports & Paths
import pandas as pd
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix

DATA_DIR = Path("..") / "data" / "raw"
IMDB_PATH = DATA_DIR / "IMDB_Dataset.csv"

IMDB_PATH, IMDB_PATH.exists()

(WindowsPath('../data/raw/IMDB_Dataset.csv'), True)

In [3]:
# Cell 3 ‚Äì Code: Load Dataset
df = pd.read_csv(IMDB_PATH, encoding="latin-1")
df.head()

# Assumes columns: review, sentiment (positive / negative).

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
# Cell 4 ‚Äì Code: Check Balance & Clean
df["sentiment"].value_counts()

df["review"] = df["review"].astype(str)
df["sentiment"] = df["sentiment"].astype(str).str.lower()

df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
# Cell 5 ‚Äì Code: Train/Test Split
X = df["review"]
y = df["sentiment"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

len(X_train), len(X_test)

(40000, 10000)

In [6]:
# Cell 6 ‚Äì Code: Build Pipeline
pipeline = Pipeline(
    steps=[
        ("tfidf", TfidfVectorizer(
            max_features=50000,
            ngram_range=(1, 2),
            stop_words="english"
        )),
        ("clf", LogisticRegression(
            max_iter=1000,
            n_jobs=-1
        )),
    ]
)

pipeline

In [7]:
# Cell 7 ‚Äì Code: Train Model
pipeline.fit(X_train, y_train)

In [8]:
# Cell 8 ‚Äì Code: Evaluate
y_pred = pipeline.predict(X_test)

print(classification_report(y_test, y_pred))

conf_matrix = confusion_matrix(y_test, y_pred)
conf_matrix

              precision    recall  f1-score   support

    negative       0.91      0.89      0.90      5000
    positive       0.89      0.91      0.90      5000

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000



array([[4458,  542],
       [ 434, 4566]], dtype=int64)

In [9]:
# Cell 9 ‚Äì Code: Try Some Manual Inputs
def predict_sentiment(text: str):
    return pipeline.predict([text])[0]

samples = [
    "This movie was absolutely fantastic, the performances were incredible.",
    "I hated every minute of this show, terrible pacing and bad writing.",
    "It was okay, not great but watchable for a lazy weekend.",
]

for s in samples:
    print(f"Review: {s}")
    print("Predicted sentiment:", predict_sentiment(s))
    print("-" * 60)

Review: This movie was absolutely fantastic, the performances were incredible.
Predicted sentiment: positive
------------------------------------------------------------
Review: I hated every minute of this show, terrible pacing and bad writing.
Predicted sentiment: negative
------------------------------------------------------------
Review: It was okay, not great but watchable for a lazy weekend.
Predicted sentiment: negative
------------------------------------------------------------


## Saving the Sentiment Model

We can serialize the trained pipeline and load it inside the backend later if we want to integrate sentiment as an agent tool.

In [10]:
# Saving the Sentiment Model
import joblib
from pathlib import Path

MODELS_DIR = Path("..") / "models"
MODELS_DIR.mkdir(exist_ok=True)

joblib_path = MODELS_DIR / "imdb_sentiment_pipeline.joblib"
joblib.dump(pipeline, joblib_path)

joblib_path

# Explanation: 
# Saves the whole pipeline (TF-IDF + classifier). 
# Later, backend agents can call this model as a tool to estimate sentiment polarity on review snippets.

WindowsPath('../models/imdb_sentiment_pipeline.joblib')

-------------
## Summary

***‚ÄúCan we add a sentiment brain for reviews?‚Äù***

**What I did & why?**

**1.Loaded IMDB 50k dataset (IMDB_Dataset.csv) with columns review, sentiment.**

- Reason: this is a labeled, sentiment-supervised dataset aligned with movies/TV content.

**2.Cleaned data & checked class balance.**

- Reason: ensure labels are consistent (positive/negative lowercase strings) and balanced for training.

**3.Split into train/test (80/20, stratified).**

- Reason: proper machine-learning protocol to evaluate generalization, not just training performance.

**4.Built a Pipeline = TF-IDF + LogisticRegression.**

- Reason:

    - TF-IDF gives a strong, interpretable baseline representation of text.

    - Logistic Regression is fast, robust, and good for linear separation.

**5.Trained the model and evaluated classification metrics.**

- I have achieved ~90% accuracy, with F1 ‚âà 0.90 for both positive and negative.

- Confusion matrix shows misclassifications are limited and symmetric.

**6.Tested a few custom reviews (‚Äúfantastic performances‚Äù, ‚Äúhated every minute‚Äù).**

- Reason: sanity-check that the predicted sentiments match human intuition.

**7.Saved the entire pipeline as models/imdb_sentiment_pipeline.joblib.**

- Reason: allow FastAPI agents to load this model as a tool later and score review snippets directly.

**What have I achieved?**

- We now have a ready-to-plug-in sentiment classifier that can run locally without any external services.

- This model can be used by an ‚ÄúAudience Sentiment Agent‚Äù to annotate or summarize review text in StreamIntel360.

---------

## Conclusion


- This notebook answers: ‚ÄúCan we quickly build a robust sentiment model for movie reviews and reuse it in our platform?‚Äù

- Yes: I trained, validated, sanity-checked, and serialized a production-ready baseline.

- We can integrate it later by loading imdb_sentiment_pipeline.joblib inside the backend and exposing it as an internal tool for agents.