# Rotten Tomatoes: Predict Review Sentiment from Movie Metadata
This notebook trains a baseline ML model to predict a critic review's **sentiment** (POSITIVE/NEGATIVE) using **movie-level variables** (e.g., scores, rating, runtime, genre, language, etc.).

It:
- Loads `rotten_tomatoes_movies.csv` and `rotten_tomatoes_movie_reviews.csv`
- Joins reviews to movies by `id`
- Builds a preprocessing + model pipeline
- Evaluates with accuracy/F1 and a confusion matrix
- Saves outputs into an `output/` folder


In [1]:

# If you're running this in Jupyter, run this once if needed:
# !pip install pandas numpy scikit-learn matplotlib joblib

import os
import re
from pathlib import Path
import pandas as pd
import numpy as np


In [2]:

# -----------------------------
# Paths + output folder
# -----------------------------
OUTPUT_DIR = Path("../output")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

def safe_filename(name: str) -> str:
    """Remove Windows-illegal characters and normalize spaces."""
    name = re.sub(r'[<>:"/\\|?*]', '', name)
    name = name.replace(" ", "_")
    return name

print("Working directory:", Path.cwd())
print("Output folder:", OUTPUT_DIR.resolve())


Working directory: c:\Users\chank\github\DataScienceProject\scripts
Output folder: C:\Users\chank\github\DataScienceProject\output


In [3]:

# -----------------------------
# Locate input files
# -----------------------------
# Preferred project layout:
#   ../data/rotten_tomatoes_movies.csv
#   ../data/rotten_tomatoes_movie_reviews.csv
#
# Fallback for this ChatGPT environment:
#   /mnt/data/rotten_tomatoes_movies.csv
#   /mnt/data/rotten_tomatoes_movie_reviews.csv

candidates = [
    (Path("../data/rotten_tomatoes_movies.csv"), Path("../data/rotten_tomatoes_movie_reviews.csv")),
    (Path("data/rotten_tomatoes_movies.csv"), Path("data/rotten_tomatoes_movie_reviews.csv")),
    (Path("/mnt/data/rotten_tomatoes_movies.csv"), Path("/mnt/data/rotten_tomatoes_movie_reviews.csv")),
]

movies_path = None
reviews_path = None
for m, r in candidates:
    if m.exists() and r.exists():
        movies_path, reviews_path = m, r
        break

if movies_path is None:
    raise FileNotFoundError(
        "Could not find the input CSVs. Put them in ../data/ or data/, or update the paths in this cell."
    )

print("Movies:", movies_path.resolve())
print("Reviews:", reviews_path.resolve())


Movies: C:\Users\chank\github\DataScienceProject\data\rotten_tomatoes_movies.csv
Reviews: C:\Users\chank\github\DataScienceProject\data\rotten_tomatoes_movie_reviews.csv


In [4]:

# -----------------------------
# Load data
# -----------------------------
movies = pd.read_csv(movies_path)
reviews = pd.read_csv(reviews_path)

print("movies shape:", movies.shape)
print("reviews shape:", reviews.shape)

display(movies.head(3))
display(reviews.head(3))


movies shape: (143258, 16)
reviews shape: (1444963, 11)


Unnamed: 0,id,title,audienceScore,tomatoMeter,rating,ratingContents,releaseDateTheaters,releaseDateStreaming,runtimeMinutes,genre,originalLanguage,director,writer,boxOffice,distributor,soundMix
0,space-zombie-bingo,Space Zombie Bingo!,50.0,,,,,2018-08-25,75.0,"Comedy, Horror, Sci-fi",English,George Ormrod,"George Ormrod,John Sabotta",,,
1,the_green_grass,The Green Grass,,,,,,2020-02-11,114.0,Drama,English,Tiffany Edwards,Tiffany Edwards,,,
2,love_lies,"Love, Lies",43.0,,,,,,120.0,Drama,Korean,"Park Heung-Sik,Heung-Sik Park","Ha Young-Joon,Jeon Yun-su,Song Hye-jin",,,


Unnamed: 0,id,reviewId,creationDate,criticName,isTopCritic,originalScore,reviewState,publicatioName,reviewText,scoreSentiment,reviewUrl
0,beavers,1145982,2003-05-23,Ivan M. Lincoln,False,3.5/4,fresh,Deseret News (Salt Lake City),Timed to be just long enough for most youngste...,POSITIVE,http://www.deseretnews.com/article/700003233/B...
1,blood_mask,1636744,2007-06-02,The Foywonder,False,1/5,rotten,Dread Central,It doesn't matter if a movie costs 300 million...,NEGATIVE,http://www.dreadcentral.com/index.php?name=Rev...
2,city_hunter_shinjuku_private_eyes,2590987,2019-05-28,Reuben Baron,False,,fresh,CBR,The choreography is so precise and lifelike at...,POSITIVE,https://www.cbr.com/city-hunter-shinjuku-priva...


## Join reviews to movies
Each review row inherits the movie-level metadata via `id`. We'll predict `scoreSentiment` (POSITIVE/NEGATIVE).

In [5]:

# Join
df = reviews.merge(movies, on="id", how="inner")

print("Joined shape:", df.shape)
print("Target distribution:\n", df["scoreSentiment"].value_counts(dropna=False))
display(df.head(3))


Joined shape: (1469543, 26)
Target distribution:
 scoreSentiment
POSITIVE    980223
NEGATIVE    489320
Name: count, dtype: int64


Unnamed: 0,id,reviewId,creationDate,criticName,isTopCritic,originalScore,reviewState,publicatioName,reviewText,scoreSentiment,...,releaseDateTheaters,releaseDateStreaming,runtimeMinutes,genre,originalLanguage,director,writer,boxOffice,distributor,soundMix
0,beavers,1145982,2003-05-23,Ivan M. Lincoln,False,3.5/4,fresh,Deseret News (Salt Lake City),Timed to be just long enough for most youngste...,POSITIVE,...,,2011-06-21,30.0,Documentary,English,Stephen Low,,,,
1,blood_mask,1636744,2007-06-02,The Foywonder,False,1/5,rotten,Dread Central,It doesn't matter if a movie costs 300 million...,NEGATIVE,...,,,,,,Unknown Director,,,,
2,city_hunter_shinjuku_private_eyes,2590987,2019-05-28,Reuben Baron,False,,fresh,CBR,The choreography is so precise and lifelike at...,POSITIVE,...,,,,,,Kenji Kodama,,,,


## Basic cleaning
- Keep only rows with a known sentiment
- Drop columns that would leak the target (e.g., reviewState may correlate strongly with sentiment)
- Convert dates (optional feature)


In [6]:

# Keep only POSITIVE / NEGATIVE (drop missing/other)
df = df[df["scoreSentiment"].isin(["POSITIVE", "NEGATIVE"])].copy()

# Optional: parse dates
df["creationDate"] = pd.to_datetime(df["creationDate"], errors="coerce")
df["review_year"] = df["creationDate"].dt.year
df["review_month"] = df["creationDate"].dt.month

# Choose features (movie metadata + a couple of review meta features that are not text)
# IMPORTANT: We are NOT using reviewText (that's a different NLP task).
drop_cols = [
    "reviewId", "reviewUrl", "reviewText",  # text/url
    "publicatioName", "criticName",         # high-cardinality identifiers (can overfit)
    "creationDate",                         # replaced by year/month
    "scoreSentiment"                        # target
]
X = df.drop(columns=[c for c in drop_cols if c in df.columns])
y = df["scoreSentiment"]

print("X shape:", X.shape)
print("y shape:", y.shape)
display(X.head(3))


X shape: (1469543, 21)
y shape: (1469543,)


Unnamed: 0,id,isTopCritic,originalScore,reviewState,title,audienceScore,tomatoMeter,rating,ratingContents,releaseDateTheaters,...,runtimeMinutes,genre,originalLanguage,director,writer,boxOffice,distributor,soundMix,review_year,review_month
0,beavers,False,3.5/4,fresh,Beavers,75.0,,,,,...,30.0,Documentary,English,Stephen Low,,,,,2003,5
1,blood_mask,False,1/5,rotten,Blood Mask,,,,,,...,,,,Unknown Director,,,,,2007,6
2,city_hunter_shinjuku_private_eyes,False,,fresh,City Hunter: Shinjuku Private Eyes,,,,,,...,,,,Kenji Kodama,,,,,2019,5


## Train/test split + pipeline
We use:
- `SimpleImputer` for missing values
- `OneHotEncoder(handle_unknown='ignore')` for categorical fields
- `LogisticRegression` as a strong baseline


In [7]:

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score, f1_score
import matplotlib.pyplot as plt
import joblib


In [8]:

# Identify numeric vs categorical columns
numeric_cols = X.select_dtypes(include=["number", "bool"]).columns.tolist()
categorical_cols = [c for c in X.columns if c not in numeric_cols]

print("Numeric columns:", numeric_cols)
print("Categorical columns:", categorical_cols[:15], "..." if len(categorical_cols) > 15 else "")

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categorical_transformer, categorical_cols),
    ],
    remainder="drop"
)

model = LogisticRegression(max_iter=1000)

clf = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", model)
])

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train:", X_train.shape, "Test:", X_test.shape)


Numeric columns: ['isTopCritic', 'audienceScore', 'tomatoMeter', 'runtimeMinutes', 'review_year', 'review_month']
Categorical columns: ['id', 'originalScore', 'reviewState', 'title', 'rating', 'ratingContents', 'releaseDateTheaters', 'releaseDateStreaming', 'genre', 'originalLanguage', 'director', 'writer', 'boxOffice', 'distributor', 'soundMix'] 
Train: (1175634, 21) Test: (293909, 21)


In [9]:

# Train
clf.fit(X_train, y_train)

# Predict
y_pred = clf.predict(X_test)

acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, pos_label="POSITIVE")

print(f"Accuracy: {acc:.4f}")
print(f"F1 (POSITIVE): {f1:.4f}")


Accuracy: 1.0000
F1 (POSITIVE): 1.0000


In [10]:

# Classification report -> save to output
report = classification_report(y_test, y_pred)
print(report)

report_path = OUTPUT_DIR / "classification_report.txt"
report_path.write_text(report, encoding="utf-8")
print("Saved:", report_path.resolve())


              precision    recall  f1-score   support

    NEGATIVE       1.00      1.00      1.00     97864
    POSITIVE       1.00      1.00      1.00    196045

    accuracy                           1.00    293909
   macro avg       1.00      1.00      1.00    293909
weighted avg       1.00      1.00      1.00    293909

Saved: C:\Users\chank\github\DataScienceProject\output\classification_report.txt


In [11]:

# Confusion matrix plot -> save to output
cm = confusion_matrix(y_test, y_pred, labels=["NEGATIVE", "POSITIVE"])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["NEGATIVE", "POSITIVE"])

plt.figure()
disp.plot(values_format="d")
plt.title("Confusion Matrix (Logistic Regression)")
plot_name = safe_filename("confusion_matrix_logistic_regression.png")
plt.savefig(OUTPUT_DIR / plot_name, bbox_inches="tight")
plt.close()

print("Saved:", (OUTPUT_DIR / plot_name).resolve())


Saved: C:\Users\chank\github\DataScienceProject\output\confusion_matrix_logistic_regression.png


<Figure size 640x480 with 0 Axes>

## Save the trained pipeline
This saves *both* preprocessing and the model, so you can load it later and call `.predict()` on raw feature rows.


In [12]:

model_path = OUTPUT_DIR / "movie_sentiment_model.joblib"
joblib.dump(clf, model_path)
print("Saved model:", model_path.resolve())


Saved model: C:\Users\chank\github\DataScienceProject\output\movie_sentiment_model.joblib


## Quick example: Predict on a few rows

In [13]:

# Load model (optional demonstration)
loaded = joblib.load(model_path)

sample = X_test.head(5)
preds = loaded.predict(sample)

out = sample.copy()
out["predicted_sentiment"] = preds
out["true_sentiment"] = y_test.loc[sample.index].values

display(out)


Unnamed: 0,id,isTopCritic,originalScore,reviewState,title,audienceScore,tomatoMeter,rating,ratingContents,releaseDateTheaters,...,originalLanguage,director,writer,boxOffice,distributor,soundMix,review_year,review_month,predicted_sentiment,true_sentiment
765510,17_blocks,True,3/4,fresh,17 Blocks,,100.0,,,2019-10-30,...,English,Davy Rothbart,"Davy Rothbart,Jennifer Tiexiera",,,,2019,5,POSITIVE,POSITIVE
819913,journey_to_the_center_of_the_earth_2_3d,False,3/5,fresh,Journey 2: The Mysterious Island,59.0,44.0,PG,"['Some Adventure Action', 'Brief Mild Language']",2012-02-10,...,English,Brad Peyton,"Brian Gunn,Mark Gunn",$103.8M,Warner Bros. Pictures,"Dolby Digital, Datasat",2012,2,POSITIVE,POSITIVE
1196306,lets_make_love,False,3/5,fresh,Let's Make Love,60.0,69.0,,,1960-09-08,...,English,George Cukor,"Norman Krasna,Hal Kanter",,20th Century Fox,"Stereo, Mono",2005,10,POSITIVE,POSITIVE
1460741,deep_water_2022,True,2/5,rotten,Deep Water,24.0,36.0,R,"['Some Violence', 'Language', 'Sexual Content'...",,...,English,Adrian Lyne,"Zach Helm,Sam Levinson",,,Dolby Digital,2022,3,NEGATIVE,NEGATIVE
731396,favela_gay,True,,rotten,Favela Gay,,,,,,...,,Rodrigo Felha,,,,,2017,8,NEGATIVE,NEGATIVE


## Notes / Next steps
- This model uses **movie metadata** only (no review text).
- If you want to predict *review text sentiment from the text itself*, that’s an NLP workflow (TF-IDF + linear model or transformers).
- For stronger performance, try `RandomForestClassifier`, `XGBoost`, or add engineered features (e.g., split genre list into multi-hot).
