# Box Office Prediction Case Study: Starter Colab Notebook
# DS 4002 - Spring 2025
# Matthew Haid

# In this notebook, you will:
# 1. Create a clean dataset by combining and processing three provided data files
# 2. Perform Exploratory Data Analysis (EDA)
# 3. Engineer features from the aggregated dataset
# 4. Train at least one machine learning model to predict box office revenue
# 5. Evaluate your models using RMSE and R²

# Follow the instructions in each cell carefully.

Instruction:
You have been provided with three files:

hand_picked_movie_data.csv — a curated set of features for each movie.

imdb_reviews.csv — individual pre-release critic reviews for movies.

movie_ids.txt — a mapping of movie names to unique IMDb IDs.

Load all three files into pandas DataFrames to start your analysis.

In [None]:
# Step 1: Import libraries and load the three datasets
import pandas as pd

# Load the hand-picked movie data

# Load the individual reviews

# Load the movie IDs


Instruction:
For each movie:

Aggregate the sentiment scores across all reviews (e.g., compute the average compound sentiment score).

Merge this aggregated sentiment information with the hand-picked movie features from hand_picked_movie_data.csv.

This will create the final dataset you'll use for modeling.

In [None]:
# Step 2: Aggregate sentiment scores for each movie

# Merge the aggregated sentiment scores with the movie metadata

# Check the final merged dataset


Instruction:
Explore the combined dataset.
Look for trends, outliers, and interesting relationships between sentiment scores and opening weekend revenue.
Visualize key features where appropriate.

In [None]:
# Step 3: EDA

# Example: Plot the distribution of average compound sentiment scores
import matplotlib.pyplot as plt
import seaborn as sns


# Plot opening weekend revenue distribution


Instruction:
Engineer features to prepare the data for machine learning:

Encode categorical variables (e.g., distributor, genre, release month).

Scale or transform numerical variables if needed.

Feel free to create new features if you have ideas that might improve the model.

In [None]:
# Step 4: Feature Engineering

# One-hot encode categorical variables

# Optional: Scale numerical features if necessary

# Ready the dataset for modeling


Instruction:
Train at least one regression model to predict opening_earnings using the engineered features.
You are encouraged to try multiple models if time allows.

In [None]:
# Step 5: Train a Regression Model

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Separate features and target


# Split into train and test sets

# Train a Linear Regression model

# Predict and evaluate


Instruction:
Visualize how well your model predicted the opening weekend revenues.
Discuss your evaluation metrics (RMSE, R²) and any modeling decisions you made.

In [None]:
# Step 6: Plot Predicted vs Actual Revenue

Instruction:
Summarize your results. Reflect on:

Which features were most important?

What limitations did your model have?

What would you try differently if you had more time?