# Movie Rating Prediction Project

This notebook demonstrates a regression model that predicts movie ratings based on features such as genre, director, actors, runtime, year, and budget.

Steps:
1. Create a sample IMDb-like dataset
2. Preprocess categorical features with OneHotEncoding
3. Train Linear Regression and Random Forest Regressor
4. Evaluate with R-squared and RMSE

Note: This notebook uses a synthetic dataset. Replace the synthetic data cell with a real dataset when available.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# For reproducibility
np.random.seed(42)

In [2]:
# Create a mock IMDb-like dataset
n_samples = 500

genres = ['Action', 'Comedy', 'Drama', 'Thriller', 'Romance']
directors = ['Director_A', 'Director_B', 'Director_C', 'Director_D']
actors = ['Actor_X', 'Actor_Y', 'Actor_Z', 'Actor_W']

data = {
    'Genre': np.random.choice(genres, n_samples),
    'Director': np.random.choice(directors, n_samples),
    'Actors': np.random.choice(actors, n_samples),
    'Runtime': np.random.randint(80, 180, n_samples),
    'Year': np.random.randint(1980, 2022, n_samples),
    'Budget': np.random.randint(1_000_000, 200_000_000, n_samples),
    'Rating': np.round(np.random.normal(6.5, 1.0, n_samples), 1)
}

df = pd.DataFrame(data)
df['Rating'] = df['Rating'].clip(1, 10)  # ensure ratings between 1 and 10

df.head()

Unnamed: 0,Genre,Director,Actors,Runtime,Year,Budget,Rating
0,Thriller,Director_B,Actor_Y,98,1990,132103007,8.6
1,Romance,Director_D,Actor_Z,115,2006,104769254,7.7
2,Drama,Director_D,Actor_W,108,1982,98163884,7.5
3,Romance,Director_A,Actor_W,139,1996,73668141,7.1
4,Romance,Director_B,Actor_X,161,2007,64975459,7.3


In [3]:
# Features and target
X = df.drop('Rating', axis=1)
y = df['Rating']

# Categorical and numerical features
categorical_features = ['Genre', 'Director', 'Actors']
numeric_features = ['Runtime', 'Year', 'Budget']

# Preprocessing: OneHotEncode categorical variables
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(drop='first'), categorical_features),
        ('num', 'passthrough', numeric_features)
    ]
)

# Pipelines
lin_reg = Pipeline(steps=[('preprocessor', preprocessor),
                          ('regressor', LinearRegression())])

rf_reg = Pipeline(steps=[('preprocessor', preprocessor),
                         ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit models
lin_reg.fit(X_train, y_train)
rf_reg.fit(X_train, y_train)

# Predictions
y_pred_lin = lin_reg.predict(X_test)
y_pred_rf = rf_reg.predict(X_test)

# Evaluation
print('Linear Regression R2:', r2_score(y_test, y_pred_lin))
print('Linear Regression RMSE:', np.sqrt(mean_squared_error(y_test, y_pred_lin)))

print('Random Forest R2:', r2_score(y_test, y_pred_rf))
print('Random Forest RMSE:', np.sqrt(mean_squared_error(y_test, y_pred_rf)))

Linear Regression R2: -0.08396282111678866
Linear Regression RMSE: 1.0621718785168222
Random Forest R2: -0.1595529001680409
Random Forest RMSE: 1.0985830373713221


## Insights

- This synthetic dataset is random, so predictive power is limited. Negative R-squared can occur.
- Replace the synthetic data with a real IMDb dataset to improve results.
- Enhancements: add NLP features from plot summaries, include revenue, try XGBoost/LightGBM, and do hyperparameter tuning.