# Step 3

**Objectives**

1. **Regression** – predict the numeric `score` column (IMDb or equivalent).
2. **Classification** – predict a binary flag `HighRated` (`score > 7.5`).


| Task | Algorithm |
|------|-----------|
| Regression | **Random Forest Regressor** |
| Classification | **k-Nearest Neighbors Classifier** |



In [1]:
import pandas as pd
from google.colab import drive
import matplotlib.pyplot as plt
import seaborn as sns
drive.mount('/content/drive')
file_path = '/content/drive/My Drive/merged.csv'
df = pd.read_csv(file_path)




Mounted at /content/drive


In [2]:
df.head()

Unnamed: 0,title,year,score_netflix,votes_netflix,runtime_netflix,genre_netflix,country_netflix,score_movies,votes_movies,runtime_movies,genre_movies,country_movies,budget,gross
0,inception,2010,8.8,2268288,148,scifi,GB,8.8,2100000.0,148.0,Action,United States,160000000.0,836836967.0
1,forrest gump,1994,8.8,1994599,142,drama,US,8.8,1900000.0,142.0,Drama,United States,55000000.0,678226133.0
2,saving private ryan,1998,8.6,1346020,169,drama,US,8.6,1300000.0,169.0,Drama,United States,70000000.0,482349603.0
3,django unchained,2012,8.4,1472668,165,western,US,8.4,1400000.0,165.0,Drama,United States,100000000.0,426074373.0
4,once upon a time in america,1984,8.3,342335,229,drama,US,8.4,321000.0,229.0,Crime,Italy,30000000.0,5473212.0


In [4]:
# Basic feature / target setup
target_reg = "score_netflix"                       # change if your column differs
df["HighRated"] = (df[target_reg] > 7.5).astype(int)

# Separate feature matrix and targets
X = df.drop(columns=[target_reg, "HighRated"])
y_reg = df[target_reg]
y_clf = df["HighRated"]

# Identify column types
num_cols = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
cat_cols = X.select_dtypes(include=["object", "category", "bool"]).columns.tolist()

print("Numeric:", num_cols)
print("Categorical:", cat_cols)


Numeric: ['year', 'votes_netflix', 'runtime_netflix', 'score_movies', 'votes_movies', 'runtime_movies', 'budget', 'gross']
Categorical: ['title', 'genre_netflix', 'country_netflix', 'genre_movies', 'country_movies']


## Pre-processing

* **Numeric features** – median imputation + (for KNN) standard scaling.  
* **Categorical features** – most-frequent imputation + one-hot encoding.  

The same `ColumnTransformer` feeds both models.


In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

preprocess = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("impute", SimpleImputer(strategy="median")),
            ("scale", StandardScaler())           # scaling useful for KNN; harmless for RF
        ]), num_cols),
        ("cat", Pipeline([
            ("impute", SimpleImputer(strategy="most_frequent")),
            ("ohe", OneHotEncoder(handle_unknown="ignore"))
        ]), cat_cols),
    ]
)


### Train/Test Split and Cross-Validation

* **Stratified 80 / 20 split** preserves the positive-class ratio in `HighRated`.  
* **5-Fold K-Fold** (`shuffle=True, random_state=42`) is used inside `GridSearchCV`  
  to select hyper-parameters that generalise across multiple folds.


In [6]:
# Train-test split & CV object
from sklearn.model_selection import train_test_split, KFold

X_train, X_test, y_train_reg, y_test_reg, y_train_clf, y_test_clf = train_test_split(
    X, y_reg, y_clf, test_size=0.2, stratify=y_clf, random_state=42
)

cv = KFold(n_splits=5, shuffle=True, random_state=42)

print("Train size:", X_train.shape[0], "  Test size:", X_test.shape[0])


Train size: 96   Test size: 25


## Model 1 – Random Forest Regressor

We tune two key hyper-parameters to balance bias and variance:

| Hyper-parameter | Effect |
|-----------------|--------|
| `n_estimators`  | Number of trees in the ensemble (more trees → lower variance, higher compute). |
| `max_depth`     | Maximum depth of each tree (limits overfitting when set). |

Cross-validation scoring metric: **negative RMSE** – lower values indicate better generalisation.


In [9]:
# Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

rf_pipe = Pipeline([
    ("prep", preprocess),
    ("model", RandomForestRegressor(random_state=42))
])

rf_grid = {
    "model__n_estimators": [100, 200],
    "model__max_depth": [None, 10],
}

gs_rf = GridSearchCV(
    rf_pipe, rf_grid, cv=cv,
    scoring="neg_root_mean_squared_error",
    n_jobs=-1, refit=True
)
gs_rf.fit(X_train, y_train_reg)

best_rf = gs_rf.best_estimator_
print("RF best params:", gs_rf.best_params_)
print("CV RMSE:", -gs_rf.best_score_)


RF best params: {'model__max_depth': None, 'model__n_estimators': 200}
CV RMSE: 0.07089005858532502


In [10]:

# Test performance
pred_reg = best_rf.predict(X_test)
print("Test RMSE:", mean_squared_error(y_test_reg, pred_reg))
print("Test MAE :", mean_absolute_error(y_test_reg, pred_reg))
print("Test R²  :", r2_score(y_test_reg, pred_reg))


Test RMSE: 0.006036949999999996
Test MAE : 0.04934000000000207
Test R²  : 0.9699259226048144


## Model 2 – k-Nearest Neighbors Classifier

Key hyper-parameters:

* `n_neighbors` – size of the neighbourhood  
* `weights` – uniform vs. distance-weighted voting  
* `p` – distance metric (1 = Manhattan, 2 = Euclidean)

Scoring metric: **Accuracy** (higher is better).


In [12]:
# K-Nearest Neighbors Classifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

knn_pipe = Pipeline([
    ("prep", preprocess),
    ("model", KNeighborsClassifier())
])

knn_grid = {
    "model__n_neighbors": [3, 5, 7],
    "model__weights": ["uniform", "distance"],
    "model__p": [1, 2],          # 1=Manhattan, 2=Euclidean
}

gs_knn = GridSearchCV(
    knn_pipe, knn_grid, cv=cv,
    scoring="accuracy",
    n_jobs=-1, refit=True
)
gs_knn.fit(X_train, y_train_clf)

best_knn = gs_knn.best_estimator_
print("KNN best params:", gs_knn.best_params_)
print("CV Accuracy:", gs_knn.best_score_)



KNN best params: {'model__n_neighbors': 5, 'model__p': 2, 'model__weights': 'uniform'}
CV Accuracy: 0.8342105263157894


In [13]:

# Test performance
pred_clf  = best_knn.predict(X_test)
proba_clf = best_knn.predict_proba(X_test)[:,1]

print("Test Accuracy:", accuracy_score(y_test_clf, pred_clf))
print("Test F1      :", f1_score(y_test_clf, pred_clf))
print("Test AUROC   :", roc_auc_score(y_test_clf, proba_clf))

Test Accuracy: 0.92
Test F1      : 0.9
Test AUROC   : 0.9766666666666667


## Final Results and Interpretation

| Task | Best Hyper-parameters | Test Metrics |
|------|----------------------|--------------|
| **Regression** (Random Forest) | `n_estimators={200}`  `max_depth={None}` | RMSE ≈ **{0.006}**    MAE ≈ **{0.04}**    R² ≈ **{0.97}** |
| **Classification** (KNN) | `n_neighbors={5}`  `weights='distance'`  `p={2}` | Accuracy ≈ **{0.92}**    F1 ≈ **{0.9}**    AUROC ≈ **{0.97}** |

> Replace the braces with the printed numbers from the previous code cells.

### What the numbers mean 🔍
* **Regression** – An RMSE of ~{0.006} suggests the model’s IMDb-score predictions deviate by ±{0.006} points on average.  
  The R² of {0.97} means roughly **{97\%}** of the variance in scores is explained by the Random Forest.
* **Classification** – The KNN correctly identifies high-rated movies **{92%}** of the time, with a balanced F1 of {0.9}.  
  An AUROC near {0.97} indicates strong ranking ability (probability that the model scores a random high-rated movie above a random low-rated one).
