# IMDB Rating Prediction - Random Forest

Bu notebook'ta IMDB film verisini kullanarak **rating** değerini
Random Forest ile tahmin ediyoruz. Özellikler:

- Sayısal: `votes`, `runtime`
- Kategorik: `genre` (multi-hot encoding), `director` (target encoding),`stars` (multi-value target encoding)
- Metin: `description` (TF-IDF)

Model:
- `RandomForestRegressor`

Değerlendirme:
- Primary metric: **RMSE**
- Secondary metrics: **MSE**, **R²**
- Değerlendirme yöntemi:
  - Train set üzerinde **5-fold cross validation** (GridSearchCV ile)
  - Ayrı tutulan **test seti** üzerinde final performans (RMSE, MSE, R²)


## 1. Clean Data Loading & Data Info

In [10]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.metrics import mean_squared_error, r2_score

In [11]:
data_path = "/content/imdb_data/IMDB_cleaned.csv"   # data_clean.py çıktısı
df = pd.read_csv(data_path)

print("[INFO] Data size:", df.shape)
df.head()


[INFO] Data size: (61227, 8)


Unnamed: 0,movie,genre,runtime,rating,stars,description,votes,director
0,Mission: Impossible - Dead Reckoning Part One,"Action, Adventure, Thriller",163,8.0,"Tom Cruise, Hayley Atwell, Ving Rhames, Simon ...",Ethan Hunt and his IMF team must track down a ...,106759,Christopher McQuarrie
1,Sound of Freedom,"Action, Biography, Drama",131,7.9,"Jim Caviezel, Mira Sorvino, Bill Camp, Cristal...",The incredible true story of a former governme...,41808,Alejandro Monteverde
2,They Cloned Tyrone,"Action, Comedy, Mystery",122,6.7,"John Boyega, Jamie Foxx, Teyonah Parris, Kiefe...",A series of eerie events thrusts an unlikely t...,14271,Juel Taylor
3,The Flash,"Action, Adventure, Fantasy",144,6.9,"Ezra Miller, Michael Keaton, Sasha Calle, Mich...",Barry Allen uses his super speed to change the...,126445,Andy Muschietti
4,Transformers: Rise of the Beasts,"Action, Adventure, Sci-Fi",127,6.1,"Anthony Ramos, Dominique Fishback, Luna Lauren...","During the 90s, a new faction of Transformers ...",62180,Steven Caple Jr.


In [12]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61227 entries, 0 to 61226
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   movie        61227 non-null  object 
 1   genre        61227 non-null  object 
 2   runtime      61227 non-null  int64  
 3   rating       61227 non-null  float64
 4   stars        61227 non-null  object 
 5   description  61227 non-null  object 
 6   votes        61227 non-null  int64  
 7   director     61227 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 3.7+ MB


## 2. Genre Encoding (Multi-hot)

Bu adımda:

- `genre` sütunundaki stringleri sadeleştiriyoruz (boşlukları kaldırıyoruz).
- `str.get_dummies(sep=",")` kullanarak her tür için ayrı bir dummy sütunu (multi-hot encoding) oluşturuyoruz.
- Orijinal `genre` sütununu drop edip yerine bu dummy sütunlarını ekliyoruz.

Bu encoding, hedef değişkeni (**rating**) kullanmadığı için
**train/test split'inden önce** yapmak güvenli (data leakage oluşturmaz).


In [13]:
df_model = df.copy()

# genre içindeki boşlukları kaldır
df_model["genre"] = df_model["genre"].astype(str).str.replace(" ", "", regex=False)

# multi-hot encoding
genres_encoded = df_model["genre"].str.get_dummies(sep=",")

print("[INFO] Tür sayısı:", genres_encoded.shape[1])

# genre sütununu kaldır, encoded sütunları ekle
df_model = pd.concat([df_model.drop(columns=["genre"]), genres_encoded], axis=1)

df_model.head()


[INFO] Tür sayısı: 28


Unnamed: 0,movie,runtime,rating,stars,description,votes,director,Action,Adult,Adventure,...,News,Reality-TV,Romance,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western
0,Mission: Impossible - Dead Reckoning Part One,163,8.0,"Tom Cruise, Hayley Atwell, Ving Rhames, Simon ...",Ethan Hunt and his IMF team must track down a ...,106759,Christopher McQuarrie,1,0,1,...,0,0,0,0,0,0,0,1,0,0
1,Sound of Freedom,131,7.9,"Jim Caviezel, Mira Sorvino, Bill Camp, Cristal...",The incredible true story of a former governme...,41808,Alejandro Monteverde,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,They Cloned Tyrone,122,6.7,"John Boyega, Jamie Foxx, Teyonah Parris, Kiefe...",A series of eerie events thrusts an unlikely t...,14271,Juel Taylor,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Flash,144,6.9,"Ezra Miller, Michael Keaton, Sasha Calle, Mich...",Barry Allen uses his super speed to change the...,126445,Andy Muschietti,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,Transformers: Rise of the Beasts,127,6.1,"Anthony Ramos, Dominique Fishback, Luna Lauren...","During the 90s, a new faction of Transformers ...",62180,Steven Caple Jr.,1,0,1,...,0,0,0,1,0,0,0,0,0,0


## 3. X ve y Oluşturma

- Hedef değişken (y) = `rating`
- Özellikler (X):
  - Sayısal + genre sütunları: `votes`, `runtime` + tüm genre dummy'leri
  - Kategorik sütunlar: `director`, `stars` (bunları ileride target-encode edeceğiz)
  - Metin sütunu: `description`


In [14]:
target_col = "rating"
y = df_model[target_col]

# genre dummy sütunlarının isimleri
genre_cols = list(genres_encoded.columns)

# temel sayısal sütunlar
numeric_base = ["votes", "runtime"]

# sayısal + genre sütunlarının tamamı
numeric_features_full = numeric_base + genre_cols

# target encoding yapılacak kategorik sütunlar
director_col = "director"
stars_col = "stars"

# metin sütunu
text_feature = "description"

# X: numeric + categorical + text sütunları
X = df_model[numeric_features_full + [director_col, stars_col, text_feature]]

print("[INFO] X şekli:", X.shape)
print("[INFO] y şekli:", y.shape)

X.head()


[INFO] X şekli: (61227, 33)
[INFO] y şekli: (61227,)


Unnamed: 0,votes,runtime,Action,Adult,Adventure,Animation,Biography,Comedy,Crime,Documentary,...,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western,director,stars,description
0,106759,163,1,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,Christopher McQuarrie,"Tom Cruise, Hayley Atwell, Ving Rhames, Simon ...",Ethan Hunt and his IMF team must track down a ...
1,41808,131,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,Alejandro Monteverde,"Jim Caviezel, Mira Sorvino, Bill Camp, Cristal...",The incredible true story of a former governme...
2,14271,122,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,Juel Taylor,"John Boyega, Jamie Foxx, Teyonah Parris, Kiefe...",A series of eerie events thrusts an unlikely t...
3,126445,144,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,Andy Muschietti,"Ezra Miller, Michael Keaton, Sasha Calle, Mich...",Barry Allen uses his super speed to change the...
4,62180,127,1,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,Steven Caple Jr.,"Anthony Ramos, Dominique Fishback, Luna Lauren...","During the 90s, a new faction of Transformers ..."


## 4. Train / Test Split

Veriyi:

- %80 **train**
- %20 **test**

şeklinde ayıracağız.

Önemli Not:  
`director` ve `stars` için target encoding işlemini **sadece train set** üzerinden
öğreneceğiz; böylece test setin rating bilgisi encoding aşamasına sızmayacak (Target encoding -> Bir kategori, o kategoriye ait örneklerdeki hedef değişkenin ortalama değeri ile kodlanır.)
→ **data leakage** engellenmiş olacak.


In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

print("[INFO] Train:", X_train.shape)
print("[INFO] Test :", X_test.shape)


[INFO] Train: (48981, 33)
[INFO] Test : (12246, 33)


## 5. Director & Stars için Target Encoding (Sadece Train Üzerinden)

Bu bölümde iki farklı encoding yapacağız:

1. **Director (tek değer) için:**
   - Train set üzerinde `director` → ortalama `rating`
   - Görülmeyen director'ler için global ortalama

2. **Stars (çoklu değer) için:**
   - Her filmde `stars` şu formda: `"Oyuncu1, Oyuncu2, Oyuncu3"`
   - Önce bu string'i listeye çeviriyoruz: `["Oyuncu1", "Oyuncu2", "Oyuncu3"]`
   - Train set üzerinde her oyuncu için ortalama rating hesaplıyoruz.
   - Bir filmin star değeri = o filmdeki tüm oyuncuların ortalama ratinglerinin ortalaması
   - Test set'te görülmeyen oyuncular için global ortalama kullanıyoruz.

Bu sayede:

- `director` ve `stars` sütunları sayısallaşmış oluyor.
- Encoding işlemi sadece train verisini "gördüğü" için test bilgisi sızmıyor (leakage yok).


In [15]:
def target_encode_director(X_train, X_test, y_train, director_col):
    """
    director için target encoding:
    - Train üzerinde director -> ortalama rating hesaplar
    - Train ve test'te bu mapping ile encode eder
    - Görülmeyen director'ler için global ortalama kullanır
    """
    X_train = X_train.copy()
    X_test = X_test.copy()
    y_train = pd.Series(y_train)

    global_mean = y_train.mean()
    print(f"[INFO] Global mean rating (train): {global_mean:.4f}")

    tmp = pd.DataFrame({director_col: X_train[director_col], "rating": y_train})
    director_means = tmp.groupby(director_col)["rating"].mean()

    new_col = f"{director_col}_te"

    X_train[new_col] = X_train[director_col].map(director_means).fillna(global_mean)
    X_test[new_col]  = X_test[director_col].map(director_means).fillna(global_mean)

    # orijinal director sütununu artık drop edebiliriz
    X_train = X_train.drop(columns=[director_col])
    X_test  = X_test.drop(columns=[director_col])

    print(f"[INFO] Target-encoded column created: {new_col}")

    return X_train, X_test


X_train_enc, X_test_enc = target_encode_director(
    X_train, X_test, y_train, director_col
)

X_train_enc.head()


[INFO] Global mean rating (train): 6.1892
[INFO] Target-encoded column created: director_te


Unnamed: 0,votes,runtime,Action,Adult,Adventure,Animation,Biography,Comedy,Crime,Documentary,...,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western,stars,description,director_te
17348,134,85,0,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,"Alicia Leigh Willis, Kalen Bull, Matthew Pohlk...",When a high school student starts dating a reb...,5.75
5010,1547,97,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,"Peter Finch, James MacArthur, Bernard Lee, Joh...","In Scotland in 1751, young David Balfour is sh...",6.407143
59940,86,60,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,"Johnny Mack Brown, Louise Stanley, Ted Adams, ...",Jeff arrives in town to see the Sheriff only t...,5.55
32336,318,84,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,"Rachael Hevrin, Grace Powell, Ella Taylor, Kri...","After moving to a new town, a young college st...",3.3
39751,2902,78,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,"Igor Ilyinsky, Lyudmila Gurchenko, Yuri Belov,...",A new chief of a Culture House is planning to ...,6.22


In [16]:
def target_encode_stars_multi(X_train, X_test, y_train, stars_col):
    """
    Multi-value stars için target encoding:
    - "Oyuncu1, Oyuncu2" gibi stringleri listeye çevirir
    - Train set üzerinde her star için ortalama rating hesaplar
    - Bir filmin stars değeri = o filmdeki tüm star'ların ortalama ratinglerinin ortalaması
    - Test set'te görülmeyen star'lar için global ortalama kullanır
    """
    X_train = X_train.copy()
    X_test = X_test.copy()
    y_train = pd.Series(y_train)

    global_mean = y_train.mean()

    # 1) Star'ları liste haline getir
    X_train["stars_list"] = X_train[stars_col].apply(
        lambda x: [s.strip() for s in str(x).split(",")]
    )
    X_test["stars_list"] = X_test[stars_col].apply(
        lambda x: [s.strip() for s in str(x).split(",")]
    )

    # 2) Her star için rating listesi topla
    star_ratings = {}

    for stars, rating in zip(X_train["stars_list"], y_train):
        for s in stars:
            if s not in star_ratings:
                star_ratings[s] = []
            star_ratings[s].append(rating)

    # 3) Her star için ortalama rating
    star_mean = {s: np.mean(vals) for s, vals in star_ratings.items()}

    # 4) Bir film için star listesinden ortalama değer üret
    def encode_star_list(stars):
        vals = [star_mean.get(s, global_mean) for s in stars]
        return np.mean(vals) if len(vals) > 0 else global_mean

    X_train["stars_te"] = X_train["stars_list"].apply(encode_star_list)
    X_test["stars_te"]  = X_test["stars_list"].apply(encode_star_list)

    # 5) Geçici sütunları drop et
    X_train = X_train.drop(columns=[stars_col, "stars_list"])
    X_test  = X_test.drop(columns=[stars_col, "stars_list"])

    print("[INFO] Target-encoded column created: stars_te")

    return X_train, X_test


X_train_enc, X_test_enc = target_encode_stars_multi(
    X_train_enc, X_test_enc, y_train, stars_col
)

print("[INFO] Encoded train shape:", X_train_enc.shape)
print("[INFO] Encoded test  shape:", X_test_enc.shape)
X_train_enc.head()


[INFO] Target-encoded column created: stars_te
[INFO] Encoded train shape: (48981, 33)
[INFO] Encoded test  shape: (12246, 33)


Unnamed: 0,votes,runtime,Action,Adult,Adventure,Animation,Biography,Comedy,Crime,Documentary,...,Sci-Fi,Short,Sport,Talk-Show,Thriller,War,Western,description,director_te,stars_te
17348,134,85,0,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,When a high school student starts dating a reb...,5.75,5.269847
5010,1547,97,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,"In Scotland in 1751, young David Balfour is sh...",6.407143,6.608708
59940,86,60,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,Jeff arrives in town to see the Sheriff only t...,5.55,5.705343
32336,318,84,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,"After moving to a new town, a young college st...",3.3,3.877847
39751,2902,78,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,A new chief of a Culture House is planning to ...,6.22,7.004847


## 6. Numeric ve Text Özelliklerini Ayırma

Artık:

- `description` dışındaki tüm özellikler sayısal (votes, runtime, genre dummies, director_te, stars_te).
- `description` ise metin sütunu.

TF-IDF vektörleştirmesini sadece `description` üzerinde uygulayacağız.
Diğer tüm sayısal sütunlar olduğu gibi Random Forest'a gidecek.


In [17]:
# text sütunu dışındaki tüm sütunlar numeric olarak kabul edilecek
numeric_features = [col for col in X_train_enc.columns if col != text_feature]

print("[INFO] Numeric feature sayısı:", len(numeric_features))
print("[INFO] Text feature:", text_feature)


[INFO] Numeric feature sayısı: 32
[INFO] Text feature: description


## 7. TF-IDF + Random Forest Pipeline ve 5-Fold Cross Validation

Bu bölümde:

- `description` → **TF-IDF** (`TfidfVectorizer`) ile vektörleştirilecek.
- Numeric özellikler (votes, runtime, genre dummies, director_te, stars_te) → doğrudan geçecek.
- `ColumnTransformer` ile numeric + text birleşik feature space oluşturacağız.
- Sparse TF-IDF çıktısını `FunctionTransformer` ile dense formata çevireceğiz (Random Forest dense bekliyor).
- Model: `RandomForestRegressor`
- `GridSearchCV` ile:
  - 5-fold CV
  - Primary metric: **RMSE** (`neg_root_mean_squared_error`)
  - Hiperparametre taraması (n_estimators, max_depth, min_samples_leaf)


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

# 1) Text -> TF-IDF
text_transformer = TfidfVectorizer(
    max_features=1000,
    stop_words="english"
)

# 2) ColumnTransformer: numeric + text
preprocessor = ColumnTransformer(
    transformers=[
        ("num", "passthrough", numeric_features),
        ("text", text_transformer, text_feature)
    ],
    remainder="drop"
)

# 3) Sparse -> dense (RF için)
to_dense = FunctionTransformer(
    lambda x: x.toarray() if hasattr(x, "toarray") else np.asarray(x),
    accept_sparse=True
)

# 4) Random Forest modeli
rf = RandomForestRegressor(
    random_state=42,
    n_jobs=-1
)

# 5) Tüm pipeline
rf_pipeline = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("to_dense", to_dense),
    ("model", rf)
])

# 6) Hiperparametre grid'i
param_grid = {
    "model__n_estimators": [100, 200, 400],
    "model__max_depth": [None, 20],
    "model__min_samples_leaf": [1, 2, 4],
}

# 7) 5-Fold CV, primary metric: RMSE
grid_search = GridSearchCV(
    estimator=rf_pipeline,
    param_grid=param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
    n_jobs=-1,
    verbose=2
)

grid_search.fit(X_train_enc, y_train)


Fitting 5 folds for each of 18 candidates, totalling 90 fits


## 8. Cross Validation Sonuçları

Burada:

- GridSearchCV'nin bulduğu en iyi hiperparametreleri,
- 5-fold CV sonucunda elde edilen **en iyi ortalama RMSE** değerini göreceğiz.


In [23]:
print("========== GRID SEARCH SONUÇLARI ==========")
print("En iyi hiperparametreler:", grid_search.best_params_)

best_cv_rmse = -grid_search.best_score_
print(f"5-Fold CV En İyi Ortalama RMSE: {best_cv_rmse:.4f}")




AttributeError: 'GridSearchCV' object has no attribute 'best_params_'

## 9. Değişkenlerin Önemini Bulma
Şimdi RF modelimiz için değişkenlerin önem sırasını görelim:

In [None]:
best_model = grid_search.best_estimator_

# 1) RF modelini pipeline içinden çek
rf_model = best_model.named_steps["model"]

# 2) TF-IDF transformerı çek
tfidf = best_model.named_steps["preprocess"].named_transformers_["text"]

# 3) TF-IDF feature isimlerini al
tfidf_features = tfidf.get_feature_names_out()

# 4) Numeric feature listesi + TF-IDF
all_features = numeric_features + list(tfidf_features)

print("[INFO] Toplam feature sayısı:", len(all_features))
print("[INFO] RF importance sayısı:", len(rf_model.feature_importances_))

# 5) Feature importance dataframe
fi = pd.DataFrame({
    "feature": all_features,
    "importance": rf_model.feature_importances_
}).sort_values(by="importance", ascending=False)

fi.head(20)   # ilk 20 özelliği göster


## 10. Test Set Üzerinde Final Değerlendirme

Artık:

- `grid_search.best_estimator_` bize en iyi hiperparametrelere sahip Random Forest pipeline'ını veriyor.
- Bu modeli, en başta ayırdığımız **test seti** üzerinde değerlendiriyoruz.

Hesaplanan metrikler:

- **RMSE (primary metric)**
- **MSE**
- **R²**


In [None]:
best_model = grid_search.best_estimator_

# Test set tahminleri
y_pred = best_model.predict(X_test_enc)

# Primary: RMSE
rmse = mean_squared_error(y_test, y_pred, squared=False)
# Secondary: MSE
mse = mean_squared_error(y_test, y_pred, squared=True)
# Secondary: R²
r2 = r2_score(y_test, y_pred)

print("========== TEST SET PERFORMANSI ==========")
print(f"RMSE (Primary):   {rmse:.4f}")
print(f"MSE  (Secondary): {mse:.4f}")
print(f"R²   (Secondary): {r2:.4f}")