Let’s walk through a complete, modular structure for your Movie Rating Prediction with Python project using an IMDb-style dataset

dataset has columns like:

Name

Year

Duration (in "min" format)

Genre (sometimes multiple, comma-separated)

Rating (numeric, sometimes missing)

Votes (numeric)

Director

Actor 1, Actor 2, Actor 3

### Loading required libraries

In [44]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from category_encoders import TargetEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
import warnings
warnings.filterwarnings('ignore')

### Loading dataset


In [45]:
df = pd.read_csv(r"C:\Users\HP\OneDrive\Desktop\Indolike\IMDb Movies India.csv",encoding='ISO-8859-1')
df.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,,,Drama,,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8.0,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
2,#Homecoming,(2021),90 min,"Drama, Musical",,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35.0,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
4,...And Once Again,(2010),105 min,Drama,,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali


### Basic Information Checking

In [46]:
df.shape

(15509, 10)

In [47]:
df.columns

Index(['Name', 'Year', 'Duration', 'Genre', 'Rating', 'Votes', 'Director',
       'Actor 1', 'Actor 2', 'Actor 3'],
      dtype='object')

In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      15509 non-null  object 
 1   Year      14981 non-null  object 
 2   Duration  7240 non-null   object 
 3   Genre     13632 non-null  object 
 4   Rating    7919 non-null   float64
 5   Votes     7920 non-null   object 
 6   Director  14984 non-null  object 
 7   Actor 1   13892 non-null  object 
 8   Actor 2   13125 non-null  object 
 9   Actor 3   12365 non-null  object 
dtypes: float64(1), object(9)
memory usage: 1.2+ MB


In [49]:
df.isnull().sum()

Name           0
Year         528
Duration    8269
Genre       1877
Rating      7590
Votes       7589
Director     525
Actor 1     1617
Actor 2     2384
Actor 3     3144
dtype: int64

### Cleaning The data

In [50]:
# Basic Cleaning - Drop duplicates
df.drop_duplicates(inplace=True)

# Separate features & target
target_col = "Rating"  # Change this if your target column has a different name
df = df.dropna(subset=[target_col])

### Target column and features

In [51]:
X = df.drop(columns=[target_col])
y = df[target_col]


### Preprocessing

In [52]:
# Identify categorical & numeric features
cat_features = X.select_dtypes(include=['object', 'category']).columns.tolist()
num_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Encoding strategy: OneHot for low-cardinality, Target Encoding for high-cardinality
low_card = [col for col in cat_features if X[col].nunique() <= 15]
high_card = [col for col in cat_features if X[col].nunique() > 15]

In [53]:
# Preprocessing pipelines
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

low_card_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

high_card_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("target_enc", TargetEncoder())
])

# 7️⃣ ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ("num", numeric_transformer, num_features),
    ("low_card_cat", low_card_transformer, low_card),
    ("high_card_cat", high_card_transformer, high_card)
])

# 8️⃣ Try multiple models
models = {
    "LinearRegression": LinearRegression(),
    "RandomForest": RandomForestRegressor(n_estimators=200, random_state=42),
    "XGBoost": XGBRegressor(n_estimators=300, learning_rate=0.1, max_depth=6, random_state=42)
}

### Train/Test Split

In [54]:
# 9️⃣ Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Model Building and evaluation

In [55]:
for name, model in models.items():
    pipe = Pipeline(steps=[("preprocessor", preprocessor),
                           ("model", model)])
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)

    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print(f"\n{name} Performance:")
    print(f"MSE: {mse:.4f}")
    print(f"R² Score: {r2:.4f}")



LinearRegression Performance:
MSE: 1.8582
R² Score: 0.0005

RandomForest Performance:
MSE: 1.8570
R² Score: 0.0012

XGBoost Performance:
MSE: 1.8458
R² Score: 0.0072
