# Movie Rating Prediction


building a regression model to predict movie ratings based on features like genre, director, and actors.


Dataset: [Movie Rating Prediction Dataset](https://www.kaggle.com/datasets/adrianmcmahon/imdb-india-movies)

---

## Introduction
This project focuses on 
building a regression model to predict movie ratings based on features like genre, director, and actors.
.

In [2]:

# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, mean_squared_error, r2_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder


### Step 1: Load the Dataset
We first load the dataset and display its structure to understand its features.

In [3]:
data = pd.read_csv(r'C:\Users\gaura\Downloads\IMDb Movies India.csv', encoding="latin1")
data.head()

# Handle missing values
data.dropna(subset=["Rating"], inplace=True)  # Drop rows where Rating is missing
data.ffill(inplace=True)  # Fill remaining NaNs with forward fill

# Clean Year column
data["Year"] = data["Year"].str.extract(r"(\d+)").astype(float)

# Convert Duration to numerical format
data["Duration"] = data["Duration"].str.replace(" min", "").astype(float)


### Step 2: Data Overview
Let's inspect the dataset for missing values and other irregularities.

In [4]:
# Display dataset information
print(data.info())
print(data.describe())

<class 'pandas.core.frame.DataFrame'>
Index: 7919 entries, 1 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      7919 non-null   object 
 1   Year      7919 non-null   float64
 2   Duration  7919 non-null   float64
 3   Genre     7919 non-null   object 
 4   Rating    7919 non-null   float64
 5   Votes     7919 non-null   object 
 6   Director  7919 non-null   object 
 7   Actor 1   7919 non-null   object 
 8   Actor 2   7919 non-null   object 
 9   Actor 3   7919 non-null   object 
dtypes: float64(3), object(7)
memory usage: 680.5+ KB
None
              Year     Duration       Rating
count  7919.000000  7919.000000  7919.000000
mean   1993.321758   132.969062     5.841621
std      20.463770    26.228506     1.381777
min    1917.000000    21.000000     1.100000
25%    1979.500000   118.000000     4.900000
50%    1997.000000   135.000000     6.000000
75%    2011.000000   150.000000     6.800000
max    202

### Step 3: Preprocess Data
Handle missing values and encode categorical variables to prepare the dataset for modeling.

In [5]:
# Fill missing values in the 'Genre' column
data['Genre'] = data['Genre'].fillna('Unknown')

# Convert the 'Genre' column to categorical and encode it as codes
data['Genre'] = data['Genre'].astype('category').cat.codes


### Step 4: Define Features and Target
Select the features (X) and the target variable (y) for prediction.

In [6]:
# Combine actor columns into a single 'Actors' column
data['Actors'] = data[['Actor 1', 'Actor 2', 'Actor 3']].apply(lambda x: ', '.join(x.dropna()), axis=1)

# Define features (X) and target (y)
X = data[['Genre', 'Director', 'Actors']]  # Replace with relevant features
y = data['Rating']

In [7]:
# Define features (X) and target (y)
X = data[['Genre', 'Director', 'Actors']]  # Replace with relevant features
y = data['Rating']

### Step 5: Split the Data
Split the dataset into training and testing sets for model evaluation.

In [8]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Align train and test sets to have the same columns
X_train, X_test = X_train.align(X_test, join="left", axis=1, fill_value=0)


### Step 6: Train the Model
Fit a linear regression model to the training data.

In [9]:
# Encode categorical columns (assuming `X_train` has categorical data)
encoder = OneHotEncoder(handle_unknown='ignore')  # Ignore unknown categories
X_train_encoded = encoder.fit_transform(X_train).toarray()

# Drop NaN values in y_train to match the length of X_train_encoded
X_train_encoded = X_train_encoded[~y_train.isna()]
y_train = y_train.dropna()

# Train the regression model
model = LinearRegression()
model.fit(X_train_encoded, y_train)
# Encode categorical variables
data = pd.get_dummies(data, columns=["Genre", "Director", "Actor 1", "Actor 2", "Actor 3"], drop_first=True)


### Step 7: Evaluate the Model
Check the model's performance using metrics like Mean Squared Error and R2 Score.

In [10]:
# Ensure X_test is encoded using the same encoder as X_train
X_test_encoded = encoder.transform(X_test).toarray()

# Ensure the shape of X_test_encoded matches the expectations
if X_test_encoded.shape[1] != model.coef_.shape[0]:
    print("Mismatch in feature dimensions. Check for consistency between training and test data.")

# Evaluate the model
y_pred = model.predict(X_test_encoded)
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R2 Score:", r2_score(y_test, y_pred))

Mean Squared Error: 1.8291673319053885
R2 Score: 0.016124512326556317


## Conclusion
This notebook demonstrates an end-to-end solution for building and evaluating the specified model. Ensure you follow best practices in preprocessing, feature engineering, and evaluation.