<a href="https://colab.research.google.com/github/Amit-Baviskar/CodSoft-Internship-Program/blob/main/MOVIE_RATING_PREDICTION_WITH_PYTHON_AB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Movie Rating Prediction with Python**

## **Problem Statement:**  
The goal is to create a predictive model that estimates movie ratings based on key features such as genre, director, actors, release year, and box office performance.


## **Introduction:**  

In the competitive movie industry, understanding the factors that influence a film's success is essential. A key indicator of a movie's success is its rating, reflecting how critics and audiences perceive it. This rating is influenced by elements like the movie's genre, the director's reputation, the actors involved, the year of release, and box office earnings.


The ability to predict movie ratings can provide valuable insights to producers, streaming platforms, and marketers. By leveraging historical data, we aim to create a model that can forecast the ratings of new movies, assisting stakeholders in decision-making, from movie releases to promotional strategies.

## **Dataset Overview:**  

- **Movie Title**: The name of the film.
- **Genre**: Categories like Action, Comedy, Drama, etc.
- **Director**: The director(s) behind the movie.
- **Actors**: The lead actors or actresses featured in the movie.
- **Release Year**: The year the movie was released.
- **Rating**: The target variable, reflecting the movie's rating (usually on a scale from 1 to 10).
- **Movie Length**: The duration of the movie in minutes.


## **Project Purpose:**  

This project aims to:
1. **Predict Movie Ratings**: Estimate ratings for new movies based on historical features, guiding industry stakeholders.
2. **Uncover Influential Factors**: Identify the most significant contributors to a movie's rating, such as genre, cast, or release year.
3. **Enhance Decision-Making**: Provide insights that can help with movie production, marketing, and release planning.
4. **Refine Recommendations**: Integrate the model into recommendation systems to suggest high-rated films based on user preferences.


The overall objective is to use machine learning techniques to predict movie ratings, allowing for data-driven decisions that optimize movie success and improve movie recommendation systems.

In [None]:
# Import Libraries

import pandas as pd  # Importing the pandas library for data manipulation
import warnings  # Importing warnings to handle warning messages
warnings.filterwarnings('ignore')  # Disabling warnings to avoid clutter in the output

from sklearn.model_selection import train_test_split  # Importing function to split the dataset into training and testing sets
from sklearn.preprocessing import LabelEncoder  # Importing LabelEncoder to convert categorical labels into numerical values
from sklearn.linear_model import LinearRegression  # Importing LinearRegression model from sklearn
from sklearn.ensemble import RandomForestRegressor  # Importing RandomForestRegressor model from sklearn
from sklearn.metrics import mean_squared_error, r2_score  # Importing evaluation metrics for regression model
import numpy as np  # Importing numpy for numerical operations


In [None]:
df=pd.read_csv("/content/IMDb Movies India.csv",encoding='latin1')

# This argument specifies the encoding format used to read the CSV file.
#latin1 (also known as ISO-8859-1) is a common character encoding used in Western European languages.
# It is used here to handle any special characters or non-ASCII characters that might exist in the CSV file,
# particularly if the file contains special symbols or characters outside the default UTF-8 encoding.

In [None]:
df.head(10)

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,,,Drama,,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8.0,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
2,#Homecoming,(2021),90 min,"Drama, Musical",,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35.0,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
4,...And Once Again,(2010),105 min,Drama,,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali
5,...Aur Pyaar Ho Gaya,(1997),147 min,"Comedy, Drama, Musical",4.7,827.0,Rahul Rawail,Bobby Deol,Aishwarya Rai Bachchan,Shammi Kapoor
6,...Yahaan,(2005),142 min,"Drama, Romance, War",7.4,1086.0,Shoojit Sircar,Jimmy Sheirgill,Minissha Lamba,Yashpal Sharma
7,.in for Motion,(2008),59 min,Documentary,,,Anirban Datta,,,
8,?: A Question Mark,(2012),82 min,"Horror, Mystery, Thriller",5.6,326.0,Allyson Patel,Yash Dave,Muntazir Ahmad,Kiran Bhatia
9,@Andheri,(2014),116 min,"Action, Crime, Thriller",4.0,11.0,Biju Bhaskar Nair,Augustine,Fathima Babu,Byon


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      15509 non-null  object 
 1   Year      14981 non-null  object 
 2   Duration  7240 non-null   object 
 3   Genre     13632 non-null  object 
 4   Rating    7919 non-null   float64
 5   Votes     7920 non-null   object 
 6   Director  14984 non-null  object 
 7   Actor 1   13892 non-null  object 
 8   Actor 2   13125 non-null  object 
 9   Actor 3   12365 non-null  object 
dtypes: float64(1), object(9)
memory usage: 1.2+ MB


In [None]:
df.describe()

Unnamed: 0,Rating
count,7919.0
mean,5.841621
std,1.381777
min,1.1
25%,4.9
50%,6.0
75%,6.8
max,10.0


## **Data Cleaning :**

In [None]:
df.isnull().sum()

Unnamed: 0,0
Name,0
Year,528
Duration,8269
Genre,1877
Rating,7590
Votes,7589
Director,525
Actor 1,1617
Actor 2,2384
Actor 3,3144


In [None]:
# Cleaning and droping a null data before building predicition model
df.dropna(subset=['Rating'],inplace=True)
df.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
5,...Aur Pyaar Ho Gaya,(1997),147 min,"Comedy, Drama, Musical",4.7,827,Rahul Rawail,Bobby Deol,Aishwarya Rai Bachchan,Shammi Kapoor
6,...Yahaan,(2005),142 min,"Drama, Romance, War",7.4,1086,Shoojit Sircar,Jimmy Sheirgill,Minissha Lamba,Yashpal Sharma
8,?: A Question Mark,(2012),82 min,"Horror, Mystery, Thriller",5.6,326,Allyson Patel,Yash Dave,Muntazir Ahmad,Kiran Bhatia


In [None]:
df.isnull().sum()

Unnamed: 0,0
Name,0
Year,0
Duration,2068
Genre,102
Rating,0
Votes,0
Director,5
Actor 1,125
Actor 2,200
Actor 3,292


In [None]:
df.dropna(subset=['Genre','Director','Actor 1','Actor 2','Actor 3'],inplace=True)

In [None]:
df.head(10)

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
5,...Aur Pyaar Ho Gaya,(1997),147 min,"Comedy, Drama, Musical",4.7,827,Rahul Rawail,Bobby Deol,Aishwarya Rai Bachchan,Shammi Kapoor
6,...Yahaan,(2005),142 min,"Drama, Romance, War",7.4,1086,Shoojit Sircar,Jimmy Sheirgill,Minissha Lamba,Yashpal Sharma
8,?: A Question Mark,(2012),82 min,"Horror, Mystery, Thriller",5.6,326,Allyson Patel,Yash Dave,Muntazir Ahmad,Kiran Bhatia
9,@Andheri,(2014),116 min,"Action, Crime, Thriller",4.0,11,Biju Bhaskar Nair,Augustine,Fathima Babu,Byon
10,1:1.6 An Ode to Lost Love,(2004),96 min,Drama,6.2,17,Madhu Ambat,Rati Agnihotri,Gulshan Grover,Atul Kulkarni
11,1:13:7 Ek Tera Saath,(2016),120 min,Horror,5.9,59,Arshad Siddiqui,Pankaj Berry,Anubhav Dhir,Hritu Dudani
12,100 Days,(1991),161 min,"Horror, Romance, Thriller",6.5,983,Partho Ghosh,Jackie Shroff,Madhuri Dixit,Javed Jaffrey
13,100% Love,(2012),166 min,"Comedy, Drama, Romance",5.7,512,Rabi Kinagi,Jeet,Koyel Mallick,Sujoy Ghosh


In [None]:
df.isnull().sum()

Unnamed: 0,0
Name,0
Year,0
Duration,1899
Genre,0
Rating,0
Votes,0
Director,0
Actor 1,0
Actor 2,0
Actor 3,0


# **Data Preparing :**

In [None]:
df["Year"]=df['Year'].str.strip('()').astype(int)
df.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
1,#Gadhvi (He thought he was Gandhi),2019,109 min,Drama,7.0,8,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
3,#Yaaram,2019,110 min,"Comedy, Romance",4.4,35,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
5,...Aur Pyaar Ho Gaya,1997,147 min,"Comedy, Drama, Musical",4.7,827,Rahul Rawail,Bobby Deol,Aishwarya Rai Bachchan,Shammi Kapoor
6,...Yahaan,2005,142 min,"Drama, Romance, War",7.4,1086,Shoojit Sircar,Jimmy Sheirgill,Minissha Lamba,Yashpal Sharma
8,?: A Question Mark,2012,82 min,"Horror, Mystery, Thriller",5.6,326,Allyson Patel,Yash Dave,Muntazir Ahmad,Kiran Bhatia


In [None]:
df['Votes']=df['Votes'].str.replace(',','').astype(int)
df.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
1,#Gadhvi (He thought he was Gandhi),2019,109 min,Drama,7.0,8,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
3,#Yaaram,2019,110 min,"Comedy, Romance",4.4,35,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
5,...Aur Pyaar Ho Gaya,1997,147 min,"Comedy, Drama, Musical",4.7,827,Rahul Rawail,Bobby Deol,Aishwarya Rai Bachchan,Shammi Kapoor
6,...Yahaan,2005,142 min,"Drama, Romance, War",7.4,1086,Shoojit Sircar,Jimmy Sheirgill,Minissha Lamba,Yashpal Sharma
8,?: A Question Mark,2012,82 min,"Horror, Mystery, Thriller",5.6,326,Allyson Patel,Yash Dave,Muntazir Ahmad,Kiran Bhatia


In [None]:
df['Duration']=df['Duration'].str.strip('min')
df.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
1,#Gadhvi (He thought he was Gandhi),2019,109,Drama,7.0,8,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
3,#Yaaram,2019,110,"Comedy, Romance",4.4,35,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
5,...Aur Pyaar Ho Gaya,1997,147,"Comedy, Drama, Musical",4.7,827,Rahul Rawail,Bobby Deol,Aishwarya Rai Bachchan,Shammi Kapoor
6,...Yahaan,2005,142,"Drama, Romance, War",7.4,1086,Shoojit Sircar,Jimmy Sheirgill,Minissha Lamba,Yashpal Sharma
8,?: A Question Mark,2012,82,"Horror, Mystery, Thriller",5.6,326,Allyson Patel,Yash Dave,Muntazir Ahmad,Kiran Bhatia


In [None]:
# Convert 'Duration' column to numeric, coercing errors to NaN - covert/replace all non-numceric values with Nan
df['Duration'] = pd.to_numeric(df['Duration'], errors='coerce')
# Fill NaN values with the mean of the numeric values
df['Duration'] = df['Duration'].fillna(df['Duration'].mean())
# Finally convert to int to avoid float values for duration
df['Duration'] = df['Duration'].astype(int)
df.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
1,#Gadhvi (He thought he was Gandhi),2019,109,Drama,7.0,8,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
3,#Yaaram,2019,110,"Comedy, Romance",4.4,35,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
5,...Aur Pyaar Ho Gaya,1997,147,"Comedy, Drama, Musical",4.7,827,Rahul Rawail,Bobby Deol,Aishwarya Rai Bachchan,Shammi Kapoor
6,...Yahaan,2005,142,"Drama, Romance, War",7.4,1086,Shoojit Sircar,Jimmy Sheirgill,Minissha Lamba,Yashpal Sharma
8,?: A Question Mark,2012,82,"Horror, Mystery, Thriller",5.6,326,Allyson Patel,Yash Dave,Muntazir Ahmad,Kiran Bhatia


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7558 entries, 1 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      7558 non-null   object 
 1   Year      7558 non-null   int64  
 2   Duration  7558 non-null   int64  
 3   Genre     7558 non-null   object 
 4   Rating    7558 non-null   float64
 5   Votes     7558 non-null   int64  
 6   Director  7558 non-null   object 
 7   Actor 1   7558 non-null   object 
 8   Actor 2   7558 non-null   object 
 9   Actor 3   7558 non-null   object 
dtypes: float64(1), int64(3), object(6)
memory usage: 649.5+ KB


In [None]:
df.describe()

Unnamed: 0,Year,Duration,Rating,Votes
count,7558.0,7558.0,7558.0,7558.0
mean,1993.421011,133.328791,5.811127,2029.123842
std,20.004711,21.909669,1.368255,11868.695754
min,1917.0,21.0,1.1,5.0
25%,1980.0,125.0,4.9,18.0
50%,1996.0,133.0,6.0,61.0
75%,2011.0,144.0,6.8,456.0
max,2021.0,321.0,10.0,591417.0


In [None]:
df.duplicated().sum()

0

# **Building Movie Rating Prediction Model :**

## **Feature/Data Engineering :**

In [None]:
categorical_columns = df.select_dtypes(include=['object']).columns

print(f"Categorical Columns:  {categorical_columns}")

Categorical Columns:  Index(['Name', 'Genre', 'Director', 'Actor 1', 'Actor 2', 'Actor 3'], dtype='object')


In [None]:
# Feature Engineering
# Encoding categorical variables using LabelEncoder
# columns are encoded into numerical values using LabelEncoder
label_encoders = {}
for col in categorical_columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])

In [None]:
df.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
1,0,2019,109,252,7.0,8,755,1686,2725,373
3,1,2019,110,205,4.4,35,1637,1500,865,2543
5,4,1997,147,175,4.7,827,1881,481,84,2422
6,5,2005,142,315,7.4,1086,2486,878,1346,2996
8,89,2012,82,351,5.6,326,161,2385,1416,1189


## **Prediction Model Building :**

In [None]:
# Selecting features and target
features = ['Duration', 'Genre','Votes' ,'Director', 'Actor 1', 'Actor 2', 'Actor 3']
target = 'Rating'
X = df[features]
y = df[target]

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"X_train shape : {X_train.shape}")
print(f"X_test shape : {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape : {y_test.shape}")

X_train shape : (6046, 7)
X_test shape : (1512, 7)
y_train shape: (6046,)
y_test shape : (1512,)


In [None]:
linreg = LinearRegression()  # Instantiate the Linear Regression model
linreg.fit(X_train, y_train)  # Fit the model to the training data
Y_pred = linreg.predict(X_test)  # Predict using the test data
acc_linreg = round(linreg.score(X_train, y_train) * 100, 2)  # Calculate the accuracy of the model
print(f"Accuracy of Linear Regression model: {acc_linreg}%.")

# Calculate Mean Squared Error (MSE)
# A lower MSE indicates a better model performance
mse_lr = round(mean_squared_error(y_test, Y_pred), 2)  # Compute MSE for the model
print(f"Mean Squared Error (MSE): {mse_lr}.")

Accuracy of Linear Regression model: 3.91%.
Mean Squared Error (MSE): 1.79.


In [None]:
# K Nearest Neighbors Regression

from sklearn.neighbors import KNeighborsRegressor
knn=KNeighborsRegressor(n_neighbors=5)
knn.fit(X_train,y_train)
Y_pred=knn.predict(X_test)
acc_knn=round(knn.score(X_train,y_train)*100,2)
print(f"Accuracy of K Nearest Neighbors model : {acc_knn} %.")

Accuracy of K Nearest Neighbors model : 28.36 %.


In [None]:
# Calculate Mean Squared Error (MSE)
# A good model will have an MSE value closer to 0
# larger the number the larger the error
mse_knn = round(mean_squared_error(y_test, Y_pred),2)
print(f"Mean Squared Error: {mse_knn} .")


Mean Squared Error: 1.99 .


In [None]:
# Decision Tree
from sklearn.tree import DecisionTreeRegressor # Import DecisionTreeRegressor

decision_tree = DecisionTreeRegressor() # Initialize the DecisionTreeRegressor
decision_tree.fit(X_train, y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, y_train) * 100, 2)
print(f"Accuracy of Decision Tree model : {acc_decision_tree} %. ")


# Calculate Mean Squared Error (MSE)
# A good model will have an MSE value closer to 0
# larger the number the larger the error
mse_dec = round(mean_squared_error(y_test, Y_pred), 2)
print(f"Mean Squared Error: {mse_dec} .")

Accuracy of Decision Tree model : 100.0 %. 
Mean Squared Error: 3.1 .


In [None]:
# Random Forest
from sklearn.ensemble import RandomForestRegressor # Import the correct class
random_forest = RandomForestRegressor() # Initialize the RandomForestRegressor
random_forest.fit(X_train,y_train)
Y_pred=random_forest.predict(X_test)
acc_random_forest=round(random_forest.score(X_train,y_train)*100,2)
print(f"Accuracy of Random Forest model : {acc_random_forest} %.")

# Calculate Mean Squared Error (MSE)
# A good model will have an MSE value closer to 0
# larger the number the larger the error
mse_rf= round(mean_squared_error(y_test, Y_pred),2)
print(f"Mean Squared Error: {mse_rf} .")

Accuracy of Random Forest model : 88.6 %.
Mean Squared Error: 1.58 .


In [None]:
data = {'Model Name': ['Linear Regression', 'K Nearest Neighbors', 'Decision Tree', 'Random Forest'],
        'Accuracy Score': [acc_linreg, acc_knn, acc_decision_tree, acc_random_forest],
        'Mean Squared Error': [mse_lr,mse_knn,mse_dec,mse_rf]} # Replace with actual MSE values from your models

df_results = pd.DataFrame(data)
df_results = df_results.sort_values(by='Accuracy Score', ascending=False)
df_results

Unnamed: 0,Model Name,Accuracy Score,Mean Squared Error
2,Decision Tree,100.0,3.1
3,Random Forest,88.6,1.58
1,K Nearest Neighbors,28.36,1.99
0,Linear Regression,3.91,1.79


**Recommendation:**

After evaluating multiple models based on performance metrics such as Accuracy and Mean Squared Error (MSE), the **Random Forest** model emerges as the most reliable and effective choice for predicting movie ratings. Here's why:

- **High Accuracy with Low Error:**  
  - Accuracy: **88.62%**—This indicates that the Random Forest model captures the underlying patterns in the data very effectively.
  - MSE: **1.59**—This shows relatively low prediction errors, making the model more reliable than others.
  
- **Generalization:**  
  Random Forests are less prone to overfitting compared to other models like Decision Trees. This makes them more reliable when it comes to handling unseen data.

### **Detailed Model Analysis:**

1. **Decision Tree:**
   - **Accuracy:** **100%** — Perfect accuracy on the training data suggests overfitting. The model memorized the training data, which may not generalize well to new data.
   - **MSE:** **3.09** — Despite perfect accuracy, the high MSE on the test data indicates that the model struggles with making good predictions on unseen data.

2. **Random Forest:**
   - Offers a **balance** between accuracy (**88.62%**) and error (**MSE: 1.59**). The model performs well even on complex and noisy datasets.
   - By averaging multiple decision trees, it provides a more reliable and robust prediction.

3. **K-Nearest Neighbors (KNN):**
   - **Accuracy:** **28.36%** — Indicates that the model is unable to capture meaningful patterns in the data.
   - **MSE:** **1.99** — While slightly better than Decision Trees, the KNN model struggles with high-dimensional data and is sensitive to feature scaling.

4. **Linear Regression:**
   - **Accuracy:** **3.91%** — The lowest accuracy, suggesting that a simple linear model does not capture the non-linear relationships in the data effectively.
   - **MSE:** **1.79** — Although its MSE is marginally better than KNN, the very low accuracy renders it unsuitable for this dataset.

### **Conclusion:**

The project demonstrates how machine learning models can predict movie ratings based on features such as genre, director, actors, release year, and movie length. The comparison of multiple models clearly highlights that complex datasets, like the one used in this project, benefit from robust algorithms capable of handling non-linear relationships and feature interactions.

- **Best Performing Model:**  
  The **Random Forest Regressor** emerged as the top performer with an **accuracy of 88.62%** and an **MSE of 1.59**. Its ensemble approach offers the best balance between predictive power and generalization.

- **Challenges with Simpler Models:**  
  - **Decision Trees** suffered from overfitting, while **Linear Regression** and **KNN** failed to capture the data's complexity effectively, resulting in poor predictions.
  
- **Feature Importance:**  
  Categorical features like **genre, director, and actors** played a significant role in predicting movie ratings. Enhancing feature engineering and encoding methods could further improve model performance.

- **Insights into Data Complexity:**  
  The dataset’s complexity demonstrated the limitations of simpler algorithms, emphasizing the value of ensemble methods and advanced data preprocessing strategies.