In [8]:
import pandas as pd

# Specify encoding to avoid UnicodeDecodeError
df = pd.read_csv("IMDb Movies India.csv", encoding='latin1')

# Show the first few rows
df.head()


Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,,,Drama,,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8.0,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
2,#Homecoming,(2021),90 min,"Drama, Musical",,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35.0,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
4,...And Once Again,(2010),105 min,Drama,,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali


In [9]:
df.columns


Index(['Name', 'Year', 'Duration', 'Genre', 'Rating', 'Votes', 'Director',
       'Actor 1', 'Actor 2', 'Actor 3'],
      dtype='object')

In [14]:
df['Votes'].head(10)


Unnamed: 0,Votes
0,
1,8.0
2,
3,35.0
4,
5,827.0
6,1086.0
7,
8,326.0
9,11.0


In [15]:
df.iloc[:, 5].head(10)  # Show the 6th column


Unnamed: 0,Votes
0,
1,8.0
2,
3,35.0
4,
5,827.0
6,1086.0
7,
8,326.0
9,11.0


In [16]:
# Clean the Votes column
df['Votes'] = df['Votes'].str.replace(',', '')
df['Votes'] = pd.to_numeric(df['Votes'], errors='coerce')  # Convert to float
df['Votes'] = df['Votes'].fillna(0)  # Replace NaN with 0

# Check if it's fixed
df['Votes'].head(10)


Unnamed: 0,Votes
0,0.0
1,8.0
2,0.0
3,35.0
4,0.0
5,827.0
6,1086.0
7,0.0
8,326.0
9,11.0


In [17]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error

# Select features and target
features = ['Genre', 'Director', 'Actor 1', 'Votes']
target = 'Rating'

# Drop rows where Rating or any feature is missing
df = df.dropna(subset=features + [target])

# Encode categorical columns
le = LabelEncoder()
for col in ['Genre', 'Director', 'Actor 1']:
    df[col] = le.fit_transform(df[col])

# Split data
X = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict & evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("✅ Model trained successfully!")
print("📉 Mean Squared Error:", mse)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = le.fit_transform(df[col])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = le.fit_transform(df[col])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = le.fit_transform(df[col])


✅ Model trained successfully!
📉 Mean Squared Error: 1.8615142054657934


🔹 Project Title:
Movie Rating Prediction Using Python

🔹 Objective:
To build a machine learning model that predicts IMDb movie ratings using features such as genre, director, actor, and vote count.

🔹 Dataset:
File Name: IMDb Movies India.csv

Source: Provided by InternOrbit

Shape: Rows = (you can use df.shape[0]), Columns = (use df.shape[1])

🔹 Features Used:
Genre – the genre of the movie

Director – the director of the movie

Actor 1 – lead actor/actress

Votes – total user votes

Target Variable:

Rating – IMDb rating to be predicted

🔹 Data Preprocessing:
Removed commas from Votes and converted it to numeric

Filled missing values in Votes with 0

Dropped rows with missing data in important columns

Applied Label Encoding on categorical columns (Genre, Director, Actor 1)

🔹 Model Used:
Linear Regression from scikit-learn

Train-Test Split: 80% training, 20% testing

🔹 Model Evaluation:
Mean Squared Error (MSE): (copy the printed value from your Colab output)

The model predicts ratings with reasonable accuracy based on the available features.

🔹 Conclusion:
This project successfully demonstrates how basic regression models can predict movie ratings using IMDb data. With more features (like full cast, reviews, duration), the model accuracy can be further improved.