<h1><b>Movie Rating Prediction using Python by Richard Muchoki</b></h1>

<p style="font-size:16px;">In this Jupyter Notebook I'm going to build a Machine Learning Model that predicts Movie Ratings using features like genre, director, and actors. The goal is to analyze historical movie data and develop a model that accurately estimates the rating given to a movie by users or critics.</p>

<p style="font-size:16px;"><b>Columns Description:</b></p>

<ol style="font-size:16px;">
    <li>Name: Name of the movie.</li><br>
    <li>Year: The year the movie was released.</li><br>
    <li>Duration: Time duration of the movies in minutes.</li><br>
    <li>Genre: Different movie genres.</li><br>
    <li>Rating: Rating given to the movie.</li><br>
    <li>Votes: Votes given to the movie.</li><br>
    <li>Director: Director of the movie.</li><br>
    <li>Actor 1: Main actor of the movie.</li><br>
    <li>Actor 2: Second main actor of the movie.</li><br>
    <li>Actor 3: Third main actor of the movie.</li><br>
</ol>

<h1><b>1. Data Collection and Loading</b></h1>

In [6]:
# Import Necessary Libraries
import numpy as np
import pandas as pd

# Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Sklearn for Logistic Regression
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn import metrics

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

from sklearn.model_selection import GridSearchCV

<h3><b>Suppress Warnings</b></h3>

In [10]:
import warnings
warnings.filterwarnings("ignore")

In [22]:
# Load the Dataset
movie = pd.read_csv(r"C:\Users\Richard Muchoki\Documents\CodSoft Projects\CODSOFT\Movie Rating Prediction\Dataset\IMDb Movies India.csv", encoding="latin_1")

# Create a copy of the Dataset
movie_data = movie.copy()

movie_data

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,,,Drama,,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
2,#Homecoming,(2021),90 min,"Drama, Musical",,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
4,...And Once Again,(2010),105 min,Drama,,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali
...,...,...,...,...,...,...,...,...,...,...
15504,Zulm Ko Jala Doonga,(1988),,Action,4.6,11,Mahendra Shah,Naseeruddin Shah,Sumeet Saigal,Suparna Anand
15505,Zulmi,(1999),129 min,"Action, Drama",4.5,655,Kuku Kohli,Akshay Kumar,Twinkle Khanna,Aruna Irani
15506,Zulmi Raj,(2005),,Action,,,Kiran Thej,Sangeeta Tiwari,,
15507,Zulmi Shikari,(1988),,Action,,,,,,


In [24]:
# Describe the data
movie_data.describe()

Unnamed: 0,Rating
count,7919.0
mean,5.841621
std,1.381777
min,1.1
25%,4.9
50%,6.0
75%,6.8
max,10.0


In [44]:
movie_data.describe(include="all")

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
count,15509,14981,7240,13632,7919.0,7920.0,14984,13892,13125,12365
unique,13838,102,182,485,,2034.0,5938,4718,4891,4820
top,Anjaam,(2019),120 min,Drama,,8.0,Jayant Desai,Ashok Kumar,Rekha,Pran
freq,7,410,240,2780,,227.0,58,158,83,91
mean,,,,,5.841621,,,,,
std,,,,,1.381777,,,,,
min,,,,,1.1,,,,,
25%,,,,,4.9,,,,,
50%,,,,,6.0,,,,,
75%,,,,,6.8,,,,,


In [26]:
movie_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      15509 non-null  object 
 1   Year      14981 non-null  object 
 2   Duration  7240 non-null   object 
 3   Genre     13632 non-null  object 
 4   Rating    7919 non-null   float64
 5   Votes     7920 non-null   object 
 6   Director  14984 non-null  object 
 7   Actor 1   13892 non-null  object 
 8   Actor 2   13125 non-null  object 
 9   Actor 3   12365 non-null  object 
dtypes: float64(1), object(9)
memory usage: 1.2+ MB


In [32]:
movie_data.dtypes

Name         object
Year         object
Duration     object
Genre        object
Rating      float64
Votes        object
Director     object
Actor 1      object
Actor 2      object
Actor 3      object
dtype: object

In [34]:
movie_data.memory_usage()

Index          132
Name        124072
Year        124072
Duration    124072
Genre       124072
Rating      124072
Votes       124072
Director    124072
Actor 1     124072
Actor 2     124072
Actor 3     124072
dtype: int64

In [36]:
# Let's check for missing values in the DataFrame
movie_data.isnull().sum()

Name           0
Year         528
Duration    8269
Genre       1877
Rating      7590
Votes       7589
Director     525
Actor 1     1617
Actor 2     2384
Actor 3     3144
dtype: int64

In [42]:
movie_data.duplicated().sum()

6