<h1 style = "text-align: center; font: 48;">MOVIE RATING PREDICTION WITH PYTHON<h1>

<h2>Introduction</h2>

<p>
Movies are without a doubt, an essential part of the modern age. Each Movie has a story to tell, an idea to share, or a case to defend.
This dataset is pulled from IMDb.com of all the Indian movies on the platform. <br>
Our task today will be to clean this data by different method, and to provide some usefull insight into how to the dataset is partitioned, accompagned by studying the relationship between different Features.
All of this will be assembled into what we call, A Data Preprocessing Pipeline. More on that later on.
</p>

<h2>Page Index</h2>
<ol>
    <li>Data collection</li>
    <li>Data visualization</li>
    <li>Data processing</li>
    <li>Model initialisation</li>
    <li>Valuation and optimization of the prediciton module</li>
</ol>

<h2>Data Collection</h2>

<h3>Source</h3>

The data has been collected from the famous website <i>Kaggle</i> which contains many usefull data set and informations.<br>
Link : <a href="https://www.kaggle.com/datasets/adrianmcmahon/imdb-india-movies">IMDb India Movies</a>

<h3>Libraries</h3>

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import sklearn as skl

In [48]:
print('Pandas version is : ', pd.__version__)
print('NumPy version is : ', np.__version__)
print('Seaborn version is : ', sns.__version__)
print('matplotlib version is : ', plt.matplotlib.__version__)
print('sklearn version is : ', skl.__version__)


Pandas version is :  2.0.3
NumPy version is :  1.24.3
Seaborn version is :  0.12.2
matplotlib version is :  3.7.2
sklearn version is :  1.3.0


<h3>Importing the dataset</h3>

In [49]:
df = pd.read_csv('IMDb Movies India.csv')

In [50]:
df.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,,,Drama,,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8.0,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
2,#Homecoming,(2021),90 min,"Drama, Musical",,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35.0,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
4,...And Once Again,(2010),105 min,Drama,,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali


<h3>Understanding the dataset</h3>

In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      15509 non-null  object 
 1   Year      14981 non-null  object 
 2   Duration  7240 non-null   object 
 3   Genre     13632 non-null  object 
 4   Rating    7919 non-null   float64
 5   Votes     7920 non-null   object 
 6   Director  14984 non-null  object 
 7   Actor 1   13892 non-null  object 
 8   Actor 2   13125 non-null  object 
 9   Actor 3   12365 non-null  object 
dtypes: float64(1), object(9)
memory usage: 1.2+ MB


<h3>Data description</h3>

<div style="width: 90%;margin:auto; background-color: #fff; padding: 20px; border-radius: 8px; box-shadow: 0 0 5px rgba(0, 0, 0, 0.1);">
    <h1 style="text-align: center; color: #333;">Movie Rating Dataset Description</h1>
    <p>In our dataset, we observe the following:</p>
    <ol>
        <li>There are a total of <span style="font-weight: bold; color: #555;">15,509</span> entries/rows.</li>
        <li>Each movie entry includes <span style="font-weight: bold; color: #555;">10</span> different features:</li>
    </ol>
    <ul>
        <li><span style="font-weight: bold; color: #555;">Name</span>: The title of the movie (string value).</li>
        <li><span style="font-weight: bold; color: #555;">Year</span>: The year in which the movie was released (integer).</li>
        <li><span style="font-weight: bold; color: #555;">Duration</span>: The length of the movie in minutes (float).</li>
        <li><span style="font-weight: bold; color: #555;">Genre</span>: The genre of the movie (string).</li>
        <li><span style="font-weight: bold; color: #555;">Rating</span>: A rating on a scale from 0 to 10, indicating viewer satisfaction (float).</li>
        <li><span style="font-weight: bold; color: #555;">Votes</span>: The number of people who have rated the movie.</li>
        <li><span style="font-weight: bold; color: #555;">Director</span>: The director of the movie.</li>
        <li><span style="font-weight: bold; color: #555;">Actor 1</span>: The main actor of the movie.</li>
        <li><span style="font-weight: bold; color: #555;">Actor 2</span>: The second main actor of the movie.</li>
        <li><span style="font-weight: bold; color: #555;">Actor 3</span>: The third main actor of the movie.</li>
    </ul>
</div>


<h3>Data Preprocessing</h3>

<p>
One of the most common way to handle any dataset in order to feed it to a prediction model or any sort of models, is to build what we call a 'Data Pipeline'.<br>
This will guide us into building a much cleaner and more presentable study case, and to ensure our models are as efficient and noise free as possible. <br>
    First, let's analyse in depth our dataset
</p>

In [43]:
df.isna().sum()

Name           0
Year         528
Duration    8269
Genre       1877
Rating      7590
Votes       7589
Director     525
Actor 1     1617
Actor 2     2384
Actor 3     3144
dtype: int64

Here we notice some intresting stuffs:<br>
    + 528 movies don't have their years of publishement.<br>
    + We are missing the duration for 8269 movie!.<br>
    + 1877 Movies are missing their genre.<br>
        .<br>
        .<br>
        .<br>
    <br>
Now it's clear that we need to handle this cases before we continue with anything.<br>
We can either transform the dataset by dropping the rows with missing values, or feed the dataset some calculated values like the mean or the median.<br>

In [47]:
df1 = df.copy()
def missing_values():
    col = ['Duration', 'Rating']
    for a in col:
        df1.dropna(subset = a, axis = 0 , inplace = True)
    


<h3>Data's Format</h3>

In [46]:
for a in df1.columns:
    column_format = df1[a].apply(lambda x: type(x)).unique()

    if len(column_format) == 1:
        print(f"All values in the '{a}' column have the same format: {column_format[0]}")
    else:
        print(f"Values in the '{a}' column have different formats: {column_format}")

All values in the 'Name' column have the same format: <class 'str'>
All values in the 'Year' column have the same format: <class 'str'>
All values in the 'Duration' column have the same format: <class 'str'>
Values in the 'Genre' column have different formats: [<class 'str'> <class 'float'>]
All values in the 'Rating' column have the same format: <class 'float'>
All values in the 'Votes' column have the same format: <class 'str'>
Values in the 'Director' column have different formats: [<class 'str'> <class 'float'>]
Values in the 'Actor 1' column have different formats: [<class 'str'> <class 'float'>]
Values in the 'Actor 2' column have different formats: [<class 'str'> <class 'float'>]
Values in the 'Actor 3' column have different formats: [<class 'str'> <class 'float'>]
