### Project Title: Movie Rating Prediction With Python
#### Done By: Nozipho Sithembiso Ndebele
---

<div style="text-align: center;">
<img src="https://storage.googleapis.com/kaggle-datasets-images/1867204/3122809/6dd06ad75dfe450aeaf370a7348600f3/dataset-card.jpg?t=2022-02-01-05-41-22" alt="Movie Image" width="1000"/>
</div>

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**

The IMDb India Movies dataset is a curated collection of Indian movies listed on IMDb, capturing essential information such as titles, ratings, genres, and the individuals involved in each film (actors, directors). The dataset offers a rich opportunity for exploratory data analysis (EDA), trend detection, and insights into the Indian film industry over time.


### Purpose
The main objective of this project is to:

- Clean and preprocess the dataset by handling missing values and standardizing data formats.

- Perform exploratory data analysis to understand trends in ratings, duration, votes, genres, and roles of actors/directors.

- Identify key patterns and trends such as top-rated movies, most active directors, and yearly performance metrics.

- Build predictive models to understand factors influencing movie ratings or popularity.


### Significance
This dataset provides a comprehensive view of:

- Evolution of the Indian film industry over the years.

- Audience preferences, gauged via votes and ratings.

- Influence of movie duration and genre on viewer perception.

- Trends related to popular actors and directors.

- It is an excellent foundation for beginner to intermediate projects in data cleaning, EDA, visualization, and predictive modeling.

### Problem Domain
The dataset opens up multiple areas of inquiry:

- Time series trends: How have ratings or the number of movies changed over the years?

- Impact analysis: Does movie duration or genre affect its IMDb rating?

- Popularity vs. quality: Do high vote counts correlate with higher ratings?

- Industry contributions: Which directors and actors have contributed the most to Indian cinema?

- These questions can be answered through:

  - Cleaning and preprocessing (e.g., handling nulls in actors and genres).

  - Aggregation and grouping operations.

  - Visualization (using Matplotlib, Seaborn, or Plotly).

  - Building regression or classification models (e.g., predicting high vs low-rated movies).


### Challenges
- Missing data: Many null values in actor, director, and genre columns.

- Data inconsistency: Genre, duration, and names might have inconsistent formats or typos.

- Multiple entries: A movie may be listed under multiple genres or actors, requiring careful parsing.

- Imbalanced popularity: Some movies may have a high number of votes but poor ratings (or vice versa).

- Bias in ratings: Ratings might be skewed by popularity or recency effects.

### Key Questions
- What is the trend of movie ratings across different years?

- Which year had the best average rating?

- Does movie duration impact the rating?

- What are the top 10 movies by rating per year and overall?

- Which directors have directed the most movies?

- Which actors frequently appear across top-rated movies?

- What genres are most commonly associated with high ratings?

- Are more movies being released over the years?

- Can we predict the success (high rating) of a movie based on features like duration, genre, and director?



---
<a href=#one></a>
## **Importing Packages**

### Purpose
To set up the Python environment with the necessary libraries for data manipulation, visualization, and machine learning. These libraries will facilitate data preprocessing, feature extraction, model training, and evaluation.

### Details
* Pandas: For handling and analyzing data.

* NumPy: For numerical operations.

* Matplotlib/Seaborn: For data visualization to understand trends and patterns.

* scikit-learn: For building and evaluating machine learning models.

* NLTK/Spacy: For text preprocessing and natural language processing tasks.

---

In [18]:
# Import necessary packages  

# Data manipulation and analysis  
import pandas as pd  # Pandas for data handling  
import numpy as np  # NumPy for numerical operations  

# Data visualization  
import matplotlib.pyplot as plt  # Matplotlib for static plots  
import seaborn as sns  # Seaborn for statistical visualization  
import plotly.express as px  # Plotly for interactive plots  

# Natural Language Processing  
import nltk  # Natural Language Toolkit  
from nltk.corpus import stopwords  # Stopword removal  
from nltk.tokenize import word_tokenize  # Tokenization  
import re  # Regular expressions for text cleaning  

# Configure visualization settings
sns.set(style='whitegrid')  # Set the default style for Seaborn plots
plt.rcParams['figure.figsize'] = (10, 6)  # Set default figure size for Matplotlib

# Suppress warnings
import warnings  # Import the warnings module
warnings.filterwarnings('ignore')  # Ignore all warning messages


---
<a href=#two></a>
## **Data Collection and Description**
### Purpose
To understand the structure and content of the IMDb India Movies dataset for analysis and model-building.

### Details
- Source: IMDb (scraped and compiled)

- Format: CSV

- Rows: ~13,800+

- Columns: 10

### Types of Data
- `Name`	Title of the movie
- `Year`	Year of release
- `Duration`	Movie duration (in minutes)
- `Genre`	Primary and secondary genres (can be multi-label)
- `Rating`	IMDb user rating
- `Votes`	Number of user votes
- `Director`	Name of the movie's director
- `Actor 1`	Lead actor
- `Actor 2`	Supporting actor
- `Actor 3`	Supporting actor


---
<a href=#three></a>
## **Loading Data**
### Purpose
The purpose of this section is to load the dataset into the notebook for further manipulation and analysis. This is the first step in working with the data, and it allows us to inspect the raw data and get a sense of its structure.

### Details
In this section, we will load the dataset into a Pandas DataFrame and display the first few rows to understand what the raw data looks like. This will help in planning the next steps of data cleaning and analysis.


---

In [19]:
# Load the dataset into a Pandas DataFrame

# The dataset is stored in a CSV file named 'IMDb Movies India.csv'
df = pd.read_csv('IMDb Movies India.csv', encoding='ISO-8859-1')

In [20]:
# df is the original dataset (DataFrame), this creates a copy of it
df_copy = df.copy()

# Now 'df_copy' is an independent copy of 'df'. Changes to 'df_copy' won't affect 'df'.


In [21]:
# Display the first few rows of the dataset to get a sense of what the raw data looks like
df_copy.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,,,Drama,,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8.0,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
2,#Homecoming,(2021),90 min,"Drama, Musical",,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35.0,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
4,...And Once Again,(2010),105 min,Drama,,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali


In [22]:
# Display the number of rows and columns in the dataset to understand its size
df_copy.shape

(15509, 10)

In [23]:
# Check the structure of the dataset
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      15509 non-null  object 
 1   Year      14981 non-null  object 
 2   Duration  7240 non-null   object 
 3   Genre     13632 non-null  object 
 4   Rating    7919 non-null   float64
 5   Votes     7920 non-null   object 
 6   Director  14984 non-null  object 
 7   Actor 1   13892 non-null  object 
 8   Actor 2   13125 non-null  object 
 9   Actor 3   12365 non-null  object 
dtypes: float64(1), object(9)
memory usage: 1.2+ MB


---
<a href=#four></a>
## **Data Cleaning and Filtering**
Before analyzing the data, it is crucial to clean and filter it. This process involves handling missing values, removing outliers, correcting errors, and possibly reducing the data by filtering out irrelevant features. These steps ensure that the analysis is based on accurate and reliable data.

Details
In this section, we will:

* Check for Missing Values: Identify if there are any missing values in the dataset and handle them accordingly.
* Remove Duplicates: Ensure there are no duplicate rows that could skew the analysis.
* Correct Errors: Look for and correct any obvious data entry errors.
* Filter Data: Depending on the analysis requirements, filter the data to include only relevant records.

In [24]:
# 1. Check for missing values in the dataset

def check_missing_values(df):
    """
    Check for missing values in the dataset and display the number of missing values per column.

    Parameters:
    df (pandas.DataFrame): The dataset to check for missing values.

    Returns:
    pandas.Series: A series showing the number of missing values for each column.
    """
     # Check for missing values in the dataset and display them
    print("Missing values per column:")
    missing_values = df.isnull().sum()
    print(missing_values)
    return missing_values


In [25]:
# Assuming df is your DataFrame
missing_values = check_missing_values(df_copy)


Missing values per column:
Name           0
Year         528
Duration    8269
Genre       1877
Rating      7590
Votes       7589
Director     525
Actor 1     1617
Actor 2     2384
Actor 3     3144
dtype: int64


After examining the dataset, missing values were identified in some  columns. All other columns are complete, ensuring data integrity for most features. Depending on the analysis goals, handling these missing values may be necessary through imputation or other data-cleaning techniques.

In [26]:
def drop_missing_values(df):
    """
    Drops rows with missing values in specified important columns.

    Args:
    df (pandas.DataFrame): The dataframe to process.

    Returns:
    pandas.DataFrame: The dataframe with rows containing missing values dropped.
    """
    # Columns to check for missing values
    columns_to_check = ['Year', 'Duration', 'Genre', 'Rating', 'Votes', 
                        'Director', 'Actor 1', 'Actor 2', 'Actor 3']

    # Count missing values before
    missing_before = df[columns_to_check].isnull().sum().sum()
    print(f"\nMissing values before dropping: {missing_before}")

    # Drop rows with any missing values in the specified columns
    df = df.dropna(subset=columns_to_check)

    # Count missing values after
    missing_after = df[columns_to_check].isnull().sum().sum()
    print(f"Missing values after dropping: {missing_after}")
    print(f"Remaining rows: {df.shape[0]}")

    return df


In [27]:
# Call the function
df_copy = drop_missing_values(df_copy)


Missing values before dropping: 33523
Missing values after dropping: 0
Remaining rows: 5659


In [28]:
def remove_duplicates(df):
    """
    Checks for duplicate rows in the dataset and removes them if any are found.

    Args:
    df (pandas.DataFrame): The dataframe to check for duplicate rows.

    Returns:
    pandas.DataFrame: The dataframe with duplicate rows removed, if any existed.
    """
    # Check for duplicate rows
    duplicate_rows = df.duplicated().sum()
    print(f"\nNumber of duplicate rows: {duplicate_rows}")
    
    # Remove duplicates if any exist
    if duplicate_rows > 0:
        df.drop_duplicates(inplace=True)
        print(f"Duplicate rows removed. Updated dataframe has {len(df)} rows.")
    else:
        print("No duplicate rows found.")
    
    return df

In [29]:
df_copy = remove_duplicates(df_copy)


Number of duplicate rows: 0
No duplicate rows found.


Upon reviewing the dataset, no duplicate rows were found. This ensures that all records are unique, and no further action is required for data deduplication.


## **Saving the Cleaned Dataset**
### Purpose

This section outlines how to save the cleaned dataset for future use. Saving the dataset ensures that the data cleaning process does not need to be repeated and allows for consistent use in subsequent analyses.

### Details

We will save the cleaned dataset as a CSV file.

In [30]:
#6. Save the cleaned dataset to a new CSV file

def save_cleaned_dataset(df, filename='cleaned_IMDb_Movies_India.csv'):
    """
    Saves the cleaned dataframe to a CSV file.

    Args:
    df (pandas.DataFrame): The cleaned dataframe to save.
    filename (str): The name of the file to save the dataframe to (default is 'cleaned_domestic_violence.csv').

    Returns:
    None
    """
    # Save the cleaned dataset to a CSV file
    df.to_csv(filename, index=False)
    print(f"Cleaned dataset saved successfully as '{filename}'.")


In [31]:
save_cleaned_dataset(df_copy)


Cleaned dataset saved successfully as 'cleaned_IMDb_Movies_India.csv'.


---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**

It is the process of analyzing datasets to summarize key features, often through visualization methods. It aims to discover patterns, spot anomalies, and formulate hypotheses for deeper insights, which informs subsequent analysis.
#### Advantages

- Helps in understanding the data before modeling.
- Provides insights that guide feature selection and engineering.
- Assists in choosing appropriate modeling techniques.
- Uncovers potential data quality issues early.

`The following methods were employed to communicate our objective:`



---


---
<a href=#nine></a>
## **Conclusion and Future Work**


##### Conclusion



##### Future Work

To build upon this study, future work could focus on the following areas:



---
<a href=#ten></a>
## **References**

## Additional Sections to Consider

**Contributors**: Nozipho Sithembiso Ndebele
