# IMDB Indian Movies Data Analysis and Visualization

![](https://cdn.tollywood.net/wp-content/uploads/2022/01/IMDB-Most-Anticipated-Indian-Movies-of-2022.jpg)


This code is an example of an Exploratory Data Analysis (EDA) conducted on a dataset containing information about Indian movies from IMDb. The purpose of this EDA is to gain insights into the dataset, clean and preprocess the data, and perform various data visualizations to understand trends and patterns in Indian cinema. 

### Importing Libraries

The code begins by importing essential Python libraries such as Pandas, NumPy, Seaborn, Matplotlib, Plotly, and others for data manipulation and visualization.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
from collections import namedtuple
import ydata_profiling as pandas_profiling  # Use ydata_profiling instead of pandas_profiling
from IPython.display import display

### Data Loading

The IMDb India Movies dataset is loaded using Pandas from a CSV file. The first few rows of the dataset are displayed to get an initial overview of the data.

In [None]:
#Load dataset
df = pd.read_csv('/kaggle/input/imdb-india-movies/IMDb Movies India.csv', encoding='latin1')

In [None]:
# Display the first few rows of the dataset
df.head()

### Data Profiling

The Pandas Profiling library is used to generate a comprehensive report on the dataset. This report includes statistics, data types, missing values, and more, providing an in-depth understanding of the data.

In [None]:
# Generate a Pandas Profiling Report for initial data exploration
report = pandas_profiling.ProfileReport(df)
display(report)

### Data Cleaning

Several data cleaning steps are performed, including removing rows with missing values, removing duplicate rows based on movie name and year, and cleaning the 'Year' and 'Duration' columns. Additionally, rows with a 'Year' value of '2022' are removed.

In [None]:
# Remove rows with missing values in columns 1 to 9
df.dropna(subset=df.columns[1:9], how='all', inplace=True)

In [None]:
# Remove duplicate rows based on 'Name' and 'Year'
df.drop_duplicates(subset=['Name', 'Year'], keep='first', inplace=True)

In [None]:
# Remove rows with missing values in columns 1, 2, 4, and 5 (excluding 'Genre')
df.dropna(subset=df.columns[[1, 2, 4, 5]], how='all', inplace=True)

In [None]:
# Clean 'Year' and 'Duration' columns
df['Year'] = df['Year'].str.replace(r'[()]', '', regex=True)
df['Duration'] = df['Duration'].str.replace(r' min', '', regex=True)

In [None]:
# Remove rows with 'Year' equal to '2022'
df = df[df['Year'] != '2022']

In [None]:
# Display the cleaned dataset shape
print(f"Cleaned dataset shape: {df.shape}")

### Data Visualization

The code utilizes various plotting libraries like Plotly and Seaborn to create informative visualizations. It starts by plotting the number of movies released by year to understand the temporal trends in Indian cinema.

In [None]:
# Plot the number of movies released by year
year_count = df['Year'].value_counts().reset_index()
year_count.columns = ['Year', 'Count']

In [None]:
fig = px.bar(year_count, x='Year', y='Count', text='Count', title='Number of Movies Released by Year')
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(
    xaxis=dict(title='Year of Movie Release', titlefont_size=16),
    yaxis=dict(title='Count of Movies Released', titlefont_size=16, tickfont_size=14)
)
fig.show()

### Genre Analysis

Dummy columns are created for each genre, and genre trends over the years are visualized using line charts. This analysis helps identify popular movie genres over time.

In [None]:
# Create dummy columns for each genre
dummies = df['Genre'].str.get_dummies(', ')
df_genre = pd.concat([df, dummies], axis=1)

In [None]:
# Plot genre trends over the years
genre_columns = df_genre.columns[10:]  # Assuming genre columns start from the 11th column
genre_count_by_year = df_genre.groupby('Year')[genre_columns].sum().reset_index()

fig = go.Figure()
for genre in genre_columns:
    fig.add_trace(go.Scatter(x=genre_count_by_year['Year'], y=genre_count_by_year[genre],
                             mode='lines', name=genre))

fig.update_layout(
    title='Genre Trends Over the Years',
    xaxis=dict(title='Year', titlefont_size=16),
    yaxis=dict(title='Count', titlefont_size=16, tickfont_size=14),
    legend=dict(y=0, x=1.0, bgcolor='rgba(255, 255, 255, 0)', bordercolor='rgba(255, 255, 255, 0)')
)
fig.show()

In [None]:
# Create a dataframe for actors and their movie counts by year
actor_cols = ['Actor 1', 'Actor 2', 'Actor 3']
actor_df = pd.melt(df[['Year'] + actor_cols], id_vars=['Year'], value_vars=actor_cols, var_name='Actor', value_name='Movie_Count')
actor_df.dropna(subset=['Actor'], inplace=True)

### Top Actors Analysis

The code identifies and plots the top 20 actors in terms of the number of movies they have acted in over the years. This analysis sheds light on the most prolific actors in Indian cinema.

In [None]:
# Get the top 20 actors by movie count
top_20_actors = actor_df['Actor'].value_counts().head(20).index.tolist()
top_20_actor_df = actor_df[actor_df['Actor'].isin(top_20_actors)]

In [None]:
# Plot the top 20 actors by movie count over the years
fig = px.strip(top_20_actor_df, x='Year', y='Actor', color='Actor', title='Top 20 Actors by Number of Movies Made Over the Years')
fig.update_layout(
    xaxis_tickfont_size=14,
    height=600
)
fig.show()

### Top Directors Analysis

Similar to the actors analysis, the code identifies and plots the top 20 directors by the number of movies they have directed. This analysis provides insights into influential directors in Indian cinema.

In [None]:
# Top Directors Analysis
director_df = df[['Director', 'Year']].dropna()
director_df['Movie_Count'] = 1

In [None]:
# Get the top 20 directors by movie count
top_20_directors = director_df['Director'].value_counts().head(20).index.tolist()
top_20_director_df = director_df[director_df['Director'].isin(top_20_directors)]

In [None]:
# Plot the top 20 directors by movie count over the years
fig = px.strip(top_20_director_df, x='Year', y='Director', color='Director', title='Top 20 Directors by Number of Movies Made Over the Years')
fig.update_layout(
    xaxis_tickfont_size=14,
    height=600
)
fig.show()

### Duration, Rating, and Votes Analysis

The code explores the relationships between movie duration, ratings, and votes. It creates a 3D scatter plot and a pairplot to visualize these relationships.

In [None]:
# Duration, Rating, and Votes Analysis
dur_rat = df[['Duration', 'Rating', 'Votes']].dropna()
dur_rat['Duration'] = dur_rat['Duration'].str.replace(' min', '').astype(int)
dur_rat['Votes'] = dur_rat['Votes'].str.replace(',', '').astype(float)

In [None]:
# 3D Scatter Plot of Duration, Rating, and Votes
fig = px.scatter_3d(dur_rat, x='Duration', y='Rating', z='Votes', color='Rating', title='3D Plot of Duration, Rating, and Votes')
fig.show()

In [None]:
# Pairplot for Duration, Rating, and Votes
sns.pairplot(dur_rat)
plt.show()