# Analysis of the “Movie IMBD Dataset"

#### Overview :

This project analyzes the “Movie IMBD Dataset” to uncover patterns and insights related to movie ratings, genres, budgets, and social media influence. The work involves data cleaning, preprocessing, exploratory data analysis, and visualization using Python libraries such as Pandas, Matplotlib, Seaborn, and Plotly. By examining correlations, trends, and sentiment indicators, the project aims to highlight factors that contribute to a movie’s success and summarize actionable insights through clear visualizations and interpretations.

### Importing Libraries

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
warnings.filterwarnings("ignore")

In [None]:
#from IPython.display import FileLink

# Save CSV
#data.to_csv("movie_metadata_duplicate.csv", index=False)

## 1. Data Loading and Initial Overview

##### Loading dataset

In [57]:
data = pd.read_csv("movie_metadata.csv")

##### Describing an overview of Dataset

In [13]:
#first 5 rows
print("=========First 5 rows==========")
data.head(5)



Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [58]:
#Shape of dataset
print("===========Number of rows and coloumns============")
print("\nNumber of rows  : ",data.shape[0])
print("Number of coloumns : ", data.shape[1])


Number of rows  :  5043
Number of coloumns :  28


In [21]:
#Info of rows and coloumns
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 28 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   color                      5024 non-null   object 
 1   director_name              4939 non-null   object 
 2   num_critic_for_reviews     4993 non-null   float64
 3   duration                   5028 non-null   float64
 4   director_facebook_likes    4939 non-null   float64
 5   actor_3_facebook_likes     5020 non-null   float64
 6   actor_2_name               5030 non-null   object 
 7   actor_1_facebook_likes     5036 non-null   float64
 8   gross                      4159 non-null   float64
 9   genres                     5043 non-null   object 
 10  actor_1_name               5036 non-null   object 
 11  movie_title                5043 non-null   object 
 12  num_voted_users            5043 non-null   int64  
 13  cast_total_facebook_likes  5043 non-null   int64

In [22]:
#Basic summary for numerical columns
data.describe()

Unnamed: 0,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,gross,num_voted_users,cast_total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
count,4993.0,5028.0,4939.0,5020.0,5036.0,4159.0,5043.0,5043.0,5030.0,5022.0,4551.0,4935.0,5030.0,5043.0,4714.0,5043.0
mean,140.194272,107.201074,686.509212,645.009761,6560.047061,48468410.0,83668.16,9699.063851,1.371173,272.770808,39752620.0,2002.470517,1651.754473,6.442138,2.220403,7525.964505
std,121.601675,25.197441,2813.328607,1665.041728,15020.75912,68452990.0,138485.3,18163.799124,2.013576,377.982886,206114900.0,12.474599,4042.438863,1.125116,1.385113,19320.44511
min,1.0,7.0,0.0,0.0,0.0,162.0,5.0,0.0,0.0,1.0,218.0,1916.0,0.0,1.6,1.18,0.0
25%,50.0,93.0,7.0,133.0,614.0,5340988.0,8593.5,1411.0,0.0,65.0,6000000.0,1999.0,281.0,5.8,1.85,0.0
50%,110.0,103.0,49.0,371.5,988.0,25517500.0,34359.0,3090.0,1.0,156.0,20000000.0,2005.0,595.0,6.6,2.35,166.0
75%,195.0,118.0,194.5,636.0,11000.0,62309440.0,96309.0,13756.5,2.0,326.0,45000000.0,2011.0,918.0,7.2,2.35,3000.0
max,813.0,511.0,23000.0,23000.0,640000.0,760505800.0,1689764.0,656730.0,43.0,5060.0,12215500000.0,2016.0,137000.0,9.5,16.0,349000.0


In [59]:
#datatypes of each columns
data.dtypes

color                         object
director_name                 object
num_critic_for_reviews       float64
duration                     float64
director_facebook_likes      float64
actor_3_facebook_likes       float64
actor_2_name                  object
actor_1_facebook_likes       float64
gross                        float64
genres                        object
actor_1_name                  object
movie_title                   object
num_voted_users                int64
cast_total_facebook_likes      int64
actor_3_name                  object
facenumber_in_poster         float64
plot_keywords                 object
movie_imdb_link               object
num_user_for_reviews         float64
language                      object
country                       object
content_rating                object
budget                       float64
title_year                   float64
actor_2_facebook_likes       float64
imdb_score                   float64
aspect_ratio                 float64
m

#### Observations

The IMDb movie dataset contains 5,043 movies with 28 attributes, including details on directors, actors, genres, budgets, box office collections, and IMDb scores.
It combines categorical, numerical, and popularity metrics like Facebook likes.
This dataset is ideal for analyzing movie performance, audience sentiment, and industry trends.

## Data Pre-processing

#### Handling Duplicates

In [60]:
# Removing duplicate values

data = data.drop_duplicates()

# Shape after removing duplicates
print("===========Number of rows and coloumns============")
print("\nNumber of rows  : ",data.shape[0])
print("Number of coloumns : ", data.shape[1])



Number of rows  :  4998
Number of coloumns :  28


#### Handling Missing Values

In [69]:
# Missing Values
print("Number of missing values for each Column  \n","="*50,"\n",data.isna().sum())

Number of missing values for each Column  
 color                        17
director_name                 0
num_critic_for_reviews        0
duration                      0
director_facebook_likes       0
actor_3_facebook_likes        0
actor_2_name                  0
actor_1_facebook_likes        0
gross                         0
genres                        0
actor_1_name                  0
movie_title                   0
num_voted_users               0
cast_total_facebook_likes     0
actor_3_name                  0
facenumber_in_poster          0
plot_keywords                 0
movie_imdb_link               0
num_user_for_reviews          0
language                      0
country                       0
content_rating                0
budget                        0
title_year                    0
actor_2_facebook_likes        0
imdb_score                    0
aspect_ratio                  0
movie_facebook_likes          0
dtype: int64


In [62]:
# Removing rows with missing values from less critical columns 
data.dropna(subset=["actor_3_facebook_likes"], inplace=True)
data.dropna(subset=["actor_2_facebook_likes"], inplace=True)
data.dropna(subset=["actor_1_facebook_likes"], inplace=True)

In [72]:
# Replacing rows with appropriate values
data["color"].fillna(data["color"].mode()[0] , inplace=True)     # replaced with mode value
data["director_name"].fillna("Unknown", inplace=True)
data["num_critic_for_reviews"].fillna(int(data["num_critic_for_reviews"].mean()) ,inplace=True)     # replaced with mean value 
data["duration"].fillna(data["duration"].mean() , inplace=True)      # replaced with mean value
data["director_facebook_likes"].fillna(0, inplace=True)       #replaced with 0 since Facebook likes naturally fit the idea that "no data" = "no likes"
data["gross"]=data.groupby("genres")["gross"].transform(lambda x:x.fillna(x.mean()))       #Fill missing 'gross' with the mean of its genre group
data["gross"].fillna(data["gross"].mean(), inplace=True)    # Backup plan for gross column 

data["actor_1_name"].fillna("Unknown",inplace=True)
data["actor_2_name"].fillna("Unknown",inplace=True)
data["actor_3_name"].fillna("Unknown",inplace=True)
data["facenumber_in_poster"].fillna(int(data["facenumber_in_poster"].mode()[0]),inplace=True)      # Replaced with mode value
data["plot_keywords"].fillna("Unknown",inplace=True)
data["num_user_for_reviews"].fillna(int(data["num_user_for_reviews"].mean()),inplace=True)
data["language"].fillna("Unknown",inplace=True)
data["country"].fillna("Unknown",inplace=True)
data["content_rating"].fillna("Not Rated",inplace=True)
data["budget"]=data.groupby("genres")["budget"].transform(lambda x: x.fillna(x.mean()))       #Fill missing 'budget' with the mean of its genre group
data["budget"].fillna(data["budget"].mean(), inplace=True)           # Backup plan for budget column 

data["title_year"].fillna(int(data["title_year"].mode()[0]),inplace=True)            # Replaced with mode value
data["aspect_ratio"].fillna(float(data["aspect_ratio"].mode()[0]),inplace=True)          # Replaced with mode value

#### Deriving New coloumns

In [79]:
#Deriving profit column

data["Profit"]=data["gross"]-data["budget"]

#Deriving decade column - to get the decade on film released

data["Decade"]= (data["title_year"]//10)*10

In [83]:
#Deriving Column Successfull/not - cost wise

data["Success_status"] = data.apply(lambda row: "Successful" if row["Profit"] > row["budget"] * 0.1 else "Loss",axis=1)   #considered as successful if profit is greater than 10% of budget

In [89]:
data.sample()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,Profit,Decade,Success_status
3076,Color,Oliver Parker,105.0,97.0,32.0,327.0,Rupert Everett,893.0,18535191.0,Comedy|Romance,...,PG-13,14000000.0,1999.0,692.0,6.9,1.85,646,4535191.0,1990.0,Successful


#### Filtering

In [96]:
#Recent movies 

recent_movies =data[data["title_year"]>2013]    #for 2013-2016 movies considered as recent
recent_movies

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,Profit,Decade,Success_status
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,2.000742e+08,Action|Adventure|Thriller,...,PG-13,2.450000e+08,2015.0,393.0,6.8,2.35,85000,-4.492582e+07,2010.0,Loss
8,Color,Joss Whedon,635.0,141.0,0.0,19000.0,Robert Downey Jr.,26000.0,4.589916e+08,Action|Adventure|Sci-Fi,...,PG-13,2.500000e+08,2015.0,21000.0,7.5,2.35,118000,2.089916e+08,2010.0,Successful
10,Color,Zack Snyder,673.0,183.0,0.0,2000.0,Lauren Cohan,15000.0,3.302491e+08,Action|Adventure|Sci-Fi,...,PG-13,2.500000e+08,2016.0,4000.0,6.9,2.35,197000,8.024906e+07,2010.0,Successful
20,Color,Peter Jackson,422.0,164.0,0.0,773.0,Adam Brown,5000.0,2.551084e+08,Adventure|Fantasy,...,PG-13,2.500000e+08,2014.0,972.0,7.5,2.35,65000,5.108370e+06,2010.0,Loss
27,Color,Anthony Russo,516.0,147.0,94.0,11000.0,Scarlett Johansson,21000.0,4.071973e+08,Action|Adventure|Sci-Fi,...,PG-13,2.500000e+08,2016.0,19000.0,8.2,2.35,72000,1.571973e+08,2010.0,Successful
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4999,Color,Kirk Loudon,3.0,95.0,0.0,279.0,Johnny Walter,883.0,4.346774e+07,Drama|Sci-Fi|Thriller,...,Not Rated,7.500000e+04,2014.0,507.0,5.0,2.35,87,4.339274e+07,2010.0,Successful
5012,Color,David Ayer,233.0,109.0,453.0,120.0,Martin Donovan,1000.0,1.049997e+07,Action|Crime|Drama|Thriller,...,R,3.500000e+07,2014.0,206.0,5.7,1.85,10000,-2.450003e+07,2010.0,Loss
5016,Color,Joseph Mazzella,140.0,90.0,0.0,9.0,Mikaal Bates,313.0,3.266586e+07,Crime|Drama|Thriller,...,Not Rated,2.500000e+04,2015.0,25.0,4.8,2.35,33,3.264086e+07,2010.0,Successful
5019,Color,Marcus Nispel,43.0,91.0,158.0,265.0,Brittany Curran,630.0,3.884903e+07,Horror|Mystery|Thriller,...,R,1.643410e+07,2015.0,512.0,4.6,1.85,0,2.241493e+07,2010.0,Successful


In [95]:
#Successfull Movies

Successful=data[data["Success_status"]=="Successful"]
Successful

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,Profit,Decade,Success_status
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,7.605058e+08,Action|Adventure|Fantasy|Sci-Fi,...,PG-13,2.370000e+08,2009.0,936.0,7.9,1.78,33000,5.235058e+08,2000.0,Successful
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,4.481306e+08,Action|Thriller,...,PG-13,2.500000e+08,2012.0,23000.0,8.5,2.35,164000,1.981306e+08,2010.0,Successful
6,Color,Sam Raimi,392.0,156.0,0.0,4000.0,James Franco,24000.0,3.365303e+08,Action|Adventure|Romance,...,PG-13,2.580000e+08,2007.0,11000.0,6.2,2.35,0,7.853030e+07,2000.0,Successful
8,Color,Joss Whedon,635.0,141.0,0.0,19000.0,Robert Downey Jr.,26000.0,4.589916e+08,Action|Adventure|Sci-Fi,...,PG-13,2.500000e+08,2015.0,21000.0,7.5,2.35,118000,2.089916e+08,2010.0,Successful
9,Color,David Yates,375.0,153.0,282.0,10000.0,Daniel Radcliffe,25000.0,3.019570e+08,Adventure|Family|Fantasy|Mystery,...,PG,2.500000e+08,2009.0,11000.0,7.5,2.35,10000,5.195698e+07,2000.0,Successful
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5036,Color,Anthony Vallone,140.0,84.0,2.0,2.0,John Considine,45.0,2.281075e+07,Crime|Drama,...,PG-13,3.250000e+03,2005.0,44.0,7.8,2.35,4,2.280750e+07,2000.0,Successful
5038,Color,Scott Smith,1.0,87.0,2.0,318.0,Daphne Zuniga,637.0,2.185859e+07,Comedy|Drama,...,Not Rated,1.478137e+07,2013.0,470.0,7.7,2.35,84,7.077218e+06,2010.0,Successful
5039,Color,Unknown,43.0,43.0,0.0,319.0,Valorie Curry,841.0,3.780378e+07,Crime|Drama|Mystery|Thriller,...,TV-14,3.286044e+07,2009.0,593.0,7.5,16.00,32000,4.943331e+06,2000.0,Successful
5040,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,3.444259e+07,Drama|Horror|Thriller,...,Not Rated,1.400000e+03,2013.0,0.0,6.3,2.35,16,3.444119e+07,2010.0,Successful


In [97]:
#Movies with high IMBD score

High_rated = data[data["imdb_score"]>8]
High_rated

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,Profit,Decade,Success_status
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,4.481306e+08,Action|Thriller,...,PG-13,2.500000e+08,2012.0,23000.0,8.5,2.35,164000,1.981306e+08,2010.0,Successful
17,Color,Joss Whedon,703.0,173.0,0.0,19000.0,Robert Downey Jr.,26000.0,6.232795e+08,Action|Adventure|Sci-Fi,...,PG-13,2.200000e+08,2012.0,21000.0,8.1,1.85,123000,4.032795e+08,2010.0,Successful
27,Color,Anthony Russo,516.0,147.0,94.0,11000.0,Scarlett Johansson,21000.0,4.071973e+08,Action|Adventure|Sci-Fi,...,PG-13,2.500000e+08,2016.0,19000.0,8.2,2.35,72000,1.571973e+08,2010.0,Successful
43,Color,Lee Unkrich,453.0,103.0,125.0,721.0,John Ratzenberger,15000.0,4.149845e+08,Adventure|Animation|Comedy|Family|Fantasy,...,G,2.000000e+08,2010.0,1000.0,8.3,1.85,30000,2.149845e+08,2010.0,Successful
58,Color,Andrew Stanton,421.0,98.0,475.0,522.0,Fred Willard,1000.0,2.238069e+08,Adventure|Animation|Family|Sci-Fi,...,G,1.800000e+08,2008.0,729.0,8.4,2.35,16000,4.380689e+07,2000.0,Successful
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4921,Color,Majid Majidi,46.0,89.0,373.0,27.0,Amir Farrokh Hashemian,36.0,9.254020e+05,Drama|Family,...,PG,1.800000e+05,1997.0,35.0,8.5,1.85,0,7.454020e+05,1990.0,Successful
4924,Color,Cary Bell,140.0,78.0,0.0,0.0,Stacie Evans,0.0,5.746749e+06,Documentary,...,Not Rated,1.800000e+05,2014.0,0.0,8.7,2.35,88,5.566749e+06,2010.0,Successful
4937,Color,Bill Melendez,43.0,25.0,36.0,27.0,Bill Melendez,39.0,1.354330e+08,Animation|Comedy|Family,...,TV-G,1.500000e+05,1965.0,36.0,8.4,1.33,0,1.352830e+08,1960.0,Successful
4972,Color,Sut Jhally,16.0,80.0,3.0,0.0,Seth Ackerman,103.0,5.746749e+06,Documentary,...,Not Rated,7.000000e+04,2004.0,0.0,8.3,2.35,110,5.676749e+06,2000.0,Successful


#### Aggregating 

In [98]:
#Average IMBD score by Genre

avg_score_genres = data.groupby("genres")["imdb_score"].mean()
avg_score_genres

genres
Action                                                             5.972727
Action|Adventure                                                   6.770000
Action|Adventure|Animation|Comedy|Crime|Family|Fantasy             6.200000
Action|Adventure|Animation|Comedy|Drama|Family|Fantasy|Thriller    6.000000
Action|Adventure|Animation|Comedy|Drama|Family|Sci-Fi              7.950000
                                                                     ...   
Sci-Fi|Thriller                                                    6.370000
Thriller                                                           5.326316
Thriller|War                                                       7.900000
Thriller|Western                                                   8.100000
Western                                                            6.583333
Name: imdb_score, Length: 910, dtype: float64

In [100]:
#Total gross collection - country wise

gross_by_country = data.groupby("country")["gross"].sum().sort_values(ascending=False)
gross_by_country

country
USA            1.980400e+11
UK             1.459280e+10
Canada         3.642970e+09
France         3.402603e+09
Germany        2.837204e+09
                   ...     
Kyrgyzstan     2.126511e+06
Afghanistan    1.127331e+06
Finland        6.117090e+05
Philippines    7.007100e+04
Georgia        1.714900e+04
Name: gross, Length: 64, dtype: float64

In [101]:
#Number of movies per Decade

num_of_movies_decade = data.groupby("Decade")["movie_title"].count()
num_of_movies_decade

Decade
1910.0       1
1920.0       5
1930.0      15
1940.0      24
1950.0      28
1960.0      72
1970.0     111
1980.0     287
1990.0     782
2000.0    2177
2010.0    1473
Name: movie_title, dtype: int64

#### Observations

The dataset was cleaned by handling duplicates and missing values using mean, mode, and group-wise replacement. New columns like Profit, Decade, and Success Status gave insights into financial trends and movie success. Filtering and aggregation further revealed patterns in genres, countries, and decades, highlighting profitability and audience preferences.

## Exploratory Data Analysis (EDA)