#   MOVIE INDUSTRY  ANALYSIS


## Business Problem
Your company now sees all the big companies creating original video content and they want to get in on the fun. They have decided to create a new movie studio, but they don’t know anything about creating movies. You are charged with exploring what types of films are currently doing the best at the box office. You must then translate those findings into actionable insights that the head of your company's new movie studio can use to help decide what type of films to create.



# Objective

This project aims to analyze  data to identify trends, clean inconsistencies, and visualize insights. The workflow follows five structured steps:


### Workflow & Steps

#### Step 1: Importing ,Connecting and Loading Datasets and Analysing the datasets

1.1 Import necessary  libraries

1.2 Connecting and loading the 

1.3 Check dataset shape (rows & columns) 

1.4 View dataset information (column types, missing values)

1.5 Summary statistics of numerical columns  

1.6 Check for duplicate rows  

1.7 Check for unique values in key categorical columns


#### Step 2: Data Cleaning

2.1 Remove duplicates 

2.2 Handle missing values appropriately  

2.3 Convert data types if necessary  

2.4 Standardize column names for consistency 

2.5 Correct inconsistent categorical values  

2.6 Save the cleaned dataset  

#### Step 1: Importing and Loading Data

#### 1.1 : Importing neccessary libraries

In [166]:
# We are Importing neccessary libraries

import zipfile
from scipy import stats
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

* In this project, we import libraries that provide tools for data manipulation and database interaction. 
 These libraries enable efficient loading, cleaning, merging, and analysis of structured data from various sources, 
 including flat files and relational databases. By using these tools, we can prepare and organize the data for deeper analysis and insight.

1.2:Connecting and  Loading  the dataset

In [167]:
# Establishing a connection to the SQLite database containing structured data 
conn = sqlite3.connect("zippedData/im.db1/im.db")

* Establishing a connection to the SQLite database containing structured data *by creating a connection object, we enable the ability to execute SQL queries  and extract tables or records from the database. This connection is essential for retrieving data stored in relational format and loading it into a DataFrame and also for further analysis.


In [168]:
# Query im.db to get data from movie_basics.

query = """
SELECT * 
 FROM movie_basics;
"""

# output query using pandas

movie_basics_df = pd.read_sql(query, conn)
movie_basics_df.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


* This will display the first 5 rows of the movie_basics_df DataFrame.

In [169]:
# checking  rows and columumn in movie_basics
movie_basics_df.shape

(146144, 6)

* indicates that the movie_basics has 146,144 rows and 6 columns.

In [170]:
# Summary Statistics of `movie_basics_df`
movie_basics_df.describe

<bound method NDFrame.describe of          movie_id                                primary_title  \
0       tt0063540                                    Sunghursh   
1       tt0066787              One Day Before the Rainy Season   
2       tt0069049                   The Other Side of the Wind   
3       tt0069204                              Sabse Bada Sukh   
4       tt0100275                     The Wandering Soap Opera   
...           ...                                          ...   
146139  tt9916538                          Kuambil Lagi Hatiku   
146140  tt9916622  Rodolpho Teóphilo - O Legado de um Pioneiro   
146141  tt9916706                              Dankyavar Danka   
146142  tt9916730                                       6 Gunn   
146143  tt9916754               Chico Albuquerque - Revelações   

                                     original_title  start_year  \
0                                         Sunghursh        2013   
1                                   Ash

 * To generate summary statistics for the numerical columns of your movie_basics_df DataFrame.

In [171]:
# query movie_ratings table

query = """
SELECT *
 FROM movie_ratings;
"""

movie_rating_df = pd.read_sql(query, conn)
movie_rating_df.head()

Unnamed: 0,movie_id,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


* This will display the first 5 rows of the movie_ratings DataFrame.

In [172]:
# checking  rows and columumn in movie_rating.
movie_rating_df.shape

(73856, 3)

* indicates that the movie_basics has 73,856 rows and 3 columns.

In [173]:
# Summary Statistics of `movie_rating_df`
movie_rating_df.describe

<bound method NDFrame.describe of          movie_id  averagerating  numvotes
0      tt10356526            8.3        31
1      tt10384606            8.9       559
2       tt1042974            6.4        20
3       tt1043726            4.2     50352
4       tt1060240            6.5        21
...           ...            ...       ...
73851   tt9805820            8.1        25
73852   tt9844256            7.5        24
73853   tt9851050            4.7        14
73854   tt9886934            7.0         5
73855   tt9894098            6.3       128

[73856 rows x 3 columns]>

 * To generate summary statistics for the numerical columns of your movie_rating_df DataFrame.

In [174]:
# merge movie_basics_df & movie_rating_df

film_df = pd.merge(movie_basics_df, movie_rating_df, on='movie_id', how='inner')
film_df.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",7.0,77
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",7.2,43
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,6.9,4517
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",6.1,13
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",6.5,119


* movie_basics and movie_ratings tables are most relevant so we merged to get film dataframe.

In [175]:
# checking for  rows and column after merging
film_df.shape

(73856, 8)

*  indicates that the movie_basics has 73,856 rows and 8 columns.

In [176]:
# Data type 
film_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movie_id         73856 non-null  object 
 1   primary_title    73856 non-null  object 
 2   original_title   73856 non-null  object 
 3   start_year       73856 non-null  int64  
 4   runtime_minutes  66236 non-null  float64
 5   genres           73052 non-null  object 
 6   averagerating    73856 non-null  float64
 7   numvotes         73856 non-null  int64  
dtypes: float64(2), int64(2), object(4)
memory usage: 4.5+ MB


* To get an overview of the DataFrame, including the number of non-null entries, data types

In [177]:
# Summary Statistics
film_df.describe


<bound method NDFrame.describe of         movie_id                    primary_title              original_title  \
0      tt0063540                        Sunghursh                   Sunghursh   
1      tt0066787  One Day Before the Rainy Season             Ashad Ka Ek Din   
2      tt0069049       The Other Side of the Wind  The Other Side of the Wind   
3      tt0069204                  Sabse Bada Sukh             Sabse Bada Sukh   
4      tt0100275         The Wandering Soap Opera       La Telenovela Errante   
...          ...                              ...                         ...   
73851  tt9913084                 Diabolik sono io            Diabolik sono io   
73852  tt9914286                Sokagin Çocuklari           Sokagin Çocuklari   
73853  tt9914642                        Albatross                   Albatross   
73854  tt9914942       La vida sense la Sara Amat  La vida sense la Sara Amat   
73855  tt9916160                       Drømmeland                  Drømmela

In [178]:
# The following are the column names in the `film_df` DataFrame:
film_df.columns.tolist()

['movie_id',
 'primary_title',
 'original_title',
 'start_year',
 'runtime_minutes',
 'genres',
 'averagerating',
 'numvotes']

In [179]:
# load data from bom.movie_gross.csv.gz
# box_office_df : shows how movies were earning

box_office_df= pd.read_csv('zippedData/bom.movie_gross.csv.gz')
box_office_df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [180]:
# check box_office_df rows and columns
box_office_df.shape

(3387, 5)

In [181]:
box_office_df.describe()

Unnamed: 0,domestic_gross,year
count,3359.0,3387.0
mean,28745850.0,2013.958075
std,66982500.0,2.478141
min,100.0,2010.0
25%,120000.0,2012.0
50%,1400000.0,2014.0
75%,27900000.0,2016.0
max,936700000.0,2018.0


In [182]:
# Check dataset info
box_office_df.info()  # Summary of dataset, including column types and non-null values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


 1.4 Checking For Missing Values

In [183]:
# Checking missing values
def check_missing_values(df, name):
    print(f"\n🔍 Missing values in {name}:")
    missing_values = df.isnull().sum()
    print(missing_values[missing_values > 0])  # Show only columns with missing values

# Check for missing values
check_missing_values(box_office_df, "Box Office Data")
check_missing_values(movie_basics_df, "Movies basics Data")
check_missing_values(movie_rating_df, "Movie Ratings Data")
check_missing_values(film_df, "Merged Movie basics and Movie ratings Data")


🔍 Missing values in Box Office Data:
studio               5
domestic_gross      28
foreign_gross     1350
dtype: int64

🔍 Missing values in Movies basics Data:
original_title        21
runtime_minutes    31739
genres              5408
dtype: int64

🔍 Missing values in Movie Ratings Data:
Series([], dtype: int64)

🔍 Missing values in Merged Movie basics and Movie ratings Data:
runtime_minutes    7620
genres              804
dtype: int64


#### 1.5 Handling Missing Values

Since there are only 5 missing studio values and 28 missing domestic_gross values, we drop the rows.

In [184]:
box_office_df.dropna(subset=["studio", "domestic_gross"], inplace=True)

Since 1350 missing values is a significant amount, we will fill them with the median 

In [185]:
box_office_df["foreign_gross"] = box_office_df["foreign_gross"].replace(",", "", regex=True).astype(float)
box_office_df["foreign_gross"].fillna(box_office_df["foreign_gross"].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  box_office_df["foreign_gross"].fillna(box_office_df["foreign_gross"].median(), inplace=True)


In [186]:
# Checking handled missing values

# To confirm no missing values
box_office_df["foreign_gross"].dtype 
box_office_df.isnull().sum() 

title             0
studio            0
domestic_gross    0
foreign_gross     0
year              0
dtype: int64

In [187]:
# film Data Cleaning 

# Fill missing 'runtime_minutes' with the median value
median_runtime = film_df['runtime_minutes'].median()
film_df['runtime_minutes'].fillna(median_runtime, inplace=True)

# Fill missing 'genres' with 'Unknown'
film_df['genres'].fillna('Unknown', inplace=True)

# Optional: If you want to fill missing 'original_title' with the 'primary_title' instead
film_df['original_title'].fillna(movie_basics_df['primary_title'], inplace=True)

# --- Checking after cleaning ---
print("Missing Values After Cleaning:")
print(film_df.isnull().sum())


Missing Values After Cleaning:
movie_id           0
primary_title      0
original_title     0
start_year         0
runtime_minutes    0
genres             0
averagerating      0
numvotes           0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  film_df['runtime_minutes'].fillna(median_runtime, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  film_df['genres'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which w

In [188]:
# Drop any rows still missing 'runtime_minutes' or 'genres'
film_df.dropna(subset=['runtime_minutes', 'genres'], inplace=True)

# Final check
print("Missing Values After Final Cleaning:")
print(film_df.isnull().sum())


Missing Values After Final Cleaning:
movie_id           0
primary_title      0
original_title     0
start_year         0
runtime_minutes    0
genres             0
averagerating      0
numvotes           0
dtype: int64


In [189]:
#  Standardize column names
film_df.columns =film_df .columns.str.strip().str.lower().str.replace(
    ' ', '_') 
box_office_df.columns = box_office_df.columns.str.strip().str.lower().str.replace(
    ' ', '_')  

In [190]:
#  Correct inconsistent categorical values (Example: Convert text to lowercase)
for col in film_df.select_dtypes(include=['object']).columns:
    film_df[col] =film_df[col].str.lower().str.strip()

for col in box_office_df.select_dtypes(include=['object']).columns:
    box_office_df[col] =box_office_df[col].str.lower().str.strip()

 Handling Duplicates in the Data

In [191]:
# 2.2 Check for duplicates
# duplicates = df.duplicated().sum()
# Print(f"Number of duplicate rows: {duplicates}")  # Identify duplicate records
print(f"🔍 Duplicates in Box Office Data: {box_office_df.duplicated().sum()}")
print(f"🔍 Duplicates in TMDb Movies Data: {movie_basics_df.duplicated().sum()}")
print(f"🔍 Duplicates in Movie Budgets Data: {movie_rating_df.duplicated().sum()}")
print(f"🔍 Duplicates in Movie Budgets Data: {film_df.duplicated().sum()}")

🔍 Duplicates in Box Office Data: 0
🔍 Duplicates in TMDb Movies Data: 0
🔍 Duplicates in Movie Budgets Data: 0
🔍 Duplicates in Movie Budgets Data: 0


In [192]:
 # Checking  inconsistent casing:

box_office_df["studio"] = box_office_df["studio"].str.strip().str.lower()
box_office_df["title"] = box_office_df["title"].str.strip().str.lower()


In [193]:
# Counting occurrences of each title
box_office_df["title"].value_counts().head(10)

title
bluebeard                                                   2
the chronicles of narnia: the voyage of the dawn treader    1
the king's speech                                           1
tron legacy                                                 1
the karate kid                                              1
prince of persia: the sands of time                         1
black swan                                                  1
megamind                                                    1
robin hood                                                  1
the last airbender                                          1
Name: count, dtype: int64

In [194]:
# Searching for the Movie "Bluebeard" in `box_office_df`
box_office_df[box_office_df["title"] == "bluebeard"]

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
317,bluebeard,strand,33500.0,5200.0,2010
3045,bluebeard,wgusa,43100.0,19400000.0,2017


In [195]:
# Differentiating the titles Bluebeard by adding the release year
box_office_df["title"] = box_office_df["title"] + " (" + box_office_df["year"].astype(str) + ")"
box_office_df["title"].value_counts()[box_office_df["title"].value_counts() > 1]

Series([], Name: count, dtype: int64)

In [196]:
# 2.3 Display categorical features summary
box_office_df.describe(include=['O'])
movie_basics_df.describe(include=['O'])
movie_rating_df.describe(include=['O'])

Unnamed: 0,movie_id
count,73856
unique,73856
top,tt9174828
freq,1


In [197]:
print("Missing Values:\n", box_office_df.isnull().sum())  # Check missing values 
print("Missing Values:\n", movie_basics_df.isnull().sum())  # Check missing values
print("Missing Values:\n", movie_rating_df.isnull().sum())  # Check missing values     
print("\nDuplicate Rows:", box_office_df.duplicated().sum())  # Check duplicate rows  
print("\nDuplicate Rows:", movie_basics_df.duplicated().sum())  # Check duplicate rows 
print("\nDuplicate Rows:",movie_rating_df.duplicated().sum())  # Check duplicate rows 
print("\nData Types:\n", box_office_df.dtypes)  # Check data type
print("\nData Types:\n", movie_basics_df.dtypes)  # Check data type
print("\nData Types:\n", movie_rating_df.dtypes)  # Check data type

Missing Values:
 title             0
studio            0
domestic_gross    0
foreign_gross     0
year              0
dtype: int64
Missing Values:
 movie_id               0
primary_title          0
original_title        21
start_year             0
runtime_minutes    31739
genres              5408
dtype: int64
Missing Values:
 movie_id         0
averagerating    0
numvotes         0
dtype: int64

Duplicate Rows: 0

Duplicate Rows: 0

Duplicate Rows: 0

Data Types:
 title              object
studio             object
domestic_gross    float64
foreign_gross     float64
year                int64
dtype: object

Data Types:
 movie_id            object
primary_title       object
original_title      object
start_year           int64
runtime_minutes    float64
genres              object
dtype: object

Data Types:
 movie_id          object
averagerating    float64
numvotes           int64
dtype: object


In [198]:
film_df.to_csv("cleaned_dataset.csv", index=False)  # Save cleaned dataset for Tableau 
box_office_df.to_csv("cleaned_dataset.csv", index=False)


In [199]:
from IPython.display import FileLink  
FileLink("cleaned_dataset.csv")  # Generates a download link 