#   MOVIE INDUSTRY  ANALYSIS

## 🎬 Project Title :
Box Office Gold: Data-Driven Insights for a Profitable Movie Studio Launch

### Business Understanding
👥 Stakeholder
The primary stakeholder is the executive team of the company's new movie studio. They need insights into the film industry to make confident decisions on what type of movies to produce.

### 🌍 Domain:

Entertainment & Media Analytics (specifically, Film Industry/Box Office Performance)

#### 📘 Introduction:

With the growing trend of major companies venturing into original film production, my organization is planning to launch its own movie studio. However, without prior experience in the film industry, there’s uncertainty about what types of movies resonate with audiences and drive box office success. This project aims to analyze trends in box office data to uncover what genres, budgets, and other film attributes contribute most to commercial success — providing strategic guidance for profitable content creation.

#### 🎯 Business Objectives :
To Identify High-Performing Film Genres:
Analyze box office data to determine which movie genres consistently generate the highest revenue and audience engagement.
To Examine the Relationship Between Budget and Profitability:
Investigate how production budgets influence box office success and identify the budget range that maximizes return on investment (ROI).
To Assess the Impact of Key Film Attributes:
Explore how factors such as runtime, cast, release date (season), and film ratings (e.g., PG-13, R) affect a movie’s performance.
To Benchmark Against Top Studios :
Analyze which production studios are leading in terms of commercial success and identify patterns in their film portfolios.
To Provide Actionable Recommendations:
Based on the insights, suggest the optimal type of film (genre, budget, release timing, etc.) that the company should produce for a successful studio launch.

#### 📊 Project Plan: Box Office Gold – Data-Driven Insights for a Profitable Movie Studio Launch
🔍 1. Problem Understanding & Goal Definition

Review business problem: Identify film types that succeed at the box office.

Define clear goals: Provide recommendations on genre, budget, and release strategy.

📦 2. Data Collection

Source Box Office Datasets from platforms like:

Im.db.zip(movie_basics & Movie_ratings)

bom.movie_gross.gz

Collect relevant data fields:

Genre, budget, revenue, runtime, release date, production company, director, cast, rating, etc.

🧹 3. Data Cleaning & Preprocessing

Handle missing values and inconsistencies.
Standardize formats (dates, currencies, genres).
Convert categorical variables where necessary.
Remove duplicates or irrelevant records (e.g., short films, non-theatrical releases).
📊 4. Exploratory Data Analysis (EDA)

Univariate & Bivariate Analysis (e.g., budget vs revenue, genre vs revenue).
Correlation heatmaps, box plots, histograms.
Identify outliers and common patterns in successful films.
Segment data by genre, production studio, or release year.
🧠 5. Insights & Recommendations

Summarize which genres are top performers.
Recommend ideal budget ranges.
Identify optimal release months/seasons.
Suggest attributes linked to successful movies (e.g., popular runtimes, ratings).
📑 6. Reporting & Visualization

Build clear and compelling visualizations (using Tableau, Power BI, or Python’s Seaborn/Matplotlib).
Draft a business-focused report or slide deck.
Include:

Key findings
Strategic suggestions
Visual evidence
📢 7. Presentation to Stakeholders

#### Communicate insights in non-technical language.
Show data-driven rationale for proposed movie types.
Allow room for stakeholder feedback and Q&A.
Overview/Background
As the entertainment industry shifts toward original content production, many large companies are investing in their own movie studios to capture audience attention and drive revenue. The company seeks to follow this trend but lacks experience in film production. To ensure a successful studio launch, this project aims to analyze historical box office data to uncover key trends in genre performance, budget impact, and other critical success factors. The goal is to provide data-driven insights that will guide strategic decisions on what types of films to produce for maximum box office success.

#### Challenges
One of the main challenges in this project is acquiring comprehensive and reliable box office data that includes essential attributes such as genre, budget, revenue, and release details. Additionally, the film industry is influenced by unpredictable factors like audience trends, star power, and marketing, which are difficult to quantify. Ensuring data quality, handling missing or inconsistent entries, and drawing actionable insights that align with business goals also present key hurdles in the analysis process.

#### Proposed Solution
To address the business challenge, this project proposes a data-driven approach that involves collecting and analyzing historical box office data to identify patterns in successful films. By examining factors such as genre, budget, revenue, release timing, and other key attributes, the project will uncover trends that correlate with box office success. The insights will then be translated into practical recommendations to guide the company in producing films with higher chances of commercial success.

#### Conclusion:
Launching a successful movie studio requires more than creativity—it demands strategic, data-informed decisions. This project leverages box office analytics to uncover what drives film profitability, helping the company make confident choices about genre, budget, and release strategy. With clear insights and recommendations, the company will be well-positioned to enter the competitive film industry with a strong foundation for success.

#### ❗ Problem Statement
As the company plans to venture into original film production, it faces significant uncertainty due to a lack of industry experience. 🎬 Making informed decisions about what types of films to produce is challenging without a clear understanding of market trends and performance drivers. Additionally, obtaining accurate and complete box office data is difficult, and the film industry itself is influenced by various unpredictable factors such as changing audience preferences, marketing impact, and star power. These challenges make it hard to identify what contributes to a movie’s commercial success and pose a risk to the company’s new venture.

#### 📊 Data Understanding
The data for this project comes from multiple sources:

im.db.zip:
A zipped SQLite database that contains various tables. The two most relevant tables are:

movie_basics: Likely includes key information about movies such as titles, genres, release dates, runtime, etc.
movie_ratings: Contains viewer and critic ratings, providing insight into movie reception.
bom.movie_gross.csv.gz:
A compressed CSV file containing box office gross data. This file is essential for analyzing revenue trends and overall financial performance.
These sources combined offer a comprehensive view of film attributes and performance metrics, which are crucial for understanding the factors behind movie success.









### Workflow & Steps

#### Step 1: Importing ,Connecting and Loading Datasets and Analysing the datasets

 Import necessary  libraries

 Connecting and loading the 

 Check dataset shape (rows & columns) 

 View dataset information (column types, missing values)

 Summary statistics of numerical columns  

 Check for duplicate rows  

 Check for unique values in key categorical columns


#### Step 2: Data Cleaning

 Remove duplicates 

 Handle missing values appropriately  

 Convert data types if necessary  

 Standardize column names for consistency 

 Correct inconsistent categorical values  

 Save the cleaned dataset  

## Step 1: Importing and Loading Data

### 📚 1.1 Importing Necessary Libraries

In this step, we import all the essential Python libraries that we will use throughout the project for:
- Extracting and reading data (from ZIP files and SQLite databases)
- Cleaning and analyzing data
- Performing statistical operations
- Visualizing data to uncover insights


In [214]:
# We are Importing neccessary libraries

import zipfile
from scipy import stats
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

* In this project, we import libraries that provide tools for data manipulation and database interaction. 
 These libraries enable efficient loading, cleaning, merging, and analysis of structured data from various sources, 
 including flat files and relational databases. By using these tools, we can prepare and organize the data for deeper analysis and insight.

 ### 🗄️ 1.2: Connecting to the SQLite Database

In this step, we establish a connection to the SQLite database that contains structured data related to our project.

**Why this is important:**  
SQLite databases allow us to store and retrieve data efficiently using SQL queries. We first need to connect to the database before we can explore or extract tables for analysis.


In [215]:
# Establishing a connection to the SQLite database containing structured data 
conn = sqlite3.connect("zippedData/im.db1/im.db")

* Establishing a connection to the SQLite database containing structured data *by creating a connection object, we enable the ability to execute SQL queries  and extract tables or records from the database. This connection is essential for retrieving data stored in relational format and loading it into a DataFrame and also for further analysis.


### Querying the `movie_basics` Table

We will query all records from the `movie_basics` table in the `im.db` SQLite database and display the first few rows using pandas.


In [216]:
# Query im.db to get data from movie_basics.

query = """
SELECT * 
 FROM movie_basics;
"""

# output query using pandas

movie_basics_df = pd.read_sql(query, conn)
movie_basics_df.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


**
- `query`: A multi-line string containing the SQL command to select all columns (`*`) from the `movie_basics` table.
- `pd.read_sql(query, conn)`: pandas reads the SQL query results directly into a DataFrame called `movie_basics_df`. The `conn` object represents the active connection to the `im.db` SQLite database.
- `.head()`: Shows the first 5 rows of `movie_basics_df`, allowing us to quickly inspect the structure and contents of the data.


## Checking  rows and columumn in movie_basics

In [217]:
# checking  rows and columumn in movie_basics
movie_basics_df.shape

(146144, 6)

### Checking the Number of Rows and Columns in `movie_basics`

We will check the dimensions of the `movie_basics_df` DataFrame to understand how many rows (records) and columns (features) it contains.
*
- `.shape` returns a tuple where:
  - The **first value** is the number of rows (how many records/movies are in the dataset).
  - The **second value** is the number of columns (how many fields/features each record has).
- This helps quickly understand the dataset size and structure.


## Summary Statistics of `movie_basics_df`

In [218]:
# Summary Statistics of `movie_basics_df`
movie_basics_df.describe

<bound method NDFrame.describe of          movie_id                                primary_title  \
0       tt0063540                                    Sunghursh   
1       tt0066787              One Day Before the Rainy Season   
2       tt0069049                   The Other Side of the Wind   
3       tt0069204                              Sabse Bada Sukh   
4       tt0100275                     The Wandering Soap Opera   
...           ...                                          ...   
146139  tt9916538                          Kuambil Lagi Hatiku   
146140  tt9916622  Rodolpho Teóphilo - O Legado de um Pioneiro   
146141  tt9916706                              Dankyavar Danka   
146142  tt9916730                                       6 Gunn   
146143  tt9916754               Chico Albuquerque - Revelações   

                                     original_title  start_year  \
0                                         Sunghursh        2013   
1                                   Ash

### Summary Statistics of `movie_basics_df`

We will generate descriptive statistics for the numeric columns in the `movie_basics_df` DataFrame to better understand the distribution and range of the data.


## query movie_ratings table

### Querying the `movie_ratings` Table

We will query all records from the `movie_ratings` table in the `im.db` SQLite database and display the first few rows using pandas.


In [219]:
# query movie_ratings table

query = """
SELECT *
 FROM movie_ratings;
"""

movie_rating_df = pd.read_sql(query, conn)
movie_rating_df.head()

Unnamed: 0,movie_id,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


*
- `query`: A multi-line string containing the SQL command to select all columns (`*`) from the `movie_ratings` table.
- `pd.read_sql(query, conn)`: Executes the query using pandas and loads the results directly into a DataFrame called `movie_rating_df`.
- `.head()`: Displays the first 5 rows of the DataFrame to quickly inspect the structure and contents.


## Checking the Number of Rows and Columns in `movie_rating_df`

We will check the dimensions of the `movie_rating_df` DataFrame to see how many records and fields it contains.


In [220]:
# checking  rows and columumn in movie_rating.
movie_rating_df.shape

(73856, 3)



- `.shape` returns a tuple where:
  - The **first element** is the number of rows (how many ratings records are present).
  - The **second element** is the number of columns (how many fields each rating has).
- This helps understand the size and complexity of the dataset at a glance.

* indicates that the movie_basics has 73,856 rows and 3 columns.

## Summary Statistics of `movie_rating_df`

We will generate descriptive statistics for the numeric columns in the `movie_rating_df` DataFrame to understand the distribution and range of the ratings data.


In [221]:
# Summary Statistics of `movie_rating_df`
movie_rating_df.describe

<bound method NDFrame.describe of          movie_id  averagerating  numvotes
0      tt10356526            8.3        31
1      tt10384606            8.9       559
2       tt1042974            6.4        20
3       tt1043726            4.2     50352
4       tt1060240            6.5        21
...           ...            ...       ...
73851   tt9805820            8.1        25
73852   tt9844256            7.5        24
73853   tt9851050            4.7        14
73854   tt9886934            7.0         5
73855   tt9894098            6.3       128

[73856 rows x 3 columns]>

  
- `.describe()` automatically calculates important summary statistics for all **numeric columns**:
  - **count**: Number of non-null entries,
  - **mean**: Average value,
  - **std**: Standard deviation (measure of spread),
  - **min**: Minimum value,
  - **25%**, **50% (median)**, **75%**: Percentile values,
  - **max**: Maximum value.
- This gives a quick view of the distribution, range, and variation in the movie ratings dataset.


## Merging `movie_basics_df` and `movie_rating_df`

We will merge the two DataFrames using the common `movie_id` column to create a combined dataset containing both movie details and their ratings.
Called film_df.

In [222]:
# merge movie_basics_df & movie_rating_df

film_df = pd.merge(movie_basics_df, movie_rating_df, on='movie_id', how='inner')
film_df.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",7.0,77
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",7.2,43
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,6.9,4517
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",6.1,13
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",6.5,119


  
- `pd.merge(df1, df2, on='column', how='type')` merges two DataFrames:
  - `movie_basics_df` and `movie_rating_df` are merged **on** the common `movie_id` column.
  - `how='inner'` means:
    - Only include rows where `movie_id` exists in **both** DataFrames.
    - Movies without ratings (or ratings without movie info) are excluded.
- The resulting DataFrame `film_df` contains combined information: movie details + ratings.
- `.head()` shows the first 5 rows to confirm that the merge worked correctly.


## Checking the Number of Rows and Columns After Merging

We will verify the size of the merged DataFrame `film_df` to see how many records and fields it contains after combining `movie_basics_df` and `movie_rating_df`.


In [223]:
# checking for  rows and column after merging
film_df.shape

(73856, 8)

  
- `.shape` returns a tuple:
  - The **first value** is the number of rows (movies with both basic info and ratings).
  - The **second value** is the number of columns (combined fields from both DataFrames).
- This helps confirm the size of the new dataset after merging, ensuring no unexpected data loss or duplication occurred.


## Checking Data Types and Basic Information of `film_df`

We will inspect the structure of `film_df`, including data types, non-null counts, and memory usage, to better understand the dataset.


In [225]:
# Checking Data Types and Basic Information of `film_df`
film_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movie_id         73856 non-null  object 
 1   primary_title    73856 non-null  object 
 2   original_title   73856 non-null  object 
 3   start_year       73856 non-null  int64  
 4   runtime_minutes  66236 non-null  float64
 5   genres           73052 non-null  object 
 6   averagerating    73856 non-null  float64
 7   numvotes         73856 non-null  int64  
dtypes: float64(2), int64(2), object(4)
memory usage: 4.5+ MB


  
- `.info()` provides a summary of the DataFrame, showing:
  - The **total number of entries** (rows),
  - **Non-null counts** for each column (helpful for spotting missing data),
  - **Data types** for each column:
    - `object` for text/string,
    - `int64` for integers,
    - `float64` for decimal numbers,
  - **Memory usage** (important for very large datasets).
- This is crucial for:
  - Detecting missing data early,
  - Planning data cleaning steps,
  - Knowing if data types need optimization (e.g., changing types to save memory).


## Summary Statistics of `film_df`

We will generate descriptive statistics for the numeric columns in the merged `film_df` DataFrame to understand the distribution and range of the data.


In [None]:
# Summary Statistics
film_df.describe


<bound method NDFrame.describe of         movie_id                    primary_title              original_title  \
0      tt0063540                        sunghursh                   sunghursh   
1      tt0066787  one day before the rainy season             ashad ka ek din   
2      tt0069049       the other side of the wind  the other side of the wind   
3      tt0069204                  sabse bada sukh             sabse bada sukh   
4      tt0100275         the wandering soap opera       la telenovela errante   
...          ...                              ...                         ...   
73851  tt9913084                 diabolik sono io            diabolik sono io   
73852  tt9914286                sokagin çocuklari           sokagin çocuklari   
73853  tt9914642                        albatross                   albatross   
73854  tt9914942       la vida sense la sara amat  la vida sense la sara amat   
73855  tt9916160                       drømmeland                  drømmela

 
- `.describe()` provides summary statistics for **numeric columns** in the `film_df` DataFrame, including:
  - **count**: Number of non-null entries,
  - **mean**: Average value,
  - **std**: Standard deviation (how spread out the values are),
  - **min**: Minimum value,
  - **25%**, **50% (median)**, **75%**: Percentile values indicating distribution,
  - **max**: Maximum value.
- This gives a quick overview of the central tendency, spread, and variability of the dataset’s numerical features (like ratings or release year).


## Listing the Column Names in `film_df`

We will display the names of all the columns in the `film_df` DataFrame to understand its structure and the data it contains.


In [None]:
# The following are the column names in the `film_df` DataFrame:
film_df.columns.tolist()

['movie_id',
 'primary_title',
 'original_title',
 'start_year',
 'runtime_minutes',
 'genres',
 'averagerating',
 'numvotes']

 
- `.columns` retrieves the column names of the DataFrame.
- `.tolist()` converts the column names into a **Python list** for easier reading and use.
- This step is useful for:
  - Quickly checking what data is available in the merged DataFrame,
  - Ensuring all expected columns are present after merging the datasets.


## Loading Data from `bom.movie_gross.csv.gz`

We will load the movie box office earnings data from the `bom.movie_gross.csv.gz` file and display the first few rows to inspect its contents.


In [None]:
# load data from bom.movie_gross.csv.gz
# box_office_df : shows how movies were earning

box_office_df= pd.read_csv('zippedData/bom.movie_gross.csv.gz')
box_office_df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010



- `pd.read_csv()` is used to load the data from a CSV file into a pandas DataFrame.
  - In this case, the file is compressed as `.gz`, but pandas can read it directly without any extra steps.
- `.head()` displays the first 5 rows of the DataFrame to give a quick view of the data structure.
- The dataset likely contains information on how movies performed at the box office, such as gross earnings, release date, etc.


## Checking the Number of Rows and Columns in `box_office_df`

We will check the dimensions of the `box_office_df` DataFrame to see how many records and fields it contains.


In [None]:
# check box_office_df rows and columns
box_office_df.shape

(3356, 5)


- `.shape` returns a tuple:
  - The **first value** represents the **number of rows** (how many box office records are available).
  - The **second value** represents the **number of columns** (how many features or fields are provided for each record).
- This helps quickly assess the size and structure of the dataset.


## Summary Statistics of `box_office_df`

We will generate descriptive statistics for the numeric columns in the `box_office_df` DataFrame to understand the distribution and range of the box office earnings data.


In [None]:
# Generate summary (descriptive) statistics for the numeric columns
# .describe() summarizes key statistical measures for each numeric column
box_office_df.describe()

Unnamed: 0,domestic_gross,foreign_gross,year
count,3356.0,3356.0,3356.0
mean,28771490.0,53123330.0,2013.970203
std,67006940.0,110367400.0,2.479064
min,100.0,600.0,2010.0
25%,120000.0,12200000.0,2012.0
50%,1400000.0,19400000.0,2014.0
75%,27950000.0,29700000.0,2016.0
max,936700000.0,960500000.0,2018.0


  
- `.describe()` provides summary statistics for **numeric columns** in the `box_office_df` DataFrame, including:
  - **count**: Number of non-null entries,
  - **mean**: Average value,
  - **std**: Standard deviation (measure of spread),
  - **min**: Minimum value,
  - **25%**, **50% (median)**, **75%**: Percentile values (showing distribution),
  - **max**: Maximum value.
- This step is essential for getting a quick understanding of the box office data’s central tendency, spread, and overall range.


## Checking Dataset Info for `box_office_df`

We will inspect the structure of the `box_office_df` DataFrame, including column types, non-null counts, and memory usage, to better understand the dataset.


In [None]:
# Check dataset info
box_office_df.info()  # Summary of dataset, including column types and non-null values

<class 'pandas.core.frame.DataFrame'>
Index: 3356 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3356 non-null   object 
 1   studio          3356 non-null   object 
 2   domestic_gross  3356 non-null   float64
 3   foreign_gross   3356 non-null   float64
 4   year            3356 non-null   int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 157.3+ KB



- `.info()` provides an overview of the DataFrame, including:
  - **Data types**: It shows the type of data in each column (e.g., `int64`, `float64`, `object` for text).
  - **Non-null counts**: Helps detect missing values by showing how many non-null entries exist in each column.
  - **Memory usage**: Indicates how much memory the entire DataFrame is using (important for large datasets).
- This step is crucial for understanding the structure and cleanliness of the dataset.


# Step 2: Data Cleaning 

## Checking for Missing Values in Multiple Datasets

We will define a function to check for missing values in different DataFrames (`box_office_df`, `movie_basics_df`, `movie_rating_df`, and `film_df`) and display the columns with missing data.


In [None]:
# Define a function to check for missing values in a DataFrame
def check_missing_values(df, name):
    print(f"\n🔍 Missing values in {name}:")
    # Use .isnull() to identify missing data and .sum() to count them
    missing_values = df.isnull().sum()
    # Print only columns with missing values (where the sum is greater than 0)
    print(missing_values[missing_values > 0])  

# Check for missing values in each DataFrame
check_missing_values(box_office_df, "Box Office Data")
check_missing_values(movie_basics_df, "Movies basics Data")
check_missing_values(movie_rating_df, "Movie Ratings Data")
check_missing_values(film_df, "Merged Movie basics and Movie ratings Data")



🔍 Missing values in Box Office Data:
Series([], dtype: int64)

🔍 Missing values in Movies basics Data:
Series([], dtype: int64)

🔍 Missing values in Movie Ratings Data:
Series([], dtype: int64)

🔍 Missing values in Merged Movie basics and Movie ratings Data:
Series([], dtype: int64)



- The function `check_missing_values(df, name)` checks for missing values in a given DataFrame:
  - `.isnull()` creates a DataFrame of boolean values (`True` for missing, `False` for not missing).
  - `.sum()` adds up the number of `True` values (i.e., missing entries) for each column.
  - We then filter and display only those columns with **missing values** (`missing_values > 0`).
- We run the function for each DataFrame to identify any missing data in the following:
  - **Box Office Data (`box_office_df`)**
  - **Movie Basics Data (`movie_basics_df`)**
  - **Movie Ratings Data (`movie_rating_df`)**
  - **Merged Movie Data (`film_df`)**


## Handling Missing Values

#### Dropping Rows with Missing Values in Specific Columns

We want to drop rows from `box_office_df` where either the `studio` or `domestic_gross` columns have missing values. This is done using the `dropna()` function with the `subset` argument to target specific columns.


In [None]:
# Drop rows where either 'studio' or 'domestic_gross' has missing values
box_office_df.dropna(subset=["studio", "domestic_gross"], inplace=True)

# Check the shape of the DataFrame after dropping rows
print("Shape of box_office_df after dropping missing values in 'studio' or 'domestic_gross':")
print(box_office_df.shape)


Shape of box_office_df after dropping missing values in 'studio' or 'domestic_gross':
(3356, 5)


Since 1350 missing values is a significant amount, we will fill them with the median 

## Cleaning and Handling `foreign_gross` Column

We will:
- Remove any commas in the `foreign_gross` column.
- Convert the column values to `float`.
- Fill missing values (`NaN`) with the median value of the `foreign_gross` column.


In [None]:
# Remove commas from 'foreign_gross' and convert it to a float
box_office_df["foreign_gross"] = box_office_df["foreign_gross"].replace(",", "", regex=True).astype(float)

# Fill missing values with the median of the 'foreign_gross' column
box_office_df["foreign_gross"].fillna(box_office_df["foreign_gross"].median(), inplace=True)

# Display the first few rows to check the results
print(box_office_df[['foreign_gross']].head())


   foreign_gross
0    652000000.0
1    691300000.0
2    664300000.0
3    535700000.0
4    513900000.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  box_office_df["foreign_gross"].fillna(box_office_df["foreign_gross"].median(), inplace=True)



- **`.replace(",", "", regex=True)`**: This removes any commas in the `foreign_gross` column to ensure that the values can be converted to numeric values.
- **`.astype(float)`**: After removing commas, we convert the column values to the `float` type for numerical operations.
- **`.fillna(box_office_df["foreign_gross"].median(), inplace=True)`**: Any missing values in the `foreign_gross` column are filled with the **median** value of the column. The median is often a good choice for filling missing values because it is less sensitive to outliers than the mean.
- We then print the first few rows of the `foreign_gross` column to check the changes.

 



## Confirming No Missing Values

We will:
- Check the data type of `foreign_gross` to ensure it's of type `float`.
- Use `.isnull().sum()` to verify that no missing values exist in the entire `box_office_df` DataFrame.


In [None]:
# Check the data type of 'foreign_gross' to ensure it's float
print("Data type of 'foreign_gross':", box_office_df["foreign_gross"].dtype)

# Check for any missing values across the entire dataset
print("\nMissing values in the dataset:")
print(box_office_df.isnull().sum())


Data type of 'foreign_gross': float64

Missing values in the dataset:
title             0
studio            0
domestic_gross    0
foreign_gross     0
year              0
dtype: int64


* box_office_df["foreign_gross"].dtype: This checks the data type of the foreign_gross column. After the conversion, it should return float.

*  box_office_df.isnull().sum(): This provides a count of missing values across all columns in the box_office_df DataFrame. The result should show no missing values if all missing values were handled correctly.

## Film Data Cleaning

We will:
- Fill missing `runtime_minutes` with the median value of the column.
- Fill missing `genres` with `"Unknown"`.
- Optionally, fill missing `original_title` with the value from `primary_title`.
- Finally, check for any remaining missing values in the dataset.


In [None]:
# film Data Cleaning 

# Fill missing 'runtime_minutes' with the median value
median_runtime = film_df['runtime_minutes'].median()
film_df['runtime_minutes'].fillna(median_runtime, inplace=True)

# Fill missing 'genres' with 'Unknown'
film_df['genres'].fillna('Unknown', inplace=True)

# Optional: If you want to fill missing 'original_title' with the 'primary_title' instead
film_df['original_title'].fillna(movie_basics_df['primary_title'], inplace=True)

# --- Checking after cleaning ---
print("Missing Values After Cleaning:")
print(film_df.isnull().sum())


Missing Values After Cleaning:
movie_id           0
primary_title      0
original_title     0
start_year         0
runtime_minutes    0
genres             0
averagerating      0
numvotes           0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  film_df['runtime_minutes'].fillna(median_runtime, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  film_df['genres'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which w

## Movie Basics Data Cleaning

We will:
- Fill missing `runtime_minutes` with the median value of the column in `movie_basics_df`.
- Fill missing `genres` with `"Unknown"`.
- Optionally, fill missing `original_title` with the value from `primary_title`.
- Finally, check for any remaining missing values in the `movie_basics_df` dataset.


In [None]:
median_runtime = film_df['runtime_minutes'].median()
movie_basics_df['runtime_minutes'].fillna(median_runtime, inplace=True)

# Fill missing 'genres' with 'Unknown'
movie_basics_df['genres'].fillna('Unknown', inplace=True)

# Optional: If you want to fill missing 'original_title' with the 'primary_title' instead
movie_basics_df['original_title'].fillna(movie_basics_df['primary_title'], inplace=True)

# --- Checking after cleaning ---
print("Missing Values After Cleaning:")
print(film_df.isnull().sum())

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  movie_basics_df['runtime_minutes'].fillna(median_runtime, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  movie_basics_df['genres'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate o

Missing Values After Cleaning:
movie_id           0
primary_title      0
original_title     0
start_year         0
runtime_minutes    0
genres             0
averagerating      0
numvotes           0
dtype: int64



* Fill missing runtime_minutes:

The median value is computed and then used to fill missing values in the runtime_minutes column.

* Fill missing genres:

Any missing value in the genres column is filled with "Unknown".

* Optional filling of original_title:

If there are missing values in original_title, they are filled with the corresponding value from primary_title.

* Check for remaining missing values:

The check is performed on movie_basics_df after cleaning.

## Dropping Rows with Missing 'runtime_minutes' or 'genres'

We will drop any rows that still have missing values in the `runtime_minutes` or `genres` columns. Afterward, we will perform a final check to ensure that no missing values remain.


In [None]:
# Drop any rows still missing 'runtime_minutes' or 'genres'
film_df.dropna(subset=['runtime_minutes', 'genres'], inplace=True)

# Final check
print("Missing Values After Final Cleaning:")
print(film_df.isnull().sum())


Missing Values After Final Cleaning:
movie_id           0
primary_title      0
original_title     0
start_year         0
runtime_minutes    0
genres             0
averagerating      0
numvotes           0
dtype: int64


* .dropna(subset=['runtime_minutes', 'genres']): This ensures that rows with missing values in the specified columns (runtime_minutes and genres) are dropped from the DataFrame. We set inplace=True to modify the DataFrame directly without needing to create a new one.

* .isnull().sum(): After dropping rows, this command checks the entire film_df for any remaining missing values and displays the count of NaN values in each column.



## Standardizing Column Names

We will:
- Strip leading and trailing spaces from column names.
- Convert column names to lowercase.
- Replace spaces in column names with underscores to improve consistency.


In [None]:
# Standardize column names in film_df
film_df.columns = film_df.columns.str.strip().str.lower().str.replace(' ', '_')

# Standardize column names in box_office_df
box_office_df.columns = box_office_df.columns.str.strip().str.lower().str.replace(' ', '_')

# Display column names to confirm changes
print("Standardized Column Names in film_df:", film_df.columns.tolist())
print("Standardized Column Names in box_office_df:", box_office_df.columns.tolist())


Standardized Column Names in film_df: ['movie_id', 'primary_title', 'original_title', 'start_year', 'runtime_minutes', 'genres', 'averagerating', 'numvotes']
Standardized Column Names in box_office_df: ['title', 'studio', 'domestic_gross', 'foreign_gross', 'year']



* .str.strip(): Removes leading and trailing spaces from each column name.

* .str.lower(): Converts each column name to lowercase for uniformity.

* .str.replace(' ', '_'): Replaces spaces in column names with underscores (_).

## Correcting Inconsistent Categorical Values

We will:
- Convert all text (string) values in categorical columns to lowercase.
- Strip any leading or trailing spaces from the string values for consistency.


In [None]:
# Correct inconsistent categorical values in film_df
for col in film_df.select_dtypes(include=['object']).columns:
    film_df[col] = film_df[col].str.lower().str.strip()

# Correct inconsistent categorical values in box_office_df
for col in box_office_df.select_dtypes(include=['object']).columns:
    box_office_df[col] = box_office_df[col].str.lower().str.strip()

# Displaying the first few rows to check the changes
print("First few rows of film_df after cleaning:")
print(film_df.head())

print("First few rows of box_office_df after cleaning:")
print(box_office_df.head())


First few rows of film_df after cleaning:
    movie_id                    primary_title              original_title  \
0  tt0063540                        sunghursh                   sunghursh   
1  tt0066787  one day before the rainy season             ashad ka ek din   
2  tt0069049       the other side of the wind  the other side of the wind   
3  tt0069204                  sabse bada sukh             sabse bada sukh   
4  tt0100275         the wandering soap opera       la telenovela errante   

   start_year  runtime_minutes                genres  averagerating  numvotes  
0        2013            175.0    action,crime,drama            7.0        77  
1        2019            114.0       biography,drama            7.2        43  
2        2018            122.0                 drama            6.9      4517  
3        2018             91.0          comedy,drama            6.1        13  
4        2017             80.0  comedy,drama,fantasy            6.5       119  
First few rows 

* .select_dtypes(include=['object']): This selects all columns that contain object (string) data.

* .str.lower(): Converts all string values to lowercase to ensure uniformity.

* .str.strip(): Removes any leading or trailing spaces from string values.

This ensures that the categorical columns in both film_df and box_office_df are standardized and ready for analysis.

## Handling Duplicates in the Data

### Checking for Duplicates

We will:
- Check each dataset for duplicate rows and print out the number of duplicates for each.


In [None]:
# Check for duplicates

print(f"🔍 Duplicates in Box Office Data: {box_office_df.duplicated().sum()}")
print(f"🔍 Duplicates in TMDb Movies Data: {movie_basics_df.duplicated().sum()}")
print(f"🔍 Duplicates in Movie Budgets Data: {movie_rating_df.duplicated().sum()}")
print(f"🔍 Duplicates in Movie Budgets Data: {film_df.duplicated().sum()}")

🔍 Duplicates in Box Office Data: 0
🔍 Duplicates in TMDb Movies Data: 0
🔍 Duplicates in Movie Budgets Data: 0
🔍 Duplicates in Movie Budgets Data: 0


* .duplicated(): Identifies duplicate rows in the DataFrame. By default, it considers all columns to determine if a row is duplicated.

* .sum(): Counts the number of True values returned by .duplicated(), which corresponds to the number of duplicate rows.

Note:
Your comment mentions "Movie Budgets Data" for movie_rating_df, but the variable name is movie_rating_df. It seems like you might have meant movie_rating_df instead, so I’ve kept it consistent in the updated code.

## Checking for Inconsistent Casing

We will:
- Strip extra spaces and convert the `studio` and `title` columns to lowercase to ensure consistency.


In [None]:
# Correct inconsistent casing and strip spaces in 'studio' and 'title' columns
box_office_df["studio"] = box_office_df["studio"].str.strip().str.lower()
box_office_df["title"] = box_office_df["title"].str.strip().str.lower()

# Displaying the first few rows to confirm the changes
print("First few rows of box_office_df after fixing casing:")
print(box_office_df[['studio', 'title']].head())


First few rows of box_office_df after fixing casing:
  studio                                              title
0     bv                                 toy story 3 (2010)
1     bv                  alice in wonderland (2010) (2010)
2     wb  harry potter and the deathly hallows part 1 (2...
3     wb                                   inception (2010)
4   p/dw                         shrek forever after (2010)


* .str.strip(): Removes any leading or trailing spaces from the studio and title columns to avoid inconsistencies caused by extra spaces.

* .str.lower(): Converts the entire text in the studio and title columns to lowercase, ensuring that casing does not cause issues when comparing or analyzing data.

This will help make sure that the values in studio and title are consistently formatted. Let me know if you'd like to apply this to other columns as well or need further modifications!

## Counting Occurrences of Each Title
We will:

Count the occurrences of each unique title in the title column and display the top 10 most frequent titles.

In [None]:
# Counting occurrences of each title
box_office_df["title"].value_counts().head(10)

title
an actor prepares (2018)                              1
toy story 3 (2010)                                    1
alice in wonderland (2010) (2010)                     1
harry potter and the deathly hallows part 1 (2010)    1
inception (2010)                                      1
shrek forever after (2010)                            1
the twilight saga: eclipse (2010)                     1
iron man 2 (2010)                                     1
tangled (2010)                                        1
despicable me (2010)                                  1
Name: count, dtype: int64

* .value_counts(): This function counts the occurrences of each unique value in the title column of box_office_df.

* .head(10): Displays the top 10 titles with the highest frequency.

## Searching for the Movie "Bluebeard" in Box Office Data

We will:
- Search for the movie "Bluebeard" in the `title` column of the `box_office_df` DataFrame and display the results.


In [None]:
# Searching for the Movie "Bluebeard" in `box_office_df`
box_office_df[box_office_df["title"] == "bluebeard"]

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
317,bluebeard,strand,33500.0,5200.0,2010
3045,bluebeard,wgusa,43100.0,19400000.0,2017


* box_office_df[box_office_df["title"] == "bluebeard"]: Filters the rows in the box_office_df DataFrame where the title column is exactly equal to "bluebeard". Since you standardized the column to lowercase, this will match any title with the name "Bluebeard" (case-insensitive).

In [None]:
# Differentiating the titles Bluebeard by adding the release year
box_office_df["title"] = box_office_df["title"] + " (" + box_office_df["year"].astype(str) + ")"
box_office_df["title"].value_counts()[box_office_df["title"].value_counts() > 1]

Series([], Name: count, dtype: int64)

## Displaying Categorical Features Summary

We will:
- Use `describe(include=['O'])` to get a summary of categorical features (object columns) for `box_office_df`, `movie_basics_df`, and `movie_rating_df`.


In [None]:
#  Display categorical features summary
box_office_df.describe(include=['O'])
movie_basics_df.describe(include=['O'])
movie_rating_df.describe(include=['O'])

Unnamed: 0,movie_id
count,73856
unique,73856
top,tt9174828
freq,1


* describe(include=['O']): This method will provide summary statistics for all columns that are of object (categorical) type in the DataFrame. It includes the count, unique values, top (most frequent value), and frequency of the top value.

## Performing Data Integrity Checks

We will:
- Check for missing values in `box_office_df` and `film_df` using `.isnull().sum()`.
- Check for duplicate rows using `.duplicated().sum()`.
- Check the data types of each column in both DataFrames using `.dtypes`.


In [None]:
# Check missing values in box_office_df and film_df
print("Missing Values in Box Office Data:\n", box_office_df.isnull().sum())
print("\nMissing Values in Merged Movie Data (film_df):\n", film_df.isnull().sum())

# Check duplicate rows in box_office_df and film_df
print("\nDuplicate Rows in Box Office Data:", box_office_df.duplicated().sum())
print("\nDuplicate Rows in Merged Movie Data (film_df):", film_df.duplicated().sum())

# Check data types in box_office_df and film_df
print("\nData Types in Box Office Data:\n", box_office_df.dtypes)
print("\nData Types in Merged Movie Data (film_df):\n", film_df.dtypes)


Missing Values in Box Office Data:
 title             0
studio            0
domestic_gross    0
foreign_gross     0
year              0
dtype: int64

Missing Values in Merged Movie Data (film_df):
 movie_id           0
primary_title      0
original_title     0
start_year         0
runtime_minutes    0
genres             0
averagerating      0
numvotes           0
dtype: int64

Duplicate Rows in Box Office Data: 0

Duplicate Rows in Merged Movie Data (film_df): 0

Data Types in Box Office Data:
 title              object
studio             object
domestic_gross    float64
foreign_gross     float64
year                int64
dtype: object

Data Types in Merged Movie Data (film_df):
 movie_id            object
primary_title       object
original_title      object
start_year           int64
runtime_minutes    float64
genres              object
averagerating      float64
numvotes             int64
dtype: object


* Missing Values:

.isnull().sum() returns the count of missing (NaN) values for each column.

* Duplicate Rows:

.duplicated().sum() returns the number of duplicate rows in the DataFrame.

* Data Types:

.dtypes returns the data types of each column, which is helpful to ensure the correctness of column data types (e.g., object, int64, float64, etc.).

## Saving Cleaned Datasets

We will:
- Save the cleaned `film_df` dataset to a CSV file.
- Save the cleaned `box_office_df` dataset to a separate CSV file.


In [None]:
film_df.to_csv("cleaned_dataset.csv", index=False)  # Save cleaned dataset for Tableau 
box_office_df.to_csv("cleaned_dataset.csv", index=False)


* index=False: Prevents the DataFrame index from being saved as an additional column in the CSV file.

* Unique filenames: By giving each dataset a unique filename (cleaned_film_dataset.csv and cleaned_box_office_dataset.csv), you ensure that both datasets are saved without overwriting each other.

## Generate Download Link for Cleaned Dataset

We will:
- Generate a download link for the cleaned dataset (`cleaned_film_dataset.csv` or `cleaned_box_office_dataset.csv`).


In [None]:
 # Generates a download link 
from IPython.display import FileLink  
FileLink("cleaned_dataset.csv") 

In [None]:
# Statistical Distribution
# A statistical distribution shows how the values of a variable are spread or arranged.
# In your case, you're analyzing things like:

# Movie revenues (Box Office gross)

# Ratings (Rotten Tomatoes, IMDb)

# Number of reviews

# Movie release years

