#   MOVIE INDUSTRY  ANALYSIS

## 🎬 Project Title :
Box Office Gold: Data-Driven Insights for a Profitable Movie Studio Launch

### Business Understanding
👥 Stakeholder
The primary stakeholder is the executive team of the company's new movie studio. They need insights into the film industry to make confident decisions on what type of movies to produce.

### 🌍 Domain:

Entertainment & Media Analytics (specifically, Film Industry/Box Office Performance)

#### 📘 Introduction:

With the growing trend of major companies venturing into original film production, my organization is planning to launch its own movie studio. However, without prior experience in the film industry, there’s uncertainty about what types of movies resonate with audiences and drive box office success. This project aims to analyze trends in box office data to uncover what genres, budgets, and other film attributes contribute most to commercial success — providing strategic guidance for profitable content creation.

#### 🎯 Business Objectives :
To Identify High-Performing Film Genres:
Analyze box office data to determine which movie genres consistently generate the highest revenue and audience engagement.
To Examine the Relationship Between Budget and Profitability:
Investigate how production budgets influence box office success and identify the budget range that maximizes return on investment (ROI).
To Assess the Impact of Key Film Attributes:
Explore how factors such as runtime, cast, release date (season), and film ratings (e.g., PG-13, R) affect a movie’s performance.
To Benchmark Against Top Studios :
Analyze which production studios are leading in terms of commercial success and identify patterns in their film portfolios.
To Provide Actionable Recommendations:
Based on the insights, suggest the optimal type of film (genre, budget, release timing, etc.) that the company should produce for a successful studio launch.

#### 📊 Project Plan: Box Office Gold – Data-Driven Insights for a Profitable Movie Studio Launch
🔍 1. Problem Understanding & Goal Definition

Review business problem: Identify film types that succeed at the box office.

Define clear goals: Provide recommendations on genre, budget, and release strategy.

📦 2. Data Collection

Source Box Office Datasets from platforms like:

Im.db.zip(movie_basics & Movie_ratings)

bom.movie_gross.gz

Collect relevant data fields:

Genre, budget, revenue, runtime, release date, production company, director, cast, rating, etc.

🧹 3. Data Cleaning & Preprocessing

Handle missing values and inconsistencies.
Standardize formats (dates, currencies, genres).
Convert categorical variables where necessary.
Remove duplicates or irrelevant records (e.g., short films, non-theatrical releases).
📊 4. Exploratory Data Analysis (EDA)

Univariate & Bivariate Analysis (e.g., budget vs revenue, genre vs revenue).
Correlation heatmaps, box plots, histograms.
Identify outliers and common patterns in successful films.
Segment data by genre, production studio, or release year.
🧠 5. Insights & Recommendations

Summarize which genres are top performers.
Recommend ideal budget ranges.
Identify optimal release months/seasons.
Suggest attributes linked to successful movies (e.g., popular runtimes, ratings).
📑 6. Reporting & Visualization

Build clear and compelling visualizations (using Tableau, Power BI, or Python’s Seaborn/Matplotlib).
Draft a business-focused report or slide deck.
Include:

Key findings
Strategic suggestions
Visual evidence
📢 7. Presentation to Stakeholders

#### Communicate insights in non-technical language.
Show data-driven rationale for proposed movie types.
Allow room for stakeholder feedback and Q&A.
Overview/Background
As the entertainment industry shifts toward original content production, many large companies are investing in their own movie studios to capture audience attention and drive revenue. The company seeks to follow this trend but lacks experience in film production. To ensure a successful studio launch, this project aims to analyze historical box office data to uncover key trends in genre performance, budget impact, and other critical success factors. The goal is to provide data-driven insights that will guide strategic decisions on what types of films to produce for maximum box office success.

#### Challenges
One of the main challenges in this project is acquiring comprehensive and reliable box office data that includes essential attributes such as genre, budget, revenue, and release details. Additionally, the film industry is influenced by unpredictable factors like audience trends, star power, and marketing, which are difficult to quantify. Ensuring data quality, handling missing or inconsistent entries, and drawing actionable insights that align with business goals also present key hurdles in the analysis process.

#### Proposed Solution
To address the business challenge, this project proposes a data-driven approach that involves collecting and analyzing historical box office data to identify patterns in successful films. By examining factors such as genre, budget, revenue, release timing, and other key attributes, the project will uncover trends that correlate with box office success. The insights will then be translated into practical recommendations to guide the company in producing films with higher chances of commercial success.

#### Conclusion:
Launching a successful movie studio requires more than creativity—it demands strategic, data-informed decisions. This project leverages box office analytics to uncover what drives film profitability, helping the company make confident choices about genre, budget, and release strategy. With clear insights and recommendations, the company will be well-positioned to enter the competitive film industry with a strong foundation for success.

#### ❗ Problem Statement
As the company plans to venture into original film production, it faces significant uncertainty due to a lack of industry experience. 🎬 Making informed decisions about what types of films to produce is challenging without a clear understanding of market trends and performance drivers. Additionally, obtaining accurate and complete box office data is difficult, and the film industry itself is influenced by various unpredictable factors such as changing audience preferences, marketing impact, and star power. These challenges make it hard to identify what contributes to a movie’s commercial success and pose a risk to the company’s new venture.

#### 📊 Data Understanding
The data for this project comes from multiple sources:

im.db.zip:
A zipped SQLite database that contains various tables. The two most relevant tables are:

movie_basics: Likely includes key information about movies such as titles, genres, release dates, runtime, etc.
movie_ratings: Contains viewer and critic ratings, providing insight into movie reception.
bom.movie_gross.csv.gz:
A compressed CSV file containing box office gross data. This file is essential for analyzing revenue trends and overall financial performance.
These sources combined offer a comprehensive view of film attributes and performance metrics, which are crucial for understanding the factors behind movie success.









### Workflow & Steps

#### Step 1: Importing ,Connecting and Loading Datasets and Analysing the datasets

 Import necessary  libraries

 Connecting and loading the 

 Check dataset shape (rows & columns) 

 View dataset information (column types, missing values)

 Summary statistics of numerical columns  

 Check for duplicate rows  

 Check for unique values in key categorical columns


#### Step 2: Data Cleaning

 Remove duplicates 

 Handle missing values appropriately  

 Convert data types if necessary  

 Standardize column names for consistency 

 Correct inconsistent categorical values  

 Save the cleaned dataset  

## Step 1: Importing and Loading Data

### 📚 1.1 Importing Necessary Libraries

In this step, we import all the essential Python libraries that we will use throughout the project for:
- Extracting and reading data (from ZIP files and SQLite databases)
- Cleaning and analyzing data
- Performing statistical operations
- Visualizing data to uncover insights


In [40]:
# We are Importing neccessary libraries

import zipfile
from scipy import stats
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings; warnings.simplefilter('ignore', FutureWarning)


* In this project, we import libraries that provide tools for data manipulation and database interaction. 
 These libraries enable efficient loading, cleaning, merging, and analysis of structured data from various sources, 
 including flat files and relational databases. By using these tools, we can prepare and organize the data for deeper analysis and insight.

 ### 🗄️ 1.2: Connecting to the SQLite Database

In this step, we establish a connection to the SQLite database that contains structured data related to our project.

**Why this is important:**  
SQLite databases allow us to store and retrieve data efficiently using SQL queries. We first need to connect to the database before we can explore or extract tables for analysis.


In [41]:
# Establishing a connection to the SQLite database containing structured data 
conn = sqlite3.connect("zippedData/im.db1/im.db")
cur = conn.cursor()

cur.execute("SELECT name from sqlite_master").fetchall()

[('movie_basics',),
 ('directors',),
 ('known_for',),
 ('movie_akas',),
 ('movie_ratings',),
 ('persons',),
 ('principals',),
 ('writers',)]

* Establishing a connection to the SQLite database containing structured data *by creating a connection object, we enable the ability to execute SQL queries  and extract tables or records from the database. This connection is essential for retrieving data stored in relational format and loading it into a DataFrame and also for further analysis.


# Combined Dataset

This dataset merges information from the following tables:
- `principals`
- `movie_akas`
- `movie_basics`
- `movie_ratings`
- `persons`

### Columns:
- `movie_id` — Unique identifier for each movie
- `person_id` — Unique identifier for each person
- `category` — The category of job for the person (e.g., actor, director)
- `job` — Specific job title
- `characters` — Characters played (if applicable)
- `ordering` — Order of the title for the movie
- `aka_title` — Alternative title of the movie
- `region` — Region associated with the title
- `language` — Language of the alternative title
- `types` — Types of alternative titles (e.g., working title, original)
- `attributes` — Additional attributes of the title
- `is_original_title` — Indicator if the title is the original
- `primary_title` — Official main title of the movie
- `original_title` — Original language title
- `start_year` — Year the movie was released
- `runtime_minutes` — Duration of the movie
- `genres` — Movie genres
- `average_rating` — Average rating from users
- `num_votes` — Number of votes received
- `primary_name` — Person’s name
- `birth_year` — Person’s birth year
- `death_year` — Person’s death year (if applicable)
- `primary_profession` — Main profession(s) of the person


In [42]:
#directors has person_id,movie_id
#writers has movie_id,person_id
#known_for has person_id,movie_id
#movie_ratings movie_id,averagerating,numvotes
#persons has person_id,primary_name,birth_year,death_year,primary_profession
#principals has movie_id,person_id,category,job,characters
#movie_akas has movie_id,ordering,title,region,language,types,attributes,is_original_title
#movie_basics has movie_id,primary_title,original_title,start_year,runtime_minutes,genres

#persons,principals,movie_akas,movie_basics,movie_ratings are the only tables we need for all the columns in the db

combined = """SELECT *
                from principals
                JOIN movie_akas USING (movie_id)
                JOIN movie_basics USING (movie_id)
                JOIN movie_ratings USING (movie_id)
                JOIN persons USING (person_id);"""

combined_df = pd.read_sql_query(combined,conn)
combined_df.head()




Unnamed: 0,movie_id,ordering,person_id,category,job,characters,ordering.1,title,region,language,...,original_title,start_year,runtime_minutes,genres,averagerating,numvotes,primary_name,birth_year,death_year,primary_profession
0,tt0323808,10,nm0059247,editor,,,1,May Day,GB,,...,The Wicker Tree,2011,96.0,"Drama,Horror",3.9,2328,Sean Barton,1944.0,,"editor,editorial_department,assistant_director"
1,tt0323808,10,nm0059247,editor,,,2,Cowboys for Christ,GB,,...,The Wicker Tree,2011,96.0,"Drama,Horror",3.9,2328,Sean Barton,1944.0,,"editor,editorial_department,assistant_director"
2,tt0323808,10,nm0059247,editor,,,3,The Wicker Tree,GB,,...,The Wicker Tree,2011,96.0,"Drama,Horror",3.9,2328,Sean Barton,1944.0,,"editor,editorial_department,assistant_director"
3,tt0323808,10,nm0059247,editor,,,4,The Wicker Tree,,,...,The Wicker Tree,2011,96.0,"Drama,Horror",3.9,2328,Sean Barton,1944.0,,"editor,editorial_department,assistant_director"
4,tt0323808,10,nm0059247,editor,,,5,Плетеное дерево,RU,,...,The Wicker Tree,2011,96.0,"Drama,Horror",3.9,2328,Sean Barton,1944.0,,"editor,editorial_department,assistant_director"


In [43]:
film_df = combined_df.copy()

## Checking the Number of Rows and Columns After Merging

We will verify the size of the merged DataFrame `film_df` to see how many records and fields it contains after combining 


In [44]:
# checking for  rows and column after merging
film_df.shape

(2422866, 24)

  
- `.shape` returns a tuple:
  - The **first value** is the number of rows (movies with both basic info and ratings).
  - The **second value** is the number of columns (combined fields from both DataFrames).
- This helps confirm the size of the new dataset after merging, ensuring no unexpected data loss or duplication occurred.


## Checking Data Types and Basic Information of `film_df`

We will inspect the structure of `film_df`, including data types, non-null counts, and memory usage, to better understand the dataset.


In [45]:
# Checking Data Types and Basic Information of `film_df`
film_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2422866 entries, 0 to 2422865
Data columns (total 24 columns):
 #   Column              Dtype  
---  ------              -----  
 0   movie_id            object 
 1   ordering            int64  
 2   person_id           object 
 3   category            object 
 4   job                 object 
 5   characters          object 
 6   ordering            int64  
 7   title               object 
 8   region              object 
 9   language            object 
 10  types               object 
 11  attributes          object 
 12  is_original_title   float64
 13  primary_title       object 
 14  original_title      object 
 15  start_year          int64  
 16  runtime_minutes     float64
 17  genres              object 
 18  averagerating       float64
 19  numvotes            int64  
 20  primary_name        object 
 21  birth_year          float64
 22  death_year          float64
 23  primary_profession  object 
dtypes: float64(5), int64(4),

  
- `.info()` provides a summary of the DataFrame, showing:
  - The **total number of entries** (rows),
  - **Non-null counts** for each column (helpful for spotting missing data),
  - **Data types** for each column:
    - `object` for text/string,
    - `int64` for integers,
    - `float64` for decimal numbers,
  - **Memory usage** (important for very large datasets).
- This is crucial for:
  - Detecting missing data early,
  - Planning data cleaning steps,
  - Knowing if data types need optimization (e.g., changing types to save memory).


## Summary Statistics of `film_df`

We will generate descriptive statistics for the numeric columns in the merged `film_df` DataFrame to understand the distribution and range of the data.


In [46]:
# Summary Statistics
film_df.describe()


Unnamed: 0,ordering,ordering.1,is_original_title,start_year,runtime_minutes,averagerating,numvotes,birth_year,death_year
count,2422866.0,2422866.0,2422866.0,2422866.0,2327862.0,2422866.0,2422866.0,1054911.0,42418.0
mean,5.279298,6.401481,0.1377749,2014.116,100.5235,6.240323,31048.51,1967.515,1990.075251
std,2.825593,7.38387,0.3446636,2.576625,120.3028,1.234918,98109.64,20.26385,64.839456
min,1.0,1.0,0.0,2010.0,3.0,1.0,5.0,1.0,17.0
25%,3.0,2.0,0.0,2012.0,88.0,5.6,89.0,1960.0,1992.0
50%,5.0,3.0,0.0,2014.0,97.0,6.4,829.0,1970.0,2012.0
75%,8.0,8.0,0.0,2016.0,110.0,7.1,9625.0,1979.0,2016.0
max,10.0,61.0,1.0,2019.0,51420.0,10.0,1841066.0,2014.0,2019.0


 
- `.describe()` provides summary statistics for **numeric columns** in the `film_df` DataFrame, including:
  - **count**: Number of non-null entries,
  - **mean**: Average value,
  - **std**: Standard deviation (how spread out the values are),
  - **min**: Minimum value,
  - **25%**, **50% (median)**, **75%**: Percentile values indicating distribution,
  - **max**: Maximum value.
- This gives a quick overview of the central tendency, spread, and variability of the dataset’s numerical features (like ratings or release year).


## Listing the Column Names in `film_df`

We will display the names of all the columns in the `film_df` DataFrame to understand its structure and the data it contains.


 
- `.describe()` provides summary statistics for **numeric columns** in the `film_df` DataFrame, including:
  - **count**: Number of non-null entries,
  - **mean**: Average value,
  - **std**: Standard deviation (how spread out the values are),
  - **min**: Minimum value,
  - **25%**, **50% (median)**, **75%**: Percentile values indicating distribution,
  - **max**: Maximum value.
- This gives a quick overview of the central tendency, spread, and variability of the dataset’s numerical features (like ratings or release year).


## Listing the Column Names in `film_df`

We will display the names of all the columns in the `film_df` DataFrame to understand its structure and the data it contains.


In [47]:
# The following are the column names in the `film_df` DataFrame:
film_df.columns.tolist()

['movie_id',
 'ordering',
 'person_id',
 'category',
 'job',
 'characters',
 'ordering',
 'title',
 'region',
 'language',
 'types',
 'attributes',
 'is_original_title',
 'primary_title',
 'original_title',
 'start_year',
 'runtime_minutes',
 'genres',
 'averagerating',
 'numvotes',
 'primary_name',
 'birth_year',
 'death_year',
 'primary_profession']

 
- `.columns` retrieves the column names of the DataFrame.
- `.tolist()` converts the column names into a **Python list** for easier reading and use.
- This step is useful for:
  - Quickly checking what data is available in the merged DataFrame,
  - Ensuring all expected columns are present after merging the datasets.


## Loading Data from `bom.movie_gross.csv.gz`

We will load the movie box office earnings data from the `bom.movie_gross.csv.gz` file and display the first few rows to inspect its contents.


In [48]:
# load data from bom.movie_gross.csv.gz
# box_office_df : shows how movies were earning

box_office_df= pd.read_csv('zippedData/bom.movie_gross.csv.gz')
box_office_df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010



- `pd.read_csv()` is used to load the data from a CSV file into a pandas DataFrame.
  - In this case, the file is compressed as `.gz`, but pandas can read it directly without any extra steps.
- `.head()` displays the first 5 rows of the DataFrame to give a quick view of the data structure.
- The dataset likely contains information on how movies performed at the box office, such as gross earnings, release date, etc.


## Checking the Number of Rows and Columns in `box_office_df`

We will check the dimensions of the `box_office_df` DataFrame to see how many records and fields it contains.


In [49]:
# check box_office_df rows and columns
box_office_df.shape

(3387, 5)


- `.shape` returns a tuple:
  - The **first value** represents the **number of rows** (how many box office records are available).
  - The **second value** represents the **number of columns** (how many features or fields are provided for each record).
- This helps quickly assess the size and structure of the dataset.


## Summary Statistics of `box_office_df`

We will generate descriptive statistics for the numeric columns in the `box_office_df` DataFrame to understand the distribution and range of the box office earnings data.


In [50]:
# Generate summary (descriptive) statistics for the numeric columns
# .describe() summarizes key statistical measures for each numeric column
box_office_df.describe()

Unnamed: 0,domestic_gross,year
count,3359.0,3387.0
mean,28745850.0,2013.958075
std,66982500.0,2.478141
min,100.0,2010.0
25%,120000.0,2012.0
50%,1400000.0,2014.0
75%,27900000.0,2016.0
max,936700000.0,2018.0


  
- `.describe()` provides summary statistics for **numeric columns** in the `box_office_df` DataFrame, including:
  - **count**: Number of non-null entries,
  - **mean**: Average value,
  - **std**: Standard deviation (measure of spread),
  - **min**: Minimum value,
  - **25%**, **50% (median)**, **75%**: Percentile values (showing distribution),
  - **max**: Maximum value.
- This step is essential for getting a quick understanding of the box office data’s central tendency, spread, and overall range.


## Checking Dataset Info for `box_office_df`

We will inspect the structure of the `box_office_df` DataFrame, including column types, non-null counts, and memory usage, to better understand the dataset.


In [51]:
# Check dataset info
box_office_df.info()  # Summary of dataset, including column types and non-null values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB



- `.info()` provides an overview of the DataFrame, including:
  - **Data types**: It shows the type of data in each column (e.g., `int64`, `float64`, `object` for text).
  - **Non-null counts**: Helps detect missing values by showing how many non-null entries exist in each column.
  - **Memory usage**: Indicates how much memory the entire DataFrame is using (important for large datasets).
- This step is crucial for understanding the structure and cleanliness of the dataset.


## Loading Data from `tn.movie_budgets.csv`

We will load the movie box office earnings data from the `tn.movie_budgets.csv` file and display the first few rows to inspect its contents.


In [52]:
# load data from bom.movie_gross.csv.gz
# box_office_df : shows how movies were earning

movie_budget_df= pd.read_csv('zippedData/tn.movie_budgets.csv.gz')
movie_budget_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"



- `pd.read_csv()` is used to load the data from a CSV file into a pandas DataFrame.
  - In this case, the file is compressed as `.gz`, but pandas can read it directly without any extra steps.
- `.head()` displays the first 5 rows of the DataFrame to give a quick view of the data structure.
- The dataset likely contains information on how movies performed at the movie_budget.


## Checking the Number of Rows and Columns in `movie_budget_df`

We will check the dimensions of the `moviee_budget_df` DataFrame to see how many records and fields it contains.


In [53]:
# check box_office_df rows and columns
movie_budget_df.shape

(5782, 6)


- `.shape` returns a tuple:
  - The **first value** represents the **number of rows** (how many box office records are available).
  - The **second value** represents the **number of columns** (how many features or fields are provided for each record).
- This helps quickly assess the size and structure of the dataset.


## Summary Statistics of `movie_budget_df`

We will generate descriptive statistics for the numeric columns in the `movie_budget_df` DataFrame to understand the distribution and range of the movie budget earnings data.


In [54]:
# Generate summary (descriptive) statistics for the numeric columns
# .describe() summarizes key statistical measures for each numeric column
movie_budget_df.describe

<bound method NDFrame.describe of       id  release_date                                        movie  \
0      1  Dec 18, 2009                                       Avatar   
1      2  May 20, 2011  Pirates of the Caribbean: On Stranger Tides   
2      3   Jun 7, 2019                                 Dark Phoenix   
3      4   May 1, 2015                      Avengers: Age of Ultron   
4      5  Dec 15, 2017            Star Wars Ep. VIII: The Last Jedi   
...   ..           ...                                          ...   
5777  78  Dec 31, 2018                                       Red 11   
5778  79   Apr 2, 1999                                    Following   
5779  80  Jul 13, 2005                Return to the Land of Wonders   
5780  81  Sep 29, 2015                         A Plague So Pleasant   
5781  82   Aug 5, 2005                            My Date With Drew   

     production_budget domestic_gross worldwide_gross  
0         $425,000,000   $760,507,625  $2,776,345,279  
1

  
- `.describe()` provides summary statistics for **numeric columns** in the `movie_budget_df` DataFrame, including:
  - **count**: Number of non-null entries,
  - **mean**: Average value,
  - **std**: Standard deviation (measure of spread),
  - **min**: Minimum value,
  - **25%**, **50% (median)**, **75%**: Percentile values (showing distribution),
  - **max**: Maximum value.
- This step is essential for getting a quick understanding of the movie budget data’s central tendency, spread, and overall range.


## Checking Dataset Info for `movie_budget_df`

We will inspect the structure of the `movie_budget_df` DataFrame, including column types, non-null counts, and memory usage, to better understand the dataset.


In [55]:
# Check dataset info
movie_budget_df.info()  # Summary of dataset, including column types and non-null values


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB



- `.info()` provides an overview of the DataFrame, including:
  - **Data types**: It shows the type of data in each column (e.g., `int64`, `float64`, `object` for text).
  - **Non-null counts**: Helps detect missing values by showing how many non-null entries exist in each column.
  - **Memory usage**: Indicates how much memory the entire DataFrame is using (important for large datasets).
- This step is crucial for understanding the structure and cleanliness of the dataset.


# Step 2: Data Cleaning 

## Checking for Missing Values in Multiple Datasets

We will define a function to check for missing values in different DataFrames (`box_office_df`, `movie_basics_df`, `movie_rating_df`, and `film_df`) and display the columns with missing data.


In [56]:
# Define a function to check for missing values in a DataFrame
def check_missing_values(df, name):
    print(f"\n🔍 Missing values in {name}:")
    # Use .isnull() to identify missing data and .sum() to count them
    missing_values = df.isnull().sum()
    # Print only columns with missing values (where the sum is greater than 0)
    print(missing_values[missing_values > 0])  

# Check for missing values in each DataFrame
check_missing_values(box_office_df, "Box Office Data")
check_missing_values(film_df, "Merged Movie basics and Movie ratings Data")
check_missing_values(movie_budget_df, "movie budget Data")



🔍 Missing values in Box Office Data:
studio               5
domestic_gross      28
foreign_gross     1350
dtype: int64

🔍 Missing values in Merged Movie basics and Movie ratings Data:
job                   1757131
characters            1468512
region                 393739
language              2073586
types                  974092
attributes            2301142
runtime_minutes         95004
genres                   9057
birth_year            1367955
death_year            2380448
primary_profession      52454
dtype: int64

🔍 Missing values in movie budget Data:
Series([], dtype: int64)



- The function `check_missing_values(df, name)` checks for missing values in a given DataFrame:
  - `.isnull()` creates a DataFrame of boolean values (`True` for missing, `False` for not missing).
  - `.sum()` adds up the number of `True` values (i.e., missing entries) for each column.
  - We then filter and display only those columns with **missing values** (`missing_values > 0`).
- We run the function for each DataFrame to identify any missing data in the following:
  - **Box Office Data (`box_office_df`)**
  - **Movie Basics Data (`movie_basics_df`)**
  - **Movie Ratings Data (`movie_rating_df`)**
  - **Merged Movie Data (`film_df`)**


## Handling Missing Values

#### Dropping Rows with Missing Values in Specific Columns

We want to drop rows from `box_office_df` where either the `studio` or `domestic_gross` columns have missing values. This is done using the `dropna()` function with the `subset` argument to target specific columns.


In [57]:
# Drop rows where either 'studio' or 'domestic_gross' has missing values
box_office_df.dropna(subset=["studio", "domestic_gross"], inplace=True)

# Check the shape of the DataFrame after dropping rows
print("Shape of box_office_df after dropping missing values in 'studio' or 'domestic_gross':")
print(box_office_df.shape)


Shape of box_office_df after dropping missing values in 'studio' or 'domestic_gross':
(3356, 5)


Since 1350 missing values is a significant amount, we will fill them with the median 

## Cleaning and Handling `foreign_gross` Column

We will:
- Remove any commas in the `foreign_gross` column.
- Convert the column values to `float`.
- Fill missing values (`NaN`) with the median value of the `foreign_gross` column.


In [58]:
# Remove commas from 'foreign_gross' and convert it to a float
box_office_df["foreign_gross"] = box_office_df["foreign_gross"].replace(",", "", regex=True).astype(float)

# Fill missing values with the median of the 'foreign_gross' column
box_office_df["foreign_gross"].fillna(box_office_df["foreign_gross"].median(), inplace=True)

# Display the first few rows to check the results
print(box_office_df[['foreign_gross']].head())


   foreign_gross
0    652000000.0
1    691300000.0
2    664300000.0
3    535700000.0
4    513900000.0



- **`.replace(",", "", regex=True)`**: This removes any commas in the `foreign_gross` column to ensure that the values can be converted to numeric values.
- **`.astype(float)`**: After removing commas, we convert the column values to the `float` type for numerical operations.
- **`.fillna(box_office_df["foreign_gross"].median(), inplace=True)`**: Any missing values in the `foreign_gross` column are filled with the **median** value of the column. The median is often a good choice for filling missing values because it is less sensitive to outliers than the mean.
- We then print the first few rows of the `foreign_gross` column to check the changes.

 



## Confirming No Missing Values

We will:
- Check the data type of `foreign_gross` to ensure it's of type `float`.
- Use `.isnull().sum()` to verify that no missing values exist in the entire `box_office_df` DataFrame.


In [59]:
# Check the data type of 'foreign_gross' to ensure it's float
print("Data type of 'foreign_gross':", box_office_df["foreign_gross"].dtype)

# Check for any missing values across the entire dataset
print("\nMissing values in the dataset:")
print(box_office_df.isnull().sum())

print("\nMissing values in the dataset:")
print(movie_budget_df.isnull().sum())

Data type of 'foreign_gross': float64

Missing values in the dataset:
title             0
studio            0
domestic_gross    0
foreign_gross     0
year              0
dtype: int64

Missing values in the dataset:
id                   0
release_date         0
movie                0
production_budget    0
domestic_gross       0
worldwide_gross      0
dtype: int64


* box_office_df["foreign_gross"].dtype: This checks the data type of the foreign_gross column. After the conversion, it should return float.

*  box_office_df.isnull().sum(): This provides a count of missing values across all columns in the box_office_df DataFrame. The result should show no missing values if all missing values were handled correctly.

## Film Data Cleaning

We will:
- Fill missing `runtime_minutes` with the median value of the column.
- Fill missing `genres` with `"Unknown"`.
- Optionally, fill missing `original_title` with the value from `primary_title`.
- Finally, check for any remaining missing values in the dataset.


In [60]:
# film Data Cleaning 

# Fill missing 'runtime_minutes' with the median value
median_runtime = film_df['runtime_minutes'].median()
film_df['runtime_minutes'].fillna(median_runtime, inplace=True)

# Fill missing 'genres' with 'Unknown'
film_df['genres'].fillna('Unknown', inplace=True)

# Optional: If you want to fill missing 'original_title' with the 'primary_title' instead
film_df['original_title'].fillna(film_df['primary_title'], inplace=True)

# --- Checking after cleaning ---
print("Missing Values After Cleaning:")
print(film_df.isnull().sum())


Missing Values After Cleaning:
movie_id                    0
ordering                    0
person_id                   0
category                    0
job                   1757131
characters            1468512
ordering                    0
title                       0
region                 393739
language              2073586
types                  974092
attributes            2301142
is_original_title           0
primary_title               0
original_title              0
start_year                  0
runtime_minutes             0
genres                      0
averagerating               0
numvotes                    0
primary_name                0
birth_year            1367955
death_year            2380448
primary_profession      52454
dtype: int64


## Movie Basics Data Cleaning

We will:
- Fill missing `runtime_minutes` with the median value of the column in `movie_basics_df`.
- Fill missing `genres` with `"Unknown"`.
- Optionally, fill missing `original_title` with the value from `primary_title`.
- Finally, check for any remaining missing values in the `movie_basics_df` dataset.


In [61]:
median_runtime = film_df['runtime_minutes'].median()
film_df['runtime_minutes'].fillna(median_runtime, inplace=True)

# Fill missing 'genres' with 'Unknown'
film_df['genres'].fillna('Unknown', inplace=True)

# Optional: If you want to fill missing 'original_title' with the 'primary_title' instead
film_df['original_title'].fillna(film_df['primary_title'], inplace=True)

# --- Checking after cleaning ---
print("Missing Values After Cleaning:")
print(film_df.isnull().sum())

Missing Values After Cleaning:
movie_id                    0
ordering                    0
person_id                   0
category                    0
job                   1757131
characters            1468512
ordering                    0
title                       0
region                 393739
language              2073586
types                  974092
attributes            2301142
is_original_title           0
primary_title               0
original_title              0
start_year                  0
runtime_minutes             0
genres                      0
averagerating               0
numvotes                    0
primary_name                0
birth_year            1367955
death_year            2380448
primary_profession      52454
dtype: int64



* Fill missing runtime_minutes:

The median value is computed and then used to fill missing values in the runtime_minutes column.

* Fill missing genres:

Any missing value in the genres column is filled with "Unknown".

* Optional filling of original_title:

If there are missing values in original_title, they are filled with the corresponding value from primary_title.

* Check for remaining missing values:

The check is performed on movie_basics_df after cleaning.

## Dropping Rows with Missing 'runtime_minutes' or 'genres'

We will drop any rows that still have missing values in the `runtime_minutes` or `genres` columns. Afterward, we will perform a final check to ensure that no missing values remain.


In [62]:
# Drop any rows still missing 'runtime_minutes' or 'genres'
film_df.dropna(subset=['runtime_minutes', 'genres'], inplace=True)

# Final check
print("Missing Values After Final Cleaning:")
print(film_df.isnull().sum())


Missing Values After Final Cleaning:
movie_id                    0
ordering                    0
person_id                   0
category                    0
job                   1757131
characters            1468512
ordering                    0
title                       0
region                 393739
language              2073586
types                  974092
attributes            2301142
is_original_title           0
primary_title               0
original_title              0
start_year                  0
runtime_minutes             0
genres                      0
averagerating               0
numvotes                    0
primary_name                0
birth_year            1367955
death_year            2380448
primary_profession      52454
dtype: int64


In [63]:
# drop columns that have large missing value
columns_to_drop = ['job', 'characters', 'region', 'language', 'types', 'attributes', 'birth_year', 'death_year', 'primary_profession']
film_df = film_df.drop(columns=columns_to_drop)
print("Missing Values After Final Cleaning:")
print(film_df.isnull().sum())

Missing Values After Final Cleaning:
movie_id             0
ordering             0
person_id            0
category             0
ordering             0
title                0
is_original_title    0
primary_title        0
original_title       0
start_year           0
runtime_minutes      0
genres               0
averagerating        0
numvotes             0
primary_name         0
dtype: int64


* .dropna(subset=['runtime_minutes', 'genres']): This ensures that rows with missing values in the specified columns (runtime_minutes and genres) are dropped from the DataFrame. We set inplace=True to modify the DataFrame directly without needing to create a new one.

* .isnull().sum(): After dropping rows, this command checks the entire film_df for any remaining missing values and displays the count of NaN values in each column.
* We dropped the columns 'job', 'characters', 'region', 'language', 'types', 'attributes', 'birth_year', 'death_year', and 'primary_profession' from film_df because they had a large number of missing values, making them unreliable for analysis. Removing them ensures the dataset is cleaner and more accurate for further exploration.



## Standardizing Column Names

We will:
- Strip leading and trailing spaces from column names.
- Convert column names to lowercase.
- Replace spaces in column names with underscores to improve consistency.


In [64]:
# Standardize column names in film_df
film_df.columns = film_df.columns.str.strip().str.lower().str.replace(' ', '_')

# Standardize column names in box_office_df
box_office_df.columns = box_office_df.columns.str.strip().str.lower().str.replace(' ', '_')


# Standardize column names in movie_budget_df
movie_budget_df.columns = movie_budget_df.columns.str.strip().str.lower().str.replace(' ', '_')

# Display column names to confirm changes
print("Standardized Column Names in film_df:", film_df.columns.tolist())
print("Standardized Column Names in box_office_df:", box_office_df.columns.tolist())
print("Standardized Column Names in movie_budget_df:", movie_budget_df.columns.tolist())


Standardized Column Names in film_df: ['movie_id', 'ordering', 'person_id', 'category', 'ordering', 'title', 'is_original_title', 'primary_title', 'original_title', 'start_year', 'runtime_minutes', 'genres', 'averagerating', 'numvotes', 'primary_name']
Standardized Column Names in box_office_df: ['title', 'studio', 'domestic_gross', 'foreign_gross', 'year']
Standardized Column Names in movie_budget_df: ['id', 'release_date', 'movie', 'production_budget', 'domestic_gross', 'worldwide_gross']



* .str.strip(): Removes leading and trailing spaces from each column name.

* .str.lower(): Converts each column name to lowercase for uniformity.

* .str.replace(' ', '_'): Replaces spaces in column names with underscores (_).

## Correcting Inconsistent Categorical Values

We will:
- Convert all text (string) values in categorical columns to lowercase.
- Strip any leading or trailing spaces from the string values for consistency.


In [65]:
# Correct inconsistent categorical values in film_df
for col in film_df.select_dtypes(include=['object']).columns:
    film_df[col] = film_df[col].str.lower().str.strip()

# Correct inconsistent categorical values in box_office_df
for col in box_office_df.select_dtypes(include=['object']).columns:
    box_office_df[col] = box_office_df[col].str.lower().str.strip()

# Displaying the first few rows to check the changes
print("First few rows of film_df after cleaning:")
print(film_df.head())

print("First few rows of box_office_df after cleaning:")
print(box_office_df.head())


First few rows of film_df after cleaning:
    movie_id  ordering  person_id category  ordering               title  \
0  tt0323808        10  nm0059247   editor         1             may day   
1  tt0323808        10  nm0059247   editor         2  cowboys for christ   
2  tt0323808        10  nm0059247   editor         3     the wicker tree   
3  tt0323808        10  nm0059247   editor         4     the wicker tree   
4  tt0323808        10  nm0059247   editor         5     плетеное дерево   

   is_original_title    primary_title   original_title  start_year  \
0                0.0  the wicker tree  the wicker tree        2011   
1                0.0  the wicker tree  the wicker tree        2011   
2                0.0  the wicker tree  the wicker tree        2011   
3                1.0  the wicker tree  the wicker tree        2011   
4                0.0  the wicker tree  the wicker tree        2011   

   runtime_minutes        genres  averagerating  numvotes primary_name  
0      

In [66]:

# Correct inconsistent categorical values in movie_budget_df
for col in movie_budget_df.select_dtypes(include=['object']).columns:
    movie_budget_df[col] = movie_budget_df[col].str.lower().str.strip()


print("First few rows of box_office_df after cleaning:")
print(movie_budget_df.head())


First few rows of box_office_df after cleaning:
   id  release_date                                        movie  \
0   1  dec 18, 2009                                       avatar   
1   2  may 20, 2011  pirates of the caribbean: on stranger tides   
2   3   jun 7, 2019                                 dark phoenix   
3   4   may 1, 2015                      avengers: age of ultron   
4   5  dec 15, 2017            star wars ep. viii: the last jedi   

  production_budget domestic_gross worldwide_gross  
0      $425,000,000   $760,507,625  $2,776,345,279  
1      $410,600,000   $241,063,875  $1,045,663,875  
2      $350,000,000    $42,762,350    $149,762,350  
3      $330,600,000   $459,005,868  $1,403,013,963  
4      $317,000,000   $620,181,382  $1,316,721,747  


* .select_dtypes(include=['object']): This selects all columns that contain object (string) data.

* .str.lower(): Converts all string values to lowercase to ensure uniformity.

* .str.strip(): Removes any leading or trailing spaces from string values.

This ensures that the categorical columns in both film_df and box_office_df are standardized and ready for analysis.

In [67]:
# Remove dollar signs and commas, then convert to integers
money_columns = ['production_budget', 'domestic_gross', 'worldwide_gross']
for col in money_columns:
    movie_budget_df[col] = movie_budget_df[col].replace('[\$,]', '', regex=True).astype(int)

# Convert 'release_date' to MM/DD/YYYY format
movie_budget_df['release_date'] = pd.to_datetime(movie_budget_df['release_date'], format='mixed', errors='coerce').dt.strftime('%m/%d/%Y')


pd.to_datetime(...): Converts the date strings (like "Dec 18, 2009") into proper datetime objects.

.dt.strftime('%m/%d/%Y'): Formats the date into a consistent numeric format: MM/DD/YYYY (e.g., "12/18/2009"), which is easier to sort or filter programmatically.


This cleaned format ensures data consistency, improves readability, and allows for more robust data analysis and visualization.

## Handling Duplicates in the Data

### Checking for Duplicates

We will:
- Check each dataset for duplicate rows and print out the number of duplicates for each.


In [68]:
# Check for duplicates

print(f"🔍 Duplicates in Box Office Data: {box_office_df.duplicated().sum()}")
print(f"🔍 Duplicates in Box Office Data: {movie_budget_df.duplicated().sum()}")
print(f"🔍 Duplicates in Movie Budgets Data: {film_df.duplicated().sum()}")

🔍 Duplicates in Box Office Data: 0
🔍 Duplicates in Box Office Data: 0
🔍 Duplicates in Movie Budgets Data: 0


* .duplicated(): Identifies duplicate rows in the DataFrame. By default, it considers all columns to determine if a row is duplicated.

* .sum(): Counts the number of True values returned by .duplicated(), which corresponds to the number of duplicate rows.

Note:
Your comment mentions "Movie Budgets Data" for movie_rating_df, but the variable name is movie_rating_df. It seems like you might have meant movie_rating_df instead, so I’ve kept it consistent in the updated code.

## Checking for Inconsistent Casing

We will:
- Strip extra spaces and convert the `studio` and `title` columns to lowercase to ensure consistency.


In [69]:
# Correct inconsistent casing and strip spaces in 'studio' and 'title' columns
box_office_df["studio"] = box_office_df["studio"].str.strip().str.lower()
box_office_df["title"] = box_office_df["title"].str.strip().str.lower()

# Displaying the first few rows to confirm the changes
print("First few rows of box_office_df after fixing casing:")
print(box_office_df[['studio', 'title']].head())


First few rows of box_office_df after fixing casing:
  studio                                        title
0     bv                                  toy story 3
1     bv                   alice in wonderland (2010)
2     wb  harry potter and the deathly hallows part 1
3     wb                                    inception
4   p/dw                          shrek forever after


* .str.strip(): Removes any leading or trailing spaces from the studio and title columns to avoid inconsistencies caused by extra spaces.

* .str.lower(): Converts the entire text in the studio and title columns to lowercase, ensuring that casing does not cause issues when comparing or analyzing data.

This will help make sure that the values in studio and title are consistently formatted. Let me know if you'd like to apply this to other columns as well or need further modifications!

## Counting Occurrences of Each Title
We will:

Count the occurrences of each unique title in the title column and display the top 10 most frequent titles.

In [70]:
# Counting occurrences of each title
box_office_df["title"].value_counts().head(10)

title
bluebeard                                                   2
the chronicles of narnia: the voyage of the dawn treader    1
the king's speech                                           1
tron legacy                                                 1
the karate kid                                              1
prince of persia: the sands of time                         1
black swan                                                  1
megamind                                                    1
robin hood                                                  1
the last airbender                                          1
Name: count, dtype: int64

* .value_counts(): This function counts the occurrences of each unique value in the title column of box_office_df.

* .head(10): Displays the top 10 titles with the highest frequency.

## Searching for the Movie "Bluebeard" in Box Office Data

We will:
- Search for the movie "Bluebeard" in the `title` column of the `box_office_df` DataFrame and display the results.


In [71]:
# Searching for the Movie "Bluebeard" in `box_office_df`
box_office_df[box_office_df["title"] == "bluebeard"]

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
317,bluebeard,strand,33500.0,5200.0,2010
3045,bluebeard,wgusa,43100.0,19400000.0,2017


* box_office_df[box_office_df["title"] == "bluebeard"]: Filters the rows in the box_office_df DataFrame where the title column is exactly equal to "bluebeard". Since you standardized the column to lowercase, this will match any title with the name "Bluebeard" (case-insensitive).

In [72]:
# Differentiating the titles Bluebeard by adding the release year
box_office_df["title"] = box_office_df["title"] + " (" + box_office_df["year"].astype(str) + ")"
box_office_df["title"].value_counts()[box_office_df["title"].value_counts() > 1]

Series([], Name: count, dtype: int64)

## Displaying Categorical Features Summary

We will:
- Use `describe(include=['O'])` to get a summary of categorical features (object columns) for `box_office_df`, `movie_basics_df`, and `movie_rating_df`.


In [73]:
#  Display categorical features summary
box_office_df.describe(include=['O'])
movie_budget_df.describe(include=['O'])
film_df.describe(include=['O'])

Unnamed: 0,movie_id,person_id,category,title,primary_title,original_title,genres,primary_name
count,2422866,2422866,2422866,2422866,2422866,2422866,2422866,2422866
unique,69502,343794,12,193482,65862,66952,915,333736
top,tt2488496,nm0089658,actor,robin hood,frozen,frozen,drama,jason blum
freq,610,1321,580858,320,810,770,333400,1321


* describe(include=['O']): This method will provide summary statistics for all columns that are of object (categorical) type in the DataFrame. It includes the count, unique values, top (most frequent value), and frequency of the top value.

## Performing Data Integrity Checks

We will:
- Check for missing values in `box_office_df` and `film_df` using `.isnull().sum()`.
- Check for duplicate rows using `.duplicated().sum()`.
- Check the data types of each column in both DataFrames using `.dtypes`.


* describe(include=['O']): This method will provide summary statistics for all columns that are of object (categorical) type in the DataFrame. It includes the count, unique values, top (most frequent value), and frequency of the top value.

## Performing Data Integrity Checks

We will:
- Check for missing values in `box_office_df` and `film_df` using `.isnull().sum()`.
- Check for duplicate rows using `.duplicated().sum()`.
- Check the data types of each column in both DataFrames using `.dtypes`.


In [74]:
# Check missing values in box_office_df and film_df
print("Missing Values in Box Office Data:\n", box_office_df.isnull().sum())
print("\nMissing Values in Merged Movie Data (film_df):\n", film_df.isnull().sum())
print("\nMissing Values in Movie budget Data (movie_budget_df):\n", movie_budget_df.isnull().sum())

# Check duplicate rows in box_office_df and film_df
print("\nDuplicate Rows in Box Office Data:", box_office_df.duplicated().sum())
print("\nDuplicate Rows in movie budget Data:", movie_budget_df.duplicated().sum())
print("\nDuplicate Rows in Merged Movie Data (film_df):", film_df.duplicated().sum())

# Check data types in box_office_df and film_df
print("\nData Types in Box Office Data:\n", box_office_df.dtypes)
print("\nData Types in Merged Movie Data (film_df):\n", film_df.dtypes)
print("\nData Types in movie budget Data:\n", movie_budget_df.dtypes)

Missing Values in Box Office Data:
 title             0
studio            0
domestic_gross    0
foreign_gross     0
year              0
dtype: int64

Missing Values in Merged Movie Data (film_df):
 movie_id             0
ordering             0
person_id            0
category             0
ordering             0
title                0
is_original_title    0
primary_title        0
original_title       0
start_year           0
runtime_minutes      0
genres               0
averagerating        0
numvotes             0
primary_name         0
dtype: int64

Missing Values in Movie budget Data (movie_budget_df):
 id                   0
release_date         0
movie                0
production_budget    0
domestic_gross       0
worldwide_gross      0
dtype: int64

Duplicate Rows in Box Office Data: 0

Duplicate Rows in movie budget Data: 0

Duplicate Rows in Merged Movie Data (film_df): 0

Data Types in Box Office Data:
 title              object
studio             object
domestic_gross    float

* Missing Values:

.isnull().sum() returns the count of missing (NaN) values for each column.

* Duplicate Rows:

.duplicated().sum() returns the number of duplicate rows in the DataFrame.

* Data Types:

.dtypes returns the data types of each column, which is helpful to ensure the correctness of column data types (e.g., object, int64, float64, etc.).

## Saving Cleaned Datasets

We will:
- Save the cleaned  dataset to a separate CSV file.


In [75]:
import os
from IPython.display import FileLink

# Ensure the target folder exists
os.makedirs("zippedData", exist_ok=True)

# Define file paths
box_office_path = "zippedData/cleaned_dataset_box_office.csv"
film_df_path = "zippedData/cleaned_dataset_film_df.csv"
movie_budget_path = "zippedData/cleaned_dataset_movie_budget.csv"

# Save box_office_df
if not os.path.exists(box_office_path):
    box_office_df.to_csv(box_office_path, index=False)
    print(f"✅ Saved: {box_office_path}")
else:
    print(f"⚠️ Skipped (already exists): {box_office_path}")

# Save film_df
if not os.path.exists(film_df_path):
    film_df.to_csv(film_df_path, index=False)
    print(f"✅ Saved: {film_df_path}")
else:
    print(f"⚠️ Skipped (already exists): {film_df_path}")

# Save movie_budget_df
if not os.path.exists(movie_budget_path):
    movie_budget_df.to_csv(movie_budget_path, index=False)
    print(f"✅ Saved: {movie_budget_path}")
else:
    print(f"⚠️ Skipped (already exists): {movie_budget_path}")




⚠️ Skipped (already exists): zippedData/cleaned_dataset_box_office.csv
⚠️ Skipped (already exists): zippedData/cleaned_dataset_film_df.csv
✅ Saved: zippedData/cleaned_dataset_movie_budget.csv


* index=False: Prevents the DataFrame index from being saved as an additional column in the CSV file.

* Unique filenames: By giving each dataset a unique filename (cleaned dataset.csv), you ensure that all  datasets are saved without overwriting each other.

## Generate Download Link for Cleaned Dataset

We will:
- Generate a download link for the cleaned dataset (`cleaned_dataset.csv`).


In [76]:
 # Generate download links
print("\n📎 Download links:")
display(FileLink(box_office_path))
display(FileLink(film_df_path))
display(FileLink(movie_budget_path))


📎 Download links:


* index=False: Prevents the DataFrame index from being saved as an additional column in the CSV file.

* Unique filenames: By giving each dataset a unique filename (cleaned_dataset.csv), you ensure that all datasets are saved without overwriting each other.