## 🎬  **Project Title** : 

**Box Office Gold**: **Data-Driven Insights for a Profitable Movie Studio Launch**

### **Business Understanding**

#### 👥 Stakeholder
The primary stakeholder is the executive team of the company's new movie studio. They need insights into the film industry to make confident decisions on what type of movies to produce.

🌍 **Domain**:

**Entertainment & Media Analytics (specifically, Film Industry/Box Office Performance)**

📘 **Introduction**:

With the growing trend of major companies venturing into original film production, my organization is planning to launch its own movie studio. However, without prior experience in the film industry, there’s uncertainty about what types of movies resonate with audiences and drive box office success. This project aims to analyze trends in box office data to uncover what genres, budgets, and other film attributes contribute most to commercial success — providing strategic guidance for profitable content creation.

#### 🎯 **Business Objectives** :

1. **To Identify High-Performing Film Genres**:

* Analyze box office data to determine which movie genres consistently generate the highest revenue and audience engagement.

2. **To Examine the Relationship Between Budget and Profitability**:

* Investigate how production budgets influence box office success and identify the budget range that maximizes return on investment (ROI).

3. **To Assess the Impact of Key Film Attributes**:

* Explore how factors such as runtime, cast, release date (season), and film ratings (e.g., PG-13, R) affect a movie’s performance.

4. **To Benchmark Against Top Studios** :

* Analyze which production studios are leading in terms of commercial success and identify patterns in their film portfolios.

5. **To Provide Actionable Recommendations**:

Based on the insights, suggest the optimal type of film (genre, budget, release timing, etc.) that the company should produce for a successful studio launch.

#### 📊 **Project Plan**: **Box Office Gold – Data-Driven Insights for a Profitable Movie Studio Launch**

🔍 **1. Problem Understanding & Goal Definition**

* Review business problem: Identify film types that succeed at the box office.

* Define clear goals: Provide recommendations on genre, budget, and release strategy.


📦 **2. Data Collection**

**Source Box Office Datasets from platforms like**:

* Im.db.zip(movie_basics & Movie_ratings)

* bom.movie_gross.gz

**Collect relevant data fields**:

Genre, budget, revenue, runtime, release date, production company, director, cast, rating, etc.


🧹 **3. Data Cleaning & Preprocessing**

* Handle missing values and inconsistencies.
* Standardize formats (dates, currencies, genres).
* Convert categorical variables where necessary.
* Remove duplicates or irrelevant records (e.g., short films, non-theatrical releases).

📊 **4. Exploratory Data Analysis (EDA)**

* Univariate & Bivariate Analysis (e.g., budget vs revenue, genre vs revenue).
* Correlation heatmaps, box plots, histograms.
* Identify outliers and common patterns in successful films.
* Segment data by genre, production studio, or release year.

🧠 **5. Insights & Recommendations**

* Summarize which genres are top performers.
* Recommend ideal budget ranges.
* Identify optimal release months/seasons.
* Suggest attributes linked to successful movies (e.g., popular runtimes, ratings).

📑 **6. Reporting & Visualization**

* Build clear and compelling visualizations (using Tableau, Power BI, or Python’s Seaborn/Matplotlib).
* Draft a business-focused report or slide deck.

**Include:**

* Key findings
* Strategic suggestions
* Visual evidence

📢 **7. Presentation to Stakeholders**

* Communicate insights in non-technical language.
* Show data-driven rationale for proposed movie types.
* Allow room for stakeholder feedback and Q&A.









### **Overview/Background**

As the entertainment industry shifts toward original content production, many large companies are investing in their own movie studios to capture audience attention and drive revenue. The company seeks to follow this trend but lacks experience in film production. To ensure a successful studio launch, this project aims to analyze historical box office data to uncover key trends in genre performance, budget impact, and other critical success factors. The goal is to provide data-driven insights that will guide strategic decisions on what types of films to produce for maximum box office success.


### **Challenges**

One of the main challenges in this project is acquiring comprehensive and reliable box office data that includes essential attributes such as genre, budget, revenue, and release details. Additionally, the film industry is influenced by unpredictable factors like audience trends, star power, and marketing, which are difficult to quantify. Ensuring data quality, handling missing or inconsistent entries, and drawing actionable insights that align with business goals also present key hurdles in the analysis process.


### **Proposed Solution**

To address the business challenge, this project proposes a data-driven approach that involves collecting and analyzing historical box office data to identify patterns in successful films. By examining factors such as genre, budget, revenue, release timing, and other key attributes, the project will uncover trends that correlate with box office success. The insights will then be translated into practical recommendations to guide the company in producing films with higher chances of commercial success.

### **Conclusion**:

Launching a successful movie studio requires more than creativity—it demands strategic, data-informed decisions. This project leverages box office analytics to uncover what drives film profitability, helping the company make confident choices about genre, budget, and release strategy. With clear insights and recommendations, the company will be well-positioned to enter the competitive film industry with a strong foundation for success.

### ❗ **Problem Statement**

As the company plans to venture into original film production, it faces significant uncertainty due to a lack of industry experience. 🎬 Making informed decisions about what types of films to produce is challenging without a clear understanding of market trends and performance drivers. Additionally, obtaining accurate and complete box office data is difficult, and the film industry itself is influenced by various unpredictable factors such as changing audience preferences, marketing impact, and star power. These challenges make it hard to identify what contributes to a movie’s commercial success and pose a risk to the company’s new venture.

## 📊 **Data Understanding**

The data for this project comes from multiple sources:

1. **im.db.zip**:

**A zipped SQLite database that contains various tables. The two most relevant tables are**:

* **movie_basics**: Likely includes key information about movies such as titles, genres, release dates, runtime, etc.
* **movie_ratings**: Contains viewer and critic ratings, providing insight into movie reception.

2. **bom.movie_gross.csv.gz**:

* A compressed CSV file containing box office gross data. This file is essential for analyzing revenue trends and overall financial performance.
* These sources combined offer a comprehensive view of film attributes and performance metrics, which are crucial for understanding the factors behind movie success.



In cell below I **import Libraries** to use to achieve the project goals

In [2]:
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In cell below I load and organize data for analysis from `im.db.zip` and `bom.movie_gross.csv.gz` using variable **film_df**. The data is contained in different sources. For the `im.db (after extracting)` I will query only two tables which I am focusing with in this project `movie_basics` and `movie_ratings`. And for the `bom.movie_gross.csv.gz (compressed file)`

In [None]:
# create connection to im.db

conn = sqlite3.connect('ZippedData/im.db/im.db')

# query im.db to get data from movie_basics & movie_rating

query = """
SELECT * 
 FROM movie_basics;
"""

# output query using pandas

movie_basics_df = pd.read_sql(query, conn)
movie_basics_df.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In cell I check the shape of `movie_basics_df`

In [None]:
# check for shape

movie_basics_df.shape

(146144, 6)

The cell above shows that `movie_basics_df` contains **146144** entries(rows) and **6** attributes.

In cell below I write a query to query im.db to get `movie_rating` table

In [None]:
# query movie_ratings table

query = """
SELECT *
 FROM movie_ratings;
"""

movie_rating_df = pd.read_sql(query, conn)
movie_rating_df.head()

Unnamed: 0,movie_id,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [None]:
# check for shape

movie_rating_df.shape

(73856, 3)

Cell above shows the shape of `movie_rating_df` which contains **73856** entries(rows) and **3** features. Which shows **difference of 72,288** from `movie_basics`

Clearly by observing the two tables `movie_basics_df` &`movie_rating_df` their **shape differ**. `movie_basics_df` as shape of **(146144, 6)** and `movie_rating_df` shape of **(73856, 3)**. Based on that best option is to **merge the tables** considering both are using same **key (movie_id)**. And uses the **inner join** asnit only keeps the records that appear in both tables, avoided `left join` **as it keeps all movie_basics rows, adds ratings where they exist **creating null for missing ratings**

In cell below I merge the tables and store the under variable `film_df`

In [12]:
# merge movie_basics_df & movie_rating_df

film_df = pd.merge(movie_basics_df, movie_rating_df, on='movie_id', how='inner')
film_df.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,averagerating,numvotes
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",7.0,77
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",7.2,43
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,6.9,4517
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",6.1,13
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",6.5,119


In [13]:
# check for shape after merging

film_df.shape

(73856, 8)

Cell above shows shape of `film_df` after merging which shows **73856** entries and **8** features. Features are 8 as it shared `movie_id` and the **entries are affected by inner join**

In cell below I load data from `bom.movie_gross.csv.gz` as I need it for the analysis.

In [None]:
# load data from bom.movie_gross.csv.gz
# box_office_df : shows how movies were earning

box_office_df= pd.read_csv('zippedData/bom.movie_gross.csv.gz')
box_office_df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [None]:
# check box_office_df shape
box_office_df.shape

(3387, 5)

The cell above shows `box_office_df` as **3387 entries**  and **5 features**. `box_office_df` shows **How much money a movie made** while `film_df` shows **Movies/film data** eg rating, runminutes etc. **Best option is to use the two dataset separate and work with them parallel** 

In [18]:
box_office_df.columns.to_list()

['title', 'studio', 'domestic_gross', 'foreign_gross', 'year']

In [19]:
film_df.columns.tolist()

['movie_id',
 'primary_title',
 'original_title',
 'start_year',
 'runtime_minutes',
 'genres',
 'averagerating',
 'numvotes']