# 📊 Exploratory Data Analysis (EDA) – TMDb Yearly Movie Data  
**Author:** Joseph Tulani Aytch  
**Last Updated:** Aug 2025  

---

## 🎯 Objective
Combine yearly TMDb movie datasets, clean them for analysis, and answer key exploratory questions:

1. **Load** `csv.gz` files for each year  
2. **Merge** into a single DataFrame  
3. **Filter** for movies with valid financial data *(budget > 0 or revenue > 0)*  
4. **Summarize**:
   - Movies per certification category *(G, PG, PG‑13, R)*
   - Average revenue per certification
   - Average budget per certification

This structured EDA builds a baseline understanding before deeper analysis or visualization.

---

## 📌 How to Use This Notebook
- **View Only:** Browse to see combined dataset stats and summaries – no execution required.
- **Run Yourself:**
  1. Ensure yearly `final_tmdb_data_YYYY.csv.gz` files exist in `Data/` (in repo)
  2. Install dependencies:  
     ```bash
     pip install -r requirements.txt
     ```
  3. Run all cells.

---

## 🛠 Skills Demonstrated
- Multi‑file ingestion and concatenation
- Data filtering with conditional logic
- Grouped summaries and aggregation in `pandas`
- Portfolio‑friendly documentation

In [1]:
import pandas as pd
import os
import glob

pd.set_option('display.float_format', '{:,.0f}'.format)

# === 1. AUTO‑LOAD ALL YEARLY TMDb FILES FROM Data/ FOLDER ===
data_dir = os.path.join("Data")
file_pattern = os.path.join(data_dir, "final_tmdb_data_*.csv.gz")

yearly_files = sorted(glob.glob(file_pattern))

if not yearly_files:
    raise FileNotFoundError(
        f"No TMDb yearly files found matching pattern: {file_pattern}\n"
        "Make sure cleaned yearly files are saved in the Data/ folder."
    )

# Load and merge
df_list = []
for file in yearly_files:
    print(f"📂 Loading {os.path.basename(file)}")
    df_list.append(pd.read_csv(file))

tmdb_results_combined = pd.concat(df_list, ignore_index=True)
print(f"✅ Combined {len(yearly_files)} files into one DataFrame with {len(tmdb_results_combined):,} rows.")

# === 2. SAVE COMBINED DATASET FOR REPRODUCIBILITY ===
combined_path = os.path.join(data_dir, "tmdb_results_combined.csv.gz")
tmdb_results_combined.to_csv(combined_path, compression="gzip", index=False)
print(f"💾 Combined dataset saved to: {combined_path}")

# Quick info check
tmdb_results_combined.info()


📂 Loading final_tmdb_data_2000.csv.gz
📂 Loading final_tmdb_data_2001.csv.gz
✅ Combined 2 files into one DataFrame with 2,507 rows.
💾 Combined dataset saved to: Data\tmdb_results_combined.csv.gz
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2507 entries, 0 to 2506
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_id                2507 non-null   object 
 1   adult                  2505 non-null   float64
 2   backdrop_path          1334 non-null   object 
 3   belongs_to_collection  201 non-null    object 
 4   budget                 2505 non-null   float64
 5   genres                 2505 non-null   object 
 6   homepage               169 non-null    object 
 7   id                     2505 non-null   float64
 8   original_language      2505 non-null   object 
 9   original_title         2505 non-null   object 
 10  overview               2455 non-null   object 
 11  popularity        

<details>
<summary><strong>Q1: How many movies have at least some valid financial information?</strong></summary>

We define “valid” as having **budget > 0** or **revenue > 0**.  
Movies with both budget and revenue equal to 0 are excluded from later visualizations and summaries.

</details>


In [13]:
# Create boolean mask: TRUE if budget > 0 OR revenue > 0
financial_filter = (
    (tmdb_results_combined['budget'] > 0) | 
    (tmdb_results_combined['revenue'] > 0)
)

# Apply filter to get only movies with some financial information
financial_df = tmdb_results_combined.loc[financial_filter]

# Preview a few relevant columns to verify the filter worked
financial_df[['title', 'budget', 'revenue']].head()


Unnamed: 0,title,budget,revenue
1,The Fantasticks,10000000.0,0.0
4,In the Mood for Love,150000.0,12854953.0
6,Heavy Metal 2000,15000000.0,0.0
10,Songs from the Second Floor,0.0,80334.0
11,Vulgar,120000.0,14904.0


<details>
<summary><strong>Q2: How many movies are there in each certification category (G / PG / PG‑13 / R)?</strong></summary>

Understanding the distribution of certifications helps identify gaps or biases in the dataset,  
and may guide which segments we compare in deeper analysis.

</details>


In [15]:
# Count how many movies fall into each certification category
# Helps us understand dataset distribution across rating levels
cert_counts = tmdb_results_combined['certification'].value_counts()
cert_counts


certification
R          456
PG-13      183
NR          69
PG          63
G           24
NC-17        6
Unrated      1
-            1
Name: count, dtype: int64

<details>
<summary><strong>Q3: What is the average revenue per certification category?</strong></summary>

This gives a high‑level view of how revenue potential varies by rating.  
Later, this can be paired with budget data to examine ROI.

</details>


In [28]:
# Calculate mean revenue grouped by certification rating
# Useful for spotting trends in earning potential by rating
avg_rev_by_cert = tmdb_results_combined.groupby('certification')['revenue'].mean()
avg_rev_by_cert


certification
-                  0
G         72,185,327
NC-17              0
NR         2,189,701
PG        62,583,104
PG-13     71,057,114
R         16,678,490
Unrated            0
Name: revenue, dtype: float64

<details>
<summary><strong>Q4: What is the average budget per certification category?</strong></summary>

Shows typical spending levels by rating category, which can reveal whether higher‑rated films  
tend to have larger production budgets.

</details>


In [30]:
# Calculate average budget grouped by certification
avg_budget_by_cert = tmdb_results_combined.groupby('certification')['budget'].mean()
avg_budget_by_cert


certification
-                  0
G         23,833,333
NC-17              0
NR         1,552,755
PG        24,980,159
PG-13     30,891,573
R          9,916,676
Unrated            0
Name: budget, dtype: float64