# 🎬 TMDb Data Extraction – Financial & Certification Enrichment  
**Author:** Joseph Tulani Aytch  
**Last Updated:** Aug 2025  

---

## 🎯 Overview
In **Part 1**, we saw that the IMDb dataset lacked financial metrics like budget and revenue — critical for analyzing movie performance.  
Here, we supplement the dataset with **The Movie Database (TMDb)** API to add:

- **Budget** 💰  
- **Revenue** 📈  
- **MPAA Certification** 🎟 (G, PG, PG‑13, R)

---

## 📌 Stakeholder Request
- Enrich the filtered IMDb dataset from Part 1 with budget, revenue, and US certification.
- Proof‑of‑concept: process movies from **2000** and **2001**.
- Save one compressed CSV file per year.

---

## 🛠 Approach
1. **Setup**: Import libraries, configure folders, load API key from a local JSON secret file.
2. **Helper Functions**:
   - `write_json()` → append API results to JSON storage.
   - `get_movie_with_rating()` → retrieve details + certification.
3. **Validation**: Test API calls with known titles.
4. **Data Extraction**:
   - Outer loop: iterate over years in scope.
   - Inner loop: call TMDb API for each movie and append to JSON.
5. **Output**:
   - `final_tmdb_data_YYYY.csv.gz` → per‑year results.
   - `tmdb_results_combined.csv.gz` → merged for downstream EDA.

---

## 📂 Deliverables
- `final_tmdb_data_2000.csv.gz`
- `final_tmdb_data_2001.csv.gz`
- `tmdb_results_combined.csv.gz`
- Well‑commented API extraction and enrichment code.

---

## ⚠ Notes
- TMDb API is rate‑limited — large runs may take significant time.
- API key is loaded locally, never committed to the repo.


> **🔑 API Key Required**  
> This notebook requires a valid **TMDb API key** to run.  
> - You can request a free key at [https://developer.themoviedb.org/docs](https://developer.themoviedb.org/docs).  
> - Store your key **securely** in a local file or environment variable — never commit it to GitHub.  
> - In this project, the key is read from a local JSON file at `~/.secret/TMDB_api.json`, which is excluded via `.gitignore`.  
> - If the key is missing, API requests will fail. The notebook will still render static outputs for portfolio viewing.


In [4]:
# pip install tmdbsimple

In [3]:
import os, time, json
import pandas as pd
import tmdbsimple as tmdb
from tqdm.notebook import tqdm_notebook

# === 1. Setup folders ===
data_dir = os.path.join("Data")
os.makedirs(data_dir, exist_ok=True)

print(f"📂 Data directory ready: {data_dir}")

# List current data files
print("📄 Current files in Data/:", os.listdir(data_dir))


📂 Data directory ready: Data
📄 Current files in Data/: ['final_akas.csv.gz', 'final_basics.csv.gz', 'final_ratings.csv.gz', 'final_tmdb_data_2000.csv.gz', 'final_tmdb_data_2001.csv.gz', 'title_basics.csv.gz', 'tmdb_api_results_2000.json', 'tmdb_api_results_2001.json', 'tmdb_results_combined.csv.gz']


In [18]:
# === 2. Load API credentials ===
import json
from pathlib import Path
import tmdbsimple as tmdb

secret_path = Path.home() / ".secret" / "TMDB_api.json"
api_key = None

try:
    with open(secret_path) as f:
        creds = json.load(f)
        api_key = creds.get("api_key")
    if not api_key:
        raise KeyError("Missing 'api_key' field.")
except FileNotFoundError:
    print(f"⚠️ No API key file found at: {secret_path}")
except (json.JSONDecodeError, KeyError) as e:
    print(f"⚠️ Error reading API key: {e}")

if api_key:
    tmdb.API_KEY = api_key
    print("✅ API key registered with tmdbsimple")
else:
    print("🔒 API key not available. Skipping live API calls.")


✅ API key registered with tmdbsimple


In [19]:
# === 3. Helper functions (unchanged logic, but still relative paths where applicable) ===
def write_json(new_data, filename):
    with open(filename, 'r+') as file:
        file_data = json.load(file)
        if isinstance(new_data, list) and isinstance(file_data, list):
            file_data.extend(new_data)
        else:
            file_data.append(new_data)
        file.seek(0)
        json.dump(file_data, file)

def get_movie_with_rating(movie_id):
    movie = tmdb.Movies(movie_id)
    info = movie.info()
    releases = movie.releases()
    for c in releases['countries']:
        if c['iso_3166_1'] == 'US':
            info['certification'] = c['certification']
    return info


In [20]:
# Test with known movie IDs
get_movie_with_rating("tt0848228")  # The Avengers
get_movie_with_rating("tt0332280")  # The Notebook


{'adult': False,
 'backdrop_path': '/bWoa6FtpnD2qByTVTmL5pAZUAKv.jpg',
 'belongs_to_collection': None,
 'budget': 29000000,
 'genres': [{'id': 10749, 'name': 'Romance'}, {'id': 18, 'name': 'Drama'}],
 'homepage': 'http://www.newline.com/properties/notebookthe.html',
 'id': 11036,
 'imdb_id': 'tt0332280',
 'origin_country': ['US'],
 'original_language': 'en',
 'original_title': 'The Notebook',
 'overview': "An epic love story centered around an older man who reads aloud to a woman with Alzheimer's. From a faded notebook, the old man's words bring to life the story about a couple who is separated by World War II, and is then passionately reunited, seven years later, after they have taken different paths.",
 'popularity': 14.7851,
 'poster_path': '/rNzQyW4f8B8cQeg7Dgj3n6eT5k9.jpg',
 'production_companies': [{'id': 12,
   'logo_path': '/x33I3vv8nx1O7rECNN7X5MsAFoN.png',
   'name': 'New Line Cinema',
   'origin_country': 'US'},
  {'id': 1565, 'logo_path': None, 'name': 'Avery Pix', 'origin_

In [21]:
# === 4. Load IMDb basics from Part 0 ===
basics_path = os.path.join(data_dir, "final_basics.csv.gz")
if not os.path.exists(basics_path):
    raise FileNotFoundError(f"Missing {basics_path} — run Part 0 first.")

basics = pd.read_csv(basics_path)
print(f"✅ Loaded basics dataset: {len(basics):,} rows")


✅ Loaded basics dataset: 84,200 rows


In [22]:
# === 5. Extraction loop ===
errors = []
YEARS_TO_GET = [2000, 2001]

for YEAR in tqdm_notebook(YEARS_TO_GET, desc='YEARS', position=0):
    json_file = os.path.join(data_dir, f"tmdb_api_results_{YEAR}.json")
    csv_file  = os.path.join(data_dir, f"final_tmdb_data_{YEAR}.csv.gz")

    # Ensure JSON exists
    if not os.path.isfile(json_file):
        with open(json_file, 'w') as f:
            json.dump([{'imdb_id': 0}], f)

    # Filter for current year
    df_year = basics.loc[basics['startYear'] == YEAR].copy()
    movie_ids = df_year['tconst']

    # Skip already stored IDs
    previous_df = pd.read_json(json_file)
    movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]

    for movie_id in tqdm_notebook(movie_ids_to_get,
                                  desc=f"Movies {YEAR}",
                                  position=1, leave=True):
        try:
            temp = get_movie_with_rating(movie_id)
            write_json(temp, json_file)
            time.sleep(0.02)
        except Exception as e:
            errors.append([movie_id, e])

    # Save yearly CSV
    final_year_df = pd.read_json(json_file)
    final_year_df.to_csv(csv_file, compression="gzip", index=False)
    print(f"💾 Saved {YEAR} data: {csv_file}")


YEARS:   0%|          | 0/2 [00:00<?, ?it/s]

Movies 2000:   0%|          | 0/207 [00:00<?, ?it/s]

💾 Saved 2000 data: Data\final_tmdb_data_2000.csv.gz


Movies 2001:   0%|          | 0/240 [00:00<?, ?it/s]

💾 Saved 2001 data: Data\final_tmdb_data_2001.csv.gz


In [23]:
# === 6. Merge yearly files for downstream EDA ===
import glob

yearly_files = sorted(glob.glob(os.path.join(data_dir, "final_tmdb_data_*.csv.gz")))
df_list = [pd.read_csv(f) for f in yearly_files]
tmdb_results_combined = pd.concat(df_list, ignore_index=True)

combined_path = os.path.join(data_dir, "tmdb_results_combined.csv.gz")
tmdb_results_combined.to_csv(combined_path, compression="gzip", index=False)

print(f"✅ Combined {len(yearly_files)} files → {len(tmdb_results_combined):,} rows total")
print(f"💾 Saved merged dataset to: {combined_path}")


✅ Combined 2 files → 2,586 rows total
💾 Saved merged dataset to: Data\tmdb_results_combined.csv.gz
