# TMDB Data Extraction

## Overview
In Part 1, we identified that the IMDB dataset lacked **financial metrics** such as budget and revenue, which are essential for analyzing movie success.  
To address this gap, we integrated **The Movie Database (TMDB)** as a supplemental data source, leveraging its free API to enrich our dataset.

## Stakeholder Request
- Add **budget**, **revenue**, and **MPAA Certification** (G, PG, PG‑13, R) to the filtered movie dataset from Part 1.
- Run a proof‑of‑concept extraction for movies released in **2000** and **2001**.
- Save one compressed CSV file per year.

## Approach
1. **Setup:** Import required libraries, configure folders, and load API credentials securely from a local JSON file.
2. **Helper Functions:**
   - `write_json()` → appends API results to JSON storage.
   - `get_movie_with_rating()` → retrieves movie details + US certification.
3. **Validation:** Test the extraction functions with two known titles to confirm correct API behavior.
4. **Data Extraction Loops:**
   - Outer loop iterates over the years in scope.
   - Inner loop calls the TMDB API for each movie ID, stores results in JSON, and writes yearly compressed CSVs.
5. **Merge Results:** Concatenate yearly files into a single combined dataset for downstream EDA.

## Deliverables
- `final_tmdb_data_2000.csv.gz`
- `final_tmdb_data_2001.csv.gz`
- `tmdb_results_combined.csv.gz` (all TMDB API data merged)
- Well‑commented code for both API extraction and EDA.

**Note:** API extraction is rate‑limited; execution on large datasets may require extended run times.


In [2]:
# pip install tmdbsimple

In [3]:
import os, time, json
import pandas as pd
import tmdbsimple as tmdb
from tqdm.notebook import tqdm_notebook

# Folder for storing data
FOLDER = "Data/"
os.makedirs(FOLDER, exist_ok=True)

# List files currently in the Data folder
os.listdir(FOLDER)


['final_akas.csv.gz',
 'final_basics.csv.gz',
 'final_ratings.csv.gz',
 'final_tmdb_data_2000.csv.gz',
 'final_tmdb_data_2001.csv.gz',
 'title_basics.csv.gz',
 'tmdb_api_results_2000.json',
 'tmdb_api_results_2001.json',
 'tmdb_results_combined.csv.gz']

In [4]:
import os, json

# Build path to your secret file in a cross-platform way
secret_path = os.path.expanduser("~/.secret/TMDB_api.json")

with open(secret_path, 'r') as f:
    login = json.load(f)

tmdb.API_KEY = login['api-key']


In [5]:
def write_json(new_data, filename):
    """
    Appends new movie records to an existing JSON file.
    Handles both list and dict JSON structures.
    Adapted from: https://www.geeksforgeeks.org/append-to-json-file-using-python/
    """
    with open(filename, 'r+') as file:
        file_data = json.load(file)

        # Extend if both are lists, else append
        if isinstance(new_data, list) and isinstance(file_data, list):
            file_data.extend(new_data)
        else:
            file_data.append(new_data)

        # Reset file pointer & overwrite
        file.seek(0)
        json.dump(file_data, file)


def get_movie_with_rating(movie_id):
    """
    Retrieves TMDB movie details plus US certification rating.
    """
    movie = tmdb.Movies(movie_id)
    info = movie.info()
    releases = movie.releases()

    # Search for US certification and add to info dict
    for c in releases['countries']:
        if c['iso_3166_1'] == 'US':
            info['certification'] = c['certification']

    return info


In [6]:
# Test with known movie IDs
get_movie_with_rating("tt0848228")  # The Avengers
get_movie_with_rating("tt0332280")  # The Notebook


{'adult': False,
 'backdrop_path': '/bWoa6FtpnD2qByTVTmL5pAZUAKv.jpg',
 'belongs_to_collection': None,
 'budget': 29000000,
 'genres': [{'id': 10749, 'name': 'Romance'}, {'id': 18, 'name': 'Drama'}],
 'homepage': 'http://www.newline.com/properties/notebookthe.html',
 'id': 11036,
 'imdb_id': 'tt0332280',
 'origin_country': ['US'],
 'original_language': 'en',
 'original_title': 'The Notebook',
 'overview': "An epic love story centered around an older man who reads aloud to a woman with Alzheimer's. From a faded notebook, the old man's words bring to life the story about a couple who is separated by World War II, and is then passionately reunited, seven years later, after they have taken different paths.",
 'popularity': 14.7851,
 'poster_path': '/rNzQyW4f8B8cQeg7Dgj3n6eT5k9.jpg',
 'production_companies': [{'id': 12,
   'logo_path': '/x33I3vv8nx1O7rECNN7X5MsAFoN.png',
   'name': 'New Line Cinema',
   'origin_country': 'US'},
  {'id': 1565, 'logo_path': None, 'name': 'Avery Pix', 'origin_

In [7]:
# Load filtered dataset from Part 1
basics = pd.read_csv('Data/final_basics.csv.gz')

# Error tracking
errors = []

# Years to process in proof-of-concept run
YEARS_TO_GET = [2000, 2001]


In [8]:
for YEAR in tqdm_notebook(YEARS_TO_GET, desc='YEARS', position=0):

    JSON_FILE = f'{FOLDER}tmdb_api_results_{YEAR}.json'

    # Create empty JSON file if it doesn't exist
    if not os.path.isfile(JSON_FILE):
        with open(JSON_FILE, 'w') as f:
            json.dump([{'imdb_id': 0}], f)

    # Filter movies for the current year
    df = basics.loc[basics['startYear'] == YEAR].copy()
    movie_ids = df['tconst'].copy()

    # Load existing JSON data & avoid re-fetching already-stored IDs
    previous_df = pd.read_json(JSON_FILE)
    movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]

    # Inner loop — fetch each movie from API
    for movie_id in tqdm_notebook(movie_ids_to_get,
                                  desc=f'Movies from {YEAR}',
                                  position=1, leave=True):
        try:
            temp = get_movie_with_rating(movie_id)
            write_json(temp, JSON_FILE)
            time.sleep(0.02)  # Avoid hammering the server
        except Exception as e:
            errors.append([movie_id, e])

    # Save yearly CSV output
    final_year_df = pd.read_json(JSON_FILE)
    final_year_df.to_csv(f"{FOLDER}final_tmdb_data_{YEAR}.csv.gz",
                         compression="gzip", index=False)


YEARS:   0%|          | 0/2 [00:00<?, ?it/s]

Movies from 2000:   0%|          | 0/189 [00:00<?, ?it/s]

Movies from 2001:   0%|          | 0/318 [00:00<?, ?it/s]

In [9]:
# Load yearly files
year_2000 = pd.read_csv('Data/final_tmdb_data_2000.csv.gz')
year_2001 = pd.read_csv('Data/final_tmdb_data_2001.csv.gz')

# Combine into one DataFrame
tmdb_results_combined = pd.concat([year_2001, year_2000], ignore_index=True)

# Save merged file for EDA
tmdb_results_combined.to_csv(f"{FOLDER}tmdb_results_combined.csv.gz",
                             compression="gzip", index=False)
