# Part 2 – Data Analytics

## Step 1: Crawl a real-world dataset

### Data source overview & Data collection
Register in TMDB and obtain an API Key, and comply with the terms of use.   
NOTICE: "This program uses TMDB and the TMDB APIs but is not endorsed, certified, or otherwise approved by TMDB."  
Access rate limit: 40 times/second   


- Dataset：The Movie Database (TMDB) Public API, movie resource entry point: https://www.themoviedb.org/movie
- Call the Discover Movies API: https://api.themoviedb.org/3/discover/movie
- Data type: Movie metadata:
  - The Discover movies API metadata types include: id, title, release_date, vote_average, vote_count, popularity, original_language, genre_ids, overview ect.
  - For more details, please refer to the official documentation:https://developer.themoviedb.org/reference/discover-movie
- Data extraction scale: 200 movie records were extracted page by page, due to the release year


In [3]:
import requests
import pandas as pd


def fetch_tmdb_movies(api_key="7f8c6e28203eb7d6a49fa0caf4466396", start_date="2025-01-01", end_date="2025-12-31", max_results=200, language="en-UK"):
    """Fetch popular movies from TMDB discover API within a date range."""
    api_key = api_key
    base_url = "https://api.themoviedb.org/3/discover/movie"
    params = {
        "api_key": api_key,
        "language": language,
        "primary_release_date.gte": start_date,
        "primary_release_date.lte": end_date,
        "page": 1,
    }

    records = []
    while len(records) < max_results:
        result = requests.get(base_url, params=params, timeout=10)
        result.raise_for_status()
        data = result.json()
        for movie in data.get("results", []):
            records.append({
                "id": movie.get("id"),
                "title": movie.get("title"),
                "release_date": movie.get("release_date"),
                "vote_average": movie.get("vote_average"),
                "vote_count": movie.get("vote_count"),
                "popularity": movie.get("popularity"),
                "original_language": movie.get("original_language"),
                "genre_ids": movie.get("genre_ids"),
            })
            if len(records) >= max_results:
                break
        if not data.get("total_pages"): 
            break
        elif params["page"] >= data.get("total_pages", 0):
            break
        params["page"] += 1
    movie_data = pd.DataFrame(records)
    return movie_data


### Variables and schema
- `id`: The movie's unique ID
- `title`: Movie title
- `release_date`: Release date
- `vote_average`: TMDB average rating
- `vote_count`: The number of people who voted
- `popularity`: Popularity index
- `original_language`: Original language code
- `genre_ids`: List of type IDs.


In [10]:
df_movies = fetch_tmdb_movies(start_date="2025-01-01", end_date="2025-12-31", max_results=200)
display(df_movies.head())

Unnamed: 0,id,title,release_date,vote_average,vote_count,popularity,original_language,genre_ids
0,1062722,Frankenstein,2025-10-17,7.866,1457,582.9067,en,"[18, 27, 14]"
1,1054867,One Battle After Another,2025-09-23,7.594,1424,414.7432,en,"[28, 53, 80]"
2,1248226,Playdate,2025-11-05,6.44,182,396.6845,en,"[28, 35, 10751]"
3,1242898,Predator: Badlands,2025-11-05,7.397,354,263.6485,en,"[28, 878, 12]"
4,1296504,Stand Your Ground,2025-05-09,5.839,31,218.2118,en,"[28, 53, 80]"


### Secondary data crawling & Table merging

It's clear that first attempt at scraping data via an API yielded basic information about the films.   
However, based on evaluation, I believe I need more detailed data for each film to improve analysis:
 - `running time`: Video length
 - `budget`: Film cost/budget
 - `revenue`: Box office revenue  

I merge the two tables using the movie's unique ID.

In [None]:
import time

def fetch_movie_detail(movie_id, api_key="7f8c6e28203eb7d6a49fa0caf4466396"):
    url = f"https://api.themoviedb.org/3/movie/{movie_id}"
    resp = requests.get(url, params={"api_key": api_key}, timeout=10)
    resp.raise_for_status()
    details = resp.json()
    return {
        "id" : movie_id,
        "runtime": details.get("runtime"),
        "budget": details.get("budget"),
        "revenue": details.get("revenue")
    }

detail_record = []
for movie_id in df_movies["id"]:
    detail_record.append(fetch_movie_detail(movie_id))
    time.sleep(0.1)
    """add a sleep time to control access frequency"""
df_movie_details = pd.DataFrame(detail_record)

display(df_movie_details.head())

"""merge the two tables with unique ID"""
df_movie_full = df_movies.merge(df_movie_details, on = "id", how = "left")
display(df_movie_full.head())

Unnamed: 0,id,runtime,budget,revenue
0,1062722,150,120000000,144496
1,1054867,162,130000000,200300000
2,1248226,95,0,0
3,1242898,107,105000000,136304860
4,1296504,100,0,0


Unnamed: 0,id,title,release_date,vote_average,vote_count,popularity,original_language,genre_ids,runtime,budget,revenue
0,1062722,Frankenstein,2025-10-17,7.866,1457,582.9067,en,"[18, 27, 14]",150,120000000,144496
1,1054867,One Battle After Another,2025-09-23,7.594,1424,414.7432,en,"[28, 53, 80]",162,130000000,200300000
2,1248226,Playdate,2025-11-05,6.44,182,396.6845,en,"[28, 35, 10751]",95,0,0
3,1242898,Predator: Badlands,2025-11-05,7.397,354,263.6485,en,"[28, 878, 12]",107,105000000,136304860
4,1296504,Stand Your Ground,2025-05-09,5.839,31,218.2118,en,"[28, 53, 80]",100,0,0


### Storage as CSV
- Storage path：data/tmdb_movies_2025.csv。



In [15]:
from pathlib import Path

output_path = Path("C:\\Users\\10525\\Desktop\\SDPA-final\\SDPA_EMATM0048_2720314\\data\\tmdb_movies_2025.csv")
output_path.parent.mkdir(parents=True,exist_ok=True)
df_movie_full.to_csv(output_path)
