# TMDB Movie Data Probability

### ---Question---

#### What‚Äôs the proportion of TV shows that are drama on TMDB

### ---Description---

Random variable: Number of ratings (vote_count) for a randomly selected movie

One trial: Selecting one random movie from the TMDB dataset and recording its vote_count

Assumption: Movies are randomly sampled from TMDB; vote counts follow an approximately normal distribution

A possible bias is that im only choosing 1000 movies instead of looking at more of the dataset

In [3]:
import requests
import pandas as pd
import random
import time

# --- Your TMDB API Key ---
API_KEY = "e60bf5d10ffaceb9ad099377c20a9924"

# --- Function to fetch a random valid TV show ---
def get_random_tv_show():
    while True:
        random_id = random.randint(1, 250000)  # TV show IDs range is smaller than movies
        url = f"https://api.themoviedb.org/3/tv/{random_id}?api_key={API_KEY}"
        response = requests.get(url)

        if response.status_code == 200:
            data = response.json()
            if data.get("name"):  # ensure valid TV show
                return data
        time.sleep(0.1)

# --- Function to collect multiple unique TV shows ---
def get_unique_random_tv_shows(n=10):
    seen_ids = set()
    shows = []

    while len(shows) < n:
        show_data = get_random_tv_show()
        show_id = show_data.get("id")

        if show_id not in seen_ids:
            seen_ids.add(show_id)
            shows.append({
                "id": show_id,
                "name": show_data.get("name"),
                "popularity": show_data.get("popularity"),
                "vote_count": show_data.get("vote_count"),
                "genre_ids": [genre['id'] for genre in show_data.get("genres", [])],
                "genres": [genre['name'] for genre in show_data.get("genres", [])],
                "first_air_date": show_data.get("first_air_date"),
                "overview": show_data.get("overview")
            })
            print(f"‚úÖ Added: {show_data.get('name')} (ID: {show_id}) [{len(shows)}/{n}]")
        else:
            print(f"‚ö†Ô∏è Duplicate ID {show_id}, retrying...")

        time.sleep(0.1)  # rate limit delay

    return pd.DataFrame(shows)

# --- Fetch random TV shows ---
df_tv = get_unique_random_tv_shows(10)

# --- Compute proportion of Drama shows ---
df_tv["is_drama"] = df_tv["genres"].apply(lambda g: "Drama" in g if isinstance(g, list) else False)
proportion_drama = df_tv["is_drama"].mean()

# --- Display results ---
print("\nüìä Proportion of TV shows that are Drama:")
print(f"{proportion_drama:.2%} ({df_tv['is_drama'].sum()} out of {len(df_tv)})")

pd.set_option('display.max_colwidth', None)
pd.set_option('display.colheader_justify', 'center')


‚úÖ Added: How to Win at Everything (ID: 135133) [1/10]
‚úÖ Added: A Story of "Grappler Baki" and Me (ID: 131573) [2/10]
‚úÖ Added: Ninja Mono (ID: 44720) [3/10]
‚úÖ Added: ‡πÄ‡∏û‡∏µ‡∏¢‡∏á‡∏ä‡∏≤‡∏¢‡∏Ñ‡∏ô‡∏ô‡∏µ‡πâ  Piang Chai Khon Nee Mai Chai Poo Wised (ID: 68536) [4/10]
‚úÖ Added: Gentle Mercy (ID: 57165) [5/10]
‚úÖ Added: Super Animals (ID: 38525) [6/10]
‚úÖ Added: Keibuho Yabe Kenzo (ID: 54456) [7/10]
‚úÖ Added: Racon: Ailem ƒ∞√ßin (ID: 86628) [8/10]
‚úÖ Added: Taheyyaty ila Al-'Aila Al-Kareema (ID: 53858) [9/10]
‚úÖ Added: Ê≠£„Åó„ÅÑ„É≠„ÉÉ„ÇØ„Éê„É≥„Éâ„ÅÆ‰Ωú„ÇäÊñπ (ID: 118295) [10/10]

üìä Proportion of TV shows that are Drama:
10.00% (1 out of 10)


### --- Answer ---

The empirical probability from 1,000 simulated samples was 0.0000, meaning none of the sample means were greater than 500.

The theoretical probability assuming a normal model (Œº = 32.67, œÉ = 272.96) was also 0.0000, confirming that 500 is far beyond the expected range of sample means.

The mean of all sample means was 31.98, with a standard deviation of 26.50 (close to the theoretical œÉ/‚àön = 27.30).