#  📡 Download TMDB Data Notebook Overview  
This notebook pulls TMDB data for movies in the IMDb dataset, using IMDb `tconst` as the lookup key.  

## 📂 Steps in This Notebook  

1️⃣ **🔒 Hidden Code Blocks** → **Old, possibly useful code** from the earlier screenplay-based approach. Needs investigation before deletion.  

2️⃣ **Import Libraries & Setup** → Loads dependencies, sets **API key & base URL** for TMDB queries.  

3️⃣ **Retrieve TMDB IDs from IMDb `tconst`** →  
   - Queries TMDB using IMDb IDs to get **TMDB IDs**.  
   - ⚠️ **Already completed, should not be run again** (avoids 11K API calls).  

4️⃣ **Fetch TMDB Movie Data** →  
   - Uses **TMDB IDs** to pull movie details.  
   - Retrieves **release dates, credits, keywords, and other metadata**.  
   - Saves each movie’s data as a **JSON file** using its **IMDb `tconst` as the filename**.  

5️⃣ **(Old Code) Fuzzy Matching & Title Fixing** →  
   - Leftover code for **matching screenplay titles** to IMDb titles.  
   - ⚠️ **No longer needed, planned for deletion.**  

## 🛠️ Next Steps  
✔ **Review & remove outdated code for clean public GitHub repo.**  
✔ **Ensure final dataset structure is correct.**  
✔ **Move forward with merging TMDB JSON files into the IMDb dataset.**  


In [None]:
import requests
import json
import time



# Define the base URL for the TMDB API
BASE_URL = 'https://api.themoviedb.org/3'

def search_movie(title):
    """Search for a movie by title and return the movie ID."""
    search_url = f"{BASE_URL}/search/movie"
    params = {
        'api_key': API_KEY,
        'query': title
    }
    response = requests.get(search_url, params=params)
    data = response.json()
    results = data.get('results', [])
    if results:
        return results[0]['id']
    return None

def get_movie_details(movie_id):
    """Get full details of a movie by its ID."""
    details_url = f"{BASE_URL}/movie/{movie_id}"
    params = {
        'api_key': API_KEY
    }
    response = requests.get(details_url, params=params)
    return response.json()

def download_movie_data(titles, output_folder):
    """Download movie data for a list of titles and save to JSON files."""
    for title in titles:
        movie_id = search_movie(title)
        if movie_id:
            details = get_movie_details(movie_id)
            file_path = f"{output_folder}/{title.replace(' ', '_')}.json"
            with open(file_path, 'w', encoding='utf-8') as f:
                json.dump(details, f, ensure_ascii=False, indent=4)
            print(f"Downloaded data for '{title}' and saved to '{file_path}'")
        else:
            print(f"No data found for '{title}'")
        time.sleep(1)  # Respect the rate limit by adding a delay

# Example usage
titles = [
    "Inception",
    "The Dark Knight",
    "Interstellar"
]
output_folder = 'movie_data'

# Create the output folder if it doesn't exist
import os
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

download_movie_data(titles, output_folder)


In [None]:
import requests
import json
import time

# Set your TMDB API key here
API_KEY = 'd490af79f7ede7821ba89165ed285350'

# Define the base URL for the TMDB API
BASE_URL = 'https://api.themoviedb.org/3'

def search_movie(title):
    """Search for a movie by title and return the movie ID."""
    search_url = f"{BASE_URL}/search/movie"
    params = {
        'api_key': API_KEY,
        'query': title
    }
    response = requests.get(search_url, params=params)
    data = response.json()
    results = data.get('results', [])
    if results:
        return results[0]['id']
    return None

def get_movie_details(movie_id):
    """Get full details of a movie by its ID."""
    details_url = f"{BASE_URL}/movie/{movie_id}"
    params = {
        'api_key': API_KEY,
        'append_to_response': 'release_dates,credits'
    }
    response = requests.get(details_url, params=params)
    return response.json()

def extract_box_office_by_country(details):
    """Extract box office numbers by country from movie details."""
    box_office = {}
    if 'release_dates' in details and 'results' in details['release_dates']:
        for country_data in details['release_dates']['results']:
            country = country_data.get('iso_3166_1')
            for release in country_data.get('release_dates', []):
                if 'box_office' in release:
                    box_office[country] = release['box_office']
    return box_office

def extract_actors_and_characters(details):
    """Extract list of actors and their characters from movie credits."""
    actors_characters = []
    if 'credits' in details and 'cast' in details['credits']:
        for cast_member in details['credits']['cast']:
            actors_characters.append({
                'actor': cast_member.get('name'),
                'character': cast_member.get('character')
            })
    return actors_characters

def download_movie_data(titles, output_folder):
    """Download movie data for a list of titles and save to JSON files."""
    for title in titles:
        movie_id = search_movie(title)
        if movie_id:
            details = get_movie_details(movie_id)
            box_office = extract_box_office_by_country(details)
            actors_characters = extract_actors_and_characters(details)

            # Combine details with extracted information
            details['box_office_by_country'] = box_office
            details['actors_characters'] = actors_characters

            file_path = f"{output_folder}/{title.replace(' ', '_')}.json"
            with open(file_path, 'w', encoding='utf-8') as f:
                json.dump(details, f, ensure_ascii=False, indent=4)
            print(f"Downloaded data for '{title}' and saved to '{file_path}'")
        else:
            print(f"No data found for '{title}'")
        time.sleep(1)  # Respect the rate limit by adding a delay

# Example usage
titles = [
    "Inception",
    "The Dark Knight",
    "Interstellar"
]
output_folder = 'Data/movie_data'

# Create the output folder if it doesn't exist
import os
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

download_movie_data(titles, output_folder)


### IMPORT LIBS

In [1]:
import requests
import os
import pandas as pd
import string
import json
import time

# TMDB API details
BASE_URL = "https://api.themoviedb.org/3"
API_KEY = 'd490af79f7ede7821ba89165ed285350'

### Get IDs of all scripts (don't run again):

In [5]:
import pandas as pd

def get_movie_id(IMDB_ID="tt1375666"):
    """Search TMDB for a movie by IMDb ID and return its TMDB ID."""
    url = f"{BASE_URL}/find/{IMDB_ID}"
    params = {
        "api_key": API_KEY,
        "external_source": "imdb_id"
    }
    response = requests.get(url, params=params, timeout=10)
    data = response.json()

    if data.get("movie_results"):  # Fix: Correct key for TMDB API response
        return data["movie_results"][0]["id"]  # Return first search result ID
    
    print(f"Error fetching {IMDB_ID}")  # Fix: Corrected error message
    return None  # No match found

def update_lists(df, prior_fetches, file_path="data/titles_with_subgenres.csv"):
    """Read movie IMDb IDs (tconst) from CSV and fetch their TMDB IDs."""
    
    tconsts = df["tconst"].tolist()  # Fix: Ensure correct column name
    
    ids = []
    i = 0
    for t in tconsts:
        if t in prior_fetches:
            pass
        if i % 1000 == 0:
            print(i)
            
        try:
            movie_id = get_movie_id(t)
        except:
            print("Timeout occured")
            return list(zip(tconsts[:len(ids)], ids))
        ids.append(movie_id)
        
        time.sleep(0.05)  # Avoid hitting TMDB rate limits
        i += 1

    return list(zip(tconsts, ids))
df = pd.read_csv(file_path)
titles_ids = update_lists(df=df, None)

0
Error fetching tt26900526
1000
Error fetching tt6328046
Error fetching tt27850479
Error fetching tt23058594
2000
3000
Error fetching tt28504311
Error fetching tt11959018
4000
5000
Error fetching tt15354498
6000
Error fetching tt30836377
Error fetching tt32491118
7000
Error fetching tt21932820
Error fetching tt32389619
8000
9000
10000
11000


In [6]:
print(f"Number of downloaded ids: {len(titles_ids)}")

Number of downloaded ids: 11083


### FETCH DATA

In [9]:
import requests
import json
import time
import pandas as pd

def get_movie_details(movie_id):
    """Get full details of a movie by its ID."""
    details_url = f"{BASE_URL}/movie/{movie_id}"
    params = {
        'api_key': API_KEY,
        'append_to_response': 'release_dates,credits,keywords'
    }
    response = requests.get(details_url, params=params)
    return response.json()

def extract_actors_and_characters(details):
    """Extract list of actors and their characters from movie credits."""
    actors_characters = []
    if 'credits' in details and 'cast' in details['credits']:
        for cast_member in details['credits']['cast']:
            actors_characters.append({
                'actor': cast_member.get('name'),
                'character': cast_member.get('character')
            })
    return actors_characters

def download_movie_data(titles_ids, output_folder):
    """Download movie data for a list of titles and save to JSON files."""
    x=0
    for t, i in titles_ids:
        if x%500 == 0:
            print(x)
        x+=1

        if os.path.exists(f"{output_folder}/{t}.json"):
            continue
            
        if i:
            details = get_movie_details(i)
            actors_characters = extract_actors_and_characters(details)
            details['actors_characters'] = actors_characters
            file_path = f"{output_folder}/{t.replace(' ', '_')}.json"
            with open(file_path, 'w', encoding='utf-8') as f:
                json.dump(details, f, ensure_ascii=False, indent=4)
            #print(f"Downloaded data for '{t}' and saved to '{file_path}'")
        else:
            print(f"No data found for '{t}'")
        time.sleep(0.05)

download_movie_data(titles_ids=titles_ids, output_folder="data/subgenre_titles_data")

0
500
No data found for 'tt26900526'
1000
No data found for 'tt6328046'
No data found for 'tt27850479'
No data found for 'tt23058594'
1500
2000
2500
3000
No data found for 'tt28504311'
No data found for 'tt11959018'
3500
4000
4500
5000
5500
No data found for 'tt15354498'
6000
No data found for 'tt30836377'
No data found for 'tt32491118'
6500
7000
7500
No data found for 'tt21932820'
No data found for 'tt32389619'
8000
8500
9000
9500
10000
10500
11000


Get a list of unmatched movies:

In [8]:
file_list = os.listdir("Data/movie_data")
titles_json = [f.replace(".json", "").replace("_", " ").lower() for f in file_list]
titles_screenplays = [s.lower() for s in screenplays]
fails = []

for t in titles_screenplays:
    if t not in titles_json:
        fails.append(t)

print(f"There were {len(fails)} unmatched titles")

There were 315 unmatched titles


Fuzzy matches:

In [45]:
# Get titles and release years from IMDB data for titles since 1940
#dat = pd.read_csv("Data/title.basics.tsv", sep="\t")
dat2 = dat.copy()

# clean years
dat2["startYear"] = dat2["startYear"].apply(convert_to_int_or_1900)

# Function to check if a string represents an integer
def convert_to_int_or_1900(x):
    try:
        int(x)
        return x
    except ValueError:
        return '1900'

# Function to process each element
def process_string(s):
    if isinstance(s, str):
        return s.lower().strip().translate(translator)
    else:
        return str(s)

# extract and clean titles
translator = str.maketrans('', '', string.punctuation)
titles = list(dat2[(dat2["titleType"] == "movie") & (dat2["isAdult"] == 0) & (dat2["startYear"].astype(int) >= 1920)]["primaryTitle"])
titles_clean = [process_string(s) for s in titles]

# extract years
years = list(dat2[(dat2["titleType"] == "movie") & (dat2["isAdult"] == 0)& (dat2["startYear"].astype(int) >= 1920)]["startYear"])

titles_years = list(zip(titles, years))
print(f"The list has {len(titles_years)} titles")

The list has 557983 titles


In [56]:
pd.DataFrame(titles_years, columns=["title", "year"]).to_csv("Data/imdb_titles_years.csv")

In [51]:
from fuzzywuzzy import fuzz, process

BASE_URL = 'https://api.themoviedb.org/3'

def search_movie(title):
    """Search for a movie by title and return the results."""
    search_url = f"{BASE_URL}/search/movie"
    params = {
        'api_key': API_KEY,
        'query': title
    }
    response = requests.get(search_url, params=params)
    return response.json()

def find_best_match(title, candidates):
    """Find the best match for the given title from the list of candidates using fuzzy matching."""
    best_match = process.extractOne(title, candidates, scorer=fuzz.token_sort_ratio)
    return best_match

def match_titles_to_tmdb(titles, candidates, verbose=False):
    matched_results = []
    for title in titles:
        best_match = find_best_match(title, candidates)
        if best_match:
            matched_results.append({
                'original_title': title,
                'best_match_title': best_match[0],
                'similarity_score': best_match[1],
                'tmdb_data': next((candidate for candidate in candidates if candidate == best_match[0]), None)
            })
            if verbose:
                print(f"Best match for '{title}': '{best_match[0]}' with a similarity score of {best_match[1]}")
        else:
            print(f"No match found for '{title}'")
    return matched_results

# Example usage
matched_results = match_titles_to_tmdb(titles=fails, candidates=titles)
"""
# Print matched results
for result in matched_results:
    print(f"Original Title: {result['original_title']}")
    print(f"Best Match Title: {result['best_match_title']}")
    print(f"Similarity Score: {result['similarity_score']}")
    print(f"TMDB Data: {result['tmdb_data']}")
    print("\n")"""

print("The fuzzy matches are complete")



The fuzzy matches are complete


Manual review of fuzzy matches:

In [54]:
import sys
import time  # Optional: For adding a delay, which can be useful for demonstration purposes

is_match = []

for d in matched_results:
    # Construct the message
    message = f"Screenplay: {d['original_title']}, Matched: {d['best_match_title']}, Score: {d['similarity_score']}   "
    
    # Clear the line and print the message
    sys.stdout.write('\033[K' + message + '\r')
    sys.stdout.flush()
    
    # Capture user input
    user_input = input("\nIs this a correct match? (yes/no): ")
    is_match.append(user_input)
    
    # Optional: Delay for demonstration purposes, remove this in the real script
    time.sleep(1)

# Optional: Clear the line after the loop is done
sys.stdout.write("\033[K\r")
sys.stdout.flush()

print("\nReview completed.")

[KScreenplay: hack slash, Matched: Hack/Slash, Score: 100   


Is this a correct match? (yes/no):  y


[KScreenplay: how to get away with murder 1x01 pilot 2014, Matched: How to Get Away with It, Score: 70   


Is this a correct match? (yes/no):  n


[KScreenplay: ground hog day, Matched: Groundhog Day, Score: 96   


Is this a correct match? (yes/no):  y


[KScreenplay: ceramic life, Matched: Eri Ife, Score: 74   


Is this a correct match? (yes/no):  n


[KScreenplay: bay watch, Matched: Baywatch, Score: 94   


Is this a correct match? (yes/no):  y


[KScreenplay: chronicle 2 martyr, Matched: Mutant Chronicles, Score: 74   


Is this a correct match? (yes/no):  n


[KScreenplay: the green effect, Matched: The Red Effect, Score: 87   


Is this a correct match? (yes/no):  n


[KScreenplay: hell incorporated, Matched: Gals, Incorporated, Score: 82   


Is this a correct match? (yes/no):  n


[KScreenplay: london rocks, Matched: London Rock, Score: 96   


Is this a correct match? (yes/no):  n


[KScreenplay: peasantville, Matched: Pleasantville, Score: 96   


Is this a correct match? (yes/no):  y


[KScreenplay: avengers worlds collide, Matched: As Worlds Collide, Score: 85   


Is this a correct match? (yes/no):  n


[KScreenplay: bounty jumpers, Matched: Body Jumper, Score: 80   


Is this a correct match? (yes/no):  n


[KScreenplay: sherlock ep3, Matched: Sherlock Jr., Score: 78   


Is this a correct match? (yes/no):  n


[KScreenplay: the dragons of krull, Matched: Dragons on the Hill, Score: 82   


Is this a correct match? (yes/no):  n


[KScreenplay: it came from the drive in, Matched: It Came from the Desert, Score: 79   


Is this a correct match? (yes/no):  n


[KScreenplay: the silver linings playbook, Matched: Silver Linings Playbook, Score: 92   


Is this a correct match? (yes/no):  y


[KScreenplay: fully automatic, Matched: Automatic, Score: 75   


Is this a correct match? (yes/no):  n


[KScreenplay: driving miss daisey, Matched: Driving Miss Daisy, Score: 97   


Is this a correct match? (yes/no):  y


[KScreenplay: burn after heading, Matched: Burn After Reading, Score: 94   


Is this a correct match? (yes/no):  y


[KScreenplay: hudsuckerproxy, Matched: The Hudsucker Proxy, Score: 85   


Is this a correct match? (yes/no):  y


[KScreenplay: danny graves man cave, Matched: Dancing on Graves, Score: 68   


Is this a correct match? (yes/no):  n


[KScreenplay: the nine lives of chloe king salvation, Matched: Saving the Lives of Children, Score: 73   


Is this a correct match? (yes/no):  n


[KScreenplay: alien engineers, Matched: Engineers, Score: 75   


Is this a correct match? (yes/no):  n


[KScreenplay: the amity ville asylum, Matched: The Amityville Asylum, Score: 74   


Is this a correct match? (yes/no):  y


[KScreenplay: martyr chronicle 2, Matched: Mutant Chronicles, Score: 74   


Is this a correct match? (yes/no):  n


[KScreenplay: an october wedding, Matched: Another Wedding?, Score: 85   


Is this a correct match? (yes/no):  n


[KScreenplay: killing charlie kaufman, Matched: Killing Christian, Score: 70   


Is this a correct match? (yes/no):  n


[KScreenplay: papermoon, Matched: Paperman, Score: 82   


Is this a correct match? (yes/no):  n


[KScreenplay: captain pillips, Matched: Captain Phillips, Score: 97   


Is this a correct match? (yes/no):  y


[KScreenplay: back to the future 2&3, Matched: Back to the Future, Score: 90   


Is this a correct match? (yes/no):  n


[KScreenplay: to kill a mocking bird, Matched: To Kill a Mockingbird, Score: 79   


Is this a correct match? (yes/no):  y


[KScreenplay: youre dead meat piplowski, Matched: I Eat Your Skin, Score: 65   


Is this a correct match? (yes/no):  n


[KScreenplay: a wizard of earthsea, Matched: Hearts of War, Score: 73   


Is this a correct match? (yes/no):  n


[KScreenplay: big trouble in little china 2, Matched: Big Trouble in Little China, Score: 96   


Is this a correct match? (yes/no):  n


[KScreenplay: hulk john turman, Matched: Johnny Tremain, Score: 67   


Is this a correct match? (yes/no):  n


[KScreenplay: tom raider, Matched: Womb Raider, Score: 86   


Is this a correct match? (yes/no):  n


[KScreenplay: bizzaro, Matched: Blizzard, Score: 80   


Is this a correct match? (yes/no):  n


[KScreenplay: the keys to the street, Matched: The Street Descends to the Sea, Score: 77   


Is this a correct match? (yes/no):  n


[KScreenplay: miss congeniality ii, Matched: Miss Congeniality, Score: 92   


Is this a correct match? (yes/no):  n


[KScreenplay: latchkeepers annotated, Matched: Another Watcher, Score: 65   


Is this a correct match? (yes/no):  n


[KScreenplay: all the pretty dead girls, Matched: All The Pretty Girls, Score: 89   


Is this a correct match? (yes/no):  n


[KScreenplay: war of the worlds (1951), Matched: War of the Worlds, Score: 87   

KeyboardInterrupt: Interrupted by user

Save fuzzy match preds and results:

In [None]:
# Saving dictionary as JSON
with open("Data/Fuzzy Title Matches.json", 'w') as json_file:
    json.dump(matched_results, json_file, indent=4)