# Fetching Movie Data

Our current dataset contains 90,000 movies, but it only includes their TMDB id as well as the IMDB id. To enrich this dataset, we will fetch their details using the TMDB API.

In [60]:
import pandas as pd
import requests
import time
from concurrent.futures import ThreadPoolExecutor, as_completed

import warnings
warnings.filterwarnings('ignore')

import os
from dotenv import load_dotenv
# Load the environment variables .env
load_dotenv()

True

The links csv files has 3 columns:

- `movieId`: the inner id
- `imdbId`: the id that the movie has on IMDB
- `tmdbId`: the id that the movie has on TMDB

We will keep only `movieId` and `tmdbId`

In [None]:
mdf = pd.read_csv('../Data/Raw/links.csv')
mdf.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [62]:
mdf.shape

(87585, 3)

In [63]:
# Check how many null values we have
mdf.isna().sum()

movieId      0
imdbId       0
tmdbId     124
dtype: int64

In [64]:
# Check if there are duplicated values 
mdf[mdf.duplicated(subset=['tmdbId'], keep=False)].sort_values(by='tmdbId').head(10)

Unnamed: 0,movieId,imdbId,tmdbId
5892,6003,290538,4912.0
34340,144606,270288,4912.0
10103,34330,368089,9775.0
49943,178755,376800,9775.0
4138,4241,266860,10991.0
47959,174533,235679,10991.0
5562,5672,313487,12600.0
47964,174543,287635,12600.0
61157,202599,1016268,13020.0
9897,33154,413845,13020.0


We will perform the following actions:

- Remove any duplicate entries based on the TMDB ID.
- Drop the IMDB ID column, as it will not be used in the analysis.
- Remove rows that don't have a valid TMDB ID.
- Convert the TMDB ID column to integers for consistency.
- Rename the TMDB ID column to "id" in preparation for fetching the movie data.

In [65]:
mdf.drop_duplicates(subset=['tmdbId'], inplace=True) 
mdf.drop(columns='imdbId', inplace=True)  
mdf.dropna(subset=['tmdbId'], inplace=True)  
mdf['tmdbId'] = mdf['tmdbId'].astype('int')
mdf.rename(columns={'tmdbId': 'id'}, inplace=True)
mdf.head()  

Unnamed: 0,movieId,id
0,1,862
1,2,8844
2,3,15602
3,4,31357
4,5,11862


## Fetch the details of the films with the TMDB API

In [66]:
ids = mdf['id'].values
ids.shape

(87425,)

In [67]:
def fetch_movie(movie_id, api_key):
    """
    Function to fetch data for a single movie from TMDB API
    Args:
        movie_id (int): ID of the movie
        api_key (str): API key for authentication

    Returns:
        dict: Movie data if request is successful, None if failed
    """
    url = f"https://api.themoviedb.org/3/movie/{movie_id}?api_key={api_key}"
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.json()
        else:
            return None  # Return None if response code is not 200 (success)
    
    except requests.exceptions.RequestException as e:
        # Handle any network-related or request errors
        print(f"Error fetching data for movie ID {movie_id}: {e}")
        return None

In [68]:
def fetch_movies(ids, api_key):
    """
    Function to fetch movie data for multiple movie IDs using concurrent requests
    Args:
        ids (list): List of movie IDs to fetch
        api_key (str): API key for authentication
    
    Returns:
        tuple: A tuple containing two lists:
            - List of successfully fetched movies (as JSON)
            - List of movie IDs for which fetching data failed
    """
    id_errors = [] 
    movies = [] 

    # Using ThreadPoolExecutor to send multiple requests concurrently
    with ThreadPoolExecutor(max_workers=10) as executor:
        # Submitting the fetch_movie function to the executor for each movie ID
        futures = {executor.submit(fetch_movie, movie_id, api_key): movie_id for movie_id in ids}

        # Processing results as they complete
        for future in as_completed(futures):
            movie_id = futures[future]
            movie_data = future.result()
            
            if movie_data:
                movies.append(movie_data)
            else:
                id_errors.append(movie_id)
            
            # Adding a small sleep time to avoid hitting API rate limits
            time.sleep(0.001)

    return movies, id_errors

In [69]:
api_key = os.getenv('tmdb_api_key')
movies, id_errors = fetch_movies(ids, api_key)

print(f"Fetched {len(movies)} movies")
print(f"Failed to fetch {len(id_errors)} movies")

Fetched 86351 movies
Failed to fetch 1074 movies


## Convert Successful Fetches to DataFrame

Convert the movies for which we successfully retrieved details into a DataFrame. Afterward, we will concatenate this DataFrame with the rest of the movie dataset.

In [70]:
fetched = pd.DataFrame(data=movies)
fetched.head().transpose()

Unnamed: 0,0,1,2,3,4
adult,False,False,False,False,False
backdrop_path,/iZGiMD0p1M2AOmzKknFo5bkuz94.jpg,/62BnXyJtVEq4WKNSpnPG7QPYYDI.jpg,/oMGV48EGhsNavC1PL8HMeWs5Udq.jpg,/jP8lHNHD89xaRPfAdyz5KEVYcSb.jpg,/3Rfvhy1Nl6sSGJwyjb0QiZzZYlB.jpg
belongs_to_collection,,,,,"{'id': 10194, 'name': 'Toy Story Collection', ..."
budget,50000000,62000000,58000000,0,30000000
genres,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","[{'id': 10749, 'name': 'Romance'}, {'id': 18, ...","[{'id': 10751, 'name': 'Family'}, {'id': 28, '...","[{'id': 16, 'name': 'Animation'}, {'id': 12, '..."
homepage,,,,,http://toystory.disney.com/toy-story
id,524,9087,11860,45325,862
imdb_id,tt0112641,tt0112346,tt0114319,tt0112302,tt0114709
origin_country,[US],[US],[US],[US],[US]
original_language,en,en,en,en,en


Finally, we will merge the fetched movie data with their corresponding original movie_id to ensure consistency and accuracy.

In [71]:
final_df = pd.merge(fetched, mdf, on='id')
final_df.head().transpose()

Unnamed: 0,0,1,2,3,4
adult,False,False,False,False,False
backdrop_path,/iZGiMD0p1M2AOmzKknFo5bkuz94.jpg,/62BnXyJtVEq4WKNSpnPG7QPYYDI.jpg,/oMGV48EGhsNavC1PL8HMeWs5Udq.jpg,/jP8lHNHD89xaRPfAdyz5KEVYcSb.jpg,/3Rfvhy1Nl6sSGJwyjb0QiZzZYlB.jpg
belongs_to_collection,,,,,"{'id': 10194, 'name': 'Toy Story Collection', ..."
budget,50000000,62000000,58000000,0,30000000
genres,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","[{'id': 10749, 'name': 'Romance'}, {'id': 18, ...","[{'id': 10751, 'name': 'Family'}, {'id': 28, '...","[{'id': 16, 'name': 'Animation'}, {'id': 12, '..."
homepage,,,,,http://toystory.disney.com/toy-story
id,524,9087,11860,45325,862
imdb_id,tt0112641,tt0112346,tt0114319,tt0112302,tt0114709
origin_country,[US],[US],[US],[US],[US]
original_language,en,en,en,en,en


In [72]:
final_df[['title', 'id']].isna().sum()

title    0
id       0
dtype: int64

In [73]:
final_df.shape 

(86351, 27)

Our final dataset contains 86,351 movies, which includes detailed information. We will save this dataset for now and proceed with preprocessing at a later stage

In [None]:
final_df.to_parquet('../Data/Raw/movies_metadata.parquet', index=False)

##