## Movies Crew and Keywords

Once we had clean and filtered the original movies dataset, now we are going to fetch their crew information (such as the actors and directors), as well as their keywords, to have a useful dataset for a content based recommender system.

In [1]:

import pandas as pd
import numpy as np
import requests
import json
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
import os
from dotenv import load_dotenv
import warnings
warnings.filterwarnings('ignore')

# Load the environment variables .env
load_dotenv()

True

Get the id's of the desired films

In [2]:
ids = pd.read_csv('../Data/Processed/movies_ids.csv')
ids.head()

Unnamed: 0,id
0,524
1,9087
2,11860
3,862
4,12110


Create a function to handle the endpoints for the TMDB api

In [None]:
def fetch_tmdb_data(movie_id, api_key, endpoint=""):
    """Generic function to fetch any TMDB data for a movie"""
    url = f"https://api.themoviedb.org/3/movie/{movie_id}/{endpoint}?api_key={api_key}"
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.json()
        elif response.status_code == 429:  # Too Many Requests
            # Sleep for a bit if we hit rate limits
            time.sleep(2)
            return fetch_tmdb_data(movie_id, api_key, endpoint)  # Simple retry
        else:
            print(f"Error: HTTP {response.status_code} for movie {movie_id}")
            return None
    except Exception as e:
        print(f"Error fetching data for movie {movie_id}: {e}")
        return None

In [None]:
# Create endpoint-specific functions that use the global one
def fetch_credits(movie_id, api_key):
    return fetch_tmdb_data(movie_id, api_key, "credits")

def fetch_keywords(movie_id, api_key):
    return fetch_tmdb_data(movie_id, api_key, "keywords")

In [6]:
def fetch_movies(ids, api_key, fetch_func, max_workers=10, delay=0.05):
    """
    Function to fetch movie data for multiple movie IDs concurrently
    
    Args:
        ids: List of movie IDs
        api_key: TMDB API key
        fetch_func: Function to fetch specific data (e.g., fetch_credits)
        max_workers: Max number of concurrent requests
        delay: Small delay between requests to be nice to API
        
    Returns:
        (successful_results, failed_ids)
    """
    successful = []
    failed = []
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(fetch_func, movie_id, api_key): movie_id for movie_id in ids}
        
        for future in as_completed(futures):
            movie_id = futures[future]
            result = future.result()
            
            if result:
                successful.append(result)
            else:
                failed.append(movie_id)
            
            # Small delay to avoid hammering the API
            time.sleep(delay)
    
    return successful, failed

In [7]:
api_key = os.getenv('tmdb_api_key')
movie_ids = ids['id']

Once we've created the functions to fetch the desired data correctly, we will now begin retrieving the keywords

In [8]:
# Fetching keywords
keywords, failed_keywords = fetch_movies(movie_ids, api_key, fetch_keywords)
print(f"Fetched {len(keywords)} movie keywords, failed {len(failed_keywords)}")

Fetched 10073 movie keywords, failed 0


#### Keywords

In [9]:
keywords = pd.DataFrame(keywords)
keywords.head()

Unnamed: 0,id,keywords
0,4584,"[{'id': 7879, 'name': 'secret love'}, {'id': 2..."
1,21032,"[{'id': 1994, 'name': 'wolf'}, {'id': 2551, 'n..."
2,5,"[{'id': 612, 'name': 'hotel'}, {'id': 613, 'na..."
3,862,"[{'id': 10084, 'name': 'rescue'}, {'id': 6054,..."
4,8844,"[{'id': 7035, 'name': 'giant insect'}, {'id': ..."


Now retrive the credits information

In [10]:
# Fetching credits
credits, failed_credits = fetch_movies(movie_ids, api_key, fetch_credits)
print(f"Fetched {len(credits)} movie credits, failed {len(failed_credits)}")

Fetched 10073 movie credits, failed 0


#### Credits

In [11]:
credits = pd.DataFrame(credits)
credits.head()

Unnamed: 0,id,cast,crew
0,9087,"[{'adult': False, 'gender': 2, 'id': 3392, 'kn...","[{'adult': False, 'gender': 2, 'id': 3026, 'kn..."
1,524,"[{'adult': False, 'gender': 2, 'id': 380, 'kno...","[{'adult': False, 'gender': 2, 'id': 1032, 'kn..."
2,4584,"[{'adult': False, 'gender': 1, 'id': 7056, 'kn...","[{'adult': False, 'gender': 2, 'id': 2226, 'kn..."
3,12110,"[{'adult': False, 'gender': 2, 'id': 7633, 'kn...","[{'adult': False, 'gender': 2, 'id': 71303, 'k..."
4,862,"[{'adult': False, 'gender': 2, 'id': 31, 'know...","[{'adult': False, 'gender': 0, 'id': 12893, 'k..."


There was no movie that failed to retrieve its data.

In [12]:
# Ensure we do not have duplicated rows
credits[credits['id'].duplicated()].shape[0], keywords[keywords['id'].duplicated()].shape[0]

(0, 0)

## Extracting the Director and Actors

1. **Crew:** From the crew, we will only extract the director.
2. **Cast:** Lesser known actors and minor roles do not really affect people's opinion of a movie. Therefore, we must only select the major characters and their respective actors, so we will choose the top 3 actors that appear in the credits list. 

In [13]:
# Convertfrom JSON to a Pyhton Object
credits['crew'] = credits['crew'].apply(lambda x: json.loads(x) if isinstance(x, str) else x)
credits['cast'] = credits['cast'].apply(lambda x: json.loads(x) if isinstance(x, str) else x)

In [14]:
# Extract the director from the crew
def get_director(x):
    for item in x:
        if item['job'] == 'Director':
            return item['name']
    return np.nan

In [15]:
credits['director'] = credits['crew'].apply(lambda x: get_director(x))
# Get the top 3 actors from the cast
credits['cast'] = credits['cast'].apply(lambda x: x[:3] if len(x) >=3 else x).apply(lambda x: [item['name'] for item in x])

In [16]:
credits.head()

Unnamed: 0,id,cast,crew,director
0,9087,"[Michael Douglas, Annette Bening, Martin Sheen]","[{'adult': False, 'gender': 2, 'id': 3026, 'kn...",Rob Reiner
1,524,"[Robert De Niro, Sharon Stone, Joe Pesci]","[{'adult': False, 'gender': 2, 'id': 1032, 'kn...",Martin Scorsese
2,4584,"[Emma Thompson, Kate Winslet, Alan Rickman]","[{'adult': False, 'gender': 2, 'id': 2226, 'kn...",Ang Lee
3,12110,"[Leslie Nielsen, Mel Brooks, Amy Yasbeck]","[{'adult': False, 'gender': 2, 'id': 71303, 'k...",Mel Brooks
4,862,"[Tom Hanks, Tim Allen, Don Rickles]","[{'adult': False, 'gender': 0, 'id': 12893, 'k...",John Lasseter


Finally we will delete the crew column, since we had already extract the relevant information

In [17]:
# Delete the crew column
credits.drop(columns='crew', inplace=True)

# Nan if the film do not have information about the actors
credits['cast'] = credits['cast'].apply(lambda x: np.nan if not x else x)
credits.isna().sum()

id          0
cast        9
director    0
dtype: int64

We have the complete information for all the movies

## Extracting the Keywords

Now we are going to generate a list of all the keywords for each movie

In [18]:
keywords.iloc[0]['keywords']

[{'id': 7879, 'name': 'secret love'},
 {'id': 212, 'name': 'london, england'},
 {'id': 548, 'name': 'countryside'},
 {'id': 818, 'name': 'based on novel or book'},
 {'id': 4129, 'name': 'widow'},
 {'id': 11109, 'name': 'military officer'},
 {'id': 964, 'name': 'servant'},
 {'id': 2755, 'name': 'country life'},
 {'id': 4472, 'name': 'pneumonia'},
 {'id': 10911, 'name': 'inheritance'},
 {'id': 15060, 'name': 'period drama'},
 {'id': 156507, 'name': 'rainstorm'},
 {'id': 156512, 'name': 'decorum'},
 {'id': 159871, 'name': 'horse carriage'},
 {'id': 165100, 'name': 'young love'},
 {'id': 165388, 'name': 'dowry'},
 {'id': 192311, 'name': 'social class'},
 {'id': 207928, 'name': '19th century'},
 {'id': 217957, 'name': 'penniless'},
 {'id': 227556, 'name': 'social elite'},
 {'id': 238085, 'name': 'bloodletting'},
 {'id': 243090, 'name': 'free spirited'},
 {'id': 244113, 'name': 'sussex'},
 {'id': 274542, 'name': '1800s'},
 {'id': 285894, 'name': 'sisters love'},
 {'id': 307851, 'name': 'marr

As we can see the keyword column is a list of dictionaries, corresponding to name and id of each keyword

In [19]:
keywords['keywords'] = keywords['keywords'].apply(lambda x: json.loads(x) if isinstance(x, str) else x) \
                        .apply(lambda x: [item['name'] for item in x]) \
                        .apply(lambda x: np.nan if not x else x)
keywords.head()

Unnamed: 0,id,keywords
0,4584,"[secret love, london, england, countryside, ba..."
1,21032,"[wolf, pet, cartoon, dog-sledding race, alaska..."
2,5,"[hotel, new year's eve, witch, bet, sperm, hot..."
3,862,"[rescue, friendship, mission, jealousy, villai..."
4,8844,"[giant insect, board game, disappearance, jung..."


In [20]:
keywords.isna().sum()

id            0
keywords    614
dtype: int64

There are only 614 films without keywords

## Add the extra details to the films

Finally we will add the directors, actors and keywords to their corresponding film in the movies metadata csv file 

In [21]:
mdf = pd.read_csv('../Data/Raw/movies_metadata.csv')
# Select only the relevant columns
mdf = mdf[['movieId', 'id', 'title', 'genres', 'overview', 'release_date', 
           'popularity', 'runtime', 'status', 'tagline',  'vote_average', 
           'vote_count', 'poster_path', 'backdrop_path']]
# Select the filtered movies
mdf = mdf[mdf['id'].isin(ids['id'])]

In [22]:
# Merge the DataFrames
df = pd.merge(pd.merge(mdf, credits, on='id'), keywords, on='id')
df.head().transpose()

Unnamed: 0,0,1,2,3,4
movieId,16.0,11.0,7.0,1.0,12.0
id,524,9087,11860,862,12110
title,Casino,The American President,Sabrina,Toy Story,Dracula: Dead and Loving It
genres,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","[{'id': 10749, 'name': 'Romance'}, {'id': 18, ...","[{'id': 16, 'name': 'Animation'}, {'id': 12, '...","[{'id': 35, 'name': 'Comedy'}, {'id': 27, 'nam..."
overview,"In early-1970s Las Vegas, Sam ""Ace"" Rothstein ...","Widowed U.S. president Andrew Shepherd, one of...","Sabrina Fairchild, a chauffeur's daughter, gre...","Led by Woody, Andy's toys live happily in his ...",When a lawyer shows up at the vampire's doorst...
release_date,1995-11-22,1995-11-17,1995-12-15,1995-11-22,1995-12-22
popularity,6.6361,1.6287,2.7156,22.3485,2.0197
runtime,179.0,114.0,127.0,81.0,88.0
status,Released,Released,Released,Released,Released
tagline,No one stays at the top forever.,Why can't the most powerful man in the world h...,You are cordially invited to the most surprisi...,Hang on for the comedy that goes to infinity a...,You'll laugh until you die...then you'll rise ...


In [23]:
df.shape

(10073, 17)

See if we have duplicated values

In [26]:
rt = df['title'].duplicated().sum()
rtmdb = df['id'].duplicated().sum()
rid = df['movieId'].duplicated().sum()

print(f'There are {rt} duplicated titles, {rtmdb} duplicated TMDB ids, and {rid} duplicated movie ids.')

There are 256 duplicated titles, 0 duplicated TMDB ids, and 0 duplicated movie ids.


Delete the duplicated titles and select the one with the highest vote_count

In [None]:
df_final = df.loc[df.groupby('title')['vote_count'].idxmax()]
df_final.shape

(9817, 17)

In [29]:
rt = df_final['title'].duplicated().sum()
rtmdb = df_final['id'].duplicated().sum()
rid = df_final['movieId'].duplicated().sum()

print(f'There are {rt} duplicated titles, {rtmdb} duplicated TMDB ids, and {rid} duplicated movie ids.')

There are 0 duplicated titles, 0 duplicated TMDB ids, and 0 duplicated movie ids.


In [34]:
df_final['movieId'] = df_final['movieId'].astype(int)
df_final.head()

Unnamed: 0,movieId,id,title,genres,overview,release_date,popularity,runtime,status,tagline,vote_average,vote_count,poster_path,backdrop_path,cast,director,keywords
8710,221850,614696,#Alive,"[{'id': 28, 'name': 'Action'}, {'id': 27, 'nam...","As a grisly virus rampages a city, a lone man ...",2020-06-24,3.6641,98.0,Released,You must survive.,7.231,1868.0,/lZPvLUMYEPLTE2df1VW5FHTYC8N.jpg,/k2SY15W9QXH9qL8f4a4BbytV1BE.jpg,"[Yoo Ah-in, Park Shin-hye, Lee Hyun-wook]",Cho Il,"[escape, alone, survival, drone, zombie, apart..."
8485,212989,605734,#Iamhere,"[{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...",Stéphane lives a quiet life as an eminent Fren...,2020-02-05,0.6237,100.0,Released,,5.6,208.0,/yxHDlr90ww0XZt3U26W95JNExf3.jpg,/g5ZCG8coC1bfB81anxL52l30uDe.jpg,"[Alain Chabat, Bae Doona, Jules Sagot]",Eric Lartigau,"[seoul, south korea]"
7082,177545,455656,#realityhigh,"[{'id': 35, 'name': 'Comedy'}]",When nerdy high schooler Dani finally attracts...,2017-07-17,1.4162,99.0,Released,,6.28,1028.0,/iZliPeiiDta9KbONAhdFSXhTxrO.jpg,/smgZYp49OB6xo4hZewxzryrh5xN.jpg,"[Nesta Cooper, Keith Powers, Alicia Sanz]",Fernando Lebrija,"[high school, nerd, teenage crush, social media]"
5205,117867,252178,'71,"[{'id': 53, 'name': 'Thriller'}, {'id': 28, 'n...",A young British soldier must find his way back...,2014-10-10,1.3967,99.0,Released,,6.803,1137.0,/xjorsS84euahsmGlnEEeE3LFSVZ.jpg,/aTloiKdNs2c8vlstbx3wBWD6Thi.jpg,"[Jack O'Connell, Sean Harris, Paul Anderson]",Yann Demange,"[1970s, riot, northern ireland, survival, sold..."
3335,69757,19913,(500) Days of Summer,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","Tom, greeting-card writer and hopeless romanti...",2009-07-17,9.5369,95.0,Released,This is not a love story. This is a story abou...,7.293,10342.0,/qXAuQ9hF30sQRsXf40OfRVl0MJZ.jpg,/1M2i4Mxd03elGOTmEkIvqrHfmyS.jpg,"[Joseph Gordon-Levitt, Zooey Deschanel, Geoffr...",Marc Webb,"[jealousy, gallery, fight, date, architect, in..."


## Save the Final Dataset

Once we have fetched the data, added it to the CSV metadata file, and removed any duplicate titles, we will save it as the final CSV file, which will be used for the recommender system

In [None]:
# Save it as the final dataset
df_final.to_csv('../Data/Processed/movies_final.csv', index=False)