# TMDB API - Practice

**Project Planning**
- The goal is to extract financial and certification data from TMDB's API and the prior IMDB dataset.
- We will make use of an INNER and OUTER loop

The OUTER loop will loop through the start years included in the IMDB data, filter the title basics data for the selected year, and save the list of movie ID's from that year to retrieve in the inner loop.

The INNER loop will loop through every movie id from the select year, extract it's results from the TMDB API, and append them to a JSON file.

**For this practice assignment**

- We will be practicing the inner loop of API calls for a single year's movie list from our IMDB *title basics* data. Specifically we will extract API results for every movie with startYear=2010.
    - Read the instructions below, including examples in the 'Getting Started' section, before starting to work.

# Planning 

### Preparation BEFORE the loop

- Designate a folder to save the information
- Define custom functions to use for API calls
- Load cleaned *title basics* data from Part 1 of Project 2
- Define the year we will retrieve (2010) and create an empty list for appending error messages

### Prepare the DataFrame and JSON file

- Use the selected year to define filenames and filter the data
    - Define a JSON_FILE filename to save the results in progress
    - Check if the file exists
        - If file does NOT exist, create the empty json file using 'with open' that only contains the key 'imdb_id'
        - If the file DOES exist, print a message saying so.

Once the json file for results exists:
- Filter the IMDB *title basics* data for the selected year (2010) and save the movie ID's from that year as 'movies_ids'
- Check the JSON file for previously downloaded movie ID's and filter out movie ID's that already exist in the json file to prevent duplicate API calls by:
    - Loading in the contents of the JSON file with pd.read_json
        - Compare the movie_ids that were in the json file to the saved movie_ids_to_get
    - Save the final list of 'movie_ids_to_get' by filtering out movies taht already exist in the json file

### Perform the loop of API calls

- Use the previously written function to combine the certification with the rest of the .info() from the TMDB API result
- Create a loop to make API calls for each id in the year specified (2010). Include a progress bar using tqdm_notebook.
- For each movie id:
  - Extract the current ID from the API and retrieve the dict of results
  - Append the new results to the list from the json file
  - Save the updated json file back to the disk

### Save the results to a compressed .csv

- After the loop, save the final result for the select year (2010) as a .csv.gz file with the year in the filename.
- Note: at this point we have completed the inner loop that we will need for the next part of our project.

# Execution phase

## Imports and loading


**Designate a folder**

In [1]:
# Import packages
import os, time, json
import tmdbsimple as tmdb 
import pandas as pd
from tqdm.notebook import tqdm_notebook
# Create the folder for saving files (if it doesn't exist)
FOLDER = "Data/"
os.makedirs(FOLDER, exist_ok=True)
os.listdir(FOLDER)

['.ipynb_checkpoints', 'movie_basics.csv', 'movie_ratings.csv', 'untitled.txt']

**API Credentials**


In [6]:
with open('/Users/Malue/.secret/tmdb_api.json', 'r') as f:
    login = json.load(f)
# Display the keys of the loaded dict
login.keys()

dict_keys(['api-key'])

In [7]:
tmdb.API_KEY = login['api-key']

**Define functions**

In [2]:
def get_movie_with_rating(movie_id):
    """Adapted from source = https://github.com/celiao/tmdbsimple"""
    # Get the movie object for the current id
    movie = tmdb.Movies(movie_id)

    # Save the .info and .releases dictionaries
    info = movie.info()
    releases = movie.releases()

    # Loop through countries in releases
    for c in releases['countries']:
        # If the country abbreviation==US
        if c['iso_3166_1']=='US':
            # Save a 'certification' key in info with the fetched certification
            info['certification'] = c['certification']

    return info

In [3]:
def write_json(new_data, filename): 
    """Appends a list of records (new_data) to a json file (filename). 
    Adapted from: https://www.geeksforgeeks.org/append-to-json-file-using-python/"""  
    
    with open(filename,'r+') as file:
        # First we load existing data into a dict.
        file_data = json.load(file)
        ## Choose extend or append
        if (type(new_data) == list) & (type(file_data) == list):
            file_data.extend(new_data)
        else:
             file_data.append(new_data)
        # Sets file's current position at offset.
        file.seek(0)
        # convert back to json.
        json.dump(file_data, file)

**Test that API function works**

In [8]:
avengers = get_movie_with_rating('tt0848228')
avengers

{'adult': False,
 'backdrop_path': '/9BBTo63ANSmhC4e6r62OJFuK2GL.jpg',
 'belongs_to_collection': {'id': 86311,
  'name': 'The Avengers Collection',
  'poster_path': '/yFSIUVTCvgYrpalUktulvk3Gi5Y.jpg',
  'backdrop_path': '/zuW6fOiusv4X9nnW3paHGfXcSll.jpg'},
 'budget': 220000000,
 'genres': [{'id': 878, 'name': 'Science Fiction'},
  {'id': 28, 'name': 'Action'},
  {'id': 12, 'name': 'Adventure'}],
 'homepage': 'https://www.marvel.com/movies/the-avengers',
 'id': 24428,
 'imdb_id': 'tt0848228',
 'original_language': 'en',
 'original_title': 'The Avengers',
 'overview': 'When an unexpected enemy emerges and threatens global safety and security, Nick Fury, director of the international peacekeeping agency known as S.H.I.E.L.D., finds himself in need of a team to pull the world back from the brink of disaster. Spanning the globe, a daring recruitment effort begins!',
 'popularity': 122.648,
 'poster_path': '/RYMX2wcKCBAr24UyPD7xwmjaTn.jpg',
 'production_companies': [{'id': 420,
   'logo_path

In [9]:
notebook = get_movie_with_rating('tt0332280')
notebook

{'adult': False,
 'backdrop_path': '/qom1SZSENdmHFNZBXbtJAU0WTlC.jpg',
 'belongs_to_collection': None,
 'budget': 29000000,
 'genres': [{'id': 10749, 'name': 'Romance'}, {'id': 18, 'name': 'Drama'}],
 'homepage': 'http://www.newline.com/properties/notebookthe.html',
 'id': 11036,
 'imdb_id': 'tt0332280',
 'original_language': 'en',
 'original_title': 'The Notebook',
 'overview': "An epic love story centered around an older man who reads aloud to a woman with Alzheimer's. From a faded notebook, the old man's words bring to life the story about a couple who is separated by World War II, and is then passionately reunited, seven years later, after they have taken different paths.",
 'popularity': 58.251,
 'poster_path': '/rNzQyW4f8B8cQeg7Dgj3n6eT5k9.jpg',
 'production_companies': [{'id': 12,
   'logo_path': '/mevhneWSqbjU22D1MXNd4H9x0r0.png',
   'name': 'New Line Cinema',
   'origin_country': 'US'},
  {'id': 1565, 'logo_path': None, 'name': 'Avery Pix', 'origin_country': 'US'},
  {'id': 26

**Load in cleaned Title Basics data**

In [10]:
# Load in the dataframe from Project 2 Part 1 as basics
basics = pd.read_csv('Data/movie_basics.csv')

**Define a variable with the year to extract from the API**

In [12]:
# We have data available for years 2000-2020
# This assignment is only for year 2010
YEAR = 2010

**Define an errors list**
- We want to be able to save and inspect error messages for any movie which causes one.
- We need to create an empty list which will then be populated

In [13]:
errors = []

## Preparing DataFrame and files

**Select a JSON_FILE filename to save the results-in-progress**

- First, define a file path including the year.
- For our project, we will have multiple files, one for each year of movies. The code below will identify the folder in the FOLDER we just defined above and will name the file based on the current year.

In [14]:
# Defining the JSON file to store results for a certain year
JSON_FILE = f'{FOLDER}tmdb_api_results{YEAR}.json'

**Determine if the JSON file exists:**
- Check if the file exists or not.
- If going through the lesson for the first time, it likely does not. However if we are revisiting this lesson, it will already be there. We don't want to do anything to it yet, just make sure it is a file we wish to add to.

In [20]:
# Check if file exists
file_exists = os.path.isfile(JSON_FILE)

- If file does NOT exist:
    - Print a statement informing the user as such
    - Create an empty json file using 'with open' with an empty key 'imdb_id' and value 0.

In [21]:
# If file does not exist: create it
if file_exists == False:
    # Print message indicating so
    print(f'Creating {JSON_FILE} for API results for {YEAR}.')
    # Save empty dict with key 'imdb_id' and value 0
    with open(JSON_FILE, 'w') as f:
        json.dump([{'imdb_id':0}], f)

elif file_exists == True:
    print(f'The file {JSON_FILE} already exists.')

The file Data/tmdb_api_results2010.json already exists.


- If file DOES exist, print a statement indicating so.

**Filter for the selected year and save the movie ids**
- For this project we will be breaking up title_basics data by year. For this practice assignment we will only be working with 2010.
- We create a new DataFrame by filtering title_basics for the selected YEAR (defined above). We will then save the list of movie_ids as a separate variable.

In [22]:
# Saving new year as the current dataframe
df = basics.loc[ basics['startYear'] == YEAR].copy()
# Saving movie ids to separate variable
movie_ids = df['tconst']

## Check previous results and create the final list of movie_ids_to_get

Our API calls are going to have some built in safeguards when looping through multiple calls.
- Load in any existing API results with pd.read_json
- Check to see if any of the movie_ids to get are already in the json file
- Filter out only the movies that are missing from the json file to use in the loop

The code loads any existing info from the json file into a dataframe called 'previous_df'. This will start empty, but as we iterate through the loop it will continue to have more and more information.

In [23]:
# Load existing data from json into a dataframe called 'previous_df'
previous_df = pd.read_json(JSON_FILE)

**Check for and filter out movie ids that already exist**
- The next bit of code will prevent wasted API calls on data we already have.
- Note that it is defining the ids being called in such a way that it excludes any ids that are already present in previous_df
- This allows us to pick up where we left off if the API call gets interrupted.

In [24]:
# Filter out any ids that are already in the JSON_FILE
movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]

- We now have defined 'movie_ids_to_get' which includes the ids from our dataframe in the year we are seeking, and excludes any that we have already made calls for.
- We will use this list for our loop of API calls.

## Start looop through movie id's

- We have the filtered list of movie_ids_to_get for the current year, and now we can create an inner loop to iterate through the list.
- For each ID we will:
    - Retrieve the movie info from the TMDB API
    - Append the movie_info dictioanry to our JSON_FILE
    - Wait 20 ms to avoid overwhelming the API

**Setting up a progress bar**

In [25]:
# Loop through the movie_ids_to_get with a tqdm progress bar
"""
for movie_id in tqdm_notebook(movie_ids_to_get, f'Movies from {YEAR}'):
"""

"\nfor movie_id in tqdm_notebook(movie_ids_to_get, f'Movies from {YEAR}'):\n"

- Ultimately we will be creating a loop, but first let's explore the individual pieces of code

## Iterate through the list of Movie ID's and make calls

- The following code will make use of two custom functions we used in previous lessons, namely get_movie_with_rating and write_json
    - Make sure both functions are defined above before trying to call them in the code below
- Because some movies exist in IMDB's database but not within TMDB's, we will get an error if we attempt to retrieve a movie ID that TMDB does not have in it's database.
    - To get around this error, we use a try/except block around our API extraction code.

In [27]:
for movie_id in tqdm_notebook(movie_ids_to_get, f'Movies from {YEAR}'):
        
    # Get index and movie id from list
    try:
        # Retrieve the data for the movie_id
        temp = get_movie_with_rating(movie_id)
        # Append/extend results to existing file using a pre-made function
        write_json(temp, JSON_FILE)
        # Short 20 ms sleep to prevent overwhelming server with API calls
        time.sleep(0.02)
    except Exception as e:
        errors.append([movie_id, e])

Movies from 2010:   0%|          | 0/3862 [00:00<?, ?it/s]

## After the loop

In [28]:
# Print a message showing the number of movie_id's that caused an error
print(f'- Total errors: {len(errors)}')

- Total errors: 1128


**Save the year's results as csv.gz file**

- Once all of the API calls for the current year are made, open the .json file with pd.read_json and convert each json file to a compressed csv to save space. This is done after the loop has finished running.

In [29]:
final_year_df = pd.read_json(JSON_FILE)
final_year_df.to_csv(f'{FOLDER}final_tmdb_data_{YEAR}.csv.gz', compression='gzip',
                     index=False)

# Summary

- This lesson exemplifies the importance of planning complex coding tasks so the objectives are clear and can be explained in plain language *before* starting to code.
- While this lesson demonstrates some code that may be useful in the next phase of our project, we must still make sure to understand the code at each step so we can independently put together a final product on our own.