**Efficient TMDB API Calls**

**Planning**

Before jumping into the code, it is important to outline in plain language what you are trying to do. Before you can ask the computer to do it, you have to really understand what you are asking. This week has introduced some new code that you may still be getting used to, so this lesson will help walk you through the task. We will go through the individual pieces of code, but for the project, you will need to put it all together, in a logical order, with correct formatting! There will be an OUTER and INNER loop: a loop within a loop!

**Our goal will be to**

- Determine where to save our results and in what file format.
- Decide what subset of movies to retrieve (based on Years).
- Develop code to make API calls based on our existing IMDB IDs with the INNER Loop
- Organize output by year into separate .json files using an OUTER LOOP

**BEFORE the Loops**

- Designate a folder to save your information
- Define the years you wish to retrieve
- Define any custom functions you will use

**Create an OUTER loop for each year with a progress bar using tqdm_notebook**

- Define a JSON_FILE filename to save the results in progress.
   - Check if the file exists.
     - if no:
        - Create the empty JSON file with with open that just contains the key "imdb_id"
     - if yes:
        - Do nothing.
- Define/filter the movie IDs you want to retrieve (that belongs to the year being retrieved)
- Check for and remove any previously downloaded movie IDs to prevent duplicate API calls.
   - Load in any existing/previous results with pd.read_json
      - Check to see if any of the movie_ids to get are already in the JSON file.
      - Filter out only movies that are missing from the JSON file to use in the loop.

**Create an INNER loop to make API calls for each id in the YEAR specified in the outer loop. For each id:**

- Load up results thus far from the JSON file as a list.
- Extract the current ID from API and extract the dictionary of results
- Append the new results to the list from the JSON file
- Save the updated JSON file back to the disk

**After the inner loop,** save the final results for that year as a csv.gz file with the year in the filename.

 - Then, the outer loop repeats for the remaining years.

**BEFORE THE LOOPS**

**Designate a folder**

You will save API call data in the data folder you created for project Part 1.

In [1]:
import pandas as pd
import os, time,json
import tmdbsimple as tmdb 
FOLDER = "Data/"
os.makedirs(FOLDER, exist_ok=True)
os.listdir(FOLDER)

['final_results_NY_pizza.csv.gz',
 'final_results_NY_Steak.csv.gz',
 'results_in_progress_NY_pizza.json',
 'results_in_progress_NY_Steak.json',
 'title_basics.csv.gz',
 'tmdb_data_2000.csv.gz',
 'tmdb_data_2001.csv.gz']

**Define Your Functions**

You should ultimately put any custom functions at the top of your notebook. You can first write them where you first use them in your project, but once you have the functions completed and tested, you should move their definitions to the top of your notebook after you import your packages.

You will need your function to get the movie rating from the prior lesson, as well as the new function below: write_json. This is a modified version of a function from **https://www.geeksforgeeks.org/append-to-json-file-using-python/**. Notice that the original source link is included in the function's docstring to give proper credit to the original authors.

In [2]:
def write_json(new_data, filename): 
    """Appends a list of records (new_data) to a json file (filename). 
    Adapted from: https://www.geeksforgeeks.org/append-to-json-file-using-python/"""  
    
    with open(filename,'r+') as file:
        # First we load existing data into a dict.
        file_data = json.load(file)
        ## Choose extend or append
        if (type(new_data) == list) & (type(file_data) == list):
            file_data.extend(new_data)
        else:
             file_data.append(new_data)
        # Sets file's current position at offset.
        file.seek(0)
        # convert back to json.
        json.dump(file_data, file)

**Load in the Title Basics data**

You need to read in the filtered dataframe you created based on the specification of Project 3 Part 1.

You will be filtering out the movies for each year inside the loop, so we will need this loaded and ready to be filtered.

In [3]:
# Load in the dataframe from project part 1 as basics:
basics = pd.read_csv('/Users/tspiet/Documents/GitHub/Data_Enrichment/Data-Enrichment/Data/title_basics.csv.gz')

**Create Required Lists for the Loop**
    
Define a list of the Years to Extract from the API

We have data from 2000 - 2020 available. If we just want results for the first two years, we will create a YEARS_TO_GET list that only contains those 2 years (for now). This will control our outer loop.

In [4]:
YEARS_TO_GET = [2000,2001]

**Define an errors list**

We will want to be able to save the ids and error messages for any movie that causes an error. To do so, we will want to create an empty errors list before our loops that we can append to later.

In [5]:
errors = [ ]

**----Start OUTER Loop----**

**Ultimately we will be creating a loop, but let's explore each piece of the code:**


**Set up Progress Bar**

We want to keep track of our progress and ensure our calls are working. The progress bar works within the for statement of the for loop. Note that this will iterate through each year that is defined in the YEARS_TO_GET variable.

In [8]:
# Start of OUTER loop
for YEAR in tqdm_notebook(YEARS_TO_GET, desc='YEARS', position=0):

SyntaxError: incomplete input (881592417.py, line 2)

**Select a JSON_FILE filename to save the results in progress.**

 - Check if the file exists.
   - if no:
       - Create the empty JSON file with with open that just contains the key "imdb_id"
   - if yes:
       - Do nothing.
First, define the file path and names: We are going to have multiple files since we are creating a separate file for each year. The code below will identify the folder in the FOLDER we just defined above and will name the file based on the current year.

In [9]:
#Defining the JSON file to store results for year
JSON_FILE = f'{FOLDER}tmdb_api_results_{YEAR}.json'

NameError: name 'YEAR' is not defined

Check if that file already exists or not.

In [None]:
# Check if file exists
file_exists = os.path.isfile(JSON_FILE)

The code below will create the file and save an empty dictionary with just imdb_id. We will be appending to this empty dictionary throughout our calls.

In [None]:
# If it does not exist: create it
if file_exists == False:
# save an empty dict with just "imdb_id" to the new json file.
    with open(JSON_FILE,'w') as f:
        json.dump([{'imdb_id':0}],f)

**Define/filter the IDs to call**
    
We are going to break up our title_basics data by year, so we will define a new dataframe for each year. Notice that which YEAR will depend on what we define YEAR as. Leaving YEAR a variable allows the code to be easier to read and reproduce.

In [None]:
#Saving new year as the current df
df = basics.loc[ basics['startYear']==YEAR].copy()
# saving movie ids to list
movie_ids = df['tconst'].copy()

**Check for and remove any previously downloaded Movie id's**

You may remember from our lesson on efficient API calls that we are going to build in some safeguards when looping through multiple calls.

- Load in any existing API results with pd.read_json
- Check to see if any of the movie_ids to get are already in the JSON file.
- Filter out only movies that are missing from the JSON file to use in the loop
The code loads any existing information from the JSON file into a dataframe called the "previous_df." This will start empty, but as you iterate through the loop, it will continue to have more and more information.

In [None]:
# filter out any ids that are already in the JSON_FILE
movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]

Now we have defined the "movie_ids_to_get". It includes the ids from our dataframe in the year we are seeking, and it excludes any that we have already made calls for.

We will use this list for our inner loop of API calls.

**----Start INNER Loop----**

Now that we have the filtered list of movie_ids_to_get for the current year, we will now create an inner loop to iterate through the movie_ids_to_get, and for each ID, we will: retrieve the movie info from the TMDB API, append the movie_info dictionary to our JSON_FILE, wait 20 ms to avoid overwhelming the API.

**Iterate through the list of Movie IDs and make the calls**

The code below relies on the function you wrote in the previous lesson that made API calls and added the certification to the .info results. Here this function is named "get_movie_with_rating". Make sure you have the function from the earlier lesson in the code file before you plan to call on it! This loop also uses the function above (write_json) to extend/append the results to the .json file. **Make sure both functions are defined in your code file before you try to call them!**

Since some movies exist in IMDB's title basics dataset (our DataFrame) that do not exist within the database for TMDB's API, we will get an error whenever we attempt to retrieve a movie id that TMDB does not have in its database.

To get around this, we will use a try and except statement around our inner loop. We will TRY to run the inner loop to retrieve and save the data for the current movie_id, but if we get an error, we will save the movie_id and error message in our errors list

In [None]:
 #Get index and movie id from list
    # INNER Loop
    for movie_id in tqdm_notebook(movie_ids_to_get,
                                  desc=f'Movies from {YEAR}',
                                  position=1,
                                  leave=True):
        try:
            # Retrieve then data for the movie id
            temp = get_movie_with_rating(movie_id)  
            # Append/extend results to existing file using a pre-made function
            write_json(temp,JSON_FILE)
            # Short 20 ms sleep to prevent overwhelming server
            time.sleep(0.02)
            
        except Exception as e:
            errors.append([movie_id, e])

**After the Inner Loop**

Once the inner loop through the current movie_ids_to_get has finished, we will have all of our results for that year in our JSON_FILE. We now want to save them in a smaller file format.
Save the year's results as csv.gz file

Once all of the API calls for the current year are made, you should open your .json file with pd.read_json and convert each json file to a compressed csv (".csv.gz") to save space. This is done after the inner loop but within the outer loop.

In [None]:
    final_year_df = pd.read_json(JSON_FILE)
    final_year_df.to_csv(f"{FOLDER}final_tmdb_data_{YEAR}.csv.gz", compression="gzip", index=False)

**After Your Inner & Outer Loop**

Print a message reporting back the number of movie ids that caused an error.

**Troubleshooting:**
  
If you get an error message when trying to run pd.read_json, try replacing pd.read_json with the "read_and_fix_json" helper function in this repository: **https://github.com/coding-dojo-data-science/data-enrichment-helper-functions**

In [None]:
# Instead of previous_df=pd.read_json:
previous_df = read_and_fix_json(JSON_FILE)

**Summary**

This lesson exemplifies the importance of planning your complex coding tasks so that you are clear on what you are trying to do in plain language before translating to code. While this lesson shows examples of the different segments of code that you may want to use in the next phase of the project, remember it is still up to you to read and understand each step so that you can put together the final product! **You will need to be conscientious of the order of the information and the appropriate format (especially tabs).**