Our goal will be to

Determine where to save our results and in what file format.
Decide what subset of movies to retrieve (based on Years).
Develop code to make API calls based on our existing IMDB IDs with the INNER Loop
Organize output by year into separate .json files using an OUTER LOOP

BEFORE the Loops

Designate a folder to save your information
Define the years you wish to retrieve
Define any custom functions you will use

Create an OUTER loop for each year with a progress bar using tqdm_notebook
1.Define a JSON_FILE filename to save the results in progress.
2.Define/filter the movie IDs you want to retrieve (that belongs to the year being retrieved)
3.Check for and remove any previously downloaded movie IDs to prevent duplicate API calls.
Create an INNER loop to make API calls for each id in the YEAR specified in the outer loop. For each id:



In [1]:
import json
with open('/Users/siblose/.secret/tmdb_api.json', 'r') as f:
    login = json.load(f)
## Display the keys of the loaded dict
login.keys()

dict_keys(['api-key'])

In [2]:
#save API call data in the data folder you created for project Part 1.
import os, time,json
import tmdbsimple as tmdb 

tmdb.API_KEY =  login['api-key']
FOLDER = "Project3Data/"
os.makedirs(FOLDER, exist_ok=True)
os.listdir(FOLDER)

['final_tmdb_data_2000.csv.gz',
 'final_tmdb_data_2001.csv.gz',
 'final_tmdb_data_2019.csv.gz',
 'final_tmdb_data_2020.csv.gz',
 'title_akas.csv.gz',
 'title_basics.csv.gz',
 'title_ratings.csv.gz',
 'tmdb_api_results_2000.json',
 'tmdb_api_results_2001.json',
 'tmdb_api_results_2019.json',
 'tmdb_api_results_2020.json']

In [3]:
#will need your function to get the movie rating from the prior lesson, as well as the new function below: write_json.
def write_json(new_data, filename): 
    """Appends a list of records (new_data) to a json file (filename). 
    Adapted from: https://www.geeksforgeeks.org/append-to-json-file-using-python/"""  
    
    with open(filename,'r+') as file:
        # First we load existing data into a dict.
        file_data = json.load(file)
        ## Choose extend or append
        if (type(new_data) == list) & (type(file_data) == list):
            file_data.extend(new_data)
        else:
             file_data.append(new_data)
        # Sets file's current position at offset.
        file.seek(0)
        # convert back to json.
        json.dump(file_data, file)

Load in the Title Basics data

In [4]:
import pandas as pd

In [5]:
# Load in the dataframe from project part 1 as basics:
basics = pd.read_csv('Project3Data/title_basics.csv.gz')


In [6]:
basics.tail(15)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
162213,tt9908448,movie,The Bells of Hell,The Bells of Hell,0,2018,,88,"Drama,Fantasy"
162214,tt9909086,movie,Pheriaa Come Back,Pheriaa Come Back,0,2018,,137,Drama
162215,tt9909418,movie,White Dresses,White Dresses,0,1996,,50,Drama
162216,tt9911196,movie,The Marriage Escape,De beentjes van Sint-Hildegard,0,2020,,103,"Comedy,Drama"
162217,tt9911750,movie,Chambu Gabale,Chambu Gabale,0,1989,,131,Comedy
162218,tt9913660,movie,No Apology,No Apology,0,2019,,102,Drama
162219,tt9913872,movie,De la piel del Diablo,De la piel del Diablo,0,2019,,75,Thriller
162220,tt9913936,movie,Paradise,Paradise,0,2019,,135,"Crime,Drama"
162221,tt9914192,movie,No Gogó do Paulinho,No Gogó do Paulinho,0,2020,,98,Comedy
162222,tt9914828,movie,The War of Godzilla,The War of Godzilla,0,2015,,102,"Action,Comedy,Family"


Create Required Lists for the Loop

In [7]:
#We have data from 2000 - 2020 available. If we just want results for the first two years, 
#we will create a YEARS_TO_GET list that only contains those 2 years (for now). This will control our outer loop.

In [8]:
YEARS_TO_GET = [2000,2001]

In [9]:
#Define an errors list
errors = []

In [10]:
import tmdbsimple as tmdb

In [11]:
from tqdm.notebook import tqdm_notebook
import time

In [12]:
#It should return a dictionary of results that includes certification.
def get_movie_with_rating(movie_id):
  #Get movie object for the current id
  movie = tmdb.Movies(movie_id)    
  # Save the .info .release dictionaries
  info = movie.info()
   
  releases = movie.releases()  
  #Loop through countries in releases
  for c in releases['countries']:
     # if the country abbreviation==US
    if c['iso_3166_1'] == 'US':
          ## save a "certification" key in the info dict with the certification
       info['certification'] = c['certification']
       break
  return info
    

In [13]:
# Start of OUTER loop
for YEAR in tqdm_notebook(YEARS_TO_GET, desc='YEARS', position=0):
    #Defining the JSON file to store results for year
    JSON_FILE = f'{FOLDER}tmdb_api_results_{YEAR}.json'
    # Check if file exists
    file_exists = os.path.isfile(JSON_FILE)
    # If it does not exist: create it
    if file_exists == False:
    # save an empty dict with just "imdb_id" to the new json file.
      with open(JSON_FILE,'w') as f:
        json.dump([{'imdb_id':0}],f)
    #Saving new year as the current df
    df = basics.loc[ basics['startYear']==YEAR].copy()
    # saving movie ids to list
    movie_ids = df['tconst'].copy()
    # Load existing data from json into a dataframe called "previous_df"
    previous_df = pd.read_json(JSON_FILE)
    # filter out any ids that are already in the JSON_FILE
    movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]
    
    
      #Get index and movie id from list
    # INNER Loop
    for movie_id in tqdm_notebook(movie_ids_to_get,
                                  desc=f'Movies from {YEAR}',
                                  position=1,
                                  leave=True):
        try:
            # Retrieve then data for the movie id
            temp = get_movie_with_rating(movie_id)  
            # Append/extend results to existing file using a pre-made function
            write_json(temp,JSON_FILE)
            # Short 20 ms sleep to prevent overwhelming server
            time.sleep(0.02)
            
        except Exception as e:
            errors.append([movie_id, e])
    final_year_df = pd.read_json(JSON_FILE)
    final_year_df.to_csv(f"{FOLDER}final_tmdb_data_{YEAR}.csv.gz", compression="gzip", index=False)
print(f"- Total errors: {len(errors)}")

YEARS:   0%|          | 0/2 [00:00<?, ?it/s]

Movies from 2000:   0%|          | 0/1464 [00:00<?, ?it/s]

Movies from 2001:   0%|          | 0/1584 [00:00<?, ?it/s]

- Total errors: 425


In [14]:
final_year_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1363 entries, 0 to 1362
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   imdb_id                1363 non-null   object 
 1   adult                  1362 non-null   float64
 2   backdrop_path          753 non-null    object 
 3   belongs_to_collection  97 non-null     object 
 4   budget                 1362 non-null   float64
 5   genres                 1362 non-null   object 
 6   homepage               1362 non-null   object 
 7   id                     1362 non-null   float64
 8   original_language      1362 non-null   object 
 9   original_title         1362 non-null   object 
 10  overview               1362 non-null   object 
 11  popularity             1362 non-null   float64
 12  poster_path            1232 non-null   object 
 13  production_companies   1362 non-null   object 
 14  production_countries   1362 non-null   object 
 15  rele