# Project 3 Part 2 API

<mark> ***Use an API to extract box office revenue and profit data to add to your IMDB data and perform exploratory data analysis.***

### Task

Your stakeholder wants you to extract the budget, revenue, and MPAA Rating (G/PG/PG-13/R), which is also called "Certification".

Note: this process can take a long time and may need to run overnight.
Specifications - Financial Data
Your stakeholder would like you to extract and save the results for movies that meet all of the criteria established in part 1 of the project (You should already have a filtered dataframe saved from part one as a csv.gz file)

* [x] As a proof-of-concept, they requested you perform a test extraction of movies that started in 2000 or 2001

* [x] Each year should be saved as a separate .csv.gz file

Hint: Use the two custom functions from the lessons (Intro to TMDB API, and Efficient TMDB API Calls). Be sure to define these functions prior to calling them in your code!

One function will add the certification (MPGG Rating) to movie.info
The other function will help you append/extend a JSON file with Python
Confirm Your API Function works.

* [x] In order to ensure your function for extracting movie data from TMDB is working, test your function on these 2 movie ids: tt0848228 ("The Avengers") and tt0332280 ("The Notebook"). Make sure that your function runs without error and that it returns the correct movie's data for both test ids.

Once you have retrieved and saved the final results to 2 separate .csv.gz files, move on to a new Exploratory Data Analysis notebook to explore the following questions.

### Imports

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm_notebook
import tmdbsimple as tmdb
import json
import time
import os

### Loading in csvs

In [2]:
basics = pd.read_csv('Data/title_basics.csv.gz')
basics.head(2)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama


In [3]:
# changing year to int because the float may be causing an error in later code
basics['startYear'] = basics['startYear'].astype(int)

In [4]:
# confirming
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020,,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
3,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"
4,tt0096056,movie,Crime and Punishment,Crime and Punishment,0,2002,,126,Drama


## Setting up the API

In [5]:
# loading api-key
with open('/Users/cameron/.secret/tmdb_api.json', 'r') as f:
    login = json.load(f)
# confirming
login.keys()

dict_keys(['api-key'])

In [6]:
# setting key in tmdb module
tmdb.API_KEY =  login['api-key']

### Setting up folder

In [7]:
folder = "Data/"
os.makedirs(folder, exist_ok=True)
os.listdir(folder);

['final_tmdb_data_2006.csv.gz',
 'tmdb_api_results_2010.json',
 'final_tmdb_data_2014.csv.gz',
 'tmdb_api_results_2006.json',
 'final_tmdb_data_2008.csv.gz',
 'final_tmdb_data_2004.csv.gz',
 'tmdb_api_results_2007.json',
 'tmdb_api_results_2011.json',
 'tmdb_api_results_2000.json',
 'final_tmdb_data_2000.csv.gz',
 'final_tmdb_data_2012.csv.gz',
 'tmdb_api_results_2001.json',
 'final_tmdb_data_2010.csv.gz',
 'final_tmdb_data_2002.csv.gz',
 'title_basics.csv.gz',
 'tmdb_api_results_2002.json',
 'final_tmdb_data_2007.csv.gz',
 'tmdb_api_results_2014.json',
 'tmdb_api_results_2015.json',
 'tmdb_api_results_2003.json',
 'final_tmdb_data_2009.csv.gz',
 'final_tmdb_data_2005.csv.gz',
 'final_tmdb_data_2001.csv.gz',
 'final_tmdb_data_2013.csv.gz',
 'tmdb_api_results_2004.json',
 'tmdb_api_results_2012.json',
 'tmdb_api_results_2008.json',
 'title_akas.csv.gz',
 'tmdb_api_results_2009.json',
 'tmdb_api_results_2013.json',
 'final_tmdb_data_2011.csv.gz',
 'final_tmdb_data_2003.csv.gz',
 'tmdb_ap

## Setting up functions

In [8]:
def get_movie_with_rating(movie_id):
    movie = tmdb.Movies(movie_id)
    info = movie.info()
    releases = movie.releases()

    for c in releases['countries']:
        if c['iso_3166_1' ] == 'US':
           info['certification'] = c['certification']
    
    return info

In [9]:
def write_json(new_data, filename): 
    """Appends a list of records (new_data) to a json file (filename). 
    Adapted from: https://www.geeksforgeeks.org/append-to-json-file-using-python/"""  
    
    with open(filename, 'r+') as file:
        file_data = json.load(file)
        if (type(new_data) == list) & (type(file_data) == list):
            file_data.extend(new_data)
        else:
             file_data.append(new_data)
        file.seek(0)
        json.dump(file_data, file)

### Testing Functions

In order to ensure your function for extracting movie data from TMDB is working, test your function on these 2 movie ids: tt0848228 ("The Avengers") and tt0332280 ("The Notebook"). Make sure that your function runs without error and that it returns the correct movie's data for both test ids.

In [10]:
# using function
avengers = get_movie_with_rating('tt0848228')
notebook = get_movie_with_rating('tt0332280')

In [28]:
# confirming results
for i in [avengers, notebook]:
    print(i['title'], i['release_date'], i['certification'])

The Avengers 2012-04-25 PG-13
The Notebook 2004-06-25 PG-13


<mark><u>**Comment:**</u>

<font color='dodgerblue' size=4><i>
Looks like everything worked here
</i></font>

### Loop to Gather Data

In [61]:
# years to get from api call
years_to_get = [2019]

In [62]:
# list to catch errors
errors = []

In [63]:
for year in tqdm_notebook(years_to_get, desc='YEARS', position=0):
    json_file = f'{folder}tmdb_api_results_{year}.json'
    if os.path.isfile(json_file) == False:
        with open(json_file, 'w') as f:
            json.dump([{'imdb_id':0}], f)

    df = basics.loc[basics['startYear'] == year].copy()
    movie_ids = df['tconst'].copy()
    previous_df = pd.read_json(json_file)
    movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]

    for movie_id in tqdm_notebook(movie_ids_to_get,
                                  desc=f'Movies from {year}',
                                  position=1,
                                  leave=True):
        try:
            temp = get_movie_with_rating(movie_id)  
            write_json(temp, json_file)
            time.sleep(0.02)
            
        except Exception as e:
            errors.append([movie_id, e])
        
    final_year_df = pd.read_json(json_file)
    final_year_df.to_csv(f"{folder}final_tmdb_data_{year}.csv.gz", compression="gzip", index=False)

YEARS:   0%|          | 0/1 [00:00<?, ?it/s]

Movies from 2019:   0%|          | 0/5847 [00:00<?, ?it/s]

In [55]:
# checking the number of errors
print(f"- Total errors: {len(errors)}")

- Total errors: 1103


In [60]:
# checking the number of errors
print(f"- Total errors: {len(errors)}")

# demonstrating that the errors are related to ids not in the tmdb
count = 0
for i in errors:
    if '404 Client Error' in str(i[1]):
        count += 1
print(f'- 404 Client Errors: {count}')

- Total errors: 1103
- 404 Client Errors: 1103


In [57]:
errors[:5]

[['tt0850247',
  requests.exceptions.HTTPError('404 Client Error: Not Found for url: https://api.themoviedb.org/3/movie/tt0850247?api_key=4d4c9815bcf18b420f748fabcf653225')],
 ['tt10013634',
  requests.exceptions.HTTPError('404 Client Error: Not Found for url: https://api.themoviedb.org/3/movie/tt10013634?api_key=4d4c9815bcf18b420f748fabcf653225')],
 ['tt10018116',
  requests.exceptions.HTTPError('404 Client Error: Not Found for url: https://api.themoviedb.org/3/movie/tt10018116?api_key=4d4c9815bcf18b420f748fabcf653225')],
 ['tt10027174',
  requests.exceptions.HTTPError('404 Client Error: Not Found for url: https://api.themoviedb.org/3/movie/tt10027174?api_key=4d4c9815bcf18b420f748fabcf653225')],
 ['tt10052452',
  requests.exceptions.HTTPError('404 Client Error: Not Found for url: https://api.themoviedb.org/3/movie/tt10052452?api_key=4d4c9815bcf18b420f748fabcf653225')]]

In [38]:
df_2014 = pd.read_json('Data/tmdb_api_results_2015.json')
df_2016 = pd.read_json('Data/tmdb_api_results_2016.json')

In [39]:
display(len(df_2015), len(df_2016))

3793

4040

In [40]:
df_2016.columns

Index(['imdb_id', 'adult', 'backdrop_path', 'belongs_to_collection', 'budget',
       'genres', 'homepage', 'id', 'original_language', 'original_title',
       'overview', 'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'certification'],
      dtype='object')

In [41]:
df_2016[df_2016['certification'].isna()][['title', 'release_date', 'certification']].head(20)

Unnamed: 0,title,release_date,certification
0,,,
3,The History of Love,2016-11-09,
4,"Hot Country, Cold Winter",2016-05-04,
17,Chờ Em Đến Ngày Mai,2016-12-30,
18,Special Forces,2016-02-04,
19,半熟少女之梦想预备生,2013-12-12,
21,Damned Gold,2016-04-22,
22,Pociecha,2016-11-05,
27,Touch,2016-11-22,
30,The Merchant of Venice - Live at Shakespeare's...,2016-08-04,
