# Business Analysis on Movie Production

> ## TMDB Files ETL (Extract, Transform and Load)

## Business Problem:
Previously in phase one, a set of data was gotten from the IMDB for analyzing the success of movies, but it was discovered that the dataset contained no financial information on the movies(Budget or Revenue). In order for the movies to be analyzed , additional data on the financial is required. The business stakeholder, identified The Movie Database (TMDB) as a great source of additional financial data 

**Data SOurce:** (https://www.themoviedb.org/)


## Data Specifications
>- information on budget, revenue and MPAA rating (G/PG/PG-13/R), which is also called "Certification", should be extracted and added to the initial data.

>- The stakeholder is only interested in movies with the same criteria as the ones specified in the initial IMDB ETL extracted.

>- Extractions on movies starting from year 2000 and 2001 was requested as a proof-of-concept. Each file is to be saved in a separate .csv.gz file.

## Importing libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import tmdbsimple as tmdb
import os, json, math, time
from tqdm.notebook import tqdm_notebook

##  APIs

### Loading API Credentials

In [2]:
#Loading API Credentials
import json

#Loading the api credentials from the .secret folder.
with open("/Users/heill/.secret/tmdb_api.json", "r") as f:
    login = json.load(f)
    
#Display of the api key
login.keys()

dict_keys(['api_key'])

In [3]:
#Loging in with the api keys
import tmdbsimple as tmdb
tmdb.API_KEY = login['api_key']

### Testing of API call

In [4]:
#Making a movie object using the .Movies function from tmdb
movie = tmdb.Movies(603)

#calling the movie objects dictionary
movie_info = movie.info()
movie_info

{'adult': False,
 'backdrop_path': '/ncEsesgOJDNrTUED89hYbA117wo.jpg',
 'belongs_to_collection': {'id': 2344,
  'name': 'The Matrix Collection',
  'poster_path': '/bV9qTVHTVf0gkW0j7p7M0ILD4pG.jpg',
  'backdrop_path': '/bRm2DEgUiYciDw3myHuYFInD7la.jpg'},
 'budget': 63000000,
 'genres': [{'id': 28, 'name': 'Action'},
  {'id': 878, 'name': 'Science Fiction'}],
 'homepage': 'http://www.warnerbros.com/matrix',
 'id': 603,
 'imdb_id': 'tt0133093',
 'original_language': 'en',
 'original_title': 'The Matrix',
 'overview': 'Set in the 22nd century, The Matrix tells the story of a computer hacker who joins a group of underground insurgents fighting the vast and powerful computers who now rule the earth.',
 'popularity': 63.733,
 'poster_path': '/f89U3ADr1oiB1s9GkdPOEpXUk5H.jpg',
 'production_companies': [{'id': 79,
   'logo_path': '/tpFpsqbleCzEE2p5EgvUq6ozfCA.png',
   'name': 'Village Roadshow Pictures',
   'origin_country': 'US'},
  {'id': 372,
   'logo_path': None,
   'name': 'Groucho II Film

**The addition of Revenue, was successfull and this can be viewed using the info funct. However Certification has been omitted. This needs to be rectified inorder to meet the requirements**

In [5]:
# A View of file columns
movie_info["budget"]

63000000

In [6]:
movie_info["revenue"]

463517383

In [7]:
movie_info["imdb_id"]

'tt0133093'

In [8]:
#test search
movie = tmdb.Movies("tt1361336")
info = movie.info()
info["budget"]

50000000

**Below is a sample extraction of certification using .releases()

In [9]:
# example from package README - the rating of the movie if it is in the US
response = movie.releases()
for c in movie.countries:
    if c['iso_3166_1'] == 'US':
        print(c['certification'])

PG
PG
PG


### Saving the Data

In [10]:
#Creating a folder to save data, or simply calling it if it already exist.
FOLDER = 'Movies_files/'
os.makedirs(FOLDER, exist_ok =True)
os.listdir(FOLDER)

['Title_Akas.csv',
 'Title_Akas.csv.gz',
 'Title_Basics.csv',
 'Title_Basics.csv.gz',
 'Title_Ratings.csv',
 'Title_Ratings.csv.gz',
 'tmdb_api_results_2000.json',
 'tmdb_api_results_2001.json',
 'tmdb_api_results_2002.json',
 'tmdb_api_results_2003.json',
 'tmdb_api_results_2004.json',
 'tmdb_combined_years.csv.gz',
 'tmdb_results_combined.csv.gz',
 'tmdb_results_combined_final_df.csv.gz',
 'tmdb_years']

###  Extract Files

In [11]:
#Defining the years to extract
YEARS_TO_GET = list(range(2000, 2022,1))
print(YEARS_TO_GET)

[2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021]


####  Defining functions for ease of coding

**Movie Rating function**

In [12]:
def movie_with_rating (movie_id):

#Retrieving movie and release dates
    movie = tmdb.Movies(movie_id)
    #Constructing output dict
    movie_info = movie.info()
    releases = movie.releases()
    #loop through counteries in the releases
    for c in releases["countries"]:
    #if the country abbreviatedion==US
        if c["iso_3166_1"]=="US":
        #Save a "certification" key in the info dict with the certification
            movie_info["certification"] = c["certification"]
    return movie_info

**Function to append results to the JSON file**

In [13]:
#USing a function to append new results to the existing JSON file

def write_json (new_data, filename):
    with open(filename, "r+") as file:
        #First we load existing data into a dict
        file_data = json.load(file)
        if (type(new_data)==list)& (type(file_data)==list):
            file_data.extend(new_data)# can use .append as alternative
        else:
            file_data.append(new_data)
        #setting the current files's position at offset.
        file.seek(0)
        #converting back to json
        json.dump(file_data, file)

**Loading the dataframe to use in specifying the parameters of the loop.**

In [14]:
#Loading the dataframe from project part 1 as basics
basics = pd.read_csv('Data_files/Title_Basics.csv.gz')
basics.head(2)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El Tango del Viudo y Su Espejo Deformante,0,2020,,70,Drama


### API call loop

In [None]:
# Start of OUTER loop
for YEAR in tqdm_notebook(YEARS_TO_GET,desc='YEARS',position=0):
    
    #Defining the JSON file to store results for year
    JSON_FILE = f"{FOLDER}tmdb_api_results_{YEAR}.json"

        #checking for existence of file
    file_exists =os.path.isfile(JSON_FILE)

    #Checking if file already exists, if it does not exist: create it
    if file_exists ==False:

    #save an empty dict with just "imdb_id" to the new json file.

        with open(JSON_FILE, "w") as f:
            json.dump([{"imdb_id":0}], f)

    #Defining and filtering the IDs to call

    #Saving new year as the current df
    df =basics.loc[basics["startYear"]==YEAR].copy()

    #Saving movie ids to the list
    movie_ids =df["tconst"].copy()#.to_list
  

    #Load in any existing API results with pd.read_json
    #Check to see if any of the movie_ids to get are already in the JSON file.
    #Filter out only movies that are missing from the JSON file to use in the loop

    previous_df = pd.read_json(JSON_FILE)
    

    #filtering movie IDS already in the JSON_FILE
    movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df["imdb_id"])]
   

    
    #INNER LOOP:  .

    #REtrieve the movie ID from list
    for movie_id in tqdm_notebook(movie_ids_to_get, 
                                 desc =f'Movies from {YEAR}',
                                  position =1,
                                  leave =True):

        #retrieve data for the movie id
        try:
            temp =movie_with_rating(movie_id)#This will use the pre-made function
            #Append the results to existing file using the pre-made function
            write_json(temp, JSON_FILE)
            #Set to 20ms sleep to avoid overwhelming server
            time.sleep(0.02)

        #if it fails, make a dict with just the id and None for certification
        except Exception as e:
            continue
        
final_year_df =pd.read_json(JSON_FILE)
final_year_df.to_csv(f"{FOLDER}final_tmdb_data_{YEAR}.csv.gz", compression ="gzip", index =False)