<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Capstone Project - Predicting Movie Box Office Revenue (Part 1 - Data Acquisition)

## Table of Contents

- [Background Information](#Background-Information)
- [Problem Statement](#Problem-Statement)
- [Executive Summary](#Executive-Summary)
- [Import libraries](#Import-libraries)
- [Data Acquisition](#Data-Acquisition)    
    - [Scraping movie details](#Scraping-movie-details)
    - [Scraping cast/crew details](#Scraping-cast/crew-details)
- [Export datasets](#Export-datasets)

## Background Information

<table><tr>
<td> <img src="../images/avatar.png" alt="Drawing" style="width: 200px;"/> </td>
<td> <img src="../images/mars-needs-mom.png" alt="Drawing" style="width: 200px;"/> </td>
</tr></table>

In the past years, there were many movies that were known to be a theatrical success, such as "Avatar" which was released in 2009 which smashed several box office records at that time. Avatar was reported to have grossed 2.85 billion USD, according to Rotten Tomatoes. [(source)](https://editorial.rottentomatoes.com/article/highest-grossing-movies-all-time/)

While there exist movies that were a hit with the mass audience, there were also movies that flopped terribly. "Mars Needs Mom", an animation film released in 2011 by Walt Disney was one of the worst-performing in film history. Despite costing Disney 150 million USD to produce the film, "Mars Needs Mom" only managed to gross 39.5 million USD worldwide [(source)](https://www.cbsnews.com/pictures/biggest-movie-flops-box-office-bombs/43/), causing Disney to suffer huge losses. Producing a movie typically requires huge amount of resources and time which carries a significant amount of financial and market risk for the production house. Therefore, it is imperative for production companies to make accurate revenue prediction for better resource allocation.

## Problem Statement

The goal of the project is to build a supervised learning model to allow production companies to make better predictions of movie box office revenue instead on relying on intuition and experience, and to identify features that has a strong impact on box office revenue. This will allow them to make better data-driven decisions for resource allocation.

For this project, 2 metrics will be used to evaluate the perfomance of our models:
- Root mean squared error (RMSE)
- $R^{2}$ score

## Executive Summary

The dataset used in this project comes from The Movie Database(TMDB), a popular user-editable database for movies and TV shows. Using the API provided on their website, the movie id for 10,000 movies under the "Popular Movies" section were identified, before scraping data on the primary movie details and data on the casts and crew and subsequently combining them into a single dataset. The initial dataset contained 31 features including revenue, movie budget, genre, cast and crew. Data cleaning, exploratory data analysis and feature engineering were carried out to transform our data to be fit for modelling.

The best performing model was the CatBoost regressor model, which achieved a test $R^{2}$ score of 0.76 and a test RMSE score of 69 million. There were signs of overfitting which were present even after regularization attempts during hyperparameter tuning. Possible reasons include insufficient data in which some rows had to be dropped due to missing revenue data, as well as possible data inaccuracies since TMDB is open-source and thus might not be properly verified. The model also identified vote count and movie budget as the strongest predictors for box office revenue prediction. However, not all the features would be useful to the relevant stakeholders during the planning steps as data on some of these features would only be available post theatrical release. Further improvements could be made to our model, by collecting cross-platform data from IMDB, Rotten Tomatoes or Metacritic, as well as to use NLP techniques and analyze public sentiments on the movie during early stages of production.

## Import libraries

In [1]:
#import libraries
from tqdm.notebook import trange
import pandas as pd
from pandas import json_normalize
import requests
import time

#set notebook parameters
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('max_colwidth', 60)

In [2]:
api_key = '8430ae943b8cbe68f284d74fa3351f2c'

We will first use the TMDB API to scrape for the id number of the first 10,000 movies under the "Popular" section on the TMDB website. Each page on the site has 20 movies, we will scrape it for a total of 500 times which is the maximum limit.

In [3]:
link = "https://api.themoviedb.org/3/movie/popular?api_key={}&language=en-US&page=1".format(api_key)
response = requests.get(link)

In [4]:
id_list = []

In [5]:
#iterate through each page under 'Popular' for movies to scrape movie_id
for i in trange(1,501):
    try:
        response = requests.get("https://api.themoviedb.org/3/movie/popular?api_key={}&language=en-US&page={}".format(api_key,i))
        for j in range(20):
            id_list.append(response.json()['results'][j]['id'])

    except:
        #notifies us if there is an error during scraping
        print(f"Error occurred while scraping")

  0%|          | 0/500 [00:00<?, ?it/s]

In [6]:
len(set(id_list))

10000

## Data Acquisition

With the movie id as path parameters, we will scrape the primary information of the movies as well as information on cast and crew.

#### Scraping movie details

In [7]:
#scrape 1st movie
response = requests.get("https://api.themoviedb.org/3/movie/{}?api_key={}&language=en-US".format(id_list[0],api_key))
data = response.json()
data = json_normalize(data)
movies = pd.DataFrame(data)   

In [8]:
#iterate through each movie_id to scrape movie primary information
for i in trange(1,len(id_list)):
    try:
        response = requests.get("https://api.themoviedb.org/3/movie/{}?api_key={}&language=en-US".format(id_list[i],api_key))
        data = response.json()
        data = json_normalize(data)
        temp_df = pd.DataFrame(data)
        movies = pd.concat([movies, temp_df], ignore_index=True)
        
        #0.3 seconds interval per requests to prevent server overload    
        time.sleep(0.3)    
        
    except:
        #notifies us if there is an error during scraping
        print(f"Error occurred while scraping")
    

  0%|          | 0/9999 [00:00<?, ?it/s]

Error occurred while scraping
Error occurred while scraping


In [9]:
movies.shape

(9998, 29)

#### Scraping cast/crew details

In [10]:
#scrape 1st movie
response = requests.get("https://api.themoviedb.org/3/movie/{}/credits?api_key={}&language=en-US".format(id_list[0],api_key))
data = response.json()
data = json_normalize(data)
credits = pd.DataFrame(data)   

In [11]:
#iterate through each movie_id to scrape cast/crew details
for i in trange(1,len(id_list)):
    try:
        response = requests.get("https://api.themoviedb.org/3/movie/{}/credits?api_key={}&language=en-US".format(id_list[i],api_key))
        data = response.json()
        data = json_normalize(data)
        temp_df = pd.DataFrame(data)
        credits = pd.concat([credits, temp_df], ignore_index=True)
        
        #0.3 seconds interval per requests to prevent server overload    
        time.sleep(0.3)    
        
    except:
        #notifies us if there is an error during scraping
        print(f"Error occurred while scraping")
    

  0%|          | 0/9999 [00:00<?, ?it/s]

Error occurred while scraping


In [12]:
credits.shape

(9999, 3)

## Export datasets

In [13]:
#uncomment codes below to save/overwrite files
# movies.to_csv('../data/movies_scraped.csv', index=False)
# credits.to_csv('../data/credits_scraped.csv', index=False)