# Predicting Movie Earnings
    Thomas van der Molen
    S4-AI41

## Table of Contents
- [Version History](#Version-History)
- [Introduction](#Introduction)  
- [Proposal](#Proposal)
    * [Domain Understanding](#Domain-Understanding)
        * [Interview](#Interview)
        * [Real World Example](#Real-World-Example)
            * [House of Cards](#House-of-Cards)
            * [Sonic the Hedgehog](#Sonic-the-Hedgehog)
        * [Question Statement](#Question-Statement)
    * [Data Sourcing](#Data-Sourcing)
    * [Analytic Approach](#Analytic-Approach)
- [Iteration 1](#Iteration-1)
    - [Provisioning](#Provisioning)
        * [Data Requirements](#Data-Requirements)
            * [TMDb](#TMDb)
            * [The-Numbers](#The-Numbers)
        * [Data Collection](#Data-Collection)
        * [Data Understanding](#Data-Understanding)
        * [Data Preparation](#Data-Preparation)
    - [Predictions](#Predictions)
        * [Preprocessing](#Preprocessing)
        * [Modelling](#Modelling)
        * [Evaluation](#Evaluation)
- [Delivery](#Delivery)
    * [Collecting](#Collecting)
    * [Documenting](#Documenting)
    * [Reporting](#Reporting)

In [1]:
%%html
<style>
table {float:left}
img {float:left}
</style>

## Version History
| Version | Date | Change |
| :---: | :--- | :--- |
| 1.0 | 24-03-2022 | Created document and worked on the proposal |
| 1.1 | 25-03-2022 | Worked on proposal and provisioning |

## Introduction
  
This is iteration 1 of my challenge, if you wish to see the exploratory research done during Iteration 0, you can find that in the iteration 0 folder [here](Iteration0/ChallengeProposal.html) or [online](https://github.com/Thomas-Molen/FHICT-S4-AI/blob/main/Challenge/ChallengeProposal.html)
  
I have decided to do a project in the domain of movies, because I enjoy watching movies and seeing what makes them succeed or fail, I have also chosen this domain because there is quite a bit of data publicly available about movies that could be used.

## Proposal

### Domain Understanding
As said in the [Introduction](#Introduction) I have chosen to do a project in the domain of movies, for this there are a couple sub domains that the film industry could be split up into. For me the most important one is the area connected to the production process, seen as the companies and individuals operating in this area put a lot of investment (both time and money) into the movie itself.

#### Interview
First step of getting a better understanding of my chosen domain would be to get into contact with domain experts.  
During Iteration 0 I looked at what companies by target audience would be for my project, in this research I found 3 different groups:
- Production and distribution (e.g. [Paramount Puctures Studios](http://www.paramountstudios.com/), [Lionsgate](https://www.lionsgate.com/))
- Financing companies (e.g. [Peacock Film Finance](https://peacockfilmfinance.com/))
- Film Bond companies (film completion bonds are a form of insurance for investors, e.g. [Surety Bonds](https://www.suretybondsdirect.com/))

For my interview I originally approach smaller production companies such as [24fps production](https://www.24fpsproductions.com/) for a possible interview with a domain expert working at the company. However, I never got a response from them.  
I discussed my problems with getting in contact with a domain expert with my semester coach and he suggested sending a questionnaire to individuals instead, seen as these professionals (understandably) might not have the time for an interview.  
I took up the given advice and messaged some domain experts currently working as film producers on linkedIn.

\[No Responses yet\]

#### Real World Example
To gain a broader understanding of the specific direction I want to go into with my domain, I have collected some examples of real world implementations of data to gauge and produce successful media such as movies and series.  
  
##### House of Cards
House of Cards is a political drama/thriller series made by [Netflix](https://en.wikipedia.org/wiki/Netflix) (currently one of the largest subscription based streaming service). While Netflix has made many original series and movies, this one was a bit special. Netflix as a platform which millions of people use daily has a lot of data on it's users behavior and watch habbits, all this data Netflix used to create a mathematical success of a series and [it worked](https://www.fastcompany.com/1671893/the-secret-sauce-behind-netflixs-hit-house-of-cards-big-data)! The series became the most watched series on Netflix at the time of it's release combined with [great reviews](https://www.rottentomatoes.com/tv/house-of-cards) by critics and audience alike.  
Since this first dive into manufacturing success, Netflix has over the 10 years since House of Cards released used their data to create successful movies, series and documentaries, proving that creating successful movies is not just luck and human emotions.

##### Sonic the Hedgehog
Sonic the Hedgehog was supposed to be a successful new film series made by Paramount Pictures Studios based on the video game by the same name, however this movie's way to success did not go as the executives from Paramount might have expected. When the first trailer launched, the entire internet showed their complete hatred of the way the movie portrait the main character Sonic. All this uproar and hatred on the design and the movie became so bad that the movie was [delayed for 3 month](https://screenrant.com/sonic-hedgehog-movie-release-delay-why-good/), this delay was done to completely redo parts of the movie with the main one being Sonic himself.  
After this delay however the movie released to relative success making [3.5 times it's budget in revenue](https://en.wikipedia.org/wiki/Sonic_the_Hedgehog_(film)) and setting the movie's universe up for a sequel (which obviously was the plan from the start).  
This change from possible failure of the year to a successful movie, proves that critic reviews and general public opinion between the period of first showing the movie to release can still be a big turning point for a movie.

#### Question Statement
For this project I Will be trying to:
*Predict the success of movies*

This main question has been altered during the making of iteration 0 (discussed in the challenge proposal of iteration 0). I ended up on having the main focus be revenue, because this is a value that a lot of companies focus on when making movies, this value is also more flexible and might have more opportunities than a value such as budget (budget is greatly connected to what kind of movie (for example sci-fi needs a higher budget cost for post production work) and what actors are in the movie (popular actors will take up a lot more of a budget))

There are also a couple subquestions that I would like to answer along the way:
- *Do genres impact the revenue in a significant way*
- *Are actors a big factor in a movie's success*
- *Is there any correlation between a movie's title and it's success*
- *Does the time of year impact movie sales*

### Data Sourcing
For data sourcing there are many good movie database websites to choose from, for critic reviews there are websites such as [Metacritic](https://www.metacritic.com/) and [Rotten Tomatoes](https://www.rottentomatoes.com/) and for a more curated database of movies there are websites such as [IMdb](https://www.imdb.com/) and [TMDb](https://www.themoviedb.org/).

Extra data can also be gotten from more news related websites such as [the-numbers](https://www.the-numbers.com/) who might have more specific data such as gross revenue or budget in the case of the-numbers.

### Analytic Approach
During Iteration 0, a lot of time has been spent figuring out a proper target value.
Out of this research came that for predicting success, revenue would be the best fit. Revenue can be combined with values such as budget to create a profit coefficient which could be used by companies as a KPI.  
After considering this the main goal of this challenge will be to accurately predict a movie's revenue based on public data, this will allow film production and investment companies/individuals to reduce the risks that come along with making a movie.  

Seen as I will be attempting to predict a value (revenue) regression will be the obvious alogirthm type to use, so I will be focussing my data with the goal of using regression to predict revenue.

# Iteration 1

## Provisioning

### Data Requirements
As stated in the [Data Sourcing](#Data-Sourcing) there are several data sources in closely related domains that are able to supply movie data, the ones that I will be initialy focussing on will be:
- [TMDb](https://www.themoviedb.org/)
- [the-numbers](https://www.the-numbers.com/)

I will be using TMDb as my base data, from here I will be getting any general movie data (such as title, releasedate, etc.).
the-numbers I will be using to enrich my data with budgets and revenue. I will not be using any critic data as this data generally is not a good indication for predicting a movie's success before it has been released, seen as there will be no reviews yet (however this could be used in later iterations because of factors such as in the sonic movie example described in [Real World Example](#Real-World-Example)).

As said previously the domains that TMDb and the-numbers fall into are different but still related, with the same general reason of sharing their data publicly.  

#### TMDb
This website is pretty special seen as it is operated and maintained by one person (however he is trying to [grow is team](https://www.themoviedb.org/talk/5c4f34c1c3a36805cc826e67)). The website gives away it's data for free via an API, this is very generous compared to other companies such as [theTVDB](https://thetvdb.com/subscribe) who offers it's data at a subscription cost (or free via a contribution program).  
After doing some research I found that TMDb is a subsidiary by [TiVo](https://www.tivo.com/), this company has been in the home entertainment industry 1997 selling digital video recorders, currently however they are selling multi-media settop boxes to connect to your tv (to watch for example netflix or record shows on tv).  
If I were to guess this company subsidies TMDb for it's data it can collect on both movie's and user interests/interactions.  

#### The-Numbers
The other website I am interested in is the-numbers, this website is a more news based website focussing on sharing relevant movie news to people for free. This website operates completely for free, however they do not give out their data in an easy way such as a data dump or API's.  
The the-numbers website is part of a parent company called [Nash Information Services, LLC](https://www.nashinfoservices.com/), this company has another daughter company called [OpusData](https://www.opusdata.com/
) who is a for profit consultant company mainly focussed on the film industry.  
So Nash Information Services, LLC seems to subsidies the-numbers to operate for free to gain public knowledge and interest in the company and their services. 

### Data Collection

### Data Understanding

### Data Preperation

## Predictions

### Preprocessing

### Modelling

### Evaluation

## Delivery

### Collecting

### Documenting

### Reporting