# Microsoft Studios: Data-driven recommendations for a profitable initial release

## Data Sources

### Financial Data

The financial data was scraped from [The-Numbers: Where Data and the Movie Business Meet](https://www.the-numbers.com/movie/budgets/all), which yielded 2,700 films accompanied by the following data:
* Release Date
* Production Budget 
* Domestic Box Office Gross
* Worldwide Box Office Gross


### Genre Data

OpusData provided a data set free of charge in response to an educational request with the condition that it is used for educational purposes only. There were 1,936 film listings provided, from which the genre classifications were obtained.

### Runtime and Personnel (Performer and Director) Data

Information courtesy of
IMDb
(http://www.imdb.com).
Used with permission.

This data is made available at https://datasets.imdbws.com/, updated daily as compressed .tar.gz files.  These files were accessed on July 30th, 2019 and processed as a SQLite3 database, constructed and queried with IMDbPY, an unofficial API designed for this purpose by Davide Alberani, *et al.* and made available via pip and conda.

## Data Analysis

### Establishment of Primary Metric: % ROI

Prior to further analysis, percent return on investment (% ROI) was selected as the best means of evaluating responsible allocation of funds and calculated per the formula:
</br>
<center>$\frac{Domestic  Gross - Production Budget}{Production Budget}\cdot100$</center>
</br>

Consideration of Gross box office receipts were limited to domestic results because budget information pertaining to distribution, marketing or promotion of overseas releases was unavailable.

### Genre

In order to assess financial performance by genre it was necessary initially to merge the financial dataframe and the genre dataframe. Doing so left 1,388 titles with the necessary data to assess aggregate performance by genre. % ROI was used to evaluate performace across movie genre. The following box plot shows the distribution of financial performance for each genre, presented in order of decreasing median % ROI. 

![](ROI_by_Genre.jpg)

While musicals presented with the highest overall performance, there are only 20 musicals in this set of 1,388 films and may not be representative of the market. The next three highest performing genres were comedies, romantic comedies, and black comedies. Since romantic comedies and black comdies are both subgenres of comedy and their individual film counts were somewhat low (52 and 15, respectively), we have aggregated all three categories together for further analysis.

### Release Date

Using the release date data from The-Numbers.com, financial performance could be mapped to each possible week of release. The distribution of of % ROI over the possible weeks of release is shown in the figure below.

![](ROI_by_Week2.jpg)

Weeks of release with a median value below the break even line are shaded in blue, while weeks with a median value above the break even line are shaded in red. 

Looking specifically at the release weeks for comedies, the figure below shows that a weekly analysis for movies for comedies roughly mirrors that of all genres.

![](ROI_by_Week_Comedy.jpg)

### Budget

To recommend an initial budget range, the budgets of the 278 comedies, including romantic comedies and black comedies, were considered collectively. The films were grouped into three categories relative to the median budget of \\$35 million:
* Below \\$23 million
* Between \\$23 and \\$46 million
* Above \\$46 million

These threshold values were calculated using the standard deviation of production budgets to create a $\frac{\sigma}{2}$ interval around the median.

The plot below shows the distribution of % ROI for each subset.  

![](ROI_by_Compared_Med.jpeg)




### Runtime

Rutimes for comedies have a median of 103 minutes, a mean of 104 minutes and a standard deviation of just under 13.5 minutes. The near-normal distribution tails into higher values, with a maximum of 180 minutes as shown in the figure below.

![](Runtime_Dist.jpg)

When evaluated against % ROI, it may be seen that that best results are achieved with films near the average length, with higher ROI being favored by longer runtimes, as illustrated by the figure below. 

![](ROI_by_Runtime2.jpg)


### Casting and Selection of Director

The IMDb database lists the four top-billed members of the cast for each film listing, performers in comedic films may be evaluated based on the % ROI of the films in which they play a starring role. Shown below are the top 50 performers and top 20 directors by this metric.

#### Top 50 Performers by Mean % ROI

![](Top_20_Performers.JPG)

![](Next_20_Performers.JPG)

![](Next_10_Performers.JPG)

#### Top 20 Directors by Mean % ROI

![](Top_20_Directors.JPG)

### Final Recommendations

Comedies are likely to be the safest choice of genre for an initial offering, and a release week between Memorial Day and Independence Day weekends has a solid track record for both comedies and movie attendance in general.  The production budget should be kept near typical industry values (23-46 million USD), erring on the low end to maximize ROI, and runtime should be allowed to extend somewhat away fom the mean of 104 minutes if warranted.  Choosing the director is safest from the provided list, and casting directors are encouraged to start searching from the provided list of performers.

## Any questions?