# Lights, Data, Strategy: Informing Microsoft's Movie Studio with Box Office Analytics

1. Business Understanding
a) Introduction.

In an effort to tap into the growing trend of original video content and expand its footprint in the entertainment industry, Microsoft has made the strategic decision to establish its own movie studio. However, being new, Microsoft faces the challenge of understanding which types of films perform exceptionally well. To address this challenge, a data driven analysis is necessary to uncover valuable insights that can guide the decision making process regarding the types of films to start working on. 

The film industry has grown remarkably in recent years, benefiting from technological advancements such as 5G internet, higher quality displays, and smartphones. These new technologies have changed consumer preferences with regard to content access, and in the process companies like Netflix that moved in early to cater to these new appetites benefitted greatly. This shift compelled many tech giants such as Apple and Amazon to venture into movie production, aiming to capitalize on lucrative opportunities the demand presents. Recognizing the potential and profitability, Microsoft seeks to strategically navigate this market, guided by existing data and insights gleaned from it.

The primary objective of this analysis is to identify attributes successful films at the box office share, using comprehensive data sets spanning various aspects of the film industry.By examining historical records, genre trends, and audience preferences, we aim to provide actionable insights to the leadership of Microsoft's new movie studio. 
These insights will guide Microsoft in determining the types of films that are competitive and likely to thrive in this evolving market. 

Through an in-depth exploration of the available data and the application of analytics techniques, we will uncover valuable patterns, correlations, and market trends. Equipped with these data-driven insights, Microsoft will be better positioned to develop a successful film strategy, compete, and increase their chances of success.

b) Defining the metric for success. / Problem statement. 

Microsoft's new movie studio lacks domain expertise in filmmaking and needs guidance on which film genres to focus on to maximize box office success. Leveraging historical box office data and exploring genre trends, this investigation will guide the choice of films they should produce.
To address this problem we will examine datasets containing information on past movie releases, genres, and revenue from these movies. Box office revenues, ROI and ratings data will serve as indicators of financial viability and audience reception across different film genres. These elements will be examined in conjunction with other factors that contribute to a film's success, such as budget, release timing, and target audience demographics to get a more comprehensive understanding of the dynamics that drive box office success. Based on the results of our analysis, we will provide actionable insights and recommendations regarding the film genres that have historically performed well at the box office. These insights will enable Microsoft to align their movie studio's content strategy with audience preferences and market trends and increase the probaility of producing commercially successful films. 
*** To be reviewed ***
The success of our analysis will be measured by the following metrics:

Accuracy of genre performance predictions: Our analysis should accurately identify the film genres that have historically performed well at the box office, enabling Microsoft to make informed decisions.

Revenue increase: The recommendations and insights derived from our analysis should contribute to an increase in box office revenue for Microsoft's movie studio compared to randomly selecting film genres.

Market share growth: Microsoft's movie studio should gain a larger market share within the film industry by producing successful films aligned with audience preferences and market trends.
*** reviewed ***

c) Main objectives and understaning the context 
Main Objective of the Study:
The main objective of this study is to analyze historical film data and identify the most profitable film genres for Microsoft's new movie studio. Outcomes will determine the types of films to create, maximizing the chances of success.

Specific Objectives:
1. To determine the genres with the highest potential for success in the current market by analyzing genre trends and performance to identify film genres that have consistently performed well in terms of revenue and audience reception.

2. To investigate additional factors beyond genre, such as budget, release timing, and target audience demographics, to determine their impact on a film's success. This will allow us to find relationships between genre and other influential elements.

3. Based on the analysis, derive actionable insights and recommendations that align the studio's content strategy with audience preferences and market trends to increase the probability of producing commercially successful films.

d) Experimental Design
** To be updated as we go* 
1. Data collection
2.
3.
4.

e) Data relevance, data understanding. 

2. Reading the Data

In [4]:
# importing libraries. 
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np 
import seaborn as sns


%matplotlib inline 

In [27]:
# Read the CSV files

tmdb_movies_df = pd.read_csv('data/tmdb_movies.csv')
tn_movie_budgets_df = pd.read_csv('data/tn_movie_budgets.csv', index_col=0 )
bom_movie_gross_df = pd.read_csv('data/bom_movie_gross.csv')



Inspect the contents of the dataframe df.info 

In [28]:
# inspect first 5 records in the list
tmdb_movies_df.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [29]:
# Looks like out import added an extra index column, we will have to drop the column and modify our df before we can proceed. 
tmdb_movies_df.drop(tmdb_movies_df.columns[0], axis= 1, inpComparing the vote_average(rating) with popularitylace= True)

# Inspect our changes 
tmdb_movies_df.head()

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [30]:
tmdb_movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   genre_ids          26517 non-null  object 
 1   id                 26517 non-null  int64  
 2   original_language  26517 non-null  object 
 3   original_title     26517 non-null  object 
 4   popularity         26517 non-null  float64
 5   release_date       26517 non-null  object 
 6   title              26517 non-null  object 
 7   vote_average       26517 non-null  float64
 8   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 1.8+ MB


In [31]:
tmdb_movies_df.shape

(26517, 9)

What information do we obtain from checking the shape and info of the dataframe?
"""
From our range index there are 26517 rows in the tmdb_movies.csv dataframe. 
There are 8 columns in the tmdb_movies dataframe. 
Non-Null Count is the number of non-null values present in the column. All columns have 26,517 non-null values, indicating that there are no missing values in any of the columns.
Dtype refers to the data type of the values stored in the column. The DataFrame contains columns of three different data types:
    int64: Columns 'id' and 'vote_count' are of integer type.
    object: Columns 'original_language', 'original_title', 'release_date', and 'title' are of object type, which typically represents string values.
    float64: Columns 'popularity' and 'vote_average' are of floating-point type.
"""

Comparing the vote_average(rating) with popularity

In [59]:
# Extract and anlyse ratings from tmdb
ratings_tmdb = tmdb_movies_df['vote_average']

# Calculate the minimum and maximum values
min_rating = ratings_tmdb.min()
max_rating = ratings_tmdb.max()

# Print the range of values
print("Minimum Rating:", min_rating)
print("Maximum Rating:", max_rating)


# aggregation functions. 
print(f'The mean value of the ratings column is {ratings_tmdb.mean():.3f}, median value is {ratings_tmdb.median()}.\
    \nThe standard deviation is {ratings_tmdb.std():.3f}.')


Minimum Rating: 0.0
Maximum Rating: 10.0
The mean value of the ratings column is 5.991, median value is 6.0.    
The standard deviation is 1.853.


On average movies in the tmdb_movies.csv dataset have received a rating of 6.0 and from median approximately 50% of the movies have a rating of below 6.0.
Standard deviation of 1.853 measures dispersion. A larger value indicates greater spread, smaller values indicate data points are clustered around the mean. 
Ratings range from 0 to 10, standard deviation of 1.853 indicates ratings have some variability but are not extremely spread out. 
 

In [68]:
# Extract and anlyse ratings from tmdb
popularity_tmdb = tmdb_movies_df['popularity']

# Calculate the minimum and maximum values
min_popularity= popularity_tmdb.min()
max_popularity= popularity_tmdb.max()

# Print the range of values
print("Minimum Rating:", min_popularity)
print("Maximum Rating:", max_popularity)


# aggregation functions. 
print(f'The mean value of the popularity column is {popularity_tmdb.mean():.3f}, median value is {popularity_tmdb.median()}.\
    \nThe standard deviation is {popularity_tmdb.std():.3f}.')

Minimum Rating: 0.6
Maximum Rating: 80.773
The mean value of the popularity column is 3.131, median value is 1.374.    
The standard deviation is 4.355.


In [32]:
bom_movie_gross_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [33]:
bom_movie_gross_df.shape

(3387, 5)

In [34]:
bom_movie_gross_df.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [35]:
tn_movie_budgets_df.shape

(5782, 5)

In [37]:
tn_movie_budgets_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 5782 entries, 1 to 82
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   release_date       5782 non-null   object
 1   movie              5782 non-null   object
 2   production_budget  5782 non-null   object
 3   domestic_gross     5782 non-null   object
 4   worldwide_gross    5782 non-null   object
dtypes: object(5)
memory usage: 271.0+ KB


In [38]:
tn_movie_budgets_df.head()

Unnamed: 0_level_0,release_date,movie,production_budget,domestic_gross,worldwide_gross
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


Identiy and handle missing values

In [None]:
clean the data, text files numbers etc

Explore data distributions matplotlib

Explore differences between subsets

In [None]:
Explore correlations

In [None]:
Engineer and explore new features