MICROSOFT MOVIE ANALYSIS

1.BUSINESS UNDERSTANDING

    A.INTRODUCTION

When choosing the kinds of films to make, resource limitations must be taken into consideration. Understanding the available resources can help you determine the size and extent of the movie studio's production capabilities, including the budget, talent pool, and technological infrastructure. Microsoft will be able to set standards for success and gauge the studio's performance in the market by defining the major success measures, such as box office revenues, profitability, audience ratings, and critical acclaim. 

Additionally, it is crucial to take into account the wants and requirements of the many parties involved, particularly the head of Microsoft's film division. It will be possible to make sure that the conclusions and insights drawn from the analysis are suited to their demands for making decisions by comprehending their vision, goals, and particular requirements.

Microsoft can successfully investigate the kinds of movies that are now successful at the box office by developing a thorough understanding of the commercial context. The conclusions drawn from this project will offer the head of the movie studio implementable suggestions that will direct their choices regarding the ideal film production approach. This includes choosing a genre that appeals to the target demographic, allocating resources wisely, and matching the movies' branding to Microsoft. In the end, these insights will help Microsoft produce hit movies and forge a significant foothold in the film business.


    B.PROBLEM STATEMENT

Microsoft's inexperience in the movie-making industry is a major barrier to the success of their recently launched movie studio. The current challenge is to identify the traits and categories of movies that have enjoyed significant box office success. The idea is to provide the head of Microsoft's movie studio with the knowledge needed to choose the kinds of movies to make, increasing the likelihood of success.

    C.MAIN OBJECTIVE

To perform exploratory data analysis in order to learn more about the kinds of movies that are currently doing well at the box office

    D.SPECIFIC OBJECTIVES

-Conduct exploratory data analysis to find patterns, trends, and connections between audience preferences, box office success, and popular genres.
-Draw conclusions that can be put into practice from the data analysis, highlighting the categories of movies that are connecting with audiences and doing well at the box office.
-Deliver a thorough presentation that includes a summary of the data analysis's findings, conclusions, and suggestions.

2.IMPORTING LIBRARIES

In [1]:
import pandas as pd

3.READING THE DATA

In [3]:
#loading data on movie gross
bom_data = pd.read_csv('Datasets/bom.movie_gross.csv.gz')
bom_data.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [4]:
#loading data on budgets
budgets_data = pd.read_csv('Datasets/tn.movie_budgets.csv.gz')
budgets_data.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [6]:
#loading data on titles basics
basics_data = pd.read_csv('Datasets/imdb.title.basics.csv.gz')
basics_data.head()

Unnamed: 0,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [7]:
#loading data on ratings
ratings_data = pd.read_csv('Datasets/imdb.title.ratings.csv.gz')
ratings_data.head()


Unnamed: 0,tconst,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


DATA WRANGLING

DROPPING COLUMNS

Only a few features and rows from the various datasets that were gathered are important to the procedure. As a result, in this stage, the features from each dataset that were not necessary were removed. After then, the remaining datasets were connected.

In [8]:
#selecting necessary columns on the movie gross dataset
necessary_columns = ['title','studio','domestic_gross','foreign_gross','year']
bom_data_filtered = bom_data[necessary_columns]
bom_data_filtered.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [7]:
bom_data_filtered.shape

(3387, 5)

This dataset contains 3387 rows and 5 columns

In [9]:
#selecting necessary columns on the budgets dataset
new_budgets = ['movie','domestic_gross','worldwide_gross','release_date']
budgets_data_filtered = budgets_data[new_budgets]
budgets_data_filtered.head()

Unnamed: 0,movie,domestic_gross,worldwide_gross,release_date
0,Avatar,"$760,507,625","$2,776,345,279","Dec 18, 2009"
1,Pirates of the Caribbean: On Stranger Tides,"$241,063,875","$1,045,663,875","May 20, 2011"
2,Dark Phoenix,"$42,762,350","$149,762,350","Jun 7, 2019"
3,Avengers: Age of Ultron,"$459,005,868","$1,403,013,963","May 1, 2015"
4,Star Wars Ep. VIII: The Last Jedi,"$620,181,382","$1,316,721,747","Dec 15, 2017"


In [10]:
budgets_data_filtered.shape

(5782, 4)

This dataset contains 5782 rows and 2 columns

In [11]:
#dropping unnecessary columns in the titles basics dataset
basics_data.drop(['original_title', 'runtime_minutes'], axis=1)


Unnamed: 0,tconst,primary_title,start_year,genres
0,tt0063540,Sunghursh,2013,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,2019,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,2018,Drama
3,tt0069204,Sabse Bada Sukh,2018,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,2017,"Comedy,Drama,Fantasy"
...,...,...,...,...
146139,tt9916538,Kuambil Lagi Hatiku,2019,Drama
146140,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,2015,Documentary
146141,tt9916706,Dankyavar Danka,2013,Comedy
146142,tt9916730,6 Gunn,2017,


In [12]:
basics_data.shape

(146144, 6)

This dataset contains 146144 rows and 6 columns

FORMATTING DATATYPES

Various datasets need various formatting.

In the budgets data, convert the 'production_budget', 'domestic_gross' and 'worldwide_gross' columns to numeric data type and convert to float type. then convert the 'release_date' column to a datetime data type.

In [13]:
budgets_data['domestic_gross'] = budgets_data['domestic_gross'].str.replace('[^\d.]', '', regex=True).astype(float)
budgets_data['worldwide_gross'] = budgets_data['worldwide_gross'].str.replace('[^\d.]', '', regex=True).astype(float)
budgets_data['release_date'] = pd.to_datetime(budgets_data['release_date'])
print(budgets_data.head())


   id release_date                                        movie   
0   1   2009-12-18                                       Avatar  \
1   2   2011-05-20  Pirates of the Caribbean: On Stranger Tides   
2   3   2019-06-07                                 Dark Phoenix   
3   4   2015-05-01                      Avengers: Age of Ultron   
4   5   2017-12-15            Star Wars Ep. VIII: The Last Jedi   

  production_budget  domestic_gross  worldwide_gross  
0      $425,000,000     760507625.0     2.776345e+09  
1      $410,600,000     241063875.0     1.045664e+09  
2      $350,000,000      42762350.0     1.497624e+08  
3      $330,600,000     459005868.0     1.403014e+09  
4      $317,000,000     620181382.0     1.316722e+09  


In [16]:
#convert columns to numeric type then to float types
budgets_data['production_budget'] = pd.to_numeric(budgets_data['production_budget'], errors='coerce')
budgets_data['domestic_gross'] = pd.to_numeric(budgets_data['domestic_gross'], errors='coerce')

#convert the columns to float type
budgets_data['production_budget'] = budgets_data['production_budget'].astype(float)
budgets_data['domestic_gross'] = budgets_data['domestic_gross'].astype(float)

print(budgets_data)


      id release_date                                        movie   
0      1   2009-12-18                                       Avatar  \
1      2   2011-05-20  Pirates of the Caribbean: On Stranger Tides   
2      3   2019-06-07                                 Dark Phoenix   
3      4   2015-05-01                      Avengers: Age of Ultron   
4      5   2017-12-15            Star Wars Ep. VIII: The Last Jedi   
...   ..          ...                                          ...   
5777  78   2018-12-31                                       Red 11   
5778  79   1999-04-02                                    Following   
5779  80   2005-07-13                Return to the Land of Wonders   
5780  81   2015-09-29                         A Plague So Pleasant   
5781  82   2005-08-05                            My Date With Drew   

      production_budget  domestic_gross  worldwide_gross  
0                   NaN     760507625.0     2.776345e+09  
1                   NaN     241063875.0  

In [19]:
budgets_data_filtered = budgets_data.drop(['production_budget', 'worldwide_gross'], axis=1)
print(budgets_data_filtered)


      id release_date                                        movie   
0      1   2009-12-18                                       Avatar  \
1      2   2011-05-20  Pirates of the Caribbean: On Stranger Tides   
2      3   2019-06-07                                 Dark Phoenix   
3      4   2015-05-01                      Avengers: Age of Ultron   
4      5   2017-12-15            Star Wars Ep. VIII: The Last Jedi   
...   ..          ...                                          ...   
5777  78   2018-12-31                                       Red 11   
5778  79   1999-04-02                                    Following   
5779  80   2005-07-13                Return to the Land of Wonders   
5780  81   2015-09-29                         A Plague So Pleasant   
5781  82   2005-08-05                            My Date With Drew   

      domestic_gross  
0        760507625.0  
1        241063875.0  
2         42762350.0  
3        459005868.0  
4        620181382.0  
...              ... 