# Microsoft's foray into the entertainment industry: A detailed analysis
# 1.Business Understanding
# a)Introduction

In a move to ride the wave of the popular trend in original video content and expand its footprints in the entertainment industry, Microsoft has decided to launch its very own movie studio. However, being newcomers to the film world, Microsoft is still trying to discover what makes a movie successful.To tackle this challenge head-on, a comprehensive data analysis is neccessary to uncover the secrets of successful films, thereby informing strategic decisions for this new venture.

The main aim of this analysis is to uncover the common factors among successful movies. This in-depth exploration will dive into historical data, genre trends, and audience preferences to provide actionable insights that guide the studio's content creation and production strategies. The ultimate goal is to provide practical advice to guide Microsoft's new movie studio as a global tech powerhouse, is strategically stepping into the entertainment arena, leveraging its tech expertise to create visually captivating and immersive film content.







# b)Defining the Metric for Success 
In navigating the intricate landscape of film production, our challenge lies in leveraging market research and consumer insights to not only inform strategic decision-making but also enhance the financial performance of our film productions. Identifying untapped market opportunities and tailoring films to meet the specific demands of target audiences is paramount for sustained success in the dynamic entertainment industry. The success of our analysis will be measured by considering the following metrics:

Box Office Performance:In regarding revenue generation,this analysis should center on identifying film genres that exhibit the highest box office revenue generation.

Market Research : The effectiveness of our market research efforts will be gauged by our ability to uncover valuable consumer insights that directly influence decision-making and financial outcomes.

Target Audience Alignment: Success metrics will be based on how well our films align with the demands and preferences of specific target audiences, ensuring a tailored and resonant cinematic experience. 



 

# c)Main Objective


The main objective of the analysis for Microsoft's movie studio project is to identify the types of films that have been most successful in terms of generating high box office revenue. The analysis aims to leverage market research, consumer insights, and collaboration with industry professionals to inform decision-making. Additionally, the analysis seeks to strike a balance between creativity and data-driven approaches, ultimately guiding the studio in producing films that resonate with audiences and achieve financial success in the dynamic landscape of the entertainment industry.

  # Specific objectives 
Genre Identification: Determine the film genres that have historically performed well in terms of generating high box office revenue.

Market Research Utilization: Effectively use market research and consumer insights to inform decision-making in the production process.

Audience Targeting: Identify and understand target audiences, tailoring films to meet the specific demands and preferences of diverse viewer segments. 

Cultural Impact: Explore the potential for films to have a positive cultural impact, contributing to the broader cultural landscape and potentially enhancing the country's economic growth and development.


# d)Experimental Design
Data Collection: Collect data on the highest-grossing movies.Use data sources to obtain this information that includes variables such as genre, release year, production budget, and domestic gross revenue.

Data Cleaning: Preprocess the data to clean and standardize it. Remove any duplicate records and fill in missing values with the mean or median. Ensure that all categorical variables are represented in a standardized format.

Exploratory Data Analysis: To understand the patterns and trends in the data. This can include examining the relationship between movie features like genre, release year, and production budget with their box office revenue.

Analysis and visualizations: Visualize the data using charts like bar graphs and histograms to gain insights into the top-grossing films.

Conclusions and Presentation : Presenting the finding and explain the patterns and trends identified in the data, and discuss the implications for future film production. Suggest potential areas of improvement or expansion, such as exploring international markets, incorporating audience feedback, or leveraging data-driven strategies for film marketing and distribution.

# d)Data Relevance 


# 2.Reading the Data  


In [41]:
#import standard packages
import pandas as pd
import numpy as np 
import csv
import matplotlib.pyplot as plt
import seaborn as sns


In [42]:
#Loading data from the CSV files 
bom_movie_gross = pd.read_csv('bom.movie_gross.csv')
print("BOM Movie Gross Data:")
print(bom_movie_gross.head())


tmdb_movies = pd.read_csv('tmdb.movies.csv')
print("TMDB Movies Data:")
print(tmdb_movies.head())

tn_movie_budgets = pd.read_csv('tn.movie_budgets.csv')
print("TN Movie Budgets Data:")
print(tn_movie_budgets.head())


BOM Movie Gross Data:
   Unnamed: 0            genre_ids     id original_language  \
0           0      [12, 14, 10751]  12444                en   
1           1  [14, 12, 16, 10751]  10191                en   
2           2        [12, 28, 878]  10138                en   
3           3      [16, 35, 10751]    862                en   
4           4        [28, 878, 12]  27205                en   

                                 original_title  popularity release_date  \
0  Harry Potter and the Deathly Hallows: Part 1      33.533   19/11/2010   
1                      How to Train Your Dragon      28.734   26/03/2010   
2                                    Iron Man 2      28.515   07/05/2010   
3                                     Toy Story      28.005   22/11/1995   
4                                     Inception      27.920   16/07/2010   

                                          title  vote_average  vote_count  
0  Harry Potter and the Deathly Hallows: Part 1           7.7     

In [43]:
#loading data from the tsv files
rt_movie = pd.read_csv('rt.movie_info.tsv', sep='\t')
print("\nMovie Info")
print(rt_movie.head())
rt_movie

rt_reviews = pd.read_csv('rt.reviews.tsv', delimiter='\t', encoding='latin1')
print("\nRT Reviews Data:")
print(rt_reviews.head())
rt_reviews


Movie Info
   id                                           synopsis rating  \
0   1  This gritty, fast-paced, and innovative police...      R   
1   3  New York City, not-too-distant-future: Eric Pa...      R   
2   5  Illeana Douglas delivers a superb performance ...      R   
3   6  Michael Douglas runs afoul of a treacherous su...      R   
4   7                                                NaN     NR   

                                 genre          director  \
0  Action and Adventure|Classics|Drama  William Friedkin   
1    Drama|Science Fiction and Fantasy  David Cronenberg   
2    Drama|Musical and Performing Arts    Allison Anders   
3           Drama|Mystery and Suspense    Barry Levinson   
4                        Drama|Romance    Rodney Bennett   

                            writer  theater_date      dvd_date currency  \
0                   Ernest Tidyman   Oct 9, 1971  Sep 25, 2001      NaN   
1     David Cronenberg|Don DeLillo  Aug 17, 2012   Jan 1, 2013        $   

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"
...,...,...,...,...,...,...,...,...
54427,2000,The real charm of this trifle is the deadpan c...,,fresh,Laura Sinagra,1,Village Voice,"September 24, 2002"
54428,2000,,1/5,rotten,Michael Szymanski,0,Zap2it.com,"September 21, 2005"
54429,2000,,2/5,rotten,Emanuel Levy,0,EmanuelLevy.Com,"July 17, 2005"
54430,2000,,2.5/5,rotten,Christopher Null,0,Filmcritic.com,"September 7, 2003"


In [44]:
## Display basic information about the dataset
print(bom_movie_gross.info())
print(tmdb_movies.info())
print(tn_movie_budgets.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int

In [45]:
# Check for missing values in bom movie gross dataset
print("\nMissing Values in BOM:")
print(bom_movie_gross.isnull().sum())

# Check for missing values in tmdb dataset
print("\nMissing Values in tmdb:")
print(tmdb_movies.isnull().sum())

# Check for missing values in movie budgets dataset
print("\nMissing Values in movie budget:")
print(tn_movie_budgets.isnull().sum())

#check for missing data in rt movie info
print("\nMissing Values in RT Movie Info:")
print(rt_movie.isnull().sum())

#check for missing data in rt reviews
print("\nMissing Values in rt_reviews:")
print(rt_reviews.isnull().sum())


Missing Values in BOM:
Unnamed: 0           0
genre_ids            0
id                   0
original_language    0
original_title       0
popularity           0
release_date         0
title                0
vote_average         0
vote_count           0
dtype: int64

Missing Values in tmdb:
Unnamed: 0           0
genre_ids            0
id                   0
original_language    0
original_title       0
popularity           0
release_date         0
title                0
vote_average         0
vote_count           0
dtype: int64

Missing Values in movie budget:
id                   0
release_date         0
movie                0
production_budget    0
domestic_gross       0
worldwide_gross      0
dtype: int64

Missing Values in RT Movie Info:
id                 0
synopsis          62
rating             3
genre              8
director         199
writer           449
theater_date     359
dvd_date         359
currency        1220
box_office      1220
runtime           30
studio          

In [46]:
#find unique values in tmbd movies data sets
print("\nUnique Genres in TMDB Movies:")
print(tmdb_movies['genre_ids'].unique())
print("\nFrequency of Unique Genres in TMDB Movies:")
print(tmdb_movies['genre_ids'].value_counts())



Unique Genres in TMDB Movies:
['[12, 14, 10751]' '[14, 12, 16, 10751]' '[12, 28, 878]' ...
 '[18, 14, 27, 878, 10749, 53]' '[16, 27, 9648]' '[10751, 12, 28]']

Frequency of Unique Genres in TMDB Movies:
genre_ids
[99]                       3700
[]                         2479
[18]                       2268
[35]                       1660
[27]                       1145
                           ... 
[37, 12]                      1
[10752, 878]                  1
[28, 53, 10749, 18, 35]       1
[99, 80, 53, 36]              1
[10751, 12, 28]               1
Name: count, Length: 2477, dtype: int64


In [47]:
#Check column names 

print("\nColumns in bom_movie_gross data:")
print(bom_movie_gross.columns)

print("\nColumns in tmdb_movies data:")
print(tmdb_movies.columns)

print("\nColumns in tn_movie_budgets data:")
print(tn_movie_budgets.columns)

print("Columns in rt_movie data:")
print(rt_movie.columns)

print("\nColumns in rt_reviews data:")
print(rt_reviews.columns)


Columns in bom_movie_gross data:
Index(['Unnamed: 0', 'genre_ids', 'id', 'original_language', 'original_title',
       'popularity', 'release_date', 'title', 'vote_average', 'vote_count'],
      dtype='object')

Columns in tmdb_movies data:
Index(['Unnamed: 0', 'genre_ids', 'id', 'original_language', 'original_title',
       'popularity', 'release_date', 'title', 'vote_average', 'vote_count'],
      dtype='object')

Columns in tn_movie_budgets data:
Index(['id', 'release_date', 'movie', 'production_budget', 'domestic_gross',
       'worldwide_gross'],
      dtype='object')
Columns in rt_movie data:
Index(['id', 'synopsis', 'rating', 'genre', 'director', 'writer',
       'theater_date', 'dvd_date', 'currency', 'box_office', 'runtime',
       'studio'],
      dtype='object')

Columns in rt_reviews data:
Index(['id', 'review', 'rating', 'fresh', 'critic', 'top_critic', 'publisher',
       'date'],
      dtype='object')


# Data analysis
