# Microsofts's New Movie Studio

## Overview
### Problem 
- Okay, so Microsoft sees other big companies making cool video content and wants to join in. They're starting their own movie studio, but they're clueless about making movies. That's where I come in. I have to find out which types of films are doing well at the box office and translate that into actionable insights for the head of Microsoft's studio. This will help them decide what kind of films to create.
### Data Understanding 
- The data I have available to me come from a variety of movie databases.
- The data represents various movie attributes. Key among them being revenue generated, popularity and ratings which will serve as our success metrics in this analysis.
#### Questions to consider
- Does expensive mean successfull?
- Is studio important to success?
- Are some genres more successful than others?

In [3]:
# import relevant modules
import pandas as pd
import numpy as np
from scipy import stats

# visulization modules
import matplotlib.pyplot as plt
import seaborn as sns

### Data Preparation
- Get a better representation of user rating
- Fixing genre column in title.basics.csv

#### Getting a better representaion of user rating
- I doesn't seem fair to compare films with an avergae rating of 4 with only 100 voters to that with 4 and 10,000 voters.
- I will be using a method inspired by Bayesian probablity to  give a better representation of the rating.
- More information on how this works can be found  [here](https://stackoverflow.com/a/50476254)

In [2]:
# load the data
ratings_df = pd.read_csv('data/title.ratings.csv')

In [5]:
# Define the parameter R as the median of the averagerating column
R = np.median(ratings_df['averagerating'])

# Calculate the z-score for the 95% confidence level
z = stats.norm.ppf(0.975)

# Calculate W as z^2/4
W = z**2/4

# Calculate the Bayesian rating using the formula
ratings_df["bayesian_rating"] = (W * R + ratings_df["averagerating"] * ratings_df["numvotes"]) / (W + ratings_df["numvotes"])

# preview first 5 rows of the dataframe
ratings_df.head()


Unnamed: 0,tconst,averagerating,numvotes,bayesian_rating
0,tt10356526,8.3,31,8.245912
1,tt10384606,8.9,559,8.895884
2,tt1042974,6.4,20,6.404582
3,tt1043726,4.2,50352,4.200044
4,tt1060240,6.5,21,6.5


#### Fixing genre column
- convert items in genre column to a list

In [8]:
# load the dat
basics_df = pd.read_csv('data/title.basics.csv')

In [10]:
txt = 'Hello'
[txt]

['Hello']

In [16]:
def text_to_list(text):
    # what to do if text is empty or not a string type
    if text == '' or type(text) != str:
        return np.nan
    # check if text has a comma
    elif ',' in text:
        return text.split(',')
    # what to do is text has no comma
    else:
        return [text]

In [17]:
# apply text_to_list to genres column
basics_df.loc[:, 'genres'] = basics_df['genres'].apply(text_to_list)

In [18]:
# previe results
basics_df['genres']

0           [Action, Crime, Drama]
1               [Biography, Drama]
2                          [Drama]
3                  [Comedy, Drama]
4         [Comedy, Drama, Fantasy]
                    ...           
146139                     [Drama]
146140               [Documentary]
146141                    [Comedy]
146142                         NaN
146143               [Documentary]
Name: genres, Length: 146144, dtype: object

### Data Analysis

#### Does Expensive mean successfull?
##### Questions to answer?
- Do more expensive films get better ratings?
- Do more expensive films make more profit?
- What genres on average cost the most to make?