![example](images/director_shot.jpeg)

# Microsoft Movie Analysis

**Authors:** Albane Colmenares
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [1]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3 

%matplotlib inline


In [2]:
# Here you run your code to explore the data


In [3]:
# Loading and inspecting available datasets
# Loading bom.movie_gross and storing data into df_movie_gross
df_movie_gross = pd.read_csv('data/bom.movie_gross.csv.gz', compression='gzip')

df_movie_gross.head()


Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [4]:
# conn = sqlite3.connect('data/im.db')

# df_imdbtest = pd.read_sql('data/im.db.zip', conn)
# #
# df_imdbtest.head()

In [5]:
# Inspect overall shape and info of the dataframe
df_movie_gross.shape
df_movie_gross.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [6]:
# df_movie_gross data cleaning
# Inspect the 5 studio null information
df_movie_gross[df_movie_gross['studio'].isnull()]


Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
210,Outside the Law (Hors-la-loi),,96900.0,3300000.0,2010
555,Fireflies in the Garden,,70600.0,3300000.0,2011
933,Keith Lemon: The Film,,,4000000.0,2012
1862,Plot for Peace,,7100.0,,2014
2825,Secret Superstar,,,122000000.0,2017


In [7]:
# Inspect the studios in general
studios = df_movie_gross.drop_duplicates(subset=['studio'])
print(studios['studio'].tolist())
print(len(studios['studio'].tolist()))
# There are 258 production studios, which is large enough to remove 5 
# rows that don't have any studios. 

['BV', 'WB', 'P/DW', 'Sum.', 'Par.', 'Uni.', 'Fox', 'Wein.', 'Sony', 'FoxS', 'SGem', 'WB (NL)', 'LGF', 'MBox', 'CL', 'W/Dim.', 'CBS', 'Focus', 'MGM', 'Over.', 'Mira.', 'IFC', 'CJ', 'NM', 'SPC', 'ParV', 'Gold.', 'JS', 'RAtt.', 'Magn.', 'Free', '3D', 'UTV', 'Rela.', 'Zeit.', 'Anch.', 'PDA', 'Lorb.', 'App.', 'Drft.', 'Osci.', 'IW', 'Rog.', nan, 'Eros', 'Relbig.', 'Viv.', 'Hann.', 'Strand', 'NGE', 'Scre.', 'Kino', 'Abr.', 'CZ', 'ATO', 'First', 'GK', 'FInd.', 'NFC', 'TFC', 'Pala.', 'Imag.', 'NAV', 'Arth.', 'CLS', 'Mont.', 'Olive', 'CGld', 'FOAK', 'IVP', 'Yash', 'ICir', 'FM', 'Vita.', 'WOW', 'Truly', 'Indic.', 'FD', 'Vari.', 'TriS', 'ORF', 'IM', 'Elev.', 'Cohen', 'NeoC', 'Jan.', 'MNE', 'Trib.', 'Rocket', 'OMNI/FSR', 'KKM', 'Argo.', 'SMod', 'Libre', 'FRun', 'WHE', 'P4', 'KC', 'SD', 'AM', 'MPFT', 'Icar.', 'AGF', 'A23', 'Da.', 'NYer', 'Rialto', 'DF', 'KL', 'ALP', 'LG/S', 'WGUSA', 'MPI', 'RTWC', 'FIP', 'RF', 'ArcEnt', 'PalUni', 'EpicPics', 'EOne', 'LD', 'AF', 'TFA', 'Myr.', 'BM&DH', 'SEG', 'PalT

In [8]:
# Dropping rows
df_movie_gross = df_movie_gross[df_movie_gross['studio'].notna()]
# Verifying that na rows were dropped 
df_movie_gross.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3382 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3382 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3356 non-null   float64
 3   foreign_gross   2033 non-null   object 
 4   year            3382 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 158.5+ KB


In [9]:
# Inspecting movies that don't have domestic revenue: do they have a 
# foreign revenue? 
df_movie_gross[df_movie_gross['domestic_gross'].isnull()]

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
230,It's a Wonderful Afterlife,UTV,,1300000,2010
298,Celine: Through the Eyes of the World,Sony,,119000,2010
302,White Lion,Scre.,,99600,2010
306,Badmaash Company,Yash,,64400,2010
327,Aashayein (Wishes),Relbig.,,3800,2010
537,Force,FoxS,,4800000,2011
713,Empire of Silver,NeoC,,19000,2011
871,Solomon Kane,RTWC,,19600000,2012
928,The Tall Man,Imag.,,5200000,2012
936,"Lula, Son of Brazil",NYer,,3800000,2012


In [10]:
# All movies that don't have domestic revenue were distributed oversees 

In [11]:
# Now inspecting the same info for foreign revenue
df_movie_gross[df_movie_gross['foreign_gross'].isnull()]

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
222,Flipped,WB,1800000.0,,2010
254,The Polar Express (IMAX re-issue 2010),WB,673000.0,,2010
267,Tiny Furniture,IFC,392000.0,,2010
269,Grease (Sing-a-Long re-issue),Par.,366000.0,,2010
280,Last Train Home,Zeit.,288000.0,,2010
...,...,...,...,...,...
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018


In [12]:
# Convert foreign_gross column as float
df_movie_gross['foreign_gross'] = df_movie_gross['foreign_gross'].str.replace(',','').astype(np.float64)

# Filling na values with 0 on both columns:
df_movie_gross.update(df_movie_gross[['domestic_gross', 'foreign_gross']].fillna(0))

df_movie_gross.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3382 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3382 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3382 non-null   float64
 3   foreign_gross   3382 non-null   float64
 4   year            3382 non-null   int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 158.5+ KB


## Question
What does BV stand for? Production Studio should be Pixar or Disney Pictures

In [None]:
df_movie_gross.loc[df_movie_gross['studio'] == 'BV']
# What does BV stand for? 

In [None]:
# Loading and inspecting available datasets
# Loading bom.movie_gross and storing data into df_movie_gross

import zipfile
with zipfile.ZipFile('data/im.db.zip', 'r') as zip_ref:
    zip_ref.extractall('data')


In [None]:
# Loading im.db and storing data into df_imdb

# Creating connection to database
conn = sqlite3.connect('data/im.db')

# Creating a cursor
cur = conn.cursor()


In [None]:
df = pd.read_sql("""
                SELECT * 
                FROM sqlite_master
""", con=conn)
df

In [None]:
# Viewing tables in database
# with read_sql_query
imdb_data = pd.read_sql_query(
                              """
                              SELECT name from sqlite_master 
                              WHERE type="table"
                              ;
                              """
,conn)
imdb_data.head()

In [None]:
# Inspecting rt.movie_info file
# Loading rt.movie_info and storing data into df_movie_info

df_movie_info = pd.read_csv('data/rt.movie_info.tsv.gz', compression='gzip', sep='\t')

df_movie_info.head()


In [None]:
# Inspect overall shape and info of the dataframe
df_movie_info.shape
df_movie_info.info()

## notes
### df_movie_info data cleaning: 
-  all dates are strings, turn into datetime, keep only year? create month and year columns? 
-  only 340 rows with currency and box office, can we do anything with this information?
-  runtime should be useful to test correlation between rating or reviews, so turn into integer 

In [None]:
# Inspecting rt.reviews file
# Loading rt.reviews and storing data into df_reviews

df_reviews = pd.read_csv('data/rt.reviews.tsv.gz', compression='gzip', sep='\t', encoding = 'unicode_escape')
df_reviews.head()


In [None]:
# Inspect overall shape and info of the dataframe
df_reviews.shape
df_reviews.info()

## notes
-  review null rows and either drop or replace
-  is rating out of 5 for all and can one same ratio be used for all

In [None]:
# Inspecting tmdb.movies file
# Loading tmdb.movies and storing data into df_tmdb_movies

df_tmdb_movies = pd.read_csv('data/tmdb.movies.csv.gz', compression='gzip')
df_tmdb_movies.head()

In [None]:
# df_tmdb_movies.genre_ids[0]
df_tmdb_movies.genre_ids = df_tmdb_movies.genre_ids.apply(lambda x: x[1:-1].split(','))
df_tmdb_movies.head()

In [None]:
type(df_tmdb_movies.genre_ids[0])

In [None]:
df_tmdb_movies.explode('genre_ids')


If needed to change genre ids, use: explode. 


In [None]:
# Inspect overall shape and info of the dataframe
df_tmdb_movies.shape
df_tmdb_movies.info()

## notes
-  unnamed0 column to drop --> example heroes_df = heroes_df.drop(['Unnamed: 0'], axis=1)
-  genre_ids to remove from list? 
-  release date: turn into datetime to create 2 columns for year and month?

In [None]:
# Inspecting tn.movie_budgets file
# Loading tn.movie_budgets and storing data into df_movie_budgets

df_movie_budgets = pd.read_csv('data/tn.movie_budgets.csv.gz', compression='gzip')

df_movie_budgets.head()



In [None]:
df_movie_budgets.shape
df_movie_budgets.info()

## notes
-  turn production_budget, domestic_gross, worldwide_gross as integers 
-  strip $ and , signs from these columns
-  again, dates as datetime, year and month?

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to clean the data


## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Drop the column Unnamed: 0 with axis=1
# Example heroes_df = heroes_df.drop(['Unnamed: 0'], axis=1)
# heroes_df.head()

# tn.movie_budgets: turn object columns as numbers and strip $ and , 

In [None]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***

In [None]:
# Closing connection
# conn.close()