# Project 1 Flatiron

### Chuck Nadel


In [2]:
# Your code here - remember to use markdown cells for comments as well!

## Importing Packages

In [3]:
import pandas as pd
import sqlite3
import plotly.express as px

## Uploading Databases

In [4]:
# The Movie DB
theMovieDF = pd.read_csv('zippedData/tmdb.movies.csv.gz')

# The Numbers DF
numsDF = pd.read_csv('zippedData/tn.movie_budgets.csv.gz')

# IMDB SQL Database
conn = sqlite3.connect('zippedData/im.db')
imdbDF = pd.read_sql('''
        SELECT *
        FROM sqlite_master
        WHERE type='table'
        ''', conn)
# IMDB Basics and Ratings Information, merged for ease of analysis
basics_ratingsDF = pd.read_sql('''
        SELECT * FROM movie_basics
        INNER JOIN movie_ratings ON movie_basics.movie_id=movie_ratings.movie_id
        ''', conn)

## Cleaning the Data

#### First, lets look information provided by each impored Database

In [5]:
#The Movie Database
theMovieDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


The Movie Dataframe seems to have zero null values in any columns, and the datatypes all make sense, making our job much easier!

In [6]:
#IMDB Dataframe
basics_ratingsDF.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,movie_id.1,averagerating,numvotes
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",tt0063540,7.0,77
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",tt0066787,7.2,43
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,tt0069049,6.9,4517
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",tt0069204,6.1,13
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",tt0100275,6.5,119


Our IMDB dataframe is mostly okay, but it seems that we are missing about 7000 movies' runtime and 800 movies' genres.
In addition, movie_id column repeated itself when merging the tables.
We will address the missing runtime values by finding the median runtime.
Since the number of movies missing a genre is relatively small, we will just drop those rows.

In [7]:
# The Numbers DF
numsDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


At first glance, all seems well here. However, this database does stores the production_budget, domestic_gross, and worldwide_gross columns as objects, as opposed to integers.
To address this, we will use base python to strip the values under these columns such that they can be turned into integers for further analysis

#### Now, lets clean the Basics and Ratings Dataframe and the Numbers Dataframe, based on what we wrote above.

Basics and Ratings Dataframe:

In [8]:
# Dropping Rows without genre
basics_ratingsDF.dropna(subset = ['genres'], axis=0, inplace=True)
# Finding Median Runtime
medianRuntime = basics_ratingsDF['runtime_minutes'].median()
# Replace Null Values in runtime_minutes with median value
basics_ratingsDF.fillna(value = medianRuntime, inplace = True)

In [9]:
basics_ratingsDF.isnull().sum()

movie_id           0
primary_title      0
original_title     0
start_year         0
runtime_minutes    0
genres             0
movie_id           0
averagerating      0
numvotes           0
dtype: int64

The Numbers Dataframe:

In [10]:
# Remove $ Sign, commas from production_budget, domestic_gross, and worldwide_gross columns
columnstofix = ['production_budget', 'domestic_gross','worldwide_gross']
for column in columnstofix:
    numsDF[column] = numsDF[column].apply(lambda x:x.replace('$',''))
    numsDF[column] = numsDF[column].apply(lambda x:x.replace(',',''))

In [11]:
# Convert cleaned columns to int data type
for column in columnstofix:
    numsDF[column] = pd.to_numeric(numsDF[column])
# Convert Dates in release_date to datetime object
numsDF['year'] = numsDF['release_date'].str[-4:]
# create new column for 'year'
numsDF['year'] = pd.to_numeric(numsDF['year'], downcast='integer')


Create a new column; international profit

In [12]:
numsDF['Worldwide_Profit'] = numsDF['worldwide_gross']-numsDF['production_budget']
numsDF.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,year,Worldwide_Profit
0,1,"Dec 18, 2009",Avatar,425000000,760507625,2776345279,2009,2351345279
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875,2011,635063875
2,3,"Jun 7, 2019",Dark Phoenix,350000000,42762350,149762350,2019,-200237650
3,4,"May 1, 2015",Avengers: Age of Ultron,330600000,459005868,1403013963,2015,1072413963
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747,2017,999721747


Correlation between number of IMDB Reviews, Box Office Profit

### Safest (profit-wise) Movies by Genre

Let's look at the % of movies in the green, black, and red based on their Genre

Green = 60%+ Profit

Black = 0-60% Profit 

Red = Lost Money

In [23]:
# First, let's create a new numerical column in NumsDF, %ROI (Profit/Budget)
numsDF['% Return'] = numsDF['Worldwide_Profit']/numsDF['production_budget']*100
# Now, let's categorize these results in a new column, simple_success, based on the parameters above
numsDF['simple_success'] = numsDF['% Return'].apply(lambda x: 'Green' if x >= 60 else 'Red' if x < 0 else 'Black')
# Finally, we need to merge this dataframe with the genre series in the IMDB dataframe
numsimdbDF = basics_ratingsDF.merge(numsDF, how = 'inner', left_on = ['primary_title', 'start_year'], right_on = ['movie', 'year'])
# Since the genres listed together, we want to make a dataframe where their is only 1 genre per column. We will do this by splitting the genre entries, and then copying each column over
# Split by genres
numsimdbDF['genres'] = numsimdbDF['genres'].str.split(',')
numsimdbDF

genre_list = [{'Profit?':row.simple_success,'genres':g} for idx, row in numsimdbDF.iterrows() for g in row.genres ]
# Convert it to dataframe
profit_genreDF = pd.DataFrame(genre_list)


In [24]:


# Since some movies currently have multiple genres, we are going to use pandas melt to 
#profit_genreDF = numsimdbDF[['simple_success', 'genres']]
grouped_df = profit_genreDF.groupby('genres')
profit_genreDF['Profit?'].value_counts().sort_values(ascending=False)
# Now, we want



Green    2301
Red      1081
Black     427
Name: Profit?, dtype: int64

In [27]:
df_stacked = profit_genreDF.groupby(['genres', 'Profit?']).size().reset_index(name='Frank')
df_stacked['Frank'] = df_stacked.groupby('genres')['Frank'].apply(lambda x: x/x.sum())
df_stacked.head()
df_green = df_stacked[df_stacked['Profit?'] == 'Green']
df_stacked.head(20)
px.bar(df_stacked, x = 'genres', y = 'Frank', color = 'Profit?', barmode = 'stack', color_discrete_map = {'Black': 'black', 'Green':'seagreen','Red':'darkred'})

In [28]:
px.bar(df_green, x = 'genres', y = 'Frank', color = 'Profit?', barmode = 'stack', color_discrete_map = {'Green':'seagreen'})