## Data cleaning

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# plt.style.available
plt.style.use('seaborn-darkgrid')

In [3]:
# Reading in compressed csv

movie_gross = pd.read_csv('./zippedData/bom.movie_gross.csv.gz')
tn_budgets = pd.read_csv('./zippedData/tn.movie_budgets.csv.gz')

In [38]:
# Reading in .dat files

dataworld_movies = pd.read_csv('movies.dat', sep='::', names=['movie_id', 'name/year', 'genre'])
dataworld_reviews = pd.read_csv('ratings.dat', sep='::', names=['user_id', 'movie_id', 'rating', 'rating_timestamp'])

  dataworld_movies = pd.read_csv('movies.dat', sep='::', names=['movie_id', 'name/year', 'genre'])
  dataworld_reviews = pd.read_csv('ratings.dat', sep='::', names=['user_id', 'movie_id', 'rating', 'rating_timestamp'])


## Movie gross

### Sales null values:
- Summary: there are ~1.2k null values in foreign_gross and ~15 in domestic_gross 
- Approach: replace null values with 0
- Rationale: my assumption is that null values mean that the film had no foreign box office sales, for example. Therefore, replacing these values with 0 is an accurate representation. Also, we want to be able to sum across and perform other operations, which null values my hinder.

### Studio null values:
- I decided to delete these rows. There are only 5. They are either very small revenue domestic films or foreign-only films, which is out of scope for this project. (it would be great for a film to have foreign income, but I'm assuming producing foreign-only films is out of scope for Microsoft).

In [5]:
movie_gross.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [6]:
studio_na = movie_gross[movie_gross['studio'].isna()]

In [7]:
studio_na

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
210,Outside the Law (Hors-la-loi),,96900.0,3300000.0,2010
555,Fireflies in the Garden,,70600.0,3300000.0,2011
933,Keith Lemon: The Film,,,4000000.0,2012
1862,Plot for Peace,,7100.0,,2014
2825,Secret Superstar,,,122000000.0,2017


In [8]:
# Drop films that have a NaN studio

movie_gross_clean = movie_gross.dropna(subset=['studio'])

In [9]:
# Replace NaN in revenue with 0

to_replace = {'domestic_gross': 0, 'foreign_gross': 0}

movie_gross_clean = movie_gross_clean.fillna(value=to_replace)

In [10]:
movie_gross_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3382 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3382 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3382 non-null   float64
 3   foreign_gross   3382 non-null   object 
 4   year            3382 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 158.5+ KB


## The Numbers (TN) budgets

### Data Cleaning Summary

- Converted all budget info (revenue, costs) to integers
- Calculated % and dollars of ROI. ROI calc is (worldwide gross - budget) / budget
- Split release_date into separate month and year columns
- Filtered for only movies after 1990. Rationale is that we are trying to make relevant recommendations to a newly launched studio. Trends prior to 1990 probably don't have as much relevance. There's an argument to be made that we should be filtering on even more recent data.
- Removed movies with 0 worldwide_gross. Believe this is a combinatoin of (1) data error (confirmed released movies with revenue have 0 in this column) or (2) movies recently made that haven't been released yet.
- Adjusted worldwide gross for inflation (after confirming via Google searches that it wasn't already adjusted).

source: https://www.usinflationcalculator.com


In [11]:
tn_budgets.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [12]:
tn_budgets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [13]:
# Convert string columns to int

rev_cols = ['production_budget', 'domestic_gross', 'worldwide_gross']

tn_budgets.loc[:,rev_cols] = tn_budgets.loc[:,rev_cols].applymap(lambda x: int(x.replace('$', '').replace(',','')))

In [14]:
# Calculate ROI ($ and %)

tn_budgets['ROI %'] = (tn_budgets['worldwide_gross'] - tn_budgets['production_budget']) / tn_budgets['production_budget']
tn_budgets['ROI $'] = tn_budgets['worldwide_gross'] - tn_budgets['production_budget']

In [15]:
# Parse release date for year / months

tn_budgets['year'] = tn_budgets['release_date'].apply(lambda x: int(x[-4:]))
tn_budgets['month'] = tn_budgets['release_date'].apply(lambda x: x[:3])

month_full = {'Jan': 'January', 'Feb': 'February', 'Mar': 'March', 'Apr': 'April', 'Jun': 'June',
             'Jul': 'July', 'Aug': 'August', 'Sep': 'September', 'Oct': 'October', 'Nov': 'November',
              'Dec': 'December'}

tn_budgets['month'] = tn_budgets['month'].replace(month_full)

In [16]:
# Filter for only recent releases

tn_budgets_recent = tn_budgets[tn_budgets['year'] >= 1990]

In [17]:
# Filter out movies with 0 world_wide gross

tn_budgets_recent = tn_budgets_recent[tn_budgets_recent['worldwide_gross'] != 0]

In [18]:
# Adjust revenue #'s for inflation

years = list(range(1990,2022))

inflation = [2.09, 2.00, 1.95, 1.89, 1.84, 1.79, 1.74, 1.70, 1.67, 1.64, 1.59, 1.54, 1.52, 1.48, 1.45, 1.40, 1.35, 
1.32, 1.27, 1.27, 1.25, 1.21, 1.19, 1.17, 1.15, 1.15, 1.14, 1.11, 1.09, 1.07, 1.05, 1.00]

inflation_dict = dict(zip(years,inflation))

In [19]:
tn_budgets_recent['inflation'] = tn_budgets_recent['year'].apply(lambda x: inflation_dict[x])

In [20]:
# Convert all $ to 2021

tn_budgets_recent['worldwide_gross_inf'] = tn_budgets_recent.loc[:,'worldwide_gross'] * tn_budgets_recent.loc[:,'inflation']
tn_budgets_recent['domestic_gross_inf'] = tn_budgets_recent.loc[:,'domestic_gross'] * tn_budgets_recent.loc[:,'inflation']
tn_budgets_recent['production_budget_inf'] = tn_budgets_recent.loc[:,'production_budget'] * tn_budgets_recent.loc[:,'inflation']

In [21]:
tn_budgets_recent['ROI $ Inf'] = tn_budgets_recent['worldwide_gross_inf'] - tn_budgets_recent['production_budget_inf']

In [22]:
tn_budgets_clean = tn_budgets_recent.copy()

## Dataworld movies / reviews

- Cleaning: Separated Name/year from the same column
- Separated genre into a list
- Joined the databases together
- Drop the few NaN values from genre. Only 80 out of 38K
- For the reviews. Group by movie ID and aggregate average review and count

In [25]:
dataworld_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37700 entries, 0 to 37699
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   movie_id   37700 non-null  int64 
 1   name/year  37700 non-null  object
 2   genre      37629 non-null  object
dtypes: int64(1), object(2)
memory usage: 883.7+ KB


In [43]:
dataworld_movies.head(1)

Unnamed: 0,movie_id,name/year,genre
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short


In [40]:
# Create a copy

dataworld_clean = dataworld_movies.copy()

In [42]:
# Separate out name and year into separate columns

dataworld_clean['year'] = dataworld_clean.loc[:,'name/year'].apply(lambda x: int(x[-5:-1]))
dataworld_clean['name'] = dataworld_clean.loc[:,'name/year'].apply(lambda x: x[0:-7])

In [66]:
# Drop NAs

dataworld_clean.dropna(subset=['genre'], inplace=True)

In [67]:
# Separate out the different genres

dataworld_clean['genre_list'] = dataworld_clean.loc[:,'genre'].apply(lambda x: x.split('|'))
dataworld_clean['genre_length'] = dataworld_clean.loc[:,'genre_list'].apply(lambda x: len(x))

In [69]:
# Add multiple genres to their own columns

for i in range(10):
    dataworld_clean[f"Genre{i + 1}"] = dataworld_clean['genre_list'].apply(lambda x: x[i] if len(x) > i else np.NaN)

In [70]:
dataworld_clean.head()

Unnamed: 0,movie_id,name/year,genre,year,name,genre_list,genre_length,Genre1,Genre2,Genre3,Genre4,Genre5,Genre6,Genre7,Genre8,Genre9,Genre10
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,Edison Kinetoscopic Record of a Sneeze,"[Documentary, Short]",2,Documentary,Short,,,,,,,,
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,La sortie des usines Lumière,"[Documentary, Short]",2,Documentary,Short,,,,,,,,
2,12,The Arrival of a Train (1896),Documentary|Short,1896,The Arrival of a Train,"[Documentary, Short]",2,Documentary,Short,,,,,,,,
4,91,Le manoir du diable (1896),Short|Horror,1896,Le manoir du diable,"[Short, Horror]",2,Short,Horror,,,,,,,,
5,131,Une nuit terrible (1896),Short|Comedy|Horror,1896,Une nuit terrible,"[Short, Comedy, Horror]",3,Short,Comedy,Horror,,,,,,,


In [26]:
dataworld_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 915370 entries, 0 to 915369
Data columns (total 4 columns):
 #   Column            Non-Null Count   Dtype
---  ------            --------------   -----
 0   user_id           915370 non-null  int64
 1   movie_id          915370 non-null  int64
 2   rating            915370 non-null  int64
 3   rating_timestamp  915370 non-null  int64
dtypes: int64(4)
memory usage: 27.9 MB


In [71]:
dataworld_reviews.head()

Unnamed: 0,user_id,movie_id,rating,rating_timestamp
0,1,114508,8,1381006850
1,2,499549,9,1376753198
2,2,1305591,8,1376742507
3,2,1428538,1,1371307089
4,3,75314,1,1595468524


In [81]:
reviews_clean = dataworld_reviews.copy()

In [83]:
reviews_clean['rating2'] = reviews_clean.loc[:,'rating']

In [84]:
# Aggregate movies by avg review and review_count

new_cols = {'rating': 'avg_rating', 'rating2': 'rating_count'}
reviews_clean = reviews_clean.groupby('movie_id').agg({'rating':'mean', 'rating2':'count'}).rename(columns=new_cols)

In [88]:
# Join the two dataframes on movieID

dataworld_clean = dataworld_clean.merge(reviews_clean, on='movie_id', how='left')

In [90]:
dataworld_clean = dataworld_clean[dataworld_clean['year'] >= 1990]

In [92]:
# Final datasets to export to CSV

movie_gross_clean.to_csv('movies_gross.csv')
tn_budgets_clean.to_csv('theNumbers.csv')
dataworld_clean.to_csv('DataWorld_reviews.csv')