### **Project 3 - Part 4: Hypothesis Testing**

#### **Author: Pieter Slabber**

#### **Part 4:**

For part 4 of the project, you will be using your MySQL database from part 3 to answer meaningful questions for your stakeholder. They want you to use your hypothesis testing and statistics knowledge to answer 3 questions about what makes a successful movie.

#### **Questions to Answer:**

- The stakeholder's first question is: does the MPAA rating of a movie (G/PG/PG-13/R) affect how much revenue the movie generates?
   - If so, what was the p-value of your analysis?
   - And which rating earns the most revenue?
   - They want you to prepare a visualization that supports your finding.
- It is then up to you to think of 2 additional hypotheses to test that your stakeholder may want to know.
- Some example hypotheses you could test:
   - Do movies that are over 2.5 hours long earn more revenue than movies that are 1.5 hours long (or less)?
   - Do movies released in 2020 earn less revenue than movies released in 2018?
      - How do the years compare for movie ratings?
   - Do some movie genres earn more revenue than others?
   - Are some genres higher rated than others?
   - etc.

#### **Specifications**

#### Your Data

- A critical first step for this assignment will be to retrieve additional movie data to add to your SQL database.
  - You will want to use the TMDB API again and extract data for additional years
  - You may want to review the optional lesson from Week 1 on "Using Glob to Load Many Files" to load and combine all of your API results for each year.
- However, trying to extract the TMDB data for all movies from 2000-2022 could take >24 hours!
- To address this issue, you should EITHER:
   - Define a smaller (but logical) period of time to use for your analyses (e.g., last 10 years, 2010-2019 (pre-pandemic, etc).
   - OR coordinate with cohort-mates and divide the API calls so that you can all download the data for a smaller number of years and then share your downloaded JSON data.

#### **Deliverables**

- You should use the same project repository you have been using for Parts 1-3 (for your portfolio)
  - Create a new notebook in your project repository just for the hypothesis testing (like "Part 4 - Hypothesis Testing.ipynb")
  - Make sure the results and visualization for all 3 hypotheses are in your notebook.



#### **Imports**

In [21]:
# Imports
import json
import pandas as pd
import numpy as np
import tmdbsimple as tmdb
from tqdm.notebook import tqdm_notebook
import seaborn as sns
from scipy import stats
import scipy
import os, time, json
FOLDER = "C:/Users/Shaun/Documents/GitHub/Data_Enrichment/Data"
os.makedirs(FOLDER, exist_ok=True)
os.listdir(FOLDER)

import pymysql
pymysql.install_as_MySQLdb()

from sqlalchemy import create_engine
from sqlalchemy_utils import create_database, database_exists

In [22]:
basics_data="C:/Users/Shaun/Documents/GitHub/Data_Enrichment/Data/title.basics.tsv.gz"

In [23]:
basics = pd.read_csv(basics_data, sep='\t', low_memory=False)
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [24]:
df = basics

In [25]:
df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [26]:
# Reduce memory usage
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage(deep=True).sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':  # for integers
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:  # for floats.
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage(deep=True).sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [27]:
reduce_mem_usage(df)

Mem. usage decreased to 5832.01 Mb (0.0% reduction)


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
...,...,...,...,...,...,...,...,...,...
10332731,tt9916848,tvEpisode,Episode #3.17,Episode #3.17,0,2009,\N,\N,"Action,Drama,Family"
10332732,tt9916850,tvEpisode,Episode #3.19,Episode #3.19,0,2010,\N,\N,"Action,Drama,Family"
10332733,tt9916852,tvEpisode,Episode #3.20,Episode #3.20,0,2010,\N,\N,"Action,Drama,Family"
10332734,tt9916856,short,The Wind,The Wind,0,2015,\N,27,Short


**Replace "\N" with np.nan**

In [28]:
# Replace \N with nan
df = df.replace({'\\N':np.nan})
df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,,1,"Comedy,Short"


**Eliminate movies that are null for runtimeMinutes**

In [29]:
df = df.dropna(subset=['runtimeMinutes'])

In [30]:
if df['runtimeMinutes'].isnull().any():
    print("There are null values in the runtimeMinutes column.")
else:
    print("No null values found in the runtimeMinutes column.")

No null values found in the runtimeMinutes column.


**Eliminate movies that are null for genre**

In [31]:
df = df.dropna(subset=['genres'])

In [32]:
if df['genres'].isnull().any():
    print("There are null values in the genres column.")
else:
    print("No null values found in the genres column.")

No null values found in the genres column.


In [33]:
df = df[df['titleType'] == 'movie']
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 392714 entries, 8 to 10332686
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          392714 non-null  object
 1   titleType       392714 non-null  object
 2   primaryTitle    392714 non-null  object
 3   originalTitle   392714 non-null  object
 4   isAdult         392714 non-null  object
 5   startYear       386076 non-null  object
 6   endYear         0 non-null       object
 7   runtimeMinutes  392714 non-null  object
 8   genres          392714 non-null  object
dtypes: object(9)
memory usage: 30.0+ MB


**Keep startYear 2000-2022**

In [34]:
#Check all records with year 
df = df[df['startYear'].str.contains('2000|2001|2002|2003|2004|2005|2006|2007|2008|2009|2010|2011|2012|2012|2013|2014|2015|2016|2017|2018|2019|2020|2021|2022', na=False, regex=True, case=False)] 
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 225606 entries, 13081 to 10332686
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          225606 non-null  object
 1   titleType       225606 non-null  object
 2   primaryTitle    225606 non-null  object
 3   originalTitle   225606 non-null  object
 4   isAdult         225606 non-null  object
 5   startYear       225606 non-null  object
 6   endYear         0 non-null       object
 7   runtimeMinutes  225606 non-null  object
 8   genres          225606 non-null  object
dtypes: object(9)
memory usage: 17.2+ MB


**Eliminate movies that include "Documentary" in genre**

In [35]:
# Exclude movies that are included in the documentary category.
is_documentary = df['genres'].str.contains('documentary',case=False)
df = df[~is_documentary]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 148871 entries, 34800 to 10332676
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          148871 non-null  object
 1   titleType       148871 non-null  object
 2   primaryTitle    148871 non-null  object
 3   originalTitle   148871 non-null  object
 4   isAdult         148871 non-null  object
 5   startYear       148871 non-null  object
 6   endYear         0 non-null       object
 7   runtimeMinutes  148871 non-null  object
 8   genres          148871 non-null  object
dtypes: object(9)
memory usage: 11.4+ MB


**Keep only US movies**

In [36]:
akas_data="C:/Users/Shaun/Documents/GitHub/Data_Enrichment/Data/title.akas.tsv.gz"

mylist = []

for chunk in  pd.read_csv(akas_data, sep='\t', chunksize=20000):
    mylist.append(chunk)

akas = pd.concat(mylist, axis= 0)
del mylist

In [37]:
# Filter the basics table down to only include the US by using the filter akas dataframe
keepers =df['tconst'].isin(akas['titleId'])
keepers

34800       True
61110       True
67662       True
80547       True
86789       True
            ... 
10332418    True
10332457    True
10332502    True
10332586    True
10332676    True
Name: tconst, Length: 148871, dtype: bool

In [38]:
df = df[keepers]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 148165 entries, 34800 to 10332676
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          148165 non-null  object
 1   titleType       148165 non-null  object
 2   primaryTitle    148165 non-null  object
 3   originalTitle   148165 non-null  object
 4   isAdult         148165 non-null  object
 5   startYear       148165 non-null  object
 6   endYear         0 non-null       object
 7   runtimeMinutes  148165 non-null  object
 8   genres          148165 non-null  object
dtypes: object(9)
memory usage: 11.3+ MB


In [39]:
## Save current dataframe to file.
df.to_csv("C:/Users/Shaun/Documents/GitHub/Data_Enrichment/Data/title_basics.csv.gz",compression='gzip',index=False)

In [40]:
# Open saved file and preview again
df = pd.read_csv("C:/Users/Shaun/Documents/GitHub/Data_Enrichment/Data/title_basics.csv.gz", low_memory = False)
df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020,,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
3,tt0082328,movie,Embodiment of Evil,Encarnação do Demônio,0,2008,,94,Horror
4,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"


### **Funtions**

In [41]:
def get_movie_with_rating(movie_id):
        movie = tmdb.Movies(movie_id)
        info = movie.info()
        
        release = movie.releases()
        for c in release['countries']:
            if c['iso_3166_1'] == 'US':
                info['certification'] = c['certification']

        return info

In [42]:
def write_json(new_data, filename): 
    """Appends a list of records (new_data) to a json file (filename). 
    Adapted from: https://www.geeksforgeeks.org/append-to-json-file-using-python/"""  
    
    with open(filename,'r+') as file:
        # First we load existing data into a dict.
        file_data = json.load(file)
        ## Choose extend or append
        if (type(new_data) == list) & (type(file_data) == list):
            file_data.extend(new_data)
        else:
             file_data.append(new_data)
        # Sets file's current position at offset.
        file.seek(0)
        # convert back to json.
        json.dump(file_data, file)

In [44]:
with open('C:/Users/Shaun/Desktop/.secret/tmdb_api.json') as f: #change the path to match YOUR path!!
    login = json.load(f)
login.keys()

dict_keys(['client-id', 'api-key'])

In [54]:
tmdb.API_KEY =  login['api-key']

In [55]:
tst = get_movie_with_rating("tt0332281") 
tst

{'adult': False,
 'backdrop_path': '/sQLAF4RGTgOBwscrHEKHD4Qnrfv.jpg',
 'belongs_to_collection': None,
 'budget': 0,
 'genres': [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}],
 'homepage': '',
 'id': 29079,
 'imdb_id': 'tt0332281',
 'original_language': 'en',
 'original_title': 'Nowhere to Go But Up',
 'overview': "Val is 23 years old and full of dreams. She travels to New York to become an actress. She is lonely in a strange country, in a strange city, with little money and no friends. In her path, she meets weird people who they, also, seek their dreams but everyday life gets in the way. Tired and hungry she sits on the corner of a building. Across the street a writer whose fantasy has dry out. In an instant she becomes his muse... At the Oscar's night she will be the one with the Golden Globe in her hands.",
 'popularity': 3.49,
 'poster_path': '/jbLxetPVJH03a8CZPpa0bw4E2tI.jpg',
 'production_companies': [{'id': 2813,
   'logo_path': None,
   'name': 'Forensic Films',
 

In [56]:
# Load in the dataframe from project part 1 as basics:
basics = pd.read_csv("C:/Users/Shaun/Documents/GitHub/Data_Enrichment/Data/title_basics.csv.gz")

In [57]:
basics['startYear'].value_counts()

2018    9744
2019    9489
2017    9480
2022    9385
2016    9040
2015    8561
2021    8463
2014    8100
2013    7841
2020    7696
2012    7352
2011    6803
2010    6411
2009    6019
2008    5262
2007    4649
2006    4425
2005    3939
2004    3578
2003    3266
2002    3016
2001    2884
2000    2762
Name: startYear, dtype: int64

In [58]:
basics.head(20)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020,,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
3,tt0082328,movie,Embodiment of Evil,Encarnação do Demônio,0,2008,,94,Horror
4,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"
5,tt0096056,movie,Crime and Punishment,Crime and Punishment,0,2002,,126,Drama
6,tt0096235,movie,Taxi Killer,Taxi Killer,0,2022,,106,"Action,Crime,Drama"
7,tt0100275,movie,The Wandering Soap Opera,La Telenovela Errante,0,2017,,80,"Comedy,Drama,Fantasy"
8,tt0102362,movie,Istota,Istota,0,2000,,80,"Drama,Romance"
9,tt0103340,movie,Life for Life: Maximilian Kolbe,Zycie za zycie. Maksymilian Kolbe,0,2006,,90,"Biography,Drama"


In [59]:
## Save current dataframe to file.
#df.to_csv("Data/title_basics.csv.gz",compression='gzip',index=False)

In [60]:
#YEARS_TO_GET = [2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022]
YEARS_TO_GET = [2015,2016,2017,2018,2019,2020,2021,2022]

In [61]:
errors = [ ]

### **Outer / Inner Loop**

In [62]:
# Start of OUTER loop
for YEAR in tqdm_notebook(YEARS_TO_GET, desc='YEARS', position=0):
    # Defining the JSON file to store results for the year
    JSON_FILE = f'{FOLDER}tmdb_api_results_{YEAR}.json'
    
    # Check if the file exists
    file_exists = os.path.isfile(JSON_FILE)
    
    # If it does not exist: create it
    if not file_exists:  # Simplified condition
        # Save an empty dict with just "imdb_id" to the new json file.
        with open(JSON_FILE, 'w') as f:
            json.dump([{'imdb_id': 0}], f)

    # Saving the new year as the current df
    df = basics.loc[basics['startYear'] == YEAR].copy()
    # Saving movie ids to list
    movie_ids = df['tconst'].copy()

    # Load existing data from json into a dataframe called "previous_df"
    previous_df = pd.read_json(JSON_FILE)

    # Filter out any ids that are already in the JSON_FILE
    movie_ids_to_get = movie_ids[~movie_ids.isin(previous_df['imdb_id'])]

    # Get index and movie id from the list
    # INNER Loop
    for movie_id in tqdm_notebook(movie_ids_to_get,
                                  desc=f'Movies from {YEAR}',
                                  position=1,
                                  leave=True):
        try:
            # Retrieve the data for the movie id
            temp = get_movie_with_rating(movie_id)
            # Append/extend results to the existing file using a pre-made function
            write_json(temp, JSON_FILE)
            # Short 20 ms sleep to prevent overwhelming the server
            time.sleep(0.02)

        except Exception as e:
            errors.append([movie_id, e])

    final_year_df = pd.read_json(JSON_FILE)
    final_year_df.to_csv(f"{FOLDER}final_tmdb_data_{YEAR}.csv.gz", compression="gzip", index=False)

print(f"- Total errors: {len(errors)}")

YEARS:   0%|          | 0/8 [00:00<?, ?it/s]

Movies from 2015:   0%|          | 0/8561 [00:00<?, ?it/s]

Movies from 2016:   0%|          | 0/9040 [00:00<?, ?it/s]

KeyboardInterrupt: 

### **Load in your csv.gz's of results for each year extracted.**

### **Concatenate the data into 1 dataframe for the remainder of the analysis.**

In [70]:
movies_2010 = pd.read_csv('C:/Users/Shaun/Documents/GitHub/Data_Enrichment/Data/final_tmdb_data_2010.csv.gz')
movies_2011 = pd.read_csv('C:/Users/Shaun/Documents/GitHub/Data_Enrichment/Data/final_tmdb_data_2011.csv.gz')
movies_2012 = pd.read_csv('C:/Users/Shaun/Documents/GitHub/Data_Enrichment/Data/final_tmdb_data_2012.csv.gz')
movies_2013 = pd.read_csv('C:/Users/Shaun/Documents/GitHub/Data_Enrichment/Data/final_tmdb_data_2013.csv.gz')
movies_2014 = pd.read_csv('C:/Users/Shaun/Documents/GitHub/Data_Enrichment/Data/final_tmdb_data_2014.csv.gz')
movies_2015 = pd.read_csv('C:/Users/Shaun/Documents/GitHub/Data_Enrichment/Data/final_tmdb_data_2015.csv.gz')
movies_2016 = pd.read_csv('C:/Users/Shaun/Documents/GitHub/Data_Enrichment/Data/final_tmdb_data_2016.csv.gz')
movies_2017 = pd.read_csv('C:/Users/Shaun/Documents/GitHub/Data_Enrichment/Data/final_tmdb_data_2017.csv.gz')
movies_2018 = pd.read_csv('C:/Users/Shaun/Documents/GitHub/Data_Enrichment/Data/final_tmdb_data_2018.csv.gz')
movies_2019 = pd.read_csv('C:/Users/Shaun/Documents/GitHub/Data_Enrichment/Data/final_tmdb_data_2019.csv.gz')

# Concatenate the DataFrameC:\Users\Shaun\Documents\GitHub\Data_Enrichment\Data
all_movies = pd.concat([movies_2010, movies_2011, movies_2012, movies_2013, movies_2014, movies_2015, movies_2016, movies_2017, movies_2018, movies_2019], ignore_index=True)

In [74]:
all_movies.info

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,0,,,,,,,,,,...,,,,,,,,,,
1,tt0312305,0.0,/lqUbt2cy2pnqvxKefbQAtxLS0WA.jpg,,0.0,"[{'id': 10751, 'name': 'Family'}, {'id': 16, '...",http://www.qqthemovie.com/,23738.0,en,Quantum Quest: A Cassini Space Odyssey,...,0.0,45.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,,Quantum Quest: A Cassini Space Odyssey,0.0,7.9,8.0,
2,tt0326965,0.0,/xt2klJdKCVGXcoBGQrGfAS0aGDE.jpg,,0.0,"[{'id': 53, 'name': 'Thriller'}, {'id': 9648, ...",http://www.inmysleep.com,40048.0,en,In My Sleep,...,0.0,90.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Sleepwalking can be deadly.,In My Sleep,0.0,5.318,33.0,PG-13
3,tt0331312,0.0,,,0.0,[],,214026.0,en,This Wretched Life,...,0.0,0.0,[],Released,,This Wretched Life,0.0,5.0,1.0,
4,tt0393049,0.0,/gc9FN5zohhzCt05RkejQIIPLtBl.jpg,,300000.0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,324352.0,en,Anderson's Cross,...,0.0,98.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Sometimes the boy next door is more than the b...,Anderson's Cross,0.0,4.0,5.0,


### **The file name should be "tmdb_results_combined.csv.gz"**

In [76]:
all_movies.to_csv(f"{FOLDER}tmdb_results_combined.csv.gz", compression="gzip", index=False)

In [77]:
ratings = pd.read_csv('C:/Users/Shaun/Documents/GitHub/Data_Enrichment/data/title.ratings.tsv.gz')
ratings.head()

Unnamed: 0,tconst\taverageRating\tnumVotes
0,tt0000001\t5.7\t2002
1,tt0000002\t5.8\t269
2,tt0000003\t6.5\t1893
3,tt0000004\t5.5\t178
4,tt0000005\t6.2\t2678


In [78]:
basics = pd.read_csv('C:/Users/Shaun/Documents/GitHub/Data_Enrichment/data/title_basics.csv.gz')
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020,,70,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama
3,tt0082328,movie,Embodiment of Evil,Encarnação do Demônio,0,2008,,94,Horror
4,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi"


In [79]:
imdb = pd.read_csv('C:/Users/Shaun/Documents/GitHub/Data_Enrichment/data/tmdb_results_combined.csv.gz')
imdb.head()

Unnamed: 0,imdb_id,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,original_language,original_title,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,certification
0,0,,,,,,,,,,...,,,,,,,,,,
1,tt0113026,0.0,/vMFs7nw6P0bIV1jDsQpxAieAVnH.jpg,,10000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,62127.0,en,The Fantasticks,...,0.0,86.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,Try to remember the first time magic happened,The Fantasticks,0.0,5.5,22.0,
2,tt0113092,0.0,,,0.0,"[{'id': 878, 'name': 'Science Fiction'}]",,110977.0,en,For the Cause,...,0.0,100.0,"[{'english_name': 'English', 'iso_639_1': 'en'...",Released,The ultimate showdown on a forbidden planet.,For the Cause,0.0,5.45,10.0,
3,tt0116391,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",,442869.0,hi,Gang,...,0.0,152.0,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,,Gang,0.0,4.0,1.0,
4,tt0116748,0.0,/wr0hTHwkYIRC82MwNbhOvqrw27N.jpg,,0.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,579396.0,hi,Karobaar,...,0.0,180.0,"[{'english_name': 'Hindi', 'iso_639_1': 'hi', ...",Released,The Business of Love,Karobaar,0.0,7.0,3.0,


In [80]:
## create a col with a list of genres
basics['genres_split'] = basics['genres'].str.split(',')
basics


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,genres_split
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance","[Comedy, Fantasy, Romance]"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020,,70,Drama,[Drama]
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama,[Drama]
3,tt0082328,movie,Embodiment of Evil,Encarnação do Demônio,0,2008,,94,Horror,[Horror]
4,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005,,100,"Comedy,Horror,Sci-Fi","[Comedy, Horror, Sci-Fi]"
...,...,...,...,...,...,...,...,...,...,...
148160,tt9916190,movie,Safeguard,Safeguard,0,2020,,95,"Action,Adventure,Thriller","[Action, Adventure, Thriller]"
148161,tt9916270,movie,Il talento del calabrone,Il talento del calabrone,0,2020,,84,Thriller,[Thriller]
148162,tt9916362,movie,Coven,Akelarre,0,2020,,92,"Drama,History","[Drama, History]"
148163,tt9916538,movie,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,0,2019,,123,Drama,[Drama]


In [81]:
exploded_genres = basics.explode('genres_split')
exploded_genres


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,genres_split
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance",Comedy
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance",Fantasy
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001,,118,"Comedy,Fantasy,Romance",Romance
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020,,70,Drama,Drama
2,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018,,122,Drama,Drama
...,...,...,...,...,...,...,...,...,...,...
148161,tt9916270,movie,Il talento del calabrone,Il talento del calabrone,0,2020,,84,Thriller,Thriller
148162,tt9916362,movie,Coven,Akelarre,0,2020,,92,"Drama,History",Drama
148162,tt9916362,movie,Coven,Akelarre,0,2020,,92,"Drama,History",History
148163,tt9916538,movie,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,0,2019,,123,Drama,Drama


In [82]:
## exploding the column of lists
genres_split = basics['genres'].str.split(",")

unique_genres = genres_split.explode().unique()
unique_genres

array(['Comedy', 'Fantasy', 'Romance', 'Drama', 'Horror', 'Sci-Fi',
       'Action', 'Crime', 'Biography', 'Mystery', 'Adventure', 'Musical',
       'Thriller', 'Music', 'Animation', 'Family', 'History', 'War',
       'Sport', 'Western', 'Adult', 'News', 'Reality-TV', 'Talk-Show',
       'Game-Show'], dtype=object)

In [83]:
basics.drop(['originalTitle', 'isAdult', 'genres'], axis=1, inplace=True)


In [84]:
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,startYear,endYear,runtimeMinutes,genres_split
0,tt0035423,movie,Kate & Leopold,2001,,118,"[Comedy, Fantasy, Romance]"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,2020,,70,[Drama]
2,tt0069049,movie,The Other Side of the Wind,2018,,122,[Drama]
3,tt0082328,movie,Embodiment of Evil,2008,,94,[Horror]
4,tt0088751,movie,The Naked Monster,2005,,100,"[Comedy, Horror, Sci-Fi]"


In [85]:
# Split the 'tconst' column into three separate columns
ratings[['tconst', 'averageRating', 'numVotes']] = ratings['tconst\taverageRating\tnumVotes'].str.split('\t', expand=True)
# Drop the original combined column
ratings = ratings.drop('tconst\taverageRating\tnumVotes', axis=1)
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,2002
1,tt0000002,5.8,269
2,tt0000003,6.5,1893
3,tt0000004,5.5,178
4,tt0000005,6.2,2678


In [86]:
title_genres = exploded_genres[['tconst','genres_split']].copy()
title_genres.head()

Unnamed: 0,tconst,genres_split
0,tt0035423,Comedy
0,tt0035423,Fantasy
0,tt0035423,Romance
1,tt0062336,Drama
2,tt0069049,Drama


In [87]:
## Making the genre mapper dictionary
genre_ints = range(len(unique_genres))
genre_map = dict(zip(unique_genres, genre_ints))
genre_map


{'Comedy': 0,
 'Fantasy': 1,
 'Romance': 2,
 'Drama': 3,
 'Horror': 4,
 'Sci-Fi': 5,
 'Action': 6,
 'Crime': 7,
 'Biography': 8,
 'Mystery': 9,
 'Adventure': 10,
 'Musical': 11,
 'Thriller': 12,
 'Music': 13,
 'Animation': 14,
 'Family': 15,
 'History': 16,
 'War': 17,
 'Sport': 18,
 'Western': 19,
 'Adult': 20,
 'News': 21,
 'Reality-TV': 22,
 'Talk-Show': 23,
 'Game-Show': 24}

In [88]:
title_genres['genre_id'] = title_genres['genres_split'].map(genre_map)
title_genres.drop('genres_split', axis=1, inplace=True)
title_genres.head()

Unnamed: 0,tconst,genre_id
0,tt0035423,0
0,tt0035423,1
0,tt0035423,2
1,tt0062336,3
2,tt0069049,3


In [89]:
genre_lookup = pd.DataFrame({'Genre_Name': list(genre_id_map.keys()),
                             'Genre_ID': list(genre_id_map.values())})
genre_lookup.head()

NameError: name 'genre_id_map' is not defined

In [32]:
# Create connection string using credentials following this format# 
connection_str = "mysql+pymysql://root:root@127.0.0.1/Movies"
engine = create_engine(connection_str)

In [1]:
q = """SELECT t.primaryTitle, m.revenue, m.certification 
FROM movies.tmdb_data as m
JOIN title_basics as t 
ON m.imdb_id = t.tconst
WHERE m.certification = 'G';"""
df_G = pd.read_sql(q,engine)
df_G

NameError: name 'pd' is not defined

In [7]:
q = """SELECT t.primaryTitle, m.revenue, m.certification 
FROM movies.tmdb_data as m
JOIN title_basics as t 
ON m.imdb_id = t.tconst
WHERE m.certification = 'PG-13';"""
df_PG13 = pd.read_sql(q,engine)
df_PG13

Unnamed: 0,primaryTitle,revenue,certification
0,Mission: Impossible II,546388105.0,PG-13
1,X-Men,296339527.0,PG-13
2,Supernova,14828081.0,PG-13
3,The Hiding Place,0.0,PG-13
4,Waterproof,0.0,PG-13
...,...,...,...
181,Metropolis,4035192.0,PG-13
182,Betaville,0.0,PG-13
183,Inuyasha the Movie: Affections Touching Across...,0.0,PG-13
184,One Piece: Clockwork Island Adventure,0.0,PG-13


In [8]:
q = """SELECT t.primaryTitle, m.revenue, m.certification 
FROM movies.tmdb_data as m
JOIN title_basics as t 
ON m.imdb_id = t.tconst
WHERE m.certification = 'R';"""
df_R = pd.read_sql(q,engine)
df_R

Unnamed: 0,primaryTitle,revenue,certification
0,Chinese Coffee,0.0,R
1,Heavy Metal 2000,0.0,R
2,Love 101,0.0,R
3,Vulgar,14904.0,R
4,The Million Dollar Hotel,105983.0,R
...,...,...,...
469,Sex Court: The Movie,0.0,R
470,Jesus Christ Vampire Hunter,0.0,R
471,WXIII: Patlabor,0.0,R
472,Death Game,0.0,R


In [9]:
q = """SELECT t.primaryTitle, m.revenue, m.certification 
FROM movies.tmdb_data as m
JOIN title_basics as t 
ON m.imdb_id = t.tconst
WHERE m.certification = 'PG';"""
df_PG = pd.read_sql(q,engine)
df_PG

Unnamed: 0,primaryTitle,revenue,certification
0,In the Mood for Love,14204632.0,PG
1,Titan A.E.,36754634.0,PG
2,Return to Me,36609995.0,PG
3,Dinosaur,354248063.0,PG
4,The Adventures of Rocky & Bullwinkle,35134820.0,PG
...,...,...,...
63,Bug Off!,0.0,PG
64,Little Secrets,0.0,PG
65,Mr. Bones,0.0,PG
66,The Living Forest,482902.0,PG
