# Specifications - Database
### - Your stakeholder wants you to take the data you have been cleaning and collecting in Parts 1 & 2 of the project and wants you to create a MySQL database for them.

### - Specifically, they want the data from the following files included in your database:
 - Title Basics:
   - Movie ID (tconst)
   - Primary Title
   - Start Year
   - Runtime (in Minutes)
   - Genres
 - Title Ratings
   - Movie ID (tconst)
   - Average Movie Rating
   - Number of Votes
 - The TMDB API Results (multiple files)
   - Movie ID
   - Revenue
   - Budget
   - Certification (MPAA Rating)
- You should normalize the tables as best you can before adding them to your new database.

 - Note: an important exception to their request is that they would like you to keep all of the data from the TMDB API in 1 table together (even though it will not be perfectly normalized).
 - You only need to keep the imdb_id, revenue, budget, and certification columns

### Imports

In [1]:
# Standard Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
# Imports for creating database
import pymysql
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
from urllib.parse import quote_plus as urlquote

# Required Transformation Steps for Title Basics:
- Normalize Genre:

 - Convert the single string of genres from title basics into 2 new tables.
  1. title_genres: with the columns:

    - tconst
    - genre_id
  2. genres:

    - genre_id
    - genre_name
- Discard unnecessary information:

 - For the title basics table, drop the following columns:
   - "original_title" (we will use the primary title column instead)
   - "isAdult" ("Adult" will show up in the genres so this is redundant information).
   - "titleType" (every row will be a movie).
   - "genres" and other variants of genre (genre is now represented in the 2 new tables described above.
  - Do not include the title_akas table in your SQL database.
   - You have already filtered out the desired movies using this table, and the remaining data is mostly nulls and not of interest to the stakeholder.

### Load 'Basics' Data

In [7]:
# Load data and datatypes
basics = pd.read_csv('Data/title_basics.csv.gz')
basics.info()
basics.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81900 entries, 0 to 81899
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          81900 non-null  object 
 1   titleType       81900 non-null  object 
 2   primaryTitle    81900 non-null  object 
 3   originalTitle   81900 non-null  object 
 4   isAdult         81900 non-null  int64  
 5   startYear       81900 non-null  float64
 6   endYear         0 non-null      float64
 7   runtimeMinutes  81900 non-null  int64  
 8   genres          81900 non-null  object 
dtypes: float64(2), int64(2), object(5)
memory usage: 5.6+ MB


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama
2,tt0068865,movie,Lives of Performers,Lives of Performers,0,2016.0,,90,Drama
3,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,,122,Drama
4,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,,100,"Comedy,Horror,Sci-Fi"


## I) Normalizing Genres - Overview
- In order to normalize genres, we will need to:

 - Convert the single string of genres from title basics into 2 new tables.
   1. title_genres: with the columns:

     - tconst
     - genre_id
   2. genres:

     - genre_id
     - genre_name
- Creating these tables will be a multi-step process.

 1. Getting a list of all individual genres.
 2. Create a new title_genres table with with the movie ids duplicated, once for each genre that a movie belongs to.
 3. Create a mapper dictionary with numeric ids for each genre.
 4. Use the mapper dictionary to replace the string genres in title_genres with numeric genre_ids.
 5. Convert the mapper dictionary into a final genres table with the numeric genre_id and the string genre.

### 1. Getting a List of Unique Genres

In [9]:
## Create a col with a list of genres
basics['genres_split'] = basics['genres'].str.split(',')
basics

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,genres_split
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance","[Comedy, Fantasy, Romance]"
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama,[Drama]
2,tt0068865,movie,Lives of Performers,Lives of Performers,0,2016.0,,90,Drama,[Drama]
3,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,0,2018.0,,122,Drama,[Drama]
4,tt0088751,movie,The Naked Monster,The Naked Monster,0,2005.0,,100,"Comedy,Horror,Sci-Fi","[Comedy, Horror, Sci-Fi]"
...,...,...,...,...,...,...,...,...,...,...
81895,tt9914942,movie,Life Without Sara Amat,La vida sense la Sara Amat,0,2019.0,,74,Drama,[Drama]
81896,tt9915872,movie,The Last White Witch,Boku no kanojo wa mahoutsukai,0,2019.0,,97,"Comedy,Drama,Fantasy","[Comedy, Drama, Fantasy]"
81897,tt9916170,movie,The Rehearsal,O Ensaio,0,2019.0,,51,Drama,[Drama]
81898,tt9916190,movie,Safeguard,Safeguard,0,2020.0,,95,"Action,Adventure,Thriller","[Action, Adventure, Thriller]"


In [10]:
# Use .explode() to separate the list of genres into new rows
exploded_genres = basics.explode('genres_split')
exploded_genres

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,genres_split
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance",Comedy
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance",Fantasy
0,tt0035423,movie,Kate & Leopold,Kate & Leopold,0,2001.0,,118,"Comedy,Fantasy,Romance",Romance
1,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,0,2020.0,,70,Drama,Drama
2,tt0068865,movie,Lives of Performers,Lives of Performers,0,2016.0,,90,Drama,Drama
...,...,...,...,...,...,...,...,...,...,...
81898,tt9916190,movie,Safeguard,Safeguard,0,2020.0,,95,"Action,Adventure,Thriller",Action
81898,tt9916190,movie,Safeguard,Safeguard,0,2020.0,,95,"Action,Adventure,Thriller",Adventure
81898,tt9916190,movie,Safeguard,Safeguard,0,2020.0,,95,"Action,Adventure,Thriller",Thriller
81899,tt9916362,movie,Coven,Akelarre,0,2020.0,,92,"Drama,History",Drama


In [None]:
# Use .unique to get the unique genres from the genres_split column
unique_genres = sorted(exploded_genres['genres'])