# Part 3 (Core ETL to MySQL)

Goal
Transform cleaned movie datasets into a relational MySQL database (`movies`) with:
- `title_basics`  
- `title_ratings`  
- `title_genres` (join table)  
- `genres` (lookup table)  
- `tmdb_data`

All tables set with correct primary keys and tested with SQL queries.


In [12]:
# pip install yelpapi

In [14]:
# ## 1. Setup & Imports
import os, time, json
import pandas as pd
import tmdbsimple as tmdb
from tqdm.notebook import tqdm_notebook
from yelpapi import YelpAPI

# Ensure data folder exists
FOLDER = "Data/"
os.makedirs(FOLDER, exist_ok=True)
os.listdir(FOLDER)


['final_akas.csv.gz',
 'final_basics.csv.gz',
 'final_ratings.csv.gz',
 'final_tmdb_data_2000.csv.gz',
 'final_tmdb_data_2001.csv.gz',
 'title_basics.csv.gz',
 'tmdb_api_results_2000.json',
 'tmdb_api_results_2001.json',
 'tmdb_results_combined.csv.gz']

In [17]:
# ## 2. API Credentials (TMDB)
with open('C:/Users/tulan/.secret/TMDB_api.json', 'r') as f:
    login = json.load(f)

tmdb.API_KEY = login['api-key']


In [19]:
# ## 3. Load & Inspect Title Basics
basics = pd.read_csv('Data/final_basics.csv.gz')
basics.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91011 entries, 0 to 91010
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          91011 non-null  object 
 1   titleType       91011 non-null  object 
 2   primaryTitle    91010 non-null  object 
 3   originalTitle   91010 non-null  object 
 4   isAdult         91011 non-null  int64  
 5   startYear       91011 non-null  float64
 6   endYear         0 non-null      float64
 7   runtimeMinutes  91011 non-null  int64  
 8   genres          91011 non-null  object 
dtypes: float64(2), int64(2), object(5)
memory usage: 6.2+ MB


In [21]:
# ## 4. Drop Unnecessary Columns (Normalization step)
basics = basics.drop(['originalTitle', 'isAdult', 'titleType'], axis=1)


In [23]:
# ## 5. Split & Normalize Genres
# Create a list of genres per row
basics['genres_split'] = basics['genres'].str.split(',')

# Explode to long format (1 genre per row)
exploded_genres = basics.explode('genres_split')

# Unique genre list
unique_genres = sorted(exploded_genres['genres_split'].unique())


In [26]:
# ## 6. Build title_genres Table
title_genres = exploded_genres[['tconst', 'genres_split']].copy()

# Map genres to integer IDs
genre_map = dict(zip(unique_genres, range(len(unique_genres))))
title_genres['genre_id'] = title_genres['genres_split'].map(genre_map)
title_genres = title_genres.drop(columns='genres_split')


In [28]:
# ## 7. Build genres Lookup Table
genre_lookup = pd.DataFrame({
    'Genre_name': genre_map.keys(),
    'Genre_ID': genre_map.values()
})


In [30]:
# ## 8. Load Title Ratings
ratings = pd.read_csv('Data/final_ratings.csv.gz')
# No transformations required


In [32]:
# ## 9. Load & Filter TMDB API Results
results = pd.read_csv('Data/tmdb_results_combined.csv.gz')
results = results[['imdb_id', 'revenue', 'budget', 'certification']]
results = results.rename(columns={'imdb_id': 'tconst'})


In [34]:
# ## 10. Re-create Cleaned Basics (Drop genre column for SQL table)
basics = pd.read_csv('Data/final_basics.csv.gz')
basics = basics.drop(['originalTitle','isAdult','titleType','genres'], axis=1)
