# Demo : ETL on PosgreSQL

This is a sample of ELT script from csv and insert it into PostgreSQL tables.
To fully understand the notebook, you need to be familiar with pyhton, and basic usage of [pandas](https://pandas.pydata.org/).
  

**Notes**
  - The dataset in this notebook is a snapshot taken during course creation, which available to [download here]()
  - Original dataset is taken from [here](https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset)
  - The dataset from original location might change overtime, for same result, download the [course snapshot dataset]()
  
----

## In this notebook
In this step, we will read CSV data source and write it into staging tables.


## Required Python Packages
```bash
  pip install pandas
  pip install psycopg2-binary
  pip install sqlalchemy
```

In [None]:
import pandas as pd
import psycopg2
import sqlalchemy

Open database connection.  
This time, we will use sqlalchemy, to help insert data from csv to database. Please [install sqlalchemy](https://docs.sqlalchemy.org/) to use this notebook.

In [None]:
try:
    engine = sqlalchemy.create_engine('postgresql://postgres:postgres@34.101.89.227:5432/postgres')
    engine.raw_connection().set_session(autocommit=True)
    
    # Open cursor
    cur = engine.raw_connection().cursor()
except Exception as e: 
    print("Error: cannot open cursor for SQL interaction")
    print(e)

Read the files and create `DataFrame` for each file.

In [None]:
movies_file = "data/movie-datasets/IMDb_movies.csv"
names_file = "data/movie-datasets/IMDb_names.csv"
ratings_file = "data/movie-datasets/IMDb_ratings.csv"
title_principals_file = "data/movie-datasets/IMDb_title_principals.csv"

In [None]:
movies = pd.read_csv(movies_file,low_memory=False)
names = pd.read_csv(names_file,low_memory=False)
ratings = pd.read_csv(ratings_file,low_memory=False)
title_principals = pd.read_csv(title_principals_file,low_memory=False)

### Extract movies dataset

Quick peek for `movies`

In [None]:
movies.info()

Create staging table for `movies`.

In [None]:
cur.execute("""
    CREATE TABLE IF NOT EXISTS stg_movies(
        imdb_title_id varchar,
        title varchar,
        original_title varchar,
        year varchar,
        date_published varchar,
        genre varchar,
        duration varchar,
        country varchar,
        language varchar,
        director varchar,
        writer varchar,
        production_company varchar,
        actors varchar,
        description varchar,
        avg_vote varchar,
        votes varchar,
        budget varchar,
        usa_gross_income varchar,
        worlwide_gross_income varchar,
        metascore varchar,
        reviews_from_users varchar,
        reviews_from_critics varchar,
        created_date timestamp default now()
    );
""")

### Extract names dataset

Quick peek for `names`

In [None]:
names.info()

Create staging table for `names`.

In [None]:
cur.execute("""
    CREATE TABLE IF NOT EXISTS stg_names(
        imdb_name_id varchar,
        name varchar,
        birth_name varchar,
        height varchar,
        bio varchar,
        birth_details varchar,
        date_of_birth varchar,
        place_of_birth varchar,
        death_details varchar,
        date_of_death varchar,
        place_of_death varchar,
        reason_of_death varchar,
        spouses_string varchar,
        spouses varchar,
        divorces varchar,
        spouses_with_children varchar,
        children varchar,
        created_date timestamp default now()
    );
""")

### Extract ratings dataset

Quick peek for `ratings`

In [None]:
ratings.info()

Create staging table for `ratings`.

In [None]:
cur.execute("""
    CREATE TABLE IF NOT EXISTS stg_ratings(
        imdb_title_id varchar,
        weighted_average_vote varchar,
        total_votes varchar,
        mean_vote varchar,
        median_vote varchar,
        votes_10 varchar,
        votes_9 varchar,
        votes_8 varchar,
        votes_7 varchar,
        votes_6 varchar,
        votes_5 varchar,
        votes_4 varchar,
        votes_3 varchar,
        votes_2 varchar,
        votes_1 varchar,
        allgenders_0age_avg_vote varchar,
        allgenders_0age_votes varchar,
        allgenders_18age_avg_vote varchar,
        allgenders_18age_votes varchar,
        allgenders_30age_avg_vote varchar,
        allgenders_30age_votes varchar,
        allgenders_45age_avg_vote varchar,
        allgenders_45age_votes varchar,
        males_allages_avg_vote varchar,
        males_allages_votes varchar,
        males_0age_avg_vote varchar,
        males_0age_votes varchar,
        males_18age_avg_vote varchar,
        males_18age_votes varchar,
        males_30age_avg_vote varchar,
        males_30age_votes varchar,
        males_45age_avg_vote varchar,
        males_45age_votes varchar,
        females_allages_avg_vote varchar,
        females_allages_votes varchar,
        females_0age_avg_vote varchar,
        females_0age_votes varchar,
        females_18age_avg_vote varchar,
        females_18age_votes varchar,
        females_30age_avg_vote varchar,
        females_30age_votes varchar,
        females_45age_avg_vote varchar,
        females_45age_votes varchar,
        top1000_voters_rating varchar,
        top1000_voters_votes varchar,
        us_voters_rating varchar,
        us_voters_votes varchar,
        non_us_voters_rating varchar,
        non_us_voters_votes varchar,
        created_date timestamp default now()
    );
""")

### Extract title_principals dataset

Quick peek for `title_principals`

In [None]:
title_principals.info

Create staging table for `title_principals`.

In [None]:
cur.execute("""
    CREATE TABLE IF NOT EXISTS stg_title_principals(
        imdb_title_id varchar,
        ordering varchar,
        imdb_name_id varchar,
        category varchar,
        job varchar,
        characters varchar,
        created_date timestamp default now()
    );
""")

### Load the data from csv into each staging tables

For each staging table, cleanup data first.  
In this sample, clean based on date, assuming data only loaded on daily basis.

In [None]:
cur.execute("""
    DELETE FROM stg_movies
        WHERE date_trunc('day', created_date) = date_trunc('day', now())
""")

movies.to_sql("stg_movies", con=engine, if_exists="append", index=False, method="multi", chunksize=500)

In [None]:
cur.execute("""
    DELETE FROM stg_names
        WHERE date_trunc('day', created_date) = date_trunc('day', now())
""")

names.to_sql("stg_names", con=engine, if_exists="append", index=False, method="multi", chunksize=500)

In [None]:
cur.execute("""
    DELETE FROM stg_ratings
        WHERE date_trunc('day', created_date) = date_trunc('day', now())
""")

ratings.to_sql("stg_ratings", con=engine, if_exists="append", index=False, method="multi", chunksize=500)

In [None]:
cur.execute("""
    DELETE FROM stg_title_principals
        WHERE date_trunc('day', created_date) = date_trunc('day', now())
""")

title_principals.to_sql("stg_title_principals", con=engine, if_exists="append", index=False, method="multi", chunksize=500)

Try to check the staging tables

In [None]:
cur.execute("SELECT count(*) FROM stg_movies WHERE date_trunc('day', created_date) = date_trunc('day', now())")
print("stg_movies today's data : {} rows".format(cur.fetchone()[0]))

cur.execute("SELECT count(*) FROM stg_names WHERE date_trunc('day', created_date) = date_trunc('day', now())")
print("stg_names today's data : {} rows".format(cur.fetchone()[0]))

cur.execute("SELECT count(*) FROM stg_ratings WHERE date_trunc('day', created_date) = date_trunc('day', now())")
print("stg_ratings today's data : {} rows".format(cur.fetchone()[0]))

cur.execute("SELECT count(*) FROM stg_title_principals WHERE date_trunc('day', created_date) = date_trunc('day', now())")
print("stg_title_principals today's data : {} rows".format(cur.fetchone()[0]))