# Exercise 1 -  Sakila Star Schema & ETL  

All the database tables in this demo are based on public database samples and transformations
- `Sakila` is a sample database created by `MySql` [Link](https://video.udacity-data.com/topher/2021/August/61120e06_pagila-3nf/pagila-3nf.png)
- The postgresql version of it is called `Pagila` [Link](https://github.com/devrimgunduz/pagila)
- The facts and dimension tables design is based on O'Reilly's public dimensional modelling tutorial schema [Link](https://video.udacity-data.com/topher/2021/August/61120d38_pagila-star/pagila-star.png)

# STEP0: Using ipython-sql

- Load ipython-sql: `%load_ext sql`

- To execute SQL queries you write one of the following atop of your cell: 
    - `%sql`
        - For a one-liner SQL query
        - You can access a python var using `$`    
    - `%%sql`
        - For a multi-line SQL query
        - You can **NOT** access a python var using `$`


- Running a connection string like:
`postgresql://postgres:postgres@db:5432/pagila` connects to the database


# STEP1 : Connect to the local database where Pagila is loaded

##  1.1 Create the pagila db and fill it with data
- Adding `"!"` at the beginning of a jupyter cell runs a command in a shell, i.e. we are not running python code but we are running the `createdb` and `psql` postgresql commmand-line utilities

In [1]:
# Force Jupyter to use bash shell
%env SHELL=/bin/zsh


env: SHELL=/bin/zsh


In [3]:
%load_ext sql

In [4]:
%sql postgresql://postgres:postgres@db:5432/pagila


Traceback (most recent call last):
  File "/Users/Moaze002/.local/share/virtualenvs/DataEngineering_Nano_degree-hrkbftWL/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 146, in __init__
    self._dbapi_connection = engine.raw_connection()
                             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Moaze002/.local/share/virtualenvs/DataEngineering_Nano_degree-hrkbftWL/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 3302, in raw_connection
    return self.pool.connect()
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/Moaze002/.local/share/virtualenvs/DataEngineering_Nano_degree-hrkbftWL/lib/python3.11/site-packages/sqlalchemy/pool/base.py", line 449, in connect
    return _ConnectionFairy._checkout(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/Moaze002/.local/share/virtualenvs/DataEngineering_Nano_degree-hrkbftWL/lib/python3.11/site-packages/sqlalchemy/pool/base.py", line 1263, in _checkout
    fairy = _ConnectionRecord.checkout(pool)


In [10]:
# !PGPASSWORD=admin createdb -h 127.0.0.1 -U Moaze002 pagila
# !PGPASSWORD=admin psql -q -h 127.0.0.1 -U Moaze002 -d pagila -f Data/pagila-schema.sql
# !PGPASSWORD=admin psql -q -h 127.0.0.1 -U Moaze002 -d pagila -f Data/pagila-data.sql

## 1.2 Connect to the newly created db

In [3]:
%load_ext sql

In [7]:
DB_ENDPOINT = "127.0.0.1"
DB = 'pagila'
DB_USER = 'Moaze002'
DB_PASSWORD = 'admin'
DB_PORT = '5432'

# postgresql://username:password@host:port/database
conn_string = "postgresql://{}:{}@{}:{}/{}" \
                        .format(DB_USER, DB_PASSWORD, DB_ENDPOINT, DB_PORT, DB)

print(conn_string)


postgresql://Moaze002:admin@127.0.0.1:5432/pagila


In [8]:
%sql $conn_string

# STEP2 : Explore the  3NF Schema

<img src="./pagila-3nf.png" width="50%"/>

## 2.1 How much? What data sizes are we looking at?

In [9]:
nStores = %sql select count(*) from store;
nFilms = %sql select count(*) from film;
nCustomers = %sql select count(*) from customer;
nRentals = %sql select count(*) from rental;
nPayment = %sql select count(*) from payment;
nStaff = %sql select count(*) from staff;
nCity = %sql select count(*) from city;
nCountry = %sql select count(*) from country;

print("nFilms\t\t=", nFilms[0][0])
print("nCustomers\t=", nCustomers[0][0])
print("nRentals\t=", nRentals[0][0])
print("nPayment\t=", nPayment[0][0])
print("nStaff\t\t=", nStaff[0][0])
print("nStores\t\t=", nStores[0][0])
print("nCities\t\t=", nCity[0][0])
print("nCountry\t\t=", nCountry[0][0])

 * postgresql://Moaze002:***@127.0.0.1:5432/pagila
1 rows affected.
 * postgresql://Moaze002:***@127.0.0.1:5432/pagila
1 rows affected.
 * postgresql://Moaze002:***@127.0.0.1:5432/pagila
1 rows affected.
 * postgresql://Moaze002:***@127.0.0.1:5432/pagila
1 rows affected.
 * postgresql://Moaze002:***@127.0.0.1:5432/pagila
1 rows affected.
 * postgresql://Moaze002:***@127.0.0.1:5432/pagila
1 rows affected.
 * postgresql://Moaze002:***@127.0.0.1:5432/pagila
1 rows affected.
 * postgresql://Moaze002:***@127.0.0.1:5432/pagila
1 rows affected.
nFilms		= 1000
nCustomers	= 599
nRentals	= 16044
nPayment	= 16049
nStaff		= 1500
nStores		= 500
nCities		= 600
nCountry		= 109


## 2.2 When? What time period are we talking about?

In [11]:
%%sql 
select min(payment_date) as start, max(payment_date) as end from payment;

 * postgresql://Moaze002:***@127.0.0.1:5432/pagila
1 rows affected.


start,end
2022-01-23 14:03:52.212496+01:00,2022-07-27 12:39:20.739759+02:00


## 2.3 Where? Where do events in this database occur?

Write a query that displays the number of addresses by district in the address table. Limit the table to the top 10 districts.

In [27]:
%%sql
select COUNT(address),district
from address
GROUP BY district
ORDER BY COUNT(address) DESC
LIMIT 10 

 * postgresql://Moaze002:***@127.0.0.1:5432/pagila
10 rows affected.


count,district
10,Buenos Aires
9,Shandong
9,California
9,West Bengali
8,Uttar Pradesh
8,So Paulo
7,England
7,Maharashtra
6,Southern Tagalog
5,Gois
