# Exercise 1 -  Sakila Star Schema & ETL  

All the database tables in this demo are based on public database samples and transformations
- `Sakila` is a sample database created by `MySql` [Link](https://video.udacity-data.com/topher/2021/August/61120e06_pagila-3nf/pagila-3nf.png)
- The postgresql version of it is called `Pagila` [Link](https://github.com/devrimgunduz/pagila)
- The facts and dimension tables design is based on O'Reilly's public dimensional modelling tutorial schema [Link](https://video.udacity-data.com/topher/2021/August/61120d38_pagila-star/pagila-star.png)

# STEP0: Using ipython-sql

- Load ipython-sql: `%load_ext sql`

- To execute SQL queries you write one of the following atop of your cell: 
    - `%sql`
        - For a one-liner SQL query
        - You can access a python var using `$`    
    - `%%sql`
        - For a multi-line SQL query
        - You can **NOT** access a python var using `$`


- Running a connection string like:
`postgresql://postgres:postgres@db:5432/pagila` connects to the database


# STEP1 : Connect to the local database where Pagila is loaded

##  1.1 Create the pagila db and fill it with data
- Adding `"!"` at the beginning of a jupyter cell runs a command in a shell, i.e. we are not running python code but we are running the `createdb` and `psql` postgresql commmand-line utilities

## 1.2 Connect to the newly created db

In [None]:
from src.database import get_pg_connection, execute_pg_query, close_pg_connection
from IPython.display import display

conn = get_pg_connection(sample_pagila=True)
cur = conn.cursor()

# STEP2 : Explore the  3NF Schema

![](../images/pagila-3nf.png)

## 2.1 How much? What data sizes are we looking at?

In [None]:
nStores = "select count(*) from store;"
nFilms = "select count(*) from film;"
nCustomers = "select count(*) from customer;"
nRentals = "select count(*) from rental;"
nPayment = "select count(*) from payment;"
nStaff = "select count(*) from staff;"
nCity = "select count(*) from city;"
nCountry = "select count(*) from country;"

print(
    "nFilms\t\t=",
    execute_pg_query(conn, nFilms, fetch=1).to_string(index=False, header=False),
)
print(
    "nCustomers\t=",
    execute_pg_query(conn, nCustomers, fetch=1).to_string(index=False, header=False),
)
print(
    "nRentals\t=",
    execute_pg_query(conn, nRentals, fetch=1).to_string(index=False, header=False),
)
print(
    "nPayment\t=",
    execute_pg_query(conn, nPayment, fetch=1).to_string(index=False, header=False),
)
print(
    "nPayment\t=",
    execute_pg_query(conn, nPayment, fetch=1).to_string(index=False, header=False),
)
print(
    "nStaff\t\t=",
    execute_pg_query(conn, nStaff, fetch=1).to_string(index=False, header=False),
)
print(
    "nStores\t\t=",
    execute_pg_query(conn, nStores, fetch=1).to_string(index=False, header=False),
)
print(
    "nCities\t\t=",
    execute_pg_query(conn, nCity, fetch=1).to_string(index=False, header=False),
)
print(
    "nCountry\t=",
    execute_pg_query(conn, nCountry, fetch=1).to_string(index=False, header=False),
)

## 2.2 When? What time period are we talking about?

In [None]:
result = execute_pg_query(
    conn, "select min(payment_date) as start, max(payment_date) as end from payment;"
)
display(result.style.set_caption("Payment Date Range").hide(axis="index"))

## 2.3 Where? Where do events in this database occur?
TODO: Write a query that displays the number of addresses by district in the address table. Limit the table to the top 10 districts. Your results should match the table below.

In [None]:
result = execute_pg_query(
    conn,
    """
SELECT
  district,
  COUNT(*) AS address_count
FROM address
GROUP BY district
ORDER BY address_count DESC, district ASC
LIMIT 10;
    """,
)
styled = (
    result.style.set_caption("Top 10 Districts by Address Count")
    .format({"address_count": "{:,}"})
    .hide(axis="index")
)
display(styled)

<div class="p-Widget jp-RenderedHTMLCommon jp-RenderedHTML jp-OutputArea-output jp-OutputArea-executeResult" data-mime-type="text/html"><table>
    <tbody><tr>
        <th>district</th>
        <th>n</th>
    </tr>
    <tr>
        <td>Buenos Aires</td>
        <td>10</td>
    </tr>
    <tr>
        <td>California</td>
        <td>9</td>
    </tr>
    <tr>
        <td>Shandong</td>
        <td>9</td>
    </tr>
    <tr>
        <td>West Bengali</td>
        <td>9</td>
    </tr>
    <tr>
        <td>So Paulo</td>
        <td>8</td>
    </tr>
    <tr>
        <td>Uttar Pradesh</td>
        <td>8</td>
    </tr>
    <tr>
        <td>Maharashtra</td>
        <td>7</td>
    </tr>
    <tr>
        <td>England</td>
        <td>7</td>
    </tr>
    <tr>
        <td>Southern Tagalog</td>
        <td>6</td>
    </tr>
    <tr>
        <td>Punjab</td>
        <td>5</td>
    </tr>
</tbody></table></div>

Close connection

In [None]:
close_pg_connection(conn)