# Exercise 02 -  OLAP Cubes - Roll Up and Drill Down

All the databases table in this demo are based on public database samples and transformations
- `Sakila` is a sample database created by `MySql` [Link](https://video.udacity-data.com/topher/2021/August/61120e06_pagila-3nf/pagila-3nf.png)
- The postgresql version of it is called `Pagila` [Link](https://github.com/devrimgunduz/pagila)
- The facts and dimension tables design is based on O'Reilly's public dimensional modelling tutorial schema [Link](https://video.udacity-data.com/topher/2021/August/61120e06_pagila-3nf/pagila-3nf.png)

Start by connecting to the database by running the cells below. If you are coming back to this exercise, then uncomment and run the first cell to recreate the database. If you recently completed the slicing and dicing exercise, then skip to the second cell.

In [None]:
%%capture 
# ^hide output of this cell (comment to debug)

# run step 5 (and 4) to create and populate tables (can take a couple of seconds)
%run "E1 - Step 5.ipynb"

# retrieve tables from step 4 & 5
%store -r tables 

from src.database import (
    get_pg_connection,
    execute_pg_query,
    create_pg_table,
    insert_pg_rows,
    drop_pg_table,
    close_pg_connection,
)
from IPython.display import display
import pandas as pd

# get active connection
conn = get_pg_connection(sample_pagila=True)

### Star Schema

![](./../images/pagila-star.png)

## Roll-up
- Stepping up the level of aggregation to a large grouping
- e.g.`city` is summed as `country`

TODO: Write a query that calculates revenue (sales_amount) by day, rating, and country. Sort the data by revenue in descending order, and limit the data to the top 20 results. The first few rows of your output should match the table below.

In [None]:
%%time
result = execute_pg_query(
    conn,
    """
SELECT
    dim_d.day,
    dim_m.rating,
    dim_c.country,
    SUM(s.sales_amount) AS revenue
FROM factSales s
JOIN dimDate dim_d ON s.date_key = dim_d.date_key
JOIN dimMovie dim_m ON s.movie_key = dim_m.movie_key
JOIN dimCustomer dim_c ON s.customer_key = dim_c.customer_key
GROUP BY dim_d.day, dim_m.rating, dim_c.country
ORDER BY revenue DESC
LIMIT 20
    """
)
display(result.style.set_caption("Top 20 Revenue by Day, Movie Rating, and Country").hide(axis="index"))

<div class="p-Widget jp-RenderedHTMLCommon jp-RenderedHTML jp-mod-trusted jp-OutputArea-output jp-OutputArea-executeResult" data-mime-type="text/html"><table>
    <tbody><tr>
        <th>day</th>
        <th>rating</th>
        <th>country</th>
        <th>revenue</th>
    </tr>
    <tr>
        <td>30</td>
        <td>G</td>
        <td>China</td>
        <td>169.67</td>
    </tr>
    <tr>
        <td>30</td>
        <td>PG</td>
        <td>India</td>
        <td>156.67</td>
    </tr>
    <tr>
        <td>30</td>
        <td>NC-17</td>
        <td>India</td>
        <td>153.64</td>
    </tr>
    <tr>
        <td>30</td>
        <td>PG-13</td>
        <td>China</td>
        <td>146.67</td>
    </tr>
    <tr>
        <td>30</td>
        <td>R</td>
        <td>China</td>
        <td>145.66</td>
    </tr>
</tbody></table></div>

## Drill-down
- Breaking up one of the dimensions to a lower level.
- e.g.`city` is broken up into  `districts`

TODO: Write a query that calculates revenue (sales_amount) by day, rating, and district. Sort the data by revenue in descending order, and limit the data to the top 20 results. The first few rows of your output should match the table below.

In [None]:
%%time
result = execute_pg_query(
    conn,
    """
SELECT
    dim_d.day,
    dim_m.rating,
    dim_c.district,
    SUM(s.sales_amount) AS revenue
FROM factSales s
JOIN dimDate dim_d ON s.date_key = dim_d.date_key
JOIN dimMovie dim_m ON s.movie_key = dim_m.movie_key
JOIN dimCustomer dim_c ON s.customer_key = dim_c.customer_key
GROUP BY dim_d.day, dim_m.rating, dim_c.district
ORDER BY revenue DESC
LIMIT 20
    """
)
display(result.style.set_caption("Top 20 Revenue by Day, Movie Rating, and District").hide(axis="index"))

<div class="p-Widget jp-RenderedHTMLCommon jp-RenderedHTML jp-mod-trusted jp-OutputArea-output jp-OutputArea-executeResult" data-mime-type="text/html"><table>
    <tbody><tr>
        <th>day</th>
        <th>rating</th>
        <th>district</th>
        <th>revenue</th>
    </tr>
    <tr>
        <td>30</td>
        <td>PG-13</td>
        <td>Southern Tagalog</td>
        <td>53.88</td>
    </tr>
    <tr>
        <td>30</td>
        <td>G</td>
        <td>Inner Mongolia</td>
        <td>38.93</td>
    </tr>
    <tr>
        <td>30</td>
        <td>G</td>
        <td>Shandong</td>
        <td>36.93</td>
    </tr>
    <tr>
        <td>30</td>
        <td>NC-17</td>
        <td>West Bengali</td>
        <td>36.92</td>
    </tr>
    <tr>
        <td>17</td>
        <td>PG-13</td>
        <td>Shandong</td>
        <td>34.95</td>
    </tr>
</tbody></table></div>

In [None]:
for table in tables:
    drop_pg_table(conn, table, cascade=True)
close_pg_connection(conn)