# Exercise 02 -  OLAP Cubes - Slicing and Dicing

All the databases table in this demo are based on public database samples and transformations
- `Sakila` is a sample database created by `MySql` [Link](https://video.udacity-data.com/topher/2021/August/61120e06_pagila-3nf/pagila-3nf.png)
- The postgresql version of it is called `Pagila` [Link](https://github.com/devrimgunduz/pagila)
- The facts and dimension tables design is based on O'Reilly's public dimensional modelling tutorial schema [Link](https://video.udacity-data.com/topher/2021/August/61120d38_pagila-star/pagila-star.png)

Start by creating and connecting to the database by running the cells below.

In [None]:
%%capture
# ^hide output of this cell (comment to debug)

# run step 5 (and 4) to create and populate tables (can take a couple of seconds)
%run "E1 - Step 5.ipynb"

# retrieve tables from step 4 & 5
%store -r tables

from src.database import (
    get_pg_connection,
    execute_pg_query,
    create_pg_table,
    insert_pg_rows,
    drop_pg_table,
    close_pg_connection,
)
from IPython.display import display
import pandas as pd

# get active connection
conn = get_pg_connection("pagila")

### Star Schema

![](./../images/pagila-star.png)

# Start with a simple cube
TODO: Write a query that calculates the revenue (sales_amount) by day, rating, and city. Remember to join with the appropriate dimension tables to replace the keys with the dimension labels. Sort by revenue in descending order and limit to the first 20 rows. The first few rows of your output should match the table below.

In [None]:
%%time
result = execute_pg_query(
    conn,
    """
SELECT
    dim_d.day,
    dim_m.rating,
    dim_c.city,
    SUM(s.sales_amount) AS revenue
FROM factSales s
JOIN dimDate dim_d ON s.date_key = dim_d.date_key
JOIN dimMovie dim_m ON s.movie_key = dim_m.movie_key
JOIN dimCustomer dim_c ON s.customer_key = dim_c.customer_key
GROUP BY dim_d.day, dim_m.rating, dim_c.city
ORDER BY revenue DESC
LIMIT 20
    """,
)
display(
    result.style.set_caption("Top 20 Revenue by Day, Movie Rating, and City").hide(
        axis="index"
    )
)

<div class="p-Widget jp-RenderedHTMLCommon jp-RenderedHTML jp-mod-trusted jp-OutputArea-output jp-OutputArea-executeResult" data-mime-type="text/html"><table>
    <tbody><tr>
        <th>day</th>
        <th>rating</th>
        <th>city</th>
        <th>revenue</th>
    </tr>
    <tr>
        <td>30</td>
        <td>G</td>
        <td>San Bernardino</td>
        <td>24.97</td>
    </tr>
    <tr>
        <td>30</td>
        <td>NC-17</td>
        <td>Apeldoorn</td>
        <td>23.95</td>
    </tr>
    <tr>
        <td>21</td>
        <td>NC-17</td>
        <td>Belm</td>
        <td>22.97</td>
    </tr>
    <tr>
        <td>30</td>
        <td>PG-13</td>
        <td>Zanzibar</td>
        <td>21.97</td>
    </tr>
    <tr>
        <td>28</td>
        <td>R</td>
        <td>Mwanza</td>
        <td>21.97</td>
    </tr>
</tbody></table></div>

## Slicing

Slicing is the reduction of the dimensionality of a cube by 1 e.g. 3 dimensions to 2, fixing one of the dimensions to a single value. In the example above, we have a 3-dimensional cube on day, rating, and country.

TODO: Write a query that reduces the dimensionality of the above example by limiting the results to only include movies with a `rating` of "PG-13". Again, sort by revenue in descending order and limit to the first 20 rows. The first few rows of your output should match the table below. 

In [None]:
%%time
result = execute_pg_query(
    conn,
    """
SELECT
    dim_d.day,
    dim_m.rating,
    dim_c.city,
    SUM(s.sales_amount) AS revenue
FROM factSales s
JOIN dimDate dim_d ON s.date_key = dim_d.date_key
JOIN dimMovie dim_m ON s.movie_key = dim_m.movie_key
JOIN dimCustomer dim_c ON s.customer_key = dim_c.customer_key
WHERE dim_m.rating = 'PG-13'
GROUP BY dim_d.day, dim_m.rating, dim_c.city
ORDER BY revenue DESC
LIMIT 20
    """,
)
display(
    result.style.set_caption(
        "Top 20 Revenue by Day, Movie Rating, and City (PG-13)"
    ).hide(axis="index")
)

<div class="p-Widget jp-RenderedHTMLCommon jp-RenderedHTML jp-mod-trusted jp-OutputArea-output jp-OutputArea-executeResult" data-mime-type="text/html"><table>
    <tbody><tr>
        <th>day</th>
        <th>rating</th>
        <th>city</th>
        <th>revenue</th>
    </tr>
    <tr>
        <td>30</td>
        <td>PG-13</td>
        <td>Zanzibar</td>
        <td>21.97</td>
    </tr>
    <tr>
        <td>28</td>
        <td>PG-13</td>
        <td>Dhaka</td>
        <td>19.97</td>
    </tr>
    <tr>
        <td>29</td>
        <td>PG-13</td>
        <td>Shimoga</td>
        <td>18.97</td>
    </tr>
    <tr>
        <td>30</td>
        <td>PG-13</td>
        <td>Osmaniye</td>
        <td>18.97</td>
    </tr>
    <tr>
        <td>21</td>
        <td>PG-13</td>
        <td>Asuncin</td>
        <td>18.95</td>
    </tr>
</tbody></table></div>

## Dicing
Dicing is creating a subcube with the same dimensionality but fewer values for  two or more dimensions. 

TODO: Write a query to create a subcube of the initial cube that includes moves with:
* ratings of PG or PG-13
* in the city of Bellevue or Lancaster
* day equal to 1, 15, or 30

The first few rows of your output should match the table below. 

In [None]:
%%time
result = execute_pg_query(
    conn,
    """
SELECT
    dim_d.day,
    dim_m.rating,
    dim_c.city,
    SUM(s.sales_amount) AS revenue
FROM factSales s
JOIN dimDate dim_d ON s.date_key = dim_d.date_key
JOIN dimMovie dim_m ON s.movie_key = dim_m.movie_key
JOIN dimCustomer dim_c ON s.customer_key = dim_c.customer_key
WHERE (dim_m.rating = 'PG-13' OR dim_m.rating = 'PG')
AND dim_c.city IN ('Bellevue', 'Lancaster')
AND (dim_d.day = 1 OR dim_d.day = 15 OR dim_d.day = 30)
GROUP BY dim_d.day, dim_m.rating, dim_c.city
ORDER BY revenue DESC
LIMIT 20
    """,
)
display(
    result.style.set_caption("Top 20 Revenue by Day, Movie Rating, and City").hide(
        axis="index"
    )
)

<div class="p-Widget jp-RenderedHTMLCommon jp-RenderedHTML jp-mod-trusted jp-OutputArea-output jp-OutputArea-executeResult" data-mime-type="text/html"><table>
    <tbody><tr>
        <th>day</th>
        <th>rating</th>
        <th>city</th>
        <th>revenue</th>
    </tr>
    <tr>
        <td>30</td>
        <td>PG</td>
        <td>Lancaster</td>
        <td>12.98</td>
    </tr>
    <tr>
        <td>1</td>
        <td>PG-13</td>
        <td>Lancaster</td>
        <td>5.99</td>
    </tr>
    <tr>
        <td>30</td>
        <td>PG-13</td>
        <td>Bellevue</td>
        <td>3.99</td>
    </tr>
    <tr>
        <td>30</td>
        <td>PG-13</td>
        <td>Lancaster</td>
        <td>2.99</td>
    </tr>
    <tr>
        <td>15</td>
        <td>PG-13</td>
        <td>Bellevue</td>
        <td>1.98</td>
    </tr>
</tbody></table></div>

In [None]:
for table in tables:
    drop_pg_table(conn, table, cascade=True)
close_pg_connection(conn)