# <center>Big Data &ndash; Exercises</center>
## <center>Fall 2021 &ndash; Week 13 &ndash; ETH Zurich</center>
## <center>Cubes</center>

## Introduction

In this exercise, we analyze the sales data of a fictious wholesale supplier
(taken from the database system benchmark [TPC-H](http://www.tpc.org/tpch/))
in our favorite spreadsheet application. Then, we will use SQL to query the data shape cube. 
A cube is a collection of numeric data organized by arrays of discrete identifiers (Janus and Fouché, 2009). It is quite natural, as we see in the lectures, to map cubes to tables. 

## 1. The TPC-H Dataset as OLAP Cube

Let us get familiar with the dataset.
It consists of orders made, each of which is made by a customer, and consists of lineitems.
Think of an order as a shopping cart with several items in it.
The items of an order are parts that may be provided by different suppliers.
Suppliers and customers come from different nations, which are grouped into regions of the world.
The following figure illustrates the schema of the TPC-H dataset.
<br>

![Schema of the TPC-H dataset](https://cloud.inf.ethz.ch/s/MNY8ksxgX78zf9a/download)


### Task 1

1. Which table(s) of the TPC-H schema is/are the fact table(s)?
depends on what you are interested... Anyone can be a fact table. If you are interested in orders, then orders is a fact table, its dimension connects to other tables (star schema, even snow-flake schema).

1. What is/are the measure(s)?
E.g., in lineitems:
quantity, tax, price... anything that can be aggregated. Maybe not shipmode: it's kind of categorical data.
1. What are the dimensions?
linenumber, date, shipdate, commitdate... Anything except measures and keys. Primary keys and foreign keys are not categorical data - they just let the relationship happen.
1. What do you call this flavor of OLAP?
ROLAP

### Solutions

...

## 2. Analyzing TPC-H with a Pivot Table

Download [OLAP Cubes Excel file](https://cloud.inf.ethz.ch/s/rJKGqkHxbAtYQog/download)
and open it with your favorite spreadsheet application.
The file contains a universal table (a fully denormalized table) of a small TPC-H dataset.
The schema has been modified slightly to make analysis in a spreadsheet application easier:
The two precomputed measures revenue and cost
as well as the hierarchy of time dimensions in the attribute *orderdate* have been added in the materialized form
and some other attributes have been removed.

You may need to look up how to use pivot tables in your spreadsheet application.

1. Microsoft Excel: [PivotTable](https://support.office.com/en-us/article/Create-a-PivotTable-to-analyze-worksheet-data-a9a84538-bfe9-40a9-a8e9-f99134456576)
1. Google Sheets  : [pivot tables](https://support.google.com/docs/answer/1272900?co=GENIE.Platform%3DDesktop&hl=en)
1. Open Office    : [DataPilot](https://openoffice.blogs.com/openoffice/2006/11/data_pilots_in_.html)

### Task 1: Discussion

Discuss the terms "slice and dice", "drill down", "roll up", and "pivoting".

pivot: Value of a column in this table as name of a column in another table.

dice: SELECT (what dimensions to use). A projection
slice: WHERE

drill down:  
roll up: 

### Solution: Discussion

...

### Task 2: Create the following pivot tables:

1. Show how much revenue suppliers from different regions (as columns) produced in every year (as rows).
1. Show how much revenue suppliers from nations of Africa (as columns) produced in every year (as rows).
1. Show how much revenue suppliers from nations of Africa produced in every quarter of every year.
1. Show how much revenue suppliers from nations of Africa produced in every week of every month of Q1 in 1996.
1. Show how much revenue suppliers from nations of Africa produced in every year with "urgent" orders.
1. Show the average order quantity for parts from suppliers from nations of Africa per year.
1. Show how much revenue suppliers from nations of Africa (as rows) produced in every quarter of every year (as columns).

### Solution:
...

## 3. OLAP Cubes and SQL

### Part 1: SQL

Write SQL queries for the PivotTables from Question 2.

#### Notes

* Assume that the revenue is calculated as `olquantity * partretailprice * (1-oldiscount)`.
* To get the year or quarter from a date in PostgreSQL, you can use [`DATE_PART ('field', date )  `](https://www.postgresqltutorial.com/postgresql-date_part/). Note that the field is case-insensitive. You can write `DATE_PART('YEAR', date)` or `DATE_PART('year', date)`, which are equivalent. 

#### Database Set-up

Please wait for the message `PostgreSQL init process complete; ready for start up` before proceeding!
As before, we set up our connection to the database and enable use of `%sql` and `%%sql`.

In [8]:
server  ='postgres'
user    ='postgres'
password='bigdataclass'
database='tpch-db'

connection_string = f'postgresql://{user}:{password}@{server}:5432/{database}'

In [9]:
%reload_ext sql
%sql $connection_string

Check the tables in TPC-H. They are empty for the moment.

In [10]:
%%sql 
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'public';

 * postgresql://postgres:***@postgres:5432/tpch-db
8 rows affected.


table_name
region
nation
supplier
part
supplypart
customer
orders
orderline


Populate the tables in TPC-H with data from .tbl files

In [11]:
import numpy as np
import os
import pandas
import sqlalchemy
import sys

tables = [ # Order is important because of FKs
        'region',
        'nation',
        'supplier',
        'part',
        'supplypart',
        'customer',
        'orders',
        'orderline'
        ]

engine = sqlalchemy.create_engine(connection_string)

for table in tables:
    # Find column names
    columns = engine.execute('SELECT * FROM {0}'.format(table)).keys()

    # Load content
    data = pandas.read_csv('docker/postgres/tpch/data/{0}.tbl'.format(table), sep='|', header=None, names=columns)
    msg = 'Loading table "{0}": {1}% done\r'
    for idx, chunk in enumerate(np.array_split(data, 100)):
        sys.stdout.write(msg.format(table, idx))
        chunk.to_sql(name=table, if_exists='append', con=engine, index=False, method='multi')
    print(msg.format(table, str(100)))

Loading table "region": 0% done

IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "region_pkey"
DETAIL:  Key (regionid)=(0) already exists.

[SQL: INSERT INTO region (regionid, regionname) VALUES (%(regionid_m0)s, %(regionname_m0)s)]
[parameters: {'regionid_m0': 0, 'regionname_m0': 'AFRICA'}]
(Background on this error at: https://sqlalche.me/e/14/gkpj)

First however, we define the fact table using a WITH statement (copy this at the beginning of all other queries)

In [13]:
%%sql
WITH cube AS (
    SELECT olquantity, partretailprice, oldiscount,
           orderdate, nationname, regionname, orderpriority,
           olquantity * partretailprice * (1-oldiscount) AS revenue,
           DATE_PART('year', orderdate) AS orderyear,
           DATE_PART('quarter', orderdate) AS orderquarter,
           DATE_PART('month', orderdate) AS ordermonth,
           DATE_PART('week', orderdate) AS orderweek
    FROM orderline ol
    JOIN orders o      ON ol.orderid = o.orderid
    JOIN supplypart sp ON ol.partId = sp.partId AND ol.supplierId = sp.supplierId
    JOIN part p        ON sp.partId = p.partId
    JOIN supplier s    ON sp.supplierId = s.supplierId
    JOIN nation sn     ON s.nationId = sn.nationId
    JOIN region sr     ON sn.regionId = sr.regionId
)
SELECT * FROM cube LIMIT 10

 * postgresql://postgres:***@postgres:5432/tpch-db
10 rows affected.


olquantity,partretailprice,oldiscount,orderdate,nationname,regionname,orderpriority,revenue,orderyear,orderquarter,ordermonth,orderweek
17.0,1453.55,0.04,1996-01-02,MOZAMBIQUE,AFRICA,5-LOW,23721.936,1996.0,1.0,1.0,1.0
36.0,1574.67,0.09,1996-01-02,CHINA,ASIA,5-LOW,51586.1892,1996.0,1.0,1.0,1.0
8.0,1537.63,0.1,1996-01-02,EGYPT,MIDDLE EAST,5-LOW,11070.936,1996.0,1.0,1.0,1.0
28.0,922.02,0.09,1996-01-02,KENYA,AFRICA,5-LOW,23493.0696,1996.0,1.0,1.0,1.0
24.0,1141.24,0.1,1996-01-02,INDONESIA,ASIA,5-LOW,24650.784,1996.0,1.0,1.0,1.0
32.0,1057.15,0.07,1996-01-02,UNITED STATES,AMERICA,5-LOW,31460.784,1996.0,1.0,1.0,1.0
38.0,963.06,0.0,1996-12-01,GERMANY,EUROPE,1-URGENT,36596.28,1996.0,4.0,12.0,48.0
45.0,943.04,0.06,1993-10-14,UNITED STATES,AMERICA,5-LOW,39890.592,1993.0,4.0,10.0,41.0
49.0,1091.19,0.1,1993-10-14,FRANCE,EUROPE,5-LOW,48121.479,1993.0,4.0,10.0,41.0
27.0,1186.28,0.06,1993-10-14,INDIA,ASIA,5-LOW,30107.7864,1993.0,4.0,10.0,41.0


Note that, for the purpose of this exercise, we dropped some dimensions of the cube because none of the queries uses them. Also, we materialize some hierarchy levels of the `orderdate` dimension in order to make the subsequent queries more readable. This makes them *look* like they were new dimensions -- conceptually, they are not! (They are, well, levels of a hierarchy of the `orderdate` dimension.)

OK, you are good to go. Use the SQL cell below and add more cells as you need.

Note that the numbers you obtain with the SQL queries should not be identical to those in the pivot tables in Task2, because the data we have in the DB have more rows in its fact table.

#### Your Answers

#### 1. Show how much revenue suppliers from different regions (as columns) produced in every year (as rows).

In [17]:
%%sql
WITH cube AS (
    SELECT olquantity, partretailprice, oldiscount,
           orderdate, nationname, regionname, orderpriority,
           olquantity * partretailprice * (1-oldiscount) AS revenue,
           DATE_PART('year', orderdate) AS orderyear,
           DATE_PART('quarter', orderdate) AS orderquarter,
           DATE_PART('month', orderdate) AS ordermonth,
           DATE_PART('week', orderdate) AS orderweek
    FROM orderline ol
    JOIN orders o      ON ol.orderid = o.orderid
    JOIN supplypart sp ON ol.partId = sp.partId AND ol.supplierId = sp.supplierId
    JOIN part p        ON sp.partId = p.partId
    JOIN supplier s    ON sp.supplierId = s.supplierId
    JOIN nation sn     ON s.nationId = sn.nationId
    JOIN region sr     ON sn.regionId = sr.regionId
)
SELECT orderyear, regionname, SUM(revenue)
FROM cube
GROUP BY CUBE(orderyear, regionname)
ORDER BY orderyear, regionname

 * postgresql://postgres:***@postgres:5432/tpch-db
48 rows affected.


orderyear,regionname,sum
1992.0,AFRICA,65268216.7295
1992.0,AMERICA,62430452.5656
1992.0,ASIA,83435906.5716
1992.0,EUROPE,62009679.643
1992.0,MIDDLE EAST,35338119.828
1992.0,,308482375.3377
1993.0,AFRICA,66164509.0978
1993.0,AMERICA,64204286.247
1993.0,ASIA,85616278.5117
1993.0,EUROPE,64131945.3828


#### 2. Show how much revenue suppliers from nations of Africa produced in every year.

In [20]:
%%sql
WITH cube AS (
    SELECT olquantity, partretailprice, oldiscount,
           orderdate, nationname, regionname, orderpriority,
           olquantity * partretailprice * (1-oldiscount) AS revenue,
           DATE_PART('year', orderdate) AS orderyear,
           DATE_PART('quarter', orderdate) AS orderquarter,
           DATE_PART('month', orderdate) AS ordermonth,
           DATE_PART('week', orderdate) AS orderweek
    FROM orderline ol
    JOIN orders o      ON ol.orderid = o.orderid
    JOIN supplypart sp ON ol.partId = sp.partId AND ol.supplierId = sp.supplierId
    JOIN part p        ON sp.partId = p.partId
    JOIN supplier s    ON sp.supplierId = s.supplierId
    JOIN nation sn     ON s.nationId = sn.nationId
    JOIN region sr     ON sn.regionId = sr.regionId
)
SELECT orderyear, nationname, SUM(revenue)
FROM cube
WHERE regionname = 'AFRICA'
GROUP BY CUBE( orderyear, nationname)
ORDER BY orderyear, nationname

 * postgresql://postgres:***@postgres:5432/tpch-db
48 rows affected.


orderyear,nationname,sum
1992.0,ALGERIA,8568744.1201
1992.0,ETHIOPIA,9077948.018
1992.0,KENYA,18844592.312
1992.0,MOROCCO,5808581.6439
1992.0,MOZAMBIQUE,22968350.6355
1992.0,,65268216.7295
1993.0,ALGERIA,9816982.5956
1993.0,ETHIOPIA,8537000.0214
1993.0,KENYA,18350385.061
1993.0,MOROCCO,6731663.7897


#### 3. Show how much revenue suppliers from nations of Africa produced in every quarter of every year.

In [23]:
%%sql
WITH cube AS (
    SELECT olquantity, partretailprice, oldiscount,
           orderdate, nationname, regionname, orderpriority,
           olquantity * partretailprice * (1-oldiscount) AS revenue,
           DATE_PART('YEAR', orderdate) AS orderyear,
           DATE_PART('QUARTER', orderdate) AS orderquarter,
           DATE_PART('MONTH', orderdate) AS ordermonth,
           DATE_PART('WEEK', orderdate) AS orderweek
    FROM orderline ol
    JOIN orders o      ON ol.orderid = o.orderid
    JOIN supplypart sp ON ol.partId = sp.partId AND ol.supplierId = sp.supplierId
    JOIN part p        ON sp.partId = p.partId
    JOIN supplier s    ON sp.supplierId = s.supplierId
    JOIN nation sn     ON s.nationId = sn.nationId
    JOIN region sr     ON sn.regionId = sr.regionId
)
SELECT orderyear, orderquarter, nationname, SUM(revenue)
FROM cube
WHERE regionname = 'AFRICA'
GROUP BY CUBE(orderyear, orderquarter, nationname)
ORDER BY orderyear, orderquarter, nationname

 * postgresql://postgres:***@postgres:5432/tpch-db
234 rows affected.


orderyear,orderquarter,nationname,sum
1992.0,1.0,ALGERIA,1958124.794
1992.0,1.0,ETHIOPIA,2268245.9869
1992.0,1.0,KENYA,4728310.6342
1992.0,1.0,MOROCCO,1397538.0375
1992.0,1.0,MOZAMBIQUE,6965242.8761
1992.0,1.0,,17317462.3287
1992.0,2.0,ALGERIA,2375436.3316
1992.0,2.0,ETHIOPIA,2555879.4404
1992.0,2.0,KENYA,4904382.2641
1992.0,2.0,MOROCCO,1355226.78


#### 4. Show how much revenue suppliers from nations of Africa produced in every week of every month of Q1 in 1996.

Note that `orderweek` is from a different hierarchy of the `orderdate` dimension than `orderquarter` and `ordermonth` because a week does not generally belong to only one quarter or month. (However, a month always belongs to exactly one quarter.) This does not change anything in the SQL query below, but is an important conceptual subtlety of cubes.

In [26]:
%%sql
WITH cube AS (
    SELECT olquantity, partretailprice, oldiscount,
           orderdate, nationname, regionname, orderpriority,
           olquantity * partretailprice * (1-oldiscount) AS revenue,
           DATE_PART('year', orderdate) AS orderyear,
           DATE_PART('quarter', orderdate) AS orderquarter,
           DATE_PART('month', orderdate) AS ordermonth,
           DATE_PART('week', orderdate) AS orderweek
    FROM orderline ol
    JOIN orders o      ON ol.orderid = o.orderid
    JOIN supplypart sp ON ol.partId = sp.partId AND ol.supplierId = sp.supplierId
    JOIN part p        ON sp.partId = p.partId
    JOIN supplier s    ON sp.supplierId = s.supplierId
    JOIN nation sn     ON s.nationId = sn.nationId
    JOIN region sr     ON sn.regionId = sr.regionId
)
SELECT ordermonth, orderweek, nationname, SUM(revenue)
FROM cube
WHERE regionname = 'AFRICA' AND orderquarter = 1 AND orderyear = 1996
GROUP BY CUBE(ordermonth, orderweek, nationname)
ORDER BY nationname, ordermonth, orderweek

 * postgresql://postgres:***@postgres:5432/tpch-db
190 rows affected.


ordermonth,orderweek,nationname,sum
1.0,1.0,ALGERIA,9150.9824
1.0,2.0,ALGERIA,57006.4876
1.0,4.0,ALGERIA,81375.0602
1.0,5.0,ALGERIA,112027.8822
1.0,,ALGERIA,259560.4124
2.0,5.0,ALGERIA,62421.2568
2.0,6.0,ALGERIA,302201.4134
2.0,7.0,ALGERIA,110386.8806
2.0,8.0,ALGERIA,204250.4232
2.0,9.0,ALGERIA,67871.568


#### 5. Show how much revenue suppliers from nations of Africa produced in every year with "urgent" orders.

In [27]:
%%sql
WITH cube AS (
    SELECT olquantity, partretailprice, oldiscount,
           orderdate, nationname, regionname, orderpriority,
           olquantity * partretailprice * (1-oldiscount) AS revenue,
           DATE_PART('YEAR', orderdate) AS orderyear,
           DATE_PART('QUARTER', orderdate) AS orderquarter,
           DATE_PART('MONTH', orderdate) AS ordermonth,
           DATE_PART('WEEK', orderdate) AS orderweek
    FROM orderline ol
    JOIN orders o      ON ol.orderid = o.orderid
    JOIN supplypart sp ON ol.partId = sp.partId AND ol.supplierId = sp.supplierId
    JOIN part p        ON sp.partId = p.partId
    JOIN supplier s    ON sp.supplierId = s.supplierId
    JOIN nation sn     ON s.nationId = sn.nationId
    JOIN region sr     ON sn.regionId = sr.regionId
)
SELECT orderyear, nationname, SUM(revenue)
FROM cube
WHERE regionname = 'AFRICA' AND orderpriority = '1-URGENT'
GROUP BY CUBE(orderyear, nationname)
ORDER BY orderyear, nationname

 * postgresql://postgres:***@postgres:5432/tpch-db
48 rows affected.


orderyear,nationname,sum
1992.0,ALGERIA,2079676.9205
1992.0,ETHIOPIA,1787769.0179
1992.0,KENYA,4379720.3826
1992.0,MOROCCO,1063439.5187
1992.0,MOZAMBIQUE,5123490.183
1992.0,,14434096.0227
1993.0,ALGERIA,1849416.9956
1993.0,ETHIOPIA,1628439.7711
1993.0,KENYA,2988587.9992
1993.0,MOROCCO,1495434.2609


#### 6. Show the average order quantity for parts from suppliers from nations in Africa per year.

In [28]:
%%sql
WITH cube AS (
    SELECT olquantity, partretailprice, oldiscount,
           orderdate, nationname, regionname, orderpriority,
           olquantity * partretailprice * (1-oldiscount) AS revenue,
           DATE_PART('YEAR', orderdate) AS orderyear,
           DATE_PART('QUARTER', orderdate) AS orderquarter,
           DATE_PART('MONTH', orderdate) AS ordermonth,
           DATE_PART('WEEK', orderdate) AS orderweek
    FROM orderline ol
    JOIN orders o      ON ol.orderid = o.orderid
    JOIN supplypart sp ON ol.partId = sp.partId AND ol.supplierId = sp.supplierId
    JOIN part p        ON sp.partId = p.partId
    JOIN supplier s    ON sp.supplierId = s.supplierId
    JOIN nation sn     ON s.nationId = sn.nationId
    JOIN region sr     ON sn.regionId = sr.regionId
)
SELECT orderyear, nationname, AVG(olquantity)
FROM cube
WHERE regionname = 'AFRICA' AND orderpriority = '1-URGENT'
GROUP BY CUBE(orderyear, nationname)
ORDER BY orderyear, nationname

 * postgresql://postgres:***@postgres:5432/tpch-db
48 rows affected.


orderyear,nationname,avg
1992.0,ALGERIA,24.06153846153846
1992.0,ETHIOPIA,23.339285714285715
1992.0,KENYA,27.6198347107438
1992.0,MOROCCO,26.516129032258064
1992.0,MOZAMBIQUE,24.76282051282051
1992.0,,25.403263403263406
1993.0,ALGERIA,25.714285714285715
1993.0,ETHIOPIA,25.468085106382976
1993.0,KENYA,23.742268041237114
1993.0,MOROCCO,27.82926829268293


#### 7. Show how much revenue suppliers from nations of Africa (as rows) produced in every quarter of every year (as columns).

Columns and row of a cube are both represented as columns when mapped to relations and SQL. A tool similar to Excel's PivotTable that automatically generates SQL queries would probably just flip the order of the columns.

In [29]:
%%sql
WITH cube AS (
    SELECT olquantity, partretailprice, oldiscount,
           orderdate, nationname, regionname, orderpriority,
           olquantity * partretailprice * (1-oldiscount) AS revenue,
           DATE_PART('YEAR', orderdate) AS orderyear,
           DATE_PART('QUARTER', orderdate) AS orderquarter,
           DATE_PART('MONTH', orderdate) AS ordermonth,
           DATE_PART('WEEK', orderdate) AS orderweek
    FROM orderline ol
    JOIN orders o      ON ol.orderid = o.orderid
    JOIN supplypart sp ON ol.partId = sp.partId AND ol.supplierId = sp.supplierId
    JOIN part p        ON sp.partId = p.partId
    JOIN supplier s    ON sp.supplierId = s.supplierId
    JOIN nation sn     ON s.nationId = sn.nationId
    JOIN region sr     ON sn.regionId = sr.regionId
)
SELECT nationname, orderyear, orderquarter, SUM(revenue)
FROM cube
WHERE regionname = 'AFRICA'
GROUP BY CUBE(nationname, orderyear, orderquarter)
ORDER BY nationname, orderyear, orderquarter

 * postgresql://postgres:***@postgres:5432/tpch-db
234 rows affected.


nationname,orderyear,orderquarter,sum
ALGERIA,1992.0,1.0,1958124.794
ALGERIA,1992.0,2.0,2375436.3316
ALGERIA,1992.0,3.0,2310787.5367
ALGERIA,1992.0,4.0,1924395.4578
ALGERIA,1992.0,,8568744.1201
ALGERIA,1993.0,1.0,1850076.134
ALGERIA,1993.0,2.0,2194947.1278
ALGERIA,1993.0,3.0,3305010.8167
ALGERIA,1993.0,4.0,2466948.5171
ALGERIA,1993.0,,9816982.5956


### Part 2: MDX (Optional)

Choose one of the queries you wrote in SQL and implement it in MDX.