# <center>Big Data &ndash; Exercises &ndash; Solution</center>
## <center>Fall 2022 &ndash; Week 13 &ndash; ETH Zurich</center>
## <center>Cubes</center>

## Introduction

In the first part of this exercise, we will use a spreadsheet application to analyze the sales data of a fictious wholesale supplier (taken from the database system benchmark [TPC-H](http://www.tpc.org/tpch/)). Then, we will use SQL to query the data shape cube. 
A cube is a collection of numeric data organized by arrays of discrete identifiers (Janus and Fouché, 2009).

## 1. The TPC-H Dataset as OLAP Cube

Let us get familiar with the dataset.
It consists of **orders**, each of which is made by a **customer**, and consists of **lineitems**.
Think of an order as a shopping cart with several items in it.
The items of an order are **parts** that may be provided by different **suppliers**.
Suppliers and customers come from different **nations**, which are grouped into **regions** of the world.
The following figure illustrates the schema of the TPC-H dataset.
<br>

![Schema of the TPC-H dataset](./tpch.png)


### Task 1

1. Which table(s) of the TPC-H schema is/are the fact table(s)?
1. What is/are the measure(s)?
1. What are the dimensions?
1. What do you call this flavor of OLAP?

### Solutions

1. **Which table(s) of the TPC-H schema is/are the fact table(s)?**  
   It depends on what we want to analyze, i.e., it depends on which cube we define on the tables.
   In general, the fact table is the table that contains the measure(s).
   If we analyze orders, then the order is the fact table.
   We could also analyze line items, in which case we probably want information of the order each item belongs to as well,
   so we would take the join of lineitem and orders as fact table. On the other hand we could also analyse parts so we could join that too.



2. **What is/are the measure(s)?**  
   As we have seen in the lecture, there can be more than one measure in one fact table. 
   Fact tables often contain many different measures
   and it makes sense to have a cube with several of them at the same time
   (for example, revenue, profit, net price, and gross price).
   For example, in the table *lineitems*, quantity, extendedprice, discount, and tax can be used as measures;
   in *orders*, totalprice; in *partsupp*, availqty and supplycost;
   in *part*, retailprice; and in *customer* and *supplier*, acctbal.
   **Intuitively**, if it makes sense to calculate the sum or average of an attribute,
   then we can probably use it as measure.



3. **What are the dimensions?**  
   All other attributes (except keys, which are only needed for reference when joining tables).
   Some of them have explictly materialized hierarchies, such as the geographical hierarchy for nation and regions.
   The time dimension is also hierarchical, but the tables do not materialize it explictly:
   each date can be broken down into year, quarter, month, week, etc.
   Dimensions are typically discrete values. 
   Another way to look at it is:
   if it makes sense to have a foreign key to another table instead of an inline attribute,
   then the attribute is a dimension.



4. **What do you call this flavor of OLAP?**  
   If the data is stored and presented as relations, we speak of "Relational OLAP" (ROLAP).
   If, instead, cubes are primary citizens in a system optimized for multidimensional analysis,
   we speak of "Multidimensional OLAP" (MOLAP).
   Since Multidimensional Expressions (MDX) is designed for cubes and SQL is not, we need to note the following:
   1. MDX is better suited in terms of language; in SQL, a programmer needs to map cube concepts to relations him/herself.
   1. The result of an MDX query is directly presented as a cube,
      whereas postprocessing is needed to get roll-ups out of a SQL query.
   1. Execution of roll-up queries is potentially more efficient in an engine that is aware of grouping sets
      (which is the case for many advance SQL-based systems).

   So, while there is a certain competance mismatch between MDX and SQL, it is relatively small.
   As we will see in the following exercises, the translation of cube concepts to SQL is *relatively* easy.

## 2. Analyzing TPC-H with a Pivot Table

Open local [Exercise13_OLAP_Cubes](./Exercise13_OLAP_Cubes.xls) with your favorite spreadsheet application.
The file contains a universal table (a fully denormalized table tha contains all of the tables joined) of a small TPC-H dataset.
The schema has been modified slightly to make analysis in a spreadsheet application easier:
The two precomputed measures revenue and cost
as well as the hierarchy of time dimensions in the attribute *orderdate* have been added in the materialized form
and some other attributes have been removed.

You may need to look up how to use pivot tables in your spreadsheet application.

1. Microsoft Excel: [PivotTable](https://support.office.com/en-us/article/Create-a-PivotTable-to-analyze-worksheet-data-a9a84538-bfe9-40a9-a8e9-f99134456576)
1. Google Sheets  : [pivot tables](https://support.google.com/docs/answer/1272900?co=GENIE.Platform%3DDesktop&hl=en)
1. Open Office    : [DataPilot](https://openoffice.blogs.com/openoffice/2006/11/data_pilots_in_.html)

### Task 1: Discussion

Discuss the terms "slice and dice", "drill down", "roll up", and "pivoting".

### Solution: Discussion

* **Slice and dice:**  
  The terms might let you think of cooking. Indeed, they have their roots in cooking and describes the cutting skills in different granularities. To slice means to cut and to dice means to cut into very small uniform sections and the two actions are often performed sequentially. 
  
  In cube operations, a slice is a term for a subset of the data, generated by picking a value for one dimension and only showing the data for that value (for instance only the data at one point in time).You can think of this in terms of an SQL WHERE clause. 
  By filter some of the facts, we only look at a particular "slice" of the cube.
  For example, in the next question, we will first analyze revenue from all regions and then only consider the "slice" for Africa.
  
  By selecting dimensions, we decide which surfaces of the cube to show, i.e., we "dice" it. Oftentimes, a dice operation involves an aggregation.
  For example, we showed the regions as columns and the years as rows. Continuing with our SQL parallelism, this would be what you would put in the SELECT clause. 
  
  

* **Drill-down:**  
  Oftentimes, we want information on the various levels of granularity presented together. Typically, we do this in a hierarchical manner: we show results in a broader category first, and then narrower ones. 
  For instance, in the next task, we will look at revenue per region first, then we will "zoom in" in the subsequent steps to see the revenue per nation.



* **Roll-up:**  
  Roll-up is an inverse of drill-down. In roll-up, we may want to first have summaries in the most fine-grained view.
  While we first look at the revenue on a week granularity, we may want to see the summary per month and quarter at the same time/in the same table.
  The subtotals shown in the pivot table "roll up" that information. 
<br> <br>
* **Pivoting:** 
  Pivoting allows us to rotate the cube in space to see its various faces, by changing the positions of dimensions. 
  Then, we look at the cube from a different angle, i.e., we rotate ("pivot") it. This also includes moving quantities from being dimensions to being measures.


### Task 2: Create the following pivot tables:

1. Show how much revenue suppliers from different regions (as columns) produced in every year (as rows).
1. Show how much revenue suppliers from nations of Africa (as columns) produced in every year (as rows).
1. Show how much revenue suppliers from nations of Africa produced in every quarter of every year.
1. Show how much revenue suppliers from nations of Africa produced in every week of every month of Q1 in 1996.
1. Show how much revenue suppliers from nations of Africa produced in every year with "urgent" orders.
1. Show the average order quantity for parts from suppliers from nations of Africa per year.
1. Show how much revenue suppliers from nations of Africa (as rows) produced in every quarter of every year (as columns).

### Solution:
[Exercise13_Solution.xlsx](./Exercise13_Solution.xlsx)

## 3. OLAP Cubes and SQL

Write SQL queries for the PivotTables from Question 2.

#### Notes

* Assume that the revenue is calculated as `olquantity * partretailprice * (1-oldiscount)`. You will already find it as a column.
* To get the year or quarter from a date in PostgreSQL, you can use [`DATE_PART ('field', date )  `](https://www.postgresqltutorial.com/postgresql-date_part/). Note that the field is case-insensitive. You can write `DATE_PART('YEAR', date)` or `DATE_PART('year', date)`, which are equivalent. 

#### Database Set-up

Just like any other week, you need to `docker compose up` and please wait for the message `PostgreSQL init process complete; ready for start up` before proceeding!
As before, we set up our connection to the database and enable use of `%sql` and `%%sql`.

In [1]:
server  ='postgres'
user    ='postgres'
password='bigdataclass'
database='tpch-db'

connection_string = f'postgresql://{user}:{password}@{server}:5432/{database}'

In [2]:
%reload_ext sql
%sql $connection_string

Check the tables in TPC-H. They are empty for the moment.

In [3]:
%%sql 
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'public';

 * postgresql://postgres:***@postgres:5432/tpch-db
8 rows affected.


table_name
region
nation
supplier
part
supplypart
customer
orders
orderline


Populate the tables in TPC-H with data from .tbl files

In [4]:
import numpy as np
import os
import pandas
import sqlalchemy
import sys

tables = [ # Order is important because of FKs
        'region',
        'nation',
        'supplier',
        'part',
        'supplypart',
        'customer',
        'orders',
        'orderline'
        ]

engine = sqlalchemy.create_engine(connection_string)

for table in tables:
    # Find column names
    columns = engine.execute('SELECT * FROM {0}'.format(table)).keys()

    # Load content
    data = pandas.read_csv('docker/postgres/tpch/data/{0}.tbl'.format(table), sep='|', header=None, names=columns)
    msg = 'Loading table "{0}": {1}% done\r'
    for idx, chunk in enumerate(np.array_split(data, 100)):
        sys.stdout.write(msg.format(table, idx))
        chunk.to_sql(name=table, if_exists='append', con=engine, index=False, method='multi')
    print(msg.format(table, str(100)))

Loading table "region": 100% done
Loading table "nation": 100% done
Loading table "supplier": 100% done
Loading table "part": 100% done
Loading table "supplypart": 100% done
Loading table "customer": 100% done
Loading table "orders": 100% done
Loading table "orderline": 100% done


First however, we define the fact table using a WITH statement (**copy this at the beginning of all other queries**)

In [5]:
%%sql
WITH cube AS (
    SELECT olquantity, partretailprice, oldiscount,
           orderdate, nationname, regionname, orderpriority,
           olquantity * partretailprice * (1-oldiscount) AS revenue,
           DATE_PART('year', orderdate) AS orderyear,
           DATE_PART('quarter', orderdate) AS orderquarter,
           DATE_PART('month', orderdate) AS ordermonth,
           DATE_PART('week', orderdate) AS orderweek
    FROM orderline ol
    JOIN orders o      ON ol.orderid = o.orderid
    JOIN supplypart sp ON ol.partId = sp.partId AND ol.supplierId = sp.supplierId
    JOIN part p        ON sp.partId = p.partId
    JOIN supplier s    ON sp.supplierId = s.supplierId
    JOIN nation sn     ON s.nationId = sn.nationId
    JOIN region sr     ON sn.regionId = sr.regionId
)
SELECT * FROM cube LIMIT 10

 * postgresql://postgres:***@postgres:5432/tpch-db
10 rows affected.


olquantity,partretailprice,oldiscount,orderdate,nationname,regionname,orderpriority,revenue,orderyear,orderquarter,ordermonth,orderweek
17.0,1453.55,0.04,1996-01-02,MOZAMBIQUE,AFRICA,5-LOW,23721.936,1996.0,1.0,1.0,1.0
36.0,1574.67,0.09,1996-01-02,CHINA,ASIA,5-LOW,51586.1892,1996.0,1.0,1.0,1.0
8.0,1537.63,0.1,1996-01-02,EGYPT,MIDDLE EAST,5-LOW,11070.936,1996.0,1.0,1.0,1.0
28.0,922.02,0.09,1996-01-02,KENYA,AFRICA,5-LOW,23493.0696,1996.0,1.0,1.0,1.0
24.0,1141.24,0.1,1996-01-02,INDONESIA,ASIA,5-LOW,24650.784,1996.0,1.0,1.0,1.0
32.0,1057.15,0.07,1996-01-02,UNITED STATES,AMERICA,5-LOW,31460.784,1996.0,1.0,1.0,1.0
38.0,963.06,0.0,1996-12-01,GERMANY,EUROPE,1-URGENT,36596.28,1996.0,4.0,12.0,48.0
45.0,943.04,0.06,1993-10-14,UNITED STATES,AMERICA,5-LOW,39890.592,1993.0,4.0,10.0,41.0
49.0,1091.19,0.1,1993-10-14,FRANCE,EUROPE,5-LOW,48121.479,1993.0,4.0,10.0,41.0
27.0,1186.28,0.06,1993-10-14,INDIA,ASIA,5-LOW,30107.7864,1993.0,4.0,10.0,41.0


Note that, for the purpose of this exercise, we dropped some dimensions of the cube because none of the queries uses them. Also, we materialize some hierarchy levels of the `orderdate` dimension in order to make the subsequent queries more readable. This makes them *look* like they were new dimensions -- conceptually, they are not! (They are, well, levels of a hierarchy of the `orderdate` dimension.)

OK, you are good to go. Use the SQL cell below and add more as you need.

Note that the numbers you obtain with the SQL queries should not be identical to those in the pivot tables in Task2, because the data we have in the DB has more rows yin its fact table.

### Solution

#### 1. Show how much revenue suppliers from different regions (as columns) produced in every year (as rows).

When we map the cube to SQL, there are no explicit columns and rows anymore -- the columns are expressed by values in a particular row (`oderyear` in this case). Similarly, the sub-totals are expressed as rows where some of the dimensions are `NULL` (which Python prints as `None` in this notebook). As discussed in the lecture, in order to visualize the result of the query as a cube (like Excel does), some postprocessing is needed.

We start with a version that computes the sub-totals manually.

In [19]:
%%sql
WITH cube AS (
    SELECT olquantity, partretailprice, oldiscount,
           orderdate, nationname, regionname, orderpriority,
           olquantity * partretailprice * (1-oldiscount) AS revenue,
           DATE_PART('year', orderdate) AS orderyear,
           DATE_PART('quarter', orderdate) AS orderquarter,
           DATE_PART('month', orderdate) AS ordermonth,
           DATE_PART('week', orderdate) AS orderweek
    FROM orderline ol
    JOIN orders o      ON ol.orderid = o.orderid
    JOIN supplypart sp ON ol.partId = sp.partId AND ol.supplierId = sp.supplierId
    JOIN part p        ON sp.partId = p.partId
    JOIN supplier s    ON sp.supplierId = s.supplierId
    JOIN nation sn     ON s.nationId = sn.nationId
    JOIN region sr     ON sn.regionId = sr.regionId
)

    SELECT NULL, NULL, SUM(revenue)
    FROM cube


 * postgresql://postgres:***@postgres:5432/tpch-db
1 rows affected.


?column?,?column?_1,sum
,,2045134942.0939


In [None]:
SELECT brand, segment, SUM (quantity)
FROM sales
GROUP BY brand, segment

UNION ALL

SELECT brand, NULL, SUM (quantity)
FROM sales
GROUP BY brand

UNION ALL

SELECT NULL, segment, SUM (quantity)
FROM sales
GROUP BY segment

UNION ALL

SELECT NULL, NULL, SUM (quantity)
FROM sales;

In [None]:
SELECT brand, segment, SUM (quantity)
FROM sales
GROUP BY
    GROUPING SETS (
        (brand, segment),
        (brand),
        (segment),
        ()
    );

In [None]:
CUBE(c1,c2,c3) 

GROUPING SETS (
    (c1,c2,c3), 
    (c1,c2),
    (c1,c3),
    (c2,c3),
    (c1),
    (c2),
    (c3), 
    ()
 ) 

In [None]:
SELECT brand, segment, SUM (quantity)
FROM sales
GROUP BY CUBE (brand, segment);

In [None]:
ROLLUP(c1,c2,c3)

GROUPING SETS (
    (c1,c2,c3),
    (c1,c2),
    (c1),
    ()
)

In [None]:
SELECT brand, segment, SUM (quantity)
FROM sales
GROUP BY ROLLUP (brand, segment);

In [None]:
SELECT segment, brand, SUM (quantity)
FROM sales
GROUP BY ROLLUP (segment, brand)

In [6]:
%%sql
WITH cube AS (
    SELECT olquantity, partretailprice, oldiscount,
           orderdate, nationname, regionname, orderpriority,
           olquantity * partretailprice * (1-oldiscount) AS revenue,
           DATE_PART('year', orderdate) AS orderyear,
           DATE_PART('quarter', orderdate) AS orderquarter,
           DATE_PART('month', orderdate) AS ordermonth,
           DATE_PART('week', orderdate) AS orderweek
    FROM orderline ol
    JOIN orders o      ON ol.orderid = o.orderid
    JOIN supplypart sp ON ol.partId = sp.partId AND ol.supplierId = sp.supplierId
    JOIN part p        ON sp.partId = p.partId
    JOIN supplier s    ON sp.supplierId = s.supplierId
    JOIN nation sn     ON s.nationId = sn.nationId
    JOIN region sr     ON sn.regionId = sr.regionId
)
(
    SELECT regionname, orderyear, SUM(revenue)
    FROM cube
    GROUP BY regionname, orderyear
) UNION ALL (
    SELECT NULL, orderyear, SUM(revenue)
    FROM cube
    GROUP BY orderyear
) UNION ALL (
    SELECT regionname, NULL, SUM(revenue)
    FROM cube
    GROUP BY regionname
) UNION ALL (
    SELECT NULL, NULL, SUM(revenue)
    FROM cube
)
ORDER BY regionname, orderyear

 * postgresql://postgres:***@postgres:5432/tpch-db
48 rows affected.


regionname,orderyear,sum
AFRICA,1992.0,65268216.7295
AFRICA,1993.0,66164509.0978
AFRICA,1994.0,63964877.7068
AFRICA,1995.0,60191092.4238
AFRICA,1996.0,63326113.0036
AFRICA,1997.0,64551600.1987
AFRICA,1998.0,37799441.1835
AFRICA,,421265850.3437
AMERICA,1992.0,62430452.5656
AMERICA,1993.0,64204286.247


Alternatively, we can use `GROUPING SETS`, which PostgrelSQL Server supports. This makes the query easier to read and potentially also faster to execute. If the database system is not clever, it might execute each subquery in the `UNION ALL` expressions independently of each other, doing the (expensive) `JOIN`s for each of them again.

In [14]:
%%sql
WITH cube AS (
    SELECT olquantity, partretailprice, oldiscount,
           orderdate, nationname, regionname, orderpriority,
           olquantity * partretailprice * (1-oldiscount) AS revenue,
           DATE_PART('year', orderdate) AS orderyear,
           DATE_PART('quarter', orderdate) AS orderquarter,
           DATE_PART('month', orderdate) AS ordermonth,
           DATE_PART('week', orderdate) AS orderweek
    FROM orderline ol
    JOIN orders o      ON ol.orderid = o.orderid
    JOIN supplypart sp ON ol.partId = sp.partId AND ol.supplierId = sp.supplierId
    JOIN part p        ON sp.partId = p.partId
    JOIN supplier s    ON sp.supplierId = s.supplierId
    JOIN nation sn     ON s.nationId = sn.nationId
    JOIN region sr     ON sn.regionId = sr.regionId
)
SELECT regionname, orderyear, SUM(revenue)
FROM cube
GROUP BY GROUPING SETS (
    (regionname, orderyear),
    (regionname),
    (orderyear),
    ()
)
ORDER BY regionname, orderyear

 * postgresql://postgres:***@postgres:5432/tpch-db
48 rows affected.


regionname,orderyear,sum
AFRICA,1992.0,65268216.7295
AFRICA,1993.0,66164509.0978
AFRICA,1994.0,63964877.7068
AFRICA,1995.0,60191092.4238
AFRICA,1996.0,63326113.0036
AFRICA,1997.0,64551600.1987
AFRICA,1998.0,37799441.1835
AFRICA,,421265850.3437
AMERICA,1992.0,62430452.5656
AMERICA,1993.0,64204286.247


PostgreSQL Server also supports the `CUBE` subclause of the `GROUP BY` clause, which is just syntactic sugar for the grouping sets above. The CUBE allows you to generate multiple grouping sets.

In [11]:
%%sql
WITH cube AS (
    SELECT olquantity, partretailprice, oldiscount,
           orderdate, nationname, regionname, orderpriority,
           olquantity * partretailprice * (1-oldiscount) AS revenue,
           DATE_PART('year', orderdate) AS orderyear,
           DATE_PART('quarter', orderdate) AS orderquarter,
           DATE_PART('month', orderdate) AS ordermonth,
           DATE_PART('week', orderdate) AS orderweek
    FROM orderline ol
    JOIN orders o      ON ol.orderid = o.orderid
    JOIN supplypart sp ON ol.partId = sp.partId AND ol.supplierId = sp.supplierId
    JOIN part p        ON sp.partId = p.partId
    JOIN supplier s    ON sp.supplierId = s.supplierId
    JOIN nation sn     ON s.nationId = sn.nationId
    JOIN region sr     ON sn.regionId = sr.regionId
)
SELECT regionname, orderyear, SUM(revenue)
FROM cube
GROUP BY CUBE (regionname, orderyear)
ORDER BY regionname, orderyear

 * postgresql://postgres:***@postgres:5432/tpch-db
48 rows affected.


regionname,orderyear,sum
AFRICA,1992.0,65268216.7295
AFRICA,1993.0,66164509.0978
AFRICA,1994.0,63964877.7068
AFRICA,1995.0,60191092.4238
AFRICA,1996.0,63326113.0036
AFRICA,1997.0,64551600.1987
AFRICA,1998.0,37799441.1835
AFRICA,,421265850.3437
AMERICA,1992.0,62430452.5656
AMERICA,1993.0,64204286.247


#### 2. Show how much revenue suppliers from nations of Africa produced in every year.

We will use the short syntax from now on. What changes now are how we dice and drill down (in the `SELECT` and `GROUP BY` clauses) and how we slice (in the `WHERE` clause).

In [12]:
%%sql
WITH cube AS (
    SELECT olquantity, partretailprice, oldiscount,
           orderdate, nationname, regionname, orderpriority,
           olquantity * partretailprice * (1-oldiscount) AS revenue,
           DATE_PART('year', orderdate) AS orderyear,
           DATE_PART('quarter', orderdate) AS orderquarter,
           DATE_PART('month', orderdate) AS ordermonth,
           DATE_PART('week', orderdate) AS orderweek
    FROM orderline ol
    JOIN orders o      ON ol.orderid = o.orderid
    JOIN supplypart sp ON ol.partId = sp.partId AND ol.supplierId = sp.supplierId
    JOIN part p        ON sp.partId = p.partId
    JOIN supplier s    ON sp.supplierId = s.supplierId
    JOIN nation sn     ON s.nationId = sn.nationId
    JOIN region sr     ON sn.regionId = sr.regionId
)
SELECT nationname, orderyear, SUM(revenue)
FROM cube
WHERE regionname = 'AFRICA'
GROUP BY CUBE (nationname, orderyear)
ORDER BY nationname, orderyear

 * postgresql://postgres:***@postgres:5432/tpch-db
48 rows affected.


nationname,orderyear,sum
ALGERIA,1992.0,8568744.1201
ALGERIA,1993.0,9816982.5956
ALGERIA,1994.0,9097585.2349
ALGERIA,1995.0,9168430.0924
ALGERIA,1996.0,8137148.2759
ALGERIA,1997.0,10714639.4294
ALGERIA,1998.0,5554601.7346
ALGERIA,,61058131.4829
ETHIOPIA,1992.0,9077948.018
ETHIOPIA,1993.0,8537000.0214


Note that we should distinguish the quotation marks in PostgreSQL, see <a href="https://www.prisma.io/dataguide/postgresql/short-guides/quoting-rules#:~:text=In%20PostgreSQL%2C%20double%20quotes%20(like,name%20or%20a%20column%20name.">this post</a> for more detail. 



In a nutshell, you should know:
- Double quotation marks are used to indicate quoted identifiers. An identifier is the name of an object within PostgreSQL, such as a table name or a column name. Hence, quoted identifiers are case sensitive. This leads to PostgreSQL treating "CUSTOMER" and "customer" as entirely different objects. 
- Single quotes, on the other hand, are used to indicate that a token is a string.


In the query above, if we write "AFRICA" instead of 'AFRICA', PostgreSQL will throw an error. 

#### 3. Show how much revenue suppliers from nations of Africa produced in every quarter of every year.

In [None]:
%%sql
WITH cube AS (
    SELECT olquantity, partretailprice, oldiscount,
           orderdate, nationname, regionname, orderpriority,
           olquantity * partretailprice * (1-oldiscount) AS revenue,
           DATE_PART('YEAR', orderdate) AS orderyear,
           DATE_PART('QUARTER', orderdate) AS orderquarter,
           DATE_PART('MONTH', orderdate) AS ordermonth,
           DATE_PART('WEEK', orderdate) AS orderweek
    FROM orderline ol
    JOIN orders o      ON ol.orderid = o.orderid
    JOIN supplypart sp ON ol.partId = sp.partId AND ol.supplierId = sp.supplierId
    JOIN part p        ON sp.partId = p.partId
    JOIN supplier s    ON sp.supplierId = s.supplierId
    JOIN nation sn     ON s.nationId = sn.nationId
    JOIN region sr     ON sn.regionId = sr.regionId
)
SELECT nationname, orderyear, orderquarter, SUM(revenue)
FROM cube
WHERE regionname = 'AFRICA'
GROUP BY CUBE (nationname, orderyear, orderquarter)
ORDER BY nationname, orderyear, orderquarter

#### 4. Show how much revenue suppliers from nations of Africa produced in every week of every month of Q1 in 1996.

Note that `orderweek` is from a different hierarchy of the `orderdate` dimension than `orderquarter` and `ordermonth` because a week does not generally belong to only one quarter or month. (However, a month always belongs to exactly one quarter.) This does not change anything in the SQL query below, but is an important conceptual subtlety of cubes.

In [15]:
%%sql
WITH cube AS (
    SELECT olquantity, partretailprice, oldiscount,
           orderdate, nationname, regionname, orderpriority,
           olquantity * partretailprice * (1-oldiscount) AS revenue,
           DATE_PART('year', orderdate) AS orderyear,
           DATE_PART('quarter', orderdate) AS orderquarter,
           DATE_PART('month', orderdate) AS ordermonth,
           DATE_PART('week', orderdate) AS orderweek
    FROM orderline ol
    JOIN orders o      ON ol.orderid = o.orderid
    JOIN supplypart sp ON ol.partId = sp.partId AND ol.supplierId = sp.supplierId
    JOIN part p        ON sp.partId = p.partId
    JOIN supplier s    ON sp.supplierId = s.supplierId
    JOIN nation sn     ON s.nationId = sn.nationId
    JOIN region sr     ON sn.regionId = sr.regionId
)
SELECT orderweek, ordermonth, SUM(revenue)
FROM cube
WHERE regionname = 'AFRICA' AND
      orderyear = 1996 AND
      orderquarter = 1
GROUP BY CUBE (orderweek, ordermonth)
ORDER BY orderweek, ordermonth

 * postgresql://postgres:***@postgres:5432/tpch-db
32 rows affected.


orderweek,ordermonth,sum
1.0,1.0,1083354.5782
1.0,,1083354.5782
2.0,1.0,1359476.548
2.0,,1359476.548
3.0,1.0,706544.0276
3.0,,706544.0276
4.0,1.0,700302.1218
4.0,,700302.1218
5.0,1.0,547571.5048
5.0,2.0,415328.8306


#### 5. Show how much revenue suppliers from nations of Africa produced in every year with "urgent" orders.

In [17]:
%%sql
WITH cube AS (
    SELECT olquantity, partretailprice, oldiscount,
           orderdate, nationname, regionname, orderpriority,
           olquantity * partretailprice * (1-oldiscount) AS revenue,
           DATE_PART('YEAR', orderdate) AS orderyear,
           DATE_PART('QUARTER', orderdate) AS orderquarter,
           DATE_PART('MONTH', orderdate) AS ordermonth,
           DATE_PART('WEEK', orderdate) AS orderweek
    FROM orderline ol
    JOIN orders o      ON ol.orderid = o.orderid
    JOIN supplypart sp ON ol.partId = sp.partId AND ol.supplierId = sp.supplierId
    JOIN part p        ON sp.partId = p.partId
    JOIN supplier s    ON sp.supplierId = s.supplierId
    JOIN nation sn     ON s.nationId = sn.nationId
    JOIN region sr     ON sn.regionId = sr.regionId
)

SELECT nationname, orderyear, SUM(revenue)
FROM cube
WHERE regionname = 'AFRICA' AND
      orderpriority = '1-URGENT'
GROUP BY CUBE (nationname, orderyear)
ORDER BY nationname, orderyear

 * postgresql://postgres:***@postgres:5432/tpch-db
48 rows affected.


nationname,orderyear,sum
ALGERIA,1992.0,2079676.9205
ALGERIA,1993.0,1849416.9956
ALGERIA,1994.0,2507140.8116
ALGERIA,1995.0,1603099.6504
ALGERIA,1996.0,1994410.8918
ALGERIA,1997.0,2582016.1908
ALGERIA,1998.0,1277526.8451
ALGERIA,,13893288.3058
ETHIOPIA,1992.0,1787769.0179
ETHIOPIA,1993.0,1628439.7711


#### 6. Show the average order quantity for parts from suppliers from nations in Africa per year.

In [None]:
%%sql
WITH cube AS (
    SELECT olquantity, partretailprice, oldiscount,
           orderdate, nationname, regionname, orderpriority,
           olquantity * partretailprice * (1-oldiscount) AS revenue,
           DATE_PART('YEAR', orderdate) AS orderyear,
           DATE_PART('QUARTER', orderdate) AS orderquarter,
           DATE_PART('MONTH', orderdate) AS ordermonth,
           DATE_PART('WEEK', orderdate) AS orderweek
    FROM orderline ol
    JOIN orders o      ON ol.orderid = o.orderid
    JOIN supplypart sp ON ol.partId = sp.partId AND ol.supplierId = sp.supplierId
    JOIN part p        ON sp.partId = p.partId
    JOIN supplier s    ON sp.supplierId = s.supplierId
    JOIN nation sn     ON s.nationId = sn.nationId
    JOIN region sr     ON sn.regionId = sr.regionId
)
SELECT nationname, orderyear, AVG(olquantity)
FROM cube
WHERE regionname = 'AFRICA'
GROUP BY CUBE(nationname, orderyear)
ORDER BY nationname, orderyear

#### 7. Show how much revenue suppliers from nations of Africa (as rows) produced in every quarter of every year (as columns).

As discussed above, columns and row of a cube are both represented as columns when mapped to relations and SQL. A tool similar to Excel's PivotTable that automatically generates SQL queries would probably just flip the order of the columns.

In [18]:
%%sql
WITH cube AS (
    SELECT olquantity, partretailprice, oldiscount,
           orderdate, nationname, regionname, orderpriority,
           olquantity * partretailprice * (1-oldiscount) AS revenue,
           DATE_PART('YEAR', orderdate) AS orderyear,
           DATE_PART('QUARTER', orderdate) AS orderquarter,
           DATE_PART('MONTH', orderdate) AS ordermonth,
           DATE_PART('WEEK', orderdate) AS orderweek
    FROM orderline ol
    JOIN orders o      ON ol.orderid = o.orderid
    JOIN supplypart sp ON ol.partId = sp.partId AND ol.supplierId = sp.supplierId
    JOIN part p        ON sp.partId = p.partId
    JOIN supplier s    ON sp.supplierId = s.supplierId
    JOIN nation sn     ON s.nationId = sn.nationId
    JOIN region sr     ON sn.regionId = sr.regionId
)
SELECT orderyear, orderquarter, regionname, SUM(revenue)
FROM cube
GROUP BY CUBE (orderyear, orderquarter, regionname)
ORDER BY orderyear, orderquarter, regionname

 * postgresql://postgres:***@postgres:5432/tpch-db
234 rows affected.


orderyear,orderquarter,regionname,sum
1992.0,1.0,AFRICA,17317462.3287
1992.0,1.0,AMERICA,16465556.7073
1992.0,1.0,ASIA,22176165.1785
1992.0,1.0,EUROPE,15566873.2354
1992.0,1.0,MIDDLE EAST,9727904.2786
1992.0,1.0,,81253961.7285
1992.0,2.0,AFRICA,16458398.5488
1992.0,2.0,AMERICA,15584191.2981
1992.0,2.0,ASIA,20738684.3131
1992.0,2.0,EUROPE,15922466.4617
