# STEP 6: Repeat the computation from the facts & dimension table

Note: You will not have to write any code in this notebook. It's purely to illustrate the performance difference between Star and 3NF schemas.

Start by running the code in the cell below to connect to the database.

In [5]:
!PGPASSWORD=student createdb -h 127.0.0.1 -U student pagila_star
!PGPASSWORD=student psql -q -h 127.0.0.1 -U student -d pagila_star -f Data/pagila-data.sql

createdb: database creation failed: ERROR:  database "pagila_star" already exists
psql:Data/pagila-data.sql:23: ERROR:  relation "actor" does not exist
psql:Data/pagila-data.sql:224: invalid command \.
psql:Data/pagila-data.sql:231: ERROR:  syntax error at or near "1"
LINE 1: 1 PENELOPE GUINESS 2017-02-15 09:34:33
        ^
psql:Data/pagila-data.sql:341: invalid command \.
psql:Data/pagila-data.sql:348: ERROR:  syntax error at or near "1"
LINE 1: 1 Afghanistan 2017-02-15 09:44:00
        ^
psql:Data/pagila-data.sql:949: invalid command \.
psql:Data/pagila-data.sql:956: ERROR:  syntax error at or near "1"
LINE 1: 1 A Corua (La Corua) 87 2017-02-15 09:45:25
        ^
psql:Data/pagila-data.sql:957: invalid command \N
psql:Data/pagila-data.sql:958: invalid command \N
psql:Data/pagila-data.sql:959: invalid command \N
psql:Data/pagila-data.sql:960: invalid command \N
psql:Data/pagila-data.sql:1560: invalid command \.
psql:Data/pagila-data.sql:1567: ERROR:  syntax error at or near "1"
LINE 1:

In [3]:
%load_ext sql

DB_ENDPOINT = "127.0.0.1"
DB = 'pagila_star'
DB_USER = 'student'
DB_PASSWORD = 'student'
DB_PORT = '5432'

# postgresql://username:password@host:port/database
conn_string = "postgresql://{}:{}@{}:{}/{}" \
                        .format(DB_USER, DB_PASSWORD, DB_ENDPOINT, DB_PORT, DB)

print(conn_string)
%sql $conn_string

postgresql://student:student@127.0.0.1:5432/pagila_star


'Connected: student@pagila_star'

## 6.1 Facts Table has all the needed dimensions, no need for deep joins

In [4]:
%%time
%%sql
SELECT movie_key, date_key, customer_key, sales_amount
FROM factSales 
limit 5;


 * postgresql://student:***@127.0.0.1:5432/pagila_star
(psycopg2.ProgrammingError) relation "factsales" does not exist
LINE 2: FROM factSales 
             ^
 [SQL: 'SELECT movie_key, date_key, customer_key, sales_amount\nFROM factSales \nlimit 5;']
CPU times: user 2.27 ms, sys: 353 µs, total: 2.63 ms
Wall time: 13.9 ms


## 6.2 Join fact table with dimensions to replace keys with attributes

As you run each cell, pay attention to the time that is printed. Which schema do you think will run faster?

##### Star Schema

In [6]:
%%time
%%sql
SELECT dimMovie.title, dimDate.month, dimCustomer.city, sum(sales_amount) as revenue
FROM factSales 
JOIN dimMovie    on (dimMovie.movie_key      = factSales.movie_key)
JOIN dimDate     on (dimDate.date_key         = factSales.date_key)
JOIN dimCustomer on (dimCustomer.customer_key = factSales.customer_key)
group by (dimMovie.title, dimDate.month, dimCustomer.city)
order by dimMovie.title, dimDate.month, dimCustomer.city, revenue desc;

 * postgresql://student:***@127.0.0.1:5432/pagila_star
(psycopg2.ProgrammingError) relation "factsales" does not exist
LINE 2: FROM factSales 
             ^
 [SQL: 'SELECT dimMovie.title, dimDate.month, dimCustomer.city, sum(sales_amount) as revenue\nFROM factSales \nJOIN dimMovie    on (dimMovie.movie_key      = factSales.movie_key)\nJOIN dimDate     on (dimDate.date_key         = factSales.date_key)\nJOIN dimCustomer on (dimCustomer.customer_key = factSales.customer_key)\ngroup by (dimMovie.title, dimDate.month, dimCustomer.city)\norder by dimMovie.title, dimDate.month, dimCustomer.city, revenue desc;']
CPU times: user 5.09 ms, sys: 177 µs, total: 5.27 ms
Wall time: 6.63 ms


##### 3NF Schema

In [None]:
%%time
%%sql
SELECT f.title, EXTRACT(month FROM p.payment_date) as month, ci.city, sum(p.amount) as revenue
FROM payment p
JOIN rental r    ON ( p.rental_id = r.rental_id )
JOIN inventory i ON ( r.inventory_id = i.inventory_id )
JOIN film f ON ( i.film_id = f.film_id)
JOIN customer c  ON ( p.customer_id = c.customer_id )
JOIN address a ON ( c.address_id = a.address_id )
JOIN city ci ON ( a.city_id = ci.city_id )
group by (f.title, month, ci.city)
order by f.title, month, ci.city, revenue desc;

# Conclusion

We were able to show that:
* The star schema is easier to understand and write queries against.
* Queries with a star schema are more performant.