# STEP 6: Repeat the computation from the facts & dimension table

Note: You will not have to write any code in this notebook. It's purely to illustrate the performance difference between Star and 3NF schemas.

Start by running the code in the cell below to connect to the database.

In [1]:
import os

In [None]:
# !PGPASSWORD=student createdb -h 127.0.0.1 -U student pagila_star
# !PGPASSWORD=student psql -q -h 127.0.0.1 -U student -d pagila_star -f Data/pagila-data.sql

In [3]:
%load_ext sql

DB_ENDPOINT = os.environ["PGHOST"]
DB = 'pagila'
DB_USER = os.environ["PGUSER"]
DB_PASSWORD = os.environ["PGPASSWORD"]
DB_PORT = os.environ["PGPORT"]

# postgresql://username:password@host:port/database
conn_string = "postgresql://{}:{}@{}:{}/{}" \
                        .format(DB_USER, DB_PASSWORD, DB_ENDPOINT, DB_PORT, DB)

# print(conn_string)
%sql $conn_string

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


'Connected: postgres@pagila'

## 6.1 Facts Table has all the needed dimensions, no need for deep joins

In [4]:
%%time
%%sql

SELECT
     movie_key
    ,date_key
    ,customer_key
    ,sales_amount
    
FROM
    star.fact_sales
    
limit 5
;

 * postgresql://postgres:***@localhost:5432/pagila
5 rows affected.
Wall time: 4.99 ms


movie_key,date_key,customer_key,sales_amount
535,20170124,333,4.99
422,20170124,456,4.99
565,20170124,126,4.99
347,20170124,261,4.99
396,20170124,399,5.99


## 6.2 Join fact table with dimensions to replace keys with attributes

As you run each cell, pay attention to the time that is printed. Which schema do you think will run faster?

##### Star Schema

In [14]:
%%time
%%sql
SELECT 
     dim_movie.title
    ,dim_date.month
    ,dim_customer.city
    ,SUM(fact_sales.sales_amount) as revenue
    
FROM
    star.fact_sales AS fact_sales    
INNER JOIN
    star.dim_movie AS dim_movie
ON
    dim_Movie.movie_key = fact_sales.movie_key    
INNER JOIN
    star.dim_date AS dim_date
ON
    dim_date.date_key = fact_sales.date_key
INNER JOIN
    star.dim_customer AS dim_customer
ON
    dim_customer.customer_key = fact_sales.customer_key
GROUP BY
     dim_movie.title
    ,dim_date.month
    ,dim_customer.city
    
ORDER BY
     dim_movie.title
    ,dim_date.month
    ,dim_customer.city
    ,revenue DESC
 
LIMIT
    10
;

 * postgresql://postgres:***@localhost:5432/pagila
10 rows affected.
Wall time: 35 ms


title,month,city,revenue
Academy Dinosaur,1,Celaya,0.99
Academy Dinosaur,1,Cianjur,1.99
Academy Dinosaur,2,San Lorenzo,0.99
Academy Dinosaur,2,Sullana,1.99
Academy Dinosaur,2,Udaipur,0.99
Academy Dinosaur,3,Almirante Brown,1.99
Academy Dinosaur,3,Goinia,0.99
Academy Dinosaur,3,Kaliningrad,0.99
Academy Dinosaur,3,Kurashiki,0.99
Academy Dinosaur,3,Livorno,0.99


##### 3NF Schema

In [15]:
%%time
%%sql

SELECT
     f.title
    ,EXTRACT(month FROM p.payment_date) as month
    ,ci.city
    ,SUM(p.amount) as revenue
    
FROM 
    payment p
INNER JOIN
    rental r
ON
    p.rental_id = r.rental_id
INNER JOIN
    inventory i
ON 
    r.inventory_id = i.inventory_id
INNER JOIN
    film f
ON 
    i.film_id = f.film_id
INNER JOIN
    customer c 
ON 
    p.customer_id = c.customer_id
INNER JOIN
    address a
ON 
    c.address_id = a.address_id
INNER JOIN
    city ci 
ON 
    a.city_id = ci.city_id

GROUP BY
     f.title
    ,month
    ,ci.city
     
ORDER BY
     f.title
    ,month
    ,ci.city
    ,revenue desc
    
LIMIT
    10
;

 * postgresql://postgres:***@localhost:5432/pagila
10 rows affected.
Wall time: 87 ms


title,month,city,revenue
ACADEMY DINOSAUR,1.0,Celaya,0.99
ACADEMY DINOSAUR,1.0,Cianjur,1.99
ACADEMY DINOSAUR,2.0,San Lorenzo,0.99
ACADEMY DINOSAUR,2.0,Sullana,1.99
ACADEMY DINOSAUR,2.0,Udaipur,0.99
ACADEMY DINOSAUR,3.0,Almirante Brown,1.99
ACADEMY DINOSAUR,3.0,Goinia,0.99
ACADEMY DINOSAUR,3.0,Kaliningrad,0.99
ACADEMY DINOSAUR,3.0,Kurashiki,0.99
ACADEMY DINOSAUR,3.0,Livorno,0.99


# Conclusion

We were able to show that:
* The star schema is easier to understand and write queries against.
* Queries with a star schema are more performant.