# TP2 - DB Normalization and Querying

The objectives of this TP are:
1. Apply normalization 1NF -> 2NF -> 3NF
2. Perform SQL queries on the normalized database

In this TP, we will use a database **`wine.db`** (available in the course's website) containing wine information related to 'production' and 'sales'. 

> Production <---> Wine <---> Sales


---

### Working with db files in Jupyter
- Python provides an interface for SQLite through the *sqlite3* module
- The **`%%sql`** magic builds upon it (and other tools) to enable the usage of SQL commands within a Jupyter Notebook as in common SQL clients.
- Before proceeding, make sure that **`wine.db`** is in the same path as this notebook.
  - If **`wine.db`** is not in the same path, an empty **`wine.db`** file will be created, resulting in errors in later steps of the TP.
- The SQLite module in Python commits transactions automatically, this means that any change in the DB is immediately written to the file, e.g. creating/deleting tables.
  -  For this reason, it is recommended to have a backup of **`wine.db`** as it is provided in the course's website.

---

**`wine.db`** contains the following unnormalized tables:

<center>**Master1**</center>

|*Attribute*|         *Description*          |
| -------   |--------------------------------|
| NV        | Wine number                    |
| CRU       | Vineyard or group of vineyards |
| DEGRE     | Alcohol content                |
| MILL      | Vintage year                   |
| QTE       | Number of bottles harvested    |
| NP        | Producer number                |
| NOM       | Producer's last name           |
| PRENOM    | Producer's first name          |
| REGION    | Production region              |

From wikipedia:

__Cru__: Often used to indicate a specifically named and legally defined vineyard or ensemble of vineyards and the vines "which grow on [such] a reputed terroir; by extension of good quality." The term is also used to refer to the wine produced from such vines.


<center>**Master2**</center>

|*Attribute*|                         *Description*                  |
| -------   |--------------------------------------------------------|
| NV        | Wine number                                            |
| CRU       | Vineyard or group of vineyards                         |
| DEGRE     | Alcohol content                                        |
| MILL      | Vintage year                                           |
| DATES     | Buying date                                            |
| LIEU      | Place where the wine was sold                          |
| QTE       | Number of bottles bought                               |
| NB        | Client (buveur) number                                 |
| NOM       | Client's last name                                     |
| PRENOM    | Client's first name                                    |
| TYPE      | Type of client by volume of purchases                  |
| REGION    | Administrative Region (different to production region) |


In [27]:
import sqlite3    # Python interface for SQLite databases

In [28]:
def printSchema(connection):
    # Function to print the DB schema
    # Source: http://stackoverflow.com/a/35092773/4765776
    for (tableName,) in connection.execute(
        """
        select NAME from SQLITE_MASTER where TYPE='table' order by NAME;
        """
    ):
        print("{}:".format(tableName))
        for (
            columnID, columnName, columnType,
            columnNotNull, columnDefault, columnPK,
        ) in connection.execute("pragma table_info('{}');".format(tableName)):
            print("  {id}: {name}({type}){null}{default}{pk}".format(
                id=columnID,
                name=columnName,
                type=columnType,
                null=" not null" if columnNotNull else "",
                default=" [{}]".format(columnDefault) if columnDefault else "",
                pk=" *{}".format(columnPK) if columnPK else "",
            ))

In [29]:
conn = sqlite3.connect('wine.db')
c = conn.cursor()
print("Database schema:")
printSchema(conn)           # An usefull way to viualize the content of the database

Database schema:


From this point we will use __%%sql__ magic

In [30]:
%load_ext sql
%sql sqlite:///wine.db

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


'Connected: @wine.db'

# PART I: Database normalization

The first task on this TP is the normalization of the wine data. In its current state both tables **Master1** and **Master2** are in the First Normal Form (1NF).

By inspecting the content of these tables we can see that multiple tuples have NULL values.

In [31]:
%%sql SELECT *
FROM Master1
LIMIT 10;

 * sqlite:///wine.db
(sqlite3.OperationalError) no such table: Master1 [SQL: 'SELECT *\nFROM Master1\nLIMIT 10;'] (Background on this error at: http://sqlalche.me/e/e3q8)


* Notice that Jupyter *displays* 'None' instead of 'NULL'. 
  - This is only to comply with python notation.
* To account for NULL values, your SQL queries must test explicitly for 'NULL'.

Another problem in **Master1** and **Master2** is data redundancy, for example:

In [32]:
%%sql SELECT *
FROM Master1
WHERE NV = 45;

 * sqlite:///wine.db
(sqlite3.OperationalError) no such table: Master1 [SQL: 'SELECT *\nFROM Master1\nWHERE NV = 45;'] (Background on this error at: http://sqlalche.me/e/e3q8)


In [33]:
%%sql SELECT *
FROM Master1
ORDER BY NV DESC
LIMIT 10 ;

 * sqlite:///wine.db
(sqlite3.OperationalError) no such table: Master1 [SQL: 'SELECT *\nFROM Master1\nORDER BY NV DESC\nLIMIT 10 ;'] (Background on this error at: http://sqlalche.me/e/e3q8)


---

Additional resource for Normalization:

https://www.youtube.com/watch?v=UrYLYV7WSHM

---

#### Exercise 1.1

Convert table **Master1** to the Second Normal Form (2NF) or Third Normal Form (3NF) as applicable.
* Explain your answer
* List main functional dependencies (not all of them)
* Describe the schema of new tables and how they relate
  * You can write Tables as above or you can insert images in the notebook.
  
Remember that **`wine.db`** contains information related to wine 'production' and 'sells'.

> Production <---> Wine <---> Sales

A good start point is to look for the 'Wine' attributes.

**Hint:** Look for redundant information between the master tables.

In [9]:
%%sql SELECT NV, NOM, PRENOM,QTE
FROM Master1 
ORDER BY NV DESC
LIMIT  5 ;

 * sqlite:///wine.db
(sqlite3.OperationalError) no such table: Master1 [SQL: 'SELECT NV, NOM, PRENOM,QTE\nFROM Master1 \nORDER BY NV DESC\nLIMIT  5 ;'] (Background on this error at: http://sqlalche.me/e/e3q8)


#### Exercise 1.2

Convert table **Master2** to the Second Normal Form (2NF) or Third Normal Form (3NF) as applicable.
* Explain your answer
* List main functional dependencies (not all of them)
* Describe the schema of new tables and how they relate
  * You can write Tables as above or you can insert images in the notebook.

**Note:** For this part, consider that a wine can be bought in multiple locations and multiple times per day.

In [34]:
%%sql SELECT NV, CRU, DEGRE, MILL, REGION
FROM Master2
ORDER BY NV DESC
LIMIT 5;

 * sqlite:///wine.db
(sqlite3.OperationalError) no such table: Master2 [SQL: 'SELECT NV, CRU, DEGRE, MILL, REGION\nFROM Master2\nORDER BY NV DESC\nLIMIT 5;'] (Background on this error at: http://sqlalche.me/e/e3q8)


In [11]:
%%sql SELECT NOM, PRENOM, TYPE, QTE, DATES, LIEU, NV
FROM Master2
ORDER BY NV DESC
LIMIT 5;

 * sqlite:///wine.db
(sqlite3.OperationalError) no such table: Master2 [SQL: 'SELECT NOM, PRENOM, TYPE, QTE, DATES, LIEU, NV\nFROM Master2\nORDER BY NV DESC\nLIMIT 5;'] (Background on this error at: http://sqlalche.me/e/e3q8)


Once you have defined the 2NF or 3NF (as applicable) we need to split the data into new tables.

A table can be created from the result of a query.

In the following example we will create a new table "dummy" to store the different values of alcohol content.

In [12]:
%%sql DROP TABLE IF EXISTS dummy;

-- Create dummy table
CREATE TABLE dummy AS
SELECT DISTINCT DEGRE
FROM MASTER1;

 * sqlite:///wine.db
Done.
(sqlite3.OperationalError) no such table: MASTER1 [SQL: '-- Create dummy table\nCREATE TABLE dummy AS\nSELECT DISTINCT DEGRE\nFROM MASTER1;'] (Background on this error at: http://sqlalche.me/e/e3q8)


In [13]:
print("\nContent of the database")
printSchema(conn)


Content of the database


In [14]:
%%sql
SELECT *
FROM dummy;

 * sqlite:///wine.db
(sqlite3.OperationalError) no such table: dummy [SQL: 'SELECT *\nFROM dummy;'] (Background on this error at: http://sqlalche.me/e/e3q8)


Looking into "dummy", we notice that our query includes NULL. This is not allowed if we were to use DEGRE as key for a table.

To correct this, we need to change the query to explicitly test for NULL as follows:

In [15]:
%%sql DROP TABLE IF EXISTS dummy;

-- Create dummy table
CREATE TABLE dummy AS
SELECT DISTINCT DEGRE
FROM MASTER1
WHERE DEGRE IS NOT NULL;

SELECT *
FROM dummy;

 * sqlite:///wine.db
Done.
(sqlite3.OperationalError) no such table: MASTER1 [SQL: '-- Create dummy table\nCREATE TABLE dummy AS\nSELECT DISTINCT DEGRE\nFROM MASTER1\nWHERE DEGRE IS NOT NULL;'] (Background on this error at: http://sqlalche.me/e/e3q8)


Notice that we use `NULL` given that `None` is only used for display.

In [16]:
# Remove "dummy" table
%sql DROP TABLE IF EXISTS dummy;

 * sqlite:///wine.db
Done.


[]

#### Exercise 1.3

Create the new tables from Master1:

In [17]:
%%sql 
DROP TABLE IF EXISTS Master1_wine;
DROP TABLE IF EXISTS Master1_person;
DROP TABLE IF EXISTS Master1_sell;

CREATE TABLE Master1_wine AS
SELECT DISTINCT NV, CRU, DEGRE, MILL
FROM MASTER1
WHERE NV IS NOT NULL;

CREATE TABLE Master1_person AS
SELECT DISTINCT NP, NOM, PRENOM
FROM MASTER1
WHERE NP IS NOT NULL;


CREATE TABLE Master1_sell AS
SELECT DISTINCT NP, NV, QTE, REGION
FROM MASTER1
WHERE NP AND NV IS NOT NULL;



 * sqlite:///wine.db
Done.
Done.
Done.
(sqlite3.OperationalError) no such table: MASTER1 [SQL: 'CREATE TABLE Master1_wine AS\nSELECT DISTINCT NV, CRU, DEGRE, MILL\nFROM MASTER1\nWHERE NV IS NOT NULL;'] (Background on this error at: http://sqlalche.me/e/e3q8)


#### Exercise 1.4

Create the new tables from Master2:

In [18]:
%%sql 
DROP TABLE IF EXISTS Master2_wine;
DROP TABLE IF EXISTS Master2_person;
DROP TABLE IF EXISTS Master2_sell;

CREATE TABLE Master2_wine AS
SELECT DISTINCT NV,CRU, DEGRE, MILL
FROM MASTER2
WHERE NV IS NOT NULL;

CREATE TABLE Master2_person AS
SELECT DISTINCT NB,NOM, PRENOM, TYPE
FROM MASTER2
WHERE NB IS NOT NULL;


CREATE TABLE Master2_sell AS
SELECT DISTINCT NB,NV,QTE, DATES, REGION
FROM MASTER2
WHERE NB AND NV IS NOT NULL;


 * sqlite:///wine.db
Done.
Done.
Done.
(sqlite3.OperationalError) no such table: MASTER2 [SQL: 'CREATE TABLE Master2_wine AS\nSELECT DISTINCT NV,CRU, DEGRE, MILL\nFROM MASTER2\nWHERE NV IS NOT NULL;'] (Background on this error at: http://sqlalche.me/e/e3q8)


# PART II: SQL QUERIES

In the second part of this TP you will create SQL queries to retrieve information from the database.

**Important:**

- You MUST use the normalized tables created in previous steps.
  - The normalized tables will also be used in TP3.
- Do NOT use **Master1** and **Master2** in your queries.

#### Exercise 2.1

What are the different types of clients (buveurs) by volume of purchases?

In [19]:
%%sql
SELECT TYPE from Master2_person
GROUP BY TYPE
;


 * sqlite:///wine.db
(sqlite3.OperationalError) no such table: Master2_person [SQL: 'SELECT TYPE from Master2_person\nGROUP BY TYPE\n;'] (Background on this error at: http://sqlalche.me/e/e3q8)


In [20]:
%%sql
SELECT sum(QTE),TYPE FROM Master2_person, Master2_sell
WHERE Master2_person.NB=Master2_sell.NB
GROUP BY TYPE;

 * sqlite:///wine.db
(sqlite3.OperationalError) no such table: Master2_person [SQL: 'SELECT sum(QTE),TYPE FROM Master2_person, Master2_sell\nWHERE Master2_person.NB=Master2_sell.NB\nGROUP BY TYPE;'] (Background on this error at: http://sqlalche.me/e/e3q8)


In [21]:
%%sql

UsageError: %%sql is a cell magic, but the cell body is empty. Did you mean the line magic %sql (single %)?


#### Exercise 2.2

What regions produce Pommard or Brouilly?

In [None]:
%%sql
SELECT DISTINCT CRU, REGION from Master1_wine, Master1_sell
WHERE Master1_wine.NV=Master1_sell.NV
AND Master1_wine.CRU in ("Pommard","Brouilly");

#### Exercise 2.3

What regions produce Pommard and Brouilly?

In [None]:
%%sql
SELECT DISTINCT REGION from Master1_wine, Master1_sell
WHERE Master1_wine.NV=Master1_sell.NV AND
Master1_wine.CRU="Pommard" 
INTERSECT
SELECT DISTINCT REGION from Master1_wine, Master1_sell
WHERE Master1_wine.NV=Master1_sell.NV AND
Master1_wine.CRU="Brouilly";

#### Exercise 2.4

Get the number of wines bught by CRU and Millésime

In [None]:
%%sql
SELECT DISTINCT CRU, MILL, QTE from Master1_wine, Master1_sell
WHERE Master1_sell.NV=Master1_wine.NV
GROUP BY Master1_wine.MILL, Master1_wine.CRU
LIMIT 10;

#### Exercise 2.5

Retrieve the wine number (NV) of wines produced by more than three producers

In [None]:
%%sql
SELECT Master1_wine.NV, COUNT(Master1_sell.NP) from Master1_sell, Master1_wine 
WHERE Master1_sell.NV=Master1_wine.NV
GROUP BY Master1_sell.NV
HAVING count(Master1_sell.NP)>3

#### Exercise 2.6

Which producers have not produced any wine?

In [None]:
%%sql
SELECT * FROM Master1_person
WHERE NP IS NOT(
SELECT NP FROM Master1_sell
WHERE QTE > 0)

#### Exercise 2.7

What clients (buveurs) have bought at least one wine from 1980?

In [None]:
%%sql
SELECT DISTINCT NP, NOM, PRENOM 
from Master1_person
WHERE NP IN (SELECT NP from Master1_sell )

#### Exercise 2.8

What clients (buveurs) have NOT bought any wine from 1980?

In [None]:
%%sql
SELECT NB, NOM, PRENOM FROM Master2_person 
WHERE NB is not (SELECT NB FROM Master2_sell WHERE NV is (SELECT NV FROM Master2_wine WHERE Mill=1980 ))
ORDER BY NB
LIMIT 10;

#### Exercise 2.9

What clients (buveurs) have bought ONLY wines from 1980?

In [None]:
%%sql
SELECT DISTINCT NB, NOM, PRENOM FROM MASTER2_person
WHERE NB IN 
(
SELECT DISTINCT NB FROM MASTER2_sell 
    WHERE NV IN (
    SELECT NV FROM MASTER2_wine
    WHERE MILL=1980
    )
)
INTERSECT
SELECT DISTINCT NB, NOM, PRENOM FROM MASTER2_person
WHERE NB IN 
(
SELECT DISTINCT NB FROM MASTER2_sell 
    WHERE NV NOT IN 
    (SELECT DISTINCT NV FROM MASTER2_wine
    WHERE MILL!=1980
    ) 
)


#### Exercise 2.10

List all wines from 1980

In [None]:
%%sql
SELECT DISTINCT * FROM MASTER2_wine
WHERE MILL=1980

#### Exercise 2.11

What are the wines from 1980 bought by NB=2?

In [None]:
%%sql
SELECT * FROM MASTER2_wine
WHERE NV IS (SELECT NV FROM MASTER2_sell WHERE NB=2)
AND
MILL=1980

#### Exercise 2.12

What clients (buveurs) have bought ALL the wines from 1980?

In [None]:
%%sql 
SELECT * from MASTER2_person
WHERE NB IN(SELECT NB  from MASTER2_sell s WHERE NV IN 
             (SELECT NV from MASTER2_wine WHERE MILL=1980)
            GROUP BY NB
            HAVING COUNT(DISTINCT NV) IN 
            (SELECT COUNT(NV)
            FROM MASTER2_wine
            WHERE MILL = 1980)
           );

