# Modeling

When the amount or the complexity of the data we work with overwhelms us, we look for tools able to help us. Databases are one of the greatest tools for this job. There are several kinds of them, depending on the data they are designed to work with:

 - document database: books, blog posts, scientific papers, tweets, ...
 - timeseries database: weather station readings
 - spatial database: position and tracking a fleet of vehicles
 - graph database: friendships, likes, retweets
 - relational databases: customers, orders, providers, ...
 - and many more!

Even if in this course we will only discuss relational databases, it is important for you to know that other alternatives exists, and that they might be more suitable for some problem of yours in the future.

Relational databases are the ones mostly used because they are very versatile and can accommodate a very broad spectrum of data.

## E-R Diagrams

To be able to store information in a relational database, it is essential to define the structure of the data, also called relational model. For this, we will use Entity-Relationship (E-R) diagrams.

An E-R diagram is composed from:
 - Entities: shall refer to common real-world concepts, and are describet by a set of attributes.
 - Relations: logical associations between entities, which may also have attributes. By cardinality, we find:
    - `1-1  -----`
    - `1-N  ----<`
    - `N-1  >----`
    - `N-N  >---<`

We will start with a basic example, defining the E-R diagram of the data needed to handle the enrollments of students in a university. As the starting point, we shall have a textual description of the problem:

*UAB offers a broad range of subjects for its students.
For each student, we must store their DNI, full name and birth date.
For each subject, we must store the name and its price.
Also, each subject is divided in several groups, so that we can adapt to the number and different schedule of the students. For each group, we must store its name, the schedule (either morning, afternoon or evening) and the teacher's name.
Finally, for each time a student enrolls in a subject, we must store a unique code, the date of the enrollment and whether they have paid yet or not.*

With this, we have enough information to build our E-R diagram. First, we shall identify the entities and its corresponding attributes:
 - student: dni, full name, birth date
 - subject: name, price
 - group: name, schedule, teacher name

We proceed identifying the relations:
 - A subject has several groups, but a group only corresponds to a unique subject
 - A student may be taking several subjects, enrolling on those groups which fit their schedule. In every group, there may be multiple students enrolled. Attributes: code, enrollment date, whether has been paid or not.

The resulting E-R diagram is a follows:

    Subject          Group                                  Student
    =======   ---<   ================   >---------------<   =======
     name             name               code                dni
     price            schedule           enrollment date     full name
                      teacher's name     has paid            birth date

## Relational model

Next step is to transform that E-R diagram into a relational model specification. We shall adhere to the following convention:
 - Each entity becomes a table, mapping each attribute to a different column.
 - All values of a column belong to the same data domain (data type).
 - For each table, there shall be at least one subset of columns that uniquely identifies each row: it is called the primary key.
 - 1-1 / 1-N / N-1 relationships between tables are implemented by adding a subset of columns (at least the primary key) from the referenced table into the referencing one. These subsets are called foreign keys.
 - N-N relationships are implemented by unfolding them into a pair of 1-N / N-1 relationships with an intermediate table. This table will contain a foreign key to each of the referencing tables and any additional attribute defined for that relationship.

Among all the guidelines a good model should follow, the ones refering to "normalization" are a MUST. The science behind the normalization of a model is vast and complex, for now we can think of it as:
 - Values in a cell must be atomic
 - Values are stored in only one "place"
 - Columns not part of the PK, must depend ONLY on the PK

First, we define the tables:
 - Student: dni, full_name, birth_date
 - Subject: name, price
 - Group: name, schedule, teacher_name
 - Enrollment: code, enrollment_date, has_paid

Then, we define the primary keys (*) and foreign keys (~):
 - Student: *dni, full_name, birth_date
 - Subject: *name, price
 - Group: *name, schedule, teacher_name, ~subject_name
 - Enrollment: *code, date, has_paid, ~student_dni, ~group_name

And the diagram:

    Subject          Group                   Enrollment             Student
    =======   ---<   ==============   ---<   =============   >---   =======
    *name            *name                   *code                  *dni
     price            schedule                date                   full_name
                      teacher_name            has_paid               birth_date
                     ~subject_name           ~group_name
                                             ~student_dni

### Exercise

We will describe the data model of an online show shop:

*Our shop sells lots of different shoes. Each shoe has a brand, model and its price.
To be able to buy in our shop, customers must register their personal data, such as full name, email and password.
Customers may order multiple shoes, indicating the number of pairs for each shoe, the size and the color. For each order, we will store a unique code, the date and the shipping address.*

###  Solution

Entities:
 - shoe: brand, model, price
 - customer: full name, email, password
 - order: code, date, address

Relations:
 - A customer may place many orders, but one order belongs to one customer only.
 - An order may include multiple shoes, and one shoe may be present in several orders. Attributes: size and color
 
E-R diagram:

    Customer              Order                Shoe
    ===========   ---<   =========   >-----<   =======
     full name            code        size      brand
     email                date        color     model
     password             address     units     price

Tables:
 - Shoe: brand, model, price
 - Customer: full_name, email, password
 - Order: code, date, address
 - Item: size, color, units
 
Keys:
 - Shoe: *brand, *model, price
 - Customer: full_name, *email, password
 - Order: *code, date, address, ~customer_email
 - Item: *size, *color, units, *~shoe_brand, *~shoe_model, *~order_code
 
Relational diagram:

    Customer             Order                     Item                   Shoe
    ===========   ---<   ================   ---<   =============   >---   =======
     full_name           *code                     *size                  *brand
    *email                date                     *color                 *model
     password             address                   units                  price
                         ~customer_email           *~shoe_brand
                                                   *~shoe_model
                                                   *~order_code

# SQL

Once the relational model has been defined it needs to be implemented into the relational database.
Most interactions with relational databases are done through textual commands using a specific declarative language called SQL (Structured Query Language).

## SELECT

SQL is very powerful, but for this course we will only have a look at the SELECT statement to filter, group, aggregate and retrieve information from the database.

Here is a simplified syntax of the SELECT statement:

    SELECT expression [, ...] 
    FROM table
    [ JOIN table ON condition ] 
    WHERE condition
    GROUP BY expression
    HAVING condition
    ORDER BY expression
    LIMIT number


To learn how to use the SELECT statement, we will use examples on top the relational models previously described.

### Select all columns from `student` table

    SELECT *
    FROM student

### Select the `name` of all `Subjects`, and retrieve only 3 entries

    SELECT name
    FROM subject
    LIMIT 3

### Select the average price of all `Subjects`

    SELECT AVG(price)
    FROM subject

### Select the name and price of the most expensive `Subject`

    SELECT name, price
    FROM subject
    ORDER BY price DESC
    LIMIT 1

### Select the names of all `Subjects` cheaper than 1000

    SELECT name
    FROM subject
    WHERE price < 1000

### Select how many `Students` we have

    SELECT COUNT(*)
    FROM student

### Select how many `Groups` we have, grouped by schedule

    SELECT schedule, COUNT(*)
    FROM group
    GROUP BY schedule

## JOIN

We can also combine data from more than one table. Here is how we can use JOIN to achieve this.

### Select all the `Groups` with its corresponding `Subject`

    SELECT *
    FROM group
    JOIN subject
      ON group.subject_name = subject.name

### Select all `Subject` names given by a teacher named 'Ada Lovelace'

    SELECT subject.name
    FROM subject
    JOIN group
      ON subject.name = group.subject_name
    WHERE group.teacher = 'Ada Lovelace'

### Select the names of all `Students` who have some pending payment

    SELECT name
    FROM student
    JOIN enrollment
      ON student.id = enrollment.student_id
    GROUP BY student.id
    HAVING bool_and(group.has_paid) = FALSE

### Select the teachers of a `Student` with id 1234

    SELECT group.teacher
    FROM student
    JOIN enrollment
      ON student.id = enrollment.student_id
    JOIN group
      ON group.name = enrollment.group_name
    WHERE student.id = 1234

## Exercise

In [20]:
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [21]:
%sql sqlite:///../resources/shop.sqlite

u'Connected: None@../resources/shop.sqlite'

### Select the brand, model and price of the cheapest shoe

In [22]:
%%sql
    SELECT brand, model, price
    FROM shoe
    ORDER BY price ASC
    LIMIT 1

Done.


brand,model,price
Maggio,Christmouth,30


### Count how many different models does each brand have

In [23]:
%%sql
    SELECT brand, COUNT(*)
    FROM shoe
    GROUP BY brand

Done.


brand,COUNT(*)
Boehm,4
Effertz,3
Maggio,5


### Select how many pairs of shoes were sold in order number 4521

In [24]:
%%sql
    SELECT SUM(units)
    FROM "order"
    JOIN item
      ON "order".code = item.order_code
    WHERE code = 4521

Done.


SUM(units)
4


### How many shoes have been sold per brand?

In [25]:
%%sql
    SELECT brand, COUNT(*)
    FROM shoe
    JOIN item
      ON item.shoe_brand = shoe.brand AND item.shoe_model = shoe.model
    GROUP BY shoe.brand

Done.


brand,COUNT(*)
Boehm,6
Effertz,5
Maggio,8


### And per brand and model?

In [26]:
%%sql
    SELECT brand, model, COUNT(*)
    FROM shoe
    JOIN item
      ON item.shoe_brand = shoe.brand AND item.shoe_model = shoe.model
    GROUP BY shoe.brand, shoe.model

Done.


brand,model,COUNT(*)
Boehm,Cruz,2
Boehm,Nathanielville,1
Boehm,Tierraland,1
Boehm,Westralia,2
Effertz,Bergnaumstad,1
Effertz,Citadel,3
Effertz,Hazelmouth,1
Maggio,Asiashire,2
Maggio,Braulio,2
Maggio,Christmouth,1


### Select the shipping address of the last 3 orders placed by the customer named 'Joan Clarke'

In [27]:
%%sql
    SELECT address
    FROM "order"
    JOIN customer
      ON customer.email = "order".customer_email
    WHERE customer.full_name = 'Joan Clarke'
    ORDER BY "order".date DESC
    LIMIT 3

Done.


address
1931 Mi Ave
1931 Mi Ave
9592 Volutpat Ave


### Select the brand, model, color and size of all shows ever bought by the customer named 'Grace Hopper'

In [28]:
%%sql
    SELECT shoe.brand, shoe.model, item.color, item.size
    FROM customer
    JOIN "order"
      ON customer.email = "order".customer_email
    JOIN item
      ON "order".code = item.order_code
    JOIN shoe
      ON shoe.brand = item.shoe_brand AND shoe.model = item.shoe_model
    WHERE customer.full_name = 'Grace Hopper'

Done.


brand,model,color,size
Effertz,Citadel,white,38
Maggio,Christmouth,blue,37
Effertz,Hazelmouth,brown,42
Boehm,Tierraland,white,41
Boehm,Westralia,black,40
Boehm,Westralia,red,37
Effertz,Bergnaumstad,brown,38
Maggio,Asiashire,blue,39
Boehm,Cruz,black,36
Effertz,Citadel,blue,37


### How much did she spent?

In [29]:
%%sql
    SELECT SUM(shoe.price) AS amount
    FROM customer
    JOIN "order"
      ON customer.email = "order".customer_email
    JOIN item
      ON "order".code = item.order_code
    JOIN shoe
      ON shoe.brand = item.shoe_brand AND shoe.model = item.shoe_model
    WHERE customer.full_name = 'Grace Hopper'

Done.


amount
1314


# The importance of normalization

Suppose you have the following table. Each row stores a galaxy id, its position and measured fluxes in several bands. Not all fluxes may be present, those missing will have a NULL value.

    Galaxy
    ===========
    *id
     ra
     dec
     flux_g
     flug_r
     flux_i
     flux_z
     flux_y

## Exercise

Before going on resolving the following SQL sentences, think about:
 - Is this model normalized?
 - What do we do if we want to measure on more bands?
 - How can we store also the flux_error for every band?

### Select id from galaxies with non-null flux on `g` band

    SELECT id
    FROM galaxy
    WHERE flux_g != NULL

### Select id from galaxies with all fluxes present

    SELECT id
    FROM galaxy
    WHERE flux_g != NULL
      AND flux_r != NULL
      AND flux_i != NULL
      AND flux_z != NULL
      AND flux_y != NULL

### Select all galaxies with 3 fluxes present

!!!

## Considerations

I cannot stress too much how important it is to have a good relational model in order to be able to work efficiently with all the data. 

Non-normalized models may appear "simpler" and "easier", but it is just a mirage. Behind the simple facade, such models are more difficult to maintain and evolve. Also, information present in them can be very hard to extract.

Not all data model requirements cannot be determined at the beginning. That means we must plan and prepare our data models for change. CHANGE IS UNAVOIDABLE, and data models must be able to adapt to the evolution of requirements.

## Exercise

Could you propose a normalized model?

### Solution

    Galaxy          Measure
    ======   ---<   ============
    *id             *~galaxy_id
     ra             *band
     dec             flux

## Exercise

### Select id from galaxies with non-null flux on `g` band

    SELECT galaxy_id
    FROM measure
    WHERE band = 'g'

### Select id from galaxies with 3 fluxes present

    SELECT galaxy_id
    FROM measure
    GROUP BY galaxy_id
    HAVING COUNT(flux) = 3

### Select id from galaxies with at least 3 fluxes present

    SELECT galaxy_id
    FROM measure
    GROUP BY galaxy_id
    HAVING COUNT(flux) >= 3

## Considerations

 - What do we do if we want to measure on more bands?
 - How can we store also the flux_error for every band?
 - And the magnitude?
 - And the magnitude error?

## Security

Although mastering SQL is a must if we work with relational databases, it becomes tedious to manually write all those queries. Also, it is prone to errors and one has to validate each and every user-provided input or it could suffer from massive and fata security issues.

The most common security problem with hand-crafted SQL is SQL injection, where user provides some kind of parameter to the query. If this parameter is not secured enough or does not pass the proper validation, the user is efectibly able to run ANY statement on OUR database. It could steal our customers, our credentials, delete all our data, or worse, modifying critical data without our knowledge.

As an example, in our online shop we have a section to browse the shoes we sell. When a user select a particular brand, we display all the models and their price. Suppose we use the following query:

If we have a query like this one:

    SELECT brand, model, price
    FROM shoe
    WHERE brand = {$ parameter $}

### Normal use

    parameter = 'Maggio'

In [30]:
%%sql    
    SELECT brand, model, price
    FROM shoe
    WHERE brand = 'Maggio'

Done.


brand,model,price
Maggio,Asiashire,119
Maggio,Braulio,80
Maggio,Christmouth,30
Maggio,Gunnarport,70
Maggio,Hyman,147


### Issue 1: Ask for a non existing brand

    parameter = 'TheBestBrandInDaWorld'

In [31]:
%%sql
    SELECT brand, model, price
    FROM shoe
    WHERE brand = 'TheBestBrandInDaWorld'

Done.


brand,model,price


### Issue 2: Make the query fail

    parameter = ;

In [32]:
%%sql
    SELECT brand, model, price
    FROM shoe
    WHERE brand = ;

(sqlite3.OperationalError) near ";": syntax error [SQL: u'SELECT brand, model, price\n    FROM shoe\n    WHERE brand = ;']


### Issue 3: Select anything else

    parameter = '';SELECT * FROM customer

In [33]:
%%sql
    SELECT brand, model, price
    FROM shoe
    WHERE brand = '';SELECT * FROM customer
    

Done.
Done.


full_name,email,password
Grace Hopper,grace.hooper@example.com,85fe339c5c2678ed62e5c25c832d88665e6b25a2
Joan Clarke,joan.clarke@example.com,cdfaa02667f3fa7ceea7d86c30619b71965e4d6d
Ada Lovelace,ada.lovelace@example.com,243d943367175e42269a2d447a3ac3c0ae0d38ad


### Issue 4: DO anything else

    parameter = '';DROP TABLE you_are_lucky_this_table_does_not_exist

In [34]:
%%sql
    SELECT brand, model, price
    FROM shoe
    WHERE brand = '';DROP TABLE you_are_lucky_this_table_does_not_exist

Done.
(sqlite3.OperationalError) no such table: you_are_lucky_this_table_does_not_exist [SQL: u'DROP TABLE you_are_lucky_this_table_does_not_exist']
