# SQL basics : Exercices

## Prerequisite : modifying a table or a column, with `ALTER TABLE`

Once a table is created, you may sometimes need to modify it, by changing its name, or the name of a column, adding a column or deleting one. (Again : remember CRUD).

For these operations we have the `ALTER TABLE` command which is very intuitive to use.

Unfortunately, the `ALTER TABLE` commands has not always been fully implemented in SQLite. Pay attention to notes below.

First note : it is always wise to create a backup of your database before any attempt to make significant changes to its schema.

### Modifying a table : change name

To change the name of a table, the syntax is very simple : 

```SQL
ALTER TABLE <old_table_name> 
    RENAME TO <new_table_name>;
```

### Modifying a table : add a column

To add a column, the command is also very clear and simple :

```SQL
ALTER TABLE <table_name> 
    ADD COLUMN <new_column_name>;
```

Of course you can add constraints when you create a new column. But when you add a column to a table, some constraints can’t be specified :

- of course the new column can’t be a `PRIMARY KEY`
- it can’t be `UNIQUE` (as it is empty at its creation and records already exist  : )
- it can’t have the default values : `CURRENT_TIME`, `CURRENT_DATE` or `CURRENT_TIMESTAMP` (as some records already exist : they can’t have current datetime value)
- there are also limitations on other constraints, but they are based on more elaborate notions that we have not seen so far.

### Modifying a table/column : drop a column

Simply :

```SQL
ALTER TABLE <table_name> 
    DROP COLUMN <column_name>;
```

- you cannot drop a column if it is a `PRIMARY KEY` and `UNIQUE`
- you also cannot delete a column if it is referenced elsewhere in the schema (see `FOREIGN KEY` constraints below)

Attention ! This syntax is valid only for SQLite > v3.5. 

Old method (with previous versions) : 
1. create new table as the one you are trying to change,
2. copy all data,
3. drop old table,
4. rename the new one.

What does the `DROP COLUMN` command do? It actually does the exact steps above.

### Modifying a table/column : change name

The syntax is clear, again :

```SQL
ALTER TABLE <table_name> 
    RENAME COLUMN <old_name> 
        TO <new_name>;
```

### Changing a column type

There is no standard SQL syntax to do this : each DBMS will have its particularity

SQLite is especially tricky. As SQLite has weakly types, the best (simpler) move is to :

1. Add a table to the table with the new datatype (you know how to code this!)

2. Copy the old column by casting the data to the new type :

```SQL
UPDATE <table_name> 
       SET <new_column_name> = CAST(<old_column_name> as <new_data_type>)
```

Another, canonical (and sure) but more complex method:

1. Create a new table with the same columns, except the one we want to modify, which will have the type we want
2. Copy the old table data into the new table the column we want to modify)
3. Drop the old table
4. Rename the new table with the name of the old table
5. Commit !

Exercise : try to implement this sequence with a table you used. For example cast an `INTEGER` column to `TEXT`.



In [None]:
# your code here (chose the database and the table - and column - you want). 
# Make a backup before playing with it !




## Foreign key constraints

We have seen that when creating tables, we can declare a column as `PRIMARY KEY`.
When we have created ERD, we have seen that we can declare references between tables, connecting PK and FK.

In fact, we can declare FK in the database schema. It adds constraints : if, when creating a table, we declare a FK and a relationship with another table, this new table cannot be dropped or modified as easily, to preserve the structure of the database or schema. It makes life harder for the data scientist or administrator, but it secures the data. That’s the advantage of the relational model and to store data in databases (rather than in flatfiles).

How do we declare a FK ?

Suppose we are creating to tables : `students` and `cursus` :

```SQL
CREATE TABLE cursus(
    code INTEGER PRIMARY KEY, 
    name TEXT,
    major TEXT,
    referent TEXT
);

CREATE TABLE students(
    id INTEGER,
    firstname TEXT, 
    lastname TEXT,
    cursus_id INTEGER     -- This column will refer to a cursus.code
);
```

The declaration of a FK is a bit more complicated than the declaration of a PK :

```SQL
CREATE TABLE cursus(
    code INTEGER PRIMARY KEY, 
    name TEXT,
    major TEXT,
    referent TEXT
);

CREATE TABLE students(
    id INTEGER,
    firstname TEXT, 
    lastname TEXT,
    FOREIGN KEY(cursus_id) REFERENCES cursus(code) INTEGER     -- This column will refer to a cursus.id
);
```

SQLite will then forbids :

- to delete a records in cursus if at least one students record references to it
- to record a student (insert) who would not refer to a cursus
- only a `NULL` value of students.cursus_id would not raise an error
- it is allowed to specify a `NOT NULL` constraint to a `FOREIGN KEY` : by doing so we define a very strict relationship between the tables (each student recorded in the table `students` must refer to a cursus). 

As we said before, this foreign key constraint forbids the modification or suppression of the table to which the FK refers. But sometimes we need to modify or drop/delete tables or columns of such table. We can temporarily disable those constraint with a `PRAGMA` clause : 

``` SQL
PRAGMA foreign_keys=off; 

-- operations on the table

PRAGMA foreign_keys=on;
```

## Exercise : create a data base

### Understanding the schema

In the data folder of this repo you will find 5 `.csv` files called `customers.csv`, `orders.csv`, `items.csv`, `orders_items.csv` and `carriers.csv`

1. Open the files, observe the colums, datatype, etc. and create an ERD to understand the schema

*Edit this cell (Markdown)*

```SQL
 -- you can copy DBML code of the ERD here
```

You can insert an image by pasting its path between the parenthesis (edit the markdown code) :

![ERD of the database]( )


### Store the data in a database

2. Write a function `add_table()` that takes

* a connector to a (new) database called `orders-exercise.sqlite`
* the path to a `.csv` file containing the data to store in the added table
* the name of the added table

as arguments and which adds (create) a new table in the database referenced by the connector (you can use Pandas here if you think it makes your life easier)

Then add the tables and data from the 5 `.csv` files

**Optional** – if you want to automate the file loading, you can get the files list with these lines :
```python
import os

csv_path = '<write_here_the_path_where_csv_are_stored>'
# the best way to write a path is to use os.path.join('folder', 'subfolder1', …)
file_names = [f for f in os.listdir(csv_path) if f.endswith('.csv')]
file_names
```

you then just have to iterate through the list of `file_names` and call your function for each file

In [1]:
# your code here

import pandas as pd
import sqlite3

conn = sqlite3.connect('data/orders-exercise.sqlite')
c = conn.cursor()


In [40]:
def add_table(connector: object, path: 'string', table_name: 'string'):
    df = pd.read_csv(path)
    df.to_sql(table_name, connector, if_exists='replace', index=False)
    

In [42]:
import os

csv_path = os.path.join('.','data','exercise-orders')
file_names = [f for f in os.listdir(csv_path) if f.endswith('.csv')]
for f in file_names:
    add_table(conn, csv_path + '/' + f, f[:-4])

## Check that everything is ok

3. Write a script that will give us :

* database tables list
* columns list for each table
* request the first rows of each table

Format the output in a structured and readable way. For example :

```
----- table items columns -----

id
product
volume
weight

 table items first lines :

(1, 'Table', 40398, '27')
(2, 'Chaise', 23829, '7')
(3, 'Ordinateur', 2912, '3')
(4, 'Lampe', 2014, '0,3')
(5, 'Canapé', 43852, '14')

----- table orders_items columns -----

id
order_id

[… etc. …]
```

The idea is to automate (script) as much as possible. Data scientists build their own tools ! You may reuse or adapt that script next time you have to deal with databases :)

When you write a complex program, try to write each functionality one after the other, step by step. « From small strokes fell great oaks ».

In [2]:
# first, the well known function to query the data base

def exe(cursor: object, query: 'string'):
    cursor.execute(query)
    for row in  cursor.fetchall():
        print(row)

In [4]:
list_table = '''
PRAGMA table_list;
'''

exe(c, list_table)

('main', 'orders', 'table', 8, 0, 0)
('main', 'items', 'table', 4, 0, 0)
('main', 'shippers', 'table', 2, 0, 0)
('main', 'orders_items', 'table', 3, 0, 0)
('main', 'customers', 'table', 5, 0, 0)
('main', 'sqlite_schema', 'table', 5, 0, 0)
('temp', 'sqlite_temp_schema', 'table', 5, 0, 0)


In [45]:
# list all columns of all tables

# first, get table name
c.execute(list_table)
for row in c.fetchall():
    table = row[1] # to improve readability

    # then get columns names
    column_list = 'PRAGMA table_info(' + table +')'
    print('\n----- table ' + row[1] + ' columns -----\n')
    c.execute(column_list)
    
    # print columns names
    for row in  c.fetchall():
        column = row[1] # to improve readability
        print(column)
    
    # finally print first lines of each table   
    print('\n table ' + table + ' first lines :\n')
    select_all = 'SELECT * FROM ' + table + ' LIMIT 5'
    exe(c, select_all)



----- table orders columns -----

id
customer_id
order_date
order_status
expedition_date
reception_date
shipping_company
shipping_cost

 table orders first lines :

(1, 2, '2023-08-14', 'Annulé', None, None, 2, '2,22')
(2, 6, '2023-06-15', 'Livré', '2023-06-24', '2023-06-30', 1, '33,88')
(3, 7, '2023-04-08', 'Expédié', '2023-04-17', None, 2, '0,74')
(4, 6, '2023-02-04', 'Expédié', '2023-02-08', None, 4, '2,959')
(5, 20, '2023-03-23', 'Annulé', None, None, 3, '0,972')

----- table items columns -----

id
product
volume
weight

 table items first lines :

(1, 'Table', 40398, '27')
(2, 'Chaise', 23829, '7')
(3, 'Ordinateur', 2912, '3')
(4, 'Lampe', 2014, '0,3')
(5, 'Canapé', 43852, '14')

----- table shippers columns -----

id
company_name

 table shippers first lines :

(1, 'HCL')
(2, 'USD')
(3, 'DeFix')
(4, 'Colipost')
(5, 'Chronossimo')

----- table orders_items columns -----

id
order_id
item_id

 table orders_items first lines :

(1, 80, 3)
(2, 21, 35)
(3, 82, 12)
(4, 84, 32)
(5, 58

## Analyses

As a data scientist you have to answer to this question : is there a problem with delivery, and more specifically with some delivery company (shippers) ?

4. Do some simple initial analyses :

* how many orders, items, customers, shippers… ?
* is there some customers that never buy / get their order ? some items never bought ?
* check if there are errors. What would be inconsitents/missing data ? How to deal with them ? Apply a correction or delete such data (you choose !)

6. Simple statitistics :

* average number of items in an order ?
* average number of orders by customers ?
* what are the items most frequently sold ?
* average weight of an order ?
* compare the number of deliveries made by each shippers ?
* percentage of canceled orders ? other status ?

8. Build more complex indicators :

* number of days between expedition and delivery ? (result can be stored in a column)
* what are the slowest shippers ? the fastests ? average for each shipper ?
* add a column to the shippers table : speed of delivery (speed = time / distance btw…).
* add another column to the shippers table : price / kg of shipping (total cost = weight x price)

What is the best shipping company ? The fatest ? The cheapest ? Best compromise ?

Disclaimer : those data where randomly generated, I don’t know what the answer is, maybe there is something to find out, maybe not. Only your analysis will allow you to decide.

### Initial analyses

#### How many orders, items, customers, shippers… ?

In [46]:
n_orders = 'SELECT COUNT(*) FROM orders'
print('Number of orders recorded :')
exe(c, n_orders)

Number of orders recorded :
(100,)


In [47]:
n_status = '''
SELECT 
    order_status, COUNT(*) 
FROM 
    orders 
GROUP BY
    order_status
'''
print('Number of orders grouped by status :')
exe(c, n_status)

Number of orders grouped by status :
('Annulé', 15)
('En attente', 18)
('Expédié', 24)
('Livré', 43)


In [48]:
n_items = 'SELECT COUNT(*) FROM items'
print('Number of items in the catalogue :')
exe(c, n_items)

Number of items in the catalogue :
(50,)


In [49]:
n_customers = 'SELECT COUNT(*) FROM customers'
print('Number of customers :')
exe(c, n_customers)

Number of customers :
(20,)


In [50]:
n_shippers = 'SELECT COUNT(*) FROM shippers'
print('Number of shippers :')
exe(c, n_shippers)

Number of shippers :
(5,)


### Is there some customers that never buy / get their order ? some items never bought ?

Remember the diagram of the exclusive left join :

![Exclusive left joint diagram](./images/left_join_exclusive.png)

We want to get the customers whose id do not appear in the customer_id in orders table, or whose orders has order_status to NULL. It corresponds to an exclusive left join. (A would be the `customers` table, and B the `orders` table.

In [51]:
exclusive_left_join = '''
SELECT 
    *
FROM
    customers as c
LEFT JOIN 
    orders as o ON c.id = o.customer_id
WHERE
    o.customer_id IS NULL;
'''

exe(c, exclusive_left_join)

It seems that there are no customers who did not place any orders.

An alternative (and somewhat roundabout) way to answer the question is to look at how many orders each customer has placed by doing a `GROUP BY` :

In [52]:
group_orders = '''
SELECT 
    customer_id, COUNT(*)
FROM
    orders
GROUP BY
    customer_id;
'''

exe(c, group_orders)

(1, 3)
(2, 6)
(3, 3)
(4, 4)
(5, 5)
(6, 7)
(7, 6)
(8, 4)
(9, 7)
(10, 8)
(11, 9)
(12, 5)
(13, 2)
(14, 4)
(15, 6)
(16, 5)
(17, 4)
(18, 4)
(19, 2)
(20, 6)


We see that every customer has at least more than 2 orders. With a `JOIN` we could make the names of the customers appear :

In [53]:
join_group_orders = '''
SELECT 
    firstname, lastname, COUNT(*)
FROM
    customers as c
JOIN 
    orders as o ON c.id = o.customer_id
GROUP BY
    o.customer_id;
'''

exe(c, join_group_orders)

('Jean', 'Martin', 3)
('Marie', 'Bernard', 6)
('Pierre', 'Dubois', 3)
('Sophie', 'Thomas', 4)
('Luc', 'Robert', 5)
('Camille', 'Richard', 7)
('Antoine', 'Petit', 6)
('Isabelle', 'Durand', 4)
('Julien', 'Lefevre', 7)
('Charlotte', 'Moreau', 8)
('Nicolas', 'Garcia', 9)
('Laurence', 'Andre', 5)
('Kevin', 'Lemoine', 2)
('Elodie', 'Roux', 4)
('Bruno', 'Vincent', 6)
('Manon', 'Guillaume', 5)
('Laurent', 'Lacroix', 4)
('Sarah', 'Mercier', 4)
('Victor', 'Blanc', 2)
('Celine', 'Guerin', 6)


Can we count the number of orders by customers, minus the canceled orders ?

In [54]:
join_group_non_canceled_orders = '''
SELECT 
    firstname, lastname, COUNT(*)
FROM
    customers as c
JOIN 
    orders as o ON c.id = o.customer_id
WHERE
    o.order_status IS NOT 'Annulé'
GROUP BY
    o.customer_id;
'''
exe(c, join_group_non_canceled_orders)

('Jean', 'Martin', 3)
('Marie', 'Bernard', 5)
('Pierre', 'Dubois', 3)
('Sophie', 'Thomas', 3)
('Luc', 'Robert', 5)
('Camille', 'Richard', 7)
('Antoine', 'Petit', 5)
('Isabelle', 'Durand', 4)
('Julien', 'Lefevre', 5)
('Charlotte', 'Moreau', 6)
('Nicolas', 'Garcia', 8)
('Laurence', 'Andre', 5)
('Kevin', 'Lemoine', 2)
('Elodie', 'Roux', 4)
('Bruno', 'Vincent', 6)
('Manon', 'Guillaume', 3)
('Laurent', 'Lacroix', 2)
('Sarah', 'Mercier', 4)
('Victor', 'Blanc', 2)
('Celine', 'Guerin', 3)


Even if we count the canceled orders, each customer still has at least 2 valid orders.

How about the items ? Is there items that are not included in any order ?

In [55]:
exclusive_left_join = '''
SELECT 
    *
FROM
    items as i
LEFT JOIN 
    orders_items as o ON i.id = o.item_id
WHERE
    o.item_id IS NULL;
'''

exe(c, exclusive_left_join)

(10, 'Bureau', 54057, '28', None, None, None)
(42, 'Enceinte', 21886, '10', None, None, None)


Two items have never been ordered : desktop and speaker.

#### Check if there are errors. What would be inconsitents/missing data ? How to deal with them ? Apply a correction or delete such data (you choose !)

A major inconsistancy would be errors in the labeling of the status orders : for example a canceled order with a reception date, or a delivered order with no reception date. Let’s check if there is such inconsistancy. 

In [56]:
canceled_inconsistancies = '''
SELECT 
    id, order_status, reception_date
FROM
    orders
WHERE
    order_status = 'Annulé' AND reception_date IS NOT NULL;
'''

exe(c, canceled_inconsistancies)

(52, 'Annulé', '2023-03-29')


There is one inconsistancy !

Let’s look for delivered status orders :

In [57]:
delivered_inconsistancies = '''
SELECT 
    id, order_status, reception_date
FROM
    orders
WHERE
    order_status = 'Livré' AND reception_date IS NULL;
'''

exe(c, delivered_inconsistancies)

(45, 'Livré', None)
(50, 'Livré', None)
(68, 'Livré', None)


Other inconsistancies would be an `'Expédié'` status with a `reception_date` or no `expedition_date` and an `'En attente'` status with `expedition_date` or `reception_date`. In fact we could make the opposite reasonning : having an `expedition_date` with a `'Annulé'` or `'En attente'` status would be an inconsistancy, as well as not having an `'expedition_date'` with a `'Expédié'` or `'Livré'` status. Same with having a `'reception_date'` with a status other than `'Livré'`. To make it clear, drawing a table could help us to identify the situations :

| order_status | expedition_date | reception_date | inconsistancy | condition to test to detect inconsistancy                  |
|--------------|-----------------|----------------|---------------|------------------------------------------------------------|
| Annulé       | no              | no             | no            | there is expedition **or** reception dates                 |
|              | yes             | no             | yes           |                                                            |
|              | no              | yes            | yes           |                                                            |
|              | yes             | yes            | yes           |                                                            |
| En attente   | no              | no             | no            | **same** as precedent                                      |
|              | yes             | no             | yes           |                                                            |
|              | no              | yes            | yes           |                                                            |
|              | yes             | yes            | yes           |                                                            |
| Expédié      | no              | no             | yes           | there is **no** expedition **or** there is reception dates |
|              | yes             | no             | no            |                                                            |
|              | no              | yes            | yes           |                                                            |
|              | yes             | yes            | yes           |                                                            |
| Livré        | no              | no             | yes           |there is **no** expedition **and no** reception dates       |
|              | yes             | no             | yes           |                                                            |
|              | no              | yes            | yes           |                                                            |
|              | yes             | yes            | no            |                                                            |

This table help us to write the following conditions to detect inconsistency :

In [7]:
status_inconsistancies = '''
SELECT 
    order_status, id, expedition_date, reception_date
FROM
    orders
WHERE
    ((order_status = 'Annulé' OR order_status = 'En attente') AND (expedition_date IS NOT NULL OR reception_date IS NOT NULL))
    OR
    (order_status = 'Expédié' AND (expedition_date IS NULL OR reception_date IS NOT NULL))
    OR
    (order_status = 'Livré' AND (expedition_date IS NULL OR reception_date IS NULL))
ORDER BY
    order_status;
'''

exe(c, status_inconsistancies)

('Annulé', 52, '2023-03-22', '2023-03-29')
('Annulé', 55, '2023-07-02', None)
('En attente', 35, '2023-10-19', '2023-10-26')
('En attente', 38, '2023-05-24', '2023-06-03')
('En attente', 41, '2023-03-20', '2023-03-24')
('En attente', 57, '2023-05-16', None)
('En attente', 98, '2023-05-08', None)
('Expédié', 40, '2023-10-24', '2023-10-28')
('Expédié', 49, '2023-08-07', '2023-08-11')
('Expédié', 53, '2023-10-22', '2023-10-30')
('Expédié', 79, '2023-07-26', '2023-08-01')
('Expédié', 91, '2023-10-18', '2023-10-23')
('Livré', 45, '2023-02-01', None)
('Livré', 50, '2023-06-13', None)
('Livré', 68, '2023-05-29', None)


In [9]:
n_inconsistancies = '''
SELECT
    COUNT(*)
FROM
    (
''' + status_inconsistancies[:-2] + ');'

exe(c, n_inconsistancies)

(15,)


The first thing to do is to investigate the process of recording status or dates, to detect which column can be trusted (or none). If the dates are correct, we have to update the status, on the contrary if the status are correct, we have to dismiss the dates. For this exercise we will consider that status is wrongly updated. Therefore we will change the status according to the dates info : we will get the id of records where there is an expedition date only (status -> `'Expédié'`) and records where there are expedition and reception date (status -> `'Livré'`). Let’s begin with the records we will update to `'Expédié'` :

In [13]:
get_exp_id = ''' 
SELECT 
    id, order_status
FROM (
''' + status_inconsistancies[:-2] + '''
)
WHERE
    expedition_date IS NOT NULL AND reception_date IS NULL;
'''
# we write status_inconsistancies[:-2] just to get rid of the '; ' at the end of the docstring, 
# that would lead to a syntax error in the new request
exe(c, get_exp_id)

c.execute(get_exp_id)

id_to_exp = [ row[0] for row in c.fetchall()]
id_to_exp

(55, 'Annulé')
(57, 'En attente')
(98, 'En attente')
(45, 'Livré')
(50, 'Livré')
(68, 'Livré')


[55, 57, 98, 45, 50, 68]

Same to get the list of id we’ll have to update to `'Livré'` :

In [11]:
get_liv_id = ''' 
SELECT 
    id, order_status
FROM (
''' + status_inconsistancies[:-2] + '''
)
WHERE
    expedition_date IS NOT NULL AND reception_date IS NOT NULL;
'''

exe(c, get_liv_id)

c.execute(get_liv_id)

id_to_liv = [ row[0] for row in c.fetchall()]
id_to_liv

(52, 'Annulé')
(35, 'En attente')
(38, 'En attente')
(41, 'En attente')
(40, 'Expédié')
(49, 'Expédié')
(53, 'Expédié')
(79, 'Expédié')
(91, 'Expédié')


[52, 35, 38, 41, 40, 49, 53, 79, 91]

Now, let’s update the records’ status !

In [14]:
update_exp = '''
UPDATE orders
SET
  order_status = 'Expédié'
WHERE
  id = 
'''

for i in id_to_exp:
    query = update_exp + str(i) + ';'
    c.execute(query)

# let’s verify :
exe(c, '''SELECT * FROM orders WHERE order_status = 'Expédié';''')

(3, 7, '2023-04-08', 'Expédié', '2023-04-17', None, 2, '0,74')
(4, 6, '2023-02-04', 'Expédié', '2023-02-08', None, 4, '2,959')
(12, 2, '2023-04-20', 'Expédié', '2023-04-30', None, 1, '12,474')
(16, 7, '2023-08-16', 'Expédié', '2023-08-25', None, 4, '37,66')
(19, 10, '2023-01-18', 'Expédié', '2023-01-19', None, 1, '7,7')
(23, 9, '2023-01-10', 'Expédié', '2023-01-20', None, 4, None)
(27, 11, '2023-08-26', 'Expédié', '2023-09-01', None, 1, '22,946')
(28, 15, '2023-03-03', 'Expédié', '2023-03-11', None, 1, '5,467')
(29, 11, '2023-09-09', 'Expédié', '2023-09-16', None, 1, '8,239')
(36, 11, '2023-09-06', 'Expédié', '2023-09-10', None, 4, '90,653')
(40, 16, '2023-10-15', 'Expédié', '2023-10-24', '2023-10-28', 3, '11,34')
(45, 9, '2023-01-24', 'Expédié', '2023-02-01', None, 4, '37,66')
(49, 17, '2023-07-28', 'Expédié', '2023-08-07', '2023-08-11', 4, '1,614')
(50, 12, '2023-06-11', 'Expédié', '2023-06-13', None, 4, None)
(51, 9, '2023-01-12', 'Expédié', '2023-01-15', None, 1, '3,894')
(53, 9, '

In [15]:
update_liv = '''
UPDATE orders
SET
  order_status = 'Livré'
WHERE
  id = 
'''

for i in id_to_liv:
    query = update_liv + str(i) + ';'
    c.execute(query)

# let’s verify :
exe(c, 'SELECT * FROM orders WHERE order_status = "Livré";')

(2, 6, '2023-06-15', 'Livré', '2023-06-24', '2023-06-30', 1, '33,88')
(7, 10, '2023-03-12', 'Livré', '2023-03-20', '2023-03-25', 1, '12,166')
(13, 4, '2023-06-04', 'Livré', '2023-06-11', '2023-06-20', 5, '82,269')
(15, 15, '2023-09-08', 'Livré', '2023-09-11', '2023-09-18', 5, None)
(21, 13, '2023-03-03', 'Livré', '2023-03-05', '2023-03-12', 2, '5,55')
(24, 10, '2023-08-01', 'Livré', '2023-08-06', '2023-08-08', 1, '22,561')
(26, 16, '2023-02-06', 'Livré', '2023-02-11', '2023-02-19', 5, '57,024')
(31, 14, '2023-10-27', 'Livré', '2023-11-05', '2023-11-15', 5, '91,476')
(32, 15, '2023-09-03', 'Livré', '2023-09-08', '2023-09-10', 5, '115,533')
(33, 5, '2023-01-08', 'Livré', '2023-01-14', '2023-01-18', 1, '0,154')
(35, 3, '2023-10-14', 'Livré', '2023-10-19', '2023-10-26', 2, None)
(37, 12, '2023-10-12', 'Livré', '2023-10-16', '2023-10-20', 4, None)
(38, 14, '2023-05-14', 'Livré', '2023-05-24', '2023-06-03', 2, '4,44')
(39, 19, '2023-10-25', 'Livré', '2023-11-01', '2023-11-02', 1, '0,616')
(4

Another verification of the consistancy : 

In [3]:
exe(c, 'SELECT COUNT(*) FROM orders WHERE order_status = "Livré"')

(49,)


In [4]:
exe(c, 'SELECT COUNT(*) FROM orders WHERE reception_date IS NOT NULL')

(49,)


Do not forget to commit once you have verified correctness of the update.

In [18]:
conn.commit()

Data have been cleaned ! Another choice would have been to drop all the records with inconsistency (especially if we could not determine from where came the error). But as orders id are used as foreign keys in other table, it would have implied to update other tables too (due to constraints that preserve database integrity). 

### Simple statitistics :
#### Average number of items in an order ?

Just a `GROUP BY` order and `COUNT()`. To get the average of this, we have to use a subquery. The table that gives infos about orders and items is the `orders_items` table :

In [19]:
avg_items = '''
SELECT
    AVG(n_items)
FROM
    (
     SELECT 
         COUNT(*) as n_items
    FROM
        orders_items
    GROUP BY
        order_id
    );
'''

exe(c, avg_items)

(2.027027027027027,)


#### Average number of orders by customers ?

Same logic :

In [20]:
avg_orders = '''
SELECT
    AVG(n_orders)
FROM
    (
     SELECT 
         COUNT(*) as n_orders
    FROM
        orders
    GROUP BY
        customer_id
    );
'''

exe(c, avg_orders)

(5.0,)


#### what are the items most frequently sold ?

In [21]:
join_items_ordered = '''
SELECT 
    i.product, COUNT(*) as n_items
FROM
    items as i
JOIN 
    orders_items as o ON i.id = o.item_id
GROUP BY
    i.product
ORDER BY
    n_items DESC
LIMIT 10;
'''

exe(c, join_items_ordered)

('Vase', 7)
('Tasse', 7)
('Couteau', 7)
('Lampe', 6)
('Télévision', 5)
('Réfrigérateur', 5)
('Rideaux', 5)
('Étagère', 4)
('Table', 4)
('Serviette', 4)


With `'LIMIT 10'` we have selected the 10 items the most frequently ordered. But if we would have written `'LIMIT 11'` we would have saw that the 11th items is ordered as often as the 8th, 9th and 10th (4 times). `'LIMIT`' has nothing to do with frequency, but the number of lines. If we want to deal with frequency inside a sample (or population) `NTILE()` function (quantile) are a best choice. Those functions are a window function.

We will build two subqueries :
* the first subquery will select (count) number of items in orders, with a join to get items name (see the query join_items_ordered above)
* the second subquery uses the `NTILE()` window (`OVER`) function to determine to which quartile belong each item
* a query select only the items of the last quartile

In [23]:
last_quartile = '''
SELECT 
    product
FROM(
    SELECT
    product,
    NTILE(4) OVER (
        ORDER BY n_items
        ) AS Quartile
FROM 
( ''' + join_items_ordered[:-10] + ")) WHERE Quartile = 4;" # [:-10] cut the end of the query join_items_ordered ( to get rid of LIMIT 10;)
# we select the last quartile (Quartile = 4)

print('25% products the most frequently ordered (last quartile):')
exe(c, last_quartile)

25% products the most frequently ordered (last quartile):
('Serviette',)
('Porte-serviette',)
('Micro-ondes',)
('Clavier',)
('Bouilloire',)
('Télévision',)
('Réfrigérateur',)
('Rideaux',)
('Lampe',)
('Vase',)
('Tasse',)
('Couteau',)


#### Average weight of an order ?

It should be a reflex : always check a new variable when you have to process it. Is its actual datatype congruent with its supposed datatype ? Is there any missing value or outliers ? Any errors in certain records ? Etc.  

In [61]:
exe(c, 'SELECT weight FROM items LIMIT 10')

('27',)
('7',)
('3',)
('0,3',)
('14',)
('29',)
('12',)
('5,4',)
('28',)
('28',)


First thing : it seems that weight values are of string (`TEXT`) type :

In [66]:
c.execute('SELECT weight FROM items LIMIT 10')
type(c.fetchone()[0]) # we’re checking the first element of the tuple returned by .fetchone() 

str

We also see that the decimal separator (`'0,3' or '5,4') is a comma and not a point. That’s often the case when we import `.csv` files or other file formats that may contain localized data for which the separators may vary.

Is it a problem when we perform arithmetic operations on such data ?

In [68]:
exe(c, 'SELECT SUM(weight) FROM items')

(425.0,)


It seems that strings are automatically converted to float, but the decimal part is ignored.
In fact SQLite can’t take into account commas as decimal separator. We have to `REPLACE` all comma by point before casting values to `REAL`.
This illustrate the importance of data preparation !

In [69]:
exe(c, 'SELECT SUM(CAST( REPLACE(weight, ",", ".") AS REAL)) FROM items')

(440.0999999999999,)


That’s quite a different result ! 

* `REPLACE(weight, ",", ".")` replaced all the commas by dots
* `CAST( x AS REAL)` convert x to `REAL` type (here x is the result of the previous line)
* `SUM()` simply returns the sum
* We are seeing some rounding issues. Maybe call the `ROUND` function to get a correctly formatted two decimal result.

Now, just sum the weights for each order. We have to :
* calculate the `SUM`, taking care to carry out the conversion as seen previously and rounding the result
* `JOIN` `items` and `orders_items` tables `ON` items id
* and `GROUP BY` orders id to get result for each order :

In [75]:
exe(c, ''' 
    SELECT 
        o.order_id, ROUND(SUM(CAST(REPLACE(weight, ",", ".") AS REAL)), 2) 
            AS total_weight
    FROM
        items as i
    JOIN 
        orders_items as o ON i.id = o.item_id
    GROUP BY
        o.order_id
    LIMIT 10;
    ''')

(1, 0.6)
(2, 44.0)
(3, 0.2)
(4, 1.1)
(5, 1.2)
(6, 29.8)
(7, 15.8)
(8, 5.6)
(9, 4.8)
(11, 2.8)


It seems ok. Now let’s create of query to get the total weight for each order :

In [63]:
orders_weight = '''
SELECT 
    o.order_id, 
    ROUND(SUM(CAST(REPLACE(weight, ",", ".") AS REAL)), 2) 
        AS total_weight
FROM
    items as i
JOIN 
    orders_items as o ON i.id = o.item_id
GROUP BY
    o.order_id;
'''

exe(c, orders_weight)

(1, 0.6)
(2, 44.0)
(3, 0.2)
(4, 1.1)
(5, 1.2)
(6, 29.8)
(7, 15.8)
(8, 5.6)
(9, 4.8)
(11, 2.8)
(12, 16.2)
(13, 27.7)
(14, 25.0)
(16, 14.0)
(19, 10.0)
(20, 0.3)
(21, 1.5)
(22, 0.2)
(24, 29.3)
(25, 1.8)
(26, 19.2)
(27, 29.8)
(28, 7.1)
(29, 10.7)
(30, 12.9)
(31, 30.8)
(32, 38.9)
(33, 0.2)
(34, 2.0)
(36, 33.7)
(38, 1.2)
(39, 0.8)
(40, 14.0)
(41, 20.8)
(42, 1.3)
(43, 11.1)
(45, 14.0)
(47, 58.0)
(48, 0.3)
(49, 0.6)
(51, 2.2)
(52, 31.5)
(53, 0.9)
(55, 1.6)
(56, 4.0)
(57, 0.2)
(58, 0.9)
(60, 34.0)
(61, 34.8)
(62, 25.3)
(66, 32.3)
(67, 27.0)
(68, 11.1)
(71, 42.0)
(74, 14.0)
(75, 1.9)
(77, 27.0)
(78, 44.0)
(79, 38.0)
(80, 4.4)
(81, 69.0)
(82, 29.9)
(83, 1.4)
(84, 47.9)
(85, 31.5)
(87, 28.0)
(88, 28.0)
(91, 0.2)
(92, 30.8)
(94, 3.0)
(95, 0.6)
(96, 12.3)
(99, 0.7)
(100, 23.0)


And now just compute the mean :

In [78]:
avg_weight_orders = '''
SELECT
    ROUND(AVG(total_weight),2)
FROM
    (
''' + orders_weight[:-2] + ');'

exe(c, avg_weight_orders)

(16.52,)


When we build complex requests, we have to build them step by step, decomposing them in simpler operations.
Don’t try to write a long, perfect and optimized request at first try. Readability and comprehenssivness are your priority here.
When engaging in a step by step process, subrequests are of great help !

#### Compare the number of deliveries made by each shippers ?

Let’s analyse the question and link some terms to SQL clauses :

- deliveries -> select with `WHERE` orders for which we have a `reception_date` or a `Livré` status
- number of deliveries -> `COUNT` orders
- for each shippers -> `GROUP BY`

If we only use the `orders` table we’ll juste have the shipping companies’ code. To get their name we’ll have to do a join :

- get companies names -> `JOIN` on shipping_company // id

Let’s create the request that just get the companies `id` in the first time :

In [79]:
n_deliveries = '''
SELECT 
    shipping_company, COUNT(*)
FROM
    orders
WHERE
    reception_date IS NOT NULL
GROUP BY 
    shipping_company
'''

exe(c, n_deliveries)

(1, 11)
(2, 14)
(3, 6)
(4, 8)
(5, 10)


Now let’s complexify our requests with a `JOIN` to get companies names in a second time:

In [80]:
n_deliveries = '''
SELECT 
    company_name, COUNT(*)
FROM
    orders AS o
JOIN 
    shippers AS s 
    ON 
        s.id = o.shipping_company
WHERE
    reception_date IS NOT NULL
GROUP BY 
    company_name
'''

exe(c, n_deliveries)

('Chronossimo', 10)
('Colipost', 8)
('DeFix', 6)
('HCL', 11)
('USD', 14)


#### Percentage of canceled orders ? other status ?

We can use a subrequest to get the number of canceled orders. Then we can compute the percentage by getting the total number of orders with `COUNT(*)`.

**Important** : to compute percentage, we multiply by 1.0, the result is a float, so we can access to the float division

In [81]:
p_canceled = '''
SELECT 
    (
        SELECT 
            COUNT(*)
        FROM 
            orders
        WHERE order_status = 'Annulé'
    ) * 1.0 / COUNT(*) * 100
FROM
    orders
'''
print('Percentage of canceled orders :')
exe(c, p_canceled)

Percentage of canceled orders :
(13.0,)


If we want to compute the percentage of each status in one request, we have to use a `GROUP BY` for status, and get the total of orders with one subrequest (and don’t forget to multiply by `1.0` to cast to float) :

In [5]:
p_status = '''
        SELECT 
            order_status, COUNT(*) * 1.0 / 
            (SELECT COUNT(*) FROM ORDERS)
        FROM 
            orders
        GROUP BY
            order_status

'''
print('Percentage of orders in each status:')
exe(c, p_status)

Percentage of orders in each status:
('Annulé', 0.13)
('En attente', 0.13)
('Expédié', 0.25)
('Livré', 0.49)


### Build more complex indicators :

#### Number of days between expedition and delivery ? (result can be stored in a column)

For this one, we’ll have to use datetime functions, and in particular the function `DATE()`.

First thing to do : check the format of the date columns :

In [6]:
exe(c, 'SELECT order_date, expedition_date, reception_date FROM orders LIMIT 20')

('2023-08-14', None, None)
('2023-06-15', '2023-06-24', '2023-06-30')
('2023-04-08', '2023-04-17', None)
('2023-02-04', '2023-02-08', None)
('2023-03-23', None, None)
('2023-06-09', None, None)
('2023-03-12', '2023-03-20', '2023-03-25')
('2023-01-25', None, None)
('2023-03-27', None, None)
('2023-05-09', None, None)
('2023-02-12', None, None)
('2023-04-20', '2023-04-30', None)
('2023-06-04', '2023-06-11', '2023-06-20')
('2023-07-17', None, None)
('2023-09-08', '2023-09-11', '2023-09-18')
('2023-08-16', '2023-08-25', None)
('2023-08-09', None, None)
('2023-05-03', None, None)
('2023-01-18', '2023-01-19', None)
('2023-10-24', None, None)


It seems that all the dates have the same format : string 'YYYY-MM-DD' which is a current format.

The `JULIANDAY()` function provides us with a way to perform subtraction operations to find the number of days between two dates:

In [16]:
delivery_days = '''
SELECT
    id, CAST((JULIANDAY(reception_date) - JULIANDAY(expedition_date)) AS INTEGER) AS delivery_days
FROM
    orders
WHERE
    reception_date IS NOT NULL;
'''

exe(c, delivery_days[:-2] + ' LIMIT 10')

(2, 6)
(7, 5)
(13, 9)
(15, 7)
(21, 7)
(24, 2)
(26, 8)
(31, 10)
(32, 2)
(33, 4)


Let’s add a column to the `orders` table. The best choice is to set `DEFAULT` to `NULL`, because another value, like 0, could lead to false results if used to compute some statistics like mean, etc. `NULL` values will just be ignored  :

In [35]:
add_days = '''
ALTER TABLE 
    orders 
ADD COLUMN 
    delivery_time INTEGER DEFAULT NULL;
'''

c.execute(add_days)
conn.commit()

In [36]:
exe(c, 'PRAGMA table_info(orders)')

(0, 'id', 'INTEGER', 0, None, 0)
(1, 'customer_id', 'INTEGER', 0, None, 0)
(2, 'order_date', 'TEXT', 0, None, 0)
(3, 'order_status', 'TEXT', 0, None, 0)
(4, 'expedition_date', 'TEXT', 0, None, 0)
(5, 'reception_date', 'TEXT', 0, None, 0)
(6, 'shipping_company', 'INTEGER', 0, None, 0)
(7, 'shipping_cost', 'TEXT', 0, None, 0)
(8, 'delivery_time', 'INTEGER', 0, 'NULL', 0)


If you make a mistake, drop the column and recreate one (**do not execute this cell if everything is ok !**) :

In [33]:
drop_days = '''
ALTER TABLE 
    orders 
DROP COLUMN 
    delivery_time;
'''

c.execute(drop_days)
conn.commit()

Now it’s time to insert delivery days in the column newly created. 

Let’s start by creating the datas from the query `delivery_days` we’ve written just before :

In [37]:
c.execute(delivery_days)
delivery_data = c.fetchall()

Insert the data in the `delivery_time` column. 
We will iterate through the data list. Each element is a tuple whose first element is the `id` of the order, and the second the delivery time in days. Before concatenate this value to the query we’re building, we need to cast them to string type.

In [38]:
update_delivery_time = '''
UPDATE orders
SET
  delivery_time = '''

for i in delivery_data:
    query = update_delivery_time + str(i[1]) + ' WHERE id = ' + str(i[0]) + ' ;'
    c.execute(query)

# let’s verify :
exe(c, 'SELECT id, delivery_time FROM orders LIMIT 10;')

(1, None)
(2, 6)
(3, None)
(4, None)
(5, None)
(6, None)
(7, 5)
(8, None)
(9, None)
(10, None)


Everything seems OK ! We can `.commit()` :

In [None]:
conn.commit()

#### What are the slowest shippers ? the fastests ? average for each shipper ?

The two first questions are easy to answer : we just have to `GROUP` average delivery_time `BY` shippers. If we want shippers names, a `JOIN` will do the job :

In [40]:
mean_delivery = '''
SELECT 
    shipping_company, 
    AVG(delivery_time) AS mean_delivery
FROM
    orders
GROUP BY
    shipping_company
ORDER BY
    mean_delivery ;
'''

exe(c, mean_delivery)

(1, 4.363636363636363)
(3, 5.666666666666667)
(4, 6.0)
(5, 6.1)
(2, 6.5)


Let’s join with the `shippers` table to get companies’ names :

In [41]:
mean_delivery = '''
SELECT 
    s.company_name, 
    AVG(o.delivery_time) AS mean_delivery
FROM
    orders as o
JOIN 
    shippers AS s 
        ON o.shipping_company = s.id
GROUP BY
    o.shipping_company
ORDER BY
    mean_delivery ;
'''

exe(c, mean_delivery)

('HCL', 4.363636363636363)
('DeFix', 5.666666666666667)
('Colipost', 6.0)
('Chronossimo', 6.1)
('USD', 6.5)


The companies with fastest delivery times are DeFix and HCL, and those with longer delivery are Chronossimo and USD, by a factor near of +30%.

#### Add a column to the shippers table : speed of delivery (speed = time / distance btw…)

To answer this question, we will apply again the same steps :

1. create the data
2. create a new column
3. insert data in the newly created column

To compute speed, we need the distance in the customer column : we’ll have to join the tables.

In [45]:
speed_data = '''
SELECT
    o.id,
    c.distance,
    o.delivery_time
FROM
    orders AS o
JOIN 
    customers AS c
    ON
        o.customer_id = c.id
WHERE 
    o.delivery_time IS NOT NULL ;
'''

exe(c, speed_data[:-2] + ' LIMIT 10 ;')

(2, 840, 6)
(7, 706, 5)
(13, 829, 9)
(15, 826, 7)
(21, 681, 7)
(24, 706, 2)
(26, 649, 8)
(31, 603, 10)
(32, 826, 2)
(33, 906, 4)


That’s seems good. Now we can rewrite the query to generate data : we just need the order id, shipper id and the speed. Then we’ll `GROUP` the speed (`AVG()`) `BY` `shipper_id` and add the average speed to the `shippers` table. Before all, let’s see if we manage to generate the speed column. As the time and distances measure are `INTEGER` we may need to multiply by `1.0` to get a float division (distance / time) :

In [51]:
speed_data = '''
SELECT
    o.id,
    o.shipping_company,
    c.distance,
    o.delivery_time,
    (c.distance * 1.0 / o.delivery_time) AS speed
FROM
    orders AS o
JOIN 
    customers AS c
    ON
        o.customer_id = c.id
WHERE 
    o.delivery_time IS NOT NULL ;
'''

exe(c, speed_data[:-2] + ' LIMIT 10 ;')

(2, 1, 840, 6, 140.0)
(7, 1, 706, 5, 141.2)
(13, 5, 829, 9, 92.11111111111111)
(15, 5, 826, 7, 118.0)
(21, 2, 681, 7, 97.28571428571429)
(24, 1, 706, 2, 353.0)
(26, 5, 649, 8, 81.125)
(31, 5, 603, 10, 60.3)
(32, 5, 826, 2, 413.0)
(33, 1, 906, 4, 226.5)


Finally, we just have to `GROUP BY` to get `AVG()` by company. As we will add speed to the `shippers` table, we don’t need company names, just the companies' id :

In [55]:
speed_data = '''
SELECT
    o.shipping_company,
    ROUND((c.distance * 1.0 / o.delivery_time),2) AS speed
FROM
    orders AS o
JOIN 
    customers AS c
    ON
        o.customer_id = c.id
WHERE 
    o.delivery_time IS NOT NULL 
GROUP BY
    o.shipping_company ;
'''

exe(c, speed_data)

(1, 140.0)
(2, 97.29)
(3, 162.25)
(4, 187.0)
(5, 92.11)


In [56]:
c.execute(speed_data)
avg_speed_data = c.fetchall()
avg_speed_data

[(1, 140.0), (2, 97.29), (3, 162.25), (4, 187.0), (5, 92.11)]

Now we create a new column in the `shippers` table (don’t forget to `.commit()`) :

In [57]:
add_speed = '''
ALTER TABLE 
    shippers 
ADD COLUMN 
    delivery_speed REAL DEFAULT NULL;
'''

c.execute(add_speed)
conn.commit()

In [59]:
exe(c, 'PRAGMA table_info(shippers)')

(0, 'id', 'INTEGER', 0, None, 0)
(1, 'company_name', 'TEXT', 0, None, 0)
(2, 'delivery_speed', 'REAL', 0, 'NULL', 0)


In [None]:
Finally add data to the column :

In [61]:
update_delivery_speed = '''
UPDATE shippers
SET
  delivery_speed = '''

for i in avg_speed_data:
    query = update_delivery_speed + str(i[1]) + ' WHERE id = ' + str(i[0]) + ' ;'
    c.execute(query)

# let’s verify :
exe(c, 'SELECT * FROM shippers;')

(1, 'HCL', 140.0)
(2, 'USD', 97.29)
(3, 'DeFix', 162.25)
(4, 'Colipost', 187.0)
(5, 'Chronossimo', 92.11)


Everything is ok, we can `.commit()` :

In [62]:
conn.commit()

#### Add another column to the shippers table : price / kg of shipping (total cost = weight x price)

We already have calculated the weight of each order with the query `orders_weight`. Then we can calculate the shipping price for each order. Grouping by shippers and averaging we can get our price data. It will be simple after that to add a price column to the `shippers` table.

First, let’s have a look to the weight for each order (refresher) :


In [64]:
exe(c, orders_weight[:-2] + ' LIMIT 10 ;')

(1, 0.6)
(2, 44.0)
(3, 0.2)
(4, 1.1)
(5, 1.2)
(6, 29.8)
(7, 15.8)
(8, 5.6)
(9, 4.8)
(11, 2.8)


This query returns us the order id and order weight. Let’s use it as starting point to write a subquery that gives the us the order `id`, the `shipping_company`, `total_weight`and `shipping_cost`. Just write a test query to see if we can get all those columns :

In [76]:
orders_weight = '''
SELECT 
    o.id, 
    o.shipping_company,
    o.shipping_cost,
    ROUND(SUM(CAST(REPLACE(i.weight, ",", ".") AS REAL)), 2) 
        AS total_weight
FROM
    items as i
JOIN 
    orders_items as oi ON i.id = oi.item_id
JOIN
    orders as o ON oi.order_id = o.id
GROUP BY
    oi.order_id
LIMIT 10
'''

exe(c, orders_weight)

(1, 2, '2,22', 0.6)
(2, 1, '33,88', 44.0)
(3, 2, '0,74', 0.2)
(4, 4, '2,959', 1.1)
(5, 3, '0,972', 1.2)
(6, 2, '110,26', 29.8)
(7, 1, '12,166', 15.8)
(8, 4, '28,783', 5.6)
(9, 3, '3,888', 4.8)
(11, 5, '8,316', 2.8)


Ok, that’s fine. Now, given this subquery, we just have to calculate the prices for each order then average those prices, grouping them by `shipping_company`. Thus we obtain our prices data :

In [74]:
shipping_prices = '''

SELECT
    shipping_company,
    ROUND(AVG(CAST(REPLACE(shipping_cost, ",", ".") AS REAL) / total_weight),2) AS price
FROM 
    (SELECT 
        o.id, 
        o.shipping_company,
        o.shipping_cost,
        ROUND(SUM(CAST(REPLACE(i.weight, ",", ".") AS REAL)), 2) 
            AS total_weight
    FROM
        items as i
    JOIN 
        orders_items as oi ON i.id = oi.item_id
    JOIN
        orders as o ON oi.order_id = o.id
    GROUP BY
        oi.order_id)
GROUP BY
    shipping_company ;
'''

exe(c, shipping_prices)

(1, 0.82)
(2, 3.7)
(3, 0.81)
(4, 2.87)
(5, 2.97)


In [75]:
c.execute(shipping_prices)
prices_data = c.fetchall()
prices_data

[(1, 0.82), (2, 3.7), (3, 0.81), (4, 2.87), (5, 2.97)]

Finally, create a new column, insert data, verify and commit :

In [77]:
add_prices = '''
ALTER TABLE 
    shippers 
ADD COLUMN 
    delivery_price REAL DEFAULT NULL;
'''

c.execute(add_prices)
conn.commit()

In [78]:
exe(c, 'PRAGMA table_info(shippers)')

(0, 'id', 'INTEGER', 0, None, 0)
(1, 'company_name', 'TEXT', 0, None, 0)
(2, 'delivery_speed', 'REAL', 0, 'NULL', 0)
(3, 'delivery_price', 'REAL', 0, 'NULL', 0)


In [79]:
update_delivery_prices = '''
UPDATE shippers
SET
  delivery_price = '''

for i in prices_data:
    query = update_delivery_prices + str(i[1]) + ' WHERE id = ' + str(i[0]) + ' ;'
    c.execute(query)

# let’s verify :
exe(c, 'SELECT * FROM shippers;')

(1, 'HCL', 140.0, 0.82)
(2, 'USD', 97.29, 3.7)
(3, 'DeFix', 162.25, 0.81)
(4, 'Colipost', 187.0, 2.87)
(5, 'Chronossimo', 92.11, 2.97)


In [80]:
conn.commit()

#### What is the best shipping company ? The fastest ? The cheapest ? Best compromise ?

The fastest company is Colipost with a speed of 187 km/day. The cheapest company is DeFix with a price of 0.81€/kg. Defix is also the second fastest company (162.25 km/day), so this is the best compromise and the best shipping company.