# SQL basics : Exercices

## Prerequisite : modifying a table or a column, with `ALTER TABLE`

Once a table is created, you may sometimes need to modify it, by changing its name, or the name of a column, adding a column or deleting one. (Again : remember CRUD).

For these operations we have the `ALTER TABLE` command which is very intuitive to use.

Unfortunately, the `ALTER TABLE` commands has not always been fully implemented in SQLite. Pay attention to notes below.

First note : it is always wise to create a backup of your database before any attempt to make significant changes to its schema.

### Modifying a table : change name

To change the name of a table, the syntax is very simple : 

```SQL
ALTER TABLE <old_table_name> 
    RENAME TO <new_table_name>;
```

### Modifying a table : add a column

To add a column, the command is also very clear and simple :

```SQL
ALTER TABLE <table_name> 
    ADD COLUMN <new_column_name>;
```

Of course you can add constraints when you create a new column. But when you add a column to a table, some constraints can’t be specified :

- of course the new column can’t be a `PRIMARY KEY`
- it can’t be `UNIQUE` (as it is empty at its creation and records already exist  : )
- it can’t have the default values : `CURRENT_TIME`, `CURRENT_DATE` or `CURRENT_TIMESTAMP` (as some records already exist : they can’t have current datetime value)
- there are also limitations on other constraints, but they are based on more elaborate notions that we have not seen so far.

### Modifying a table/column : drop a column

Simply :

```SQL
ALTER TABLE <table_name> 
    DROP COLUMN <column_name>;
```

- you cannot drop a column if it is a `PRIMARY KEY` and `UNIQUE`
- you also cannot delete a column if it is referenced elsewhere in the schema (see `FOREIGN KEY` constraints below)

Attention ! This syntax is valid only for SQLite > v3.5. 

Old method (with previous versions) : 
1. create new table as the one you are trying to change,
2. copy all data,
3. drop old table,
4. rename the new one.

What does the `DROP COLUMN` command do? It actually does the exact steps above.

### Modifying a table/column : change name

The syntax is clear, again :

```SQL
ALTER TABLE <table_name> 
    RENAME COLUMN <old_name> 
        TO <new_name>;
```

### Changing a column type

There is no standard SQL syntax to do this : each DBMS will have its particularity

SQLite is especially tricky. As SQLite has weakly types, the best (simpler) move is to :

1. Add a table to the table with the new datatype (you know how to code this!)

2. Copy the old column by casting the data to the new type :

```SQL
UPDATE <table_name> 
       SET <new_column_name> = CAST(<old_column_name> as <new_data_type>)
```

Another, canonical (and sure) but more complex method:

1. Create a new table with the same columns, except the one we want to modify, which will have the type we want
2. Copy the old table data into the new table the column we want to modify)
3. Drop the old table
4. Rename the new table with the name of the old table
5. Commit !

Exercise : try to implement this sequence with a table you used. For example cast an `INTEGER` column to `TEXT`.



In [None]:
# your code here (chose the database and the table - and column - you want). 
# Make a backup before playing with it !




## Foreign key constraints

We have seen that when creating tables, we can declare a column as `PRIMARY KEY`.
When we have created ERD, we have seen that we can declare references between tables, connecting PK and FK.

In fact, we can declare FK in the database schema. It adds constraints : if, when creating a table, we declare a FK and a relationship with another table, this new table cannot be dropped or modified as easily, to preserve the structure of the database or schema. It makes life harder for the data scientist or administrator, but it secures the data. That’s the advantage of the relational model and to store data in databases (rather than in flatfiles).

How do we declare a FK ?

Suppose we are creating to tables : `students` and `cursus` :

```SQL
CREATE TABLE cursus(
    code INTEGER PRIMARY KEY, 
    name TEXT,
    major TEXT,
    referent TEXT
);

CREATE TABLE students(
    id INTEGER,
    firstname TEXT, 
    lastname TEXT,
    cursus_id INTEGER     -- This column will refer to a cursus.code
);
```

The declaration of a FK is a bit more complicated than the declaration of a PK :

```SQL
CREATE TABLE cursus(
    code INTEGER PRIMARY KEY, 
    name TEXT,
    major TEXT,
    referent TEXT
);

CREATE TABLE students(
    id INTEGER,
    firstname TEXT, 
    lastname TEXT,
    FOREIGN KEY(cursus_id) REFERENCES cursus(code) INTEGER     -- This column will refer to a cursus.id
);
```

SQLite will then forbids :

- to delete a records in cursus if at least one students record references to it
- to record a student (insert) who would not refer to a cursus
- only a `NULL` value of students.cursus_id would not raise an error
- it is allowed to specify a `NOT NULL` constraint to a `FOREIGN KEY` : by doing so we define a very strict relationship between the tables (each student recorded in the table `students` must refer to a cursus). 

As we said before, this foreign key constraint forbids the modification or suppression of the table to which the FK refers. But sometimes we need to modify or drop/delete tables or columns of such table. We can temporarily disable those constraint with a `PRAGMA` clause : 

``` SQL
PRAGMA foreign_keys=off; 

-- operations on the table

PRAGMA foreign_keys=on;
```

## Exercise : create a data base

### Understanding the schema

In the data folder of this repo you will find 5 `.csv` files called `customers.csv`, `orders.csv`, `items.csv`, `orders_items.csv` and `carriers.csv`

1. Open the files, observe the colums, datatype, etc. and create an ERD to understand the schema

*Edit this cell (Markdown)*

```SQL
 -- you can copy DBML code of the ERD here
```

You can insert an image by pasting its path between the parenthesis (edit the markdown code) :

![ERD of the database]( )


### Store the data in a database

2. Write a function `add_table()` that takes

* a connector to a (new) database called `orders-exercise.sqlite`
* the path to a `.csv` file containing the data to store in the added table
* the name of the added table

as arguments and which adds (create) a new table in the database referenced by the connector (you can use Pandas here if you think it makes your life easier)

Then add the tables and data from the 5 `.csv` files

**Optional** – if you want to automate the file loading, you can get the files list with these lines :
```python
import os

csv_path = '<write_here_the_path_where_csv_are_stored>'
# the best way to write a path is to use os.path.join('folder', 'subfolder1', …)
file_names = [f for f in os.listdir(csv_path) if f.endswith('.csv')]
file_names
```

you then just have to iterate through the list of `file_names` and call your function for each file

In [1]:
# your code here

import pandas as pd
import sqlite3

conn = sqlite3.connect('data/orders-exercise.sqlite')
c = conn.cursor()


In [2]:
def add_table(connector: object, path: 'string', table_name: 'string'):
    df = pd.read_csv(path)
    df.to_sql(table_name, connector, if_exists='replace', index=False)
    

In [3]:
import os

csv_path = os.path.join('.','data','exercise-orders')
file_names = [f for f in os.listdir(csv_path) if f.endswith('.csv')]
for f in file_names:
    add_table(conn, csv_path + '/' + f, f[:-4])

## Check that everything is ok

3. Write a script that will give us :

* database tables list
* columns list for each table
* request the first rows of each table

Format the output in a structured and readable way. For example :

```
----- table items columns -----

id
product
volume
weight

 table items first lines :

(1, 'Table', 40398, '27')
(2, 'Chaise', 23829, '7')
(3, 'Ordinateur', 2912, '3')
(4, 'Lampe', 2014, '0,3')
(5, 'Canapé', 43852, '14')

----- table orders_items columns -----

id
order_id

[… etc. …]
```

The idea is to automate (script) as much as possible. Data scientists build their own tools ! You may reuse or adapt that script next time you have to deal with databases :)

When you write a complex program, try to write each functionality one after the other, step by step. « From small strokes fell great oaks ».

In [4]:
# first, the well known function to query the data base

def exe(cursor: object, query: 'string'):
    cursor.execute(query)
    for row in  cursor.fetchall():
        print(row)

In [5]:
list_table = '''
PRAGMA table_list;
'''

exe(c, list_table)

('main', 'items', 'table', 4, 0, 0)
('main', 'orders_items', 'table', 3, 0, 0)
('main', 'orders', 'table', 7, 0, 0)
('main', 'shippers', 'table', 3, 0, 0)
('main', 'customers', 'table', 5, 0, 0)
('main', 'sqlite_schema', 'table', 5, 0, 0)
('temp', 'sqlite_temp_schema', 'table', 5, 0, 0)


In [20]:
# list all columns of all tables

# first, get table name
c.execute(list_table)
for row in c.fetchall():
    table = row[1] # to improve readability

    # then get columns names
    column_list = 'PRAGMA table_info(' + table +')'
    print('\n----- table ' + row[1] + ' columns -----\n')
    c.execute(column_list)
    
    # print columns names
    for row in  c.fetchall():
        column = row[1] # to improve readability
        print(column)
    
    # finally print first lines of each table   
    print('\n table ' + table + ' first lines :\n')
    select_all = 'SELECT * FROM ' + table + ' LIMIT 5'
    exe(c, select_all)



----- table items columns -----

id
product
volume
weight

 table items first lines :

(1, 'Table', 40398, '27')
(2, 'Chaise', 23829, '7')
(3, 'Ordinateur', 2912, '3')
(4, 'Lampe', 2014, '0,3')
(5, 'Canapé', 43852, '14')

----- table orders_items columns -----

id
order_id
item_id

 table orders_items first lines :

(1, 80, 3)
(2, 21, 35)
(3, 82, 12)
(4, 84, 32)
(5, 58, 40)

----- table orders columns -----

id
customer_id
order_date
order_status
expedition_date
reception_date
shipping_company

 table orders first lines :

(1, 2, '2023-08-14', 'Annulé', '2023-08-15', '2023-08-24', 2)
(2, 6, '2023-06-15', 'Livré', '2023-06-24', '2023-06-30', 1)
(3, 7, '2023-04-08', 'Expédié', '2023-04-17', '2023-04-21', 2)
(4, 6, '2023-02-04', 'Expédié', '2023-02-08', '2023-02-16', 4)
(5, 20, '2023-03-23', 'Annulé', '2023-03-28', '2023-03-30', 3)

----- table shippers columns -----

id
company_name
price

 table shippers first lines :

(1, 'DHL', 0.77)
(2, 'UPS', 3.7)
(3, 'FedEx', 0.81)
(4, 'Colissimo'

## Analyses

As a data scientist you have to answer to this question : is there a problem with delivery, and more specifically with some delivery company (shippers) ?

4. Do some simple initial analyses :

* how many orders, items, customers, shippers… ?
* is there some customers that never buy / get their order ? some items never bought ?
* check if there are errors. What would be inconsitents/missing data ? How to deal with them ? Apply a correction or delete such data (you choose !)

6. Simple statitistics :

* average number of items in an order ?
* average number of orders by customers ?
* what are the items most frequently sold ?
* average weight of an order ?
* compare the number of deliveries made by each shippers ?
* percentage of canceled orders ? other status ?

8. Build more complex indicators :

* number of days between expedition and delivery ? (result can be stored in a column)
* what are the slowest shippers ? the fastests ? average for each shipper ?
* add a column to the shippers table : speed of delivery (speed = time / distance btw…).
* add another column to the shippers table : price / kg of shipping (total cost = weight x price)

What is the best shipping company ? The fatest ? The cheapest ? Best compromise ?

Disclaimer : those data where randomly generated, I don’t know what the answer is, maybe there is something to find out, maybe not. Only your analysis will allow you to decide.