# SQL Basics

## Intro

### Me

* [Jean Delpech](https://github.com/Jehadel), PhD (Cognitive science with a touch of social psychology)
* Data Scientist ([Adalab collective](https://adalab.fr/))
* [Data for Good](https://dataforgood.fr/) treasurer (+ antenne Provence)

### Data Analysis / Data Science

This course covers two topics :

* SQL (Structured Query Language) : no data oriented project can success without high quality data. Data sourcing can be the hardest part of such projects. Many datasets are stored in databases (public/private, open/proprietary, from government, NGO, institutions, corporations, etc.). Those databases are queried and managed (creation, update…) using the SQL language. Any data scientis must be familiar with SQL. 
* EDA (Exploratory Data Analysis) : data exploration, cleansing and data-viz. It is often the most time consuming stage of a data science project, but necessary to improve the understanding and the quality of the dataset. Python provide powerful tools for this, such as `numpy`, `pandas` (`geopandas`), `matplotlib`, `seaborn` and `plotly` libraries. The latter, in association with the `dash` library can produce high-end interactive dashboards.

Few examples of what we can do with just these tools :
* [Carbon Bombs](https://www.carbonbombs.org/)
* [Taxplorer](https://www.taxplorer.eu/) with [EU Tax Observatory](https://www.taxobservatory.eu/) hosted by the Paris School of Economics (some machine learning and IA have been used to facilitate data sourcing). 
* [Our world in data](https://ourworldindata.org/)

### Ressources

* [Free Code Camp](https://www.freecodecamp.org/learn/relational-database/) : learn postGreSQL + bash + git and other tools. The best ressource for hands-on learning !
* [W3 school](https://www.w3schools.com/sql/default.asp) learn and test SQL queries online
* [Learn SQL in y minutes](https://learnxinyminutes.com/sql/) a great site to learn the essentials of a lot of languages
* [Data Camp SQL basics cheatsheet](https://media.datacamp.com/legacy/image/upload/v1714149594/Marketing/Blog/SQL_for_Data_Science.pdf) and [SQL joins cheatsheet](https://media.datacamp.com/legacy/v1714587799/Marketing/Blog/Joining_Data_in_SQL_2.pdf) (datacamp.org's cheatsheets are great, do not hesitate to look after others topics cheatsheets)
* Another efficient [cheatsheet](https://www.sqltutorial.org/wp-content/uploads/2016/04/SQL-cheat-sheet.pdf) (from [sqltutorial.org](https://www.sqltutorial.org/)) 
* [https://sql.sh/](https://sql.sh/) courses and reference (in french)
* Tutorials and reference specifically written for SQLite that we will use in this course : [sqlitetutorial.net](https://www.sqlitetutorial.net/).

### Python ?

First of all : are you fluent in Python ?

If you need a refresher : [Virgile Pesce’s Python beginner courses](https://github.com/virgilus/python-courses)

Python ressources :

* [Learn Python in y minutes](https://learnxinyminutes.com/python/)
* [Datacamp Python Cheatsheet](https://media.datacamp.com/legacy/image/upload/v1694526244/Marketing/Blog/Python_Basics_Cheat_Sheet-updated.pdf), another cheatsheet [focused on Python for data science](https://media.datacamp.com/cms/python-cheat-sheet.pdf) and a cheatsheet about [importing data from different format](https://media.datacamp.com/legacy/image/upload/v1676302004/Marketing/Blog/Importing_Data_Cheat_Sheet.pdf)
* And don’t ignore [the official documentation, of course !](https://docs.python.org/3/) (select the right version of Python you’re using)
* The best way to learn to code is to… code, and resolving problems. You don’t have to learn all the syntax, all the methods of the standard library of Python… you must acquire the method and the way of thinking and reasonning to write code. When you write code, first think about what data you have (type, structure…), what result you want (type…) and how to get it, step by step. Sometimes, pen and paper is the best tool to write code. Then, search the doc, google, etc. to find out how implement what you want to do, step by step.
* [codewars.com](https://www.codewars.com/) is a site (free access) where you resolve more and more complex problems with code, corrected automatically. Smooth progression and gamification keeps you motivated. Try to resolve several problems a day (at the beginning problems are easy and short). Coding is like to learn a language : perseverance, regularity and get out of your comfort zone are the keys. Code is hard, code is demanding, be ready to put in a lot of time and personal commitment if you want to master it.
* [exercism.org](https://exercism.org/) is another great (and free) site to practice with exercises and automatic correction. It proposes tracks specially designed for beginners and learning objective.

## Databases
### Definitions

To use data you have to :
* store (security)
* keep it up to date
* process and analyse
* connect it with other data

[Charles Bachman](https://fr.wikipedia.org/wiki/Charles_Bachman) (Turing Prize laureate) was a pionnier and developped the first database management system

A database architecture can be decomposed in 3 **independants** levels :

* physical / hardware : data are physically stored (HHD, SSD, computer…)
* logical / organisation : data are connected and organized according to a certain model
* interaction / user : the end user access and process the data

> « \[…] a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data. » [Wikipedia](https://en.wikipedia.org/wiki/Database)

There is 4 basic operation that the user should be able to perform on data in a persistent storage, known as CRUD operations :

* **C**reate
* **R**ead / **R**etrieve
* **U**pdate
* **D**elete / **D**estroy

The term was popularized by [James Martin](https://en.wikipedia.org/wiki/Database)

### Relational model

Model invented by [Ted Codd](https://en.wikipedia.org/wiki/Edgar_F._Codd) (Turing Prize laureate) while working at IBM (70s).

Use mathematic notions as logical operators, relational algebra, set theory to build a relational model of data.

Few concepts :

* a **relation** or a **table** groups together a set of homogeneous data which relate to the same element. For example, we could have a table for students, one for teachers, one for the cursus, another for buildings, etc. Splitting information into several homogeneous elements (tables) makes updating easier and more secure
* an **attribute** or **field** characterizes the relation. Fields are reprezented by the **columns** of a table. Each value has a type. For exemple, students have Name (string) and Age (int), etc.
* a **record** is a **line** of a table, a **tuple** of values taken by each attribute of the table. Each record is unique and identfied by a **primary key** which is one attribute wich takes a different value for each record. Chosing a primary key is mandatory. Sometime several attributes meet the uniqueness condition, they are called **candidate keys**, the primary key, and only one primary key, is chosen among them.
* **primary keys** are used to define relationship between the tables. In a table a record is identified by a unique primary key value. This primary key can be used in another table to refer to this record. In this case the primary key is called **foreign key**.

An example will make this clearer. Let’s have two tables : the table Students and the table Cursus.

* a student (a record) is defined by 7 attributes : *INE* (« identifiant national étudiant »), *Firstname, Lastname, Birthdate, Birthplace, Gender, Years* (number of years in the institution)

Students
| INE     | Firstname | Lastname | Birthdate  | Birthplace | Gender | Years |
|---------|-----------|----------|------------|------------|--------|-------|
| 1196186 | John      | Doe      | 12/12/2005 | Marseille  | M      | 6     |
| 2238957 | Jane      | Doe      | 01/03/2007 | Aix        | F      | 1     |

What would be the primary key here ?
What are the types of the attribute ?

* a cursus (a record) is defined by 5 attributes : *Code, Level, Major, Minor, Referent*

| Code | Level | Major        | Minor   | Referent  |
|------|-------|--------------|---------|-----------|
| L1DS | 1     | Computer Sc. | Math    | Pr. Smith |
| L2EF | 2     | Management   | Finance | Pr. Jones |
| M2ME | 5     | Macroeconomy | Math    | Pr. Brown |

What would be the primary key here ?

A reason to link the table would be that each student follow a cursus. We don’t have to copy all the information from the *Cursus* table to the *Students* table. We just need to know the id of the cursus followed by a student, and query the *Cursus* table with this key, to get the *Referent* of a student for exemple.

Is Code a primary or a foreign key ?

Students
| INE     | Firstname | Lastname | Birthdate  | Birthplace | Gender | Years | Cursus_id |
|---------|-----------|----------|------------|------------|--------|-------|-----------|
| 1196186 | John      | Doe      | 12/12/2005 | Marseille  | M      | 6     | M2ME      |
| 2238957 | Jane      | Doe      | 01/03/2007 | Aix        | F      | 1     | L1DS      |

Note : we don’t have to keep the same name when a primary key become a foreign key

It is good practice, for a reason of simplicity and making update easier, to use an int field for primary key (even if a string field could be a candidate). Such key, if it replaces another field, is called **surrogate key**. 

In some cases, foreign keys can be defined as a composition of several fields of a table, if this composition are unique.

### Relationships - ERD

The *[Entity Relationship Diagram](https://www.visual-paradigm.com/guide/data-modeling/what-is-entity-relationship-diagram/)* (ERD)  delivers information about the organisation (schema) of the base : tables, attributes and their type, primary key, foreign key, relationships between tables

When a database consists of many tables, it can be valuable to draw the ERD of the database. We can use [many apps](https://dbmstools.com/categories/database-design-tools) with graphic interface (like [https://sql.toad.cz/](https://sql.toad.cz/) the online demo version of [WWW SQL designer]()https://github.com/ondras/wwwsqldesigner, but here I will recommand to use [dbdiagram.io](https://dbdiagram.io/) : you can also generate figures by writing very simple and clear code (DBML), which is perfect to think about data organization (tables, attributes, types, relationship) and getting familiar with code. [Documentation](https://dbml.dbdiagram.io/docs) can be useful.

For example, the ERD of a CRM : 

![ERD example](./images/ERD-example.png)

```
// écrire un commentaire en le faisant précéder de //

// declaration of a first table
Table <nom table1> {
    <nom attribut1> <type> [primary key]
    <nom attribut2> <type>
    <nom attribut3> <type>
    …
}

// another table declaration
Table <nom table2> {
    <nom attribut1> <type> [primary key]
    <nom attribut2> <type> [note: "you can even add commentaries"]
    <nom attribut3> <type>
    …
}

// relationship declaration
Ref: <nom table1>.<nom attribut2> < <nom table2>.<nom attribut1> // foreign key declaration. What is the foreign key ?
// you can also declare a relationship inside a table (see exemple below)
```

We can refer to an attribute belonging to a particular table by using the dot `.` operator : `<table name>.<attribute name>`. This formalism allows to use same attributes name in different tables. It avoids confusion and is more readable. 

Here the code that produced the CRM ERD :

```
Table Structures {
	nom tinytext [primary key]
	domaine uniquechoice [note: 'santé / environnement']
	type uniquechoice [note: 'labo privé, hôpital public, etc.']
	description_activite text
	tags multiplechoice [note: 'oncologie, imagerie, biodiversité, etc.']
	priorite uniquechoice [note: 'haute / moyenne / basse']
	voie tinytext
	code_postal tinytext
	ville tinytext
	tel_generique tinytext
	mail_generique mailto
	site_web url
	statut uniquechoice [note: 'A contacter/contacté/a relancer/ok/abandon']
	contact_lists multiplechoice [ref: < Contacts.id]
	membres_Adalab multiplechoice [ref: < Membres.id]
	opportunites multiplechoice [ref: < Opportunites.id]
	date_MaJ timestamp
}

Table Contacts {
	id tinytext [primary key]
	nom tinytext
	prenom tinytext
	statut uniquechoice [note: 'a contacter / contacté / a relancer / ok / abandon']
	telephone tinytext
	email mailto
	poste tinytext
	linkedIn url
	structure uniquechoice [ref: - Structures.nom]
}

Table Membres {
	id tinytext [primary key]
	nom tinytext
	prenom tinytext
	disponibilite uniquechoice [note: 'totale / partielle / aucune']
	competences multiplechoices [note: 'Python / SQL / BI / ML / etc.']
	siret tinytext
	linkedIn url
	gitHub url
	email mailto
	cv document
	presentation text
	experience uniquechoice [note: 'junior / senior']
	nogo multiplechoices [note: 'enseignement / BI / etc.']
	aRejoint timestamp
	aQuitte timestamp
}

Table Echanges {
	id tinytext [primary key]
	date timestamp
	notes text
	decision tinytext
	type uniquechoice [note: 'tél / visio / mail / réunion / rencontre']
	contacts multiplechoices [ref: < Contacts.id]
	structure uniquechoice [ref: - Structures.nom]
	membres multiplechoices [ref: < Membres.id]
	opportunites multiplechoices [ref: < Opportunites.id]
	missions multiplechooices [ref: < Missions.id]
}

Table Opportunites {
	id tinytext [primary key]
	description text
	validee checkbox
	toDo text
	structure uniquechoice [ref: - Structures.nom]
	contacts multiplechoices [ref: < Contacts.id]
}

Table Missions {
	id tinytext [primary key]
	description text
	statut uniquechoice [note: 'faisabilité / devis / en cours / facturée / payée']
	demarree_le timestamp
	achevee_le timestamp
	devis money
	facture money
	structure uniquechoice [ref: - Structures.nom]
}
```

We can model different type of relationships :

* **one-to-one** : each record in a Table A is associated with one and only one record in a Table B. A foreign key in table B references the primary key in table A. For exemple each member has only one contact sheet (phone number, e-mail…)
  Interest : prevent overloading one table, distribute information to facilitate organization and data retrieval, update…
* **one-to-many** : each record in a Table A can be associated with multiple records in a Table B. **Each record in Table B is associated with only one record in Table A** the foreign key in table B reference the primary key in table A, but several records can have the same value. Example : one structure may have several contacts but one contact belongs to only one structure, or one cursus may have several students but one student follows only one cursus (in our exemple)
  Interest : can represent hierarchical data structures. Minimize data duplication and data retrieval, update…
* **many-to-one** : many records in table B can be associated with one record in table A. Logially equivalent to one-to-many.
* **many-to-many** : each record in Table A can be associated with multiple records in Table B, but this time each records in Table B can also be associated with multiple records in table A. For example an interaction implies several contacts, but a contact may have (hope so !) several interactions. It needs a **junction** or **bridge table** to be managed efficiently.
* self-referencing : it can be strange, but a table can have a primary key, and a foreign key that refers to this primary key. The classic usecase is when there is a hierarchical relationship between the record of a table. For example, a table Employees, where somme employees are managers, we can have a field *managed_by* which refers to managers :

Employees
| Id | Name     |  Role      | Managed_by |
|----|----------|------------|------------|
| 1  | Muhammad | Manager    | 6          | 
| 2  | Olivia   | Technician | 1          |
| 3  | Noah     | Engineer   | 5          |
| 4  | Amelia   | Technician | 1          |
| 5  | Olivia   | Manager    | 6          |
| 6  | Isla     | Director   | NULL       |

Here are the operators that defines different types of relationships in DBML : 
```
    <  // one-to-many
    >  // many-to-one
    -  // one-to-one
    <> // many-to-many
```

When relationships are defined, we can use algebric operators like union, intersect, difference, projection, selection, join,… to process the data. The database organization, relationships and operations are implemented by DBMS, along with storage, access, .

### DBMS

**D**ata**B**ase **M**anagement **S**ystem. It can be a software component like `SQLite`, a library, with an API, that can be directly integrated in a program. Other systems, and the most robusts one, can be stand alone applications, with a complex architecture (client/server). That’s the case of popular open source solutions like MySQL/MariaDB or PostgreSQL, or proprietary solutions like Oracle, Microsoft SQL server, IBM DB2, etc. :

![DBMS schema](./images/DBMS.png)

This architecture allows security and concurrent access to be managed : several users can request the database with different level access. DBMS garantees persistance of data, backup and an efficient access.

Often, users access to databases with different methods. Ordinary users don’t know how to code a resquest, they usually use a form or a GUI to send or recieve datas. DBA and developpers can use API or a specific language to interact with the database : SQL.

### SQL

Along with the relational model, Ted Codd proposed the SEQUEL (Structured English Query Langage) in the 70s. Its successor, SQL, was standardized for the first time in 1986 (read this [article](https://learnsql.com/blog/history-of-sql-standards/) if you are interested in SQL history), because each DBMS implemented his own version of SQL, which lead to incompatibility (some variation in the implementation remains today). SQL has several advantages :

* it is a well defined norm (curent : SQL-2019)
* this implies that queries in SQL will work regardless the DBMS’s implemenation of data management
* SQL can be interfaced with otther languates (C/C++, C#, Cobol, PHP, Java, Python…)
* the technology is mature, therefore reliable and efficient


## SQLite

We could learn to manage and query database with interactive tools like [dbeaver](https://dbeaver.io/), but the aim of this lecture is to teach scripting in Python to do the job. By doing so, you can automate and scale operations, to build a *pipeline* where the data will then be processed by other scripts to peform analyses or generate dataviz. GUI tools remain interesting to explore databases.

We won’t learn to use DBMS like MySQL or PostgreSQL : it adds a layer of complexity and we don’t train you to be a DBA. Considering our goal to learn basics of SQL, SQLlite is thus the best choice.

SQLite is not as powerful as a DBMS build on a client/server architecture (MySQL, PostGreSQL, etc.) and can’t be used in *certain* (non-local) production contexts :  it stores database in a single file only, which causes some flaws in certain situations. The one-file architecture makes access by multiple clients very complicated (reading would forbid writing for other clients), the processing load can’t be distributed between multiple devices… SQLite is not made to implement the client/server model. Nevertheless, it is the most used DBMS, as it is taylored to manage local databases. SQlite is lightweight (<600Ko), that’s why it is very popular in embedded development. iOS and Android use SQLite as embedded database. Web browsers use SQLite to manage bookmarks, local storage, etc.. Adobe PDF reader, internet boxes, trading bots, embedded softwares in cars use SQLite. SQLite is certified for use in avionics and aerospace. It was originally developped in 2000 by [Richard Hipps](https://www.hwaci.com/drh/) to be used in… missiles.

* SQLite is lightweight (600Ko), perfect for embedded systems
* SQLite is heavily tested, it is probably the most robust db engine
* SQLite format retrocompatibily is guaranteed up to 2050 (considered as a safe format for a very long term data storage by the US government)
* public domain
* local use, not a client/server achitecture, do not manage users access

SQlite is a software component which can be used as a minimalist DBMS and perfect to learn how to write SQL requests. If you want to interact with db engines build upon a client-server architecture, you should turn toward `sqlalchemy` for exemple. With SQLite you don’t need to add any third component, which is great for a course to non experts in computer science. It is perfect for learning purpose, prototyping or development of local or embedded applications. In this course we will use the `sqlite3` library, wich belongs to the standard distribution of Python.

![Different architectures for apps with database (illustration)](./images/App-Database-Architectures.png)

Some ressources :
* [SQLite site](https://sqlite.org/)
* If you want to train online and focus on SQL requests only (you’ll have to upload an existing database) : [SQLite online](https://sqliteonline.com/)
*  [sqlitetutorial.net](https://sqlitetutorial.net), already cited.

If you want to learn a client/server DBMS I recommand you to follow this online course, in addition of the present one, on [Free code camp](https://www.freecodecamp.org/learn/relational-database/) (already cited in the ressources section).


### Import a database

For a beginning, we will import an existing database. We will see later how to create a database from scratch, we will avoid boilerplate for the moment.

First of all, create a folder `data` and download the dataset/database [European soccer](https://www.kaggle.com/datasets/hugomathien/soccer) (you will be asked to register) and extract it there. If the filename is juste `database.sqlite` please rename it `european-soccer.sqlite` to avoid any future confusion with other files with a such generic name.

Then, you can either execute the cell of this notebook if you downloaded it, or copy the code in a python module, let’s call it `sql-intro.py`


In [5]:
# import sqlite3 to use it
import sqlite3

In [6]:
# instanciate a connector to the database
conn = sqlite3.connect('data/european-soccer.sqlite')
# create a cursor to read data 
c = conn.cursor()

That’s it ! Our database is loaded !

Note : when we are done with our database, don’t forget to close it properly :

In [7]:
c.close()
conn.close()

### First request : list table lines

What’s in our database ? Let’s print some records. On the metadata page on Kaggle we read that the first table in the list is called `Country` and has 11 records for 2 columns. 

In [3]:
c.execute("SELECT * FROM Country;")
rows = c.fetchall()
rows

[(1, 'Belgium'),
 (1729, 'England'),
 (4769, 'France'),
 (7809, 'Germany'),
 (10257, 'Italy'),
 (13274, 'Netherlands'),
 (15722, 'Poland'),
 (17642, 'Portugal'),
 (19694, 'Scotland'),
 (21518, 'Spain'),
 (24558, 'Switzerland')]

Which on is primary key ?

`*` refers to a "joker" keyword which, by convention, means "all"

`;` marks the end of a query. Not mandatory in most DBMS.

The structure of the query is :
```SQL
SELECT <columns>
FROM <table_name>
```

We can note the following structure, from now we will focus on the queries.

```python
import sqlite3

# for more readability, write queries in docstring
query = '''
YOUR SQL QUERY;
'''

conn = sqlite3.connect('data/your_database.sqlite')
c = conn.cursor()
c.execute(query)
rows = c.fetchall()
print(rows)
# => list (rows) of tuples (columns)
```

You may create a function that take a string (that will be our query) and the cursor as argument and return the result of the query.

In [17]:
query = '''
SELECT *
FROM Country;
'''
# replace those lines by your function if you want
c.execute(query)
rows = c.fetchall()
rows


[(1, 'Belgium'),
 (1729, 'England'),
 (4769, 'France'),
 (7809, 'Germany'),
 (10257, 'Italy'),
 (13274, 'Netherlands'),
 (15722, 'Poland'),
 (17642, 'Portugal'),
 (19694, 'Scotland'),
 (21518, 'Spain'),
 (24558, 'Switzerland')]

To improve readability :
* write keywords in uppercase (to distinguish them from columns or tables name)
* write each instruction on a different line

We can define aliases with the keyword `AS` :

In [18]:
query = '''
SELECT m.id, m.date, m.season, m.goal
FROM Match AS m;
'''
# replace those lines by your function if you want
c.execute(query)
rows = c.fetchall()
rows

[(1, '2008-08-17 00:00:00', '2008/2009', None),
 (2, '2008-08-16 00:00:00', '2008/2009', None),
 (3, '2008-08-16 00:00:00', '2008/2009', None),
 (4, '2008-08-17 00:00:00', '2008/2009', None),
 (5, '2008-08-16 00:00:00', '2008/2009', None),
 (6, '2008-09-24 00:00:00', '2008/2009', None),
 (7, '2008-08-16 00:00:00', '2008/2009', None),
 (8, '2008-08-16 00:00:00', '2008/2009', None),
 (9, '2008-08-16 00:00:00', '2008/2009', None),
 (10, '2008-11-01 00:00:00', '2008/2009', None),
 (11, '2008-10-31 00:00:00', '2008/2009', None),
 (12, '2008-11-02 00:00:00', '2008/2009', None),
 (13, '2008-11-01 00:00:00', '2008/2009', None),
 (14, '2008-11-01 00:00:00', '2008/2009', None),
 (15, '2008-11-01 00:00:00', '2008/2009', None),
 (16, '2008-11-01 00:00:00', '2008/2009', None),
 (17, '2008-11-01 00:00:00', '2008/2009', None),
 (18, '2008-11-02 00:00:00', '2008/2009', None),
 (19, '2008-11-08 00:00:00', '2008/2009', None),
 (20, '2008-11-08 00:00:00', '2008/2009', None),
 (21, '2008-11-09 00:00:00', 

Quite long, isn’t it ?
If our purpose is just to print out records to have a look on it, but we don’t need it to be exhaustive, we can limit the number of lines with the keyword `LIMIT` : 

In [20]:
query = '''
SELECT m.id, m.date, m.season, m.goal
FROM Match AS m
LIMIT 10;
'''
# replace those lines by your function if you want
c.execute(query)
rows = c.fetchall()
rows

[(1, '2008-08-17 00:00:00', '2008/2009', None),
 (2, '2008-08-16 00:00:00', '2008/2009', None),
 (3, '2008-08-16 00:00:00', '2008/2009', None),
 (4, '2008-08-17 00:00:00', '2008/2009', None),
 (5, '2008-08-16 00:00:00', '2008/2009', None),
 (6, '2008-09-24 00:00:00', '2008/2009', None),
 (7, '2008-08-16 00:00:00', '2008/2009', None),
 (8, '2008-08-16 00:00:00', '2008/2009', None),
 (9, '2008-08-16 00:00:00', '2008/2009', None),
 (10, '2008-11-01 00:00:00', '2008/2009', None)]

If we want only one occurence of each value of a field, use the keyword `DISTINCT` :

In [40]:
query = '''
SELECT DISTINCT season
FROM Match;
'''
# replace those lines by your function if you want
c.execute(query)
rows = c.fetchall()
rows

[('2008/2009',),
 ('2009/2010',),
 ('2010/2011',),
 ('2011/2012',),
 ('2012/2013',),
 ('2013/2014',),
 ('2014/2015',),
 ('2015/2016',)]

By the way, how many records in the table `Match` ? We can count them with the function… `COUNT()` !

In [21]:
query = '''
SELECT COUNT(*)
FROM Match;
'''
# replace those lines by your function if you want
c.execute(query)
rows = c.fetchall()
rows

[(25979,)]

Other useful functions (try them !) :
```SQL
SUM()
AVG()
MAX()
MIN()
```

It seems that the `goal` column is pretty empty… `COUNT()` can help us to determine how much it is empty :

In [22]:
query = '''
SELECT COUNT(goal)
FROM Match;
'''
# replace those lines by your function if you want
c.execute(query)
rows = c.fetchall()
rows

[(14217,)]

We can even compute percentage as we can perform mathematic operation with `+`, `-`, `*`, `/` and the fonction `ROUND()`:

In [30]:
query = '''
SELECT ROUND(COUNT(goal)*100.0/COUNT(*), 2) AS goal_percent
FROM Match;
'''
# replace those lines by your function if you want
c.execute(query)
rows = c.fetchall()
rows

[(54.72,)]

**Very important !**
Did you remark that we write 100.0 ? That’s because datatypes are very important here. `100` is an `int` whereas `100.0` is a `float`.
It has repercussion on the result. Even the order of the operation is important.

Compare results obtained with :

`COUNT(goal)*100.0/COUNT(*)`

`COUNT(goal)*100/COUNT(*)`

`COUNT(goal)/COUNT(*)*100`

`COUNT(goal)/COUNT(*)*100.0`

To explain this behavior, track the type of each intermediary result.

### Conditions

An important feature when we process data is the possibility to select data using conditions. Conditions can be specified with the keyword `WHERE` :

In [33]:
query = '''
SELECT player_name
FROM Player
WHERE height < 165;
'''
# replace those lines by your function if you want
c.execute(query)
rows = c.fetchall()
rows

[('Anthony Deroin',),
 ('Bakari Kone',),
 ('Diego Buonanotte',),
 ('Edgar Salli',),
 ('Fouad Rachid',),
 ('Frederic Sammaritano',),
 ('Juan Quero',),
 ('Lorenzo Insigne',),
 ('Maxi Moralez',),
 ('Pablo Piatti',),
 ('Quentin Othon',),
 ('Samuel Asamoah',)]

We can also use the classical logical operator : `AND`, `OR`…

In [36]:
query = '''
SELECT player_name
FROM Player
WHERE height < 165 OR height > 200;
'''
# replace those lines by your function if you want
c.execute(query)
rows = c.fetchall()
rows

[('Abdoul Ba',),
 ('Anthony Deroin',),
 ('Asmir Begovic',),
 ('Bakari Kone',),
 ('Bogdan Milic',),
 ('Costel Pantilimon',),
 ('Daniel Burn',),
 ('Danny Wintjens',),
 ('Diego Buonanotte',),
 ('Edgar Salli',),
 ('Fejsal Mulic',),
 ('Fouad Rachid',),
 ('Fraser Forster',),
 ('Frederic Sammaritano',),
 ('Juan Quero',),
 ('Jurgen Wevers',),
 ('Kevin Vink',),
 ('Konrad Jalocha',),
 ('Kristof van Hout',),
 ('Lacina Traore',),
 ('Lorenzo Insigne',),
 ('Maxi Moralez',),
 ('Nikola Zigic',),
 ('Pablo Piatti',),
 ('Paolo Acerbis',),
 ('Peter Crouch',),
 ('Pietro Marino',),
 ('Quentin Othon',),
 ('Robert Jones',),
 ('Samuel Asamoah',),
 ('Stefan Maierhofer',),
 ('Vanja Milinkovic-Savic',),
 ('Wojciech Kaczmarek',),
 ('Zeljko Kalac',)]

The keyword `BETWEEN` simplify the writing of conditions to select a range (bounds included) :

In [39]:
query = '''
SELECT player_name
FROM Player
WHERE height BETWEEN 200 AND 205;
'''
# replace those lines by your function if you want
c.execute(query)
rows = c.fetchall()
rows

[('Abdoul Ba',),
 ('Asmir Begovic',),
 ('Bogdan Milic',),
 ('Costel Pantilimon',),
 ('Daniel Burn',),
 ('Danny Wintjens',),
 ('Fejsal Mulic',),
 ('Fraser Forster',),
 ('Jurgen Wevers',),
 ('Kevin Vink',),
 ('Konrad Jalocha',),
 ('Lacina Traore',),
 ('Nikola Zigic',),
 ('Paolo Acerbis',),
 ('Peter Crouch',),
 ('Pietro Marino',),
 ('Robert Jones',),
 ('Stefan Maierhofer',),
 ('Vanja Milinkovic-Savic',),
 ('Wojciech Kaczmarek',),
 ('Zeljko Kalac',)]

We can also use `WHERE` to know the number of non-null (or null) values with the keywords `IS NULL` or `IS NOT NULL` :

In [31]:
query = '''
SELECT COUNT(*)
FROM Match
WHERE goal IS NOT NULL;
'''
# replace those lines by your function if you want
c.execute(query)
rows = c.fetchall()
rows

[(14217,)]

`IN` keyword allows to write complex conditions in one instruction :

In [41]:
query = '''
SELECT COUNT(*)
FROM Match
WHERE stage IN (1, 10, 20, 30);
'''
# replace those lines by your function if you want
c.execute(query)
rows = c.fetchall()
rows

[(2948,)]

We can have complex conditions on text variables (string type), the keyword is ̀`LIKE` :

In [43]:
query = '''
SELECT DISTINCT season
FROM Match
WHERE season LIKE '%2009%';
'''
# replace those lines by your function if you want
c.execute(query)
rows = c.fetchall()
rows

[('2008/2009',), ('2009/2010',)]

In [46]:
query = '''
SELECT DISTINCT season
FROM Match
WHERE season LIKE '201_%';
'''
# replace those lines by your function if you want
c.execute(query)
rows = c.fetchall()
rows

[('2010/2011',),
 ('2011/2012',),
 ('2012/2013',),
 ('2013/2014',),
 ('2014/2015',),
 ('2015/2016',)]

**attention** here numbers are char and form a string !

* `LIKE '%2009%'` -> countains the text 2009
* `LIKE '%2009'` -> ends with the text 2009
* `LIKE '2009%'` -> begins with the text 2009
* `LIKE '%201_%'` -> `_` replaces any char  


### Group results, sort

In data analysis, there is always a moment when we want to compare groups or categories together (countries, situations/conditions, periods…)

Therefore, the instruction `GROUP BY` is very important in SQL :

In [48]:
query = '''
SELECT Season, COUNT(*)
FROM Match
GROUP BY Season;
'''
# replace those lines by your function if you want
c.execute(query)
rows = c.fetchall()
rows

[('2008/2009', 3326),
 ('2009/2010', 3230),
 ('2010/2011', 3260),
 ('2011/2012', 3220),
 ('2012/2013', 3260),
 ('2013/2014', 3032),
 ('2014/2015', 3325),
 ('2015/2016', 3326)]

What does represent the numbers in the second column ?

Note : you have to `SELECT` the column by which you `GROUP BY` if you want to know to which group belongs the output numbers…

How does this work ?

![Group by schema](./images/sql_groupby.png)


We may want to order the result by this number and not by seasons :

In [59]:
query = '''
SELECT Season, COUNT(*) AS n_matches
FROM Match
GROUP BY Season
ORDER BY n_matches;
'''
# replace those lines by your function if you want
c.execute(query)
rows = c.fetchall()
rows

[('2013/2014', 3032),
 ('2011/2012', 3220),
 ('2009/2010', 3230),
 ('2010/2011', 3260),
 ('2012/2013', 3260),
 ('2014/2015', 3325),
 ('2008/2009', 3326),
 ('2015/2016', 3326)]

By default, ̀`ORDER BY` sort results in ascending order. Use the keyword `DESC` for descending order : `ORDER BY n_matches DESC`

You can’t test conditions on `AVG()` `SUM()` or other function with `WHERE`. Those functions make more sense if you group categories. To test conditions on those results, you have to use the keyword `HAVING`:

In [69]:
query = '''
SELECT Season, COUNT(*) AS n_matches
FROM Match
GROUP BY Season
HAVING n_matches < 3300
ORDER BY n_matches;
'''
# replace those lines by your function if you want
c.execute(query)
rows = c.fetchall()
rows

[('2013/2014', 3032),
 ('2011/2012', 3220),
 ('2009/2010', 3230),
 ('2010/2011', 3260),
 ('2012/2013', 3260)]

### First sight on `JOIN`

In [67]:
query = '''
SELECT Player.player_name, Player_Attributes.overall_rating
FROM Player_Attributes
JOIN Player ON Player_Attributes.id = Player.id
LIMIT 20;
'''
# replace those lines by your function if you want
c.execute(query)
rows = c.fetchall()
rows

[('Aaron Appindangoye', 67),
 ('Aaron Cresswell', 67),
 ('Aaron Doran', 62),
 ('Aaron Galindo', 61),
 ('Aaron Hughes', 61),
 ('Aaron Hunt', 74),
 ('Aaron Kuhl', 74),
 ('Aaron Lennon', 73),
 ('Aaron Lennox', 73),
 ('Aaron Meijers', 73),
 ('Aaron Mokoena', 73),
 ('Aaron Mooy', 74),
 ('Aaron Muirhead', 73),
 ('Aaron Niguez', 71),
 ('Aaron Ramsey', 71),
 ('Aaron Splaine', 71),
 ('Aaron Taylor-Sinclair', 70),
 ('Aaron Wilbraham', 70),
 ('Aatif Chahechouhe', 70),
 ('Abasse Ba', 70)]

Here, aliases can improve the readability :

In [70]:
query = '''
SELECT p.player_name, pa.overall_rating
FROM Player_Attributes AS pa
JOIN Player AS p ON pa.id = p.id
LIMIT 20;
'''
# replace those lines by your function if you want
c.execute(query)
rows = c.fetchall()
rows

[('Aaron Appindangoye', 67),
 ('Aaron Cresswell', 67),
 ('Aaron Doran', 62),
 ('Aaron Galindo', 61),
 ('Aaron Hughes', 61),
 ('Aaron Hunt', 74),
 ('Aaron Kuhl', 74),
 ('Aaron Lennon', 73),
 ('Aaron Lennox', 73),
 ('Aaron Meijers', 73),
 ('Aaron Mokoena', 73),
 ('Aaron Mooy', 74),
 ('Aaron Muirhead', 73),
 ('Aaron Niguez', 71),
 ('Aaron Ramsey', 71),
 ('Aaron Splaine', 71),
 ('Aaron Taylor-Sinclair', 70),
 ('Aaron Wilbraham', 70),
 ('Aatif Chahechouhe', 70),
 ('Abasse Ba', 70)]

### Résumé

Structure of a query (without join) : 

```SQL
SELECT <columns>
FROM <table_name>
WHERE <conditions>
GROUP BY <columns>
HAVING <conditions_with_functions>
ORDER BY <column_or_result> <DESC_or_ASC>
LIMIT <number>
```

Order of instructions [is important](https://www.sisense.com/blog/sql-query-order-of-operations/) !

## Exercices

### 1. ERD

1. Create the ERD corresponding to the example given in the section *Relational model* (students/cursus)
2. Try to create the ERD corresponding to [this (real – and complex) dataset](https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce) (no need to download, just read the metadata and context). Don’t spend to much time on this, try at least to represent 3 tables if you have difficulties to understand (and don’t hesitate to ask for help !), you will finish at home.

### 2. SQL

We have seen a lot of keyword, instructions !

1. Just play around with those instructions. Create an empty file or an empty notebook and repeat the queries with some variation (try to avoid copy/paste) in order to understand what they do and how they do it, and to understand the dataset (you can draw an ERD if it helps you - recommanded ! -, but beware : one table contains a LOT of columns, select the most significant ones).

2. Try to answer those questions with queries :
* How many matches where played in Belgium ?
* How many matches where played in Belgium or France ?
* What is the average weight of the 20 tallest player, and same for the 20 shortest ?
* What are the birthdates of players named Adil ?
* What is the average weight of players named Sylvain ?
* How many players have their names derived from Thomas (Tomas, Tomi, etc.) ?
* How many matches where played in each country ? In each league ?
* Present the precedent results by descendant numbers of matches
* Who are the 10 players with the best ratings ?
* For each league, how many matches where played ? But this time, order your response by countries name.

## To be continued…

Next time we will see :

* different types of join
* how to get the table list of a database, and for each table the column list
* how to create a database
* how to load datas from files (.csv) to a database
* sub-queries (`WITH`)
* some remaining keywords (`UNION`, etc.)
* conditional execution with `CASE WHEN THEN ELSE`