# Correction exercices cours S1 - S2

In [1]:
import pandas as pd
import sqlite3

In [2]:
c=sqlite3.connect('data/european-soccer.sqlite')
cursor = c.cursor()

In [3]:
def exe(cursor: object, query: 'string'):
    cursor.execute(query)
    for row in  cursor.fetchall():
        print(row)

## 1. How many matches where played in Belgium ?

Countries names and matches are stored in two different tables. We need to do a join to cross those informations. If we don’t know how to do a join, we can simply get the `Country_id` for Belgium in the `Country` table (by hand) and select the matches with this country id. 

In [5]:
countries = 'SELECT * FROM Country'
exe(cursor, countries)

(1, 'Belgium')
(1729, 'England')
(4769, 'France')
(7809, 'Germany')
(10257, 'Italy')
(13274, 'Netherlands')
(15722, 'Poland')
(17642, 'Portugal')
(19694, 'Scotland')
(21518, 'Spain')
(24558, 'Switzerland')


Belgium country id is 1 :

In [6]:
matches_belgium = 'SELECT COUNT(*) FROM Match WHERE Country_id = 1'
exe(cursor, matches_belgium)

(1728,)


Second method, if we know how to join tables :

In [7]:
matches_belgium_join = '''
SELECT 
    COUNT(*) 
FROM 
    Match AS m
JOIN 
    Country AS c 
    ON 
        m.Country_id = c.id 
WHERE 
    c.name = 'Belgium';
'''
exe(cursor,matches_belgium_join)

(1728,)


## 2. How many matches where played in Belgium or France ?

We can use either methods seen in 1, with a `OR` condition :

In [9]:
matches_belgium_or_france = '''
SELECT 
    count(*) 
FROM 
    Match AS m
JOIN 
    Country AS c 
    ON 
        m.Country_id = c.id 
WHERE 
    c.name = 'Belgium' 
    OR 
        c.name = 'France';
'''
exe(cursor, matches_belgium_or_france)

(4768,)


## 3. What is the average weight of the 20 tallest player, and same for the 20 shortest ?

First, let’s have a look on the `Player` table ?

In [10]:
players = 'SELECT * FROM Player LIMIT 10'
exe(cursor, players)

(1, 505942, 'Aaron Appindangoye', 218353, '1992-02-29 00:00:00', 182.88, 187)
(2, 155782, 'Aaron Cresswell', 189615, '1989-12-15 00:00:00', 170.18, 146)
(3, 162549, 'Aaron Doran', 186170, '1991-05-13 00:00:00', 170.18, 163)
(4, 30572, 'Aaron Galindo', 140161, '1982-05-08 00:00:00', 182.88, 198)
(5, 23780, 'Aaron Hughes', 17725, '1979-11-08 00:00:00', 182.88, 154)
(6, 27316, 'Aaron Hunt', 158138, '1986-09-04 00:00:00', 182.88, 161)
(7, 564793, 'Aaron Kuhl', 221280, '1996-01-30 00:00:00', 172.72, 146)
(8, 30895, 'Aaron Lennon', 152747, '1987-04-16 00:00:00', 165.1, 139)
(9, 528212, 'Aaron Lennox', 206592, '1993-02-19 00:00:00', 190.5, 181)
(10, 101042, 'Aaron Meijers', 188621, '1987-10-28 00:00:00', 175.26, 170)


First, get the weights of the 20 tallest players

In [13]:
weights_20tallest = '''
SELECT 
    weight 
FROM 
    Player 
ORDER BY 
    height DESC
LIMIT 20 
'''
exe(cursor, weights_20tallest)

(243,)
(216,)
(212,)
(185,)
(212,)
(194,)
(192,)
(212,)
(190,)
(209,)
(216,)
(203,)
(209,)
(212,)
(183,)
(192,)
(168,)
(198,)
(187,)
(185,)


How do we calculate the average of thoses values ?
It is not possible in one request with SQLite, because you can’t use an aggregate function like `AVG()` *after* the `LIMIT` keyword.

First method to deal with this limitation is to use Python : we got the individual weights with a SQL request, we can calculate the average with Python.

In [20]:
cursor.execute(weights_20tallest)
weights = [row[0] for row in cursor.fetchall()]
mean = sum(weights)/len(weights)
mean

200.9

Second method is to use sub-requests (method learned in the S2 course).

In [23]:
mean_weights_20tallest = '''

WITH 
    tallest AS (
    SELECT 
        weight 
    FROM 
        Player 
    ORDER BY 
        height DESC
    LIMIT 20 
)

SELECT 
    AVG(weight) 
FROM 
    tallest;
'''

exe(cursor, mean_weights_20tallest)

(200.9,)


To get the average weight of the 20 shortest players, just do the same, ordering this time the heights in crescent order :

In [25]:
mean_weights_20shortest = '''

WITH 
    shortest AS (
    SELECT 
        weight 
    FROM 
        Player 
    ORDER BY 
        height ASC
LIMIT 20 
)

SELECT 
    AVG(weight) 
FROM 
    shortest;
'''

exe(cursor, mean_weights_20shortest)

(135.7,)


## 3. What are the birthdates of players named Adil ?

This one is easy. Manage `TEXT` type with `LIKE`. If you know other string functions like `SUBSTR()` you can also format the output.

In [34]:
birthdates = '''
SELECT 
    player_name, 
    SUBSTR(birthday, 1, 10)
FROM 
    Player
WHERE 
    player_name 
    LIKE 
        'Adil %';
'''

exe(cursor, birthdates)

('Adil Auassar', '1986-10-06')
('Adil Chihi', '1988-02-21')
('Adil Hermach', '1986-06-27')
('Adil Rami', '1985-12-27')
('Adil Ramzi', '1977-07-14')


## 4. What is the average weight of players named Sylvain ?
This time, you can use an aggregate function with a `WHERE` clause…

In [35]:
average_weights_sylvain = '''
SELECT 
    AVG(weight)
FROM 
    Player
WHERE 
    player_name 
    LIKE 
        'Sylvain %';
'''

exe(cursor, average_weights_sylvain)

(174.66666666666666,)


## 5. How many players have their names derived from Thomas (Tomas, Tomi, etc.) ?

First let’s explore and look at what names starting with tom... or thom... look like :

In [48]:
tom = '''
SELECT 
    player_name
FROM 
    Player
WHERE 
    player_name LIKE 'Tom%' 
    OR 
        player_name LIKE 'Thom%';
'''

exe(cursor, tom)

('Thom Haye',)
('Thomas Agyepong',)
('Thomas Ayasse',)
('Thomas Bosmel',)
('Thomas Broich',)
('Thomas Bruns',)
('Thomas Buffel',)
('Thomas Carroll',)
('Thomas Chatelle',)
('Thomas Enevoldsen',)
('Thomas Fekete',)
('Thomas Foket',)
('Thomas Goddeeris',)
('Thomas Guerbert',)
('Thomas Guimaraes Azevedo',)
('Thomas Heurtaux',)
('Thomas Hitzlsperger',)
('Thomas Ince',)
('Thomas Kahlenberg',)
('Thomas Kaminski',)
('Thomas Kessler',)
('Thomas Kind Bendiksen',)
('Thomas Kleine',)
('Thomas Konrad',)
('Thomas Kotte',)
('Thomas Kraft',)
('Thomas Kristensen',)
('Thomas Lam',)
('Thomas Lemar',)
('Thomas Manfredini',)
('Thomas Mangani',)
('Thomas Matton',)
('Thomas Meunier',)
('Thomas Mueller',)
('Thomas Ouwejan',)
('Thomas Phibel',)
('Thomas Piermayr',)
('Thomas Pledl',)
('Thomas Reilly',)
('Thomas Reinmann',)
('Thomas Robson',)
('Thomas Rogne',)
('Thomas Scobbie',)
('Thomas Sorensen',)
('Thomas Toure',)
('Thomas Vermaelen',)
('Thomas Vincensini',)
('Thomas Welnicki',)
('Thomas Wils',)
('Thomas',)


It seems that some names begining by tom… are abviously not derived from Thomas (Tomoaki, Tomane…). We should refine our test condition :

In [49]:
tom = '''
SELECT 
    COUNT(*)
FROM 
    Player
WHERE 
    (
    player_name LIKE 'Tom%' 
    OR 
        player_name LIKE 'Thom%'
    ) 
    AND 
        player_name NOT LIKE 'Tomane' 
    AND 
        player_name NOT LIKE 'Tomo%';
'''

exe(cursor, tom)

(133,)


## 6. How many matches where played in each country ? In each league ?

This time it seems we should use `GROUP BY`

In [56]:
matches_by_countries = '''
SELECT 
    Country_id, 
    COUNT(*)
FROM 
    Match
GROUP BY 
    Country_id;
'''

exe(cursor, matches_by_countries)

(1, 1728)
(1729, 3040)
(4769, 3040)
(7809, 2448)
(10257, 3017)
(13274, 2448)
(15722, 1920)
(17642, 2052)
(19694, 1824)
(21518, 3040)
(24558, 1422)


If we know how to do a join, we could replace `Country_id` by countries name :

In [57]:
matches_by_countries = '''
SELECT 
    c.name, 
    COUNT(*)
FROM 
    Match AS m
JOIN 
    Country AS c 
    ON 
        m.Country_id = c.id 
GROUP BY 
    Country_id
ORDER BY 
    c.name;
'''
exe(cursor, matches_by_countries)

('Belgium', 1728)
('England', 3040)
('France', 3040)
('Germany', 2448)
('Italy', 3017)
('Netherlands', 2448)
('Poland', 1920)
('Portugal', 2052)
('Scotland', 1824)
('Spain', 3040)
('Switzerland', 1422)


Present the precedent results by descendant numbers of matches

In [59]:
matches_by_countries = '''
SELECT 
    c.name, 
    COUNT(*) AS n_matches
FROM 
    Match AS m
JOIN 
    Country AS c 
    ON 
        m.Country_id = c.id 
GROUP BY 
    Country_id
ORDER BY 
    n_matches DESC, 
    c.name;
'''
exe(cursor, matches_by_countries)

('England', 3040)
('France', 3040)
('Spain', 3040)
('Italy', 3017)
('Germany', 2448)
('Netherlands', 2448)
('Portugal', 2052)
('Poland', 1920)
('Scotland', 1824)
('Belgium', 1728)
('Switzerland', 1422)


Matches for each league :

In [66]:
matches_by_leagues = '''
SELECT 
    l.name, 
    COUNT(*) AS n_matches
FROM 
    Match AS m
JOIN 
    League AS l 
    ON 
        m.League_id = l.id 
GROUP BY 
    League_id
ORDER BY 
    n_matches DESC, 
    l.name;
'''
exe(cursor, matches_by_leagues)

('England Premier League', 3040)
('France Ligue 1', 3040)
('Spain LIGA BBVA', 3040)
('Italy Serie A', 3017)
('Germany 1. Bundesliga', 2448)
('Netherlands Eredivisie', 2448)
('Portugal Liga ZON Sagres', 2052)
('Poland Ekstraklasa', 1920)
('Scotland Premier League', 1824)
('Belgium Jupiler League', 1728)
('Switzerland Super League', 1422)


## 7. Who are the 10 players with the best ratings ?

As usual, let’s take a look at the content of the table concerned :

In [62]:
ratings = '''
SELECT 
    id, 
    player_fifa_api_id, 
    player_api_id, 
    overall_rating
FROM 
    Player_Attributes
LIMIT 10
'''
exe(cursor, ratings)

(1, 218353, 505942, 67)
(2, 218353, 505942, 67)
(3, 218353, 505942, 62)
(4, 218353, 505942, 61)
(5, 218353, 505942, 61)
(6, 189615, 155782, 74)
(7, 189615, 155782, 74)
(8, 189615, 155782, 73)
(9, 189615, 155782, 73)
(10, 189615, 155782, 73)


It appears that a player has several ratings. Therefore, we should : 
1. group average ratings by player
2. order them by ratings
3. limit results to 10
4. optional : cross players id with players names

In [63]:
top_ratings = '''
SELECT 
    player_api_id, 
    overall_rating
FROM 
    Player_Attributes
GROUP BY 
    player_api_id
ORDER BY 
    overall_rating DESC
LIMIT 10;
'''
exe(cursor, top_ratings)

(30981, 94)
(30893, 93)
(40636, 90)
(27299, 90)
(19533, 90)
(35724, 89)
(30834, 89)
(107417, 88)
(93447, 88)
(80562, 88)


If we know how to do a join, we can display players name rather than their api id :

In [64]:
top_ratings = '''
SELECT 
    p.player_name, 
    pa.overall_rating
FROM 
    Player_Attributes AS pa
JOIN 
    Player AS p 
    ON 
        pa.player_api_id = p.player_api_id
GROUP BY 
    pa.player_api_id
ORDER BY 
    pa.overall_rating DESC
LIMIT 10;
'''
exe(cursor, top_ratings)

('Lionel Messi', 94)
('Cristiano Ronaldo', 93)
('Luis Suarez', 90)
('Manuel Neuer', 90)
('Neymar', 90)
('Zlatan Ibrahimovic', 89)
('Arjen Robben', 89)
('Eden Hazard', 88)
('Robert Lewandowski', 88)
('Thiago Silva', 88)


## 8. For each league, how many matches where played ? Order your response by countries name.

In [70]:
matches_by_leagues_and_countries = '''
SELECT 
    c.name,
    l.name, 
    COUNT(*) AS n_matches
FROM 
    Match AS m
JOIN 
    League AS l 
    ON 
        m.League_id = l.id 
JOIN
    Country AS c
    ON
        l.Country_id = c.id
GROUP BY 
    League_id
ORDER BY 
    c.name
'''
exe(cursor, matches_by_leagues_and_countries)

('Belgium', 'Belgium Jupiler League', 1728)
('England', 'England Premier League', 3040)
('France', 'France Ligue 1', 3040)
('Germany', 'Germany 1. Bundesliga', 2448)
('Italy', 'Italy Serie A', 3017)
('Netherlands', 'Netherlands Eredivisie', 2448)
('Poland', 'Poland Ekstraklasa', 1920)
('Portugal', 'Portugal Liga ZON Sagres', 2052)
('Scotland', 'Scotland Premier League', 1824)
('Spain', 'Spain LIGA BBVA', 3040)
('Switzerland', 'Switzerland Super League', 1422)


## 9. Make an inner join between the `Country` table and the `League_null` table. Note the difference : which lines disappeared ?

## 10. Write RIGHT JOIN clauses (inclusive and exclusive)

## 11. Try to implement a full join 

## 12. Cross join
1. Create a new data base called `club.sqlite`
2. Create a table `Members` with id, name (you can quickly create data by creating a `.csv` file, importing it and converting it to a table with `pandas`.)
3. Insert some members (3 or 4)
4. Create a table `Reunions` with id, reunion_date (you can set default to current datetime to save time)
5. Insert some reunions : you can save time by writing a script to fill the table with loop…
6. Write a query that produce a matrix with every reunion for each members (`CROSS JOIN`) ordered by date

## 13. Self join

Create a detailed table employee with such relation manager/managed. You can imagine other situation in which SELF JOIN would be pertinent. Create a request that shows for a list of employee their manager (take care of the appearance of the results, ordered by manager, you can use string function to present a phrase like : « X manages Y »).