In [1]:
import numpy as np
import pandas as pd
import pyodbc
import sqlalchemy
import sqlite3
from subprocess import check_output
import os

%sql sqlite://

'Connected: @None'

In [2]:
summer = pd.read_csv('/kaggle/input/data-sql/summer.csv')
summer.head()

Unnamed: 0,Year,City,Sport,Discipline,Athlete,Country,Gender,Event,Medal
0,1896,Athens,Aquatics,Swimming,HAJOS Alfred,HUN,Men,100M Freestyle,Gold
1,1896,Athens,Aquatics,Swimming,HERSCHMANN Otto,AUT,Men,100M Freestyle,Silver
2,1896,Athens,Aquatics,Swimming,DRIVAS Dimitrios,GRE,Men,100M Freestyle For Sailors,Bronze
3,1896,Athens,Aquatics,Swimming,MALOKINIS Ioannis,GRE,Men,100M Freestyle For Sailors,Gold
4,1896,Athens,Aquatics,Swimming,CHASAPIS Spiridon,GRE,Men,100M Freestyle For Sailors,Silver


In [3]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:////summers', echo=False)
summer.to_sql('Summer_Medals', con = engine)

# PRACTICEs
## 1. Fleching : 4 functions.

#### A. Relative
> **`LAG`(column, n) :** `return the column's value at the row n` rows **before** `the current row`
>
> **`LEAD`(column, n) :** `return the column's value at the row n` rows **after** the `current row`
#### B. Absolute
> **`FIRST_VALUE`(column) :** return the the **`first value`** in the `table` or `partition`
>
> **`LAST_VALUE`(column) :** return the the **`last value`** in the `table` or `partition`

### 1.1. Future gold medalists
Fetching functions allow you to get values from different parts of the table into one row. If you have time-ordered data, you can "peek into the future" with the **`LEAD`** `fetching function`. This is especially useful if you want to compare a current value to a future value.

#### Instructions
For each year, fetch the `current gold medalist` and the `gold medalist 3 competitions` ahead of the current row.

In [4]:
pd.read_sql(
    """ 
        WITH Discus_Medalists AS (SELECT DISTINCT year, athlete
                                  FROM Summer_Medals
                                  WHERE (Medal = 'Gold') AND (Event = 'Discus Throw')
                                        AND (Gender = 'Women') AND (Year >= 2000)
                                        )
        SELECT year, athlete,
               LEAD(athlete, 3) OVER (ORDER BY year ASC) AS Future_Champion
        FROM Discus_Medalists
        ORDER BY Year ASC;    
    """, con = engine)

Unnamed: 0,year,athlete,Future_Champion
0,2000,ZVEREVA Ellina,PERKOVIC Sandra
1,2004,SADOVA Natalya,
2,2008,BROWN TRAFTON Stephanie,
3,2012,PERKOVIC Sandra,


So, you fetched future competitions' results with **`LEAD()`**, now practices with **`FIRST_VALUE()`**

### 1.2. First athlete by name
It's often useful to get the first or last value in a dataset to compare all other values to it. 

With `absolute fetching functions` like **`FIRST_VALUE`**, you can fetch a value at an absolute position in the table, like its `beginning` or `end`.

#### Instructions
Return all athletes and the first athlete ordered by alphabetical order.

In [5]:
pd.read_sql(
    """
        WITH All_Male_Medalists AS (SELECT DISTINCT athlete
                                    FROM Summer_Medals
                                    WHERE (Medal = 'Gold') AND (Gender = 'Men')
                                    )
        SELECT athlete,
               FIRST_VALUE(athlete) OVER (ORDER BY athlete ASC) AS First_Athlete
        FROM All_Male_Medalists;    
    """, con = engine)

Unnamed: 0,athlete,First_Athlete
0,AABYE Edgar,AABYE Edgar
1,AALTONEN Paavo Johannes,AABYE Edgar
2,AAS Thomas Valentin,AABYE Edgar
3,ABALMASAU Aliaksei,AABYE Edgar
4,ABALO Luc,AABYE Edgar
...,...,...
6240,ÖRVIG Thor,AABYE Edgar
6241,ÖSTERVOLD Henrik,AABYE Edgar
6242,ÖSTERVOLD Jan Olsen,AABYE Edgar
6243,ÖSTERVOLD Kristian Olsen,AABYE Edgar


Hence, now you can use absolute position fetching functions to fetch values at fixed positions in your table or partition.

### 1.3. Last country by name
Just like you can get the first row's value in a dataset, you can get the last row's value. 

This is often useful when you want **to compare the most recent value to previous values**.

#### Instructions
Return the `year` and the `city` in which each `Olympic games` were held.

Fetch the `last city` in which the Olympic games were held.

In [6]:
pd.read_sql(
    """ 
        WITH Hosts AS ( SELECT DISTINCT Year, City
                        FROM Summer_Medals)
        SELECT year, city,
               -- Get the last city in which the Olympic games were held
               LAST_VALUE(city) OVER( ORDER BY year ASC
                                      RANGE BETWEEN UNBOUNDED PRECEDING 
                                            AND UNBOUNDED FOLLOWING
                                      ) AS Last_City
        FROM Hosts
        ORDER BY Year ASC;    
    """, con = engine)

Unnamed: 0,Year,City,Last_City
0,1896,Athens,London
1,1900,Paris,London
2,1904,St Louis,London
3,1908,London,London
4,1912,Stockholm,London
5,1920,Antwerp,London
6,1924,Paris,London
7,1928,Amsterdam,London
8,1932,Los Angeles,London
9,1936,Berlin,London


Well done! Now you can get the values of the rows at the beginning and end of your table.

## 2. Ranking : ranking functions.
> **`ROW_NUMBER()` :** always assigns unique numbers, even if two rows's values are the same
>
> **`RANK()` :** assigns the same number to row with the identical values; skipping over the next numbers in such cases
>
> **`DENSE_RANK() :`** also assigns the same number to row with the identical values; **but doesn't skip** over the next numbers.

### 2.1. Ranking Atheletes by medals.

In **chapter 1: introduce to window function**, you used **`ROW_NUMBER`** to **rank athletes** by awarded medals. 

However, **`ROW_NUMBER`** assigns different numbers to athletes with the same count of awarded medals, **so it's not a useful ranking function**; ***if two athletes earned the same number of medals, they should have the same rank***.

#### Instructions
Rank each `athlete` by the number of `medals` they've earned -- `the higher the count, the higher the rank` -- with identical numbers in case of identical values.

In [7]:
pd.read_sql(
    """ 
        WITH Athlete_Medals AS (SELECT athlete,
                                       COUNT(*) AS Medals
                                FROM Summer_Medals
                                GROUP BY Athlete
                                )
        SELECT athlete, medals,
               -- Rank athletes by the medals they've won
               RANK() OVER (ORDER BY medals DESC) AS Rank_N
        FROM Athlete_Medals
        ORDER BY Medals DESC;    
    """, con = engine)

Unnamed: 0,athlete,Medals,Rank_N
0,PHELPS Michael,22,1
1,LATYNINA Larisa,18,2
2,ANDRIANOV Nikolay,15,3
3,MANGIAROTTI Edoardo,13,4
4,ONO Takashi,13,4
...,...,...,...
22757,ÖSTERVOLD Henrik,1,5267
22758,ÖSTERVOLD Jan Olsen,1,5267
22759,ÖSTERVOLD Kristian Olsen,1,5267
22760,ÖSTERVOLD Ole Olsen,1,5267


Well, **`RANK() s`**  output corresponds to the actual Olympics' ranking system.

### 2.2. Ranking athletes from multiple countries
In the previous exercise, you used **`RANK`** to assign rankings to one group of athletes. ***In real-world data, however, you'll often find numerous groups within your data. Without partitioning your data, one group's values will influence the rankings of the others***.

Also, while **`RANK`** skips numbers in case of identical values, the most natural way to assign rankings is not to skip numbers. ***If two countries are tied for second place, the country after them is considered to be third by most people***.

#### Instructions
Rank each country's athletes by the count of medals they've earned -- `the higher the count, the higher the rank` -- ***without skipping numbers in case of identical values***.

In [8]:
pd.read_sql(
    """ 
        WITH Athlete_Medals AS (SELECT country, athlete, COUNT(*) AS Medals
                                FROM Summer_Medals
                                WHERE country IN ('JPN', 'KOR')
                                  AND Year >= 2000
                                GROUP BY Country, Athlete
                                HAVING COUNT(*) > 1)
        SELECT country,
                -- Rank athletes in each country by the medals they've won
                athlete,
                DENSE_RANK() OVER (PARTITION BY country
                                    ORDER BY Medals DESC) AS Rank_N
        FROM Athlete_Medals
        ORDER BY Country ASC, RANK_N ASC;    
    """, con = engine)

Unnamed: 0,country,athlete,Rank_N
0,JPN,KITAJIMA Kosuke,1
1,JPN,UCHIMURA Kohei,2
2,JPN,TACHIBANA Miya,3
3,JPN,TAKEDA Miho,3
4,JPN,ICHO Kaori,4
...,...,...,...
69,KOR,OH Yong Ran,4
70,KOR,PARK Jinman,4
71,KOR,PARK Kyung-Mo,4
72,KOR,YOO Yong-Sung,4


Good job! DENSE_RANK's way of ranking is how we'd typically assign ranks in real life.

#### Summary question. DENSE_RANK's output
You have the following table:

| Country | Medals |
|---------|--------|
| IRN     | 23     |
| IRQ     | 19     |
| LBN     | 19     |
| SYR     | 19     |
| BHR     | 7      |
| KSA     | 3      |

If you were to use **`DENSE_RANK`** to order the Medals column in descending order, what rank would `BHR` be assigned?

**Remarks**

**5** will be **`False`** because this would be the output if you used **`RANK`**. ***Remember that*** **`DENSE_RANK`** ***doesn't skip numbers when ranking***.

The **correct answer** must be **3**! Since **`DENSE_RANK`** would rank `BHR` as `3rd` over `3 medals` of `KSA`.

==================================

## 3. Paging.

#### Definition: Spliting data into (approximately) equal chunks

**Usages**: 
> Many `API` return data in page to reduce data being sent.
>
> Seperating data into quartiles (Q1, Q2, Q3) or third (top 33%, and bottom third : 67%) to judge performance.

#### Enter NTILE
**`NTILE`** splits data in to `n` approximately equal pages.

### 3.1. Paging events
There are exactly `666 unique events` in the **`Summer Medals Olympics dataset`**. 

If you want to `chunk` them up to `analyze` them `piece by piece`, you'll need to split the events into groups of approximately `equal size`.

#### Instructions
Split the **`distinct events`** into exactly `111 groups`, ordered by event in `alphabetical order`.

In [9]:
pd.read_sql(
    """ 
        WITH Events AS (SELECT DISTINCT event
                        FROM Summer_Medals)
        SELECT 
              --- Split up the distinct events into 111 unique groups
              event,
              NTILE(111) OVER (ORDER BY event ASC) AS Page
        FROM Events
        ORDER BY Event ASC;    
    """, con = engine)

Unnamed: 0,event,Page
0,+ 100KG,1
1,+ 100KG (Heavyweight),1
2,+ 100KG (Super Heavyweight),1
3,+ 105KG,1
4,+ 108KG Total (Super Heavyweight),1
...,...,...
661,York Round (100Y - 80Y - 60Y),111
662,Épée Amateurs And Masters,111
663,Épée Individual,111
664,Épée Masters,111


Good! **`NTILE()`** allows you to make the size of the dataset you're working with more manageable.

### 3.2. Top, middle, and bottom thirds
Splitting your data into thirds or quartiles is often useful to understand how the values in your dataset are spread. 

Getting **summary statistics:** (`averages`, `sums`, `standard deviations`, etc.) of the `top`, `middle`, and `bottom thirds` (that is 33%; 50% and 67%) can help you determine what distribution your values follow.

#### Instructions 
**Step 1.** Split the `athletes` into `top`, `middle`, and `bottom thirds` based on their `count of medals`.

In [10]:
pd.read_sql(
    """ 
        WITH Athlete_Medals AS ( SELECT Athlete, COUNT(*) AS Medals
                                  FROM Summer_Medals
                                  GROUP BY Athlete
                                  HAVING COUNT(*) > 1
                                 )
        SELECT athlete, medals,
               -- Split athletes into thirds by their earned medals
               NTILE(3) OVER (ORDER BY medals DESC) AS Third
        FROM Athlete_Medals
        ORDER BY Medals DESC, Athlete ASC;    
    """, con = engine)

Unnamed: 0,Athlete,Medals,Third
0,PHELPS Michael,22,1
1,LATYNINA Larisa,18,1
2,ANDRIANOV Nikolay,15,1
3,MANGIAROTTI Edoardo,13,1
4,ONO Takashi,13,1
...,...,...,...
5261,ZVEREVA Ellina,2,3
5262,ZWERVER Ronald,2,3
5263,ZWOLLE Hendrik Jan,2,3
5264,ZYKINA Olesya,2,3


**Step 2.** Return the average of each third.

In [11]:
pd.read_sql(
    """
        WITH Athlete_Medals AS ( SELECT athlete, COUNT(*) AS medals
                                  FROM Summer_Medals
                                  GROUP BY athlete
                                  HAVING COUNT(*) > 1),  
              Thirds AS ( SELECT athlete, medals,
                                 NTILE(3) OVER (ORDER BY Medals DESC) AS third
                          FROM Athlete_Medals)

        SELECT third, AVG(medals) AS avg_Medals
        FROM Thirds
        GROUP BY third
        ORDER BY third ASC;    
    """, con = engine)

Unnamed: 0,third,avg_Medals
0,1,3.786446
1,2,2.0
2,3,2.0


Great! Using **`NTILE()`** and `summary statistic` functions, you could see the differences in the `top`, `middle`, and `bottom thirds`.

### 3.3. Do the samething to quartiles

In [12]:
pd.read_sql(
    """
        WITH Athlete_Medals AS ( SELECT athlete, COUNT(*) AS medals
                                  FROM Summer_Medals
                                  GROUP BY athlete
                                  HAVING COUNT(*) > 1),  
          Quartiles AS ( SELECT athlete, medals,
                                 NTILE(4) OVER (ORDER BY Medals DESC) AS quartile
                          FROM Athlete_Medals)

        SELECT quartile, AVG(medals) AS avg_Medals
        FROM Quartiles
        GROUP BY quartile
        ORDER BY quartile ASC;    
    """, con = engine)

Unnamed: 0,quantile,avg_Medals
0,1,4.133637
1,2,2.248292
2,3,2.0
3,4,2.0
