# Aggregation using window functions
© ExploreAI Academy

> ⚠️ This notebook will not run on Google Colab because it cannot connect to a local database. Please make sure that this notebook is running on the same local machine as your MySQL Workbench installation and MySQL `united_nations` database.

## Learning objectives

- Understand the concept of window functions in SQL.
- Learn how to use window functions for data aggregation.
- Understand how to use window functions to partition data.
- Practise using window functions to solve complex problems.


## Overview

In this notebook, we will explore how to use window functions in SQL to solve complex problems. Window functions are a type of function in SQL that performs a calculation across a set of table rows that are related to the current row.

We will use the `united_nations.Access_to_Basic_Services` table, which contains information about different countries, their GDP, and access to basic services.

Let's begin by calculating each country's land cover as a percentage per subregion for the year 2020.


### Connecting to our MySQL database

Since we have a MySQL database, we can connect to it using mysql and pymysql.

In [3]:
# Load and activate the SQL extension to allow us to execute SQL in a Jupyter notebook.


%load_ext sql

In [4]:
# Establish a connection to the local database using the '%sql' magic command.
# Replace 'password' with our connection password and `db_name` with our database name.

%sql mysql+pymysql://cybergod:example-password@localhost:3306/united_nations

## Exercise

### Task 1: Select the data required for the analysis

The columns you select should include:
- `Sub_region`
- `Country_name`
- `Land_area`

Filter out the results using the following criteria:
- For the `Time_period` of `2020`.
- For `Land_area` values that are not missing.


In [8]:
%%sql
SELECT 
    Sub_region,
    Country_name,
    Land_area,
    Time_period
FROM 
    Access_to_basic_services
WHERE 
    Time_period=2020
    AND
        Land_area IS NOT NULL;

 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
130 rows affected.


Sub_region,Country_name,Land_area,Time_period
Central Asia,Kazakhstan,2699700,2020
Central Asia,Tajikistan,138790,2020
Central Asia,Turkmenistan,469930,2020
Central Asia,Uzbekistan,440650,2020
Southern Asia,Afghanistan,652230,2020
Southern Asia,Bangladesh,130170,2020
Southern Asia,Bhutan,38140,2020
Southern Asia,India,2973190,2020
Southern Asia,Maldives,300,2020
Southern Asia,Nepal,143350,2020


### Task 2: Calculate the land area covered as a percentage of the country's subregion


Calculate each land area as a percentage within its sub_region:
- Divide the `Land_area` by the `SUM()` `BY` the areas `OVER` each `Sub_region`'s `PARTITION`. Name this column `pct_sub_region_land_area`.
- `Round` the calculation off to `4` decimal places.

Add this line to the query from the first task.


In [14]:
%%sql
SELECT 
    Sub_region,
    Country_name,
    ROUND(Land_area/SUM(Land_area)OVER(PARTITION BY Sub_region)*100,4) AS pct_sub_region_land_area,
    Time_period
FROM 
    Access_to_basic_services
WHERE 
    Time_period=2020
    AND
        Land_area IS NOT NULL;

 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
130 rows affected.


Sub_region,Country_name,pct_sub_region_land_area,Time_period
Australia and New Zealand,Australia,96.6901,2020
Australia and New Zealand,New Zealand,3.3099,2020
Caribbean,Barbados,0.2097,2020
Caribbean,British Virgin Islands,0.0731,2020
Caribbean,Cuba,50.6144,2020
Caribbean,Dominican Republic,23.5567,2020
Caribbean,Haiti,13.4387,2020
Caribbean,Jamaica,5.2809,2020
Caribbean,Puerto Rico,4.3251,2020
Caribbean,Trinidad and Tobago,2.5015,2020


### Task 3: Calculate The running population average for each country's subregion

Start by selecting the columns needed for this analysis:
- `Sub_region`
- `Country_name`
- `Time_period`
- `Pct_managed_drinking_water_services`
- `Pct_managed_sanitation_services`
- `Est_gdp_in_billions`
- `Est_population_in_millions`

Calculate the running average:
- Calculate the `AVG()` of the `Est_population_in_millions`.
- `PARTITION` the calculation `OVER` each country's `Sub_region`, and name this column `Running_average_population`.
- `ROUND` the calculation off to `4` decimal places.
- Filter the results `WHERE` there are values of `Est_gdp_in_billions` that are `NOT NULL`.


In [18]:
%%sql
SELECT 
    Sub_region,
    Country_name,
    Time_period,
    Pct_managed_drinking_water_services,
    Pct_managed_sanitation_services,
    Est_gdp_in_billions,
    Est_population_in_millions,
    ROUND(AVG(Est_population_in_millions)OVER(PARTITION BY Sub_region ORDER BY Time_period),4) AS Running_average_population
FROM 
    Access_to_basic_services
WHERE 
    Est_gdp_in_billions IS NOT NULL;

 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
800 rows affected.


Sub_region,Country_name,Time_period,Pct_managed_drinking_water_services,Pct_managed_sanitation_services,Est_gdp_in_billions,Est_population_in_millions,Running_average_population
Australia and New Zealand,Australia,2015,100.0,100.0,1350.62,23.815995,14.2127
Australia and New Zealand,New Zealand,2015,100.0,100.0,178.06,4.6094,14.2127
Australia and New Zealand,Australia,2016,100.0,100.0,1206.54,24.190907,14.3326
Australia and New Zealand,New Zealand,2016,100.0,100.0,188.84,4.7141,14.3326
Australia and New Zealand,New Zealand,2017,100.0,100.0,206.62,4.8136,14.4564
Australia and New Zealand,Australia,2017,100.0,100.0,1326.52,24.594202,14.4564
Australia and New Zealand,Australia,2018,100.0,100.0,1428.29,24.966643,14.5757
Australia and New Zealand,New Zealand,2018,100.0,100.0,211.95,4.9006,14.5757
Australia and New Zealand,Australia,2019,100.0,100.0,1392.23,25.340217,14.6925
Australia and New Zealand,New Zealand,2019,100.0,100.0,213.43,4.9792,14.6925


## Solutions

### Task 1: Select the data required for the analysis

In [None]:
%%sql
SELECT 
	Sub_region,
    Country_name,
    Land_area
FROM united_nations.Access_to_Basic_Services
WHERE Time_period = 2020
AND Land_area IS NOT NULL;

### Task 2: Calculate the land area covered as a percentage of the country's subregion

In [14]:
%%sql
SELECT
    Sub_region,
    Country_name,
    Land_area,
    ROUND(Land_area/SUM(Land_area) OVER (PARTITION BY sub_region)*100,4) AS pct_sub_region_land_area
FROM united_nations.Access_to_basic_services
    WHERE time_period = 2020
    AND Land_area IS NOT NULL;

 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
130 rows affected.


Sub_region,Country_name,Land_area,pct_sub_region_land_area
Australia and New Zealand,Australia,7692020,96.6901
Australia and New Zealand,New Zealand,263310,3.3099
Caribbean,Barbados,430,0.2097
Caribbean,British Virgin Islands,150,0.0731
Caribbean,Cuba,103800,50.6144
Caribbean,Dominican Republic,48310,23.5567
Caribbean,Haiti,27560,13.4387
Caribbean,Jamaica,10830,5.2809
Caribbean,Puerto Rico,8870,4.3251
Caribbean,Trinidad and Tobago,5130,2.5015


### Task 3: Calculate The running population average for each country's subregion

In [21]:
%%sql
SELECT 
    Sub_region,
    Country_name,
    Time_period,
    Pct_managed_drinking_water_services,
    Pct_managed_sanitation_services,
    Est_gdp_in_billions,
    Est_population_in_millions,
    ROUND(AVG(Est_population_in_millions) OVER (PARTITION BY Sub_region ORDER BY Time_period),4) AS Running_average_population
FROM united_nations.Access_to_basic_services
    WHERE Est_gdp_in_billions IS NOT NULL;

 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
800 rows affected.


Sub_region,Country_name,Time_period,Pct_managed_drinking_water_services,Pct_managed_sanitation_services,Est_gdp_in_billions,Est_population_in_millions,Running_average_population
Australia and New Zealand,Australia,2015,100.0,100.0,1350.62,23.815995,14.2127
Australia and New Zealand,New Zealand,2015,100.0,100.0,178.06,4.6094,14.2127
Australia and New Zealand,Australia,2016,100.0,100.0,1206.54,24.190907,14.3326
Australia and New Zealand,New Zealand,2016,100.0,100.0,188.84,4.7141,14.3326
Australia and New Zealand,New Zealand,2017,100.0,100.0,206.62,4.8136,14.4564
Australia and New Zealand,Australia,2017,100.0,100.0,1326.52,24.594202,14.4564
Australia and New Zealand,Australia,2018,100.0,100.0,1428.29,24.966643,14.5757
Australia and New Zealand,New Zealand,2018,100.0,100.0,211.95,4.9006,14.5757
Australia and New Zealand,Australia,2019,100.0,100.0,1392.23,25.340217,14.6925
Australia and New Zealand,New Zealand,2019,100.0,100.0,213.43,4.9792,14.6925


<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>