<div align="right" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/Logo blue_dark.png"  style="width:25px" align="right";/>
</div>

# Using value-based window functions
© ExploreAI Academy

In this notebook, we will explore the use of value-based window functions to access values from the previous row and use these values to calculate the rate of change between consecutive rows.

> ⚠️ This notebook will not run on Google Colab because it cannot connect to a local database. Please make sure that this notebook is running on the same local machine as your MySQL Workbench installation and MySQL `united_nations` database.

## Learning objectives

In this train, we will learn:
- How to use the `LAG()` value-based window function to extract particular column values from the previous row.
- How the results from `LAG()` can be used to perform analysis such as calculating the rate of change between consecutive values. 
 

## Overview

Say we want to investigate how the percentage of managed drinking water changes from one year to the next in every country. We can add a new column with the previous year's percentage of managed drinking water using the `LAG()` function. This is a value-based window function that extracts the value of a specific column from a previous row. 

## Connecting to our MySQL database

We will use our `Access_to_Basic_Services` table in our `united_nations` database we created in MySQL Workbench. We can apply the same queries we used in MySQL Workbench in this notebook if we connect to our MySQL server by running the cells below.


In [2]:
# Load and activate the SQL extension to allow us to execute SQL in a Jupyter notebook. 
# If you get an error here, make sure that mysql and pymysql are installed correctly. 

%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [4]:
# Establish a connection to the local database using the '%sql' magic command.
# Replace 'password' with our connection password and `db_name` with our database name. 
# If you get an error here, please make sure the database name or password is correct.

%sql mysql+pymysql://cybergod:example-password@localhost:3306/united_nations

## Exercise

Let us enter the following base query which selects the three columns we will be using from our `Access_to_Basic_Services` table: `Country_name`, `Time_period`, and `Pct_managed_drinking_water_services`. 


In [5]:
%%sql

SELECT
    Country_name,
    Time_period,
    Pct_managed_drinking_water_services
FROM 
    united_nations.Access_to_basic_services;

 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
1048 rows affected.


Country_name,Time_period,Pct_managed_drinking_water_services
Kazakhstan,2015,94.67
Kazakhstan,2016,94.67
Kazakhstan,2017,95.0
Kazakhstan,2018,95.0
Kazakhstan,2019,95.0
Kazakhstan,2020,95.0
Kyrgyzstan,2015,89.67
Kyrgyzstan,2016,90.33
Kyrgyzstan,2017,91.0
Kyrgyzstan,2018,91.33


### 1. Add a new column with the previous year's percentage of managed drinking water.

Add the line with the `LAG()` function to the base query above to extract the previous year's percentage of managed drinking water within each country. Store the results in a new column.

In [9]:
%%sql

SELECT
    Country_name,
    Time_period,
    Pct_managed_drinking_water_services,
    LAG(Pct_managed_drinking_water_services,1)OVER(PARTITION BY Country_name ORDER BY Time_period) AS prev_year_pct_water_access
FROM 
    united_nations.Access_to_basic_services;

 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
1048 rows affected.


Country_name,Time_period,Pct_managed_drinking_water_services,prev_year_pct_water_access
Afghanistan,2015,67.0,
Afghanistan,2016,69.67,67.0
Afghanistan,2017,72.33,69.67
Afghanistan,2018,75.33,72.33
Afghanistan,2019,78.0,75.33
Afghanistan,2020,80.33,78.0
Algeria,2015,92.0,
Algeria,2016,93.0,92.0
Algeria,2017,93.0,93.0
Algeria,2018,93.0,93.0


### 2. Determine the Annual Rate of Change between consecutive years.

Adding on to the query above, let us go further and determine the Annual Rate of Change between consecutive years found by calculating the difference between a measurement and its lag.

In [11]:
%%sql

SELECT
    Country_name,
    Time_period,
    Pct_managed_drinking_water_services,
    Pct_managed_drinking_water_services- LAG(Pct_managed_drinking_water_services,1)OVER(PARTITION BY Country_name ORDER BY Time_period) AS Annual_rate_of_change
    
FROM 
    united_nations.Access_to_basic_services;

 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
1048 rows affected.


Country_name,Time_period,Pct_managed_drinking_water_services,Annual_rate_of_change
Afghanistan,2015,67.0,
Afghanistan,2016,69.67,2.6700000000000017
Afghanistan,2017,72.33,2.6599999999999966
Afghanistan,2018,75.33,3.0
Afghanistan,2019,78.0,2.6700000000000017
Afghanistan,2020,80.33,2.3299999999999983
Algeria,2015,92.0,
Algeria,2016,93.0,1.0
Algeria,2017,93.0,0.0
Algeria,2018,93.0,0.0


## Solutions

### 1. Add a new column with the previous year's percentage of managed drinking water.

We apply the `LAG()` function as follows:

Firstly, this will partition our dataset by country, that is, the `Country_name` column, then order each partition by year, that is, the `Time_period` column, from the lowest to the highest. 

Then, the `LAG()` function will be used to extract the previous year's percentage of managed drinking water within a country partition and based on the resulting order of rows from the `ORDER BY` clause. The results will then be stored in a new column `Prev_year_pct_managed_drinking_water_services`.

In [None]:
%%sql

SELECT
    Country_name,
    Time_period,
    Pct_managed_drinking_water_services,
    LAG(Pct_managed_drinking_water_services) OVER( PARTITION BY Country_name 
    ORDER BY Time_period ASC) AS Prev_year_pct_managed_drinking_water_services
FROM 
    united_nations.Access_to_Basic_Services;



We can see that the `Prev_year_pct_managed_drinking_water_services` column with the previous year's `Pct_managed_drinking_water_services` in the current year's row.  This shows that the lag was implemented correctly.

**NOTE:** The first year in every country partition has a `NULL` lag value since there is no previous year to extract a value from.

### 2. Determine the Annual Rate of Change between consecutive years.

Let’s go further and determine the Annual Rate of Change between consecutive years. That is, the difference between `Pct_managed_drinking_water_services` and `Prev_year_pct_managed_drinking_water_services`.

Therefore, we subtract the `previous percentage of managed drinking water` from the `current percentage of managed drinking water`.

The query will work similarly to the previous one, except that we reuse the LAG() function to calculate the Annual Rate of Change.

**NOTE:** SQL does not allow us to use the alias we had created within the same `SELECT` statement directly. Hence, we have to explicitly write the lag function again.

The results will then be returned as a new column named `ARC_pct_managed_drinking_water_services`.

In [12]:
%%sql

SELECT
    Country_name,
    Time_period,
    Pct_managed_drinking_water_services,
    LAG(Pct_managed_drinking_water_services) OVER( PARTITION BY Country_name 
    ORDER BY Time_period ASC) AS Prev_year_pct_managed_drinking_water_services,
    Pct_managed_drinking_water_services - LAG(Pct_managed_drinking_water_services) OVER( PARTITION BY Country_name 
    ORDER BY Time_period ASC) AS ARC_pct_managed_drinking_water_services
FROM 
    united_nations.Access_to_basic_services
LIMIT 80;

 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
80 rows affected.


Country_name,Time_period,Pct_managed_drinking_water_services,Prev_year_pct_managed_drinking_water_services,ARC_pct_managed_drinking_water_services
Afghanistan,2015,67.0,,
Afghanistan,2016,69.67,67.0,2.6700000000000017
Afghanistan,2017,72.33,69.67,2.6599999999999966
Afghanistan,2018,75.33,72.33,3.0
Afghanistan,2019,78.0,75.33,2.6700000000000017
Afghanistan,2020,80.33,78.0,2.3299999999999983
Algeria,2015,92.0,,
Algeria,2016,93.0,92.0,1.0
Algeria,2017,93.0,93.0,0.0
Algeria,2018,93.0,93.0,0.0


There is a new column containing the **Annual Rate of Change** values calculated based on the `Pct_managed_drinking_water_services` column values and the `Prev_year_pct_managed_drinking_water_services` column values.

For instance, we can see that in Afghanistan, in the year 2017, the percentage of managed drinking water services increased to `72.33` from `69.67` in the previous year, resulting in an Annual Rate of Change of `2.66`.

In Bangladesh, the percentage of managed drinking water services in `2016` **remained the same** as the previous year, and therefore the Annual Rate of Change was `0`.

**NOTE:** A `None` value is returned for rows with `NULL` lag values as discussed earlier. 

## Summary

Further analysis can now be done to understand the factors influencing the change in the water access percentages over time.

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>