<div align="right" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/Logo blue_dark.png"  style="width:25px" align="right";/>
</div>

# Filtering and analysing a summary statistic report
© ExploreAI Academy

In this notebook, we demonstrate how to filter and analyse a summary statistic report using the `WHERE` and `HAVING` clauses.

> ⚠️ This notebook will not run on Google Colab because it cannot connect to a local database. Please make sure that this notebook is running on the same local machine as your MySQL Workbench installation and MySQL `united_nations` database.

## Learning objectives

By the end of this train, you should:
-  Know how to analyse a summary statistic report using the `WHERE` and `HAVING` clauses.

## Connecting to our MySQL database


Continuing with the numerical analysis of our Access_to_Basic_Services table created in MySQL Workbench, we want to filter and analyse our already-created summary statistic report. We can apply the same queries in MySQL Workbench and in this notebook if we connect to our MySQL server. Since we have a MySQL database, we can connect to it using mysql and pymysql.

In [6]:
# Load and activate the SQL extension to allow us to execute SQL in a Jupyter notebook. 
# If you get an error here, make sure that mysql and pymysql are installed correctly. 
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [7]:
# Establish a connection to the local database using the '%sql' magic command.
# Replace 'password' with our connection password and `db_name` with our database name. 
# If you get an error here, please make sure the database name or password is correct.

%sql mysql+pymysql://cybergod:example-password@localhost:3306/united_nations

To make a query, we add the `%%sql` command to the start of a cell, create one open line, and then the query like below, and run the cell.

In [8]:
%%sql

SELECT 
    *
FROM
    Access_to_basic_services
LIMIT 5;

 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
5 rows affected.


Region,Sub_region,Country_name,Time_period,Pct_managed_drinking_water_services,Pct_managed_sanitation_services,Est_population_in_millions,Est_gdp_in_billions,Land_area,Pct_unemployment
Central and Southern Asia,Central Asia,Kazakhstan,2015,94.67,98,17.542806,184.39,2699700,4.93
Central and Southern Asia,Central Asia,Kazakhstan,2016,94.67,98,17.794055,137.28,2699700,4.96
Central and Southern Asia,Central Asia,Kazakhstan,2017,95.0,98,18.037776,166.81,2699700,4.9
Central and Southern Asia,Central Asia,Kazakhstan,2018,95.0,98,18.276452,179.34,2699700,4.85
Central and Southern Asia,Central Asia,Kazakhstan,2019,95.0,98,18.513673,181.67,2699700,4.8


## Exercise

We started with finding out the minimum, maximum, and average percentage of people that have access to drinking water services, the number of countries, and the total GDP per region and sub-region. We also ordered this data by estimated GDP.

In [None]:
%%sql

SELECT Region, 
    Sub_region,
    MIN(Pct_managed_drinking_water_services) AS min_Pct_managed_drinking_water_services, 
    MAX(Pct_managed_drinking_water_services) AS max_Pct_managed_drinking_water_services, 
    AVG(Pct_managed_drinking_water_services) AS avg_Pct_managed_drinking_water_services,
    COUNT(DISTINCT(Country_name)) AS Number_of_countries,
    SUM(EST_gdp_in_billions) AS EST_total_gdp_in_billions 
FROM united_nations.Access_to_Basic_Services 
GROUP BY Region, Sub_region
ORDER BY EST_total_gdp_in_billions;

Let's continue with our analysis.

We want to use the summary statistic report to do the following:

1. Filter for the year 2020.
2. Focus on countries where the percentage of managed drinking water services is below 60%.
3. Filter for the regions and sub-regions that have fewer than four countries.

### 1. Filter for the year 2020.
Using the above query, focus on results where the time period is 2020 using the `WHERE` clause.

In [15]:
%%sql

SELECT Region, 
    Sub_region,
    Time_period,
    MIN(Pct_managed_drinking_water_services) AS min_Pct_managed_drinking_water_services, 
    MAX(Pct_managed_drinking_water_services) AS max_Pct_managed_drinking_water_services, 
    AVG(Pct_managed_drinking_water_services) AS avg_Pct_managed_drinking_water_services,
    COUNT(DISTINCT(Country_name)) AS Number_of_countries,
    SUM(EST_gdp_in_billions) AS EST_total_gdp_in_billions 
FROM united_nations.Access_to_basic_services 
WHERE Time_period=2020
GROUP BY Region, Sub_region
ORDER BY EST_total_gdp_in_billions
;

 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
18 rows affected.


Region,Sub_region,Time_period,min_Pct_managed_drinking_water_services,max_Pct_managed_drinking_water_services,avg_Pct_managed_drinking_water_services,Number_of_countries,EST_total_gdp_in_billions
Oceania,Micronesia,2020,77.0,100.0,94.5,6,6.67
Oceania,Polynesia,2020,92.0,100.0,98.55555555555556,9,7.84
Oceania,Melanesia,2020,56.67,99.0,82.934,5,40.21
Sub-Saharan Africa,Middle Africa,2020,38.33,77.33,59.3325,8,123.22
Central and Southern Asia,Central Asia,2020,85.0,100.0,94.134,5,239.1
Latin America and the Caribbean,Caribbean,2020,65.0,100.0,95.91066666666666,15,343.26
Sub-Saharan Africa,Eastern Africa,2020,48.33,100.0,70.01882352941175,17,359.1
Sub-Saharan Africa,Southern Africa,2020,76.33,92.0,83.668,5,369.34
Northern Africa and Western Asia,Northern Africa,2020,62.33,100.0,90.05333333333333,6,386.29
Sub-Saharan Africa,Western Africa,2020,53.33,99.0,73.6070588235294,17,631.91


### 2. Focus on countries where the percentage of managed drinking water services is below 60%.

Adding onto your query above, focus on results where the percentage of managed drinking water services is smaller than 60% using the `WHERE` clause.

In [17]:
%%sql

SELECT Region, 
    Sub_region,
    Time_period,
    MIN(Pct_managed_drinking_water_services) AS min_Pct_managed_drinking_water_services, 
    MAX(Pct_managed_drinking_water_services) AS max_Pct_managed_drinking_water_services, 
    AVG(Pct_managed_drinking_water_services) AS avg_Pct_managed_drinking_water_services,
    COUNT(DISTINCT(Country_name)) AS Number_of_countries,
    SUM(EST_gdp_in_billions) AS EST_total_gdp_in_billions 
FROM united_nations.Access_to_basic_services 
WHERE Time_period=2020
    AND
    Pct_managed_drinking_water_services<60
GROUP BY Region, Sub_region
ORDER BY EST_total_gdp_in_billions
 

 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
4 rows affected.


Region,Sub_region,Time_period,min_Pct_managed_drinking_water_services,max_Pct_managed_drinking_water_services,avg_Pct_managed_drinking_water_services,Number_of_countries,EST_total_gdp_in_billions
Oceania,Melanesia,2020,56.67,56.67,56.67,1,23.85
Sub-Saharan Africa,Western Africa,2020,53.33,57.33,55.33,2,31.67
Sub-Saharan Africa,Middle Africa,2020,38.33,52.67,47.75,4,66.67
Sub-Saharan Africa,Eastern Africa,2020,48.33,58.0,54.9975,4,127.59


### 3. Filter for the sub-regions that have fewer than four countries.

Filter the results above to only include the regions and sub-regions that have fewer than four countries in the Number_of_countries alias using the `HAVING` clause.

In [19]:
%%sql

SELECT Region, 
    Sub_region,
    Time_period,
    MIN(Pct_managed_drinking_water_services) AS min_Pct_managed_drinking_water_services, 
    MAX(Pct_managed_drinking_water_services) AS max_Pct_managed_drinking_water_services, 
    AVG(Pct_managed_drinking_water_services) AS avg_Pct_managed_drinking_water_services,
    COUNT(DISTINCT(Country_name)) AS Number_of_countries,
    SUM(EST_gdp_in_billions) AS EST_total_gdp_in_billions 
FROM united_nations.Access_to_basic_services 
WHERE Time_period=2020
    AND
    Pct_managed_drinking_water_services<60
GROUP BY Region, Sub_region
HAVING Number_of_countries<4
ORDER BY EST_total_gdp_in_billions


 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
2 rows affected.


Region,Sub_region,Time_period,min_Pct_managed_drinking_water_services,max_Pct_managed_drinking_water_services,avg_Pct_managed_drinking_water_services,Number_of_countries,EST_total_gdp_in_billions
Oceania,Melanesia,2020,56.67,56.67,56.67,1,23.85
Sub-Saharan Africa,Western Africa,2020,53.33,57.33,55.33,2,31.67


## Solutions

### 1. Filter for the year 2020.

In [None]:
%%sql

SELECT Region, 
    Sub_region,
    MIN(Pct_managed_drinking_water_services) AS min_Pct_managed_drinking_water_services, 
    MAX(Pct_managed_drinking_water_services) AS max_Pct_managed_drinking_water_services, 
    AVG(Pct_managed_drinking_water_services) AS avg_Pct_managed_drinking_water_services, 
    COUNT(DISTINCT(Country_name)) AS Number_of_countries,
    SUM(EST_gdp_in_billions) AS EST_total_gdp_in_billions
FROM Access_to_Basic_Services 
WHERE Time_period = 2020
GROUP BY Region, Sub_region
ORDER BY EST_total_gdp_in_billions ASC;


### 2. Focus on countries where the percentage of managed drinking water services is below 60%.

In [None]:
%%sql

SELECT Region, 
    Sub_region,
    MIN(Pct_managed_drinking_water_services) AS min_Pct_managed_drinking_water_services, 
    MAX(Pct_managed_drinking_water_services) AS max_Pct_managed_drinking_water_services, 
    AVG(Pct_managed_drinking_water_services) AS avg_Pct_managed_drinking_water_services, 
    COUNT(DISTINCT(Country_name)) AS Number_of_countries,
    SUM(EST_gdp_in_billions) AS EST_total_gdp_in_billions
FROM Access_to_Basic_Services 
WHERE Time_period = 2020
    AND Pct_managed_drinking_water_services < 60
GROUP BY Region, Sub_region
ORDER BY EST_total_gdp_in_billions ASC;

### 3. Filter for the regions and sub-regions that have fewer than four countries.

In [20]:
%%sql

SELECT Region, 
    Sub_region,
    MIN(Pct_managed_drinking_water_services) AS min_Pct_managed_drinking_water_services, 
    MAX(Pct_managed_drinking_water_services) AS max_Pct_managed_drinking_water_services, 
    AVG(Pct_managed_drinking_water_services) AS avg_Pct_managed_drinking_water_services, 
    COUNT(DISTINCT(Country_name)) AS Number_of_countries,
    SUM(EST_gdp_in_billions) AS EST_total_gdp_in_billions
FROM Access_to_basic_services 
WHERE Time_period = 2020
    AND Pct_managed_drinking_water_services < 60
GROUP BY Region, Sub_region
HAVING Number_of_countries < 4
ORDER BY EST_total_gdp_in_billions ASC;

 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
2 rows affected.


Region,Sub_region,min_Pct_managed_drinking_water_services,max_Pct_managed_drinking_water_services,avg_Pct_managed_drinking_water_services,Number_of_countries,EST_total_gdp_in_billions
Oceania,Melanesia,56.67,56.67,56.67,1,23.85
Sub-Saharan Africa,Western Africa,53.33,57.33,55.33,2,31.67


The `WHERE` clause may come to mind first when trying to apply this criterion. However, take note that this criterion may only be used after the "Number_of_countries" aggregate and grouping. This is because we want to group the data by region and sub-region and then only choose the groups that have fewer than four countries inside those groups.
Because the `WHERE` clause executes before the aggregate and GROUP BY clauses, we are unable to use it.
Therefore the `HAVING` clause is more appropriate here.

With this report, we can answer questions like “Out of the sub-regions that meet the criteria, which sub-region has the lowest GDP?”
Since the data is ordered according to the estimated GDP in ascending order, the first result in the sub-region column will contain the answer. That is, Melanesia at a GDP of 23.85 billion.


<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>