<div align="right" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/Logo blue_dark.png"  style="width:25px" align="right";/>
</div>

# Create a summary statistic report in SQL 
© ExploreAI Academy

In this notebook, we will demonstrate how to create a summary statistic report in SQL using numeric functions and aggregations. 



> ⚠️ This notebook will not run on Google Colab because it cannot connect to a local database. Please make sure that this notebook is running on the same local machine as your MySQL Workbench installation and MySQL `united_nations` database.

## Learning objectives

By the end of this train, you should:
- Know how to use the `GROUP BY` clause to examine the same dataset at different levels of granularity.

## Connecting to our MySQL database
Using our Access_to_Basic_Services table created in MySQL Workbench, we want to answer some questions on the range of our dataset. We can apply the same queries in MySQL Workbench and in this notebook if we connect to our MySQL server. Since we have a MySQL database, we can connect to it using mysql and pymysql.

In [1]:
# Load and activate the SQL extension to allow us to execute SQL in a Jupyter notebook. 
# If you get an error here, make sure that mysql and pymysql are installed correctly. 

%load_ext sql

In [3]:
# Establish a connection to the local database using the '%sql' magic command.
# Replace 'password' with our connection password and `db_name` with our database name. 
# If you get an error here, please make sure the database name or password is correct.

%sql mysql+pymysql://cybergod:example-password@localhost:3306/united_nations

To make a query, we add the `%%sql` command to the start of a cell, create one open line, and then the query like below, and run the cell.

In [4]:
%%sql

SELECT 
    *
FROM
    Access_to_basic_services
LIMIT 5;

 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
5 rows affected.


Region,Sub_region,Country_name,Time_period,Pct_managed_drinking_water_services,Pct_managed_sanitation_services,Est_population_in_millions,Est_gdp_in_billions,Land_area,Pct_unemployment
Central and Southern Asia,Central Asia,Kazakhstan,2015,94.67,98,17.542806,184.39,2699700,4.93
Central and Southern Asia,Central Asia,Kazakhstan,2016,94.67,98,17.794055,137.28,2699700,4.96
Central and Southern Asia,Central Asia,Kazakhstan,2017,95.0,98,18.037776,166.81,2699700,4.9
Central and Southern Asia,Central Asia,Kazakhstan,2018,95.0,98,18.276452,179.34,2699700,4.85
Central and Southern Asia,Central Asia,Kazakhstan,2019,95.0,98,18.513673,181.67,2699700,4.8


## Exercise
We want to determine the following:
1. What is the minimum, maximum, and average percentage of people that have access to managed drinking water services per region and sub_region?
2. What is the number of countries within each region and sub_region? 
3. What is the total GDP for each region and sub_region?

### 1. What is the minimum, maximum, and average percentage of people that have access to managed drinking water services per region and sub_region?


Calculate the minimum, maximum, and average percentage of people that have access to managed drinking water services per `region` and `sub_region` in our dataset using the `MIN`, `MAX`, and `AVG` functions. Return the result with aliases.

In [10]:
%%sql 
SELECT
    Region,
    Sub_region,
    MIN(Pct_managed_drinking_water_services),
    MAX(Pct_managed_drinking_water_services),
    AVG(Pct_managed_drinking_water_services)
    
FROM 
    Access_to_basic_services
GROUP BY
    Region,
    Sub_region
ORDER BY 
    AVG(Pct_managed_drinking_water_services) DESC
# Pct_managed_drinking_water_services

 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
18 rows affected.


Region,Sub_region,MIN(Pct_managed_drinking_water_services),MAX(Pct_managed_drinking_water_services),AVG(Pct_managed_drinking_water_services)
Oceania,Australia and New Zealand,100.0,100.0,100.0
Oceania,Polynesia,91.0,100.0,98.50648148148149
Europe and Northern America,Northern America,91.0,100.0,97.91133333333332
Latin America and the Caribbean,Caribbean,64.0,100.0,96.005
Northern Africa and Western Asia,Western Asia,59.0,100.0,95.03120370370372
Latin America and the Caribbean,South America,86.0,100.0,94.88095238095238
Latin America and the Caribbean,Central America,79.0,100.0,93.79812499999998
Oceania,Micronesia,73.67,100.0,93.41463414634148
Central and Southern Asia,Central Asia,80.33,100.0,93.14466666666668
Eastern and South-Eastern Asia,Eastern Asia,75.67,100.0,92.69966666666666


### 2. What is the total number of countries within each region and sub_region?

Determine the number of countries within each region and sub-region by using the `COUNT` function. Use an alias to name the result.

In [11]:
%%sql
SELECT 
    Region,
    Sub_region,
    count(distinct Country_name)
FROM 
    Access_to_basic_services
GROUP BY
    Region,
    Sub_region

 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
18 rows affected.


Region,Sub_region,count(distinct Country_name)
Central and Southern Asia,Central Asia,5
Central and Southern Asia,Southern Asia,9
Eastern and South-Eastern Asia,Eastern Asia,5
Eastern and South-Eastern Asia,South-Eastern Asia,11
Europe and Northern America,Northern America,5
Latin America and the Caribbean,Caribbean,27
Latin America and the Caribbean,Central America,8
Latin America and the Caribbean,South America,14
Northern Africa and Western Asia,Northern Africa,6
Northern Africa and Western Asia,Western Asia,18


### 3. What is the total GDP for each region and sub_region?

Determine the total GDP for each region and sub-region by using the `SUM` function to add all GDP values for each `region` and `sub_region`. Use an alias to name the result.

In [18]:
%%sql
SELECT 
    Region,
    Sub_region,
    SUM(Est_gdp_in_billions) Total_gdp_per_region
FROM 
    Access_to_basic_services
GROUP BY
    Region,
    Sub_region
ORDER BY
    Total_gdp_per_region DESC

 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
18 rows affected.


Region,Sub_region,Total_gdp_per_region
Eastern and South-Eastern Asia,Eastern Asia,107123.36999999998
Latin America and the Caribbean,South America,19959.58
Central and Southern Asia,Southern Asia,19824.660000000003
Eastern and South-Eastern Asia,South-Eastern Asia,15563.180000000002
Northern Africa and Western Asia,Western Asia,13605.830000000002
Europe and Northern America,Northern America,9905.96
Oceania,Australia and New Zealand,9241.73
Latin America and the Caribbean,Central America,8524.66
Sub-Saharan Africa,Western Africa,3621.309999999999
Northern Africa and Western Asia,Northern Africa,2736.8


## Solutions

### 1. What is the minimum, maximum, and average percentage of people that have access to managed drinking water services per region and sub_region?

In [12]:
%%sql

SELECT Region,
    Sub_region,
    MIN(Pct_managed_drinking_water_services) AS min_Pct_managed_drinking_water_services, 
    MAX(Pct_managed_drinking_water_services) AS max_Pct_managed_drinking_water_services, 
    AVG(Pct_managed_drinking_water_services) AS avg_Pct_managed_drinking_water_services 
FROM united_nations.Access_to_Basic_Services
GROUP BY Region, Sub_region;


 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
(pymysql.err.ProgrammingError) (1146, "Table 'united_nations.Access_to_Basic_Services' doesn't exist")
[SQL: SELECT Region,
    Sub_region,
    MIN(Pct_managed_drinking_water_services) AS min_Pct_managed_drinking_water_services, 
    MAX(Pct_managed_drinking_water_services) AS max_Pct_managed_drinking_water_services, 
    AVG(Pct_managed_drinking_water_services) AS avg_Pct_managed_drinking_water_services 
FROM united_nations.Access_to_Basic_Services
GROUP BY Region, Sub_region;]
(Background on this error at: https://sqlalche.me/e/20/f405)


Remember that we are using the `MIN`, `MAX`, and `AVG` functions to aggregate values in the `Pct_managed_drinking_water_services` column as well as view values in the `Region` and `Sub_region` columns. Therefore, we have to specify a grouping criteria using the `GROUP BY` clause.

### 2.  What is the total number of countries within each region and sub_region?

In [13]:
%%sql

SELECT Region,
    Sub_region,
    COUNT(DISTINCT(Country_name)) AS Number_of_countries
FROM united_nations.Access_to_basic_services 
GROUP BY Region, Sub_region;


 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
18 rows affected.


Region,Sub_region,Number_of_countries
Central and Southern Asia,Central Asia,5
Central and Southern Asia,Southern Asia,9
Eastern and South-Eastern Asia,Eastern Asia,5
Eastern and South-Eastern Asia,South-Eastern Asia,11
Europe and Northern America,Northern America,5
Latin America and the Caribbean,Caribbean,27
Latin America and the Caribbean,Central America,8
Latin America and the Caribbean,South America,14
Northern Africa and Western Asia,Northern Africa,6
Northern Africa and Western Asia,Western Asia,18


### 3. What is the total GDP for each region and sub_region?

In [None]:
%%sql

SELECT Region,
    Sub_region,
    SUM(EST_gdp_in_billions) AS EST_total_gdp_in_billions
FROM united_nations.Access_to_Basic_Services 
GROUP BY Region, Sub_region;


### Summary

We can also combine all of our queries into a single query to have a single return that includes all of the values.

In [22]:
%%sql

SELECT Region,
    Sub_region,
    MIN(Pct_managed_drinking_water_services) AS min_Pct_managed_drinking_water_services, 
    MAX(Pct_managed_drinking_water_services) AS max_Pct_managed_drinking_water_services, 
    AVG(Pct_managed_drinking_water_services) AS avg_Pct_managed_drinking_water_services,
    COUNT(DISTINCT(Country_name)) AS Number_of_countries,
    SUM(EST_gdp_in_billions) AS EST_total_gdp_in_billions
FROM united_nations.Access_to_basic_services 
GROUP BY Region, Sub_region
ORDER BY EST_total_gdp_in_billions DESC;

 * mysql+pymysql://cybergod:***@localhost:3306/united_nations
18 rows affected.


Region,Sub_region,min_Pct_managed_drinking_water_services,max_Pct_managed_drinking_water_services,avg_Pct_managed_drinking_water_services,Number_of_countries,EST_total_gdp_in_billions
Eastern and South-Eastern Asia,Eastern Asia,75.67,100.0,92.69966666666666,5,107123.37
Latin America and the Caribbean,South America,86.0,100.0,94.88095238095238,14,19959.58000000001
Central and Southern Asia,Southern Asia,67.0,99.67,91.89407407407406,9,19824.660000000003
Eastern and South-Eastern Asia,South-Eastern Asia,73.33,100.0,90.6260606060606,11,15563.180000000004
Northern Africa and Western Asia,Western Asia,59.0,100.0,95.03120370370372,18,13605.830000000002
Europe and Northern America,Northern America,91.0,100.0,97.91133333333332,5,9905.96
Oceania,Australia and New Zealand,100.0,100.0,100.0,2,9241.729999999998
Latin America and the Caribbean,Central America,79.0,100.0,93.79812499999998,8,8524.659999999998
Sub-Saharan Africa,Western Africa,53.33,99.0,72.3656862745098,17,3621.309999999999
Northern Africa and Western Asia,Northern Africa,61.33,100.0,88.9061111111111,6,2736.7999999999997


<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>