<div align="right" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/Logo blue_dark.png"  style="width:25px" align="right";/>
</div>

# Recreating the Access_to_Basic_Services dataset 
© ExploreAI Academy

In this notebook, we cover how ERDs help us understand database joins better. We also focus on the `LEFT JOIN` technique and highlight the importance of picking the right joining strategy, as incorrect joins can lead to inaccurate results.



> ⚠️ This notebook will not run on Google Colab because it cannot connect to a local database. Please make sure that this notebook is running on the same local machine as your MySQL Workbench installation and MySQL `united_nations` database.

## Learning objectives

By the end of this train, you will:
- Understand how Entity-Relationship Diagrams can help us understand database joins better. 
- Understand the `LEFT JOIN` technique and how it is used to combine tables.
- Know the importance of picking the right joining strategy and how incorrect joins can lead to inaccurate results.


## Overview

Entity-Relationship diagrams play a valuable role in determining the table relationships and join strategies within a database. They provide the means to make informed decisions about which tables to join and the appropriate method for doing so. 


Let’s recall our united_nations ERD which has three entities: Geographic_Location, Basic_Services, and Economic_Indicators. 

<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Northwind_ERD.png" alt= "united_nations ERD" width="60%" height="60%">

One common joining technique involves selecting a central table that serves as the core of all relationships in the database and employing a `LEFT JOIN`. 
In our case, the `Geographic_Location` table would be the central table.  

With a `LEFT JOIN`, all records from the left table are returned, along with the corresponding matching records from the right table. In cases where there is no match, the result will include NULL values on the right side.

## Connecting to our MySQL database

We'll start by connecting to the `united_nations` database. To connect to the MySQL server, run the cells below.

In [1]:
# Load and activate the SQL extension to allow us to execute SQL in a Jupyter notebook. 
# If you get an error here, make sure that mysql and pymysql are installed correctly. 

%load_ext sql

In [2]:
# Establish a connection to the local database using the '%sql' magic command.
# Replace 'password' with our connection password. 
# If you get an error here, please make sure the database name or password is correct.

%sql mysql+pymysql://root:Omar2003negm*@localhost:3306/united_nations

'Connected: root@united_nations'

We'll then use a simple `SELECT` query to fetch all records from the `Geographic_Location` table.

In [3]:
%%sql
SELECT 
	* 
FROM 
	united_nations.Geographic_Location as geo
LIMIT 5;

 * mysql+pymysql://root:***@localhost:3306/united_nations
5 rows affected.


Country_name,Sub_region,Region,Land_area
Afghanistan,Southern Asia,Central and Southern Asia,652230.0
Algeria,Northern Africa,Northern Africa and Western Asia,2381741.0
American Samoa,Polynesia,Oceania,200.0
Angola,Middle Africa,Sub-Saharan Africa,1246700.0
Anguilla,Caribbean,Latin America and the Caribbean,


## Exercise


### 1. First `LEFT JOIN`

Combine the `Geographic_Location` table with the `Economic_Indicators` table based on the `Country_name` column. 

In [6]:
%%sql
# Add your code here
SELECT
    *
FROM
    Geographic_Location
LEFT JOIN
    Economic_Indicators
ON
    Economic_Indicators.Country_name = Geographic_Location.Country_name
#LIMIT 
    #21;
# Use LIMIT if you expect a large result set

 * mysql+pymysql://root:***@localhost:3306/united_nations
1048 rows affected.
0 rows affected.


[]

In [9]:
%%sql

SELECT
    *
FROM
    Basic_Services
LIMIT
    5;

 * mysql+pymysql://root:***@localhost:3306/united_nations
5 rows affected.


Country_name,Time_period,Pct_managed_drinking_water_services,Pct_managed_sanitation_services
Afghanistan,2015,67.0,45.67
Afghanistan,2016,69.67,47.0
Afghanistan,2017,72.33,49.33
Afghanistan,2018,75.33,50.67
Afghanistan,2019,78.0,52.33


### 2. Second `LEFT JOIN`

Combine the previously joined tables with the `Basic_Services` table, again based on the `Country_name` column.

In [11]:
%%sql
# Add your code here
SELECT
    *
FROM
    Geographic_Location
LEFT JOIN
    Economic_Indicators
ON
    Economic_Indicators.Country_name = Geographic_Location.Country_name
LEFT JOIN
    Basic_Services
ON
    Economic_Indicators.Country_name = Basic_Services.Country_name
    AND
    Geographic_Location.Country_name = Basic_Services.Country_name
# Use LIMIT if you expect a large result set
#LIMIT 21;

 * mysql+pymysql://root:***@localhost:3306/united_nations
6156 rows affected.


Country_name,Sub_region,Region,Land_area,Country_name_1,Time_period,Est_gdp_in_billions,Est_population_in_millions,Pct_unemployment,Country_name_2,Time_period_1,Pct_managed_drinking_water_services,Pct_managed_sanitation_services
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2015,20.0,33.753499,,Afghanistan,2015,67.0,45.67
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2015,20.0,33.753499,,Afghanistan,2016,69.67,47.0
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2015,20.0,33.753499,,Afghanistan,2017,72.33,49.33
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2015,20.0,33.753499,,Afghanistan,2018,75.33,50.67
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2015,20.0,33.753499,,Afghanistan,2019,78.0,52.33
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2015,20.0,33.753499,,Afghanistan,2020,80.33,54.0
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2016,18.02,34.636207,,Afghanistan,2015,67.0,45.67
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2016,18.02,34.636207,,Afghanistan,2016,69.67,47.0
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2016,18.02,34.636207,,Afghanistan,2017,72.33,49.33
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2016,18.02,34.636207,,Afghanistan,2018,75.33,50.67


In [15]:
%%sql
# Add your code here
SELECT
    *
FROM
    Economic_Indicators

 * mysql+pymysql://root:***@localhost:3306/united_nations
1048 rows affected.


Country_name,Time_period,Est_gdp_in_billions,Est_population_in_millions,Pct_unemployment
Afghanistan,2015,20.0,33.753499,
Afghanistan,2016,18.02,34.636207,
Afghanistan,2017,18.9,35.643418,11.18
Afghanistan,2018,18.42,36.686784,
Afghanistan,2019,18.9,37.769499,
Afghanistan,2020,20.14,38.97223,11.71
Algeria,2015,165.98,39.543154,11.21
Algeria,2016,160.03,40.339329,10.2
Algeria,2017,170.1,41.136546,12.0
Algeria,2018,174.91,41.927007,


In [17]:
%%sql
#Add your code here
SELECT
    *
FROM
    Geographic_Location

 * mysql+pymysql://root:***@localhost:3306/united_nations
182 rows affected.


Country_name,Sub_region,Region,Land_area
Afghanistan,Southern Asia,Central and Southern Asia,652230.0
Algeria,Northern Africa,Northern Africa and Western Asia,2381741.0
American Samoa,Polynesia,Oceania,200.0
Angola,Middle Africa,Sub-Saharan Africa,1246700.0
Anguilla,Caribbean,Latin America and the Caribbean,
Antigua and Barbuda,Caribbean,Latin America and the Caribbean,440.0
Argentina,South America,Latin America and the Caribbean,2736690.0
Armenia,Western Asia,Northern Africa and Western Asia,28470.0
Aruba,Caribbean,Latin America and the Caribbean,180.0
Australia,Australia and New Zealand,Oceania,7690400.0


### 3. Refine the second `LEFT JOIN`

At first glance, the results of the above query might seem fine, but take a closer look at the `Time_periods`. We can see that they don't align as we would expect.

Refine the second `LEFT JOIN` query by adding an additional condition based on the `Time_period` column.

In [24]:
%%sql
# Add your code here
SELECT
    *
FROM
    Geographic_Location
LEFT JOIN
    Economic_Indicators
ON
    Economic_Indicators.Country_name = Geographic_Location.Country_name
LEFT JOIN
    Basic_Services
ON
    Geographic_Location.Country_name = Basic_Services.Country_name
    AND
    Economic_Indicators.Time_period = Basic_Services.Time_period;
# Use LIMIT if you expect a large result set
#LIMIT 21;

 * mysql+pymysql://root:***@localhost:3306/united_nations
1048 rows affected.
0 rows affected.


[]

## Solutions

### 1. First `LEFT JOIN`

In [12]:
%%sql

SELECT 
	* 
FROM 
	united_nations.Geographic_Location as geo 
LEFT JOIN 
	united_nations.Economic_Indicators as econ 	
	ON geo.Country_name = econ.Country_name
LIMIT 50;

 * mysql+pymysql://root:***@localhost:3306/united_nations
50 rows affected.


Country_name,Sub_region,Region,Land_area,Country_name_1,Time_period,Est_gdp_in_billions,Est_population_in_millions,Pct_unemployment
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2015,20.0,33.753499,
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2016,18.02,34.636207,
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2017,18.9,35.643418,11.18
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2018,18.42,36.686784,
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2019,18.9,37.769499,
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2020,20.14,38.97223,11.71
Algeria,Northern Africa,Northern Africa and Western Asia,2381741.0,Algeria,2015,165.98,39.543154,11.21
Algeria,Northern Africa,Northern Africa and Western Asia,2381741.0,Algeria,2016,160.03,40.339329,10.2
Algeria,Northern Africa,Northern Africa and Western Asia,2381741.0,Algeria,2017,170.1,41.136546,12.0
Algeria,Northern Africa,Northern Africa and Western Asia,2381741.0,Algeria,2018,174.91,41.927007,


With this LEFT JOIN, we will get all the records from the `Geographic_Location` table and only the matching records from the `Economic_Indicators` table. If there is no match, we will still get the data from the `Geographic_Location` table, and the columns from the `Economic_Indicators` table will be `NULL`.


### 2. Second `LEFT JOIN`

In [13]:
%%sql

SELECT 
	* 
FROM 
	united_nations.Geographic_Location as geo 
LEFT JOIN 
	united_nations.Economic_Indicators as econ 	
	ON geo.Country_name = econ.Country_name 
LEFT JOIN 
	united_nations.Basic_Services as svc 	
	ON geo.Country_name = svc.Country_name
#LIMIT 20;

 * mysql+pymysql://root:***@localhost:3306/united_nations
6156 rows affected.


Country_name,Sub_region,Region,Land_area,Country_name_1,Time_period,Est_gdp_in_billions,Est_population_in_millions,Pct_unemployment,Country_name_2,Time_period_1,Pct_managed_drinking_water_services,Pct_managed_sanitation_services
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2015,20.0,33.753499,,Afghanistan,2015,67.0,45.67
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2015,20.0,33.753499,,Afghanistan,2016,69.67,47.0
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2015,20.0,33.753499,,Afghanistan,2017,72.33,49.33
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2015,20.0,33.753499,,Afghanistan,2018,75.33,50.67
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2015,20.0,33.753499,,Afghanistan,2019,78.0,52.33
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2015,20.0,33.753499,,Afghanistan,2020,80.33,54.0
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2016,18.02,34.636207,,Afghanistan,2015,67.0,45.67
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2016,18.02,34.636207,,Afghanistan,2016,69.67,47.0
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2016,18.02,34.636207,,Afghanistan,2017,72.33,49.33
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2016,18.02,34.636207,,Afghanistan,2018,75.33,50.67


### 3. Refine second `LEFT JOIN`

In [25]:
%%sql

SELECT 
	* 
FROM 
	united_nations.Geographic_Location as geo 
LEFT JOIN 
	united_nations.Economic_Indicators as econ 	
	ON geo.Country_name = econ.Country_name 
LEFT JOIN 
	united_nations.Basic_Services as svc 	
	ON geo.Country_name = svc.Country_name
	AND econ.Time_period = svc.Time_period
#LIMIT 20;

 * mysql+pymysql://root:***@localhost:3306/united_nations
1048 rows affected.


Country_name,Sub_region,Region,Land_area,Country_name_1,Time_period,Est_gdp_in_billions,Est_population_in_millions,Pct_unemployment,Country_name_2,Time_period_1,Pct_managed_drinking_water_services,Pct_managed_sanitation_services
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2015,20.0,33.753499,,Afghanistan,2015,67.0,45.67
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2016,18.02,34.636207,,Afghanistan,2016,69.67,47.0
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2017,18.9,35.643418,11.18,Afghanistan,2017,72.33,49.33
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2018,18.42,36.686784,,Afghanistan,2018,75.33,50.67
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2019,18.9,37.769499,,Afghanistan,2019,78.0,52.33
Afghanistan,Southern Asia,Central and Southern Asia,652230.0,Afghanistan,2020,20.14,38.97223,11.71,Afghanistan,2020,80.33,54.0
Algeria,Northern Africa,Northern Africa and Western Asia,2381741.0,Algeria,2015,165.98,39.543154,11.21,Algeria,2015,92.0,85.0
Algeria,Northern Africa,Northern Africa and Western Asia,2381741.0,Algeria,2016,160.03,40.339329,10.2,Algeria,2016,93.0,85.33
Algeria,Northern Africa,Northern Africa and Western Asia,2381741.0,Algeria,2017,170.1,41.136546,12.0,Algeria,2017,93.0,84.67
Algeria,Northern Africa,Northern Africa and Western Asia,2381741.0,Algeria,2018,174.91,41.927007,,Algeria,2018,93.0,84.67


With the additional condition, we ensure that the `Time_periods` align correctly and we get the desired output.



## Summary

This notebook shows how Entity-Relationship Diagrams can help us understand database joins better. We specifically focused on the `LEFT JOIN` technique, which is widely used to combine tables. Additionally, we noticed the importance of picking the right joining strategy, as incorrect joins can lead to inaccurate results.

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>