<div align="right" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/Logo blue_dark.png"  style="width:25px" align="right";/>
</div>

# OUR DATA-DRIVEN JOURNEY IN MAJI NDOGO USING SQL
© ExploreAI Academy



## Introduction

In this first part of the integrated project, we dive into Maji ndogo's expansive dataset containing just over 60000 records spread across various tables. As we navigate this trove of data, we'll use basic queries to familiarise ourselves with the contents of each table in the database. We'll also use SQL **Data Manipulation Language (DML)** to refine some data points while we're at it.

## Connecting to our MySQL database


In [43]:
# Load and activate the SQL extension to allow us to execute SQL in a Jupyter notebook. 
# If you get an error here, make sure that mysql and pymysql are installed correctly. 

%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [45]:
# Establish a connection to the local database using the '%sql' magic command.
# Replace 'password' with our connection password and `db_name` with our database name. 
# If you get an error here, please make sure the database name or password is correct.

%sql mysql+pymysql://root:L0xbysmrben%23@localhost:3306/md_water_services

## Familiarising Ourselves With the Data

Let's start by reviewing the first few records of each table to get a high level overview of what our data looks like. First things first, let's see the tables that are in Maji Ndogo's database.

In [46]:
%sql SHOW TABLES

   mysql+pymysql://root:***@localhost:3306/
 * mysql+pymysql://root:***@localhost:3306/md_water_services
13 rows affected.


Tables_in_md_water_services
auditor_report
combined_anallysis_table
data_dictionary
employee
global_water_access
incorrect_records
incorrect_tables
location
project_progress
visits


We can see that we have a total of 8 tables. Let's see what each of these tables contain starting with the data_dictionary table

In [50]:
%sql SELECT * FROM data_dictionary LIMIT 10;

   mysql+pymysql://root:***@localhost:3306/
 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


table_name,column_name,description,datatype,related_to
employee,assigned_employee_id,Unique ID assigned to each employee,INT,visits
employee,employee_name,Name of the employee,VARCHAR(255),
employee,phone_number,Contact number of the employee,VARCHAR(15),
employee,email,Email address of the employee,VARCHAR(255),
employee,address,Residential address of the employee,VARCHAR(255),
employee,town_name,Name of the town where the employee resides,VARCHAR(255),
employee,province_name,Name of the province where the employee resides,VARCHAR(255),
employee,position,Position or job title of the employee,VARCHAR(255),
visits,record_id,Unique ID assigned to each visit,int,"water_quality, water_source"
visits,location_id,ID of the location visited,varchar(255),location


We notice that the data dictionary has description of column names per table in the database. So to get any information a specific table and their column names along with description of each column we can just run a query like below.

In [20]:
%sql SELECT column_name, description, datatype, related_to FROM data_dictionary WHERE table_name = 'employee';

 * mysql+pymysql://root:***@localhost:3306/md_water_services
8 rows affected.


column_name,description,datatype,related_to
assigned_employee_id,Unique ID assigned to each employee,INT,visits
employee_name,Name of the employee,VARCHAR(255),
phone_number,Contact number of the employee,VARCHAR(15),
email,Email address of the employee,VARCHAR(255),
address,Residential address of the employee,VARCHAR(255),
town_name,Name of the town where the employee resides,VARCHAR(255),
province_name,Name of the province where the employee resides,VARCHAR(255),
position,Position or job title of the employee,VARCHAR(255),


The information above tells us that the employee table has 8 columns on of which seems to be a primary key related to another table i.e. assigned_employee_id is used to reference some information in the visits table. We can even retrieve table names that are related to each other by running a query like so.

In [21]:
%%sql
# Retrieve related tables
SELECT DISTINCT table_name
FROM data_dictionary
WHERE related_to != "";

 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


table_name
employee
visits
water_source
well_pollution
location


We can see that there are only 6 tables related to each other as per the data_dictionary table. Great, with the data_dictionary table as our map and the md_water_services database as our landscape, we now know how to navigate our data landscape. We just go ahead and view the first fiew rows for every table save for the data_dictionary table as we already know that it is more of a reference point for our real data in the database. You can run the query below multiple times while changing the table name after the FROM clause and it should display the first 10 records and each of their attributes per table/entity.

In [24]:
%sql SELECT * FROM employee LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


assigned_employee_id,employee_name,phone_number,email,address,province_name,town_name,position
0,Amara Jengo,99637993287,amara.jengo@ndogowater.gov,36 Pwani Mchangani Road,Sokoto,Ilanga,Field Surveyor
1,Bello Azibo,99643864786,bello.azibo@ndogowater.gov,129 Ziwa La Kioo Road,Kilimani,Rural,Field Surveyor
2,Bakari Iniko,99222599041,bakari.iniko@ndogowater.gov,18 Mlima Tazama Avenue,Hawassa,Rural,Field Surveyor
3,Malachi Mavuso,99945849900,malachi.mavuso@ndogowater.gov,100 Mogadishu Road,Akatsi,Lusaka,Field Surveyor
4,Cheche Buhle,99381679640,cheche.buhle@ndogowater.gov,1 Savanna Street,Akatsi,Rural,Field Surveyor
5,Zuriel Matembo,99034075111,zuriel.matembo@ndogowater.gov,26 Bahari Ya Faraja Road,Kilimani,Rural,Field Surveyor
6,Deka Osumare,99379364631,deka.osumare@ndogowater.gov,104 Kenyatta Street,Akatsi,Rural,Field Surveyor
7,Lalitha Kaburi,99681623240,lalitha.kaburi@ndogowater.gov,145 Sungura Amanpour Road,Kilimani,Rural,Field Surveyor
8,Enitan Zuri,99248509202,enitan.zuri@ndogowater.gov,117 Kampala Road,Hawassa,Zanzibar,Field Surveyor
10,Farai Nia,99570082739,farai.nia@ndogowater.gov,33 Angélique Kidjo Avenue,Amanzi,Dahabu,Field Surveyor


## Diving Into Water Sources

Now that we are familiar with what each entity in our database entails, we can dive deeper into specific aspects of our database. A good starting point is understanding the types of water sources recorded in the database. To get that information, we can inspect the water_source table

In [25]:
%sql SELECT DISTINCT type_of_water_source FROM water_source;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


type_of_water_source
tap_in_home
tap_in_home_broken
well
shared_tap
river


We can see that we have 5 unique types of water sources recorded in our database. Understanding what each of these types mean is paramount to deciphering proper data-driven decision making reports.

## Unpacking the Visits to Water Sources

The visits entity in the database logs information on each water source each and every time the water source is visited. From our data exploration above, we also noticed that there is a time_in_queue attribute in this entity. Let's experiment and retreive records from this entity WHERE time_in_queue > 500.

%%sql 
SELECT * 
FROM md_water_services.visits 
WHERE time_in_queue > 500 
ORDER BY time_in_queue DESC;


We can further investigate the type_of_water_source with such long time_in_queue. to do this, let's select the first three source_ids from the visits entity and search for them in the water_source entity


In [27]:
%%sql
SELECT 
    source_id,
    type_of_water_source,
    number_of_people_served
FROM water_source
WHERE source_id IN ("AmRu14612224", "HaRu19538224", "AkRu05704224");

 * mysql+pymysql://root:***@localhost:3306/md_water_services
3 rows affected.


source_id,type_of_water_source,number_of_people_served
AkRu05704224,shared_tap,3398
AmRu14612224,shared_tap,3118
HaRu19538224,shared_tap,3142


We can see that these are water sources of the type shared_tap serving more than 3000 people. Keep in mind that from the information in the project description that there were other sources that were visited more than once by the surveyors to see if there was a change in time_in_queue.

## Assesssing the Quality of Water Sources

Now that we've explored the visits made to various water sources, we can narrow down our analysis on the quality of water sources. The quality of water sources is the whole point of the survey. Recall that we have an entity in our database that has records pertaining to water_quality. This entity has recorded water quality scores assigned by surveyors.

In [49]:
%%sql
# Retrieve all records in with good water sources visited twice
SELECT *
FROM water_quality
WHERE subjective_quality_score = 10
AND visit_count = 2
LIMIT 10;

   mysql+pymysql://root:***@localhost:3306/
 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


record_id,subjective_quality_score,visit_count
59,10,2
137,10,2
269,10,2
363,10,2
378,10,2
618,10,2
752,10,2
801,10,2
819,10,2
850,10,2


## Investigating Pollution Issues

Finally, let's investigate the pollution issues as per Maji Ndogo's database. There's a table with recorded pollution/contamination data of wells in Maji Ndogo. Let's have a quick look at that table.

In [34]:
%sql SELECT * FROM well_pollution LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


source_id,date,description,pollutant_ppm,biological,results
KiRu28935224,2021-01-04 09:17:00,Bacteria: Giardia Lamblia,0.0,495.898,Contaminated: Biological
AkLu01628224,2021-01-04 09:53:00,Bacteria: E. coli,0.0,6.09608,Contaminated: Biological
HaZa21742224,2021-01-04 10:37:00,"Inorganic contaminants: Zinc, Zinc, Lead, Cadmium",2.715,0.0,Contaminated: Chemical
HaRu19725224,2021-01-04 11:04:00,Clean,0.0288593,9.56996e-05,Clean
SoRu35703224,2021-01-04 11:29:00,Bacteria: E. coli,0.0,22.5009,Contaminated: Biological
AkHa00070224,2021-01-04 11:42:00,Inorganic contaminants: Cadmium,5.46739,0.0,Contaminated: Chemical
HaSe21346224,2021-01-04 11:52:00,Clean,0.0140376,8.98989e-05,Clean
HaYa21468224,2021-01-04 12:03:00,"Inorganic contaminants: Chromium, Barium, Chromium, Lead",6.05137,0.0,Contaminated: Chemical
SoRu36278224,2021-01-04 12:24:00,Parasite: Cryptosporidium,0.0,485.162,Contaminated: Biological
AkLu02155224,2021-01-04 12:29:00,"Inorganic contaminants: Selenium, Arsenic",7.64106,0.0,Contaminated: Chemical



We can see that by viewing the results column, some wells are Clean while some are Contaminated: Biologically while others are Contaminated: Chemically. Each of the records has a source_id included so this can help us link the results to the sources in other tables in the database.

As per the project description, the description column is text recorded down by scientists which will be challenging to process. The biological column is numeric and measured in the units CFU/mL which is the measure of how much contamination is in the water. 0 is clean but anything more than 0.01 is contaminated.

We can check the integrity of the data/records just to make sure we don't have any false positives which might mislead people to consume contaminated, unsafe water.

In [35]:
%%sql
SELECT * 
FROM well_pollution
WHERE results = "Clean"
AND biological > 0.01;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
0 rows affected.


source_id,date,description,pollutant_ppm,biological,results


It seems like, in some cases, if the description field begins with the word "Clean", the results have been classified as "Clean" in the results column, even though the biological column is > 0.01. Let's dive deeper into the cause of the issue with the biological contamination data.

As per the project specifications, descriptions should only have the word "Clean" if there is no biological contamination (and no chemical pol-lutants). This means that we need to find and remove the "Clean" part from all the descriptions that do have a biological contamination so this mistake is not made again.

The second issue has arisen from this error, but it is much more problematic. Some wells have been marked as "Clean" in the results column because the description had the word "Clean" in it, even though they have a biological contamination. So we need to find all the results that have a value greater than 0.01 in the biological column and have been set to "Clean" in the results column.

First, let's look at the descriptions. We need to identify the records that mistakenly have the word "Clean" in the description. However, it is important to remember that not all of the field surveyors used the description to set the results - some checked the actual data.

In [47]:
%%sql
# Retrieve all records with errornous descriptions
SELECT *
FROM well_pollution
WHERE description LIKE "Clean_%";

   mysql+pymysql://root:***@localhost:3306/
 * mysql+pymysql://root:***@localhost:3306/md_water_services
0 rows affected.


source_id,date,description,pollutant_ppm,biological,results


The query returned 38 wrong descriptions. Now we need to fix these descriptions so that we don't encounter this issue again in the future. Looking at the results we can see two different descriptions that we need to fix:

All records that mistakenly have Clean Bacteria: E. coli should updated to Bacteria: E. coli
All records that mistakenly have Clean Bacteria: Giardia Lamblia should updated to Bacteria: Giardia Lamblia
The second issue we need to fix is in our results column. We need to update the results column from "Clean" to "Contaminated: Biological" where the biological column has a value greater than 0.01.

NOTE: The query below 👇🏾 should only be run once as the changes will persist through the database, so keep that in mind when running queries in the notebook environment incase you restart the kernel and run all cells.

In [48]:
%%sql
# Update all the erronous values in description and results attributes
UPDATE
    well_pollution
SET
    description = "Bacteria: E. coli"
WHERE
    description = "Clean Bacteria: E. coli";

UPDATE
    well_pollution
SET
    description = "Bacteria: Giardia Lamblia"
WHERE
    description = "Clean Bacteria: Giardia Lamblia";

UPDATE
    well_pollution
SET
    results = "Contaminated: Biological"
WHERE
    biological > 0.01 AND results = "Clean";

   mysql+pymysql://root:***@localhost:3306/
 * mysql+pymysql://root:***@localhost:3306/md_water_services
0 rows affected.
0 rows affected.
0 rows affected.


[]

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>