# Clustering data to unveil Maji Ndogo's water crisis
Maji Ndogo: From analysis to action

We will be working with a database of 60,000 records, meticulously collected by our devoted team of engineers, field workers, scientists, and analysts.
We need to make sense of this immense data trove and extract meaningful insights and breathe life into these records and listen to the story they are telling us.

# Overview:

#### 1. Get to know our data:
Before we do anything else, let's take a good look at our data. We'll load up the database and pull up the first few records from each table. It's like getting to know a new city- we need to explore the lay of the land before we can start our journey.

#### 2. Dive into the water sources: 
We've got a whole table dedicated to the types of water sources in our database. Let's dig into it and figure out all the unique types of water sources we're dealing with.

#### 3. Unpack the visits to water sources: 
The 'visits' table in our database is like a logbook of all the trips made to different water sources. We need to unravel this logbook to understand the frequency and distribution of these visits. Let's identify which locations have been visited more than a certain number of times.

#### 4. Assess the quality of water sources: 
The quality of water sources is a pretty big deal. We'll turn to the water_quality table to find records where the subjective_quality_score is within a certain range and the visit_count is above a certain threshold. This should help us spot the water sources that are frequently visited and have a decent quality score.

#### 5. Investigate any pollution issues: 
We can't overlook the pollution status of our water sources. Let's find those water sources where the pollution_tests result came back as 'dirty' or 'biologically contaminated'. This will help us flag the areas that need immediate attention.

### Connecting to our MySQL database

Since we have a MySQL database, we can connect to it using mysql and pymysql.

In [1]:
# Load and activate the SQL extension to allow us to execute SQL in a Jupyter notebook.

%load_ext sql

# Establish a connection to the local database using the '%sql' magic command.
# Replace 'password' with our connection password and `db_name` with our database name.

%sql mysql+pymysql://root:November28!@localhost:3306/md_water_services

# 1. Get to know our data:
    
Before we do anything else, let's take a good look at our data. We'll load up the database and pull up the first
few records from each table. It's like getting to know a new city- we need to explore the lay of the land before we can start our journey. 


Let's start by listing the tables in the database:

In [7]:
%%sql

SHOW TABLES;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
8 rows affected.


Tables_in_md_water_services
data_dictionary
employee
global_water_access
location
visits
water_quality
water_source
well_pollution


Now, let's have a look at one of these tables, Let's use location, we will use SELECT * and limit it and tell it which table we are looking at

In [8]:
%%sql

SELECT *
FROM location
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


location_id,address,province_name,town_name,location_type
AkHa00000,2 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00001,10 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00002,9 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00003,139 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00004,17 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00005,125 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00006,98 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00007,21 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00008,11 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00009,6 Addis Ababa Road,Akatsi,Harare,Urban


So we can see that this table has information on a specific location, with an address, the province and town the location is in, and if it's in a city (Urban) or not. We can't really see what location this is but we can see some sort of identifying number of that location.

A data dictionary has been embedded into the database. If you query the data_dictionary table, an explanation of each column is given there. We can also use it to look up other tables.


# 2. Dive into the water sources:

Now that you're familiar with the structure of the tables, let's dive deeper. We need to understand the types of water sources we're dealing with. So, we write a SQL query to find all the unique types of water sources from the water_source table.


In [9]:
%%sql

SELECT
    DISTINCT type_of_water_source
FROM 
    water_source
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


type_of_water_source
tap_in_home
tap_in_home_broken
well
shared_tap
river


water source types:

River - People collect drinking water along a river. This is an open water source that millions of people use in Maji Ndogo. Water from a river has a high risk of being contaminated with biological and other pollutants, so it is the worst source of water possible.


Well- These sources draw water from underground sources, and are commonly shared by communities. Since these are closed water sources, contamination is much less likely compared to a river. Unfortunately, due to the aging infrastructure and the corruption of officials in the past, many of our wells are not clean.

Shared tap- This is a tap in a public area shared by communities.

Tap in home- These are taps that are inside the homes of our citizens. On average about 6 people live together in Maji Ndogo, so each of these taps serves about 6 people.

Broken tap in home- These are taps that have been installed in a citizen’s home, but the infrastructure connected to that tap is not functional. This can be due to burst pipes, broken pumps or water treatment plants that are not working.


#### An important note on the home taps: 
About 6-10 million people have running water installed in their homes in Maji Ndogo, including broken taps. If we were to document this, we would have a row of data for each home, so that one record is one tap. That means our database would contain about 1 million rows of data, which may slow our systems down. For now, the surveyors combined the data of many households together into a single record.

For example, the first record, AkHa00000224 is for a tap_in_home that serves 956 people. What this means is that the records of about 160 homes nearby were combined into one record, with an average of 6 people living in each house 160 x 6 956. So 1 tap_in_home or tap_in_home_broken record actually refers to multiple households, with the sum of the people living in these homes equal to number_of_people_served.


# 3. Unpack the visits to water sources:

We have a table in our database (visits) that logs the visits made to different water sources.

We write an SQL query that retrieves all records from this table where the time_in_queue is more than some crazy time, say 500 min. How would it feel to queue 8 hours for water?


In [10]:
%%sql

SELECT *
FROM
    VISITS
WHERE
    time_in_queue >= 500
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


record_id,location_id,source_id,time_of_record,visit_count,time_in_queue,assigned_employee_id
899,SoRu35083,SoRu35083224,2021-01-16 10:14:00,6,515,28
2304,SoKo33124,SoKo33124224,2021-02-06 07:53:00,5,512,16
2315,KiRu26095,KiRu26095224,2021-02-06 14:32:00,3,529,8
3206,SoRu38776,SoRu38776224,2021-02-20 15:03:00,5,509,46
3701,HaRu19601,HaRu19601224,2021-02-27 12:53:00,3,504,0
4154,SoRu38869,SoRu38869224,2021-03-06 10:44:00,2,533,24
5483,AmRu14089,AmRu14089224,2021-03-27 18:15:00,4,509,12
9177,SoRu37635,SoRu37635224,2021-05-22 18:48:00,2,515,1
9648,SoRu36096,SoRu36096224,2021-05-29 11:24:00,2,533,3
11631,AkKi00881,AkKi00881224,2021-06-26 06:15:00,6,502,32


To investigate how this is possible, We will have to find that information in another table that lists the types of water sources.  The water_source table has type_of_water_source and a source_id column. So we write down a couple of these source_id values from our results and search for them in thewater_source table.

AkKi00881224; AkLu01628224; HaRu19601224; SoRu36096224; SoRu37635224; SoRu38776224 

If we just select the first couple of records of the visits table without a WHERE filter, we can see that some of these rows also have 0 mins queue time. So wewrite down one or two of these too.
I chose these two:
AkRu05234224; HaZa21742224


In [13]:
%%sql
SELECT *
FROM 
    water_source
WHERE 
    source_id IN ('AkKi00881224', 'AkLu01628224', 'HaRu19601224', 'SoRu36096224', 'SoRu37635224', 'SoRu38776224', 'AkRu05234224', 'HaZa21742224')
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
8 rows affected.


source_id,type_of_water_source,number_of_people_served
AkKi00881224,shared_tap,3398
AkLu01628224,well,210
AkRu05234224,tap_in_home_broken,496
HaRu19601224,shared_tap,3322
HaZa21742224,well,308
SoRu36096224,shared_tap,3786
SoRu37635224,shared_tap,3920
SoRu38776224,shared_tap,3180


if we check them you will see which sources have people queueing. The field surveyorsy measured sources that had queues a few times to see if the queue time changed.

# 4. Assess the quality of water sources:
The quality of our water sources is the whole point of this survey. The `water_qality` table contains a quality score for each visit made about a water source that was assigned by a Field surveyor. They assigned a score to each source from 1, being terrible, to 10 for a good, clean water source in a home. Shared taps are not rated as high, and the score also depends on how long the queue times are.

Let's check if this is true. The surveyors only made multiple visits to shared taps and did not revisit other types of water sources. So there should be no records of second visits to locations where there are good water sources, like taps in homes.

So we write a query to find records where the subject_quality_score is 10-- only looking for home taps-- and where the source was visited a second time. 


In [21]:
%%sql

SELECT *
FROM
    water_quality
WHERE
    subjective_quality_score = 10 
    AND
    visit_count = 2;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
218 rows affected.


record_id,subjective_quality_score,visit_count
59,10,2
137,10,2
269,10,2
363,10,2
378,10,2
618,10,2
752,10,2
801,10,2
819,10,2
850,10,2


I get 218 rows of data. But this should not be happening! This means that some of the employees may have made mistakes. This means that our data does have some errors. We will come back to this later when we have the Audited dataset!

# 5. Investigate pollution issues:
We recorded contamination/pollution data for all of the well sources.


Let’s print the first few records of the `well_pollution` table.


In [22]:
%%sql

SELECT *
FROM
    well_pollution
LIMIT 5;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


source_id,date,description,pollutant_ppm,biological,results
KiRu28935224,2021-01-04 09:17:00,Bacteria: Giardia Lamblia,0.0,495.898,Contaminated: Biological
AkLu01628224,2021-01-04 09:53:00,Bacteria: E. coli,0.0,6.09608,Contaminated: Biological
HaZa21742224,2021-01-04 10:37:00,"Inorganic contaminants: Zinc, Zinc, Lead, Cadmium",2.715,0.0,Contaminated: Chemical
HaRu19725224,2021-01-04 11:04:00,Clean,0.0288593,9.56996e-05,Clean
SoRu35703224,2021-01-04 11:29:00,Bacteria: E. coli,0.0,22.5009,Contaminated: Biological


It looks like thescientists diligently recorded the water quality of all the wells. Some are contaminated with biological contaminants, while others are polluted with an excess of heavy metals and other pollutants. Based on the results, each well was classified as Clean, Contaminated: Biological or Contaminated: Chemical. It is important to know this because wells that are polluted with bio- or other contaminants are not safe to drink. It looks like they recorded the source_id of each test, so we can link it to a source, at some place in Maji Ndogo.


In the wella_pollution table, the descriptions are notes taken by our scientists as text, so it will be challenging to process it. The biological column is in units of CFU/mL, so it measures how much contamination is in the water. 0 is clean, and anything more than 0.01 is contaminated.

Let's check the integrity of the data. The worst case is if we have contamination, but we think we don't.

Let’s write a query that checks if the result is Clean but the biological column is > 0.01.


In [24]:
%%sql

SELECT *
FROM
    well_pollution
WHERE
    results = 'clean'
    AND
    biological > 0.01
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


source_id,date,description,pollutant_ppm,biological,results
AkRu08936224,2021-01-08 09:22:00,Bacteria: E. coli,0.0406458,35.0068,Clean
AkRu06489224,2021-01-10 09:44:00,Clean Bacteria: Giardia Lamblia,0.0897904,38.467,Clean
SoRu38011224,2021-01-14 15:35:00,Bacteria: E. coli,0.0425095,19.2897,Clean
AkKi00955224,2021-01-22 12:47:00,Bacteria: E. coli,0.0812092,40.2273,Clean
KiHa22929224,2021-02-06 13:54:00,Bacteria: E. coli,0.0722537,18.4482,Clean
KiRu25473224,2021-02-07 15:51:00,Clean Bacteria: Giardia Lamblia,0.0630094,24.4536,Clean
HaRu17401224,2021-03-01 13:44:00,Clean Bacteria: Giardia Lamblia,0.0649209,25.8129,Clean
AkRu07137224,2021-03-04 13:41:00,Clean Bacteria: Giardia Lamblia,0.0656843,18.2978,Clean
KiRu27205224,2021-03-13 14:17:00,Clean Bacteria: Giardia Lamblia,0.0418018,49.4281,Clean
AkLu02307224,2021-03-13 15:41:00,Bacteria: E. coli,0.0709682,35.203,Clean


It seems like, in some cases, if the description field begins with the word “Clean”, the results have been classified as “Clean” in the results column, even though the biological column is > 0.01.

the descriptions should only have the word “Clean” if there is no biological contamination (and no chemical pollutants). Some data personnel must have copied the data from the scientist's notes into our database incorrectly. 

We need to find and remove the “Clean” part from all the descriptions that do have biological contamination so this mistake is not made again.

Some of the field surveyors have marked wells as Clean in the results column because the description had the word “Clean” in it, even though they have a biological contamination. 

So we need to find all the results that have a value greater than 0.01 in the biological column and have been set to Clean in the results column.


### Looking at the results we can see two different descriptions that we need to fix:
All records that mistakenly have Clean Bacteria: E. coli should updated to Bacteria: E. coli
All records that mistakenly have Clean Bacteria: Giardia Lamblia should updated to Bacteria: Giardia Lamblia
The second issue we need to fix is in our results column. We need to update the results column from Clean to Contaminated: Biological where the biological column has a value greater than 0.01.


In [26]:
%%sql

UPDATE
    well_pollution
SET
    description = 'Bacteria: E. coli'
WHERE
    description = 'Clean Bacteria: E. coli';

UPDATE
    well_pollution
SET
    description = 'Bacteria: Giardia Lamblia'
WHERE
    description = 'Clean Bacteria: Giardia Lamblia';

UPDATE
    well_pollution
SET
    results = 'Contaminated: Biological'
WHERE
    biological > 0.01 AND results = 'Clean';

 * mysql+pymysql://root:***@localhost:3306/md_water_services
26 rows affected.
12 rows affected.
64 rows affected.


[]