## Water Restoration in Maji Ndogo

This is the first phase of the water restoration project in Maji Ndogo. The water survey has been carried out the the data is stored in our database. The following are the key steps in this first phase as we unravel the project.

1: Get to know our data

2: Dive into the water sources

3: Unpack the visits to water sources

4: Assess the quality of water sources

5: Investigate any pollution issues with the water sources

### Connecting to our database

In [2]:
# A side note on some of the packages that were installed for this to run appropiately.
#pip install sqlalchemy
#pip install pymysql
#pip install eralchemy
#pip install graphviz

In [3]:
# Load and activate the SQL extension to allow us to execute SQL in a Jupyter notebook.
%load_ext sql

# Establish a connection to the local database using the '%sql' magic command.
%sql mysql+pymysql://root:Blessings@localhost:3306/md_water_services

### Getting to know our data

In [4]:
%%sql

# Find out all the tables present in the database
SHOW TABLES;

Tables_in_md_water_services
data_dictionary
employee
global_water_access
location
visits
water_quality
water_source
well_pollution


In [5]:
# There is a data dictionary included with the tables, 
# so we can get a lot of information about the columns in each table in the database from this


%config SqlMagic.displaylimit = None # reset display limit to None to permit a full display 
                                    # of the data_dictionary
%sql SELECT * FROM data_dictionary; # taking a look at the data_dictionary table


table_name,column_name,description,datatype,related_to
employee,assigned_employee_id,Unique ID assigned to each employee,INT,visits
employee,employee_name,Name of the employee,VARCHAR(255),
employee,phone_number,Contact number of the employee,VARCHAR(15),
employee,email,Email address of the employee,VARCHAR(255),
employee,address,Residential address of the employee,VARCHAR(255),
employee,town_name,Name of the town where the employee resides,VARCHAR(255),
employee,province_name,Name of the province where the employee resides,VARCHAR(255),
employee,position,Position or job title of the employee,VARCHAR(255),
visits,record_id,Unique ID assigned to each visit,int,"water_quality, water_source"
visits,location_id,ID of the location visited,varchar(255),location


In [6]:
# reset display limit back to 10
%config SqlMagic.displaylimit = 10

In [7]:
%%sql

# taking a look at the location table
SELECT *
FROM location
LIMIT 5;

location_id,address,province_name,town_name,location_type
AkHa00000,2 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00001,10 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00002,9 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00003,139 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00004,17 Addis Ababa Road,Akatsi,Harare,Urban


In [8]:
%%sql

# checking if the location_id is unique for each entry
SELECT COUNT(location_id),
    COUNT(DISTINCT location_id)
FROM location;

COUNT(location_id),COUNT(DISTINCT location_id)
39650,39650


This location table has information on each specific location, with an address, the province and town the location is in, and if it is Urban or not. Each location has a unique identifier location_id.

In [9]:
%%sql

# taking a look at the visits table
SELECT *
FROM  visits
LIMIT 5;

record_id,location_id,source_id,time_of_record,visit_count,time_in_queue,assigned_employee_id
0,SoIl32582,SoIl32582224,2021-01-01 09:10:00,1,15,12
1,KiRu28935,KiRu28935224,2021-01-01 09:17:00,1,0,46
2,HaRu19752,HaRu19752224,2021-01-01 09:36:00,1,62,40
3,AkLu01628,AkLu01628224,2021-01-01 09:53:00,1,0,1
4,AkRu03357,AkRu03357224,2021-01-01 10:11:00,1,28,14


In [10]:
%%sql

#checking if the source_id is unique for each entry
SELECT COUNT(DISTINCT source_id) AS total_distinct_sourceid,
    COUNT(source_id) AS total_sourceid
FROM  visits
;

total_distinct_sourceid,total_sourceid
39650,60146


There are 39650 unique locations as evidenced by location_ids and 39650 unique source_ids.
Looks like each location contains one source. However ther are 60146 different records in the visits table, this implies that some source_ids were visited more than once.

In [11]:
# examine which sources or locations were visited more than once



## Dive into the water sources

In [12]:
%%sql

# taking a look at the water_source table
SELECT *
FROM  water_source
LIMIT 5;

source_id,type_of_water_source,number_of_people_served
AkHa00000224,tap_in_home,956
AkHa00001224,tap_in_home_broken,930
AkHa00002224,tap_in_home_broken,486
AkHa00003224,well,364
AkHa00004224,tap_in_home_broken,942


In [13]:
%%sql

#checking if the source_id is unique for each entry
SELECT COUNT(DISTINCT source_id) AS total_distinct_sourceid,
    COUNT(source_id) AS total_sourceid
FROM  water_source
;

total_distinct_sourceid,total_sourceid
39650,39650


In [14]:
%%sql

# checking the different types of water sources, how many of each type exists in maji ndogo,
# and number of people served by each type
SELECT type_of_water_source, 
    COUNT(source_id),
    SUM(number_of_people_served)
FROM  water_source
GROUP BY type_of_water_source
ORDER BY SUM(number_of_people_served) DESC
;

type_of_water_source,COUNT(source_id),SUM(number_of_people_served)
shared_tap,5767,11945272
well,17383,4841724
tap_in_home,7265,4678880
tap_in_home_broken,5856,3799720
river,3379,2362544


It appears that the shared_tap serves the most number of people in maji ndogo. It seemed like the commonest type of water source in maji ndogo is the well, however, a note from the survey team  proves otherwise as it states that taps have been combined in the recording process. Still 17383 wells within the community is a lot of wells.

#### A note from the survey team
An important note on the home taps: About 6-10 million people have running water installed in their homes in Maji Ndogo, including broken taps. If we were to document this, we would have a row of data for each home, so that one record is one tap. That means our database would contain about 1 million rows of data, which may slow our systems down. For now, the surveyors combined the data of many households together into a single record.

For example, the first record, AkHa00000224 is for a tap_in_home that serves 956 people. What this means is that the records of about 160 homes nearby were combined into one record, with an average of 6 people living in each house 160 x 6 ≈ 956. So 1 tap_in_home or tap_in_home_broken record actually refers to multiple households, with the sum of the people living in these homes equal to number_of_people_served.

## Unpack the visits to water sources

In [15]:
%%sql

# retreiving all records from the visit table where the time_in_queue is more than some very 
# crazy time span say 500 min. Imagine queuing 8 hours for water?

SELECT *
FROM visits
WHERE time_in_queue > 500
ORDER BY time_in_queue DESC;

record_id,location_id,source_id,time_of_record,visit_count,time_in_queue,assigned_employee_id
30007,AmRu14612,AmRu14612224,2022-04-02 08:55:00,2,539,8
51858,HaRu19538,HaRu19538224,2023-03-04 18:04:00,3,539,4
53278,AkRu05704,AkRu05704224,2023-03-25 13:48:00,2,539,36
45317,HaRu20126,HaRu20126224,2022-11-19 14:22:00,6,538,16
57408,SoRu35388,SoRu35388224,2023-05-27 08:52:00,5,538,1
20372,KiZu31117,KiZu31117224,2021-11-06 09:37:00,3,537,10
33650,KiRu29348,KiRu29348224,2022-05-28 12:58:00,2,537,10
31310,SoRu37865,SoRu37865224,2022-04-23 06:01:00,2,535,40
38947,SoRu38095,SoRu38095224,2022-08-13 13:48:00,6,535,30
52264,HaRu17383,HaRu17383224,2023-03-11 07:10:00,5,535,30


There are indeed places where citizen wait 539 minutes before having access 
to water. That doesnt look good

## Assess the quality of water sources

The quality of our water sources is the whole point of this survey. So it is important to look into the quality of water sources.

In [16]:
%%sql

# an overview of the water_quality table
SELECT *
FROM water_quality
;

record_id,subjective_quality_score,visit_count
0,0,1
1,1,1
2,5,1
3,10,1
4,4,1
5,0,1
6,9,1
7,10,1
8,2,1
9,10,1


The information given from the survey team is that each record_ID has a subjective quality score. 1 - Poor, 10 - Excellent, however, shared_tap sources were visited more than once. 

1. This means that the shared_tap sources only should have visit_count that is greater than 1. 

2. Also, how did they record the subjective score for the multiple visits?


In [55]:
%%sql

# check to see if there are any other types of sources that have visit_count > 1
SELECT visits.visit_count AS visit_count,
    water_source.type_of_water_source AS source_type  
FROM visits
JOIN water_source
ON visits.source_id = water_source.source_id
WHERE visit_count > 1 AND type_of_water_source <> 'shared_tap'
;

visit_count,source_type


In [None]:
%%sql

# check to see if there are any other types of sources that have visit_count > 1
SELECT visits.visit_count AS visit_count,
    water_source.type_of_water_source AS source_type  
FROM visits
JOIN water_source
ON visits.source_id = water_source.source_id
WHERE visit_count > 1 AND type_of_water_source <> 'shared_tap'
;

In [34]:
%%sql

# checking to see how the subjective scores were recorded for the multiple visits.
SELECT locationid_multiple_visit.location_id, locationid_multiple_visit.record_id, water_quality.subjective_quality_score
FROM (SELECT DISTINCT location_id, record_id
    #water_quality.subjective_quality_score
        FROM (  SELECT record_id, location_id, source_id, visit_count
                FROM visits
                WHERE visit_count > 1) AS multiple_visit) AS locationid_multiple_visit
JOIN water_quality 
ON water_quality.record_id = locationid_multiple_visit.record_id
    
;



location_id,record_id,subjective_quality_score
AkHa00036,50812,3
AkHa00036,50912,3
AkHa00036,50974,3
AkHa00036,50993,3
AkHa00036,51028,3
AkHa00036,51103,3
AkHa00036,51268,3
AkHa00090,29928,5
AkHa00090,30006,5
AkHa00090,30023,5


In [50]:
%%sql

# deliberately using some eloborate method to check how the subjective scores were recorded 
# for the multiple visits (same objective as the previous cell) 

WITH locationid_multiple_visit AS( #
    #choose only one record per location_id from the several repeated visits to each location_id
    SELECT DISTINCT location_id, record_id
    FROM (  # identify the records where visit_count was more than 1
            SELECT record_id, location_id, source_id, visit_count
            FROM visits
            WHERE visit_count > 1) AS multiple_visit
        ),

    multiple_visit_subjective_quality_score AS(
    #join to the water_quality table to obtain the subjective score
    SELECT locationid_multiple_visit.location_id,
        locationid_multiple_visit.record_id,
        water_quality.subjective_quality_score
    FROM locationid_multiple_visit
    JOIN water_quality
    ON locationid_multiple_visit.record_id = water_quality.record_id
    )

SELECT location_id,
    MAX(subjective_quality_score)-MIN(subjective_quality_score) AS range_of_sub_score
FROM multiple_visit_subjective_quality_score
GROUP BY location_id
ORDER BY range_of_sub_score DESC


location_id,range_of_sub_score
AkHa00036,0
AkHa00090,0
AkHa00103,0
AkHa00137,0
AkHa00158,0
AkHa00236,0
AkHa00259,0
AkHa00263,0
AkHa00373,0
AkHa00460,0


Finding: The exact same subjective quality score was recorded for each visit made to the locations that were visited multiple times