# Maji Ndogo: From analysis to action

## Clustering data to unveil Maji Ndogo's water crisis

### Introduction

Dear Team,
Our mission, as arduous as it is essential, requires us to delve deeper into our reservoir of data. To truly illuminate the road ahead, we must magnify
our analysis, moving beyond isolated data points to discern larger patterns and trends.
In this next step, we will cluster our data, stepping back from the individual figures to gain a panoramic understanding. This bird's eye view will
allow us to unearth broader narratives and hidden correlations concealed within our rich dataset.
Next, we must pay heed to the different forms of data in our possession. They are not mere numbers or dates; they are stories waiting to be
deciphered. Their unique structure, though challenging, brims with valuable insights. As we process these, we unlock deeper layers of
understanding.
Bear in mind that every piece of information you decipher, every category you determine, brings us one stride closer to our noble goal. It's through
the intricate details and broader brushstrokes of data that we will uncover the solutions to Maji Ndogo's water crisis.
Your unwavering commitment to this mission emboldens me. Together, we continue marching forward, using data and dedication as our compass,
towards a brighter, more secure future for Maji Ndogo.
Thank you for all your tireless efforts.

Warm regards,

Aziza Naledi

Hi Pres. Naledi,
I hope you're doing well. While diving into our recent survey data for the Maji Ndogo water project, our team stumbled upon some inconsistencies
that caught our eye. It's nothing alarming, but we think it's worth a closer look.
Would you consider bringing in an independent auditor to double-check some of the records? I think it's a smart move to ensure everything is on
the up-and-up. After all, we're all about accuracy and trust.
Feel free to reach out if you want to chat more about this or need more details.

Take care,

Chidi Kunto

Hi Chidi,
Thanks for catching that, and for being so attentive to detail. I'm right there with you on this - we want to be sure we're working with the best
information possible.
I'll get an independent auditor on this ASAP. They'll touch base with you and the rest of the team to get things rolling. I've cc'ed everyone so that
we're all on the same page.
Appreciate your diligence, Chidi. Let's keep up the great work. Maji Ndogo is counting on us.

All the best,

Aziza Naledi

Before we start, scan through the data dictionary, and perhaps query a couple of tables to get a feel for the database again.

### Cleaning our data

Ok, bring up the employee table. It has info on all of our workers, but note that the email addresses have not been added. We will have to send
them reports and figures, so let's update it. Luckily the emails for our department are easy: first_name.last_name@ndogowater.gov.

`I am going to guide you through this one, so code along.`

We can determine the email address for each employee by:
- selecting the employee_name column
- replacing the space with a full stop
- make it lowercase
- and stitch it all together

We have to update the database again with these email addresses, so before we do, let's use a SELECT query to get the format right, then use
UPDATE and SET to make the changes.

First up, let's remove the space between the first and last names using REPLACE(). You can try this:

SELECT
    REPLACE(employee_name, ' ','.') −− Replace the space with a full stop
FROM
    employee;

In [1]:
# Load and activate the SQL extension to allow us to execute SQL in a Jupyter notebook.

%load_ext sql

In [2]:
# Establish a connection to the local database using the '%sql' magic command.
# Replace 'password' with our connection password and `db_name` with our database name.

%sql mysql+pymysql://root:TULIP5SOLAs!iamnotafraid@localhost:3306/md_water_services

In [4]:
%%sql 

SELECT 
    REPLACE(employee_name, ' ', '.') 
FROM 
    Employee
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


"REPLACE(employee_name, ' ', '.')"
Amara.Jengo
Bello.Azibo
Bakari.Iniko
Malachi.Mavuso
Cheche.Buhle
Zuriel.Matembo
Deka.Osumare
Lalitha.Kaburi
Enitan.Zuri
Farai.Nia


Then we can use `LOWER()` with the result we just got. Now the name part is correct:

**SELECT**
LOWER(REPLACE(employee\_name, ' ', '.')) −− Make it all lower case
**FROM**
employee

In [7]:
%%sql

SELECT 
    LOWER(REPLACE(employee_name, ' ', '.'))
FROM 
    employee
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


"LOWER(REPLACE(employee_name, ' ', '.'))"
amara.jengo
bello.azibo
bakari.iniko
malachi.mavuso
cheche.buhle
zuriel.matembo
deka.osumare
lalitha.kaburi
enitan.zuri
farai.nia


We then use CONCAT() to add the rest of the email address:
SELECT
CONCAT(
LOWER(REPLACE(employee_name, ' ', '.')), '@ndogowater.gov') AS new_email −− add it all together
FROM
employee

In [9]:
%%sql

SELECT
    CONCAT(LOWER(REPLACE(employee_name, ' ', '.')), '@ndogowater.gov') AS new_email
FROM
    employee
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


new_email
amara.jengo@ndogowater.gov
bello.azibo@ndogowater.gov
bakari.iniko@ndogowater.gov
malachi.mavuso@ndogowater.gov
cheche.buhle@ndogowater.gov
zuriel.matembo@ndogowater.gov
deka.osumare@ndogowater.gov
lalitha.kaburi@ndogowater.gov
enitan.zuri@ndogowater.gov
farai.nia@ndogowater.gov


Quick win! Since you have done this before, you can go ahead and UPDATE the email column this time with the email addresses. Just make sure to
check if it worked!
UPDATE employee
SET email = CONCAT(LOWER(REPLACE(employee_name, ' ', '.')),

'@ndogowater.gov')

In [11]:
%%sql

UPDATE
    employee
SET email = CONCAT(LOWER(REPLACE(employee_name, ' ', '.')), '@ndogowater.gov');

 * mysql+pymysql://root:***@localhost:3306/md_water_services
56 rows affected.


[]

I picked up another bit we have to clean up. Often when databases are created and updated, or information is collected from different sources,
errors creep in. For example, if you look at the phone numbers in the phone_number column, the values are stored as strings.

The phone numbers should be 12 characters long, consisting of the plus sign, area code (99), and the phone number digits. However, when we use
the LENGTH(column) function, it returns 13 characters, indicating there's an extra character.
SELECT
LENGTH(phone_number)
FROM
employee;

In [16]:
%%sql

SELECT 
    LENGTH(phone_number)
FROM 
    employee
LIMIT 1;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
1 rows affected.


LENGTH(phone_number)
13


That's because there is a space at the end of the number! If you try to send an automated SMS to that number it will fail. This happens so often
that they create a function, especially for trimming off the space, called TRIM(column).
It removes any leading or trailing spaces from a string.

`Use TRIM() to write a SELECT query again, make sure we get the string without the space, and then UPDATE the record like you just did for the
emails. If you need more information about TRIM(), Google "TRIM documentation MySQL".`

In [18]:
%%sql

UPDATE 
    employee
SET phone_number = TRIM(phone_number);

 * mysql+pymysql://root:***@localhost:3306/md_water_services
56 rows affected.


[]

In [22]:
%%sql

SELECT 
    COUNT(*)
FROM 
    employee
WHERE LENGTH(phone_number) >= 13;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
1 rows affected.


COUNT(*)
0


### Honouring the workers

Before we dive into the analysis, let's get you warmed up a bit!
Let's have a look at where our employees live.

Use the employee table to count how many of our employees live in each town. Think carefully about what function we should use and how we
should aggregate the data.

In [26]:
%%sql

SELECT 
    town_name,
    COUNT(*) as num_employees
FROM 
    employee
GROUP BY town_name
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
9 rows affected.


town_name,num_employees
Ilanga,3
Rural,29
Lusaka,4
Zanzibar,4
Dahabu,6
Kintampo,1
Harare,5
Yaounde,1
Serowe,3


Note how many of our workers are living in smaller communities in the rural parts of Maji Ndogo.

Pres. Naledi congratulated the team for completing the survey, but we would not have this data were it not for our field workers. So let's gather
some data on their performance in this process, so we can thank those who really put all their effort in.

Pres. Naledi has asked we send out an email or message congratulating the top 3 field surveyors. So let's use the database to get the
employee_ids and use those to get the names, email and phone numbers of the three field surveyors with the most location visits.

`Let's first look at the number of records each employee collected. So find the correct table, figure out what function to use and how to group, order
and limit the results to only see the top 3 employee_ids with the highest number of locations visited.`

In [32]:
%%sql

SELECT 
    assigned_employee_id,
    COUNT(*) AS total_visits
FROM 
    Visits
GROUP BY 
    assigned_employee_id
ORDER BY 
    total_visits DESC
LIMIT 3;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
3 rows affected.


assigned_employee_id,total_visits
1,3708
30,3676
34,3539


In [44]:
%%sql

WITH top_employees AS (
    SELECT 
        assigned_employee_id,
        COUNT(*) AS visit_count
    FROM 
        visits
    GROUP BY 
        assigned_employee_id
    ORDER BY 
        visit_count DESC
    LIMIT 3
)
SELECT 
    e.assigned_employee_id,
    e.employee_name,
    e.phone_number,
    e.email,
    te.visit_count
FROM 
    employee AS e
JOIN 
    top_employees te ON e.assigned_employee_id = te.assigned_employee_id;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
3 rows affected.


assigned_employee_id,employee_name,phone_number,email,visit_count
1,Bello Azibo,99643864786,bello.azibo@ndogowater.gov,3708
30,Pili Zola,99822478933,pili.zola@ndogowater.gov,3676
34,Rudo Imani,99046972648,rudo.imani@ndogowater.gov,3539


I'll send that off to Pres. Naledi. But this survey is not primarily about our employees, so let's get working on the main task! We'll start looking at
some of the tables in the dataset at a larger scale, identify some trends, summarise important data, and draw insights.

### Analysing locations

Looking at the location table, let’s focus on the province_name, town_name and location_type to understand where the water sources are in
Maji Ndogo.

`Create a query that counts the number of records per town`

In [56]:
%%sql

SELECT
    town_name,
    COUNT(town_name) AS number_of_records
FROM
    location
GROUP BY town_name 
ORDER BY number_of_records DESC    ;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
25 rows affected.


town_name,number_of_records
Rural,23740
Harare,1650
Amina,1090
Lusaka,1070
Mrembo,990
Asmara,930
Dahabu,930
Kintampo,780
Ilanga,780
Isiqalo,770


`Now count the records per province.`

In [57]:
%%sql

SELECT
    province_name,
    COUNT(province_name) AS number_of_records
FROM
    location
GROUP BY province_name
ORDER BY number_of_records DESC;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


province_name,number_of_records
Kilimani,9510
Akatsi,8940
Sokoto,8220
Amanzi,6950
Hawassa,6030


From this table, it's pretty clear that most of the water sources in the survey are situated in small rural communities, scattered across Maji Ndogo.
If we count the records for each province, most of them have a similar number of sources, so every province is well-represented in the survey.

Can you find a way to do the following:
1. Create a result set showing:
• province_name
• town_name
• An aggregated count of records for each town (consider naming this records_per_town).
• Ensure your data is grouped by both province_name and town_name.
2. Order your results primarily by province_name. Within each province, further sort the towns by their record counts in descending order.

In [63]:
%%sql

SELECT 
    province_name,
    town_name,
    COUNT(town_name) AS records_per_town
FROM
    location
GROUP BY province_name, town_name
ORDER BY records_per_town DESC;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
31 rows affected.


province_name,town_name,records_per_town
Akatsi,Rural,6290
Kilimani,Rural,5440
Sokoto,Rural,5010
Hawassa,Rural,3900
Amanzi,Rural,3100
Akatsi,Lusaka,1070
Kilimani,Mrembo,990
Amanzi,Asmara,930
Amanzi,Dahabu,930
Kilimani,Harare,850


These results show us that our field surveyors did an excellent job of documenting the status of our country's water crisis. Every province and town
has many documented sources.
This makes me confident that the data we have is reliable enough to base our decisions on. This is an insight we can use to communicate data
integrity, so let's make a note of that.

`Finally, look at the number of records for each location type`

In [69]:
%%sql

SELECT 
    location_type,
    COUNT(location_type) AS num_sources
FROM
    location
GROUP BY location_type
ORDER BY num_sources DESC
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
2 rows affected.


location_type,num_sources
Rural,23740
Urban,15910


We can see that there are more rural sources than urban, but it's really hard to understand those numbers. Percentages are more relatable.
If we use SQL as a very overpowered calculator:

SELECT 23740 / (15910 + 23740) * 100

We can see that 60% of all water sources in the data set are in rural communities.

In [72]:
%%sql

SELECT ROUND(23740 / (15910 + 23740) * 100) AS Percentage_in_rural

 * mysql+pymysql://root:***@localhost:3306/md_water_services
1 rows affected.


Percentage_in_rural
60
