## Maji Ndogo: From analysis to action
### Weaving the data threads of Maji Ndogo's narrative

What's Happening


An auditor has reviewed the Maji Ndogo water source database.

Some data inconsistencies were found.
Ourr job is to validate the integrity of the reported water source data and find out who made the mistakes and why


### Understand the relationships in the md_water_services database.

#### SQL/ERD Insight:

* visits table is central and links:

* location_id → location

* source_id → water_source

* assigned_employee_id → employee

* One-to-many relationships are common:

* One location → many visits

* One employee → many visits

Each visit → one water_quality → this is one-to-one

**Fix the ERD if water_quality wrongly shows many-to-one. Use database design tools to correct it.**

### Load Auditor Report:
#### Create the auditor_report table:

In [1]:
# Load and activate the SQL extension to allow us to execute SQL in a Jupyter notebook. 
# If you get an error here, make sure that mysql and pymysql are installed correctly. 

%load_ext sql

In [None]:
# Establish a connection to the local database using the '%sql' magic command.
# Replace `password` with our connection password and `db_name` with our database name. 
# If you get an error here, please make sure the database name or password is correct.

%sql mysql+pymysql://root:password@localhost:3306/md_water_services

In [2]:
%%sql
DROP TABLE IF EXISTS auditor_report;

CREATE TABLE auditor_report (
  location_id VARCHAR(32),
  type_of_water_source VARCHAR(64),
  true_water_source_score INT,
  statements VARCHAR(255)
);

Traceback (most recent call last):
  File "C:\Users\Henock\anaconda3\envs\creating_an_environment\Lib\site-packages\sql\magic.py", line 196, in execute
    conn = sql.connection.Connection.set(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Henock\anaconda3\envs\creating_an_environment\Lib\site-packages\sql\connection.py", line 82, in set
    raise ConnectionError(
sql.connection.ConnectionError: Environment variable $DATABASE_URL not set, and no connect string given.

Connection info needed in SQLAlchemy format, example:
               postgresql://username:password@hostname/dbname
               or an existing connection: dict_keys([])


### Compare Scores
#### Compare scores from the auditor vs. the original database.
**Goal:**</br>
Compare subjective quality scores from surveyors with independently verified scores from auditors.</br>

Join Logic:</br>

auditor_report.location_id → visits.location_id</br>

visits.record_id → water_quality.record_id</br>

**Tasks:**</br>

Create a joined table showing:</br>

location_id, record_id, auditor_score, surveyor_score</br>

Filter where visit_count = 1

In [None]:
%%sql
SELECT
  ar.location_id,
  v.record_id,
  ar.true_water_source_score AS auditor_score,
  wq.subjective_quality_score AS surveyor_score
FROM
  auditor_report ar
JOIN visits v ON ar.location_id = v.location_id
JOIN water_quality wq ON v.record_id = wq.record_id
WHERE v.visit_count = 1;

**Expected: 1518 correct matches out of 1620 auditor records: 94% accurate**

### Find Incorrect Records
#### Add WHERE auditor_score != surveyor_score
**Goal:**</br>
Filter the joined results to get only mismatched scores (auditor vs. surveyor).</br>

**Expected Output:**</br>

1620 total auditor records</br>

1518 correct matches</br>

102 incorrect scores

In [None]:
%%sql
SELECT
  ar.location_id,
  v.record_id,
  ar.true_water_source_score AS auditor_score,
  wq.subjective_quality_score AS surveyor_score
FROM
  auditor_report ar
JOIN visits v ON ar.location_id = v.location_id
JOIN water_quality wq ON v.record_id = wq.record_id
WHERE 
  v.visit_count = 1
  AND ar.true_water_source_score != wq.subjective_quality_score;

**Expected: 102 incorrect records**

### Check if Water Source Type Matches
**Goal:**</br>
Check whether water source types (type_of_water_source) were accurately recorded.</br>
Join *visits.source_id* to *water_source.source_id* to get survey_source. Then Compare it with *auditor_report.type_of_water_source*

In [None]:
%%sql
SELECT
	ar.location_id AS location_id,
	ar.true_water_source_score as auditor_score,
	wq.subjective_quality_score as surveyor_score,
    ar.type_of_water_source as auditor_source,
    ws.type_of_water_source as ws_source
    
FROM
	auditor_report AS ar
JOIN
	visits AS vs
ON 
	ar.location_id = vs.location_id
JOIN
	water_quality AS wq
ON
	vs.record_id = wq.record_id
JOIN
	water_source AS ws
ON
	vs.source_id = ws.source_id
where 
    ar.true_water_source_score != wq.subjective_quality_score
AND
	vs.visit_count = 1
AND ws.type_of_water_source != ar.type_of_water_source;

**Result: Source types match, so previous water source analysis is still valid.**

### Link Mistakes to Employees
**Goal:**</br>
Determine which employees recorded incorrect scores.</br>

Join Logic:</br>

Add assigned_employee_id from visits</br>

Join to employee to get employee_name

In [None]:
%%sql
SELECT
	ar.location_id AS location_id,
    vs.record_id,
    em.employee_name,
	ar.true_water_source_score as auditor_score,
	wq.subjective_quality_score as surveyor_score
        
FROM
	auditor_report AS ar
JOIN
	visits AS vs
ON 
	ar.location_id = vs.location_id
JOIN
	water_quality AS wq
ON
	vs.record_id = wq.record_id
JOIN
	employee AS em
ON
	vs.assigned_employee_id = em.assigned_employee_id
where 
    ar.true_water_source_score != wq.subjective_quality_score
AND
	vs.visit_count = 1;

**Output: Table of employees who submitted the incorrect scores.**

### Use a CTE (or View)
Save the joined result with incorrect_records as a CTE or VIEW:

In [None]:
%%sql
CREATE VIEW Incorrect_records AS
SELECT
	ar.location_id AS location_id,
    vs.record_id,
    em.employee_name,
	ar.true_water_source_score as auditor_score,
	wq.subjective_quality_score as surveyor_score,
	ar.statements
        
FROM
	auditor_report AS ar
JOIN
	visits AS vs
ON 
	ar.location_id = vs.location_id
JOIN
	water_quality AS wq
ON
	vs.record_id = wq.record_id
JOIN
	employee AS em
ON
	vs.assigned_employee_id = em.assigned_employee_id
where 
    ar.true_water_source_score != wq.subjective_quality_score
AND
	vs.visit_count = 1;

### Count Errors Per Employee
We will be using the previous view ('Incorrect_records')

In [None]:
%%sql
WITH error_count AS (
SELECT
    employee_name,
    COUNT(*) AS number_of_mistakes
FROM
	Incorrect_records
GROUP BY
	employee_name
)
SELECT
	*
FROM
	error_count
ORDER BY
	number_of_mistakes DESC;

### Find Suspect List
To find possible suspects, we should try to find all of the employees who have an above-average number of mistakes. Let's break it down into steps first:
* Calculate average mistakes:

In [3]:
%%sql
SELECT AVG(number_of_mistakes) FROM error_count;

SyntaxError: invalid syntax (389364359.py, line 1)

* Employees with above-average mistakes:

In [None]:
# showing multiple with statements on purpose(you can remove 'error_count' and keep 'suspect_list' only as it is created above)
# In a case of multiple WITH statements, we use WITH x AS(), y AS(), z AS()....format
%%sql
WITH error_count AS (
SELECT
    employee_name,
    COUNT(*) AS number_of_mistakes
FROM
	Incorrect_records
GROUP BY
	employee_name
),
suspect_list AS(
SELECT
	*
FROM
	error_count
WHERE
	number_of_mistakes > (SELECT AVG(number_of_mistakes) FROM error_count))
SELECT
	*
FROM
    suspect_list;

### Check Their Statements
**Output: Revealing statements mentioning shady behavior.**

In [None]:
%%sql
SELECT 
    employee_name, 
    location_id, 
    statements
FROM 
    Incorrect_records
WHERE 
    employee_name 
IN (
    SELECT 
        employee_name 
    FROM 
        suspect_list
);

If you have a look, you will notice some alarming statements about these four officials (look at these records: </br> **AkRu04508, AkRu07310,
KiRu29639, AmAm09607** for example. See how the word "cash" is used a lot in these statements.

### Find "cash" in Statements
To confirm if these employees have allegations of bribery we should filter the statemenst with '%cash%'.

In [None]:
%%sql
SELECT 
    *
FROM 
    Incorrect_records
WHERE 
    employee_name 
IN (
    SELECT 
        employee_name 
    FROM 
        suspect_list
)
AND statements LIKE '%cash%';

**One final check**</br>
Check if there are any employees in the Incorrect_records table with statements mentioning "cash" that are not in our suspect list.</br> To do this change the IN statement on the above query to NOT IN..

In [None]:
%%sql
SELECT
	employee_name,
    location_id,
    statements
    
FROM
	incorrect_records
WHERE
	employee_name 
NOT IN(SELECT
		employee_name
	FROM
		suspect_list)
AND
	statements
like
	('%cash%');

**I get an empty result, so no one, except the four suspects, has these allegations of bribery.**

#### The corrupted employees are: 

In [None]:
%%sql
WITH corrupted AS(
SELECT 
    *
FROM 
    Incorrect_records
WHERE 
    employee_name 
IN (
    SELECT 
        employee_name 
    FROM 
        suspect_list
)
AND statements LIKE '%cash%')
SELECT
    DISTINCT employee_name
FROM
    corrupted;

#### You should see the following list:
*Zuriel Matembo*</br>
*Malachi Mavuso*</br>
*Bello Azibo*</br>
*Lalitha Kaburi*</br>

This employees had:</br>

* Above-average errors AND

* Incriminating statements including cash

They should be flagged for further investigation.