# <span style="color:orange"> **PREDICTING OPERATING CONDITION OF WATER WELLS IN TANZANIA**</span>

 By: *Grace Rotich*

This is a machine learning project with models that predict the condition of water wells in Tanzania.

![introduction well](imageWell)



## BUSINESS UNDERSTANDING
According to Water.org(2024), Tanzania faces a significant water and sanitation crisis, out of its population of 65 million people, 58 million people (88% of the population) lack access to safe water. People living under these circumstances, particularly women and girls, spend a significant amount of time traveling long distances to collect water. And other challenges like underfunding of planned government projects, population growth, and extreme weather events due to climate change create challenges for those living in poverty. Now more than ever access to safe water at home is critical to families in Tanzania

The project's main goal is to create a model for forecasting operational conditions of water points in Tanzania. With accurate predictions showing whether a water point will be working, broken, or under repair, the Tanzanian government will improve maintenance decision-making procedures, thereby making sure that communities have sustainable access to drinking water.

These are what the government of Tanzania can achieve from this;

* It will be able to improve the lifespan of water wells operationally.
* It can ensure more communities have a dependable source of clean drinking water.
* It can be sure that public funds and resources work best by identifying which boreholes should be given the  first priority during maintenance based on their neediness levels.

### Problem Statement:


<img src="Problem.jpg" alt="child problem water" width="700" height="350">


In Tanzania, many water wells fall into disrepair due to inadequate maintenance and the lack of timely repairs. This project aims to build a data-driven model to predict the condition of water wells, thus enabling the government to prioritize and address maintenance needs efficiently.


### Objectives
The main goals of this project are:

1. To understand the data and the various factors that affect the condition of water wells in Tanzania.
2. To create a machine learning model that can reliably forecast the operational state of water wells Using past data.
3. To give advice on how to allocate funds and resources more effectively for water wells maintainance and repair.

### Metrices for success
Model Performance will be evaluated on the accuracy, precision and ROC and AUC of the predictive model.

## Data Sourcing

Using data from Taarifa (http://taarifa.org/) and the Tanzanian Ministry of Water(http://maji.go.tz/), to predict which pumps are functional, which need some repairs, and which don't work at all.

## DATA UNDERSTANDING

<img src="understanding.jpg" alt="Data Analysis" width="500" height="300">






Importing Libraries and data

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split, cross_val_score, RepeatedStratifiedKFold, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

In [4]:
# loading training set values
data = pd.read_csv("training_set_values.csv")
data.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [6]:
#last 5
data.tail()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
59395,60739,10.0,2013-05-03,Germany Republi,1210,CES,37.169807,-3.253847,Area Three Namba 27,0,...,per bucket,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
59396,27263,4700.0,2011-05-07,Cefa-njombe,1212,Cefa,35.249991,-9.070629,Kwa Yahona Kuvala,0,...,annually,soft,good,enough,enough,river,river/lake,surface,communal standpipe,communal standpipe
59397,37057,0.0,2011-04-11,,0,,34.017087,-8.750434,Mashine,0,...,monthly,fluoride,fluoride,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump
59398,31282,0.0,2011-03-08,Malec,0,Musa,35.861315,-6.378573,Mshoro,0,...,never pay,soft,good,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump
59399,26348,0.0,2011-03-23,World Bank,191,World,38.104048,-6.747464,Kwa Mzee Lugawa,0,...,on failure,salty,salty,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump


The following is the set of information about the waterpoints in the data:

* amount_tsh - Total static head (amount water available to waterpoint)

* date_recorded - The date the row was entered

* funder - Who funded the well

* gps_height - Altitude of the well

* installer - Organization that installed the well

* longitude - GPS coordinate

* latitude - GPS coordinate

* wpt_name - Name of the waterpoint if there is one

* num_private -

* basin - Geographic water basin

* subvillage - Geographic location

* region - Geographic location

* region_code - Geographic location (coded)

* district_code - Geographic location (coded)

* lga - Geographic location

* ward - Geographic location

* population - Population around the well

* public_meeting - True/False

* recorded_by - Group entering this row of data

* scheme_management - Who operates the waterpoint

* scheme_name - Who operates the waterpoint

* permit - If the waterpoint is permitted

* construction_year - Year the waterpoint was constructed

* extraction_type - The kind of extraction the waterpoint uses

* extraction_type_group - The kind of extraction the waterpoint uses

* extraction_type_class - The kind of extraction the waterpoint uses

 * management - How the waterpoint is managed

* management_group - How the waterpoint is managed

* payment - What the water costs

* payment_type - What the water costs

* water_quality - The quality of the water

* quality_group - The quality of the water

* quantity - The quantity of water

* quantity_group - The quantity of water

* source - The source of the water

* source_type - The source of the water

* source_class - The source of the water

* waterpoint_type - The kind of waterpoint

* waterpoint_type_group - The kind of waterpoint


**Lables in this data sets**

The labels in this dataset are simple. There are three possible values:

* functional - the waterpoint is operational and there are no repairs needed

* functional needs repair - the waterpoint is operational, but needs repairs

* non functional - the waterpoint is not operational


In [9]:
#Summary of the data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 40 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   amount_tsh             59400 non-null  float64
 2   date_recorded          59400 non-null  object 
 3   funder                 55765 non-null  object 
 4   gps_height             59400 non-null  int64  
 5   installer              55745 non-null  object 
 6   longitude              59400 non-null  float64
 7   latitude               59400 non-null  float64
 8   wpt_name               59400 non-null  object 
 9   num_private            59400 non-null  int64  
 10  basin                  59400 non-null  object 
 11  subvillage             59029 non-null  object 
 12  region                 59400 non-null  object 
 13  region_code            59400 non-null  int64  
 14  district_code          59400 non-null  int64  
 15  lg

The dataset has null values on some of the features such as funder, installer, subvilage and public meeting. It also has a mixture of string and float datatypes so we need to convert the object data types to float for the regression model later.

In [10]:
#Dimentions of the data
data.shape

(59400, 40)