# **Predicting Water Pump Functionality in Tanzania : Machine Learning Approach to Enhance Resource Allocation**


### **Business Understanding**

### Overview

Tanzania is in the midst of a crisis, out of its 65 million population, 55%  and 11% of rural and urban population respectively, do not have access to clean water. People living under these circumstances, particularly women and girls, spend a significant amount of time traveling long distances to collect water.This poses significant risks in public health, economic productivity and educational opportunities.Now more than everaccess to safe water at home is critical to families in Tanzania.

####      Problem Statement

Access to clean and functional water points is vital for rural and urban communities in Tanzania. However, many wells in Tanzania are either non-functional or require repairs, affecting water accessibility and community well-being.

The central question is: **How can we predict the functionality of water points to prioritize maintenance and improve resource allocation?**

####    Key Stakeholders

Tanzanian Government Agencies, NGOs, Community Leaders, Technical Team (Engineers and Technicians)



### Business goals

- **Optimize resource allocation :** Predict well functionality to prioritize repairs for non-functional and poorly functioning wells . 

- **Improve community access to clean water :** Reduce repairs downtime and increase availability of functional water points

- **Support sustainability and durability of wells :** Provide insights to future installations and maintenance strategies. What factors contribute to well failures?

### **Proposed Solution (Metric : Accuracy Score >= 80%)**

Develop a machine learning model to accurately predict the functionality status of water wells (functional, non-functional, functional needs repair) based on available data.
 
This model will enable repair prioritization, informed decision making (data-driven), and improvement of water access

####    Objectives
-  Develop a Predictive Model: Create a machine learning model that accurately predicts the functionality status of water wells using features like pump type, region and other relevant factors.

-  Identify Key Predictors: Analyze the model to identify the key factors that contribute to well functionality or failure, providing insights that can inform future well design and maintenance strategies.

### **Data Understanding**

The dataset provided on https://www.drivendata.org/ by **Taarifa** and the **Tanzanian Ministry of Water**. More details on the competition could be found [here](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/23/).
Feature description for the data can be found in [data description](data_description.txt).
##### **Data assumptions :**
- The dataset is representative of all wells in Tanzania
- Historical data trends will hold for future predictions

### **Methodology**

### **Import Libraries**

In [1]:
import pandas as pd
import numpy as np
import sys
import os

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Feature selection and engineering
from sklearn.feature_selection import RFE

#Model evaluation
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

# Add the parent directory of 'modules' to sys.path
sys.path.append(os.path.abspath("../modules"))

# Now import
from modules.EDA import EDA
from modules.dataprocessor import DataProcessor
from modules.testprocessor import TestDatasetProcessor

### **Load Dataset**

In [2]:
# Load the dataset
df = pd.read_csv("./data/wells_data_cleaned.csv")
# Check the first few rows to confirm the structure
df.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,basin,region,...,construction_year,extraction_type_class,management_group,payment_type,quality_group,quantity,source_class,waterpoint_type,status_group,year_recorded
0,69572,6000.0,2011-03-14,ROMAN,1390.0,ROMAN,34.938093,-9.856322,LAKE NYASA,IRINGA,...,1999,GRAVITY,USER-GROUP,ANNUALLY,GOOD,ENOUGH,GROUNDWATER,COMMUNAL STANDPIPE,FUNCTIONAL,2011
1,8776,0.0,2013-03-06,GRUMETI,1399.0,GRUMETI,34.698766,-2.147466,LAKE VICTORIA,MARA,...,2010,GRAVITY,USER-GROUP,NEVER PAY,GOOD,INSUFFICIENT,SURFACE,COMMUNAL STANDPIPE,FUNCTIONAL,2013
2,34310,25.0,2013-02-25,LOTTERY CLUB,686.0,WORLD VISION,37.460664,-3.821329,PANGANI,MANYARA,...,2009,GRAVITY,USER-GROUP,PER BUCKET,GOOD,ENOUGH,SURFACE,COMMUNAL STANDPIPE MULTIPLE,FUNCTIONAL,2013
3,67743,0.0,2013-01-28,UNICEF,263.0,UNICEF,38.486161,-11.155298,RUVUMA / SOUTHERN COAST,MTWARA,...,1986,SUBMERSIBLE,USER-GROUP,NEVER PAY,GOOD,DRY,GROUNDWATER,COMMUNAL STANDPIPE MULTIPLE,NON FUNCTIONAL,2013
4,19728,0.0,2011-07-13,ACTION IN A,0.0,ARTISAN,31.130847,-1.825359,LAKE VICTORIA,KAGERA,...,0,GRAVITY,OTHER,NEVER PAY,GOOD,SEASONAL,SURFACE,COMMUNAL STANDPIPE,FUNCTIONAL,2011


### **Exploratory Data Analysis**