# **Project: Predicting the Condition of Water Wells in Tanzania**

## 1. Business Understanding

### Background:

  Access to clean and functional water is a critical challenge in Tanzania, where over 57 million people depend on water wells.Ensuring water wells remain functional can significantly improve the quality of life, reduce waterborne diseases, and support local economies.
  However, many wells fall into disrepair or become non-functional due to preventable issues. With over 50,000 recorded wells, the ability to predict well functionality can help optimize resource allocation for maintenance and repairs.

### Problem Statement:

  NGOs and government bodies currently rely on limited, manual assessments to determine well conditions. This approach
 is time-consuming and prone to inefficiencies. A predictive model could provide an automated, data-driven solution,
 enabling stakeholders to prioritize interventions effectively.

### Objectives:

 1. Develop a classification model to predict whether a water well is Functional, Needs Repair, or Non-functional.
 2. Identify the key factors contributing to well condition and recommend actionable strategies to improve well functionality.
 3. Compare multiple machine learning models (Logistic Regression, Decision Tree, and Random Forest) to determine the best-performing algorithm.
 4. Deliver insights to stakeholders, including feature importance and predictions, to inform policy and maintenance strategies.

### Audience:
This project targets:
  - NGOs focused on water access and sustainable development.
  - The Tanzanian government, seeking to improve public infrastructure and water security.
  - Data scientists interested in real-world applications of classification models for social impact.


### **1. Import Required Libraries**

In [10]:
## 1. Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report, roc_curve, auc, precision_recall_curve, average_precision_score

### **2. Loading the Datasets**

In [11]:
 #Load the Datasets
train_values = pd.read_csv('Training_Set_Values.csv')
train_labels = pd.read_csv('Training_Set_Labels.csv')
data = train_values.merge(train_labels, on='id')
print("Data Preview:")
data.head()


Data Preview:


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,69572,6000.0,3/14/2011,Roman,1390,Roman,34.938093,-9.856322,none,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,8776,0.0,3/6/2013,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
2,34310,25.0,2/25/2013,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional
3,67743,0.0,1/28/2013,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
4,19728,0.0,7/13/2011,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional


### **3. Data Cleaning and Core Feature Selection**

In [12]:
# Select core numerical features
core_features = ['amount_tsh', 'gps_height', 'population', 'construction_year']
data = data[['status_group'] + core_features].copy()


In [13]:

# Replace zeros in 'construction_year' with NaN
data['construction_year'] = data['construction_year'].replace(0, np.nan)

# Fill NaN values in 'construction_year' with the median
median_value = data['construction_year'].median()
data.loc[:, 'construction_year'] = data['construction_year'].fillna(median_value)

# Drop rows with missing values
data.dropna(inplace=True)

print("\nCleaned Data Info:")
data.info()



Cleaned Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   status_group       59400 non-null  object 
 1   amount_tsh         59400 non-null  float64
 2   gps_height         59400 non-null  int64  
 3   population         59400 non-null  int64  
 4   construction_year  59400 non-null  float64
dtypes: float64(2), int64(2), object(1)
memory usage: 2.3+ MB


### **4. Encode the Target Variable**

In [14]:

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
data['status_group'] = label_encoder.fit_transform(data['status_group'])
print("\nEncoded Target Variable Distribution:")
data['status_group'].value_counts()



Encoded Target Variable Distribution:


status_group
0    32259
2    22824
1     4317
Name: count, dtype: int64

### **5. Split the Data into Training and Validation Sets**

In [17]:

# Define features and target
X = data.drop(columns=['status_group'])
y = data['status_group']

# Split the data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"\nTraining Data Shape: {X_train.shape}")
print(f"Validation Data Shape: {X_val.shape}")


Training Data Shape: (47520, 4)
Validation Data Shape: (11880, 4)


### **6. Training The Models**

In [18]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Initialize models
print("### Training Models ###")

### **Train Decision Tree Model**
print("1. Training Decision Tree Model...")
dt_model = DecisionTreeClassifier(class_weight='balanced', random_state=42)
dt_model.fit(X_train, y_train)
print("Decision Tree Model Training Complete.\n")

### **Train Logistic Regression Model**
print("2. Training Logistic Regression Model...")
# Increased max_iter to ensure convergence for larger datasets
lr_model = LogisticRegression(max_iter=200, random_state=42)
lr_model.fit(X_train, y_train)
print("Logistic Regression Model Training Complete.\n")

### **Train Random Forest Model**
print("3. Training Random Forest Model...")
# Using 100 trees (default n_estimators) for balanced performance and speed
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
print("Random Forest Model Training Complete.\n")

print("### All Models Have Been Trained Successfully ###")


### Training Models ###
1. Training Decision Tree Model...
Decision Tree Model Training Complete.

2. Training Logistic Regression Model...
Logistic Regression Model Training Complete.

3. Training Random Forest Model...
Random Forest Model Training Complete.

### All Models Have Been Trained Successfully ###
