# **Tanzanian Water Wells: Predictive Maintenance for Clean Water Access**

## Executive Summary

This project addresses a critical challenge in Tanzania: **predicting the operational status of water wells** to optimize maintenance efforts and ensure reliable access to clean water for communities across the country. By leveraging machine learning classification techniques, we aim to help stakeholders prioritize resources and prevent water access disruptions.

**Key Findings:**
- Built and evaluated multiple classification models to predict well functionality
- Identified critical features that determine well failure
- Provided actionable recommendations for maintenance prioritization

# 1. Business Understanding

## 1.1 The Stakeholder

Our primary stakeholders are:
- **Tanzanian Ministry of Water**: Government agency responsible for water infrastructure
- **International NGOs**: Organizations funding and maintaining water wells
- **Local Communities**: End users dependent on functional water sources

## 1.2 The Business Problem

Tanzania faces a significant challenge with water well functionality. Many wells fall into disrepair, leaving communities without access to clean water. The problem:

- **Reactive maintenance is inefficient**: Wells are only serviced after complete failure
- **Resource constraints**: Limited funding and personnel require strategic allocation
- **Impact on communities**: Non-functional wells force communities to use unsafe water sources

## 1.3 The Business Goal

Develop a predictive model that can classify water wells into three categories:
1. **Functional**: Well is operational and needs no immediate attention
2. **Functional needs repair**: Well works but requires maintenance soon
3. **Non-functional**: Well is broken and needs immediate intervention

## 1.4 Why This Matters

Accurate predictions enable stakeholders to:
- **Prevent failures**: Address wells needing repair before they break completely
- **Optimize resources**: Focus maintenance crews on high-risk areas
- **Improve public health**: Ensure consistent access to clean water
- **Save costs**: Preventive maintenance is cheaper than emergency repairs
- **Data-driven decisions**: Move from reactive to proactive water management

# 2. Data Understanding

## 2.1 Dataset Overview

We have three data files:
- **Training set values**: Features for 59,400 wells used to train models
- **Training set labels**: The operational status (target variable) for training wells
- **Test set values**: Features for 14,850 wells where we need to predict status

### iporting neccessary libraries

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')

### Loading the datasets and submission format

In [2]:
# datasets
train_labels = pd.read_csv("0bf8bc6e-30d0-4c50-956a-603fc693d966.csv")
test_values = pd.read_csv("702ddfc5-68cd-4d1d-a0de-f5f566f76d91.csv")
train_values = pd.read_csv("4910797b-ee55-40a7-8668-10efd5c1b960.csv")
submission_format = pd.read_csv("SubmissionFormat.csv")

# data overview
print(f"Training set: {train_values.shape[0]:,} wells, {train_values.shape[1]} features")
print(f"Test set: {test_values.shape[0]:,} wells, {test_values.shape[1]} features")
print(f"Labels: {train_labels.shape[0]:,} records")

Training set: 59,400 wells, 40 features
Test set: 14,850 wells, 40 features
Labels: 59,400 records
