# Predicting Tanzanian Water Well Functionality

**Author:** Diana Terry Awuor Aloo  
**Project:** AI for Sustainable Water Access — Tanzanian Water Well Classification  
**Repository:**  `Tanzanian-Water-Well-Classification--Phase-3-Project`

---

## 1. Business Understanding (CRISP-DM)

### 1.1 Stakeholders
- **Primary:** Tanzanian Ministry of Water & Irrigation, local water agencies, and NGOs (e.g., WaterAid, UNICEF).  
- **Secondary:** Donor organizations, field maintenance teams, local communities.

### 1.2 Problem Statement
Many water points (wells) in Tanzania become non-functional or fall into disrepair. This results in communities losing reliable access to clean water and wastes limited maintenance resources.  
I will **build a predictive classification model** that forecasts the operational status of a water point (e.g., *functional*, *functional needs repair*, *non-functional*) based on installation, environmental, and usage features.

### 1.3 Business Objectives
- **Primary objective:** Provide a model that **accurately identifies wells at risk of failure** so stakeholders can prioritize inspections and allocate repair budgets efficiently.
- **Secondary objectives:**
  - Identify key features associated with failing wells (e.g., installer, pump type, age).
  - Provide prescriptive recommendations for improving well longevity (e.g., preferred pump types or installer practices).
  - Package a reproducible notebook + trained model for handoff.

### 1.4 Data Science Objectives
- Build an end-to-end predictive pipeline that includes:
  - EDA and feature engineering
  - Baseline interpretable model (Logistic Regression or single Decision Tree)
  - Tuned model(s) (regularized logistic regression, tuned decision tree, and an ensemble as optional)
  - Model evaluation with appropriate classification metrics (training + test)
- Demonstrate reproducibility (clear code, saved model artifacts, README).

### 1.5 Success Criteria & Metrics
I will Choose metrics that align with stakeholder priorities. For this project I will focus on:
- **Primary metric:** *F1-score* for the *non-functional* class (balances precision & recall when false positives and false negatives both matter).
- **Complementary metrics:** Precision, Recall, Confusion Matrix, and **ROC-AUC** (macro or per-class where appropriate).  
- **Operational threshold example:** If the model achieves **F1 ≥ 0.65** on the holdout test set for the non-functional class, it will be considered viable for pilot deployment (example target — adjust after EDA).

**Why F1 for non-functional?**  
False negatives (missed failing wells) may leave communities without water; false positives waste repair resources. F1 balances both concerns. We may also prioritize **Recall** if avoiding missed failures is more critical.



### 1.8 Next steps
Proceed to **Data Understanding**:
- Load training and label CSVs
- Merge features and labels into a single DataFrame
- Inspect columns, datatypes, missingness, and target balance
- Create initial EDA visuals (target distribution, key numeric feature histograms)
