# **Beyond the Surface**
#### *Machine Learning for Predicting Well Status in Tanzania*

## **Business Understanding**
### **Background**
In a nation where one third of the country is arid to semi-arid, access to basic water has been a constant challenge over many years. From around the 90s to early 2000s, Tanzania's water supply and sanitation was characterised by decreasing access to at least basic water sources especially in urban areas, steady access to some form of sanitation, intermittent water supply and generally low quality of service.

Other than the three major lakes in the region, ground water has been the major source of water for the nation's people. In 2006, the Government of Tanzania adopted a National Water Sector Development Strategy that aimed to promote integrated water resources management and the development of urban and rural water supply. Since it's adoption, Tanzania has made significant progress improving the access of thousands of citizens.

Despite efforts by the Tanzanian Government to deal with this issue, it has proved to be difficult mainly due to lack of resources. According to the [Tanzania_Economic_Update_2023](https://www.worldbank.org/en/country/tanzania/publication/tanzania-economic-update-universal-access-to-water-and-sanitation-could-transform-social-and-economic-development) by World Bank Group, only 61% of households in Tanzania currently have access to a basic water-supply, a great improvement from the 2000s but also there's still room for improvement.

When it comes to resources, the Tanzanian water sector remains heavily dependent on external donors with 88% of the available funds being provided by external donor organisations. However, results have been mixed. For example, a report by GIZ notes that "despite heavy investments brought in by the World Bank and the European Union, the utility serving Dar es Salaam has remained one of the worst performing water entities in Tanzania." this is cited on [WikiPedia](https://en.wikipedia.org/wiki/Water_supply_and_sanitation_in_Tanzania).

### **Project Overview**
The Government of Tanzania is constantly trying to increase the percentage of the water access in the country, however, this percentage is constantly being dragged downwards due to lack of well routine maintenance or follow-ups. 

The goal of this project is to help the Government keep a track of the functionality status of the wells across the country, whether they are functional, need repair or non-functional to help them maintain them and build more to achieve 100% access to clean, potable water across Tanzania.

### **Project Objectives**
This project seeks to:
- Develop a machine learning model that predicts whether a well is:
    - *Functional*

    - *Non-functional*

    - *Functional but needs repair*

- Achieve a target classification accuracy of **at least 85%**.

- Identify key features (e.g., installer, construction year) that drive well functionality.

- Support national water development by identifying underperforming or non-functional wells using data science tools.

- Develop a blueprint system that can be adapted for similar water access initiatives globally.

### **Stakeholders**
- *The Tanzanian Government*: By predicting which wells are functiona, non functional or need repairs, the government can have a clear idea of where resources are needed the most and help them drive their Water Development Strategy even further and improve the country's water situation. 

- *External Donors*: These are individuals or institutions who provide resources to help the nation.

- *Non-Profit Organizations*: These are organizations looking to help.

- *Local Communities*: These are the people the wells help directly. 

### **In Scope**
- Exploratory Data Analysis (EDA) to understand key trends and relationships in the dataset.

- Data preprocessing: handling missing values, encoding categorical variables, and scaling numerical ones.

- Feature engineering: including interaction terms, and domain-specific transformations.

- Model development using classification algorithms: *Logistic Regression, Decision Trees, Random Forests*

- Model evaluation using accuracy, precision, recall, F1-score, confusion matrix, and ROC-AUC.

- Model validation using k-fold cross-validation and hyperparameter tuning (GridSearchCV).