## <span style='color:blue'>USING MACHINE LEARNING TO PREDICT FUNCTIONALITY OF WATER WELLS IN TANZANIA.</SPAN>

![Water is essential for life](1.jpg)

#### Business Problem
Challenges like extreme weather events due to climate change unprecedented population growth, forest clearance and land demacatons have all contributed to water crises in many parts of Africa. Many people in Africa particularly in Tanzania are experiencing water access, sanitation and hygiene crises. Although challenging, the water crisis in Tanzania doesn’t define the country. Each day, parents are working hard to provide better lives for their children, and the country as a whole has seen steady fast growth over the last decade.

Tanzania has the largest population in East Africa currently at 64.51 million as per the Worldometer.info. Water is the basic essential need for human beings and yet more than half of the Tanzania poplulation have no access to clean drinking water. Ministry of Water and Sanitation in Tanzania has called for all inclusive and collaborative strategies to solve this problem and improve clean water sources. There are many water wells already established in Tanzania, yet some are completely not functioning, and need to be repaired. 

![Water](final.jpg)

#### Business Objectives
Our main objective goal is to build a classification model that predicts the functionality of water points. We will recommend this predictive model to the authorities to help them  understand which water wells are (i) functional, (ii) nonfunctional, and (iii) functional but still need to be repaired. Our model assists the Tanzanian Ministry of Water and Sanitation in identifying wells that require maintenance and providing valuable insights for future well projects. By utilizing this model, we can guide Tanzanian authorities in maximizing the productivity of water sources and optimizing government investments in wells.

#### My Approach
I am using this dataset to apply what I have learned in this phase ang gaurge my skill and knowledge in my Data Science immense program; Decision Trees, Logistic Regression, Random Forest, Bagging, GridSearchCV, AdaBoost, XGBoost and vector machine to perform a supervised machine learning algorithm. I will evaluate several algorithms with hyperparametor tuning to get a better sense of how various models performed given different inputs with different hyperparametor settings. I will apply accuracy, ROC-AUC and classification report to display my model performance. 

#### Data
The data used for this project is from the Data Driven website. The dataset contains nearly 60,000 entries rows and nearly 40 columns of water wells across Tanzania. Each record has information that includes various location data, technical specifications of the well, information about the water, etc. This website https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/ has all the information about the dataset. 

#### Challenges
This project involves datasets with nearly 40 features, some of which have more than 2000 unique values (e.g., installer, funder). As a student, I faced numerous challenges while learning to implement models using various algorithms. Additionally, I had to find ways to simplify the modeling process for better manageability.

#### Plan:
1. Understanding Data
2. Cleaning and Exploring Data
3. Preparing Data for Modeling
4. Finding a baseline model/ Binary regression
5. Do a Decision Tree, Logistic Regression, Random Forest Classification, integrate GridSearch in Pipeline, AdaBoost, XGBoost and Stack Regressions.
6. Results
7. Recommendation 
8. Next Step

#### Understanding Dataset

Column Descriptions:


| Features| Descriptions|
|---------| ------------|
|amount_tsh| - Total static head (amount water available to waterpoint)|
|date_recorded| - The date the row was entered|
|funder| - Who funded the well|
|gps_height| - Altitude of the well|
|installer| - Organization that installed the well|
|longitude| - GPS coordinate|
|latitude| - GPS coordinate|
|wpt_name| - Name of the waterpoint if there is one|
|num_private| -|
|basin |- Geographic water basin|
|subvillage| - Geographic |
|region |- Geographic location|
|region_code| - Geographic location (coded)|
|district_code| - Geographic location (coded)|
|lga| - Geographic location|
|ward| - Geographic location|
|population| - Population around the well|
|public_meeting| - True/False|
|recorded_by| - Group entering this row of data|
|scheme_management| - Who operates the waterpoint|
|scheme_name| - Who operates the waterpoint|
|permit| - If the waterpoint is permitted|
|construction_year| - Year the waterpoint was constructed|
|extraction_type| - The kind of extraction the waterpoint uses|
|extraction_type_class| - The kind of extraction the waterpoint uses|
|management| - How the waterpoint is managed|
|management_group| - How the waterpoint is managed|
|payment| - What the water costs|
|payment_type| - What the water costs|
|water_quality |- The quality of the water|
|quality_group| - The quality of the water|
|quantity| - The quantity of water|
|quantity_group| - The quantity of water|
|source| - The source of the water|
|source_type| - The source of the water|
|source_class| - The source of the water|
|waterpoint_type| - The kind of waterpoint|
|waterpoint_type_group| - The kind of waterpoint|

### Methods
#### Label Encode Target Variables
This project focussed on descriptive analysis, visualizations, and machine learning and modeling to describe trends for water wells in in Tanzania that need of repair across the country.

To better define the status of each well for this analysis, I had set up a binary system for the wells:

0 = Functional

1 = Needs Repair

To ensure that data leakage is minimized, and consistent preprocessing for each model, I have implemented a pipeline. The pipeline first separates the numerical data from the categorical data, allowing each type of data to undergo its respective preprocessing steps.

For the numerical data:

Since the numerical columns do not contain any missing values, an imputer is not necessary.
To standardize and scale the numerical data, the StandardScaler is applied.
For the categorical data:

To handle missing data in the categorical columns, the SimpleImputer is used to fill them with the most frequent value in each respective column. A 'missing' indicator is also included to indicate that the data was originally missing from the dataset.
To convert the categorical information into a binary system suitable for modeling, the OneHotEncoder is utilized.

By employing this pipeline, the analysis is made more efficient, data integrity is preserved, and the numerical and categorical data are appropriately prepared for modeling purposes.

### Results 

#### Visualization

***Extraction Type Class*** - This graph shows the various methods people in Tanzania use to fetch the water from the wells at different places.


![image.png](attachment:image.png)

***Region*** - This displays various locations where we have wells in Tanzania

![image.png](attachment:image.png)

***Quantity_Water*** - The graph displays the amount of water availability a cross the region

![image.png](attachment:image.png)

***Source Visualization*** - A graph showing the various sounrces of water in the region 

![image.png](attachment:image.png)

#### Modeling

***Logistic Regression Classifier*** - with the following scores


1. Train Accuracy: 0.76
2. Test Accuracy: 0.75
3. Train ROC-AUC: 0.84
4. Test ROC-AUC: 0.83

![](http://localhost:8888/files/Logistic%20Regression.png)

***Decision Tree Classifier*** - With the following scores

1. Train Accuracy: 0.93
2. Test Accuracy: 0.78
3. Train ROC-AUC: 0.99
4. Test ROC-AUC: 0.82

![image.png](attachment:image.png)

Decision Tree Classifier  - Confusion Matrix

![image.png](attachment:image.png)

***Random Forest Classifier*** - scores:

1. Train Accuracy: 0.98
2. Test Accuracy: 0.82
3. Train ROC-AUC: 0.99
4. Test ROC-AUC: 0.90
![image.png](attachment:image.png)
Confusion Matrix
![image-2.png](attachment:image-2.png)

***AdaBoost Classifier*** 

1. Train Accuracy: 0.7635465768799102
2. Test Accuracy: 0.7578451178451179
3. Train ROC-AUC: 0.839181037142629
4. Test ROC-AUC: 0.8322247795517939
![image.png](attachment:image.png)

***AdaBoost*** Confusion Matrix
![image-2.png](attachment:image-2.png)

***XGBoost Classifier***

1. Train Accuracy: 0.84
2. Test Accuracy: 0.81
3. Train ROC-AUC: 0.93
4. Test ROC-AUC: 0.89
![image.png](attachment:image.png)
***XGBoost*** Confusion Matrix
![image-2.png](attachment:image-2.png)

***Stack Classifier***

1. Train Accuracy: 0.96
2. Test Accuracy: 0.82
3. Train ROC-AUC: 0.99
4. Test ROC-AUC: 0.90
![image.png](attachment:image.png)

### Results Interpretation
In this project I run several models to analyze and predict Tanzanian Water Wells that are functional, non-funtional and those in need of repair. I considered several factors like accuracy, that measure how accurate the model can predict and ROC-AUC that measure how accurate the model identify the true positive and true negatives. 

Having said that The Stack Classifier Model has the best accuracy scor of 81% on the test and ROC-AUC of 89% on the test dataset.

### Conclusion

I first made a base model for each type of model classifier, trained and fitted with default parameters as a base. Thereafter, I selected key parameters to tune using sklearn GridSearchCV and the best parameters were used to run the final model. I compared the performance to the base model of each type, as well as between different model types. I evaluated the models using a classification report, a confusion matrix, ROC plots with AUC scores, and applied feature importances.

### Recommendation
To get better results, especially on the wells that need repair class, more features need to be included during the data collection process. Adding information in the reporting of wells such as when the well was last serviced, what kind of repairs have been done on the wells, or if any parts have been replaced in the reporting of wells could prove to be useful for better predictions.

### Next Step

The next steps in the project would be to implement the model and continue fine-tuning it as new data becomes available. With the introduction of new data, the model can learn and improve its predictions over time. This iterative process allows for ongoing refinement and optimization of the model's performance.

By utilizing this model, the detection rate of wells needing repair would be significantly improved. This improvement would result in a reduction in the cost of well maintenance and a decrease in the manpower required for inspections. The model's ability to accurately identify wells in need of repair would enable more efficient allocation of resources and timely intervention, ultimately leading to cost savings and improved operational effectiveness.