### README

### Machine Learning Model: Tanzanian Water Wells

#### Problem Description
Water Aid is an NGO based in the United Kingdom that works on access to clean water around the world. They consider access to clean water, decent toilets and good hygiene as basic human rights. For over 30 years, they have been working in partnership to improve access to these three essentials through a combination of programmatic and policy work.

Water Aid works in several countries around the globe, including Tanzania and surrounding East African countries. According to the World Sector Report (2019) around 60% of the population in this area have access to improved water, but the degree of water access, and the water quality and quantity, varies. Drought, landscape change, and the amplifying effects of climate change are straining existing surface water supplies.

Water Aid is launching a program to repair non-functioning wells in the cross country shared water basins of Eastern Africa. The status of the wells is not clearly recorded in countries surrounding Tanzania. Identifying non-functioning wells, securing funding, and traveling to these rural locations to repair wells is both time and resource intensive. They need a predictive model that accurately identifies which wells are not functioning to reduce cost and ensure they are using their resources wisely. They also need to identify a specific water basin to begin their work.



### Goals:

    1. Using an iterative process, build a predictive machine learning model based on existing water well data to accurately classify non-functioning wells.

    2. Recommend a specific water basin for Water Aid to begin their work that when combined with the model will lead to higher chances of well repair success. 

    

### Data

The Tanzanian Water Wells dataset comes from the Taarifa waterpoints dashboard, which aggregates data from the Tanzania Ministry of Water.The dataset is accessible through www.drivendata.org.  The dataset contains over 50,000 records detailing the functional status of water wells throughout the country. Each record contains data in key areas: geographic data, water quality and quantity data, water pump features, and management schemes. The data contains both numerical and categorical data. 

### Modeling process

Exploratory Data Analysis --> Split data into test set and training set --> Preprocessing Data --> Logistic Regression Initial Model --> Decision Tree Model --> Tuned Random Forest Model.

The target for the model is multiclass: Functioning well, Functioning in need of repair well, and non-Functioning well. 

The predictors for the model includes all the other non-redundant features which describe the well.

### Key Results 

Key model results focus primarily on the precision measurement. Precision is used rather than recall considering Water Aid needs to accurately identify functioning and non-functioning wells to decrease cost, maximize use of program resources, and deliver mission success metrics to their funders.

#### Initial Model Results using Logistic Regression:

    Precision Score:  Functional Wells        0.78   
                      Functional needs Repair 0.55               
                      NonFunctioning Wells    0.80
                  
     Recall Score:    Functional Wells        0.88   
                      Functional needs Repair 0.25               
                      NonFunctioning Wells    0.74

#### Final Model using Random Forest Classifier:

    Precision Score:  Functional Wells        0.82 
                      Functional needs Repair 0.34
                      NonFunctioning Wells    0.83

     Recall Score:    Functional Wells        0.79   
                      Functional needs Repair 0.60               
                      NonFunctioning Wells    0.75


### Data Visuals









### Recommendations

1. Model Use - Precision vs. Recall

Wells in shared basins should make use of the model for precision to accurately identify non-functioning wells. Chances are 83% that they will be right which will help in making use of programming resources to repair the wells.

2. Basin Location¶

The model makes significant use of data in Tanzanian water basins that also span nearby countries. The Ruvuma Basin stretches from southern Tanzania to northern Mozambique. The Ruvuma basin contains more than 55% nonfunctioning wells and over 7% of wells need repair. Considering Water Aid has a presence in both countries, and efforts by other NGOs and governments of both countries to manage this space along transboundary lines, the Ruvuma basin offers an opportunity to make an impact using this model.

3. Well Age

In the Ruvuma Basin the proportion of older wells in need of repair compared to newer wells in need of repair is much higher. Wells built between 1975 to 1990 should be targeted first. These older non-functional wells account for nearly 70% of all wells built during this time frame. 

### Navigating Repository

Data Visuals - Contains three visuals communicating data context and model results.

Job Files - Contains fitted models from initial model to final model

Water Wells.ipnyb - A jupyter notebook containing the worked model code

Test Set Values - A hold out test set for the models

Training Set Values - Features, or predictors, data set

Training Set Labels - Target variable for well functional status.
