# Capstone 2 - Predicting Water Pump Condition in Tanzania Model

Kenneth Liao

---

## Background

The UN publishes and reviews a list of least developed countries (LDC) every 3 years. LDCs are “low-income countries confronting severe structural impediments to sustainable development. They are highly vulnerable to economic and environmental shocks and have low levels of human assets.”$^{1}$. Tanzania has been classified as an LDC since the UN published the first list of LDCs in 1971$^{2}$. A common challenge of LDCs is a lack of infrastructure to support the development of the nation, including access to education and healthcare, waste management, and potable water.

According to UNICEF, as of 2017, more than 24 million Tanzanians lacked access to basic drinking water$^{3}$. This corresponds to only 56.7% of the country’s population having access to basic drinking water. Outside of developed urban areas, much of the potable water is accessed via water pumps. 

Taarifa is an open-source platform for crowd-sourced reporting and triaging of infrastructure related issues. Together with the Tanzanian Ministry of Water, data has been collected for thousands of water pumps throughout Tanzania. The goal of this project is to be able to predict the condition of these water pumps to improve maintenance, reduce pump downtime, and ensure basic water access for tens of millions of Tanzanians.

**References**

1. https://www.un.org/development/desa/dpad/least-developed-country-category.html
2. https://www.un.org/development/desa/dpad/wp-content/uploads/sites/45/publication/ldc_list.pdf
3. https://washwatch.org/en/countries/tanzania/summary/statistics/


### Problem Description

Predict the operating condition of water pumps in Tanzania given various metadata on each water pump.

### Strategy

The strategy will be to implement a Random Forest model for multiclass classification of the state of water pumps.

### Data

The dataset is provided by Taarifa, together with the Tanzanian Ministry of Water and is hosted by DrivenData.org:

https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/23/

---

We first import the necessary libraries and the cleaned datasets.

In [None]:
import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import iplot, plot, init_notebook_mode
from config import credentials
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, confusion_matrix, f1_score, recall_score, accuracy_score, precision_score
from sklearn.preprocessing import StandardScaler

init_notebook_mode(connected=True)

In [None]:
X_train = pd.read_pickle('../data/X_train.pkl')
X_test = pd.read_pickle('../data/X_cv.pkl')
y_train = pd.read_pickle('../data/y_train.pkl')
y_test = pd.read_pickle('../data/y_cv.pkl')

In [None]:
X_train.head()

In [None]:
y_train.head()

In [None]:
labels = {0: 'functional', 1: 'functional needs repair', 2: 'non functional'}
labels_list = list(labels.values())

### Out-of-box Random Forest

I'll start by building a baseline for which we can compare our model's results to. Recall that the majority class was **functional** which comprised 54.3% of the data. Let's see what the precision, recall, and f1-score metrics would look like for an out-of-box random forest model.

In [None]:
%%time
# define and train the model
model = RandomForestClassifier(n_jobs=-1)
model.fit(X_train, y_train)

In [None]:
# get the predicted labels from the model
y_pred = model.predict(X_test)

Let's look at the confusion matrix for the model.

In [None]:
cols = pd.MultiIndex.from_tuples(('Actual', i) for i in labels_list)
rows = pd.MultiIndex.from_tuples(('Predicted', i) for i in labels_list)
cm = pd.DataFrame(confusion_matrix(y_test, y_pred), index=rows, columns=cols)
cm

From here, we can calculate the precision, recall, and f1-score of the model. From this summary it's easy to see that the majority of functional pumps were correctly classified as being functional. We were less accurate in correctly classifying the non functional pumps and even worse at correctly classifying the functional pumps needing repair. Let's summarize this by compute the precision, recall, and f1 scores for this data.

In [None]:
def score(y_test, y_pred):
    scores = pd.DataFrame({'precision': precision_score(y_test, y_pred, average=None),
             'recall': recall_score(y_test, y_pred, average=None),
            'f1-score': f1_score(y_test, y_pred, average=None)},
            index=labels_list).T
    return scores

In [None]:
score(y_test, y_pred)

Let's take a moment to interpret these scores and understand what is most important for our problem statement. First, some definitions:

<br>

\begin{equation*}
Precision = \frac{True Positive}{(True Positive + False Positive)}
\end{equation*}

<br>

\begin{equation*}
Recall = \frac{True Positive}{(True Positive + False Negative)}
\end{equation*}

<br>

\begin{equation*}
F1 = \frac{2 * (Precision * Recall)}{(Precision + Recall)}
\end{equation*}

<br>

We're interested in predicting which pumps are functioning normally, which pumps are functioning but need to be repaired, and which pumps are completely non functioning. If a pump is non functional, it requires immediate attention as the population dependent on that water source cannot access clean water. Therefore, it's most critical that we predict this class with high recall. That is, for non functional pumps, we want to minimize the number of pumps we classify as being functional when they are actually non functional (false negatives). Of course if we took this to the extreme and assumed all pumps are non functional, we would have perfect recall but very low precision. This would be impractical because we would have to essentially send surveyors to every pump anyway to check their status, in which case the model is useless. With this in mind, the next step is to try to optimize this model to improve the recall of the non functional group without lowing too much precision.

In [None]:
model.feature_importances_

### Dimensionality Reduction with PCA

### Hyperparameter Tuning & Model Optimization

For hyperparameter tuning, we'll turn to GridSearchCV. To optimize the model, I want to 

In [None]:
# define the model
clf = RandomForestClassifier(n_jobs=-1, random_state=42)

param_grid = {'n_estimators':[10,100]}

scorers = {'Precision': make_scorer(precision_score),
          'Recall': make_scorer(recall_score),
          'F1_score': make_scorer(f1_score)}

In [None]:
def grid_search(model, param_grid, scorers):
    # define the stratified k-fold model
    skf = StratifiedKFold(n_splits=5)
    
    # define the grid_search model
    gs = GridSearchCV(model, param_grid, scoring=scorers, refit=False,
                           cv=skf, return_train_score=True, n_jobs=-1)
    
    # train the model
    gs.fit(X_train.values, y_train.values)
    
    # make predictions on the test set
    y_pred = grid_search.predict(X_test.values)
    
    print('Best params for {}'.format(refit_score))
    print(grid_search.best_params_)
    
    return gs

In [None]:
%%time
grid_search(clf, param_grid, scorers)