# Capstone 2 - Predicting Water Pump Condition in Tanzania Model

Kenneth Liao

---

## Background

The UN publishes and reviews a list of least developed countries (LDC) every 3 years. LDCs are “low-income countries confronting severe structural impediments to sustainable development. They are highly vulnerable to economic and environmental shocks and have low levels of human assets.”$^{1}$. Tanzania has been classified as an LDC since the UN published the first list of LDCs in 1971$^{2}$. A common challenge of LDCs is a lack of infrastructure to support the development of the nation, including access to education and healthcare, waste management, and potable water.

According to UNICEF, as of 2017, more than 24 million Tanzanians lacked access to basic drinking water$^{3}$. This corresponds to only 56.7% of the country’s population having access to basic drinking water. Outside of developed urban areas, much of the potable water is accessed via water pumps. 

Taarifa is an open-source platform for crowd-sourced reporting and triaging of infrastructure related issues. Together with the Tanzanian Ministry of Water, data has been collected for thousands of water pumps throughout Tanzania. The goal of this project is to be able to predict the condition of these water pumps to improve maintenance, reduce pump downtime, and ensure basic water access for tens of millions of Tanzanians.

**References**

1. https://www.un.org/development/desa/dpad/least-developed-country-category.html
2. https://www.un.org/development/desa/dpad/wp-content/uploads/sites/45/publication/ldc_list.pdf
3. https://washwatch.org/en/countries/tanzania/summary/statistics/


### Problem Description

Predict the operating condition of water pumps in Tanzania given various metadata on each water pump.

### Strategy

The strategy will be to implement a Random Forest model for multiclass classification of the state of water pumps.

### Data

The dataset is provided by Taarifa, together with the Tanzanian Ministry of Water and is hosted by DrivenData.org:

https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/23/

---

We first import the necessary libraries and the cleaned datasets.

In [1]:
import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import iplot, plot, init_notebook_mode
from config import credentials
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, confusion_matrix, f1_score, recall_score, accuracy_score, precision_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

init_notebook_mode(connected=True)

In [2]:
X_train = pd.read_pickle('../data/X_train.pkl')
X_test = pd.read_pickle('../data/X_cv.pkl')
y_train = pd.read_pickle('../data/y_train.pkl')
y_test = pd.read_pickle('../data/y_cv.pkl')

In [3]:
X_train.head()

Unnamed: 0_level_0,funder_0,funder_A/co Germany,funder_Aar,funder_Abas Ka,funder_Abasia,funder_Abc-ihushi Development Cent,funder_Abd,funder_Abdala,funder_Abddwe,funder_Abdul,...,latitude,num_private,region_code,district_code,population,construction_year,year_recorded,month_recorded,day_recorded,years_since_install
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
60371,0,0,0,0,0,0,0,0,0,0,...,-3.025106,0,18,8,0,0,2011,7,21,2011
17088,0,0,0,0,0,0,0,0,0,0,...,-6.0305,0,1,3,0,0,2011,3,11,2011
16532,0,0,0,0,0,0,0,0,0,0,...,-1.692329,0,18,1,0,0,2011,7,18,2011
11098,0,0,0,0,0,0,0,0,0,0,...,-3.18494,0,3,1,1,1975,2013,2,20,38
20249,0,0,0,0,0,0,0,0,0,0,...,-4.417159,0,14,1,0,0,2013,1,18,2013


In [4]:
y_train.head()

id
60371    2
17088    0
16532    2
11098    0
20249    0
Name: status_group, dtype: int64

In [5]:
labels = {0: 'functional', 1: 'functional needs repair', 2: 'non functional'}
labels_list = list(labels.values())

### Out-of-box Random Forest

I'll start by building a baseline for which we can compare our model's results to. Recall that the majority class was **functional** which comprised 54.3% of the data. Let's see what the precision, recall, and f1-score metrics would look like for an out-of-box random forest model.

In [6]:
%%time
# define and train the model
model = RandomForestClassifier(n_jobs=-1, random_state=42)
model.fit(X_train, y_train)


The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.



Wall time: 53 s


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
                       oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [7]:
# get the predicted labels from the model
y_pred = model.predict(X_test)

Let's look at the confusion matrix for the model.

In [8]:
cols = pd.MultiIndex.from_tuples(('Actual', i) for i in labels_list)
rows = pd.MultiIndex.from_tuples(('Predicted', i) for i in labels_list)
cm = pd.DataFrame(confusion_matrix(y_test, y_pred), index=rows, columns=cols)
cm

Unnamed: 0_level_0,Unnamed: 1_level_0,Actual,Actual,Actual
Unnamed: 0_level_1,Unnamed: 1_level_1,functional,functional needs repair,non functional
Predicted,functional,9493,276,950
Predicted,functional needs repair,758,453,214
Predicted,non functional,1788,144,5526


From here, we can calculate the precision, recall, and f1-score of the model. From this summary it's easy to see that the majority of functional pumps were correctly classified as being functional. We were less accurate in correctly classifying the non functional pumps and even worse at correctly classifying the functional pumps needing repair. Let's summarize this by compute the precision, recall, and f1 scores for this data.

In [9]:
def score(y_test, y_pred):
    scores = pd.DataFrame({'precision': precision_score(y_test, y_pred, average=None),
             'recall': recall_score(y_test, y_pred, average=None),
            'f1-score': f1_score(y_test, y_pred, average=None)},
            index=labels_list).T
    return scores

In [10]:
score(y_test, y_pred)

Unnamed: 0,functional,functional needs repair,non functional
precision,0.788521,0.5189,0.826009
recall,0.885624,0.317895,0.740949
f1-score,0.834256,0.394256,0.78117


Let's take a moment to interpret these scores and understand what is most important for our problem statement. First, some definitions:

<br>

\begin{equation*}
Precision = \frac{True Positive}{(True Positive + False Positive)}
\end{equation*}

<br>

\begin{equation*}
Recall = \frac{True Positive}{(True Positive + False Negative)}
\end{equation*}

<br>

\begin{equation*}
F1 = \frac{2 * (Precision * Recall)}{(Precision + Recall)}
\end{equation*}

<br>

We're interested in predicting which pumps are functioning normally, which pumps are functioning but need to be repaired, and which pumps are completely non functioning. If a pump is non functional, it requires immediate attention as the population dependent on that water source cannot access clean water. Therefore, it's most critical that we predict this class with high recall. That is, for non functional pumps, we want to minimize the number of pumps we classify as being functional when they are actually non functional (false negatives). Of course if we took this to the extreme and assumed all pumps are non functional, we would have perfect recall but very low precision. This would be impractical because we would have to essentially send surveyors to every pump anyway to check their status, in which case the model is useless. With this in mind, the next step is to try to optimize this model to improve the recall of the non functional group without lowing too much precision.

In [11]:
feat_importances = pd.DataFrame(columns=['feature', 'importance'])
for i in zip(X_train.columns, model.feature_importances_):
     feat_importances = feat_importances.append({'feature':i[0],'importance':i[1]}, ignore_index=True)
feat_importances = feat_importances.sort_values('importance', ascending=False).reset_index(drop=True)

In [12]:
feat_importances.head(10)

Unnamed: 0,feature,importance
0,latitude,0.040428
1,longitude,0.039417
2,quantity_dry,0.026119
3,gps_height,0.024455
4,years_since_install,0.020194
5,quantity_enough,0.019303
6,construction_year,0.018846
7,day_recorded,0.018829
8,population,0.018732
9,extraction_type_other,0.011949


### Hyperparameter Tuning & Model Optimization

In [47]:
%%time
class_weight = {0:1/0.5, 1:1/0.073, 2:1/0.35}
clf = RandomForestClassifier(bootstrap=True, class_weight=class_weight, criterion='gini',
                       max_depth=5, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10000, n_jobs=-1,
                       oob_score=False, random_state=42, verbose=0,
                       warm_start=False)
clf.fit(X_train, y_train)

Wall time: 2min 40s


RandomForestClassifier(bootstrap=True,
                       class_weight={0: 2.0, 1: 13.698630136986303,
                                     2: 2.857142857142857},
                       criterion='gini', max_depth=5, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=10000, n_jobs=-1, oob_score=False,
                       random_state=42, verbose=0, warm_start=False)

In [48]:
# get the predicted labels from the model
y_pred = clf.predict(X_test)

In [49]:
score(y_test, y_pred)

Unnamed: 0,functional,functional needs repair,non functional
precision,0.762044,0.320824,0.600771
recall,0.633081,0.305965,0.752212
f1-score,0.691602,0.313218,0.668016


To optimize the model, I reduced the max_depth to 5 and increased the number of estimators to 10,000 to reduce overfitting the train data. I also played with the weights given to each class to emphasize the "non functional" class. The results of the model above shows that we were able to increase the recall of the "non functional" pumps from 0.74 to 0.75 by increasing the weight of that class. This is an improvement of 1% over the baseline model. However, the precision suffered and dropped from 0.83 to 0.6! The next step would be to run an exhaustive grid search to find the optimal model to improve the recall of "non functional" pumps while maintaining the f1-score to ensure we don't lose too much precision

The model found that the gps location (latitude, longitude, and height) of water pumps is critical in determining whether they are functioning or not. One feature which I engineered "years_since_install" was also among the top 5 most important features. This is not surprising since things manufactured goods tend to degrade over time, especially with high usage and weather.

## Conlusions

I trained a Random Forest model to classify water pumps as "functional", "function needs repair", and non functional. I focused on improving the recall score for accurately predicting "non functional" pumps, as this class is the most critical in getting right. The model is able to predict "non functional" water pumps with a precision of and a recall of . This is a great start to accurately deploying resources where they are needed the most, and to ensure that Tanzanians have access to clean, potable water.

## Next Steps

Seeing how gps latitude and longitude were the two most important features in this model. I would like to explore this more by finding correlations between those two features and the other features. Why does the location matter? Are the "non functional" pumps clumped together around certain geographical regions? Why are those areas so different? The first step to answering these questions would be to plot the locations of failing pumps and explore how the rest of the features are stratified for these broken pumps.