# Capstone 2 - Predicting Water Pump Condition in Tanzania Data Munging

Kenneth Liao

---

## Background

The UN publishes and reviews a list of least developed countries (LDC) every 3 years. LDCs are “low-income countries confronting severe structural impediments to sustainable development. They are highly vulnerable to economic and environmental shocks and have low levels of human assets.”$^{1}$. Tanzania has been classified as an LDC since the UN published the first list of LDCs in 1971$^{2}$. A common challenge of LDCs is a lack of infrastructure to support the development of the nation, including access to education and healthcare, waste management, and access to potable water.

According to UNICEF, as of 2017, more than 24 million Tanzanians lacked access to basic drinking water$^{3}$. This corresponds to only 56.7% of the country’s population having access to basic drinking water. Outside of developed urban areas, much of the potable water is accessed via water pumps. 

Taarifa is an open-source platform for crowd-sourced reporting and triaging of infrastructure related issues. Together with the Tanzanian Ministry of Water, data has been collected for thousands of water pumps throughout Tanzania. The goal of this project is to be able to predict the condition of these water pumps to improve maintenance, reduce pump downtime, and ensure basic water access for millions of Tanzanians.

**References**

1. https://www.un.org/development/desa/dpad/least-developed-country-category.html
2. https://www.un.org/development/desa/dpad/wp-content/uploads/sites/45/publication/ldc_list.pdf
3. https://washwatch.org/en/countries/tanzania/summary/statistics/


### Problem Description

Predict the operating condition of water pumps in Tanzania given various metadata on each water pump.

### Strategy

The strategy will be to implement an XGBoost model as well as a neural network model for predictions and compare their performance.

### Data

The dataset is provided by Taarifa, together with the Tanzanian Ministry of Water and is hosted by DrivenData.org:

https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/23/

---

## Data Munging

In [None]:
import pandas as pd
import plotly.graph_objs as go
from plotly.offline import iplot, plot, init_notebook_mode
from config import credentials
from sklearn.model_selection import train_test_split
import xgboost as xgb

init_notebook_mode(connected=True)

In [None]:
# load the data
train = pd.read_csv('../data/train.csv')
train_labels = pd.read_csv('../data/train-labels.csv')

I'll start by removing the unwanted feature columns we identified in the EDA part of the analysis. This includes duplicate, irrelevant, and single value columns.

In [None]:
duplicated = ['recorded_by', 'payment_type', 'quantity_group']

train_clean = train.drop(duplicated, axis=1)
train_clean.columns

In [None]:
train_clean.set_index(['id', 'date_recorded'], inplace=True)

In [None]:
train_clean.head()

Next, I need to convert the categorical text features into dummy variables.

In [None]:
# list of all categorical variables
cat_cols = []
for col in train_clean.columns:
    if train_clean[col].dtype == 'object':
        cat_cols.append(col)
cat_cols

In [None]:
%%time
cat_dummies = pd.get_dummies(train_clean[cat_cols], dummy_na=True)

I use `pd.get_dummies` with the argument dummy_na=True so that null values are not ignored. They are instead encoded the same as all other values so each feature will have a null dummy variable, indicated whether the sample was null or not for that feature. The resulting categorical feature set now has 65,828 features.

In [None]:
cat_dummies.head()

In [None]:
# list of all numerical variables
num_cols = []
for col in train_clean.columns:
    if train_clean[col].dtype != 'object':
        num_cols.append(col)
num_cols

In [None]:
numerical = train_clean[num_cols]
numerical.head()

In [None]:
numerical.info()

Luckily, none of the numerical columns have null values. We also don't need to normalize the numerical columns if using a tree-based model. However, for a neural network model, normalization will be necessary. I'll leave the data as-is for now and we can apply normalization when working with the neural network model specifically.

In [None]:
# merge data back together.

In [None]:
train_full = pd.concat([cat_dummies, numerical], axis=1)
train_full.head()

In [None]:
train_full = train_full.reset_index().set_index('id')
train_full['year_recorded'] = pd.to_datetime(train_full.date_recorded).dt.year
train_full['month_recorded'] = pd.to_datetime(train_full.date_recorded).dt.month
train_full['day_recorded'] = pd.to_datetime(train_full.date_recorded).dt.day
train_full = train_full.drop('date_recorded', axis=1)
train_full['years_since_install'] = train_full['year_recorded'] - train_full['construction_year']

In [None]:
train_labels = train_labels.set_index('id')
train_labels = train_labels['status_group'].map({'functional': 0, 'functional needs repair': 1, 'non functional': 2})

In [None]:
for col in train_full.columns:
    if ',' in str(col):
        print(col)

In [None]:
X_train, X_cv, y_train, y_cv = train_test_split(train_full, train_labels, test_size=0.25, random_state=42)

In [None]:
X_train.to_pickle('../data/X_train.pkl')
X_cv.to_pickle('../data/X_cv.pkl')
y_train.to_pickle('../data/y_train.pkl')
y_cv.to_pickle('../data/y_cv.pkl')

In [None]:
X_train.head()

In [None]:
y_train.head()

The full dataset is now ready to train on. There may be issues with the dimension of this dataset after converting to dummy variables. The shape of the dataset is now 59400 X 69572. If the model shows poor performance, it may benefit by using another model to reduce the number of features to those which are most important. This can be done with a number of techniques including PCA, step-wise feature selection, and genetic algorithms for feature selection.