# Capstone 2 - Predicting Water Pump Condition in Tanzania

Kenneth Liao

---

## Background

The UN publishes and reviews a list of least developed countries (LDC) every 3 years. An LDCs are described as “low-income countries confronting severe structural impediments to sustainable development. They are highly vulnerable to economic and environmental shocks and have low levels of human assets.”$^{1}$. Tanzania has been classified as an LDC since the UN published the first list of LDCs in 1971^$^{2}$. A common challenge of LDCs is a lack of infrastructure to support the development of the nation, including access to education and healthcare, waste management, and access to potable water.

According to UNICEF, as of 2017, more than 24 million Tanzanians lacked access to basic drinking water$^{3}$. This corresponds to only 56.7% of the country’s population having access to basic drinking water. Outside of developed urban areas, much of the potable water is accessed via water pumps. 

Taarifa is an open-source platform for crowd-sourced reporting and triaging of infrastructure related issues. Together with the Tanzanian Ministry of Water, data has been collected for thousands of water pumps throughout Tanzania. The goal of this project is to be able to predict the condition of these water pumps to improve maintenance, reduce pump downtime, and ensure basic water access for millions of Tanzanians.

**References**

1. https://www.un.org/development/desa/dpad/least-developed-country-category.html
2. https://www.un.org/development/desa/dpad/wp-content/uploads/sites/45/publication/ldc_list.pdf
3. https://washwatch.org/en/countries/tanzania/summary/statistics/


### Problem Description

Predict the operating condition of water pumps in Tanzania given various metadata on each water pump.

### Strategy

The strategy will be to implement an XGBoost model as well as a neural network model for predictions and compare their performance.

### Data

The dataset is provided by Taarifa, together with the Tanzanian Ministry of Water and is hosted by DrivenData.org:

https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/23/

---

## Exploratory Analysis

Start by importing the necessary libraries and datasets.

In [10]:
import pandas as pd
import plotly.graph_objs as go
from plotly.offline import iplot, plot, init_notebook_mode

init_notebook_mode(connected=True)

In [11]:
# load the data
train = pd.read_csv('../data/train.csv')
train_labels = pd.read_csv('../data/train-labels.csv')

### Prediction Labels

In [12]:
train_labels.head()

Unnamed: 0,id,status_group
0,69572,functional
1,8776,functional
2,34310,functional
3,67743,non functional
4,19728,functional


The train_labels file contains the labels we want to predict, `status_group`. This is the condition of a given water pump.

In [54]:
counts = train_labels.groupby('status_group').count()
counts

Unnamed: 0_level_0,id
status_group,Unnamed: 1_level_1
functional,32259
functional needs repair,4317
non functional,22824


In [55]:
trace0 = go.Bar(name='functional', x=['functional'], y=[counts.loc['functional','id']],
               marker=dict(color='lightgreen'), showlegend=False)

trace1 = go.Bar(name='functional needs repair', x=['functional needs repair'], y=[counts.loc['functional needs repair','id']],
               marker=dict(color='orange'), showlegend=False)

trace2 = go.Bar(name='non functional', x=['non functional'], y=[counts.loc['non functional','id']],
               marker=dict(color='tomato'), showlegend=False)

layout = go.Layout(title='Pump Condition Distribution',
                  yaxis=dict(title='Count'))

fig = go.Figure([trace0, trace1, trace2], layout=layout)

iplot(fig, filename='pump-conditions.html')

In [58]:
counts/counts.id.sum()

Unnamed: 0_level_0,id
status_group,Unnamed: 1_level_1
functional,0.543081
functional needs repair,0.072677
non functional,0.384242


54.3% of pumps are functional, while 7.3% are functional but require repair and 38.4% are non functional.

### Features

In [61]:
train = train.set_index('id')
train.head()

Unnamed: 0_level_0,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


The preview above shows that the training data contains 39 features with mixed datatypes. The descriptions of each feature are described below, from the data source.

`amount_tsh` - Total static head (amount water available to waterpoint)
<br>`date_recorded` - The date the row was entered
<br>`funder` - Who funded the well
<br>`gps_height` - Altitude of the well
<br>`installer` - Organization that installed the well
<br>`longitude` - GPS coordinate
<br>`latitude` - GPS coordinate
<br>`wpt_name` - Name of the waterpoint if there is one
<br>`num_private` -
<br>`basin` - Geographic water basin
<br>`subvillage` - Geographic location
<br>`region` - Geographic location
<br>`region_code` - Geographic location (coded)
<br>`district_code` - Geographic location (coded)
<br>`lga` - Geographic location
<br>`ward` - Geographic location
<br>`population` - Population around the well
<br>`public_meeting` - True/False
<br>`recorded_by` - Group entering this row of data
<br>`scheme_management` - Who operates the waterpoint
<br>`scheme_name` - Who operates the waterpoint
<br>`permit` - If the waterpoint is permitted
<br>`construction_year` - Year the waterpoint was constructed
<br>`extraction_type` - The kind of extraction the waterpoint uses
<br>`extraction_type_group` - The kind of extraction the waterpoint uses
<br>`extraction_type_class` - The kind of extraction the waterpoint uses
<br>`management` - How the waterpoint is managed
<br>`management_group` - How the waterpoint is managed
<br>`payment` - What the water costs
<br>`payment_type` - What the water costs
<br>`water_quality` - The quality of the water
<br>`quality_group` - The quality of the water
<br>`quantity` - The quantity of water
<br>`quantity_group` - The quantity of water
<br>`source` - The source of the water
<br>`source_type` - The source of the water
<br>`source_class` - The source of the water
<br>`waterpoint_type` - The kind of waterpoint
<br>`waterpoint_type_group` - The kind of waterpoint

In [62]:
train.shape

(59400, 39)

The shape of the full feature data is (59400,39). Having more than 2 orders of magnitude worth of samples compared to the number of features will help avoid the curse of dimensionality. The concern would be with the label with the smallest sample size. The data for the condition "functional needs repair" is (4317, 39). In thise case, the number of samples is still 2 orders of magnitude larger than the number of features.

In [81]:
def plot_hist(data, col, ylog=False, xlog=False):
    
    if ylog:
        ymode='log'
    else:
        ymode=None
    if xlog:
        xmode='log'
    else:
        xmode=None
    
    trace = go.Histogram(x=data[col], name='col')
    
    layout = go.Layout(title='Pump Condition Distribution',
                  yaxis=dict(title='Count', type=ymode),
                       xaxis=dict(type=xmode))
        
    fig = go.Figure([trace], layout=layout)
    
    iplot(fig, filename=f'{col}-dist.html')

In [84]:
plot_hist(train, col='amount_tsh', ylog=True)

In [93]:
train.amount_tsh.astype(str).unique()

array(['6000.0', '0.0', '25.0', '20.0', '200.0', '500.0', '50.0',
       '4000.0', '1500.0', '6.0', '250.0', '10.0', '1000.0', '100.0',
       '30.0', '2000.0', '400.0', '1200.0', '40.0', '300.0', '25000.0',
       '750.0', '5000.0', '600.0', '7200.0', '2400.0', '5.0', '3600.0',
       '450.0', '40000.0', '12000.0', '3000.0', '7.0', '20000.0',
       '2800.0', '2200.0', '70.0', '5500.0', '10000.0', '2500.0',
       '6500.0', '550.0', '33.0', '8000.0', '4700.0', '7000.0', '14000.0',
       '1300.0', '100000.0', '700.0', '1.0', '60.0', '350.0', '0.2',
       '35.0', '306.0', '8500.0', '117000.0', '3500.0', '520.0', '15.0',
       '6300.0', '9000.0', '150.0', '120000.0', '138000.0', '350000.0',
       '4500.0', '13000.0', '45000.0', '2.0', '15000.0', '11000.0',
       '50000.0', '7500.0', '16300.0', '800.0', '16000.0', '30000.0',
       '53.0', '5400.0', '70000.0', '250000.0', '200000.0', '26000.0',
       '18000.0', '26.0', '590.0', '900.0', '9.0', '1400.0', '170000.0',
       '220.0', '