# Predicting the Working Status (functional or not) of Pumps at Waterpoints in Tanzania

## Introduction to the Research Question

<p>For some communities in Africa, and here more specifically in Tanzania, having a secured access to fresh water is not a given. The number of waterpoints available to these communities is limited, and making sure that the pumps at these waterpoints are functional is critical.</p>
<p>The goal of this study is to find an algorithm to identify, or predict, which pumps are likely to be broken. This information is invaluable in order to improve maintenance operations and ensure continuous access to the water.</p>
<p>The data set that I will be using for this study is provided by
<a href="http://taarifa.org/">Taarifa</a> and the <a href="http://maji.go.tz/">Tanzanian Ministry of Water</a>. In order to be able to predict if a pump at a waterpoint is functional or not, I will need to identify the explanatory variables in the data set that are the most correlated with our response variable which is the pump working status. Some of the predictors that can be envisaged to be useful to predict the working status of a pump are the contruction year, how the waterpoint is managed, the waterpoint location, and many others.</p>

## Method

### Sample

Taarifa is an open source platform for the crowd sourced reporting and triaging of infrastructure related issues.
The sample data used comes from the Taarifa waterpoint dashboard, which aggregates data from the Tanzania Ministry of Water. Therefore the data set only contains data for Tanzanian waterpoints.

The data provides information on what kind of pump is operating, when it was installed, how it is managed, and other vatiables that can be used to predict the status of pumps at waterpoints. The full data available will be used for this study (sample size: 59400). However some of the variables will not be used if they are not correlated to the response variable.

In [37]:
import pandas

trainingData = pandas.read_csv("Training Set Values.csv")
#Display the number of observations in the full data set
numberOfObservations = len(trainingData)
print("Number of observations in full data set: " + str(numberOfObservations))

Number of observations in full data set: 59400


### Measures
<p>After looking at the data dictionary some of the variables seem like good candidates (type of pump, water quality...) to predict if a pump is working, need repair, or is not even working.</p>
<p>The variable chosen are:</P>
* **funder** - Who funded the well
* **installer** - Organization that installed the well
* **basin** - Geographic water basin
* **region** - Geographic location
* **population** - Population around the well
* **public_meeting** - Indicates if the pump is in a public place :True/False
* **scheme_management** - Who operates the waterpoint
* **construction_year** - Year the waterpoint was constructed
* **extraction_type** - The kind of extraction the waterpoint uses
* **extraction_type_group** - The kind of extraction the waterpoint uses
* **extraction_type_class** - The kind of extraction the waterpoint uses
* **management** - How the waterpoint is managed
* **management_group** - How the waterpoint is managed
* **water_quality** - The quality of the water
* **source** - The source of the water
* **waterpoint_type** - The kind of waterpoint
* **waterpoint_type_group** - The kind of waterpoint


The data is provided in two data sets: "Training Set Values.csv" and "Training Set Labels.csv" that contains the response variable status_group.

In [38]:
responseData = pandas.read_csv("Training Set Labels.csv")
print("Number of observations in Training Set Labels.csv: " +
      str(len(responseData)))

Number of observations in Training Set Labels.csv: 59400


In order to perform some of the analysis, the two data set needs to be joined. This can be done using the id variable available in both data sets.

In [39]:
myData = trainingData.merge(responseData, on='id', how='outer')
print(len(myData))

59400


### Analysis
<p>The first step of the analysis will be to check if the explanatory variable chosen are associated with the response variable.</p>
The response variable is categorical, therefore we will use:
* Chi Square test of independance if the explanatory variable is categorical
* Chi Square test of independance if the explanatory variable is quantitavice, but the explanatory variable will need to be categorized first
<p>I will start by testing the public meeting place responce variable.

In [40]:
crosstab_pm = pandas.crosstab(myData['public_meeting'], myData['status_group'])
print(crosstab_pm.apply(lambda r: r*100/r.sum(), axis=1))
print(crosstab_pm)

status_group    functional  functional needs repair  non functional
public_meeting                                                     
False            42.987141                 8.743818       48.269041
True             55.689949                 7.290584       37.019466
status_group    functional  functional needs repair  non functional
public_meeting                                                     
False                 2173                      442            2440
True                 28408                     3719           18884


It shows that when the pump is at a public meeting place, 55.6% of the pumps are properly functioning, when only 42.9% are fully functioning when not a public meeting place.
Next I will run a Chi Square test of independance to validate if the variables are correlated.