<img src = '../../sb_tight.png'>
<h1 align = 'center'> Capstone Project 2: Pump It Up </h1>

---

### Notebook 4: Modeling
**Author:<br>
Tashi T. Gurung**<br>
**hseb.tashi@gmail.com**

### About the project:
The **objective** of this project is to **predict the failure of water points** spread accross Tanzania before they occur.

50% of Tanzania's population do not have access to safe water. Among other sources, Tanzanians depend on water points mostly pumps (~60K) spread across Tanzania. Compared to other infrastructure projects, water point projects consist of a huge number of inspection points that are geographically spread out. Gathering data on the condition of these pumps has been a challenge. From working with local agencies, to implementing mobile based crowd sourcing projects, none have produced satisfactory results.

The lack of quality data creates a number of problem for a stakeholder like the Tanzanian Government, specifically the Ministry of Water. Consequences include not only higher maintainence costs, but also all the problems and nuanced issues faced by communities when their access to water is compromised or threatened.

While better data collection infrastructure should be built overtime, this project (with its model(s), various analysis, and insights) will be key for efficient resource allocation to maximize the number of people and communities with access to water.
In the long run, it will assist stake holders in and project planning, and even local, regional and national level policy formation. 

### About the notebook:
The data for our project exists in two separate datasets:
1. Containing potential features
2. Containing target variable

In this notebook, we combine these datasets.\
We also perform preliminary EDA, and look at duplicate values and missing values.

Finally, we export this combined dataset for further EDA.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../data/intermediate_data/df.csv')

FileNotFoundError: [Errno 2] File ../data/intermediate_data/df.csv does not exist: '../data/intermediate_data/df.csv'

**Stupid Model**

In [19]:
df['target_var'].isna().sum()

0

In [20]:
df['target_var'].value_counts(normalize = True)

functional                 0.543081
non functional             0.384242
functional needs repair    0.072677
Name: target_var, dtype: float64

A 'stupid' model that predicts every pump to be 'funcional' will still have an ***accuracy of 54%***

**Base Model**

Let us intuitively select some features to build a simple logistic model

In [21]:
df.head(1).T

Unnamed: 0,0
id,69572
amount_tsh,6000
date_recorded,2011-03-14
funder,Roman
gps_height,1390
installer,Roman
longitude,34.9381
latitude,-9.85632
wpt_name,none
basin,Lake Nyasa


In [22]:
df['target_var'].replace({'functional needs repair':'functional'}, inplace = True)

In [23]:
df['target_var'].replace({'functional':1,
                      'non functional':0}, inplace = True)

In [56]:
df = df[~df['construction_year'].isna()]

In [74]:
df['construction_year'].isna().sum()

0

In [95]:
df['target_var'].value_counts(normalize = True)

1    0.626296
0    0.373704
Name: target_var, dtype: float64

In [80]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.loc[:,'construction_year'],
                                                    df.loc[:,'target_var']
                                                    , random_state = 23)

In [81]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [83]:
X_train

52624    1997.0
5721     2008.0
2904     2008.0
42986    1970.0
10000    2006.0
          ...  
9483     2004.0
14879    2008.0
17182    1994.0
40864    2006.0
14193    1992.0
Name: construction_year, Length: 29018, dtype: float64

In [84]:
y_train

52624    1
5721     1
2904     1
42986    0
10000    0
        ..
9483     1
14879    1
17182    1
40864    1
14193    0
Name: target_var, Length: 29018, dtype: int64

In [86]:
logreg.fit(X_train.values.reshape(-1,1),y_train.values)

LogisticRegression()

In [92]:
y_pred = logreg.predict(X_test.values.reshape(-1,1))

In [94]:
logreg.score(X_test.values.reshape(-1,1), y_test)

0.6248320066163549

In [None]:
# Target Class Balance/Imbalance
df['target_var'].value_counts(normalize = True) * 100