<img src = '../../sb_tight.png'>
<h1 align = 'center'> Capstone Project 2: Pump It Up </h1>

---

### Notebook 1: Data Wrangling
**Author:<br>
Tashi T. Gurung**<br>
**hseb.tashi@gmail.com**

### About the project:
The **objective** of this project is to **predict the failure of water points** spread accross Tanzania before they occur.

50% of Tanzania's population do not have access to safe water. Among other sources, Tanzanians depend on water points mostly pumps (~60K) spread across Tanzania. Compared to other infrastructure projects, water point projects consist of a huge number of inspection points that are geographically spread out. Gathering data on the condition of these pumps has been a challenge. From working with local agencies, to implementing mobile based crowd sourcing projects, none have produced satisfactory results.

The lack of quality data creates a number of problem for a stakeholder like the Tanzanian Government, specifically the Ministry of Water. Consequences include not only higher maintainence costs, but also all the problems and nuanced issues faced by communities when their access to water is compromised or threatened.

While better data collection infrastructure should be built overtime, this project (with its model(s), various analysis, and insights) will be key for efficient resource allocation to maximize the number of people and communities with access to water.
In the long run, it will assist stake holders in and project planning, and even local, regional and national level policy formation. 

### About the notebook:
The data for our project exists in two separate datasets:
1. Containing potential features
2. Containing target variable

In this notebook, we combine these datasets.\
We also perform preliminary EDA, and look at duplicate values and missing values.

Finally, we export this combined dataset for further EDA.

---

## Import Libraries and Datasets

In [1]:
import pandas as pd

The data source (i.e. datadriven.org) has provided the data as two separate datasets<br>
1.Containing potential features<br>
2.Containing target variable<br>
Let us import and comibine these datasets<br>
**Data Dictionary**: https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/25/

In [4]:
file_location = '../data/raw/'

# import features
df1 = pd.read_csv(f'{file_location}train.csv')
df1['date_recorded'] = pd.to_datetime(df1['date_recorded'])

# import labels (target-variables) for features
df2 = pd.read_csv(f'{file_location}trainlabels.csv')
df2.columns = ['id', 'target_var']

# merge features and it's labels (target variable)
df = pd.merge(df1, df2)

---

number of rows and columns

In [5]:
df.shape

(59400, 41)

check if the column: id is unique

In [6]:
df['id'].is_unique

True

Assign appropriate datatypes

In [5]:
# convert datatype of cols from int to str
cols = ['id','region_code','district_code',]
for col in cols:
    df[col] = df[col].astype('str')

Let us look at the first row, along with datatype and # of unique values in each column

In [10]:
pd.concat([df.iloc[0].T,df.dtypes, df.nunique()], 
          axis = 1, 
          keys = ['first row','data type','cardinality'])

Unnamed: 0,first row,data type,cardinality
id,69572,int64,59400
amount_tsh,6000,float64,98
date_recorded,2011-03-14 00:00:00,datetime64[ns],356
funder,Roman,object,1897
gps_height,1390,int64,2428
installer,Roman,object,2145
longitude,34.9381,float64,57516
latitude,-9.85632,float64,57517
wpt_name,none,object,37400
num_private,0,int64,65


---

### Data Cleaning
A) Duplicates<br>
B) Missing Values

**drop duplicate rows**

In [11]:
print(df.shape)
df.drop_duplicates(inplace = True)
print(df.shape)

(59400, 41)
(59400, 41)


No duplicates found\
Next, look at missing values

**Columns with missing value**

In [11]:
miss = df.isna().sum()[df.isna().sum() > 0] 

pd.concat([miss, round(miss/df.shape[0]*100,1), df.nunique(), df.dtypes], 
          axis = 1, join = 'inner',
          keys = ['missing','missing %', 'nunique()', 'datatype'])\
          .sort_values(by = 'missing %',
                       ascending = False)

Unnamed: 0,missing,missing %,nunique(),datatype
scheme_name,28166,47.4,2696,object
scheme_management,3877,6.5,12,object
installer,3655,6.2,2145,object
funder,3635,6.1,1897,object
public_meeting,3334,5.6,2,object
permit,3056,5.1,2,object
subvillage,371,0.6,19287,object


Observation:
1. There are 7 out of 40 (not including target) features with missing data.
2. Among the 7 features, 5 are string, and 2 are boolean

Step:
No data imputation methods applied to deal with missing data in  string/object/categorical variables/features.

---

Upon preliminary EDA, two data quality issues were found: 

1. Among numeric features, column: ***construction_year*** has missing values encoded as **0**s.

In [12]:
df['construction_year'].value_counts(normalize = True).head()

0       0.348636
2010    0.044529
2008    0.043990
2009    0.042643
2000    0.035202
Name: construction_year, dtype: float64

34.86% of the observations are **0**s, which most likely represents missing values

**data range** for column: '*construction_year*'

In [13]:
print(df[df['construction_year']!=0]['construction_year'].min(),
      ' : ',
      df['construction_year'].max())

1960  :  2013


2) For column: '*num_private*',<br>
98.7% of the data is 0s

In [14]:
print('Missing: ',df['num_private'].isna().sum())
print(df['num_private'].value_counts(normalize = True))

Missing:  0
0      0.987256
6      0.001364
1      0.001229
5      0.000774
8      0.000774
         ...   
180    0.000017
213    0.000017
23     0.000017
55     0.000017
94     0.000017
Name: num_private, Length: 65, dtype: float64


Drop column: '*num_private*'

In [15]:
df.drop('num_private',axis = 1, inplace = True)

---

### Export data

In [17]:
df.to_csv('../data/interim/df.csv', index = False)

### Summary

1. The raw training dataset was provided as two separate datasets, one with features and the other with target variable.<br>
We imported and combined it.
2. No duplicate rows were found, but 7 columns with missing values<br>
2.1) The data type of these columns are categorical inclusing two boolean
3. Preliminary EDA revealed two data quality issues:<br>
3.1) First, column '*construction_year*' has ~35% of it's values encoded as 0, which is most likey missing values<br>
3.2) Second, for column '*num_private*', ~99% of observations have the value 0. Hence we dropped this column



---

<h2 align = 'center'> END </h2>