# Data Summary and Goal

## Summary

Tanzania, as a developing country, struggles with providing clean water to its population of over 57,000,000. There are many waterpoints already established in the country, but some are in need of repair while others have failed altogether.

## Goal

Build a classifier to predict the condition of a water well, using information provided in the data. This information includes:
- Date
- Location
- Source
- Funder
- And more!

This data is from the DrivenData.org website. It is part of the "Pump It Up: Data Mining the Water Table" dfetition. DrivenData decided to split the data up into two sets, the "Training Set" and the "Test Set". 

It is implied by the names that we are to use the training set for creating our models, and the test set to test them. For this project, we considered merging the two dataframes in order to have more data to work with, however there are 59,400 entries in the training set and therefore more than enough to make good predicitons. 

If our models are subpar, we may merge the tables to aquire more data points to potentially improve model efficacy.

# Data Cleaning

## Import Libraries and Data

In [1]:
# Import Pandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from warnings import simplefilter

# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

# Load data into  Pandas dataframes
status_groups = pd.read_csv('status_groups.csv')
testset = pd.read_csv('test_set.csv')
df = pd.read_csv('training_set.csv')


# Let's add our target series to the dataframe!
status_groups.drop(['id'], axis=1, inplace=True)
df = pd.concat([df, status_groups], axis=1)

# Analyze shape of dataset
print(f'Shape of dataset: {df.shape}')
# display(df.head())


Shape of dataset: (59400, 41)


## Drop unneeded columns and deal with missing values

In [None]:

# Create column that shows the age of the well at the time of recording


from datetime import datetime
df['date_recorded'] = pd.to_datetime(df['date_recorded'])
recorded_year = [x.year for x in df.date_recorded]
df['well_age'] = recorded_year - df.construction_year
# df.well_age.value_counts()
# df['well_age'][df.well_age<0]
# df.iloc[[10441, 8729, 13366, 23373, 27501, 32619, 33942, 39559]][['construction_year', 'date_recorded', 'status_group']]


# Interestingly enough, there are some negative values for the age of wells
# This indicates that the well was PLANNED on being built at the time of recording, but had not yet been recorded

In [None]:
# EXPLORATORY! We are analyzing the columns in order to remove redundancies
# **Uncomment if viewing the unique values is desired**

# df.region.unique()
# df.scheme_management

# display(df['extraction_type_class'].unique())
# display(df['extraction_type'].unique())
# display(df['extraction_type_group'].unique())

# display(df['source_class'].unique())
# display(df['source'].unique())
# display(df['source_type'].unique())

# display(df.waterpoint_type.unique())
# display(df.waterpoint_type_group.unique())

# display(df.water_quality.unique())
# display(df.quality_group.unique())

# display(df.management.unique())
# display(df.management_group.unique())

# display(df.payment.unique())
# display(df.payment_type.unique())

# display(df.quantity.unique())
# display(df.quantity_group.unique())

df.isna().sum()[df.isna().sum()>0]

In [2]:
# Label encode the target variable
status_labels = {'status_group':{'non functional': 0, 'functional': 1, 'functional needs repair': 2}}
df = df.replace(status_labels)
df.status_group.value_counts()

# Organize the funder column a little based on occurence
low_occurence = list(df.funder[df['funder'].map(df['funder'].value_counts()) < 100].values)
df['funder'].replace(low_occurence, 'other', inplace=True)

# Drop redundant and unneeded columns
to_drop = ['scheme_name', 'recorded_by', 'wpt_name', 'extraction_type', 'extraction_type_group',
           'region_code', 'district_code', 'latitude', 'longitude', 'lga', 'ward', 'public_meeting', 'date_recorded', 
           'source', 'source_class', 'waterpoint_type', 'water_quality', 'management_group', 
           'payment', 'quantity_group','subvillage', 'num_private']

# Deal with missing values
df.drop(to_drop, axis=1, inplace=True)
df.scheme_management.replace({'None':'Ignore', np.nan:'Ignore'}, inplace=True)
df.permit.fillna(False, inplace=True)
df.dropna(axis=0, inplace=True)

# Set 'id' as the index of the dataframe
df.set_index('id', inplace=True)

## Grouping and Labeling Column Values

In [3]:
# Permit

df.permit.replace({True:1, False:0}, inplace=True)

In [4]:
# Population
    
def population(obs):
    s=''
    x=obs['population']
    if(0<x<=100):
        s='Less than 100'
    elif(100<x<=200):
        s='Between 100 and 200'
    elif(200<x<=300):
        s='Between 200 and 300'
    elif(300<x<=400):
        s='between 300 and 400'
    elif(400<x<=500):
        s='between 400 and 500'
    elif(500<x):
        s='Over 500'
    elif(x==0):
        s='No population'
    return s
df['population']=df.apply(population,axis=1)


In [5]:
# # Well_age

# # Drop all items that have a value less than 0 (very few)
# df.drop(df[df['well_age'] < 0].index, inplace = True)

# # Bin
# conditions = [df.well_age==0, (df.well_age>0)&(df.well_age<=4), (df.well_age>4)&(df.well_age<=12), (df.well_age>12)&(df.well_age<=25), 
#               (df.well_age>25)&(df.well_age<=48), df.well_age>48]
# choices = ['new', '0-4 years', '4-12 years', '12-25 years', '25-48 years', 'more than 48 years']
# df['well_age'] = np.select(conditions, choices)

conditions = [df['construction_year']==0, (df['construction_year']>=1960)&(df['construction_year']<=1970), (df['construction_year']>1970)&(df['construction_year']<=1980),
             (df['construction_year']>1980)&(df['construction_year']<=1990), (df['construction_year']>1990)&(df['construction_year']<=2000),
             (df['construction_year']>2000)&(df['construction_year']<=2010), df['construction_year']>2010]
choices = ['no_construction_year', '1960_1970', '1971_1980', '1981_1990', '1991_2000', '2001_2010', '2011_over']
df['construction_year'] = np.select(conditions, choices)


In [6]:
# Amount_tsh
# Bin
conditions = [df.amount_tsh==0,(df.amount_tsh>0)&(df.amount_tsh<=10),(df.amount_tsh>10)&(df.amount_tsh<=100), (df.amount_tsh>100)&(df.amount_tsh<=1000),
             (df.amount_tsh>1000)&(df.amount_tsh<=2000), (df.amount_tsh>2000)&(df.amount_tsh<=10000), (df.amount_tsh>10000)&(df.amount_tsh<=100000),
             df.amount_tsh>100000]
choices = ['zero', '1 to 10', '11 to 100', '101 to 1k', '1k to 2k', '2k to 10k', '10k to 100k', 'greater than 100k']
df['amount_tsh'] = np.select(conditions, choices)

In [None]:
df.gps_height = pd.qcut(df.gps_height, 8, duplicates='drop', 
        labels=['-90m - sea level', 'sea level to 46m', '46m to 393m', '393m to 1017m', '1017m to 1316m', '1316m to 1586.75m', '1586.75m to 2770m'])

In [7]:
pd.get_dummies(df, drop_first=True)

Unnamed: 0_level_0,gps_height,permit,status_group,amount_tsh_101 to 1k,amount_tsh_10k to 100k,amount_tsh_11 to 100,amount_tsh_1k to 2k,amount_tsh_2k to 10k,amount_tsh_greater than 100k,amount_tsh_zero,...,source_type_other,source_type_rainwater harvesting,source_type_river/lake,source_type_shallow well,source_type_spring,waterpoint_type_group_communal standpipe,waterpoint_type_group_dam,waterpoint_type_group_hand pump,waterpoint_type_group_improved spring,waterpoint_type_group_other
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
69572,1390,0,1,0,0,0,0,1,0,0,...,0,0,0,0,1,1,0,0,0,0
8776,1399,1,1,0,0,0,0,0,0,1,...,0,1,0,0,0,1,0,0,0,0
34310,686,1,1,0,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
67743,263,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
19728,0,1,1,0,0,0,0,0,0,1,...,0,1,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11164,351,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
60739,1210,1,1,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0
27263,1212,1,1,0,0,0,0,1,0,0,...,0,0,1,0,0,1,0,0,0,0
31282,0,1,1,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,1,0,0


# Visualizations

In [None]:
sns.histplot(target, stat='density')

In [None]:
df.population.value_counts().plot(kind='bar')

In [None]:
df.well_age.value_counts().plot(kind='bar')

# Model Building

In [None]:
# Identify features and target
features = df.drop('status_group', axis=1)
target = df.status_group

# Dummy the features
pd.get_dummies(features, drop_first=True)

# Split the data
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(features, target, random_state=33)