# Data Summary and Goal

## Summary

Tanzania, as a developing country, struggles with providing clean water to its population of over 57,000,000. There are many waterpoints already established in the country, but some are in need of repair while others have failed altogether.

## Goal

Build a classifier to predict the condition of a water well, using information provided in the data. This information includes:
- Date
- Location
- Source
- Funder
- And more!

This data is from the DrivenData.org website. It is part of the "Pump It Up: Data Mining the Water Table" dfetition. DrivenData decided to split the data up into two sets, the "Training Set" and the "Test Set". 

It is implied by the names that we are to use the training set for creating our models, and the test set to test them. For this project, we considered merging the two dataframes in order to have more data to work with, however there are 59,400 entries in the training set and therefore more than enough to make good predicitons. 

If our models are subpar, we may merge the tables to aquire more data points to potentially improve model efficacy.

# Data Cleaning and Feature Engineering

## Import Libraries and Data

In [None]:
# Import Pandas
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import MissingIndicator, SimpleImputer
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import SelectFromModel

# plot_confusion_matrix is a handy visual tool, added in the latest version of scikit-learn
# if you are running an older version, comment out this line and just use confusion_matrix

from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_roc_curve

# ignore all future warnings
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

# Load data into  Pandas dataframes
status_groups = pd.read_csv('status_groups.csv')
testset = pd.read_csv('test_set.csv')
df = pd.read_csv('training_set.csv')


# Let's add our target series to the dataframe!
status_groups.drop(['id'], axis=1, inplace=True)
df = pd.concat([df, status_groups], axis=1)

# Analyze shape of dataset
print(f'Shape of dataset: {df.shape}')
# display(df.head())


## Drop unneeded columns and deal with missing values

In [None]:

# Create column that shows the age of the well at the time of recording


from datetime import datetime
df['date_recorded'] = pd.to_datetime(df['date_recorded'])
# recorded_year = [x.year for x in df.date_recorded]
# df['well_age'] = recorded_year - df.construction_year
# df.well_age.value_counts()
# df['well_age'][df.well_age<0]
# df.iloc[[10441, 8729, 13366, 23373, 27501, 32619, 33942, 39559]][['construction_year', 'date_recorded', 'status_group']]


# Interestingly enough, there are some negative values for the age of wells
# This indicates that the well was PLANNED on being built at the time of recording, but had not yet been recorded

In [None]:
# EXPLORATORY! We are analyzing the columns in order to remove redundancies
# **Uncomment if viewing the unique values is desired**

# df.region.unique()
# df.scheme_management

# display(df['extraction_type_class'].unique())
# display(df['extraction_type'].unique())
# display(df['extraction_type_group'].unique())

# display(df['source_class'].unique())
# display(df['source'].unique())
# display(df['source_type'].unique())

# display(df.waterpoint_type.unique())
# display(df.waterpoint_type_group.unique())

# display(df.water_quality.unique())
# display(df.quality_group.unique())

# display(df.management.unique())
# display(df.management_group.unique())

# display(df.payment.unique())
# display(df.payment_type.unique())

# display(df.quantity.unique())
# display(df.quantity_group.unique())

df.isna().sum()[df.isna().sum()>0]

In [None]:
# Label encode the target variable
status_labels = {'status_group':{'non functional': 0, 'functional': 1, 'functional needs repair': 2}}
df = df.replace(status_labels)
df.status_group.value_counts()


# Drop redundant and unneeded columns
to_drop = ['scheme_name', 'recorded_by', 'wpt_name', 'extraction_type', 'extraction_type_group',
           'region_code', 'district_code', 'lga', 'ward', 'public_meeting', 'date_recorded', 
           'source', 'source_class', 'waterpoint_type', 'water_quality', 'management_group', 
           'payment', 'quantity_group','subvillage', 'num_private', 'scheme_management']

# I am keeping latitude and longitude

# Deal with missing values
df.drop(to_drop, axis=1, inplace=True)
df.permit.fillna(False, inplace=True)
df.dropna(axis=0, inplace=True)

# Set 'id' as the index of the dataframe
df.set_index('id', inplace=True)

## Grouping and Labeling Column Values

In [None]:
# Funder
df.funder.replace(to_replace='0', value='unknown', inplace=True)
df.funder.value_counts().head(20)

In [None]:
other = list(df.funder[df['funder'].map(df['funder'].value_counts()) < 484].values)
other
df['funder'].replace(other, 'other', inplace=True)

df.funder.replace(to_replace='0', value='Unknown', inplace=True)


In [None]:
# Installer

other = list(df.installer[df['installer'].map(df['installer'].value_counts()) < 392].values)
other
df['installer'].replace(other, 'other', inplace=True)

In [None]:
# Permit

df.permit.replace({True:1, False:0}, inplace=True)

In [None]:
# Population
    
def population(obs):
    s=''
    x=obs['population']
    if(0<x<=100):
        s='Less than 100'
    elif(100<x<=200):
        s='Between 100 and 200'
    elif(200<x<=300):
        s='Between 200 and 300'
    elif(300<x<=400):
        s='between 300 and 400'
    elif(400<x<=500):
        s='between 400 and 500'
    elif(500<x):
        s='Over 500'
    elif(x==0):
        s='No population'
    return s
df['population']=df.apply(population,axis=1)


In [None]:
# # Well_age

# # Drop all items that have a value less than 0 (very few)
# df.drop(df[df['well_age'] < 0].index, inplace = True)

# # Bin
# conditions = [df.well_age==0, (df.well_age>0)&(df.well_age<=4), (df.well_age>4)&(df.well_age<=12), (df.well_age>12)&(df.well_age<=25), 
#               (df.well_age>25)&(df.well_age<=48), df.well_age>48]
# choices = ['new', '0-4 years', '4-12 years', '12-25 years', '25-48 years', 'more than 48 years']
# df['well_age'] = np.select(conditions, choices)

conditions = [df['construction_year']==0, (df['construction_year']>=1960)&(df['construction_year']<=1970), (df['construction_year']>1970)&(df['construction_year']<=1980),
             (df['construction_year']>1980)&(df['construction_year']<=1990), (df['construction_year']>1990)&(df['construction_year']<=2000),
             (df['construction_year']>2000)&(df['construction_year']<=2010), df['construction_year']>2010]
choices = ['no_construction_year', '1960_1970', '1971_1980', '1981_1990', '1991_2000', '2001_2010', '2011_over']
df['construction_year'] = np.select(conditions, choices)


In [None]:
# Amount_tsh
# Bin
conditions = [df.amount_tsh==0,(df.amount_tsh>0)&(df.amount_tsh<=10),(df.amount_tsh>10)&(df.amount_tsh<=100), (df.amount_tsh>100)&(df.amount_tsh<=1000),
             (df.amount_tsh>1000)&(df.amount_tsh<=2000), (df.amount_tsh>2000)&(df.amount_tsh<=10000), (df.amount_tsh>10000)&(df.amount_tsh<=100000),
             df.amount_tsh>100000]
choices = ['zero', '1 to 10', '11 to 100', '101 to 1k', '1k to 2k', '2k to 10k', '10k to 100k', 'greater than 100k']
df['amount_tsh'] = np.select(conditions, choices)

In [None]:
df.gps_height = pd.qcut(df.gps_height, 8, duplicates='drop', 
        labels=['-90m - sea level', 'sea level to 46m', '46m to 393m', '393m to 1017m', '1017m to 1316m', '1316m to 1586.75m', '1586.75m to 2770m'])

In [None]:
df.funder.value_counts()

In [None]:
pd.get_dummies(df, drop_first=True)

In [None]:
df.columns

In [None]:
df.installer.value_counts().head(20)

# Visualizations

In [None]:
df1 = df.loc[df['funder']== 'Government Of Tanzania']
df2 = df.loc[df['funder']== 'Tasaf']              
df3 = df.loc[df['funder']== 'Danida'] 
df4 = df.loc[df['funder']== 'Hesawa'] 
df5 = df.loc[df['funder']== 'Rwssp'] 
df6 = df.loc[df['funder']== 'World Bank'] 
df7 = df.loc[df['funder']== 'Kkkt'] 
df8 = df.loc[df['funder']== 'World Vision']
df9 = df.loc[df['funder']== 'Unicef'] 
df10 = df.loc[df['funder']== 'unknown'] 
df11 = df.loc[df['funder']== 'District Council'] 
df12 = df.loc[df['funder']== 'Dhv'] 
df13 = df.loc[df['funder']== 'Private Individual'] 
df14 = df.loc[df['funder']== 'Dwsp'] 
df15 = df.loc[df['funder']== 'Norad'] 
df16 = df.loc[df['funder']== 'Germany Republi']
df17 = df.loc[df['funder']== 'Tcrs']
df18 = df.loc[df['funder']== 'Ministry Of Water']
df19 = df.loc[df['funder']== 'Water']
df20 = df.loc[df['funder']== 'Dwe']

top_20 = pd.concat([df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12,
                    df13,df14,df15,df16,df17,df18,df19,df20], ignore_index=True)

fig, ax = plt.subplots(figsize=(20,20))
sns.countplot(x='funder', hue="status_group", data=top_20)
plt.tight_layout()

In [None]:
fig, ax = plt.subplots(figsize=(25,25))
sns.countplot(x='funder', hue='status_group', data=top_20)
plt.tight_layout()

In [None]:
fig, ax = plt.subplots(figsize=(25,25))
sns.countplot(x='funder', hue='status_group', data=df)
plt.tight_layout()

In [None]:
fig, ax = plt.subplots(figsize=(12,12))
sns.scatterplot(x='latitude', y='longitude', hue='status_group',data=df)
plt.tight_layout()

In [None]:
fig, ax = plt.subplots(figsize=(12,12))
sns.countplot(x='waterpoint_type_group', hue='status_group', data=df)
plt.tight_layout()

In [None]:
fig, ax = plt.subplots(figsize=(12,12))
sns.countplot(x='source_type', hue='status_group', data=df)
plt.tight_layout()

In [None]:
fig, ax = plt.subplots(figsize=(12,12))
sns.countplot(x='quality_group', hue='status_group', data=df)
plt.tight_layout()

In [None]:
fig, ax = plt.subplots(figsize=(12,12))
sns.countplot(x='payment_type', hue='status_group', data=df)
plt.tight_layout()

In [None]:
fig, ax = plt.subplots(figsize=(12,12))
sns.countplot(x='construction_year', hue='status_group', data=df)
plt.tight_layout()

In [None]:
fig, ax = plt.subplots(figsize=(12,12))
sns.countplot(x='population', hue='status_group', data=df)
plt.tight_layout()

In [None]:
fig, ax = plt.subplots(figsize=(6,6))
sns.countplot(x='quantity', hue='status_group', data=df)
plt.tight_layout()

In [None]:
fig, ax = plt.subplots(figsize=(15,15))
sns.countplot(x='region', hue='status_group', data=df)
plt.tight_layout()

In [None]:
fig, ax = plt.subplots(figsize=(12,12))
sns.countplot(x='management', hue='status_group', data=df)
plt.tight_layout()

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
sns.countplot(x='amount_tsh', hue='status_group', data=df)
plt.tight_layout()

In [None]:
fig, ax = plt.subplots(figsize=(15,15))
sns.countplot(x='basin', hue='status_group', data=df)
plt.tight_layout()

In [None]:
fig, ax = plt.subplots(figsize=(6,6))
sns.countplot(x='permit', hue='status_group', data=df)
plt.tight_layout()

In [None]:
df.status_group.value_counts().plot(kind='bar', color='orange')

# Model Building

## Define X and y and Create a dummied Feature df

In [None]:
# Identify features and target
features = df.drop('status_group', axis=1)
target = df.status_group

# Dummy the features
features_dummied = pd.get_dummies(features, drop_first=True)
features_dummied.head()

## Decision Tree

In [None]:
# Split the data into training and test sets
from sklearn.tree import DecisionTreeClassifier
X_train, X_test, y_train, y_test = train_test_split(features_dummied, target, test_size=0.3, random_state=123)

# Fit and cross-validate Decision Tree
tree = DecisionTreeClassifier(random_state=123)
cross_val_score(tree, X_train, y_train, cv=10)

In [None]:
# Use recursive feature elimination to trim dataset columns
# from sklearn.feature_selection import RFE
# rfe = RFE(tree, 20)
# rfe = rfe.fit(X_train, y_train)

# rfe_columns = pd.Series(X_train.columns)
# rfe_code = pd.Series(rfe.support_)
# columns_to_keep = pd.concat([rfe_columns, rfe_code], axis=1)

# rfe_X_train = X_train[list(columns_to_keep[columns_to_keep[1]==True][0])]
# rfe_X_test = X_test[list(columns_to_keep[columns_to_keep[1]==True][0])]

# Cross validate again
# cross_val_score(tree, rfe_X_train, y_train, cv=10)

# print(rfe.support_)
# print(rfe.ranking_)

## Random Forest

In [None]:
# Building  Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score
rfc = RandomForestClassifier(criterion = 'entropy', random_state = 42)
rfc.fit(X_train, y_train)

# Evaluating on Training set
rfc_pred_train = rfc.predict(X_train)
print('Training Set Evaluation F1-Score=>', f1_score(y_train,rfc_pred_train, average=None))
print('Training Set Evaluation Accuracy-Score=>', accuracy_score(y_train,rfc_pred_train))

## Logistic Regression

In [57]:
# Scale the data sets
# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler()
# scaler.fit_transform(features_dummied, target)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features_dummied, target, test_size=0.3, random_state=123)

# Fit and cross-validate Logistic Regression
lr = LogisticRegression(max_iter=10000, random_state=123)
cross_val_score(lr, X_train, y_train)





# # Use recursive feature elimination to create a trimmed dataset that could potentially improve score
# from sklearn.feature_selection import RFE
# rfe = RFE(tree, 20)
# rfe = rfe.fit(X_train, y_train)

# rfe_columns = pd.Series(X_train.columns)
# rfe_code = pd.Series(rfe.support_)
# columns_to_keep = pd.concat([rfe_columns, rfe_code], axis=1)

# rfe_X_train = X_train[list(columns_to_keep[columns_to_keep[1]==True][0])]
# rfe_X_test = X_test[list(columns_to_keep[columns_to_keep[1]==True][0])]

# Cross validate again
# cross_val_score(lr, rfe_X_train, y_train, cv=10)

In [58]:
# Display classification report
from sklearn.metrics import classification_report
lr.fit(X_train, y_train)
y_preds = lr.predict(X_test)
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.79      0.64      0.71      6536
           1       0.73      0.90      0.80      9032
           2       0.56      0.11      0.18      1140

    accuracy                           0.74     16708
   macro avg       0.69      0.55      0.56     16708
weighted avg       0.74      0.74      0.72     16708



In [56]:

from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, lr.predict(X_test), multi_class='ovr')
fpr, tpr, thresholds = roc_curve(y_test, lr.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

NotFittedError: This LogisticRegression instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

ONE VS ALL IN SKLEARN
ROC CURVE