<a href="https://colab.research.google.com/github/Per-Y/Driven_Data-Pump_It_Up/blob/master/Pump_It_Up_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Driven Data - Pump it Up
## Initial Submission Notebook

Latest Edit : *5-Sep-2020*

In this notebook, I have explained the process of making a first submission to the Driven Data Challenge '[Pump it Up](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/)'. This notebook will be followed up with a second notebook which will do hyperparameter optimization for the models used here and the results will be compared.

TBU:

1.   Background
2.   Exploratory Data Analysis




**Preprocessing of Data - Basics** 

In this section, some basic actions are taken :

1.   Import the dataset csv files as dataframes
2.   Combine the training and test data for preprocessing
3.   Assign Numerical Values to the 'Status Group' target category to suit training of models



In [1]:
# Importing Basic Libraries
import pandas as pd
import numpy as np

train_data = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Pump It Up/Data Values_Train.csv')
train_labels = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Pump It Up/Data Labels_Train.csv')
test_data = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Pump It Up/Data Values_Test.csv')

# Combine train and test data
train_data['train'] = 1
test_data['train'] = 0
combined = pd.concat([train_data,test_data])

# assign numerical labels
label_dict_status_group = {'functional':0, 'non functional': 1, 'functional needs repair': 2}
train_labels.status_group = train_labels.status_group.replace(label_dict_status_group)

**Preprocessing of Data - Advanced**

In this section we do a few crucial actions :

1.   Remove the redundant features as identified in the previous section which includes
    *   Columns with several NaN values
    *   Columns with duplicate information
    *   Columns with a single value for all rows
    *   Columns with coordinates (other features act as a proxy for location information)
2.   Convert True/False boolean statements to integer type
3.   Perform one-hot encoding of categorical features (only top 25 most common values of each feature to be used)
4.   Merging Test Data with Labels




In [2]:
# Dropping Redundant Features (ie. columns with too many blanks, duplicate information, single value for all rows, coordinates)
combined = combined.drop(['source_type','date_recorded', 'source_class', 'waterpoint_type_group', 'longitude','latitude', 'quantity_group', 'num_private', 'subvillage', 'region','scheme_name','recorded_by','extraction_type'],axis=1)

# Converting Booleans to Integer Values
combined.permit = combined.permit.astype(bool).astype(int)
combined.public_meeting = combined.public_meeting.astype(bool).astype(int)

# List of columns to be one-hot encoded
cat_columns = ['funder', 'installer', 'wpt_name', 'basin', 'region_code', 'district_code', 'lga', 'ward', 'public_meeting', 'scheme_management', 'permit', 'extraction_type_group', 'extraction_type_class', 'management', 'management_group', 'payment', 'payment_type', 'water_quality', 'quality_group', 'quantity', 'source', 'waterpoint_type']

# Selecting only the top 25 values per column for one-hot encoding
for col in cat_columns:
  # List of top 25 values for the feature 'col'
  top = [x for x in combined[col].value_counts().iloc[:25].index]
  for val in top:
    # Manual one-hot encoding for each value 'val' in the top 25 values
    combined[col + "_" + str(val)]=np.where(combined[col]==val,1,0)

# Dropping original columns after encoding
combined.drop(cat_columns, axis=1, inplace=True)

# Separating test and train data
train_data=combined[combined['train']==1]
test_data=combined[combined['train']==0]
train_data = train_data.drop(['train'],axis=1)
test_data = test_data.drop(['train'],axis=1)

# Merging labels with train data
train_data = pd.merge(train_data, train_labels, on = 'id')

In [3]:
# Cross checking if any Nulls remain
train_data.isnull().sum().sum()

0

**Modeling - Unoptimized**

Here we attempt to run a variety of standard classifiers on our dataset. First we shall import all the necessary libraries then proceed to use the following models :

1.   XGBoost Classifier
2.   Logistic Regression Classifier
3.   Support Vector Classifier
4.   Tensorflow Classifier



In [4]:
# Global libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, RandomizedSearchCV, KFold, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics

#Creating test-train split
X_train, X_test, y_train, y_test = train_test_split(train_data.drop(columns=['status_group','id']), train_data['status_group'],test_size=0.33, random_state=420)


**XGBoost Classifier**

In [5]:
from xgboost.sklearn import XGBClassifier

clf_xgb = XGBClassifier(objective='multi:softmax')

start_time = time.time()
clf_xgb.fit(X_train,y_train)

print("System took %s seconds to model" % (time.time() - start_time))
print(classification_report(y_test,clf_xgb.predict(X_test)))

System took 69.59227991104126 seconds to model
              precision    recall  f1-score   support

           0       0.71      0.94      0.81     10659
           1       0.85      0.60      0.71      7498
           2       0.81      0.09      0.17      1445

    accuracy                           0.75     19602
   macro avg       0.79      0.55      0.56     19602
weighted avg       0.77      0.75      0.72     19602



In [6]:
accuracy = accuracy_score(y_test, clf_xgb.predict(X_test))
print ("SCORE:", accuracy)

SCORE: 0.7490052035506581


In [7]:
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials

space={
    'max_depth' : hp.choice('max_depth', range(3, 12, 1)),
    'learning_rate' : hp.quniform('learning_rate',0.05, 1, 0.05),
    'n_estimators' : hp.choice('n_estimators', range(50,350,10)),
    'gamma' : hp.quniform('gamma', 0, 5, 1),
    'reg_alpha' : hp.quniform('reg_alpha', 0, 50, 0.5),
    'min_child_weight' : hp.quniform('min_child_weight', 1, 20, 1),
    'subsample' : hp.quniform('subsample', 0.5, 1, 0.05),
    'colsample_bytree' : hp.quniform('colsample_bytree', 0.5, 1.0, 0.1)
    }
def objective(space):
    clf=XGBClassifier(
                    n_estimators =space['n_estimators'], 
                    max_depth = int(space['max_depth']), 
                    gamma = space['gamma'],
                    reg_alpha = int(space['reg_alpha']),
                    min_child_weight=int(space['min_child_weight']),
                    colsample_bytree=int(space['colsample_bytree']),
                    learning_rate=space['learning_rate'],
                    objective='multi:softmax',
                    subsample=space['subsample'])
    
    evaluation = [( X_train, y_train), ( X_test, y_test)]
    
    clf.fit(X_train, y_train,
            eval_set=evaluation, eval_metric="mlogloss",
            early_stopping_rounds=10,verbose=False)
    

    pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, pred)
    print ("SCORE:", accuracy)
    return {'loss': -accuracy, 'status': STATUS_OK }


trials = Trials()
best_hyperparams = fmin(fn = objective,
                        space = space,
                        algo = tpe.suggest,
                        max_evals = 100,
                        trials = trials)

SCORE:
0.6909499030711151
SCORE:
0.6869707172737476
SCORE:
0.6901846750331598
SCORE:
0.7140087746148353
SCORE:
0.7210488725640241
SCORE:
0.7072237526782982
SCORE:
0.713549637792062
SCORE:
0.6926844199571472
SCORE:
0.7194674012855831
SCORE:
0.7747678808284869
SCORE:
0.6969186817671666
SCORE:
0.7136006529945924
SCORE:
0.6301907968574635
SCORE:
0.7029384756657484
SCORE:
0.6412610958065503
SCORE:
0.6676359555147434
SCORE:
0.6791653912866034
SCORE:
0.6613100704009794
SCORE:
0.7019181716151414
SCORE:
0.6996735027038057
SCORE:
0.7784919906132027
SCORE:
0.7755331088664422
SCORE:
0.7803795531068258
SCORE:
0.7800224466891134
SCORE:
0.7805325987144169
SCORE:
0.7768595041322314
SCORE:
0.7714518926640139
SCORE:
0.7866544230180594
SCORE:
0.7165085195388226
SCORE:
0.7169676563615958
SCORE:
0.7784409754106724
SCORE:
0.7114580144883175
SCORE:
0.7229364350576472
SCORE:
0.75216814610754
SCORE:
0.7026323844505663
SCORE:
0.7232935414753596
SCORE:
0.7117641057034997
SCORE:
0.6549331700846852
SCORE:
0.773390

In [8]:
best_hyperparams

{'colsample_bytree': 1.0,
 'gamma': 0.0,
 'learning_rate': 0.75,
 'max_depth': 8,
 'min_child_weight': 16.0,
 'n_estimators': 22,
 'reg_alpha': 33.0,
 'subsample': 0.8}

In [9]:
clf_xgb = XGBClassifier(objective='multi:softmax',
                        colsample_bytree=1.0,
                        gamma=0.0,
                        learning_rate=0.4,
                        max_depth=6,
                        min_child_weight=1.0,
                        n_estimators=11,
                        reg_alpha=9.0,
                        subsample=0.7000000000000001,
                        )

start_time = time.time()
clf_xgb.fit(X_train,y_train)

print("System took %s seconds to model" % (time.time() - start_time))
print(classification_report(y_test,clf_xgb.predict(X_test)))

# Finding the predictions from the models on the final test data
xgb_pred = pd.DataFrame({'status_group' : clf_xgb.predict(test_data.drop(['id'],axis=1))})

# Adding the id column back to the predicitions
id_df = pd.DataFrame(test_data['id'])
xgb_pred = id_df.merge(xgb_pred,left_index=True,right_index=True)

# Creating Submission Files
label_dict_status_group = {0:'functional', 1:'non functional', 2:'functional needs repair'}
xgb_pred.status_group = xgb_pred.status_group.replace(label_dict_status_group)

# Exporting the data to csv files for submission
pd.DataFrame(xgb_pred).to_csv('xgb_pred_opt.csv',index=False)

System took 19.506816625595093 seconds to model
              precision    recall  f1-score   support

           0       0.72      0.93      0.81     10659
           1       0.84      0.63      0.72      7498
           2       0.74      0.11      0.20      1445

    accuracy                           0.76     19602
   macro avg       0.77      0.56      0.58     19602
weighted avg       0.77      0.76      0.73     19602



**Support Vector Classifier**

In [10]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import reciprocal, uniform

clf_svm = LinearSVC()

param_distributions = {"max_iter": uniform(500, 1500), "C": uniform(1, 20)}
rnd_search_cv = RandomizedSearchCV(clf_svm, param_distributions, n_iter=10, verbose=2, cv=3)
rnd_search_cv.fit(X_train, y_train)

start_time = time.time()
clf_svm.fit(X_train,y_train)

print("System took %s seconds to model" % (time.time() - start_time))
print(classification_report(y_test,clf_svm.predict(X_test)))

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] C=16.6246997315387, max_iter=1251.0460575266338 .................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .. C=16.6246997315387, max_iter=1251.0460575266338, total=  14.4s
[CV] C=16.6246997315387, max_iter=1251.0460575266338 .................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   14.4s remaining:    0.0s


[CV] .. C=16.6246997315387, max_iter=1251.0460575266338, total=  13.9s
[CV] C=16.6246997315387, max_iter=1251.0460575266338 .................
[CV] .. C=16.6246997315387, max_iter=1251.0460575266338, total=  15.2s
[CV] C=13.847264010980407, max_iter=1698.0557296001612 ...............
[CV]  C=13.847264010980407, max_iter=1698.0557296001612, total=  20.4s
[CV] C=13.847264010980407, max_iter=1698.0557296001612 ...............
[CV]  C=13.847264010980407, max_iter=1698.0557296001612, total=  20.8s
[CV] C=13.847264010980407, max_iter=1698.0557296001612 ...............
[CV]  C=13.847264010980407, max_iter=1698.0557296001612, total=  20.0s
[CV] C=5.438227931143861, max_iter=593.0844584177224 .................
[CV] .. C=5.438227931143861, max_iter=593.0844584177224, total=   7.5s
[CV] C=5.438227931143861, max_iter=593.0844584177224 .................
[CV] .. C=5.438227931143861, max_iter=593.0844584177224, total=   6.7s
[CV] C=5.438227931143861, max_iter=593.0844584177224 .................
[CV] .

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:  6.5min finished


System took 21.607239246368408 seconds to model
              precision    recall  f1-score   support

           0       0.68      0.90      0.78     10659
           1       0.84      0.48      0.61      7498
           2       0.21      0.17      0.19      1445

    accuracy                           0.69     19602
   macro avg       0.58      0.52      0.52     19602
weighted avg       0.71      0.69      0.67     19602





In [11]:
clf_svm

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

Note that we used LinearSVC since Multiclass SVC using one-versus-one solver takes too long due to its O(n*2) complexity. Due to this, our accuracy for the third class will always be zero.

**Artificial Neural Network Classifier / Multilayer Perceptron**

In [12]:
import sklearn.neural_network
clf_mlp = sklearn.neural_network.MLPClassifier(hidden_layer_sizes=(100, ), activation='relu', solver='adam', 
                                                 alpha=0.0001, batch_size='auto', learning_rate='constant', learning_rate_init=0.001, power_t=0.5, 
                                                 max_iter=1000, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, 
                                                 nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08, 
                                                 n_iter_no_change=10)

parameter_space = {
    'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,),(150,),(200,),(50,), (200,100,50), (100,100,50), (100,100,100), (200,150,100), (200,200,200), (50,50,50,50)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1],
    'learning_rate_init' :[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1],
    'tol': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1],
    'learning_rate': ['constant','adaptive'],
}

rnd_search_cv = RandomizedSearchCV(clf_mlp, parameter_space, n_iter=10, verbose=2, cv=3)
rnd_search_cv.fit(X_train, y_train)

start_time = time.time()
clf_mlp.fit(X_train,y_train)

print("System took %s seconds to model" % (time.time() - start_time))
print(classification_report(y_test,clf_mlp.predict(X_test)))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV] tol=0.1, solver=sgd, learning_rate_init=0.01, learning_rate=constant, hidden_layer_sizes=(50, 50, 50, 50), alpha=0.001, activation=tanh 
[CV]  tol=0.1, solver=sgd, learning_rate_init=0.01, learning_rate=constant, hidden_layer_sizes=(50, 50, 50, 50), alpha=0.001, activation=tanh, total=   6.5s
[CV] tol=0.1, solver=sgd, learning_rate_init=0.01, learning_rate=constant, hidden_layer_sizes=(50, 50, 50, 50), alpha=0.001, activation=tanh 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    6.5s remaining:    0.0s


[CV]  tol=0.1, solver=sgd, learning_rate_init=0.01, learning_rate=constant, hidden_layer_sizes=(50, 50, 50, 50), alpha=0.001, activation=tanh, total=   6.4s
[CV] tol=0.1, solver=sgd, learning_rate_init=0.01, learning_rate=constant, hidden_layer_sizes=(50, 50, 50, 50), alpha=0.001, activation=tanh 
[CV]  tol=0.1, solver=sgd, learning_rate_init=0.01, learning_rate=constant, hidden_layer_sizes=(50, 50, 50, 50), alpha=0.001, activation=tanh, total=   6.4s
[CV] tol=0.0001, solver=adam, learning_rate_init=0.01, learning_rate=constant, hidden_layer_sizes=(50, 50, 50), alpha=0.0001, activation=relu 
[CV]  tol=0.0001, solver=adam, learning_rate_init=0.01, learning_rate=constant, hidden_layer_sizes=(50, 50, 50), alpha=0.0001, activation=relu, total=  22.4s
[CV] tol=0.0001, solver=adam, learning_rate_init=0.01, learning_rate=constant, hidden_layer_sizes=(50, 50, 50), alpha=0.0001, activation=relu 
[CV]  tol=0.0001, solver=adam, learning_rate_init=0.01, learning_rate=constant, hidden_layer_sizes=(

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed: 19.5min finished


System took 31.027844667434692 seconds to model
              precision    recall  f1-score   support

           0       0.75      0.89      0.81     10659
           1       0.82      0.67      0.73      7498
           2       0.51      0.26      0.34      1445

    accuracy                           0.76     19602
   macro avg       0.69      0.61      0.63     19602
weighted avg       0.76      0.76      0.75     19602



**Results - Unoptimized**

We have XGBoost and MLP as our best performing models without any hyperparameter optimization. For references, we shall be running all of the models on the competition's test data and submit them to get a conclusive view of their effectiveness.

In [13]:
clf_mlp

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=1000,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

In [None]:
'''
import tensorflow as tf
from tensorflow import keras

model = keras.Sequential([
keras.layers.Dense(200, input_dim=287, activation='relu'),
keras.layers.Dense(100, activation='relu'),
keras.layers.Dense(1, activation='sigmoid'),
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X_train, y_train, epochs=10, batch_size=10)
'''

"\nimport tensorflow as tf\nfrom tensorflow import keras\n\nmodel = keras.Sequential([\nkeras.layers.Dense(200, input_dim=287, activation='relu'),\nkeras.layers.Dense(100, activation='relu'),\nkeras.layers.Dense(1, activation='sigmoid'),\n])\n\nmodel.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n\nmodel.fit(X_train, y_train, epochs=10, batch_size=10)\n"

In [None]:
# Finding the predictions from the models on the final test data
xgb_pred = pd.DataFrame({'status_group' : clf_xgb.predict(test_data.drop(['id'],axis=1))})
lr_pred = pd.DataFrame({'status_group' : clf_lr.predict(test_data.drop(['id'],axis=1))})
svm_pred = pd.DataFrame({'status_group' : clf_svm.predict(test_data.drop(['id'],axis=1))})
mlp_pred = pd.DataFrame({'status_group' : clf_mlp.predict(test_data.drop(['id'],axis=1))})

# Adding the id column back to the predicitions
id_df = pd.DataFrame(test_data['id'])
xgb_pred = id_df.merge(xgb_pred,left_index=True,right_index=True)
lr_pred = id_df.merge(lr_pred,left_index=True,right_index=True)
svm_pred = id_df.merge(svm_pred,left_index=True,right_index=True)
mlp_pred = id_df.merge(mlp_pred,left_index=True,right_index=True)


# Creating Submission Files
label_dict_status_group = {0:'functional', 1:'non functional', 2:'functional needs repair'}
xgb_pred.status_group = xgb_pred.status_group.replace(label_dict_status_group)
lr_pred.status_group = lr_pred.status_group.replace(label_dict_status_group)
svm_pred.status_group = svm_pred.status_group.replace(label_dict_status_group)
mlp_pred.status_group = mlp_pred.status_group.replace(label_dict_status_group)

# Exporting the data to csv files for submission
pd.DataFrame(xgb_pred).to_csv('xgb_pred.csv',index=False)
pd.DataFrame(lr_pred).to_csv('lr_pred.csv',index=False)
pd.DataFrame(svm_pred).to_csv('svm_pred.csv',index=False)
pd.DataFrame(mlp_pred).to_csv('mlp_pred.csv',index=False)

We submitted the files to the portal and found that XGBoost performed the best followed by MLP and SVM. The scores are given below:

> XGB received 0.7467

> MLP Received 0.7459

> SVM received 0.6927

In the next notebook, we will optimize the parameters and discuss possible improvements to the project.