# Ideas

Have histograms distributions for the other variables within each variable section.

Could lump together needs repair/non functional

Have variables like "installer = funder" and such. Those variables seem to be very similar.

Number of functional wells over the years, non functional wells over the years, etc.

Use SMOTE for data with no null values, all known, and no one-time value variables.

Train models with numerical status group variabel and ones with categorical status group variable.

Better evaluate which variables you should include in the model. (correlation, etc.)

Find a way of evaluating the success of each model graphically, and in a more detailed fashion.

Find ways of dissecting how well each model predicts nf, fnr, and f categories.

Write functions to make this entire notebook more organized.

SMOTE on functional needs repair data

Look for patterns in what each individual model says for FNR data points.

In [292]:
# !/usr/bin/env python
"""Utility script with functions to be used with the results of GridSearchCV.

**plot_grid_search** plots as many graphs as parameters are in the grid search results.

**table_grid_search** shows tables with the grid search results.

Inspired in [Displaying the results of a Grid Search](https://www.kaggle.com/grfiv4/displaying-the-results-of-a-grid-search) notebook,
of [George Fisher](https://www.kaggle.com/grfiv4)
"""

import pandas as pd
from plotly.subplots import make_subplots
import plotly.graph_objs as go
import pprint
from scipy import stats
from IPython.display import display

__author__ = "Juanma Hernández"
__copyright__ = "Copyright 2019"
__credits__ = ["Juanma Hernández", "George Fisher"]
__license__ = "GPL"
__maintainer__ = "Juanma Hernández"
__email__ = "https://twitter.com/juanmah"
__status__ = "Utility script"


def plot_grid_search(clf):
    """Plot as many graphs as parameters are in the grid search results.

    Each graph has the values of each parameter in the X axis and the Score in the Y axis.

    Parameters
    ----------
    clf: estimator object result of a GridSearchCV
        This object contains all the information of the cross validated results for all the parameters combinations.
    """
    # Convert the cross validated results in a DataFrame ordered by `rank_test_score` and `mean_fit_time`.
    # As it is frequent to have more than one combination with the same max score,
    # the one with the least mean fit time SHALL appear first.
    cv_results = pd.DataFrame(clf.cv_results_).sort_values(by=['rank_test_score', 'mean_fit_time'])

    # Get parameters
    parameters=cv_results['params'][0].keys()

    # Calculate the number of rows and columns necessary
    rows = -(-len(parameters) // 2)
    columns = min(len(parameters), 2)
    # Create the subplot
    fig = make_subplots(rows=rows, cols=columns)
    # Initialize row and column indexes
    row = 1
    column = 1

    # For each of the parameters
    for parameter in parameters:

        # As all the graphs have the same traces, and by default all traces are shown in the legend,
        # the description appears multiple times. Then, only show legend of the first graph.
        if row == 1 and column == 1:
            show_legend = True
        else:
            show_legend = False

        # Mean test score
        mean_test_score = cv_results[cv_results['rank_test_score'] != 1]
        fig.add_trace(go.Scatter(
            name='Mean test score',
            x=mean_test_score['param_' + parameter],
            y=mean_test_score['mean_test_score'],
            mode='markers',
            marker=dict(size=mean_test_score['mean_fit_time'],
                        color='SteelBlue',
                        sizeref=2. * cv_results['mean_fit_time'].max() / (40. ** 2),
                        sizemin=4,
                        sizemode='area'),
            text=mean_test_score['params'].apply(
                lambda x: pprint.pformat(x, width=-1).replace('{', '').replace('}', '').replace('\n', '<br />')),
            showlegend=show_legend),
            row=row,
            col=column)

        # Best estimators
        rank_1 = cv_results[cv_results['rank_test_score'] == 1]
        fig.add_trace(go.Scatter(
            name='Best estimators',
            x=rank_1['param_' + parameter],
            y=rank_1['mean_test_score'],
            mode='markers',
            marker=dict(size=rank_1['mean_fit_time'],
                        color='Crimson',
                        sizeref=2. * cv_results['mean_fit_time'].max() / (40. ** 2),
                        sizemin=4,
                        sizemode='area'),
            text=rank_1['params'].apply(str),
            showlegend=show_legend),
            row=row,
            col=column)

        fig.update_xaxes(title_text=parameter, row=row, col=column)
        fig.update_yaxes(title_text='Score', row=row, col=column)

        # Check the linearity of the series
        # Only for numeric series
        if pd.to_numeric(cv_results['param_' + parameter], errors='coerce').notnull().all():
            x_values = cv_results['param_' + parameter].sort_values().unique().tolist()
            r = stats.linregress(x_values, range(0, len(x_values))).rvalue
            # If not so linear, then represent the data as logarithmic
            if r < 0.86:
                fig.update_xaxes(type='log', row=row, col=column)

        # Increment the row and column indexes
        column += 1
        if column > columns:
            column = 1
            row += 1

            # Show first the best estimators
    fig.update_layout(legend=dict(traceorder='reversed'),
                      width=columns * 360 + 100,
                      height=rows * 360,
                      title='Best score: {:.6f} with {}'.format(cv_results['mean_test_score'].iloc[0],
                                                                str(cv_results['params'].iloc[0]).replace('{',
                                                                                                          '').replace(
                                                                    '}', '')),
                      hovermode='closest',
                      template='none')
    fig.show()


def table_grid_search(clf, all_columns=False, all_ranks=False, save=True):
    """Show tables with the grid search results.

    Parameters
    ----------
    clf: estimator object result of a GridSearchCV
        This object contains all the information of the cross validated results for all the parameters combinations.

    all_columns: boolean, default: False
        If true all columns are returned. If false, the following columns are dropped:

        - params. As each parameter has a column with the value.
        - std_*. Standard deviations.
        - split*. Split scores.

    all_ranks: boolean, default: False
        If true all ranks are returned. If false, only the rows with rank equal to 1 are returned.

    save: boolean, default: True
        If true, results are saved to a CSV file.
    """
    # Convert the cross validated results in a DataFrame ordered by `rank_test_score` and `mean_fit_time`.
    # As it is frequent to have more than one combination with the same max score,
    # the one with the least mean fit time SHALL appear first.
    cv_results = pd.DataFrame(clf.cv_results_).sort_values(by=['rank_test_score', 'mean_fit_time'])

    # Reorder
    columns = cv_results.columns.tolist()
    # rank_test_score first, mean_test_score second and std_test_score third
    columns = columns[-1:] + columns[-3:-1] + columns[:-3]
    cv_results = cv_results[columns]

    if save:
        cv_results.to_csv('--'.join(cv_results['params'][0].keys()) + '.csv', index=True, index_label='Id')

    # Unless all_columns are True, drop not wanted columns: params, std_* split*
    if not all_columns:
        cv_results.drop('params', axis='columns', inplace=True)
        cv_results.drop(list(cv_results.filter(regex='^std_.*')), axis='columns', inplace=True)
        cv_results.drop(list(cv_results.filter(regex='^split.*')), axis='columns', inplace=True)

    # Unless all_ranks are True, filter out those rows which have rank equal to one
    if not all_ranks:
        cv_results = cv_results[cv_results['rank_test_score'] == 1]
        cv_results.drop('rank_test_score', axis = 'columns', inplace = True)        

    display(cv_results)

In [361]:
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.metrics import accuracy_score, get_scorer_names, confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.ensemble import HistGradientBoostingClassifier, ExtraTreesClassifier
from sklearn.naive_bayes import GaussianNB, CategoricalNB
from xgboost import XGBClassifier
import xgboost as xgb
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [260]:
X_test = pd.read_csv("tanzanian_water_wells/X_test.csv")
X_train = pd.read_csv("tanzanian_water_wells/X_train.csv")
y_train = pd.read_csv("tanzanian_water_wells/y_train.csv")

df = pd.concat([X_train, y_train], axis=1)

In [261]:
desc = {'amount_tsh': 'Total static head (amount water available to waterpoint)',
                    'date_recorded': 'The date the row was entered',
                    'funder': 'Who funded the well',
                    'gps_height': 'Altitude of the well',
                    'installer': 'Organization that installed the well',
                    'longitude': 'GPS coordinate',
                    'latitude': 'GPS coordinate',
                    'wpt_name': 'Name of the waterpoint if there is one',
                    'subvillage': 'Geographic location',
                    'region': 'Geographic location',
                    'region_code': 'Geographic location (coded)',
                    'district_code': 'Geographic location (coded)',
                    'lga': 'Geographic location',
                    'ward': 'Geographic location',
                    'population': 'Population around the well',
                    'public_meeting': 'True/False',
                    'recorded_by': 'Group entering this row of data',
                    'scheme_management': 'Who operates the waterpoint',
                    'scheme_name': 'Who operates the waterpoint',
                    'permit': 'If the waterpoint is permitted',
                    'construction_year': 'Year the waterpoint was constructed',
                    'extraction_type': 'The kind of extraction the waterpoint uses',
                    'extraction_type_group': 'The kind of extraction the waterpoint uses',
                    'extraction_type_class': 'The kind of extraction the waterpoint uses',
                    'management': 'How the waterpoint is managed',
                    'management_group': 'How the waterpoint is managed',
                    'payment': 'What the water costs',
                    'payment_type': 'What the water costs',
                    'water_quality': 'The quality of the water',
                    'quality_group': 'The quality of the water',
                    'quantity': 'The quantity of water',
                    'quantity_group': 'The quantity of water',
                    'source': 'The source of the water',
                    'source_type': 'The source of the water',
                    'source_class': 'The source of the water',
                    'waterpoint_type': 'The kind of waterpoint',
                    'waterpoint_type_group': 'The kind of waterpoint'}

In [262]:
# Eliminating null values

df.funder.fillna("Unknown", inplace=True)
df.installer.fillna("Unknown", inplace=True)
df.scheme_management.fillna("None", inplace=True)
df.permit.fillna('Unknown', inplace=True)
df.scheme_name.fillna('Unknown', inplace=True)
df.subvillage.fillna('Unknown', inplace=True)
df.public_meeting.fillna('Unknown', inplace=True)

In [263]:
# df['fundernum'] = df['funder'].map(df.funder.value_counts())

# df['funder_installer'] = df['funder'] == df['installer']
# df['funder_installer'] = df['funder_installer'].astype('int')

# df['permit'] = df['permit'].map({True: 1, False: 0, 'Unknown': 2})

# df['status_id'] = df['status_group'].map({'non functional': 0, 'functional needs repair': 1, 'functional': 2})

# Defining the train and test sets

In [264]:
X = df.copy()

columns = ['amount_tsh', 'gps_height', 'population', 'region', 'lga', 
           'scheme_management', 'permit', 'construction_year',
           'extraction_type_group', 'payment', 'management', 
           'quality_group', 'quantity', 'source', 'waterpoint_type']

X = X[columns]

# X['public_meeting'] = X['public_meeting'].map({True: 'Yes', False: 'No', 'Unknown': 'Unknown'})
X['permit'] = X['permit'].map({True: 'Yes', False: 'No', 'Unknown': 'Unknown'})
X['gps_height'] = X['gps_height'].astype('float64')
# X['district_code'] = X['district_code'].astype('float64')
X['population'] = X['population'].astype('float64')
# X['district_code'] = X['district_code'].astype('object')

X_cat = X.drop(list(X.select_dtypes(['float64']).columns), axis=1)
X_numeric = X[list(X.select_dtypes(['float64']).columns)]

y = df['status_group']

X_cat = pd.get_dummies(X_cat)

X = pd.concat([X_numeric, X_cat], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y)

scaler = StandardScaler()
scaler.fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train),
                index = X_train.index,
                columns = X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test),
                index = X_test.index,
                columns = X_test.columns)

In [265]:
y_train.value_counts()/len(y_train)

status_group
functional                 0.544691
non functional             0.383075
functional needs repair    0.072233
Name: count, dtype: float64

In [266]:
y_train.value_counts()

status_group
functional                 24266
non functional             17066
functional needs repair     3218
Name: count, dtype: int64

In [267]:
strategy = {'functional needs repair': 10000}
smote = SMOTE(sampling_strategy=strategy)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

In [268]:
y_train_resampled.value_counts()/len(y_train_resampled)

status_group
functional                 0.472727
non functional             0.332463
functional needs repair    0.194810
Name: count, dtype: float64

# Base Model – Logistic Regression, No Regularization

In [269]:
base_model = LogisticRegression(solver='liblinear', fit_intercept=False)
base_model.fit(X_train_resampled, y_train_resampled)

base_y_hat_train = base_model.predict(X_train)
base_y_hat_test = base_model.predict(X_test)

accuracy_score(y_train, base_y_hat_train)

0.6591470258136924

In [270]:
cross_val_score(base_model, X_train, y_train, cv=5)

array([0.6976431 , 0.72233446, 0.70965208, 0.70381594, 0.7047138 ])

In [383]:
confusion_matrix(y_train, base_y_hat_train)

array([[17326,  5326,  1614],
       [  802,  2212,   204],
       [ 3917,  3322,  9827]])

In [272]:
print(classification_report(y_train, base_y_hat_train))

                         precision    recall  f1-score   support

             functional       0.79      0.71      0.75     24266
functional needs repair       0.20      0.69      0.31      3218
         non functional       0.84      0.58      0.68     17066

               accuracy                           0.66     44550
              macro avg       0.61      0.66      0.58     44550
           weighted avg       0.77      0.66      0.69     44550



In [375]:
from sklearn.metrics import classification_report, accuracy_score, make_scorer

def classification_report_with_accuracy_score(y_true, y_pred):

    print (classification_report(y_true, y_pred)) # print classification report
    print('\n\n')
    return accuracy_score(y_true, y_pred) # return accuracy score

# Nested CV with parameter optimization
nested_score = cross_val_score(base_model, X=X_train, y=y_train, cv=5, \
               scoring=make_scorer(classification_report_with_accuracy_score))
print (nested_score)

                         precision    recall  f1-score   support

             functional       0.77      0.78      0.78      4853
functional needs repair       0.23      0.48      0.31       643
         non functional       0.80      0.62      0.70      3414

               accuracy                           0.70      8910
              macro avg       0.60      0.63      0.60      8910
           weighted avg       0.74      0.70      0.71      8910




                         precision    recall  f1-score   support

             functional       0.75      0.83      0.79      4854
functional needs repair       0.28      0.38      0.32       643
         non functional       0.81      0.64      0.71      3413

               accuracy                           0.72      8910
              macro avg       0.61      0.61      0.61      8910
           weighted avg       0.74      0.72      0.73      8910




                         precision    recall  f1-score   support

            

In [380]:
report = classification_report(y_train, base_y_hat_train, output_dict=True)
report

{'functional': {'precision': 0.7859378543887503,
  'recall': 0.7140031319541745,
  'f1-score': 0.7482455572110298,
  'support': 24266},
 'functional needs repair': {'precision': 0.20368324125230203,
  'recall': 0.6873834679925419,
  'f1-score': 0.3142491831226027,
  'support': 3218},
 'non functional': {'precision': 0.8438814942035209,
  'recall': 0.5758232743466541,
  'f1-score': 0.6845459928250497,
  'support': 17066},
 'accuracy': 0.6591470258136924,
 'macro avg': {'precision': 0.611167529948191,
  'recall': 0.6590699580977901,
  'f1-score': 0.582346911052894,
  'support': 44550},
 'weighted avg': {'precision': 0.766076368687421,
  'recall': 0.6591470258136924,
  'f1-score': 0.692494780608837,
  'support': 44550}}

# Second Model – Decision Tree

In [273]:
dtc = DecisionTreeClassifier()

In [274]:
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [1, 2, 5, 10],
    'min_samples_split': [5, 10, 20, 40],
    'min_samples_leaf': [5, 10, 20],
    'splitter': ['best', 'random']
}

gs_tree = GridSearchCV(dtc, param_grid, cv=3)
gs_tree.fit(X_train_resampled, y_train_resampled)

gs_tree.best_params_

{'criterion': 'entropy',
 'max_depth': 10,
 'min_samples_leaf': 5,
 'min_samples_split': 10,
 'splitter': 'best'}

In [275]:
dtc = DecisionTreeClassifier(criterion= 'gini', max_depth= 10, min_samples_split= 5, min_samples_leaf=5, splitter='best')

In [276]:
dtc.fit(X_train_resampled, y_train_resampled)

In [277]:
# plt.figure(figsize=(12,12), dpi=500)
# tree.plot_tree(dtc, 
#                feature_names=X_train.columns,
#                class_names=np.unique(y).astype('str'),
#                filled=True, rounded=True)
# plt.show()

In [278]:
dtc_y_hat_train = dtc.predict(X_train)
dtc_y_hat_test = dtc.predict(X_test)

accuracy_score(y_train, dtc_y_hat_train)

0.7278338945005611

In [279]:
cross_val_score(dtc, X_train, y_train, cv=5)

array([0.73905724, 0.73434343, 0.74141414, 0.74287318, 0.74489338])

In [280]:
confusion_matrix(y_train, dtc_y_hat_train, labels = ['non functional', 'functional needs repair', 'functional'])

array([[ 8777,   234,  8055],
       [  165,   736,  2317],
       [  763,   591, 22912]])

In [281]:
print(classification_report(y_train, dtc_y_hat_train))

                         precision    recall  f1-score   support

             functional       0.69      0.94      0.80     24266
functional needs repair       0.47      0.23      0.31      3218
         non functional       0.90      0.51      0.66     17066

               accuracy                           0.73     44550
              macro avg       0.69      0.56      0.59     44550
           weighted avg       0.76      0.73      0.71     44550



In [282]:
len(pd.DataFrame({'true': y_train, 'pred': dtc_y_hat_train}).query('true == pred')) / len(X_train)

0.7278338945005611

# Third Model – K Nearest Neighbors

In [283]:
knn = KNeighborsClassifier(n_neighbors=3)

knn.fit(X_train_resampled, y_train_resampled)

In [284]:
knn_y_hat_train = knn.predict(X_train)
knn_y_hat_test = knn.predict(X_test)

accuracy_score(y_train, knn_y_hat_train)

0.8384960718294051

In [285]:
cross_val_score(knn, X_train, y_train, cv=5)

array([0.75207632, 0.75140292, 0.75993266, 0.76139169, 0.75342312])

In [286]:
confusion_matrix(y_train, knn_y_hat_train, labels = ['non functional', 'functional needs repair', 'functional'])

array([[13812,   417,  2837],
       [  256,  1957,  1005],
       [ 1835,   845, 21586]])

# Fourth Model – Bagging Classifier

In [295]:
bg_clf = BaggingClassifier()

In [299]:
# parameters = {
#     'n_estimators': [20, 50, 100]
# }
# clf = GridSearchCV(bg_clf, parameters, cv=5)
# clf.fit(X_train_resampled, y_train_resampled)
# plot_grid_search(clf)
# table_grid_search(clf)

In [300]:
# parameters = {
#     'max_samples': [x / 10 for x in range(1, 11)]
# }
# clf = GridSearchCV(bg_clf, parameters, cv=5)
# clf.fit(X_train, y_train)
# plot_grid_search(clf)
# table_grid_search(clf)

In [301]:
# parameters = {
#     'max_features': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.90, 0.92, 0.95, 1.0]
# }
# clf = GridSearchCV(bg_clf, parameters, cv=5)
# clf.fit(X_train_resampled, y_train_resampled)
# plot_grid_search(clf)
# table_grid_search(clf)

In [302]:
# parameters = {
#     'bootstrap': [True, False]
# }
# clf = GridSearchCV(bg_clf, parameters, cv=5)
# clf.fit(X_train_resampled, y_train_resampled)
# plot_grid_search(clf)
# table_grid_search(clf)

In [303]:
# parameters = {
#     'bootstrap_features': [True, False]
# }
# clf = GridSearchCV(bg_clf, parameters, cv=5)
# clf.fit(X_train_resampled, y_train_resampled)
# plot_grid_search(clf)
# table_grid_search(clf)

In [304]:
# parameters = {
#     'bootstrap_features': [True, False]
# }
# clf = GridSearchCV(bg_clf, parameters, cv=5)
# clf.fit(X_train_resampled, y_train_resampled)
# plot_grid_search(clf)
# table_grid_search(clf)

In [305]:
# parameters = {
#     'oob_score': [True, False]
# }
# clf = GridSearchCV(bg_clf, parameters, cv=5)
# clf.fit(X_train_resampled, y_train_resampled)
# plot_grid_search(clf)
# table_grid_search(clf)

In [306]:
# parameters = {
#     'warm_start': [True, False]
# }
# clf = GridSearchCV(bg_clf, parameters, cv=5)
# clf.fit(X_train_resampled, y_train_resampled)
# plot_grid_search(clf)
# table_grid_search(clf)

In [307]:
bagged_tree = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, max_features=50)
bagged_tree.fit(X_train_resampled, y_train_resampled)

In [308]:
bagged_preds = bagged_tree.predict(X_train)

In [309]:
cross_val_score(bagged_tree, X_train, y_train, cv=5)

array([0.7630752 , 0.77407407, 0.76734007, 0.76565657, 0.77025814])

In [310]:
confusion_matrix(y_train, bagged_preds, labels = ['non functional', 'functional needs repair', 'functional'])

array([[13835,   175,  3056],
       [  180,  1668,  1370],
       [  432,   365, 23469]])

In [311]:
print(classification_report(y_train, bagged_preds))

                         precision    recall  f1-score   support

             functional       0.84      0.97      0.90     24266
functional needs repair       0.76      0.52      0.61      3218
         non functional       0.96      0.81      0.88     17066

               accuracy                           0.87     44550
              macro avg       0.85      0.77      0.80     44550
           weighted avg       0.88      0.87      0.87     44550



In [312]:
len(pd.DataFrame({'true': y_train, 'pred': bagged_preds}).query('true == pred')) / len(X_train)

0.8747923681257015

# Fifth Model – Random Forest

In [313]:
forest = RandomForestClassifier()

param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [1, 2, 5, 10],
    'min_samples_split': [5, 10, 20, 40],
    'min_samples_leaf': [5, 10, 20]
}

gs_forest = GridSearchCV(forest, param_grid, cv=3)
gs_forest.fit(X_train_resampled, y_train_resampled)

gs_forest.best_params_

{'criterion': 'entropy',
 'max_depth': 10,
 'min_samples_leaf': 5,
 'min_samples_split': 10}

In [314]:
forest_preds = gs_forest.predict(X_train)

In [315]:
# Training accuracy score
gs_forest.score(X_train, y_train)

0.7387654320987654

In [316]:
cross_val_score(gs_forest, X_train, y_train, cv=5)

array([0.74163861, 0.74118967, 0.74657688, 0.73928171, 0.73894501])

In [317]:
confusion_matrix(y_train, forest_preds, labels = ['non functional', 'functional needs repair', 'functional'])

array([[ 9144,   206,  7716],
       [  223,   728,  2267],
       [  752,   474, 23040]])

In [318]:
print(classification_report(y_train, forest_preds))

                         precision    recall  f1-score   support

             functional       0.70      0.95      0.80     24266
functional needs repair       0.52      0.23      0.31      3218
         non functional       0.90      0.54      0.67     17066

               accuracy                           0.74     44550
              macro avg       0.71      0.57      0.60     44550
           weighted avg       0.76      0.74      0.72     44550



# Sixth Model – XGBoost

In [319]:
xgboost_y_train_resampled = y_train_resampled.map({'non functional': 0, 'functional needs repair': 1, 'functional': 2})

In [320]:
xgboost_y_train = y_train.map({'non functional': 0, 'functional needs repair': 1, 'functional': 2})

In [321]:
xgb = XGBClassifier()
xgb.fit(X_train_resampled, xgboost_y_train_resampled)

In [322]:
xgboost_train_preds = xgb.predict(X_train)
xgboost_test_preds = xgb.predict(X_test)

In [323]:
accuracy_score(xgboost_y_train, xgboost_train_preds)

0.8000224466891134

In [324]:
cross_val_score(xgb, X_train, xgboost_y_train, cv=5)

array([0.78170595, 0.78226712, 0.78395062, 0.78496072, 0.78361392])

In [325]:
confusion_matrix(xgboost_y_train, xgboost_train_preds)

array([[12441,   597,  4028],
       [  337,  1499,  1382],
       [ 1451,  1114, 21701]])

In [326]:
print(classification_report(xgboost_y_train, xgboost_train_preds))

              precision    recall  f1-score   support

           0       0.87      0.73      0.80     17066
           1       0.47      0.47      0.47      3218
           2       0.80      0.89      0.84     24266

    accuracy                           0.80     44550
   macro avg       0.71      0.70      0.70     44550
weighted avg       0.80      0.80      0.80     44550



# Eigth Model – Adaboost Classifier

In [327]:
# Instantiate an AdaBoostClassifier
adaboost_clf = AdaBoostClassifier(estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42)

In [328]:
# Fit AdaBoostClassifier
adaboost_clf.fit(X_train_resampled, y_train_resampled)

In [329]:
adaboost_train_preds = adaboost_clf.predict(X_train)

In [330]:
accuracy_score(y_train, adaboost_train_preds)

0.9362289562289562

In [331]:
cross_val_score(adaboost_clf, X_train, y_train, cv=5)

array([0.74107744, 0.75241302, 0.75712682, 0.75521886, 0.75420875])

In [332]:
confusion_matrix(y_train, adaboost_train_preds, labels = ['non functional', 'functional needs repair', 'functional'])

array([[15728,   303,  1035],
       [   82,  2712,   424],
       [  437,   560, 23269]])

In [333]:
print(classification_report(y_train, adaboost_train_preds))

                         precision    recall  f1-score   support

             functional       0.94      0.96      0.95     24266
functional needs repair       0.76      0.84      0.80      3218
         non functional       0.97      0.92      0.94     17066

               accuracy                           0.94     44550
              macro avg       0.89      0.91      0.90     44550
           weighted avg       0.94      0.94      0.94     44550



# Ninth Model – Gradient Boosting Classifier

In [334]:
# Instantiate an GradientBoostingClassifier
gbt_clf = GradientBoostingClassifier(random_state=42, n_estimators=50, max_features=50)

In [335]:
# Fit GradientBoostingClassifier
gbt_clf.fit(X_train_resampled, y_train_resampled)

In [336]:
gbt_clf_train_preds = gbt_clf.predict(X_train)

In [337]:
cross_val_score(gbt_clf, X_train, y_train, cv=5)

array([0.7342312 , 0.7312009 , 0.73815937, 0.7342312 , 0.73546577])

In [338]:
confusion_matrix(y_train, gbt_clf_train_preds, labels = ['non functional', 'functional needs repair', 'functional'])

array([[ 9312,   397,  7357],
       [  301,   777,  2140],
       [ 1084,   615, 22567]])

In [339]:
print(classification_report(y_train, gbt_clf_train_preds))

                         precision    recall  f1-score   support

             functional       0.70      0.93      0.80     24266
functional needs repair       0.43      0.24      0.31      3218
         non functional       0.87      0.55      0.67     17066

               accuracy                           0.73     44550
              macro avg       0.67      0.57      0.59     44550
           weighted avg       0.75      0.73      0.72     44550



# Tenth Model – Histogram Gradient Boosting Classifier

In [340]:
hbg = HistGradientBoostingClassifier(max_iter=100)
hbg.fit(X_train_resampled, y_train_resampled)

In [341]:
hbg_preds = hbg.predict(X_train)

In [342]:
cross_val_score(hbg, X_train, y_train, cv=5)

array([0.78271605, 0.78136925, 0.78574635, 0.78473625, 0.77968575])

In [343]:
confusion_matrix(y_train, hbg_preds, labels = ['non functional', 'functional needs repair', 'functional'])

array([[12432,   596,  4038],
       [  326,  1581,  1311],
       [ 1617,  1179, 21470]])

In [344]:
print(classification_report(y_train, hbg_preds))

                         precision    recall  f1-score   support

             functional       0.80      0.88      0.84     24266
functional needs repair       0.47      0.49      0.48      3218
         non functional       0.86      0.73      0.79     17066

               accuracy                           0.80     44550
              macro avg       0.71      0.70      0.70     44550
           weighted avg       0.80      0.80      0.80     44550



# Eleventh Model – Extra Randomized Trees

In [345]:
extra_trees = ExtraTreesClassifier(n_estimators=50, random_state=42)
extra_trees.fit(X_train_resampled, y_train_resampled)
extra_trees_train_preds = extra_trees.predict(X_train)
accuracy_score(y_train, extra_trees_train_preds)

0.9360942760942761

In [346]:
cross_val_score(extra_trees, X_train, y_train, cv=5)

array([0.77037037, 0.77396184, 0.77968575, 0.78125701, 0.77384961])

In [347]:
confusion_matrix(y_train, extra_trees_train_preds, labels = ['non functional', 'functional needs repair', 'functional'])

array([[15587,   336,  1143],
       [   57,  2723,   438],
       [  333,   540, 23393]])

In [348]:
print(classification_report(y_train, extra_trees_train_preds))

                         precision    recall  f1-score   support

             functional       0.94      0.96      0.95     24266
functional needs repair       0.76      0.85      0.80      3218
         non functional       0.98      0.91      0.94     17066

               accuracy                           0.94     44550
              macro avg       0.89      0.91      0.90     44550
           weighted avg       0.94      0.94      0.94     44550



# Thirteenth Model – Gaussian Naive Bayes

In [349]:
X_train_GNB_resampled = X_train_resampled[['amount_tsh', 'gps_height', 'population']]
X_train_GNB = X_train[['amount_tsh', 'gps_height', 'population']]

In [350]:
gnb = GaussianNB()
gnb.fit(X_train_GNB_resampled, y_train_resampled)
gnb_preds = gnb.predict(X_train_GNB)
accuracy_score(y_train, gnb_preds)

0.41672278338945007

In [351]:
gnb.classes_

array(['functional', 'functional needs repair', 'non functional'],
      dtype='<U23')

In [352]:
cross_val_score(gnb, X_train_GNB, y_train, cv=10)

array([0.41683502, 0.41705948, 0.41885522, 0.41728395, 0.41054994,
       0.41324355, 0.42312009, 0.41369248, 0.41997755, 0.42334456])

In [353]:
confusion_matrix(y_train, gnb_preds, labels = gnb.classes_)

array([[ 2221,     0, 22045],
       [  179,     0,  3039],
       [  722,     0, 16344]])

In [354]:
print(classification_report(y_train, gnb_preds))


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



                         precision    recall  f1-score   support

             functional       0.71      0.09      0.16     24266
functional needs repair       0.00      0.00      0.00      3218
         non functional       0.39      0.96      0.56     17066

               accuracy                           0.42     44550
              macro avg       0.37      0.35      0.24     44550
           weighted avg       0.54      0.42      0.30     44550




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



# VotingClassifier

In [355]:
eclf = VotingClassifier(estimators=[('lr', base_model), ('dtc', dtc), ('knn', knn), ('bag', bagged_tree), ('forest', gs_forest), ('ada', adaboost_clf), ('gbt', gbt_clf), ('hbg', hbg), ('extra_trees', extra_trees)])
eclf.fit(X_train_resampled, y_train_resampled)
eclf_preds = eclf.predict(X_train)
accuracy_score(y_train, eclf_preds)

0.8296296296296296

In [356]:
cross_val_score(eclf, X_train, y_train, cv=5)

array([0.77609428, 0.77654321, 0.78204265, 0.77699214, 0.78226712])

In [357]:
confusion_matrix(y_train, eclf_preds)

array([[23183,   544,   539],
       [ 1430,  1664,   124],
       [ 4610,   343, 12113]])

In [359]:
print(classification_report(y_train, eclf_preds))

                         precision    recall  f1-score   support

             functional       0.79      0.96      0.87     24266
functional needs repair       0.65      0.52      0.58      3218
         non functional       0.95      0.71      0.81     17066

               accuracy                           0.83     44550
              macro avg       0.80      0.73      0.75     44550
           weighted avg       0.84      0.83      0.82     44550

