# Ideas

Have histograms distributions for the other variables within each variable section.

Could lump together needs repair/non functional

Have variables like "installer = funder" and such. Those variables seem to be very similar.

Number of functional wells over the years, non functional wells over the years, etc.

Use SMOTE for data with no null values, all known, and no one-time value variables.

Train models with numerical status group variabel and ones with categorical status group variable.

Better evaluate which variables you should include in the model. (correlation, etc.)

Find a way of evaluating the success of each model graphically, and in a more detailed fashion.

Find ways of dissecting how well each model predicts nf, fnr, and f categories.

Write functions to make this entire notebook more organized.

SMOTE on functional needs repair data

Look for patterns in what each individual model says for FNR data points.

Parallel notebooks for demanding models. Hyperparameter tuning, etc.

Make your own X-test and y-test training set.

Find any differences between kaggle's X-test and your own, like extra categories and whatnot.

Consider again creating extra features.

Test out different groups of features.

Do cross validation and synthetic over-sampling at the same time.

Try different degrees of over-sampling.

How to get mode to throw some sort of error; if it does, resort to the functional master for the prediction

How to get XGBoost to take in a dataset with categorical target, fit to it, and then re-map the predictions back to categorical

ROC/AUC Curve for both train and test splits.

Evaluate your models on the training data as well. Even if there's a good score on the test dataset, it might be a lot better on the training dataset, meaning it heavily overfits.

For the final ensemble, also try different re-sampling ratios.

Consider using only chi-squared importances to make decisions about variables.

In [329]:
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.metrics import accuracy_score, get_scorer_names, confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.ensemble import HistGradientBoostingClassifier, ExtraTreesClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.naive_bayes import GaussianNB, CategoricalNB
from sklearn.feature_selection import SelectKBest, chi2, mutual_info_classif
from xgboost import XGBClassifier
import xgboost as xgb
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from statistics import mode as md
from matplotlib import pyplot as plt
from IPython.display import clear_output

In [330]:
testing = pd.read_csv("tanzanian_water_wells/X_test.csv")
X = pd.read_csv("tanzanian_water_wells/X_train.csv")
y = pd.read_csv("tanzanian_water_wells/y_train.csv")

In [331]:
(X.id == y.id).value_counts()

id
True    59400
Name: count, dtype: int64

In [332]:
y = pd.read_csv("tanzanian_water_wells/y_train.csv")['status_group']

# Finding columns between X_train and X_test that do not differ drastically

In [333]:
differences = []

columns = list(X.select_dtypes(exclude=['float64', 'int64']).columns)

for col in columns:
    difference = set(list(X[col])) ^ set(list(testing[col]))
    differences.append(len(difference))
    
pd.DataFrame({'column': list(columns), 'differences': differences}).sort_values(by=['differences'], ascending=False)

Unnamed: 0,column,differences
3,wpt_name,43128
5,subvillage,15120
2,installer,1584
1,funder,1403
12,scheme_name,1251
8,ward,145
0,date_recorded,51
14,extraction_type,1
23,quantity,0
21,water_quality,0


In [334]:
X = X.drop(['id', 'wpt_name', 'subvillage', 'installer', 'funder', 'scheme_name', 'ward', 'date_recorded', 'recorded_by'], axis=1)
testing = testing.drop(['id', 'wpt_name', 'subvillage', 'installer', 'funder', 'scheme_name', 'ward', 'date_recorded', 'recorded_by'], axis=1)

Eliminating recorded_by because it only has one value

Eliminating id because it's not important

Eliminating all the rest beacsue they are categorical variables that differ heavily between test and train sets

# Defining the train and test sets

In [335]:
# Eliminating null values from X_train
X.scheme_management.fillna("None", inplace=True)
X.permit.fillna('Unknown', inplace=True)
X.public_meeting.fillna('Unknown', inplace=True)

In [336]:
# X['public_meeting'] = X['public_meeting'].map({True: 'Yes', False: 'No', 'Unknown': 'Unknown'})
X['permit'] = X['permit'].map({True: 'Yes', False: 'No', 'Unknown': 'Unknown'})
X['gps_height'] = X['gps_height'].astype('float64')
# X['district_code'] = X['district_code'].astype('float64')
X['population'] = X['population'].astype('float64')
X['construction_year'] = X['construction_year'].astype('int64')
X['region_code'] = X['region_code'].astype('str')
X['district_code'] = X['district_code'].astype('str')

X_cat = X.select_dtypes(exclude=['float64', 'int64'])
X_cat = X_cat.astype('str')
X_numeric = X.select_dtypes(['float64', 'int64'])

In [337]:
df = pd.concat([X_numeric, X_cat, y], axis=1)

In [338]:
oe = OrdinalEncoder()
oe.fit(X_cat)
X_cat = pd.DataFrame(oe.transform(X_cat), index = X_cat.index, columns = X_cat.columns)

In [339]:
mms = MinMaxScaler()
mms.fit(X_numeric)
X_numeric = pd.DataFrame(mms.transform(X_numeric), columns = X_numeric.columns, index = X_numeric.index)

In [340]:
y = y.map({'functional': 0, 'functional needs repair': 1, 'non functional': 2})

# Feature Selection with Chi2

In [341]:
fs = SelectKBest(score_func=chi2, k='all')
fs.fit(X_cat, y)
# X_train_fs = fs.transform(X)

ch2_scores = pd.DataFrame({'feature': fs.feature_names_in_, 'score': fs.scores_, 'pvalue': fs.pvalues_})
ch2_scores['significant'] = ch2_scores.pvalue.map(lambda x: 'Yes' if x < 0.05 else 'No')
ch2_scores.sort_values(by=['score'], ascending=False).reset_index()

Unnamed: 0,index,feature,score,pvalue,significant
0,4,lga,9184.155815,0.0,Yes
1,10,extraction_type_class,4962.445269,0.0,Yes
2,9,extraction_type_group,3427.761791,0.0,Yes
3,22,waterpoint_type,3348.517448,0.0,Yes
4,8,extraction_type,2638.196579,0.0,Yes
5,23,waterpoint_type_group,2540.881101,0.0,Yes
6,1,region,1805.634614,0.0,Yes
7,2,region_code,1788.823521,0.0,Yes
8,13,payment,866.203572,8.059059e-189,Yes
9,3,district_code,705.287786,7.05835e-154,Yes


In [342]:
fs = SelectKBest(score_func=chi2, k='all')
fs.fit(X_numeric, y)
# X_train_fs = fs.transform(X)

scores = pd.DataFrame({'feature': fs.feature_names_in_, 'score': fs.scores_, 'pvalue': fs.pvalues_})
scores['significant'] = scores.pvalue.map(lambda x: 'Yes' if x < 0.05 else 'No')
scores.sort_values(by=['score'], ascending=False).reset_index()

Unnamed: 0,index,feature,score,pvalue,significant
0,1,gps_height,172.710476,3.1361489999999997e-38,Yes
1,6,construction_year,66.593012,3.463468e-15,Yes
2,2,longitude,27.408583,1.11764e-06,Yes
3,3,latitude,22.005588,1.665511e-05,Yes
4,0,amount_tsh,13.868288,0.0009739564,Yes
5,5,population,0.76877,0.6808692,No
6,4,num_private,0.401568,0.818089,No


# Feature Selection with Mutual Information

In [343]:
fs = SelectKBest(score_func=mutual_info_classif, k='all')
fs.fit(X_cat, y)

mi_scores = pd.DataFrame({'feature': fs.feature_names_in_, 'score': fs.scores_})
mi_scores.sort_values(by=['score'], ascending=False).reset_index()

Unnamed: 0,index,feature,score
0,18,quantity_group,0.111623
1,17,quantity,0.109195
2,4,lga,0.088932
3,22,waterpoint_type,0.06511
4,9,extraction_type_group,0.064002
5,8,extraction_type,0.063288
6,10,extraction_type_class,0.059152
7,23,waterpoint_type_group,0.052851
8,2,region_code,0.044029
9,1,region,0.040636


In [344]:
fs = SelectKBest(score_func=mutual_info_classif, k='all')
fs.fit(X_numeric, y)

scores = pd.DataFrame({'feature': fs.feature_names_in_, 'score': fs.scores_})
scores.sort_values(by=['score'], ascending=False).reset_index()

Unnamed: 0,index,feature,score
0,2,longitude,0.066279
1,3,latitude,0.060944
2,6,construction_year,0.036787
3,0,amount_tsh,0.035425
4,5,population,0.021281
5,1,gps_height,0.019295
6,4,num_private,0.002849


# Feature Selection with Decision Trees

In [402]:
forest = RandomForestClassifier(random_state=42, n_jobs=6, class_weight='balanced')
forest.fit(X_cat, y)

forest_scores = pd.DataFrame({'feature': forest.feature_names_in_, 'score': forest.feature_importances_})
forest_scores['cumsum'] = forest_scores['score'].cumsum()
forest_scores.sort_values(by=['score'], ascending=False).reset_index()

Unnamed: 0,index,feature,score,cumsum
0,184,quantity_group_dry,0.086442,0.784592
1,185,quantity_group_enough,0.034136,0.818728
2,205,waterpoint_type_other,0.027375,1.000000
3,186,quantity_group_insufficient,0.025360,0.844088
4,155,extraction_type_class_other,0.025346,0.518938
...,...,...,...,...
201,161,management_other - school,0.000180,0.535155
202,121,lga_Songea Urban,0.000161,0.314227
203,202,waterpoint_type_dam,0.000047,0.958715
204,57,lga_Lindi Urban,0.000027,0.199115


In [346]:
forest = RandomForestClassifier(random_state=42, n_jobs=6, class_weight='balanced')
forest.fit(X_numeric, y)

scores = pd.DataFrame({'feature': forest.feature_names_in_, 'score': forest.feature_importances_})
scores['cumsum'] = scores['score'].cumsum()
scores.sort_values(by=['score'], ascending=False).reset_index()

Unnamed: 0,index,feature,score,cumsum
0,2,longitude,0.331271,0.50878
1,3,latitude,0.319055,0.827835
2,1,gps_height,0.132658,0.177509
3,5,population,0.085367,0.916226
4,6,construction_year,0.083774,1.0
5,0,amount_tsh,0.044851,0.044851
6,4,num_private,0.003024,0.830859


# Ranking variables based on importance through multiple feature selection methods

In [347]:
cols = list(X_cat.columns)

In [348]:
ch2_rankings = []
mi_rankings = []
forest_rankings = []

df = ch2_scores.sort_values(by=['score'], ascending=False).reset_index()
for col in cols:
    ch2_rankings.append(df[df.feature == col].index[0])
    
df = mi_scores.sort_values(by=['score'], ascending=False).reset_index()
for col in cols:
    mi_rankings.append(df[df.feature == col].index[0])
    
df = forest_scores.sort_values(by=['score'], ascending=False).reset_index()
for col in cols:
    forest_rankings.append(df[df.feature == col].index[0])

In [349]:
rankings = pd.DataFrame({'feature': cols, 'chi_squared': ch2_rankings, 'mutual_information': mi_rankings, 'random_forest': forest_rankings})
rankings['average'] = rankings.apply(lambda row: (row.chi_squared + row.mutual_information + row.random_forest)/3, axis=1)

In [350]:
rankings.sort_values(by=['average'])

Unnamed: 0,feature,chi_squared,mutual_information,random_forest,average
4,lga,0,2,2,1.333333
18,quantity_group,10,0,1,3.666667
17,quantity,11,1,0,4.0
22,waterpoint_type,3,3,9,5.0
2,region_code,7,8,4,6.333333
9,extraction_type_group,2,4,13,6.333333
10,extraction_type_class,1,6,12,6.333333
1,region,6,9,5,6.666667
8,extraction_type,4,5,14,7.666667
13,payment,8,10,8,8.666667


# Correlation

In [351]:
X_numeric.corr()

Unnamed: 0,amount_tsh,gps_height,longitude,latitude,num_private,population,construction_year
amount_tsh,1.0,0.07665,0.022134,-0.05267,0.002944,0.016288,0.067915
gps_height,0.07665,1.0,0.149155,-0.035751,0.007237,0.135003,0.658727
longitude,0.022134,0.149155,1.0,-0.425802,0.023873,0.08659,0.396732
latitude,-0.05267,-0.035751,-0.425802,1.0,0.006837,-0.022152,-0.245278
num_private,0.002944,0.007237,0.023873,0.006837,1.0,0.003818,0.026056
population,0.016288,0.135003,0.08659,-0.022152,0.003818,1.0,0.26091
construction_year,0.067915,0.658727,0.396732,-0.245278,0.026056,0.26091,1.0


In [352]:
X_cat = X_cat.drop(['quantity', 'waterpoint_type_group', 'extraction_type_group', 
                    'region', 'extraction_type', 'payment_type', 'source_type', 
                    'management_group', 'water_quality', 'source_class', 
                    'region_code', 'district_code'], axis=1)

testing = testing.drop(['quantity', 'waterpoint_type_group', 'extraction_type_group', 
                    'region', 'extraction_type', 'payment_type', 'source_type', 
                    'management_group', 'water_quality', 'source_class', 
                    'region_code', 'district_code'], axis=1)

# Re-encoding datasets

In [353]:
X_cat = X[list(X_cat.columns)]

In [354]:
X_cat = pd.get_dummies(X_cat, dtype='int64')

In [355]:
X = pd.concat([X_numeric, X_cat], axis=1)

In [356]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train.reset_index(inplace=True, drop=True)
y_train = y_train.reset_index(drop=True)

# Model Analysis Function

In [357]:
def cval(X, y, cval, estimator, resample = False):
    
    reports = []
    matrices = []
    numpy_reports = []
    numpy_matrices = []
    
    report_columns = ['functional', 'functional needs repair', 
                      'non functional', 'accuracy', 'macro avg', 
                      'weighted avg']
    
    report_rows = ['precision', 'recall', 
                   'f1-score', 'support']
    
    matrix_labels = ['functional', 'functional needs repair', 
                     'non functional']
    
    idx = list(X.index)
    np.random.shuffle(idx)
    
    for i in list(range(cval)):
        arrs = np.array_split(idx, cval)
        
        test = arrs.pop(i)
        train = np.concatenate(arrs)
        
        test_x = X.take(test)
        train_x = X.take(train)
        test_y = y.take(test)
        train_y = y.take(train)
            
        if resample:
            strategy = {1: int((len(train_x))/4)}
            smote = SMOTE(sampling_strategy=strategy)
#             smote=SMOTE()
            
            train_x_resampled, train_y_resampled = smote.fit_resample(train_x, train_y)
            model = estimator
            model.fit(train_x_resampled, train_y_resampled)
            
        else:
            model = estimator
            model.fit(train_x, train_y)
            
        preds = model.predict(test_x)
        
        report = pd.DataFrame(classification_report(test_y, preds, output_dict=True))
        reports.append(report)
        numpy_reports.append(np.array(report))
        
        matrix = pd.DataFrame(confusion_matrix(test_y, preds))
        matrices.append(matrix)
        numpy_matrices.append(np.array(matrix))
        
        clear_output(wait=True)
        print(f"Fold #{i+1} out of {cval} done.")
    
    numpy_report = pd.DataFrame(np.sum(numpy_reports, axis=0)/cval, 
                                columns=report_columns, index=report_rows)
    
    numpy_matrix = pd.DataFrame(np.sum(numpy_matrices, axis=0)/cval, 
                                columns=matrix_labels, index=matrix_labels)
    
    print("Analyis complete.")
    
    return reports, matrices, numpy_report, numpy_matrix

# Base Model – Logistic Regression, No Regularization

In [358]:
estimator = LogisticRegression(solver='liblinear', fit_intercept=False)
reports, matrices, numpy_report, numpy_matrix = cval(X_train, y_train, 5, estimator, resample=True)

Fold #5 out of 5 done.
Analyis complete.


In [359]:
numpy_report

Unnamed: 0,functional,functional needs repair,non functional,accuracy,macro avg,weighted avg
precision,0.777537,0.229749,0.765024,0.676925,0.59077,0.733797
recall,0.702564,0.568191,0.661085,0.676925,0.643947,0.676925
f1-score,0.738127,0.327046,0.709128,0.676925,0.591433,0.697694
support,4832.6,633.6,3443.8,0.676925,8910.0,8910.0


In [360]:
numpy_matrix

Unnamed: 0,functional,functional needs repair,non functional
functional,3395.2,825.4,612.0
functional needs repair,186.2,359.8,87.6
non functional,785.4,382.0,2276.4


# Second Model – Decision Tree

In [361]:
dtc = DecisionTreeClassifier()

param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [1, 2, 5, 10],
    'min_samples_split': [5, 10, 20, 40],
    'min_samples_leaf': [5, 10, 20],
    'splitter': ['best', 'random']
}

gs_tree = GridSearchCV(dtc, param_grid, cv=3)
gs_tree.fit(X_train, y_train)
gs_tree.best_params_

{'criterion': 'entropy',
 'max_depth': 10,
 'min_samples_leaf': 5,
 'min_samples_split': 10,
 'splitter': 'random'}

In [362]:
dtc = DecisionTreeClassifier(criterion= 'gini', max_depth= 10, min_samples_split= 5, min_samples_leaf=10, splitter='best')

In [363]:
dtc_reports, dtc_matrices, dtc_numpy_report, dtc_numpy_matrix = cval(X_train, y_train, 5, dtc, resample=True)

Fold #5 out of 5 done.
Analyis complete.


In [364]:
dtc_numpy_matrix

Unnamed: 0,functional,functional needs repair,non functional
functional,3187.2,1241.4,404.0
functional needs repair,158.0,416.6,59.0
non functional,831.0,634.8,1978.0


In [365]:
dtc_numpy_report

Unnamed: 0,functional,functional needs repair,non functional,accuracy,macro avg,weighted avg
precision,0.763689,0.184416,0.814213,0.626465,0.587439,0.742112
recall,0.659656,0.657373,0.574431,0.626465,0.630486,0.626465
f1-score,0.707302,0.286861,0.671567,0.626465,0.555243,0.663576
support,4832.6,633.6,3443.8,0.626465,8910.0,8910.0


# Third Model - K Nearest Neighbors

In [366]:
knn = KNeighborsClassifier(n_neighbors=3)

In [367]:
knn_reports, knn_matrices, knn_numpy_report, knn_numpy_matrix = cval(X_train, y_train, 5, knn, resample=True)

Fold #5 out of 5 done.
Analyis complete.


In [368]:
knn_numpy_matrix

Unnamed: 0,functional,functional needs repair,non functional
functional,3771.2,432.2,629.2
functional needs repair,249.6,294.4,89.6
non functional,681.6,189.2,2573.0


In [369]:
knn_numpy_report

Unnamed: 0,functional,functional needs repair,non functional,accuracy,macro avg,weighted avg
precision,0.801978,0.321727,0.781639,0.745073,0.635115,0.760027
recall,0.780413,0.464659,0.747155,0.745073,0.664076,0.745073
f1-score,0.791015,0.380078,0.763983,0.745073,0.645025,0.75136
support,4832.6,633.6,3443.8,0.745073,8910.0,8910.0


# Fourth Model – Bagging Classifier

In [370]:
bagged_tree = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=100, max_features=50)

In [371]:
bagged_tree_reports, bagged_tree_matrices, bagged_tree_numpy_report, bagged_tree_numpy_matrix = cval(X_train, y_train, 5, bagged_tree, resample=True)

Fold #5 out of 5 done.
Analyis complete.


In [372]:
bagged_tree_numpy_matrix

Unnamed: 0,functional,functional needs repair,non functional
functional,3767.6,507.0,558.0
functional needs repair,196.4,357.4,79.8
non functional,610.2,189.8,2643.8


In [373]:
bagged_tree_numpy_report

Unnamed: 0,functional,functional needs repair,non functional,accuracy,macro avg,weighted avg
precision,0.823592,0.339344,0.805675,0.759686,0.656204,0.782317
recall,0.779577,0.564389,0.767682,0.759686,0.703883,0.759686
f1-score,0.800977,0.423507,0.786205,0.759686,0.67023,0.768488
support,4832.6,633.6,3443.8,0.759686,8910.0,8910.0


# Fifth Model – Random Forest

In [374]:
forest = RandomForestClassifier()

In [375]:
forest_reports, forest_matrices, forest_numpy_report, forest_numpy_matrix = cval(X_train, y_train, 5, forest, resample=True)

Fold #5 out of 5 done.
Analyis complete.


In [376]:
forest_numpy_matrix

Unnamed: 0,functional,functional needs repair,non functional
functional,3940.2,358.2,534.2
functional needs repair,235.6,304.4,93.6
non functional,606.4,134.4,2703.0


In [377]:
forest_numpy_report

Unnamed: 0,functional,functional needs repair,non functional,accuracy,macro avg,weighted avg
precision,0.823907,0.382319,0.811461,0.779753,0.672563,0.787758
recall,0.815323,0.480614,0.784906,0.779753,0.693615,0.779753
f1-score,0.81959,0.425641,0.797942,0.779753,0.681058,0.783235
support,4832.6,633.6,3443.8,0.779753,8910.0,8910.0


# Sixth Model – XGBoost

In [378]:
xgb = XGBClassifier()

In [379]:
xgboost_reports, xgboost_matrices, xgboost_numpy_report, xgboost_numpy_matrix = cval(X_train, y_train, 5, xgb, resample=True)

Fold #5 out of 5 done.
Analyis complete.


In [380]:
xgboost_numpy_matrix

Unnamed: 0,functional,functional needs repair,non functional
functional,3677.0,679.0,476.6
functional needs repair,172.8,390.8,70.0
non functional,653.8,267.6,2522.4


In [381]:
xgboost_numpy_report

Unnamed: 0,functional,functional needs repair,non functional,accuracy,macro avg,weighted avg
precision,0.816455,0.292332,0.822015,0.739641,0.643601,0.781342
recall,0.760866,0.617132,0.73238,0.739641,0.703459,0.739641
f1-score,0.787656,0.396626,0.774565,0.739641,0.652949,0.754807
support,4832.6,633.6,3443.8,0.739641,8910.0,8910.0


# Eigth Model – Adaboost Classifier

In [382]:
# Instantiate an AdaBoostClassifier
adaboost_clf = AdaBoostClassifier(estimator=DecisionTreeClassifier(), n_estimators=100, random_state=42)

In [383]:
adaboost_reports, adaboost_matrices, adaboost_numpy_report, adaboost_numpy_matrix = cval(X_train, y_train, 5, adaboost_clf, resample=True)

Fold #5 out of 5 done.
Analyis complete.


In [384]:
adaboost_numpy_matrix

Unnamed: 0,functional,functional needs repair,non functional
functional,3699.4,432.8,700.4
functional needs repair,238.4,288.0,107.2
non functional,642.6,172.6,2628.6


In [385]:
adaboost_numpy_report

Unnamed: 0,functional,functional needs repair,non functional,accuracy,macro avg,weighted avg
precision,0.807667,0.322501,0.764979,0.742536,0.631716,0.756755
recall,0.765544,0.454436,0.76329,0.742536,0.66109,0.742536
f1-score,0.786019,0.376988,0.764114,0.742536,0.642374,0.748514
support,4832.6,633.6,3443.8,0.742536,8910.0,8910.0


# Ninth Model – Gradient Boosting Classifier

In [386]:
# Instantiate an GradientBoostingClassifier
gbt_clf = GradientBoostingClassifier(random_state=42, n_estimators=200, max_features=50)

In [387]:
gbt_reports, gbt_matrices, gbt_numpy_report, gbt_numpy_matrix = cval(X_train, y_train, 5, gbt_clf, resample=True)

Fold #5 out of 5 done.
Analyis complete.


In [388]:
gbt_numpy_matrix

Unnamed: 0,functional,functional needs repair,non functional
functional,3450.0,882.2,500.4
functional needs repair,155.6,405.4,72.6
non functional,718.4,412.4,2313.0


In [389]:
gbt_numpy_report

Unnamed: 0,functional,functional needs repair,non functional,accuracy,macro avg,weighted avg
precision,0.797901,0.238649,0.801559,0.692301,0.612703,0.759627
recall,0.713881,0.63956,0.671689,0.692301,0.675044,0.692301
f1-score,0.753513,0.347452,0.730822,0.692301,0.610595,0.715919
support,4832.6,633.6,3443.8,0.692301,8910.0,8910.0


# Eleventh Model – Extra Randomized Trees

In [390]:
extra_trees = ExtraTreesClassifier(n_estimators=100, random_state=42)

In [391]:
extra_trees_reports, extra_trees_matrices, extra_trees_numpy_report, extra_trees_numpy_matrix = cval(X_train, y_train, 5, extra_trees, resample=True)

Fold #5 out of 5 done.
Analyis complete.


In [392]:
extra_trees_numpy_matrix

Unnamed: 0,functional,functional needs repair,non functional
functional,3911.8,348.2,572.6
functional needs repair,242.0,292.2,99.4
non functional,628.8,133.4,2681.6


In [393]:
extra_trees_numpy_report

Unnamed: 0,functional,functional needs repair,non functional,accuracy,macro avg,weighted avg
precision,0.817914,0.377793,0.799652,0.772795,0.665119,0.779649
recall,0.809432,0.462191,0.778719,0.772795,0.683447,0.772795
f1-score,0.813625,0.415294,0.789015,0.772795,0.672644,0.775805
support,4832.6,633.6,3443.8,0.772795,8910.0,8910.0


# Voting Classifier

In [394]:
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_train, y_train)

In [395]:
# strategy = {1: int((len(X_train))/4)}
smote = SMOTE()
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

In [396]:
vc_1 = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, max_features=50).fit(X_train_resampled, y_train_resampled)
vc_2 = XGBClassifier().fit(X_train_resampled, y_train_resampled)
vc_3 = LogisticRegression(solver='liblinear', fit_intercept=False).fit(X_train_resampled, y_train_resampled)

vc_preds_1 = vc_1.predict(X_test)
vc_preds_2 = vc_2.predict(X_test)
vc_preds_3 = vc_3.predict(X_test)

predictions_df = pd.DataFrame({'BaggingClassifier': vc_preds_1, 
                               'LogisticRegression': vc_preds_3, 
                               'XGBoost': vc_preds_2, 
                               'True Values': y_test})

In [397]:
modes = []

for i in range(len(predictions_df)):
    arr = [predictions_df.BaggingClassifier.iloc[i], 
           predictions_df.LogisticRegression.iloc[i], 
           predictions_df.XGBoost.iloc[i]]
    mode = md(arr)
    modes.append(mode)

In [398]:
predictions_df['mode'] = modes

In [399]:
predictions_df

Unnamed: 0,BaggingClassifier,LogisticRegression,XGBoost,True Values,mode
37001,0,2,0,0,0
14069,0,0,0,0,0
44177,0,0,0,0,0
37702,0,0,0,0,0
49963,0,2,2,0,2
...,...,...,...,...,...
6486,2,1,1,0,1
58404,0,2,2,0,2
32939,2,0,2,2,2
33267,0,0,0,0,0


In [400]:
pd.DataFrame(classification_report(y_test, predictions_df['mode'], output_dict=True))

Unnamed: 0,0,1,2,accuracy,macro avg,weighted avg
precision,0.821159,0.322788,0.807452,0.743232,0.650466,0.777424
recall,0.765069,0.612707,0.738448,0.743232,0.705408,0.743232
f1-score,0.792122,0.422823,0.77141,0.743232,0.662118,0.75573
support,8096.0,1149.0,5605.0,0.743232,14850.0,14850.0


In [401]:
labels = ['functional', 'functional needs repair', 'non functional']
pd.DataFrame(confusion_matrix(y_test, predictions_df['mode']), columns=labels, index=labels)

Unnamed: 0,functional,functional needs repair,non functional
functional,6194,1060,842
functional needs repair,300,704,145
non functional,1049,417,4139
