## Capstone  - NYC Tree Census - Dealing with imbalance data

### Table of contents
1. [Background](#Background)
     -   1.1 [Data Source](#Data-Source)
     -   1.2 [Objective](#Objective)
     

## 1. Loading and Preparing Data

#### 1.1 Load Libraries

In [20]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import (cross_val_score, train_test_split)

*Successfully loaded all the required libraries*

#### 1.2 Load the cleaned data 

In [2]:
df = pd.read_csv('encoded_data_health.csv')

<div class="alert alert-success">
  <strong>Success!</strong> Successfully loaded the cleaned and encoded file.
</div>

In [3]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,tree_dbh,health,latitude,longitude,x_sp,y_sp,problem_count,curb_loc_OffsetFromCurb,curb_loc_OnCurb,...,trnk_light_No,trnk_light_Yes,trnk_other_No,trnk_other_Yes,brch_light_No,brch_light_Yes,brch_shoe_No,brch_shoe_Yes,brch_other_No,brch_other_Yes
0,0,3,Fair,40.723092,-73.844215,1027431.148,202756.7687,0,0,1,...,1,0,1,0,1,0,1,0,1,0
1,1,21,Fair,40.794111,-73.818679,1034455.701,228644.8374,1,0,1,...,1,0,1,0,1,0,1,0,1,0


In [4]:
df = df.drop('Unnamed: 0', 1)

  df = df.drop('Unnamed: 0', 1)


In [5]:
df.head(2)

Unnamed: 0,tree_dbh,health,latitude,longitude,x_sp,y_sp,problem_count,curb_loc_OffsetFromCurb,curb_loc_OnCurb,steward_1or2,...,trnk_light_No,trnk_light_Yes,trnk_other_No,trnk_other_Yes,brch_light_No,brch_light_Yes,brch_shoe_No,brch_shoe_Yes,brch_other_No,brch_other_Yes
0,3,Fair,40.723092,-73.844215,1027431.148,202756.7687,0,0,1,0,...,1,0,1,0,1,0,1,0,1,0
1,21,Fair,40.794111,-73.818679,1034455.701,228644.8374,1,0,1,0,...,1,0,1,0,1,0,1,0,1,0


#### 1.3 setting variables

In [6]:
# setting X and y variables
y = df['health'].values
X = df.drop('health', axis=1).values

#### 1.4 Setting train and test data 

In [9]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(481791, 36) (481791,)
(160597, 36) (160597,)


## 2. Machine Learning Models - Attempt One 

#### 2.1 Logistic Regression 

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix

lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
        
# accuracy scores
print( 'Training Set Accuracy Score: ', lr.score(X_train, y_train))
print('Test Set Accuracy Score: ', lr.score(X_test, y_test))
    
# classification metrics
print('Classification Metrics \n')
print(classification_report(y_test, y_pred))

Training Set Accuracy Score:  0.8113559614023508
Test Set Accuracy Score:  0.8113538858135582
Classification Metrics 



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

        Fair       0.00      0.00      0.00     23719
        Good       0.81      1.00      0.90    130301
        Poor       0.00      0.00      0.00      6577

    accuracy                           0.81    160597
   macro avg       0.27      0.33      0.30    160597
weighted avg       0.66      0.81      0.73    160597



  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
# sns.heatmap(confusion_matrix(y_test, y_pred), annot = True,cmap='YlGnBu')

<div class="alert alert-info">
  <strong>Finding!</strong> Logistics Regression Model resulted in 81% accuracy.
</div>

#### 2.2 K-Nearest Neighbors (KNN)

In [11]:
from sklearn.neighbors import KNeighborsClassifier

In [12]:
knn = KNeighborsClassifier(n_neighbors=5)

In [13]:
knn.fit(X_train, y_train)

KNeighborsClassifier()

In [14]:
y_pred = knn.predict(X_test)

In [15]:
# accuracy scores
print('Training Set Accuracy Score: ', knn.score(X_train, y_train))
print('Test Set Accuracy Score: ', knn.score(X_test, y_test))
    
# classificatin report
print('Classification Metrics \n')
print(classification_report(y_test, y_pred))

Training Set Accuracy Score:  0.8499183255810092
Test Set Accuracy Score:  0.8010236803925354
Classification Metrics 

              precision    recall  f1-score   support

        Fair       0.40      0.25      0.31     23719
        Good       0.85      0.94      0.89    130301
        Poor       0.38      0.06      0.10      6577

    accuracy                           0.80    160597
   macro avg       0.54      0.42      0.43    160597
weighted avg       0.76      0.80      0.77    160597



<div class="alert alert-info">
  <strong>Finding!</strong> K-Nearest Neighbor Model resulted in 80% accuracy. It's getting towards overfitting
</div>

In [None]:
Overfitting - reduce the complexity of model and regularization 
Underfitting - more data 

the problem can be at model level and class level

#### 2.3 Decision Tree

In [16]:
from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier(random_state=42)
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_test)
    
    
# accuracy scores
print('Training Set Accuracy Score: ', decision_tree.score(X_train, y_train))
print('Test Set Accuracy Score: ', decision_tree.score(X_test, y_test))

# classification report
print('Classification Metrics \n')
print(classification_report(y_test, y_pred))

Training Set Accuracy Score:  0.9999854708784515
Test Set Accuracy Score:  0.7415642882494692
Classification Metrics 

              precision    recall  f1-score   support

        Fair       0.30      0.31      0.30     23719
        Good       0.86      0.85      0.85    130301
        Poor       0.17      0.18      0.18      6577

    accuracy                           0.74    160597
   macro avg       0.44      0.45      0.44    160597
weighted avg       0.75      0.74      0.74    160597



#### 2.4 Support Vector Machine

In [None]:
from sklearn import svm
svm_clf = svm.SVC()
svm_clf.fit(X_train, y_train)
y_pred = svm_clf.predict(X_test)

# accuracy scores
print('Training Set Accuracy Score: ', svm_clf.score(X_train, y_train))
print('Test Set Accuracy Score: ', svm_clf.score(X_test, y_test))

# classification report
print('Classification Metrics \n')
print(classification_report(y_test, y_pred))

#### 2.5 Random Forest

In [18]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
    
    
# accuracy scores
print('Training Set Accuracy Score: ', rf.score(X_train, y_train))
print('Test Set Accuracy Score: ', rf.score(X_test, y_test))
    
# classification report
print('Classification Metrics \n')
print(classification_report(y_test, y_pred))

Training Set Accuracy Score:  0.9999315055698426
Test Set Accuracy Score:  0.814896915882613
Classification Metrics 

              precision    recall  f1-score   support

        Fair       0.47      0.20      0.28     23719
        Good       0.85      0.96      0.90    130301
        Poor       0.39      0.12      0.19      6577

    accuracy                           0.81    160597
   macro avg       0.57      0.43      0.46    160597
weighted avg       0.77      0.81      0.78    160597



#### 2.6 Gradient Boosting Classifier

In [20]:
from sklearn.ensemble import GradientBoostingClassifier
gradient_booster = GradientBoostingClassifier()
gradient_booster.fit(X_train,y_train)
y_pred = gradient_booster.predict(X_test)


# accuracy scores
print('Training Set Accuracy Score: ', gradient_booster.score(X_train, y_train))
print('Test Set Accuracy Score: ', gradient_booster.score(X_test, y_test))
    
# classification report
print('Classification Metrics \n')
print(classification_report(y_test, y_pred))

Training Set Accuracy Score:  0.8128856703425344
Test Set Accuracy Score:  0.81292925770718
Classification Metrics 

              precision    recall  f1-score   support

        Fair       0.51      0.01      0.02     23719
        Good       0.81      1.00      0.90    130301
        Poor       0.48      0.03      0.05      6577

    accuracy                           0.81    160597
   macro avg       0.60      0.35      0.32    160597
weighted avg       0.76      0.81      0.73    160597



## 4. Resampling the data 

In [19]:
# dividing the data by the health status

poor = df[df['health']=='Poor']
fair = df[df['health']=='Fair']
good = df[df['health']=='Good']

In [20]:
# print the shpes of the class
print('Poor: ',poor.shape)
print('Fair: ',fair.shape)
print('Good: ',good.shape)

Poor:  (26309, 37)
Fair:  (94874, 37)
Good:  (521205, 37)


### 4.1 Oversampling Data

### 4.1.1 Random oversampling using imblearn

In [26]:
#import library
from imblearn.over_sampling import RandomOverSampler
from collections import Counter

ros = RandomOverSampler(random_state=42)

# fit predictor and target varaible
X_ros, y_ros = ros.fit_resample(X, y)

print('Original dataset shape', Counter(y))
print('Resample dataset shape', Counter(y_ros))

Original dataset shape Counter({'Good': 521205, 'Fair': 94874, 'Poor': 26309})
Resample dataset shape Counter({'Fair': 521205, 'Good': 521205, 'Poor': 521205})


In [27]:
# train test split
X_train_rs, X_test_rs, y_train_rs, y_test_rs = train_test_split(X_ros, y_ros, test_size=0.25, random_state=42)

print(X_train_rs.shape, y_train_rs.shape)
print(X_test_rs.shape, y_test_rs.shape)

(1172711, 36) (1172711,)
(390904, 36) (390904,)


#### 4.1.1a Logistic Regression 

In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix

lr_rs = LogisticRegression(random_state=42)
lr_rs.fit(X_train_rs, y_train_rs)
y_pred_rs = lr_rs.predict(X_test_rs)
        
# accuracy scores
print( 'Training Set Accuracy Score: ', lr_rs.score(X_train_rs, y_train_rs))
print('Test Set Accuracy Score: ', lr_rs.score(X_test_rs, y_test_rs))
    
# classification metrics
print('Classification Metrics \n')
print(classification_report(y_test_rs, y_pred_rs))

Training Set Accuracy Score:  0.3325448469401242
Test Set Accuracy Score:  0.33137292020547243
Classification Metrics 

              precision    recall  f1-score   support

        Fair       0.32      0.02      0.03    130385
        Good       0.33      0.55      0.41    129681
        Poor       0.34      0.43      0.38    130838

    accuracy                           0.33    390904
   macro avg       0.33      0.33      0.27    390904
weighted avg       0.33      0.33      0.27    390904



#### 4.1.1b K-Nearest Neighbour

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn_rs = KNeighborsClassifier(n_neighbors=5)
knn_rs.fit(X_train_rs, y_train_rs)
y_pred_rs = knn_rs.predict(X_test_rs)
    

# accuracy scores
print('Accuracy Score, Training Set: ', knn_rs.score(X_train_rs, y_train_rs))
print('Accuracy Score, Test Set: ', knn_rs.score(X_test_rs, y_test_rs))
    
# classificatin report
print('Classification Metrics \n')
print(classification_report(y_test_rs, y_pred_rs))

#### 4.1.1c Decision Tree

In [34]:
from sklearn.tree import DecisionTreeClassifier
decision_tree_rs = DecisionTreeClassifier(random_state=42)
decision_tree_rs.fit(X_train_rs, y_train_rs)
y_pred_rs = decision_tree_rs.predict(X_test_rs)
    
    
# accuracy scores
print('Training Set Accuracy Score: ', decision_tree_rs.score(X_train_rs, y_train_rs))
print('Test Set Accuracy Score: ', decision_tree_rs.score(X_test_rs, y_test_rs))

# classification report
print('Classification Metrics \n')
print(classification_report(y_test_rs, y_pred_rs))

Training Set Accuracy Score:  0.9999855036748184
Test Set Accuracy Score:  0.9384733847696621
Classification Metrics 

              precision    recall  f1-score   support

        Fair       0.88      0.99      0.93    130385
        Good       0.99      0.82      0.90    129681
        Poor       0.97      1.00      0.98    130838

    accuracy                           0.94    390904
   macro avg       0.94      0.94      0.94    390904
weighted avg       0.94      0.94      0.94    390904



#### 4.1.1d Random Forest 

In [36]:
from sklearn.ensemble import RandomForestClassifier
rf_rs = RandomForestClassifier(random_state=42)
rf_rs.fit(X_train_rs, y_train_rs)
y_pred_rs = rf_rs.predict(X_test_rs)
    
    
# accuracy scores
print('Training Set Accuracy Score: ', rf_rs.score(X_train_rs, y_train_rs))
print('Test Set Accuracy Score: ', rf_rs.score(X_test_rs, y_test_rs))
    
# classification report
print('Classification Metrics \n')
print(classification_report(y_test_rs, y_pred_rs))

Training Set Accuracy Score:  0.9999735655246689
Test Set Accuracy Score:  0.9596115670343613
Classification Metrics 

              precision    recall  f1-score   support

        Fair       0.91      0.99      0.95    130385
        Good       0.99      0.89      0.94    129681
        Poor       0.98      1.00      0.99    130838

    accuracy                           0.96    390904
   macro avg       0.96      0.96      0.96    390904
weighted avg       0.96      0.96      0.96    390904



#### 4.1.1e Gradient Boosting 

In [37]:
from sklearn.ensemble import GradientBoostingClassifier
gradient_booster_rs = GradientBoostingClassifier()
gradient_booster_rs.fit(X_train_rs,y_train_rs)
y_pred_rs = gradient_booster_rs.predict(X_test_rs)


# accuracy scores
print('Training Set Accuracy Score: ', gradient_booster_rs.score(X_train_rs, y_train_rs))
print('Test Set Accuracy Score: ', gradient_booster_rs.score(X_test_rs, y_test_rs))
    
# classification report
print('Classification Metrics \n')
print(classification_report(y_test_rs, y_pred_rs))

Training Set Accuracy Score:  0.45321822682655827
Test Set Accuracy Score:  0.4528554325358656
Classification Metrics 

              precision    recall  f1-score   support

        Fair       0.42      0.30      0.35    130385
        Good       0.45      0.59      0.51    129681
        Poor       0.48      0.47      0.48    130838

    accuracy                           0.45    390904
   macro avg       0.45      0.45      0.45    390904
weighted avg       0.45      0.45      0.45    390904



### 4.1.2 Synthetic minority oversampleing technique

In [8]:
from imblearn.over_sampling import SMOTE
from collections import Counter

smote = SMOTE()

# fit target and predictor variable
X_smote , y_smote = smote.fit_resample(X, y)

print('Origianl dataset shape:', Counter(y))
print('Resampple dataset shape:', Counter(y_smote))

Origianl dataset shape: Counter({'Good': 521205, 'Fair': 94874, 'Poor': 26309})
Resampple dataset shape: Counter({'Fair': 521205, 'Good': 521205, 'Poor': 521205})


In [9]:
# train test split
X_train_smote, X_test_smote, y_train_smote, y_test_smote = train_test_split(X_smote, y_smote, test_size=0.25, random_state=42)

print(X_train_smote.shape, y_train_smote.shape)
print(X_test_smote.shape, y_test_smote.shape)

(1172711, 36) (1172711,)
(390904, 36) (390904,)


#### 4.1.2a Logistic Regression

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix

lr_smote = LogisticRegression(random_state=42)
lr_smote.fit(X_train_smote, y_train_smote)
y_pred_smote = lr_smote.predict(X_test_smote)
        
# accuracy scores
print( 'Training Set Accuracy Score: ', lr_smote.score(X_train_smote, y_train_smote))
print('Test Set Accuracy Score: ', lr_smote.score(X_test_smote, y_test_smote))
    
# classification metrics
print('Classification Metrics \n')
print(classification_report(y_test_smote, y_pred_smote))

Training Set Accuracy Score:  0.33286376609411866
Test Set Accuracy Score:  0.33113245195751384
Classification Metrics 



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

        Fair       0.00      0.00      0.00    130385
        Good       0.33      0.54      0.41    129681
        Poor       0.34      0.46      0.39    130838

    accuracy                           0.33    390904
   macro avg       0.22      0.33      0.26    390904
weighted avg       0.22      0.33      0.26    390904



  _warn_prf(average, modifier, msg_start, len(result))


#### 4.1.2b K-Nearest Neighbour

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn_smote = KNeighborsClassifier(n_neighbors=5)
knn_smote.fit(X_train_smote, y_train_smote)
y_pred_smote = knn_smote.predict(X_test_smote)
    

# accuracy scores
print('Accuracy Score, Training Set: ', knn_smote.score(X_train_smote, y_train_smote))
print('Accuracy Score, Test Set: ', knn_smote.score(X_test_smote, y_test_smote))
    
# classificatin report
print('Classification Metrics \n')
print(classification_report(y_test_smote, y_pred_smote))

#### 4.1.2c Decision Tree

In [12]:
from sklearn.tree import DecisionTreeClassifier
decision_tree_smote = DecisionTreeClassifier(random_state=42)
decision_tree_smote.fit(X_train_smote, y_train_smote)
y_pred_smote = decision_tree_smote.predict(X_test_smote)
    
    
# accuracy scores
print('Training Set Accuracy Score: ', decision_tree_smote.score(X_train_smote, y_train_smote))
print('Test Set Accuracy Score: ', decision_tree_smote.score(X_test_smote, y_test_smote))

# classification report
print('Classification Metrics \n')
print(classification_report(y_test_smote, y_pred_smote))

Training Set Accuracy Score:  0.9999940309249252
Test Set Accuracy Score:  0.8016981151382436
Classification Metrics 

              precision    recall  f1-score   support

        Fair       0.74      0.73      0.73    130385
        Good       0.85      0.84      0.84    129681
        Poor       0.82      0.84      0.83    130838

    accuracy                           0.80    390904
   macro avg       0.80      0.80      0.80    390904
weighted avg       0.80      0.80      0.80    390904



#### 4.1.2d Random Forest

In [13]:
from sklearn.ensemble import RandomForestClassifier
rf_smote = RandomForestClassifier(random_state=42)
rf_smote.fit(X_train_smote, y_train_smote)
y_pred_smote = rf_smote.predict(X_test_smote)
    
    
# accuracy scores
print('Training Set Accuracy Score: ', rf_smote.score(X_train_smote, y_train_smote))
print('Test Set Accuracy Score: ', rf_smote.score(X_test_smote, y_test_smote))
    
# classification report
print('Classification Metrics \n')
print(classification_report(y_test_smote, y_pred_smote))

Training Set Accuracy Score:  0.9999803873247544
Test Set Accuracy Score:  0.8741097558479831
Classification Metrics 

              precision    recall  f1-score   support

        Fair       0.87      0.79      0.83    130385
        Good       0.85      0.93      0.89    129681
        Poor       0.90      0.91      0.90    130838

    accuracy                           0.87    390904
   macro avg       0.87      0.87      0.87    390904
weighted avg       0.87      0.87      0.87    390904



#### 4.1.2e Gradient Boosting

In [15]:
from sklearn.ensemble import GradientBoostingClassifier
gradient_booster_smote = GradientBoostingClassifier()
gradient_booster_smote.fit(X_train_smote,y_train_smote)
y_pred_smote = gradient_booster_smote.predict(X_test_smote)


# accuracy scores
print('Training Set Accuracy Score: ', gradient_booster_smote.score(X_train_smote, y_train_smote))
print('Test Set Accuracy Score: ', gradient_booster_smote.score(X_test_smote, y_test_smote))
    
# classification report
print('Classification Metrics \n')
print(classification_report(y_test_smote, y_pred_smote))

Training Set Accuracy Score:  0.6226657718738888
Test Set Accuracy Score:  0.6214825123303931
Classification Metrics 

              precision    recall  f1-score   support

        Fair       0.57      0.33      0.41    130385
        Good       0.64      0.99      0.78    129681
        Poor       0.62      0.55      0.59    130838

    accuracy                           0.62    390904
   macro avg       0.61      0.62      0.59    390904
weighted avg       0.61      0.62      0.59    390904



### 4.1.3 BroderlineSMOTE

In [18]:
from imblearn.over_sampling import BorderlineSMOTE

bsmote = BorderlineSMOTE(random_state=42)

# fit target and predictor variable
X_bsmote , y_bsmote = bsmote.fit_resample(X, y)

print('Origianl dataset shape:', Counter(y))
print('Resampple dataset shape:', Counter(y_bsmote))

Origianl dataset shape: Counter({'Good': 521205, 'Fair': 94874, 'Poor': 26309})
Resampple dataset shape: Counter({'Fair': 521205, 'Good': 521205, 'Poor': 521205})


In [19]:
# train test split
X_train_bsmote, X_test_bsmote, y_train_bsmote, y_test_bsmote = train_test_split(X_bsmote, y_bsmote, test_size=0.25, random_state=42)

print(X_train_bsmote.shape, y_train_bsmote.shape)
print(X_test_bsmote.shape, y_test_bsmote.shape) 

(1172711, 36) (1172711,)
(390904, 36) (390904,)


#### 4.1.3a Logisitc Regression 

In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix

lr_bsmote = LogisticRegression(random_state=42)
lr_bsmote.fit(X_train_bsmote, y_train_bsmote)
y_pred_bsmote = lr_bsmote.predict(X_test_bsmote)
        
# accuracy scores
print( 'Training Set Accuracy Score: ', lr_bsmote.score(X_train_bsmote, y_train_bsmote))
print('Test Set Accuracy Score: ', lr_bsmote.score(X_test_bsmote, y_test_bsmote))
    
# classification metrics
print('Classification Metrics \n')
print(classification_report(y_test_bsmote, y_pred_bsmote))

Training Set Accuracy Score:  0.33924726552407203
Test Set Accuracy Score:  0.3388888320406033
Classification Metrics 

              precision    recall  f1-score   support

        Fair       0.34      0.40      0.37    130385
        Good       0.33      0.15      0.21    129681
        Poor       0.34      0.46      0.39    130838

    accuracy                           0.34    390904
   macro avg       0.34      0.34      0.32    390904
weighted avg       0.34      0.34      0.32    390904



#### 4.1.3b K-Nearest Negibhour

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn_bsmote = KNeighborsClassifier(n_neighbors=5)
knn_bsmote.fit(X_train_bsmote, y_train_bsmote)
y_pred_bsmote = knn_bsmote.predict(X_test_bsmote)
    

# accuracy scores
print('Accuracy Score, Training Set: ', knn_bsmote.score(X_train_bsmote, y_train_bsmote))
print('Accuracy Score, Test Set: ', knn_bsmote.score(X_test_bsmote, y_test_bsmote))
    
# classificatin report
print('Classification Metrics \n')
print(classification_report(y_test_bsmote, y_pred_bsmote))

#### 4.1.3c Decision Tree

In [22]:
from sklearn.tree import DecisionTreeClassifier
decision_tree_bsmote = DecisionTreeClassifier(random_state=42)
decision_tree_bsmote.fit(X_train_bsmote, y_train_bsmote)
y_pred_bsmote = decision_tree_bsmote.predict(X_test_bsmote)
    
    
# accuracy scores
print('Training Set Accuracy Score: ', decision_tree_bsmote.score(X_train_bsmote, y_train_bsmote))
print('Test Set Accuracy Score: ', decision_tree_bsmote.score(X_test_bsmote, y_test_bsmote))

# classification report
print('Classification Metrics \n')
print(classification_report(y_test_bsmote, y_pred_bsmote))

Training Set Accuracy Score:  0.9999940309249252
Test Set Accuracy Score:  0.8332045719665186
Classification Metrics 

              precision    recall  f1-score   support

        Fair       0.79      0.77      0.78    130385
        Good       0.85      0.84      0.84    129681
        Poor       0.86      0.89      0.87    130838

    accuracy                           0.83    390904
   macro avg       0.83      0.83      0.83    390904
weighted avg       0.83      0.83      0.83    390904



#### 4.1.3d Random Forest

In [23]:
from sklearn.ensemble import RandomForestClassifier
rf_bsmote = RandomForestClassifier(random_state=42)
rf_bsmote.fit(X_train_bsmote, y_train_bsmote)
y_pred_bsmote = rf_bsmote.predict(X_test_bsmote)
    
    
# accuracy scores
print('Training Set Accuracy Score: ', rf_bsmote.score(X_train_bsmote, y_train_bsmote))
print('Test Set Accuracy Score: ', rf_bsmote.score(X_test_bsmote, y_test_bsmote))
    
# classification report
print('Classification Metrics \n')
print(classification_report(y_test_bsmote, y_pred_bsmote))

Training Set Accuracy Score:  0.9999778291497223
Test Set Accuracy Score:  0.8986119354112518
Classification Metrics 

              precision    recall  f1-score   support

        Fair       0.90      0.83      0.86    130385
        Good       0.86      0.93      0.90    129681
        Poor       0.94      0.93      0.94    130838

    accuracy                           0.90    390904
   macro avg       0.90      0.90      0.90    390904
weighted avg       0.90      0.90      0.90    390904



#### 4.1.3e Gradient Boosting 

In [24]:
from sklearn.ensemble import GradientBoostingClassifier
gradient_booster_bsmote = GradientBoostingClassifier()
gradient_booster_bsmote.fit(X_train_bsmote,y_train_bsmote)
y_pred_bsmote = gradient_booster_bsmote.predict(X_test_bsmote)


# accuracy scores
print('Training Set Accuracy Score: ', gradient_booster_bsmote.score(X_train_bsmote, y_train_bsmote))
print('Test Set Accuracy Score: ', gradient_booster_bsmote.score(X_test_bsmote, y_test_bsmote))
    
# classification report
print('Classification Metrics \n')
print(classification_report(y_test_bsmote, y_pred_bsmote))

Training Set Accuracy Score:  0.6325744364979948
Test Set Accuracy Score:  0.6301649509854081
Classification Metrics 

              precision    recall  f1-score   support

        Fair       0.58      0.36      0.44    130385
        Good       0.65      0.99      0.78    129681
        Poor       0.63      0.55      0.59    130838

    accuracy                           0.63    390904
   macro avg       0.62      0.63      0.60    390904
weighted avg       0.62      0.63      0.60    390904



### 4.1.4 SVMsmote

In [21]:
from imblearn.over_sampling import SVMSMOTE
from collections import Counter

svm = SVMSMOTE(random_state=42)

# fit target and predictor variable
X_svm , y_svm = svm.fit_resample(X, y)

print('Origianl dataset shape:', Counter(y))
print('Resampple dataset shape:', Counter(y_svm))

KeyboardInterrupt: 

In [None]:
# train test split
X_train_svm, X_test_svm, y_train_svm, y_test_svm = train_test_split(X_svm, y_svm, test_size=0.25, random_state=42)

print(X_train_svm.shape, y_train_svm.shape)
print(X_test_svm.shape, y_test_svm.shape)

#### 4.1.3a Logistic Regression 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix

lr_svm = LogisticRegression(random_state=42)
lr_svm.fit(X_train_svm, y_train_svm)
y_pred_svm = lr_svm.predict(X_test_svm)
        
# accuracy scores
print( 'Training Set Accuracy Score: ', lr_svm.score(X_train_svm, y_train_svm))
print('Test Set Accuracy Score: ', lr_svm.score(X_test_svm, y_test_svm))
    
# classification metrics
print('Classification Metrics \n')
print(classification_report(y_test_svm, y_pred_svm))

#### 4.1.3b K-Nearest Negibhour

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn_svm = KNeighborsClassifier(n_neighbors=5)
knn_svm.fit(X_train_svm, y_train_svm)
y_pred_svm = knn_svm.predict(X_test_svm)
    

# accuracy scores
print('Accuracy Score, Training Set: ', knn_svm.score(X_train_svm, y_train_svm))
print('Accuracy Score, Test Set: ', knn_svm.score(X_test_svm, y_test_svm))
    
# classificatin report
print('Classification Metrics \n')
print(classification_report(y_test_svm, y_pred_svm))

#### 4.1.3c Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
decision_tree_svm = DecisionTreeClassifier(random_state=42)
decision_tree_svm.fit(X_train_svm, y_train_svm)
y_pred_svm = decision_tree_svm.predict(X_test_svm)
    
    
# accuracy scores
print('Training Set Accuracy Score: ', decision_tree_svm.score(X_train_svm, y_train_svm))
print('Test Set Accuracy Score: ', decision_tree_svm.score(X_test_smote, y_test_svm))

# classification report
print('Classification Metrics \n')
print(classification_report(y_test_svm, y_pred_svm))

#### 4.1.3d Random Forest 

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_svm = RandomForestClassifier(random_state=42)
rf_svm.fit(X_train_svm, y_train_svm)
y_pred_svm = rf_svm.predict(X_test_svm)
    
    
# accuracy scores
print('Training Set Accuracy Score: ', rf_svm.score(X_train_svm, y_train_svm))
print('Test Set Accuracy Score: ', rf_svm.score(X_test_svm, y_test_svm))
    
# classification report
print('Classification Metrics \n')
print(classification_report(y_test_svm, y_pred_svm))

#### 4.1.3e Gradient Boosting 

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
gradient_booster_svm = GradientBoostingClassifier()
gradient_booster_svm.fit(X_train_svm,y_train_svm)
y_pred_svm = gradient_booster_svm.predict(X_test_svm)


# accuracy scores
print('Training Set Accuracy Score: ', gradient_booster_svm.score(X_train_svm, y_train_svm))
print('Test Set Accuracy Score: ', gradient_booster_svm.score(X_test_svm, y_test_svm))
    
# classification report
print('Classification Metrics \n')
print(classification_report(y_test_svm, y_pred_svm))

### 4.1.5 AdaSYN

In [8]:
from imblearn.over_sampling import ADASYN
from collections import Counter

ada = ADASYN(random_state=42)

# fit target and predictor variable
X_ada , y_ada = ada.fit_resample(X, y)

print('Origianl dataset shape:', Counter(y))
print('Resampple dataset shape:', Counter(y_ada))

Origianl dataset shape: Counter({'Good': 521205, 'Fair': 94874, 'Poor': 26309})
Resampple dataset shape: Counter({'Good': 521205, 'Poor': 518554, 'Fair': 517354})


In [9]:
# train test split
X_train_ada, X_test_ada, y_train_ada, y_test_ada = train_test_split(X_ada, y_ada, test_size=0.25, random_state=42)

print(X_train_ada.shape, y_train_ada.shape)
print(X_test_ada.shape, y_test_ada.shape)

(1167834, 36) (1167834,)
(389279, 36) (389279,)


#### 4.1.5a Logistic Regression 

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix

lr_ada = LogisticRegression(random_state=42)
lr_ada.fit(X_train_ada, y_train_ada)
y_pred_ada = lr_ada.predict(X_test_ada)
        
# accuracy scores
print( 'Training Set Accuracy Score: ', lr_ada.score(X_train_ada, y_train_ada))
print('Test Set Accuracy Score: ', lr_ada.score(X_test_ada, y_test_ada))
    
# classification metrics
print('Classification Metrics \n')
print(classification_report(y_test_ada, y_pred_ada))

Training Set Accuracy Score:  0.33718490812906626
Test Set Accuracy Score:  0.33577459868115156
Classification Metrics 



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

        Fair       0.33      0.22      0.26    129890
        Good       0.34      0.79      0.47    129552
        Poor       0.00      0.00      0.00    129837

    accuracy                           0.34    389279
   macro avg       0.22      0.34      0.25    389279
weighted avg       0.22      0.34      0.25    389279



  _warn_prf(average, modifier, msg_start, len(result))


#### 4.1.5b K-Nearest Neigbour

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn_ada = KNeighborsClassifier(n_neighbors=5)
knn_ada.fit(X_train_ada, y_train_ada)
y_pred_ada = knn_ada.predict(X_test_ada)
    

# accuracy scores
print('Accuracy Score, Training Set: ', knn_ada.score(X_train_ada, y_train_ada))
print('Accuracy Score, Test Set: ', knn_ada.score(X_test_ada, y_test_ada))
    
# classificatin report
print('Classification Metrics \n')
print(classification_report(y_test_ada, y_pred_ada))

#### 4.1.5c Decision Tree

In [14]:
from sklearn.tree import DecisionTreeClassifier
decision_tree_ada = DecisionTreeClassifier(random_state=42)
decision_tree_ada.fit(X_train_ada, y_train_ada)
y_pred_ada = decision_tree_ada.predict(X_test_ada)
    
    
# accuracy scores
print('Training Set Accuracy Score: ', decision_tree_ada.score(X_train_ada, y_train_ada))
print('Test Set Accuracy Score: ', decision_tree_ada.score(X_test_ada, y_test_ada))

# classification report
print('Classification Metrics \n')
print(classification_report(y_test_ada, y_pred_ada))

Training Set Accuracy Score:  0.9999922934252642
Test Set Accuracy Score:  0.7933512981691795
Classification Metrics 

              precision    recall  f1-score   support

        Fair       0.73      0.71      0.72    129890
        Good       0.85      0.84      0.84    129552
        Poor       0.80      0.83      0.82    129837

    accuracy                           0.79    389279
   macro avg       0.79      0.79      0.79    389279
weighted avg       0.79      0.79      0.79    389279



#### 4.1.5d Random Forest 

In [15]:
from sklearn.ensemble import RandomForestClassifier
rf_ada = RandomForestClassifier(random_state=42)
rf_ada.fit(X_train_ada, y_train_ada)
y_pred_ada = rf_ada.predict(X_test_ada)
    
    
# accuracy scores
print('Training Set Accuracy Score: ', rf_ada.score(X_train_ada, y_train_ada))
print('Test Set Accuracy Score: ', rf_ada.score(X_test_ada, y_test_ada))
    
# classification report
print('Classification Metrics \n')
print(classification_report(y_test_ada, y_pred_ada))

Training Set Accuracy Score:  0.9999837305644467
Test Set Accuracy Score:  0.8688318660909014
Classification Metrics 

              precision    recall  f1-score   support

        Fair       0.87      0.77      0.82    129890
        Good       0.85      0.93      0.89    129552
        Poor       0.89      0.91      0.90    129837

    accuracy                           0.87    389279
   macro avg       0.87      0.87      0.87    389279
weighted avg       0.87      0.87      0.87    389279



#### 4.1.5e Gradient Boosting 

In [16]:
from sklearn.ensemble import GradientBoostingClassifier
gradient_booster_ada = GradientBoostingClassifier()
gradient_booster_ada.fit(X_train_ada,y_train_ada)
y_pred_smote = gradient_booster_ada.predict(X_test_ada)


# accuracy scores
print('Training Set Accuracy Score: ', gradient_booster_ada.score(X_train_ada, y_train_ada))
print('Test Set Accuracy Score: ', gradient_booster_ada.score(X_test_ada, y_test_ada))
    
# classification report
print('Classification Metrics \n')
print(classification_report(y_test_ada, y_pred_ada))

Training Set Accuracy Score:  0.6247274869544815
Test Set Accuracy Score:  0.6226639505341927
Classification Metrics 

              precision    recall  f1-score   support

        Fair       0.87      0.77      0.82    129890
        Good       0.85      0.93      0.89    129552
        Poor       0.89      0.91      0.90    129837

    accuracy                           0.87    389279
   macro avg       0.87      0.87      0.87    389279
weighted avg       0.87      0.87      0.87    389279

