# General Overview - Machine Learning

The goal of building our machine learning model is to correctly predict a tree's health based on independent variables. We are classifying categorical variables. They are nominal, meaning that they do not have any intrinsic order to them, unlike ordinal variables. To measure our model's success, we are relying on a [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html), which shows the main classification metrics such as precision, recall, and f1-score.

Our algorithms of choice are: [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), and [Gaussian Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes). To prepare our data, we separate our target variable, y or tree health, from the independent variables, X. Next, we split X and y into training and test sets by a percentage. In our case, we are training with 75% of our data and testing with the remaining 25% percent. After splitting, we are ready to begin testing.

It's important to note that due to the heavily imbalanced group representation of our data, we are incorporating under sampling and over sampling methods in an effort to improve our precision and recall scores to find the best possible model. There are two separate notebooks for under and over sampling techniques due to the number of methods used and the length of the notebooks.

In [1]:
import numpy as np
import pandas as pd
import sklearn
from sklearn import datasets
from sklearn import metrics
from collections import Counter

from sklearn.model_selection import (KFold, 
                                     cross_val_score, 
                                     GridSearchCV, 
                                     train_test_split)
from sklearn.metrics import (classification_report,
                             confusion_matrix)

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

In [2]:
np.random.seed(42)

In [3]:
data = pd.read_csv('tree_ml.csv', index_col=0) # import data
tree = data.copy() # save a copy of data as tree

In [4]:
tree.head()

Unnamed: 0,health,health_l,num_problems,tree_dbh,root_stone_l,root_grate_l,root_other_l,trunk_wire_l,trnk_light_l,trnk_other_l,...,OnCurb,Harmful,Helpful,Unsure,Damage,Bronx,Brooklyn,Manhattan,Queens,Staten Island
0,Fair,1,0,3,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
1,Fair,1,1,21,1,0,0,0,0,0,...,1,0,0,0,1,0,0,0,1,0
2,Good,2,0,3,0,0,0,0,0,0,...,1,0,0,0,1,0,1,0,0,0
3,Good,2,1,10,1,0,0,0,0,0,...,1,0,0,0,1,0,1,0,0,0
4,Good,2,1,21,1,0,0,0,0,0,...,1,0,0,0,1,0,1,0,0,0


In [5]:
tree.shape

(651535, 26)

# Machine Learning Models

We start with splitting our data into training and testing sets in a stratified fashion so that our resulting sets have the same proportions of classes as our originals. 75% of our data is used to train the models while the remaining 25% is used for testing. 

## separate variables using train test split

In [6]:
tree_ml = tree.drop(columns='health_l') # keep the categorical column

In [7]:
y = tree_ml['health'].values # target variable
X = tree_ml.drop('health', axis=1).values 

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(488651, 24) (488651,)
(162884, 24) (162884,)


## Baseline - DummyClassifier

We start by using the DummyClassifier to make predictions using simple rules. This is our baseline for the other models.

In [8]:
stratified = DummyClassifier(strategy='stratified').fit(X_train, y_train)
dc_pred = stratified.predict(X_test)

print('Accuracy Score: ', stratified.score(X, y))

Accuracy Score:  0.68159346773389


In [9]:
frequent = DummyClassifier(strategy='most_frequent').fit(X_train, y_train)
mfreq_pred = frequent.predict(X_test)

print('Accuracy Score: ', frequent.score(X, y))

Accuracy Score:  0.8108958075928384


## Logistic Regression

In [10]:
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)
logreg_pred = logreg.predict(X_test)

# accuracy scores
print('Accuracy Score, Training Set: ', logreg.score(X_train, y_train))
print('Accuracy Score, Test Set: ', logreg.score(X_test, y_test))

# confusion matrix
cm = confusion_matrix(y_test, logreg_pred) ## add labels, calculate scores
print ('Confusion Matrix \n', cm)

# classification report
print('Classification Report \n')
print(classification_report(y_test, logreg_pred))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Accuracy Score, Training Set:  0.8106972051627849
Accuracy Score, Test Set:  0.8108224257753984
Confusion Matrix 
 [[   450  23657      0]
 [   464 131618      0]
 [   344   6349      2]]
Classification Report 

              precision    recall  f1-score   support

        Fair       0.36      0.02      0.04     24107
        Good       0.81      1.00      0.90    132082
        Poor       1.00      0.00      0.00      6695

    accuracy                           0.81    162884
   macro avg       0.72      0.34      0.31    162884
weighted avg       0.75      0.81      0.73    162884



In [11]:
# GridSearch
logreg_gs = LogisticRegression(random_state=42)
params = {'C':[0.001, 0.01, 0.1, 1, 10, 20, 40, 60]}
gridsearch = GridSearchCV(logreg_gs, params)

# fit to data
gridsearch.fit(X, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

GridSearchCV(cv=None, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=42, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 20, 40, 60]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [12]:
# cross validation - 5-fold
cv_scores = cross_val_score(logreg, X, y, cv=5)

print(cv_scores)
print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

[0.80930418 0.81148365 0.81116901 0.81186736 0.80646473]
Average 5-Fold CV Score: 0.8100577866116172


## KNN Classifier

This one takes a long time to run.

In [13]:
# GridSearch
knn = KNeighborsClassifier()
parameters = {'n_neighbors':[3, 10]}
gridsearch = GridSearchCV(knn, parameters)

# fit to data
gridsearch.fit(X, y)

GridSearchCV(cv=None, error_score=nan,
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='deprecated', n_jobs=None, param_grid={'n_neighbors': [3, 10]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [14]:
knn = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)
knn_pred = knn.predict(X_test)

# accuracy scoring
print('Accuracy Score, Training Set: ', knn.score(X_train, y_train))
print('Accuracy Score, Test Set: ', knn.score(X_test, y_test))

# confusion matrix
cm = confusion_matrix(y_test, knn_pred)
print ('Confusion Matrix \n', cm)

# classification report
print('Classification Report \n')
print(classification_report(y_test, knn_pred))

Accuracy Score, Training Set:  0.7919373949915175
Accuracy Score, Test Set:  0.7815009454581174
Confusion Matrix 
 [[  2426  21546    135]
 [  7247 124721    114]
 [   837   5711    147]]
Classification Report 

              precision    recall  f1-score   support

        Fair       0.23      0.10      0.14     24107
        Good       0.82      0.94      0.88    132082
        Poor       0.37      0.02      0.04      6695

    accuracy                           0.78    162884
   macro avg       0.47      0.36      0.35    162884
weighted avg       0.71      0.78      0.73    162884



In [15]:
# cross validation - 5-fold
cv_scores = cross_val_score(knn, X, y, cv=5)

print('CV scores: ', cv_scores)
print('Average 5-Fold CV Score: {}'.format(np.mean(cv_scores)))

CV scores:  [0.77301296 0.79154612 0.78173851 0.78559095 0.75418051]
Average 5-Fold CV Score: 0.7772138104629835


## Decision Tree Classifier

In [16]:
decision_tree = DecisionTreeClassifier(random_state=42)
decision_tree.fit(X_train, y_train)
decision_tree_pred = decision_tree.predict(X_test)

# accuracy scores
print('Accuracy Score, Training Set:', decision_tree.score(X_train, y_train))
print('Accuracy Score, Test Set:', decision_tree.score(X_test, y_test))

# confusion matrix
cm = confusion_matrix(y_test, decision_tree_pred)
print ('Confusion Matrix \n', cm)

# classification report
print('Classification Report \n')
print(classification_report(y_test, decision_tree_pred))

Accuracy Score, Training Set: 0.8254725765423585
Accuracy Score, Test Set: 0.8016011394612117
Confusion Matrix 
 [[  1534  22314    259]
 [  2877 128816    389]
 [   562   5915    218]]
Classification Report 

              precision    recall  f1-score   support

        Fair       0.31      0.06      0.11     24107
        Good       0.82      0.98      0.89    132082
        Poor       0.25      0.03      0.06      6695

    accuracy                           0.80    162884
   macro avg       0.46      0.36      0.35    162884
weighted avg       0.72      0.80      0.74    162884



In [17]:
# cross validation
cv_scores = cross_val_score(decision_tree, X, y, cv=5)

print('CV scores: ', cv_scores)
print('Average 5-Fold CV Score: {}'.format(np.mean(cv_scores)))

CV scores:  [0.79194518 0.80511408 0.80618846 0.80332599 0.77837722]
Average 5-Fold CV Score: 0.7969901847176284


## Random Forest Classifier

In [18]:
forest = RandomForestClassifier(random_state=42)
forest.fit(X_train, y_train)
y_pred = forest.predict(X_test)

# accuracy scores
print('Accuracy Score, Training Set:', forest.score(X_train, y_train))
print('Accuracy Score, Test Set:', forest.score(X_test, y_test))

# confusion matrix
cm = confusion_matrix(y_test, y_pred)
print ('Confusion Matrix \n', cm)

# classification report
print('Classification Report \n')
print(classification_report(y_test, y_pred))

Accuracy Score, Training Set: 0.8254664371913697
Accuracy Score, Test Set: 0.805198791778198
Confusion Matrix 
 [[  1155  22663    289]
 [  1934 129745    403]
 [   421   6020    254]]
Classification Report 

              precision    recall  f1-score   support

        Fair       0.33      0.05      0.08     24107
        Good       0.82      0.98      0.89    132082
        Poor       0.27      0.04      0.07      6695

    accuracy                           0.81    162884
   macro avg       0.47      0.36      0.35    162884
weighted avg       0.72      0.81      0.74    162884



In [19]:
# cross validation - 5-fold
cv_scores = cross_val_score(forest, X, y, cv=5)

print(cv_scores)
print('Average 5-Fold CV Score: {}'.format(np.mean(cv_scores)))

[0.79773151 0.80782306 0.80825282 0.80649543 0.78629698]
Average 5-Fold CV Score: 0.8013199597872716


## Gaussian Naive Bayes

In [20]:
gaussian = GaussianNB()
gaussian.fit(X_train, y_train)
gaussian_pred = gaussian.predict(X_test)

# accuracy scores
print('Accuracy Score, Training Set:', gaussian.score(X_train, y_train))
print('Accuracy Score, Test Set:', gaussian.score(X_test, y_test))

# confusion matrix
cm = confusion_matrix(y_test, gaussian_pred)
print ('Confusion Matrix \n', cm)

# classification report
print('Classification Report \n')
print(classification_report(y_test, gaussian_pred))

Accuracy Score, Training Set: 0.7369062991787595
Accuracy Score, Test Set: 0.7380896834557108
Confusion Matrix 
 [[  2517  18452   3138]
 [  8899 116393   6790]
 [   594   4788   1313]]
Classification Report 

              precision    recall  f1-score   support

        Fair       0.21      0.10      0.14     24107
        Good       0.83      0.88      0.86    132082
        Poor       0.12      0.20      0.15      6695

    accuracy                           0.74    162884
   macro avg       0.39      0.39      0.38    162884
weighted avg       0.71      0.74      0.72    162884



In [21]:
# cross validation - 5-fold
cv_scores = cross_val_score(gaussian, X, y, cv=5)

print(cv_scores)
print('Average 5-Fold CV Score: {}'.format(np.mean(cv_scores)))

[0.70158932 0.7710407  0.77120953 0.76104891 0.66612692]
Average 5-Fold CV Score: 0.7342030742784347


Since the precision and recall scores for good trees are relatively high but off-the-mark for fair and poor trees, we need to look into under sampling and over sampling methods for better interpretation of our data.