In [10]:
pip install matplotlib numpy pandas scikit-learn xgboost

Collecting xgboost
  Downloading xgboost-1.6.2-py3-none-manylinux2014_x86_64.whl (255.9 MB)
[K     |████████████████████████████████| 255.9 MB 120 kB/s  eta 0:00:011
Installing collected packages: xgboost
Successfully installed xgboost-1.6.2
Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

In [3]:
filename = "smoke_detection_iot.csv"
sd_data = pd.read_csv(filename)
sd_vector = sd_data.to_numpy()
print("Size of data: ",sd_vector.shape)
print(f"{np.sum(sd_vector[:,-1])} Positive cases")
print(f"{sd_vector.shape[0] - np.sum(sd_vector[:,-1])} Negative cases")
sd_data.head()

Size of data:  (62630, 16)
44757.0 Positive cases
17873.0 Negative cases


Unnamed: 0.1,Unnamed: 0,UTC,Temperature[C],Humidity[%],TVOC[ppb],eCO2[ppm],Raw H2,Raw Ethanol,Pressure[hPa],PM1.0,PM2.5,NC0.5,NC1.0,NC2.5,CNT,Fire Alarm
0,0,1654733331,20.0,57.36,0,400,12306,18520,939.735,0.0,0.0,0.0,0.0,0.0,0,0
1,1,1654733332,20.015,56.67,0,400,12345,18651,939.744,0.0,0.0,0.0,0.0,0.0,1,0
2,2,1654733333,20.029,55.96,0,400,12374,18764,939.738,0.0,0.0,0.0,0.0,0.0,2,0
3,3,1654733334,20.044,55.28,0,400,12390,18849,939.736,0.0,0.0,0.0,0.0,0.0,3,0
4,4,1654733335,20.059,54.69,0,400,12403,18921,939.744,0.0,0.0,0.0,0.0,0.0,4,0


## Task 1
Decision Tree Classifier Implementation. Solving the problem from last assignment of fire detection with sensor fusion.

First starting with the vanilla implementation and testing its accuracy:

In [27]:
def preprocess(sd_vector):
    # 80% train 20% test split
    train_size = int(sd_vector.shape[0]*0.8)
    x = sd_vector[:,2:-2] 
    # normalize data
    x = (x-np.mean(x,axis=0))/(np.std(x,axis=0))
    y = sd_vector[:,-1]
    return x[:train_size].copy(),y[:train_size].copy(),x[train_size:].copy(),y[train_size:].copy()

x_train, y_train, x_test, y_test = preprocess(sd_vector)

# Running the default model
rfc = DecisionTreeClassifier()
rfc.fit(x_train,y_train)
y_pred = rfc.predict(x_train)
print(f"Training Accuracy: {100*metrics.accuracy_score(y_train, y_pred):0.4f}%")
y_pred = rfc.predict(x_test)
print(f"Testing Accuracy: {100*metrics.accuracy_score(y_test, y_pred):0.4f}%")

Training Accuracy: 100.0000%
Testing Accuracy: 43.2780%


As we can see the decision tree from scikit learn with default parameters slightly underperforms the linear regressor from the last assignment that reached a near 70% accuracy. 

Next trying to overcome the data imbalances in the last assignment by setting the decision tree to balance the dataset weighting before training:

In [28]:

rfc = DecisionTreeClassifier(class_weight ="balanced")
rfc.fit(x_train,y_train)
y_pred = rfc.predict(x_train)
print(f"Training Accuracy: {100*metrics.accuracy_score(y_train, y_pred):0.4f}%")
y_pred = rfc.predict(x_test)
print(f"Testing Accuracy: {100*metrics.accuracy_score(y_test, y_pred):0.4f}%")

Training Accuracy: 100.0000%
Testing Accuracy: 83.4265%


From this test we can see a balanced dataset for the decision tree drastically improves performance to a higher score then acheived with linear classifiers. This so far is the best performing method for smoke detection.

Next changing the model from selecting the best criteria in every node of the tree to selecting a crieteria randomly to hopefully get better generalization without the greedy approach. Because this changes the model training dynamics to be highly variable I will run this experiment multiple times to see the best model it produces.

In [29]:
# Loop to train multiple random models
for i in range(20):
    rfco = DecisionTreeClassifier(class_weight ="balanced", splitter ="random")
    rfco.fit(x_train,y_train)
    y_pred = rfco.predict(x_train)
    print(f"Model {i+1} Training Accuracy: {100*metrics.accuracy_score(y_train, y_pred):0.4f}%")
    y_pred = rfc.predict(x_test)
    print(f"Model {i+1} Testing Accuracy: {100*metrics.accuracy_score(y_test, y_pred):0.4f}%")

Model 1 Training Accuracy: 100.0000%
Model 1 Testing Accuracy: 83.4265%
Model 2 Training Accuracy: 100.0000%
Model 2 Testing Accuracy: 83.4265%
Model 3 Training Accuracy: 100.0000%
Model 3 Testing Accuracy: 83.4265%
Model 4 Training Accuracy: 100.0000%
Model 4 Testing Accuracy: 83.4265%
Model 5 Training Accuracy: 100.0000%
Model 5 Testing Accuracy: 83.4265%
Model 6 Training Accuracy: 100.0000%
Model 6 Testing Accuracy: 83.4265%
Model 7 Training Accuracy: 100.0000%
Model 7 Testing Accuracy: 83.4265%
Model 8 Training Accuracy: 100.0000%
Model 8 Testing Accuracy: 83.4265%
Model 9 Training Accuracy: 100.0000%
Model 9 Testing Accuracy: 83.4265%
Model 10 Training Accuracy: 100.0000%
Model 10 Testing Accuracy: 83.4265%
Model 11 Training Accuracy: 100.0000%
Model 11 Testing Accuracy: 83.4265%
Model 12 Training Accuracy: 100.0000%
Model 12 Testing Accuracy: 83.4265%
Model 13 Training Accuracy: 100.0000%
Model 13 Testing Accuracy: 83.4265%
Model 14 Training Accuracy: 100.0000%
Model 14 Testing A

From this experiment we can see that while random decisions make the model on average perform less accuratly, by searching the model space we can actually find models that perform near 90% accuracy on the training and test sets. It is worth to note though with this sort of bias towards selecting higher performing models on the test set we may be biasing the model.

## Task 2
From the Bagging and Boosting ensemble methods pick any one algorithm
from each category. Implement both the algorithms using the same data. Use k-fold cross
validation to find the effectiveness of both the models. 


Starting with Bagging using KNeighborsClassifier implementation:

In [30]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RepeatedStratifiedKFold

x_train, y_train, x_test, y_test = preprocess(sd_vector)

X  = np.concatenate([x_train,x_test],axis=0)
Y  = np.concatenate([y_train,y_test],axis=0)

# Support vector machine bagging clasisfier
bag_model = BaggingClassifier(base_estimator=KNeighborsClassifier(),n_estimators =5 )
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(bag_model, X, Y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print(n_scores)
print('Accuracy: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))
print("Done")

bag_model.fit(x_train,y_train)
y_pred = bag_model.predict(x_test)
print(f"Model Testing Accuracy: {100*metrics.accuracy_score(y_test, y_pred):0.4f}%")

[0.99872266 0.99792432 0.99936133 0.99920166 0.99776465 0.99776465
 0.99840332 0.99888232 0.99856299 0.99888232 0.99840332 0.99776465
 0.99808399 0.99936133 0.99904199 0.99840332 0.99856299 0.99776465
 0.99888232 0.99872266 0.99872266 0.99888232 0.99776465 0.99840332
 0.99872266 0.99840332 0.99824365 0.99904199 0.999521   0.99744531]
Accuracy: 0.999 (0.001)
Done
Model Testing Accuracy: 14.9848%


These are the results of the KNeighbors based classifier.
From this we can see the k-fold cross validation with 10 splits resulted in an extremely high model accuracy of 0.999 average across all of the folds with a low standard deviation of 0.001. However, when actually fit on a dataset the examples went down to 15% accuracy. This could be due to the different fitting metrics the scikitlearn cross validation uses or simply the differences in having a relatively high train test ratio vs a low ratio in the low accuracy example.

Next implementing the Boosting method with the XGBoost implementation that includes regularization to help with overfitting while improving the performance of stock gradient boosting algorithms:

In [31]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb

x_train, y_train, x_test, y_test = preprocess(sd_vector)

X  = np.concatenate([x_train,x_test],axis=0)
Y  = np.concatenate([y_train,y_test],axis=0)

boost_model = xgb.XGBClassifier(n_estimators = 5)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(boost_model, X, Y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print(n_scores)
print('Accuracy: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))
print("Done")

boost_model.fit(x_train, y_train)
y_pred = boost_model.predict(x_test)
print(f"Model Testing Accuracy: {100*metrics.accuracy_score(y_test, y_pred):0.4f}%")

[0.99936133 0.99984033 1.         0.99984033 0.99984033 0.99920166
 0.99968066 0.999521   0.99968066 0.99920166 0.999521   0.99968066
 0.99984033 0.999521   0.99984033 0.999521   0.99936133 0.99984033
 0.99968066 1.         0.99936133 0.99984033 0.99968066 0.99984033
 0.999521   0.999521   0.99984033 1.         0.99968066 0.999521  ]
Accuracy: 1.000 (0.000)
Done
Model Testing Accuracy: 56.8338%


The boosting method seems to have performed very similarly to the bagging model with slightly higher cross validation scores. For differences the extreme gradient boosting method seems to be an order of magnitude faster then the bagging method likely due to the fact that the boosting method can be run in parrallel and the individual model sizes for this model are much much smaller. Along with this, the training and testing figures after the cross validation indicate the boosting method is more stable as it performed better in the one off training example while the bagging method had failed. Both models seemed to struggle with the particular training test set I used for my training and evaluation while both models did well with the cross validation. This may be due to improper class balancing as my test and training sets were selected sequentially potentially leading to harder examples in the relatively large test set that the model struggles with unlike the smaller partitions in cross validation.


## Task 3 
Comparing the three models implemented above utilizing the various stats that can be derived from a confusion matrix as they help contextualize a model's base accuracy and indicate exactly how the model will perform in the real world. 

Confusion matrixes show the difference between true and false positive and negative predictions to hopefully reveal and mitigate model bias. In this case of fire detections an extremely high precision is required since missing a fire would be devastating but also triggering false alarms will desensitize people to the fire alarm. Along with this recall from the confusion matrix can be used to see just how many actual cases were missed and how many were correctly classified. Also the imbalance in classes for this binary classifcation task lends itself to model bias. Simply looking at accuracy does not reveal the depth of information needed to decide if a model should be deployed in the real world.

This source gives a great example of where confusion matrices are needed for model analysis (https://machinelearningmastery.com/confusion-matrix-machine-learning/#:~:text=Classification%20accuracy%20alone%20can%20be,of%20errors%20it%20is%20making.) and was part of the motivation of me using this technique.


In [47]:
from sklearn.metrics import confusion_matrix
x_train, y_train, x_test, y_test = preprocess(sd_vector)

# second model from task 1
print("First Model")
y_pred = rfc.predict(x_test)
one_cof = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: ",one_cof)
print(f"Precision {100*(one_cof[0][0]/(one_cof[0][0]+one_cof[0][1])):0.2f}%")
print(f"Accuracy {100*((one_cof[0][0]+one_cof[1][1])/(np.sum(one_cof))):0.2f}%")
print(f"Recall  {100*(one_cof[0][0]/(one_cof[0][0]+one_cof[1][0])):0.2f}%")
print(f"Specificity  {100*(one_cof[1][1]/(one_cof[1][1]+one_cof[0][1])):0.2f}%")
print()

# bagging model
print("Second Model")
y_pred = bag_model.predict(x_test)
two_cof = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: ",two_cof)
print(f"Precision {100*(two_cof[0][0]/(two_cof[0][0]+two_cof[0][1])):0.2f}%")
print(f"Accuracy {100*((two_cof[0][0]+two_cof[1][1])/(np.sum(two_cof))):0.2f}%")
print(f"Recall  {100*(two_cof[0][0]/(two_cof[0][0]+two_cof[1][0])):0.2f}%")
print(f"Specificity  {100*(two_cof[1][1]/(two_cof[1][1]+two_cof[0][1])):0.2f}%")
print()

# boosting model
print("Third Model")
y_pred = boost_model.predict(x_test)
three_cof = confusion_matrix(y_test, y_pred)
print("Confusion Matrix: ",three_cof)
print(f"Precision {100*(three_cof[0][0]/(three_cof[0][0]+three_cof[0][1])):0.2f}%")
print(f"Accuracy {100*((three_cof[0][0]+three_cof[1][1])/(np.sum(three_cof))):0.2f}%")
print(f"Recall  {100*(three_cof[0][0]/(three_cof[0][0]+three_cof[1][0])):0.2f}%")
print(f"Specificity  {100*(three_cof[1][1]/(three_cof[1][1]+three_cof[0][1])):0.2f}%")
print()

First Model
Confusion Matrix:  [[10324  1160]
 [  916   126]]
Precision 89.90%
Accuracy 83.43%
Recall  91.85%
Specificity  9.80%

Second Model
Confusion Matrix:  [[1588 9896]
 [ 753  289]]
Precision 13.83%
Accuracy 14.98%
Recall  67.83%
Specificity  2.84%

Third Model
Confusion Matrix:  [[7016 4468]
 [ 939  103]]
Precision 61.09%
Accuracy 56.83%
Recall  88.20%
Specificity  2.25%



By analyzing these various metrics we can see the pro and con of each model and what information would be lost by only analyzing accuracy. The first destinction between the models is that across the board the First model with data balancing random forest outperforms all other methods of ensamble learning. This is likely due to the data balancing as all other models have to compensate for the data imbalance. The recall and precision of this model are high as well indicating that the model accurately assigns it's positive cases which is important in fire detection to avoiding false alarms that desensitize people's reactions.

An accuracy of only 80% makes the first model seem poor for fire detection but you would be missing the model's impressive low true negative rate indicating that not many fires would be missed with this alarm. This is also true for the third model that has a low accuracy but a high recall so it's rate of predicted positives over the true positives is high indicating the model is good at detecting fires when present but it struggles with precision meaning it has many false positives. Not only is this useful with fine tuning the model but also shows how one metric may appear high while the model's representation of the problem is poor. This Third model likely is clearly predicting true in nearly all cases and since the dataset is biased towards present fires. The second model is relatively unremarkable in metric analysis since in all regards it is poor at predicting, although even the even spread of this model can indicate a training failure as in binary classification 50% accuracy should be expected in a balanced dataset.

Overall, multiple metrics should always be used in evaluating different machine learning models as they can help with fine tuning their output and predicting what problems will arise in ht ereal world. The confusion matrix is great for this role as it offers multiple perspectives of a simple classification problem and is probably why this practice is so common in the feild. In the future I hope to use confusion matrices to analyze the performance of classification problems especially in spaces that are sensitive to false positives such as fire detection, cancer detection, and model's whose positive case should be relatively rare in the real world.





