# <font color='#d50283'>IT Academy - Data Science Itinerary</font>
## Sprint 10 Task 1 - Supervised Classification
### Model 1: Classification Tree

### Assignment by: Kat Weissman

#### General objective:

- Practice and become familiar with classification algorithms.

#### Python Learning Objectives:
- Classification trees
- KNN - k-Nearest Neighbors
- Logistic Regression
- Support Vector Machine
- XGboost

*Recommended learning resources:*
- https://www.datacamp.com/community/tutorials/decision-tree-classification-python
- https://towardsdatascience.com/how-to-best-evaluate-a-classification-model-2edb12bcc587
- https://www.ritchieng.com/machine-learning-evaluate-classification-model/
- https://towardsdatascience.com/hackcvilleds-4636c6c1ba53
- https://scikit-learn.org/stable/modules/cross_validation.html

Classification Models:
- https://stackabuse.com/k-nearest-neighbors-algorithm-in-python-and-scikit-learn/
- https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python
- https://towardsdatascience.com/logistic-regression-using-python-sklearn-numpy-mnist-handwriting-recognition-matplotlib-a6b31e2b166a
- https://www.datacamp.com/community/tutorials/svm-classification-scikit-learn-python
- https://www.datacamp.com/community/tutorials/xgboost-in-python

Imbalanced Data:
- https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18
- https://www.kdnuggets.com/2019/05/fix-unbalanced-dataset.html
- https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
- https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

In [1]:
#Import libraries - basic
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [2]:
#Import libraries - classification
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

### Level 1
### Exercise 1 
Create at least three different classification models to try to best predict DelayedFlights.csv flight delay (ArrDelay). Consider whether the flight is late or not (ArrDelay> 0).

Reference: https://towardsdatascience.com/lazy-predict-fit-and-evaluate-all-the-models-from-scikit-learn-with-a-single-line-of-code-7fe510c7281

In [3]:
pd.set_option('display.max_columns', None)  #set display to show all columns

I will load the data which I pre-processed and applied the SMOTE technique for sampling in a different notebook.

- https://github.com/KatBCN/Supervisat_Classificacio/blob/main/Sprint%2010%20-%20Classification%20Model%20-%20Pre-Processing.ipynb

In [4]:
data_link = 'https://github.com/KatBCN/Supervisat_Classificacio/blob/main/flights-sampled-smoted.pkl.bz2?raw=true'
df = pd.read_pickle(data_link,compression='bz2')

#### Data Exploration

In [5]:
# Show number of rows and columns in dataframe
df.shape

(172248, 20)

In [6]:
# Show column names
df.columns

Index(['Month', 'DayOfWeek', 'DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime',
       'UniqueCarrier', 'ActualElapsedTime', 'CRSElapsedTime', 'AirTime',
       'ArrDelay', 'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn',
       'TaxiOut', 'Cancelled', 'Diverted', 'Delayed'],
      dtype='object')

In [7]:
# Display first 5 rows of dataframe
df.head(5)

Unnamed: 0,Month,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,Diverted,Delayed
0,7,5,1511.0,1505,1740.0,1740,WN,269.0,275.0,252.0,0.0,6.0,BNA,OAK,1959,5.0,12.0,0,0,0
1,1,5,1736.0,1730,1913.0,1858,FL,97.0,88.0,69.0,15.0,6.0,BOS,BWI,370,9.0,19.0,0,0,1
2,4,2,1611.0,1600,1711.0,1700,WN,60.0,60.0,41.0,11.0,11.0,ONT,LAS,197,3.0,16.0,0,0,1
3,5,5,1551.0,1535,1803.0,1740,WN,132.0,125.0,121.0,23.0,16.0,STL,HOU,687,3.0,8.0,0,0,1
4,4,3,1754.0,1725,1834.0,1820,WN,100.0,115.0,86.0,14.0,29.0,SLC,LAX,590,6.0,8.0,0,0,1


In [8]:
# check data set variables
df.dtypes

Month                 object
DayOfWeek             object
DepTime              float64
CRSDepTime             int64
ArrTime              float64
CRSArrTime             int64
UniqueCarrier         object
ActualElapsedTime    float64
CRSElapsedTime       float64
AirTime              float64
ArrDelay             float64
DepDelay             float64
Origin                object
Dest                  object
Distance               int64
TaxiIn               float64
TaxiOut              float64
Cancelled             object
Diverted              object
Delayed                int64
dtype: object

In [9]:
# Check for duplicates
sum(df.duplicated())

0

### Model 1 - Decision Tree Classifier

The first model will be a Decision Tree Classifier.

I would like to know if we can predict if a flight will be delayed based on actual elapsed time, distance, TaxiIn, and TaxiOut.

In [10]:
X = df[['ActualElapsedTime', 'Distance', 'TaxiIn', 'TaxiOut']]
y = df['Delayed']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

In [11]:
X.head()

Unnamed: 0,ActualElapsedTime,Distance,TaxiIn,TaxiOut
0,269.0,1959,5.0,12.0
1,97.0,370,9.0,19.0
2,60.0,197,3.0,16.0
3,132.0,687,3.0,8.0
4,100.0,590,6.0,8.0


In [12]:
# Create Decision Tree classifer object with random seed
clf = DecisionTreeClassifier(random_state=324)

In [13]:
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

In [14]:
#Predict the response for test dataset
y_pred = clf.predict(X_test)

### Level 1
### Exercise 2 : Confusion Matrix & Metrics
Compare classification models using accuracy, a confidence matrix, and other more advanced metrics.

Balanced Accuracy is important for imbalanced datasets. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html

In this case, I balanced the data using SMOTE before training the model, and I am also testing on the SMOTEd dataset, so the balanced accuracy should be similar to the accuracy.

In [15]:
def evaluateModel(y_test, y_pred):
    print(metrics.confusion_matrix(y_test, y_pred))
    print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
    print("Balanced Accuracy:",metrics.balanced_accuracy_score(y_test, y_pred))
    print("Classification Report:")
    print(metrics.classification_report(y_test, y_pred))

In [16]:
# Decision Tree Model Confusion Matrix
evaluateModel(y_test,y_pred)

[[15362  1990]
 [ 2013 15085]]
Accuracy: 0.8838026124818578
Balanced Accuracy: 0.8837913727719698
Classification Report:
              precision    recall  f1-score   support

           0       0.88      0.89      0.88     17352
           1       0.88      0.88      0.88     17098

    accuracy                           0.88     34450
   macro avg       0.88      0.88      0.88     34450
weighted avg       0.88      0.88      0.88     34450



In [17]:
clf.get_depth()

54

In [18]:
clf.get_n_leaves()

15389

CART Classification Feature Importance

Refererence:
- https://machinelearningmastery.com/calculate-feature-importance-with-python/

The order of importance for the features used in this model are:
1. TaxiOut (0.45)
2. TaxiIn (0.27)
3. Distance (0.17)
4. ActualElapsedTime (0.11)

In [19]:
# get importance
importance = clf.feature_importances_

# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))

Feature: 0, Score: 0.11209
Feature: 1, Score: 0.17199
Feature: 2, Score: 0.26523
Feature: 3, Score: 0.45069


### Level 1
### Exercise 3 : Parameter Tuning
Train them using the different parameters they support.

- https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?highlight=decision%20tree%20classifier#

Default parameters:

class sklearn.tree.DecisionTreeClassifier(*, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0)


Here are the parameters I used in the first case:

In [20]:
clf.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'presort': 'deprecated',
 'random_state': 324,
 'splitter': 'best'}

For my second tree, I will change the criterion and limit the tree to a maximum depth of 20.

In [21]:
# Create Decision Tree classifer object with different parameters
clf2 = DecisionTreeClassifier(random_state=324, max_depth=20, criterion='entropy')

In [22]:
# Train Decision Tree Classifer
clf2 = clf2.fit(X_train,y_train)

In [23]:
#Predict the response for test dataset
y_pred2 = clf2.predict(X_test)

Limiting the maximum depth of the tree and changing the criterion had an affect on the results of the model. Both accuracy and balanced accuracy improved a little bit. Interestingly, the precision and recall are different than the first tree.

- https://en.wikipedia.org/wiki/Precision_and_recall

In [24]:
# Decision Tree Model Confusion Matrix
evaluateModel(y_test,y_pred2)

[[14509  2843]
 [  986 16112]]
Accuracy: 0.8888534107402032
Balanced Accuracy: 0.8892448259244392
Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.84      0.88     17352
           1       0.85      0.94      0.89     17098

    accuracy                           0.89     34450
   macro avg       0.89      0.89      0.89     34450
weighted avg       0.89      0.89      0.89     34450



The order of importance for the features used in this model are:
1. TaxiOut (0.52)
2. TaxiIn (0.28)
3. Distance (0.13)
4. ActualElapsedTime (0.07)

In [25]:
# get importance
importance2 = clf2.feature_importances_

# summarize feature importance
for i,v in enumerate(importance2):
    print('Feature: %0d, Score: %.5f' % (i,v))

Feature: 0, Score: 0.07424
Feature: 1, Score: 0.12860
Feature: 2, Score: 0.27539
Feature: 3, Score: 0.52177


### Level 1
### Exercise 4 : Cross Validation
Compare your performance using the train / test or cross-validation approach.

- https://scikit-learn.org/stable/modules/cross_validation.html

Cross Validation on the first model of the Decision tree using the default parameters:

In [26]:
# k-fold cross validation
# CV model
model = DecisionTreeClassifier(random_state=324)
kfold = KFold(n_splits=10, random_state=42)
results = cross_val_score(model, X, y, cv=kfold)
print("%0.2f accuracy with a standard deviation of %0.2f" % (results.mean(), results.std()))



0.88 accuracy with a standard deviation of 0.07


Stratified k-fold is important when the classes are imbalanced. In this case, it has similar results because I used SMOTE to balance the dataset before running the classification models.

In [27]:
# CV model
model = DecisionTreeClassifier(random_state=324)
# Using stratified KFold
kfold = StratifiedKFold(n_splits=10, random_state=42)
results = cross_val_score(model, X, y, cv=kfold)
print("%0.2f accuracy with a standard deviation of %0.2f" % (results.mean(), results.std()))



0.88 accuracy with a standard deviation of 0.11


Cross Validation on the second model of the Decision tree using a maximum depth and 'entropy' as the criterion.

In [28]:
# k-fold cross validation
# CV model
model = DecisionTreeClassifier(random_state=324, max_depth=20, criterion='entropy')
kfold = KFold(n_splits=10, random_state=42)
results = cross_val_score(model, X, y, cv=kfold)
print("%0.2f accuracy with a standard deviation of %0.2f" % (results.mean(), results.std()))



0.88 accuracy with a standard deviation of 0.03


In [29]:
# CV model
model = DecisionTreeClassifier(random_state=324, max_depth=20, criterion='entropy')
# Using stratified KFold
kfold = StratifiedKFold(n_splits=10, random_state=42)
results = cross_val_score(model, X, y, cv=kfold)
print("%0.2f accuracy with a standard deviation of %0.2f" % (results.mean(), results.std()))



0.88 accuracy with a standard deviation of 0.12


Both trees have a similar performance to each other when using cross validation, but the second model is simpler because it has been limited to a depth of 20.

### Level 2
### Exercise 5 :  Feature Engineering

I skipped this level due to time constraints.

### Level 3
### Exercise 6 : Don't use DepDelay Variable

My model did not use the DepDelay variable.