# <font color='#d50283'>IT Academy - Data Science Itinerary</font>
## Sprint 10 Task 1 - Supervised Classification
### Model 3: Random Forest

### Assignment by: Kat Weissman

#### General objective:

- Practice and become familiar with classification algorithms.

#### Python Learning Objectives:
- Classification trees
- KNN - k-Nearest Neighbors
- Logistic Regression
- Support Vector Machine
- XGboost

*Recommended learning resources:*
- https://www.datacamp.com/community/tutorials/decision-tree-classification-python
- https://towardsdatascience.com/how-to-best-evaluate-a-classification-model-2edb12bcc587
- https://www.ritchieng.com/machine-learning-evaluate-classification-model/
- https://towardsdatascience.com/hackcvilleds-4636c6c1ba53
- https://scikit-learn.org/stable/modules/cross_validation.html

Classification Models:
- https://stackabuse.com/k-nearest-neighbors-algorithm-in-python-and-scikit-learn/
- https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python
- https://towardsdatascience.com/logistic-regression-using-python-sklearn-numpy-mnist-handwriting-recognition-matplotlib-a6b31e2b166a
- https://www.datacamp.com/community/tutorials/svm-classification-scikit-learn-python
- https://www.datacamp.com/community/tutorials/xgboost-in-python

Imbalanced Data:
- https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18
- https://www.kdnuggets.com/2019/05/fix-unbalanced-dataset.html
- https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
- https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

In [2]:
#Import libraries - basic
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [3]:
#Import libraries - classification
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

### Level 1
### Exercise 1 
Create at least three different classification models to try to best predict DelayedFlights.csv flight delay (ArrDelay). Consider whether the flight is late or not (ArrDelay> 0).

Reference: https://towardsdatascience.com/lazy-predict-fit-and-evaluate-all-the-models-from-scikit-learn-with-a-single-line-of-code-7fe510c7281

In [4]:
pd.set_option('display.max_columns', None)  #set display to show all columns

Note about imbalanced data:

I will load the data which I pre-processed and applied the SMOTE technique for sampling in a different notebook. This created a balanced dataset from an imbalanced one.

- https://github.com/KatBCN/Supervisat_Classificacio/blob/main/Sprint%2010%20-%20Classification%20Model%20-%20Pre-Processing.ipynb

In [5]:
data_link = 'https://github.com/KatBCN/Supervisat_Classificacio/blob/main/flights-sampled-smoted.pkl.bz2?raw=true'
df = pd.read_pickle(data_link,compression='bz2')

#### Data Exploration

In [6]:
# Show number of rows and columns in dataframe
df.shape

(172248, 20)

In [7]:
# Show column names
df.columns

Index(['Month', 'DayOfWeek', 'DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime',
       'UniqueCarrier', 'ActualElapsedTime', 'CRSElapsedTime', 'AirTime',
       'ArrDelay', 'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn',
       'TaxiOut', 'Cancelled', 'Diverted', 'Delayed'],
      dtype='object')

In [8]:
# Display first 5 rows of dataframe
df.head(5)

Unnamed: 0,Month,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,Diverted,Delayed
0,7,5,1511.0,1505,1740.0,1740,WN,269.0,275.0,252.0,0.0,6.0,BNA,OAK,1959,5.0,12.0,0,0,0
1,1,5,1736.0,1730,1913.0,1858,FL,97.0,88.0,69.0,15.0,6.0,BOS,BWI,370,9.0,19.0,0,0,1
2,4,2,1611.0,1600,1711.0,1700,WN,60.0,60.0,41.0,11.0,11.0,ONT,LAS,197,3.0,16.0,0,0,1
3,5,5,1551.0,1535,1803.0,1740,WN,132.0,125.0,121.0,23.0,16.0,STL,HOU,687,3.0,8.0,0,0,1
4,4,3,1754.0,1725,1834.0,1820,WN,100.0,115.0,86.0,14.0,29.0,SLC,LAX,590,6.0,8.0,0,0,1


In [9]:
# check data set variables
df.dtypes

Month                 object
DayOfWeek             object
DepTime              float64
CRSDepTime             int64
ArrTime              float64
CRSArrTime             int64
UniqueCarrier         object
ActualElapsedTime    float64
CRSElapsedTime       float64
AirTime              float64
ArrDelay             float64
DepDelay             float64
Origin                object
Dest                  object
Distance               int64
TaxiIn               float64
TaxiOut              float64
Cancelled             object
Diverted              object
Delayed                int64
dtype: object

In [10]:
# Check for duplicates
sum(df.duplicated())

0

### Model 3 - Random Forest Classifier

The third model will be an ensemble Random Forest Classifier

I would like to know if we can predict if a flight will be delayed based on actual elapsed time, distance, TaxiIn, and TaxiOut.

I performed the same task using a Decision Tree Classifier and Logistic Regression Classifer in different notebooks:
- https://github.com/KatBCN/Supervisat_Classificacio/blob/main/Sprint%2010%20-%20Classification%20Model%20-%20Classification%20Tree.ipynb
- https://github.com/KatBCN/Supervisat_Classificacio/blob/main/Sprint%2010%20-%20Classification%20Model%20-%20Logistic%20Regression%20Classifier.ipynb

The Decision Tree Classifier performed much better than the Logistic Regression Classifier.

In [11]:
X = df[['ActualElapsedTime', 'Distance', 'TaxiIn', 'TaxiOut']]
y = df['Delayed']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

In [12]:
X.head()

Unnamed: 0,ActualElapsedTime,Distance,TaxiIn,TaxiOut
0,269.0,1959,5.0,12.0
1,97.0,370,9.0,19.0
2,60.0,197,3.0,16.0
3,132.0,687,3.0,8.0
4,100.0,590,6.0,8.0


References:
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [13]:
# Create Random Forest Classifier
clf = RandomForestClassifier(random_state=324)

In [14]:
# train model
clf.fit(X_train, y_train)

RandomForestClassifier(random_state=324)

In [15]:
# make predictions for test data
y_pred = clf.predict(X_test)

### Level 1
### Exercise 2 : Confusion Matrix & Metrics
Compare classification models using accuracy, a confidence matrix, and other more advanced metrics.

Balanced Accuracy is important for imbalanced datasets. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html

In this case, I balanced the data using SMOTE before training the model, and I am also testing on the SMOTEd dataset, so the balanced accuracy should be similar to the accuracy.

In [16]:
def evaluateModel(y_test, y_pred):
    print(metrics.confusion_matrix(y_test, y_pred))
    print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
    print("Balanced Accuracy:",metrics.balanced_accuracy_score(y_test, y_pred))
    print("Classification Report:")
    print(metrics.classification_report(y_test, y_pred))

The Random Forest classifier performs even better than the Decision Tree Classifier as expected. Balanced accuracy increased from 88.4% using a Decision Tree to 91.9% using a Random Forest.

In [17]:
# Random Forest Confusion Matrix
evaluateModel(y_test,y_pred)

[[15329  2023]
 [  773 16325]]
Accuracy: 0.9188388969521045
Balanced Accuracy: 0.9191020247987614
Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.88      0.92     17352
           1       0.89      0.95      0.92     17098

    accuracy                           0.92     34450
   macro avg       0.92      0.92      0.92     34450
weighted avg       0.92      0.92      0.92     34450



### Level 1
### Exercise 3 : Parameter Tuning
Train them using the different parameters they support.

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Default parameters:

class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)


Here are the parameters I used in the first case:

In [18]:
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 324,
 'verbose': 0,
 'warm_start': False}

For my second regression model, I will try to use a different criterion and set a max depth to 20.

In [19]:
# Create Random Forest classifer object with different parameters
clf2 = RandomForestClassifier(random_state=324, criterion='entropy', max_depth=20)

In [20]:
# Train Classifer
clf2 = clf2.fit(X_train,y_train)

In [21]:
#Predict the response for test dataset
y_pred2 = clf2.predict(X_test)

The results are the similar to the first model, but just slightly worse. Overall, the random forest classifier is consistent.

In [22]:
# Confusion Matrix
evaluateModel(y_test,y_pred2)

[[15274  2078]
 [  995 16103]]
Accuracy: 0.9107982583454282
Balanced Accuracy: 0.9110252057121313
Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.88      0.91     17352
           1       0.89      0.94      0.91     17098

    accuracy                           0.91     34450
   macro avg       0.91      0.91      0.91     34450
weighted avg       0.91      0.91      0.91     34450



### Level 1
### Exercise 4 : Cross Validation
Compare your performance using the train / test or cross-validation approach.

- https://scikit-learn.org/stable/modules/cross_validation.html

Cross Validation on the first model of the Random Forest classifier using the default parameters:

The cross validation score is consistent.

In [23]:
# k-fold cross validation
# CV model
model = RandomForestClassifier(random_state=324)
kfold = KFold(n_splits=10, random_state=42)
results = cross_val_score(model, X, y, cv=kfold)
print("%0.2f accuracy with a standard deviation of %0.2f" % (results.mean(), results.std()))



0.92 accuracy with a standard deviation of 0.05
