# STATEMENT
We are going to practice and become familiar with classification algorithms.

Level 1

- Exercise 1
Create at least three different classification models to try to predict as well as possible the delay of the flights (ArrDelay) of DelayedFlights.csv. Consider whether the flight has arrived late or not (ArrDelay > 0).

- Exercise 2
Compare the classification models using accuracy, a confidence matrix and other more advanced metrics.

- Exercise 3
Train them using the different parameters they admit.

- Exercise 4
Compare their performance using the trait/test or cross-validation approach.

Level 2
- Task 5
Carry out some process of variable engineering to improve your prediction.

Level 3
- Exercise 6
Do not use the variable DepDelay when making predictions.

# Level 1
## - Exercise 1
Create at least three different classification models to try to predict as well as possible the delay of the flights (ArrDelay) of DelayedFlights.csv. Consider whether the flight has arrived late or not (ArrDelay > 0).

### 1.1.- Decision Tree Classification

A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value.  It partitions the tree in a recursive manner called recursive partitioning. This flowchart-like structure helps you in decision making. It can be visualized like a flowchart diagram which easily mimics human level thinking. That is why decision trees are easy to understand and interpret.

![](2022-03-15-11-32-53.png)

Decision Tree is a white box type of ML algorithm. It shares internal decision-making logic, which is not available in the black box type of algorithms such as Neural Network. Its training time is faster compared to the neural network algorithm. The time complexity of decision trees is a function of the number of records and number of attributes in the given data. The decision tree is a distribution-free or non-parametric method, which does not depend upon probability distribution assumptions. Decision trees can handle high dimensional data with good accuracy.  

From: https://app.datacamp.com/workspace/w/0ede46e9-76cd-4232-9215-adb63ba6efad/edit

In [1]:
# Load libraries
from IPython.display import display
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

import Pers_lib as Pers # Import Personal functions ( my functions :) )

# settings to display all columns (default is 20, now is None (all))
pd.set_option("display.max_columns", None)


In [2]:
# Load dataset
df = pd.read_csv('..\Data\DelayedFlights.csv')

# Dataset information: ✈
![](2022-02-28-17-46-54.png)

In [3]:
# Description of raw dataframe
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,1936758.0,3341651.0,2066065.0,0.0,1517452.5,3242558.0,4972466.75,7009727.0
Year,1936758.0,2008.0,0.0,2008.0,2008.0,2008.0,2008.0,2008.0
Month,1936758.0,6.111106,3.482546,1.0,3.0,6.0,9.0,12.0
DayofMonth,1936758.0,15.75347,8.776272,1.0,8.0,16.0,23.0,31.0
DayOfWeek,1936758.0,3.984827,1.995966,1.0,2.0,4.0,6.0,7.0
DepTime,1936758.0,1518.534,450.4853,1.0,1203.0,1545.0,1900.0,2400.0
CRSDepTime,1936758.0,1467.473,424.7668,0.0,1135.0,1510.0,1815.0,2359.0
ArrTime,1929648.0,1610.141,548.1781,1.0,1316.0,1715.0,2030.0,2400.0
CRSArrTime,1936758.0,1634.225,464.6347,0.0,1325.0,1705.0,2014.0,2400.0
FlightNum,1936758.0,2184.263,1944.702,1.0,610.0,1543.0,3422.0,9742.0


## Data PreProcessing. 

Processes performed in previous task:

* Delete "non relevant" columns: 
    *    'Unnamed: 0'➡️ repeated index
    *    ['FlightNum','TailNum']➡️ this info is already in Distance column.
    *    ['Origin','Dest'] ➡️ this info is already in Distance column.
    *    ['CarrierDelay','WeatherDelay','NASDelay','SecurityDelay','LateAircraftDelay'] ➡️ the kind of delay isn't relevant, and for sure the sum is dependant.
    *    ['Cancelled','CancellationCode','Diverted'] ➡️ no information when NaN deleted
    *    'DepDelay' ➡️ very dependent variable with ArrDelay
* NaN cleaning   
* delete duplicates
* create colum Date of Flight  
* sample data (1% stratified by airline)    
* standardize all numerical columns (except for day of week)
* OHE of Unique Carrier attribute.

In [4]:
# Delete columns that we find not relevant for our model.
try:
    # Let's clean first column that is repeated index.
    df = df.drop(columns ='Unnamed: 0')
    # Let's delete FlightNum and TailNum as these columns doesn't give us any useful information.
    df = df.drop(columns=['FlightNum','TailNum'])
    # Let's delete Origin and Dest as this info is already in Distance column.
    df = df.drop(columns=['Origin','Dest'])
    # Finally, let's drop the columns of Delays that are not ArrDelay, because ArrDelay is the sum of all others, and we don't think that 
    # the information of what kind of delay is, will be relevant, what is sure is that they are going to be completely dependent 
    # (the sum of them are equal to ArrDelay).
    df = df.drop(columns=['CarrierDelay','WeatherDelay','NASDelay','SecurityDelay','LateAircraftDelay'])
    # If we delete the NaN's, the Cancellation and Diverted Columns have been deleted.
    # That means that when whe have deleted NaN's registers, we have deleted the information of cancelled or diverted flights.  
    # As they were so few, is ok to delete them. But also they could be useful for another exploration. 
    # In other dataset / practice we could only extract the cancelled / diverted flights to arrive to interesting conclusions. 
    df = df.drop(columns=['Cancelled','CancellationCode','Diverted'])
    # Let's also delete the DepDelay column, as it's a very dependent variable with ArrDelay, 
    # and we want to find different relationships with the other variables.
    df = df.drop(columns='DepDelay')
except:
    print("Columns already deleted")

# Drop all NaNs (as explained in S0901, the % of NaNs is very little in all the columns (0.4% max).   )
list_cols = df.columns
array_cols = list_cols.values
NumTotalRegisters = df.shape[0]
df = df.dropna(subset=array_cols)
print(f"Number of registers deleted are {NumTotalRegisters-df.shape[0]}")
print(f"% of registers with NaNs deleted are {((NumTotalRegisters-df.shape[0])/NumTotalRegisters)*100:.2f}%")

# Delete duplicates
index_dupl_df = df.duplicated()
print("Num. duplicates =", index_dupl_df.sum())
# As there are very few duplicates, we took them off.
df.drop_duplicates(inplace= True)

# Create column Date of the flight and delete the columns Year / Month / Day of Month. We keep DayOfWeek for potential correlations.
## Date of the flight
try:
    df['Date'] = pd.to_datetime(df.Year.astype(str)+'-'+ df.Month.astype(str)+'-'+ df.DayofMonth.astype(str))
    df = df.drop(columns=['Year','Month','DayofMonth'])
except:
    print("Date column already created and columns Year, Month & DayofMonth already deleted")

# Replace number day of week to string for later graphic use
df.DayOfWeek = df.DayOfWeek.replace({1:"MON",2:"TUE",3:"WED",4:"THU",5:"FRI",6:"SAT",7:"SUN"})

Number of registers deleted are 8387
% of registers with NaNs deleted are 0.43%
Num. duplicates = 3


In [5]:
# Description of preprocessed dataframe
display(df.describe(include='all',datetime_is_numeric=True).round(2).T)

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max,std
DayOfWeek,1928368.0,7.0,FRI,321982.0,,,,,,,
DepTime,1928368.0,,,,1518.65,1.0,1203.0,1545.0,1900.0,2400.0,450.44
CRSDepTime,1928368.0,,,,1467.72,0.0,1135.0,1510.0,1815.0,2359.0,424.73
ArrTime,1928368.0,,,,1610.24,1.0,1316.0,1715.0,2030.0,2400.0,548.0
CRSArrTime,1928368.0,,,,1634.2,0.0,1325.0,1705.0,2014.0,2359.0,464.63
UniqueCarrier,1928368.0,20.0,WN,376200.0,,,,,,,
ActualElapsedTime,1928368.0,,,,133.31,14.0,80.0,116.0,165.0,1114.0,72.06
CRSElapsedTime,1928368.0,,,,134.2,-21.0,82.0,116.0,165.0,660.0,71.23
AirTime,1928368.0,,,,108.28,0.0,58.0,90.0,137.0,1091.0,68.64
ArrDelay,1928368.0,,,,42.2,-109.0,9.0,24.0,56.0,2461.0,56.78


In [6]:
df.head()

Unnamed: 0,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,Distance,TaxiIn,TaxiOut,Date
0,THU,2003.0,1955,2211.0,2225,WN,128.0,150.0,116.0,-14.0,810,4.0,8.0,2008-01-03
1,THU,754.0,735,1002.0,1000,WN,128.0,145.0,113.0,2.0,810,5.0,10.0,2008-01-03
2,THU,628.0,620,804.0,750,WN,96.0,90.0,76.0,14.0,515,3.0,17.0,2008-01-03
3,THU,1829.0,1755,1959.0,1925,WN,90.0,90.0,77.0,34.0,515,3.0,10.0,2008-01-03
4,THU,1940.0,1915,2121.0,2110,WN,101.0,115.0,87.0,11.0,688,4.0,10.0,2008-01-03



## - Exercise 2
Compare the classification models using accuracy, a confidence matrix and other more advanced metrics.



## - Exercise 3
Train them using the different parameters they admit.



## - Exercise 4
Compare their performance using the trait/test or cross-validation approach.



# Level 2
## - Task 5
Carry out some process of variable engineering to improve your prediction.



# Level 3
## - Exercise 6
Do not use the variable DepDelay when making predictions.