# Classification

Today you will consider data coming from the american Bureau of Transportation Statistics where they recorded (a lot of) data from flights in the US from 1987 to 2008 and analysed the causes of delays. 
We will only look at data from 2008 and a subset of around 100,000 instances. We also removed some of the columns to simplify the analysis: 

* we removed non-ordinal data
* we removed data that can only be known when the plane has already arrived

## Modelling task
The for the majority of this module our task is to build a classifiers that can predict whether a flight will arrive with a *major delay* given the parameters at takeoff. We define a *major delay* as 30 minutes or more.

Imagine that you are the data scientist for an aircraft company. The company must refund customers if their flight is delayed by 30 minutes or more. If you can determine the scenarios that make filghts late, the company could focus its efforts to improve and you could save them a lot of money (not to mention make customers happy)!

One approach you could take would be to put your flights into to two classes, delayed and not delayed, and fit a classification model. Below we will create the data we need for this classification model, and briefly analyse it. Feel free to perform additional analyses of the data yourself!

## Import, view, and clean data

N.B. Future notebooks in this module do not depend on this notebook - they will import `data/flights08_clean.csv` the result of running this notebook.

In [None]:
# Import packages you'll need here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display


In [None]:
# import data/flights08_raw.csv
raw_data = ...
raw_data = pd.read_csv("data/flights08_raw.csv")


### Basic EDA

Use basic Pandas methods to get the raw data's:
1. shape
1. types
1. descriptive statistics
1. nr missing values in each column
1. nr unique values in each column

In [None]:
# I would run these in separate cells normally - I call display for convenience
display(raw_data.shape)
display(raw_data.dtypes)
display(raw_data.describe())
display(raw_data.isnull().sum())
display(raw_data.nunique())


Some rows have missing values for `DepDelay` and/or `ArrDelay`. Explore these rows and determine whether they should be used in the modelling.

In [None]:
# Again, run these in separate cells normally - I'm calling display for convenience

# we see they all have Cancelled == 1
display(raw_data[raw_data.DepDelay.isnull()])

# check this is true for all cases
display(raw_data[raw_data.DepDelay.isnull() & (raw_data.Cancelled==0)])

# we see they all have Diverted == 1
display(raw_data[raw_data.ArrDelay.isnull() & (raw_data.Cancelled==0)])

# check this is true for all cases
display(raw_data[raw_data.ArrDelay.isnull() & (raw_data.Cancelled==0) & (raw_data.Diverted==0)])  


### EDA summary

1. 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay' are mostly missing. Whilst some models can handle this, its unclear how to fairly impute these values for comparison. We drop these features.
1. The rows with missing `DepDelay` values are for flights that are cancelled. Whilst we could consider these rows as `MajorDelay`=True, that would leave open what we should fill in as the `DepDelay` for modelling. We exclude these rows from analysis.
1. The remaining rows with missing `ArrDelay` are flights that were diverted. We don't know why they were diverted and we should exclude these from the analysis.
1. Month only has a single value for this dataset, we exclude this feature.

### Cleaning data

First create a new dataset using .copy() on the raw data

In [None]:
data = ...
data = raw_data.copy()
data.shape


Drop variables with many missing values

In [None]:
data.drop(['CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay'],
          axis=1, inplace=True)
data.shape


Drop diverted flights

In [None]:
data = data[data.Diverted==0]
data.drop(['Diverted'], axis=1, inplace=True)
data.shape


drop cancelled flights

In [None]:
data = data[data.Cancelled==0]
data.drop(['Cancelled'], axis=1, inplace=True)
data.shape


drop Month

In [None]:
data.drop('Month', axis=1, inplace=True)


Check level of missing data in each column now, and return the final shape vs original shape

In [None]:
display(data.isnull().sum())
display((data.shape, raw_data.shape))


## Create Response variable

Create a variable `MajorDelay` which is `1` when `ArrDelay >= 30`, and `0` otherwise, then drop `ArrDelay` from the DataFrame.

In [None]:
data['MajorDelay'] = data.ArrDelay >= 30
data.drop('ArrDelay', axis=1, inplace=True)
# data.to_csv('data/flights08_clean.csv', index=False)