# Supervised Learning

You will now explore data coming from the american Bureau of Transportation Statistics where they recorded (a lot of) data from flights in the US from 1987 to 2008 and analysed the causes of delays. 
We will only look at data from 2008 and a subset of around 100 000 instances. We also removed some of the columns to simplify the analysis

The aim is to build a classifier that can predict whether a flight will arrive with a significant delay given the parameters at takeoff.

### Loading the data

As usual, start by loading `pandas`, `numpy`, `matplotlib` and `seaborn` and load the data corresponding to the file `flights08.csv`.

In [0]:
# code to load the libraries



In [0]:
# add your code to load the data



### Getting a first look at the data

Have a look at the data:

* Do the attributes make sense? (see [here](http://stat-computing.org/dataexpo/2009/the-data.html) if needed)
* What's the shape of the dataset?
* How many missing values are present?
* How many unique values are present per attribute? what does that tell you? 

In [0]:
# add your code here to do a first exploration of the data



### Dealing with missing values

The previous step should have shown you two things:

1. some features have a **lot** of missing values (in particular those associated with Delay at departure). In the sequel we will assume that a missing value for a Delay amounts to no Delay. 
2. some feature don't have enough unique values to be interesting (which ones?) and should probably removed. 

Based on this:

* fill the missing values associated with `*Delay` by a 0
* remove the feature(s) that don't have enough variability
* remove all instances that have missing values left

In [0]:
# add your code here to clean the data



### Extracting the response

Our aim is to predict whether there will be a significative delay. 
The variable that encodes the delay is `ArrDelay`. 

* Start by having a look at it using `distplot` from `seaborn` 
* then compute the delay threshold such that 70% of the positive delays are lower than that threshold
* form a response vector `major_delay` being either 0 or 1 depending on whether the delay is less than or greater or equal to the threshold
* finally remove the `ArrDelay` column from the dataset.

In [0]:
# add your code here to show the distribution of `ArrDelay`


# compute the delay threshold


# form the response vector major__delay and remove ArrDelay from the dataset



### Splitting the data into a training and a testing set

Now that you have reasonably clean data, it's time to split into a training set to train your model(s) and a test set to test those! Sklearn has all that sorted for you, of course. 

Import the function `train_test_split` from `sklearn.model_selection` and check the documentation using the `?` as usual. 

In [0]:
# add your code to load the function and check the doc



The key options that you are most likely to use are:

* `test_size` a proportion so a number between 0 and 1, typically `0.2` or `0.3`
* `random_state` an arbitrary integer to seed the train-test split so that your experiments are reproducible
* `stratify` in the case of imbalanced data, you want to make sure your test set and your training set contain similar proportion of the different classes. 

Create `X_train`, `X_test`, `y_train`, `y_test` out of `data` and `major_delay`, use `0.3` as proportion for test and set the random state to `5175`. Specify `major_delay` as the stratifier. 

In [0]:
# add your code here



## Decision Tree Classifier (DTC)

We will apply a DTC to the dataset and see how it does.

### Using SkLearn's DTC

The procedure above can be highly optimised making the fitting of a particular DTC very fast. Much like for the kNN, SkLearn offers the `DecisionTreeClassifier` from `sklearn.tree`. Have a look at the documentation then declare a tree with no more than 3 levels. Fit it on the training data. 

In [0]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(max_depth=3)

dtc.fit(X_train, y_train)

### (Bonus) Visualising the DTC

A nice feature is to export the tree and display it using `graphviz` (http://www.graphviz.org/Download..php) 

* on Mac: install with `Homebrew` using `brew install graphviz`
* on Windows: http://www.graphviz.org/Download_windows.php 
* on Linux: http://www.graphviz.org/Download..php

To do this, 

* import `export_graphviz` from `sklearn.tree`
* use `export_graphviz` on the tree you fitted above specifying a name for the output file like `tree.dot`
* (see also [the documentation](http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html))

To see how it looks, use graphviz: 

```bash
dot -Tpng tree.dot -o tree.png
```

![](tree.png)


In [0]:
# your code here to export the tree



### Assessing the quality of a DTC

Using

```python
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
```

* recover the confusion matrix on the training or the test set
* recover the classification report on the training or the test set
* adjust the depth of the tree to get optimal results

(**Bonus**) if you have the time: try to explore the parameters of the DTC, what do they mean? do they help? See also [the doc](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier). 

In [0]:
# your code here to get a prediction out of the DTC


# print the `confusion_matrix` and `classification_report` , how good is your model?



## Random Forest Classifier

The accuracy is already extremely high, this is because some of the features are "too informative". Let's remove a few features.

In [0]:
del data["DepDelay"]
del data["TaxiOut"]
del data["Cancelled"]
del data["Diverted"]
del data["CarrierDelay"]
del data["WeatherDelay"]
del data["NASDelay"]
del data["SecurityDelay"]
del data["LateAircraftDelay"]

In [0]:
X_train, X_test, y_train, y_test = train_test_split(data, major_delay, 
                                                    test_size=0.3, random_state=5175,
                                                    stratify=major_delay)

In [0]:
dtc = DecisionTreeClassifier(max_depth=3)

dtc.fit(X_train, y_train)

y_test_pred2 = dtc.predict(X_test)

print("The confusion matrix: \n")
print(confusion_matrix(y_test, y_test_pred2))
print("\nThe classification report:\n")
print(classification_report(y_test, y_test_pred2, digits=3))

In [0]:
from sklearn.ensemble import RandomForestClassifier

In [0]:
rf = RandomForestClassifier(n_estimators=50)

rf.fit(X_train, y_train)

y_test_pred3 = rf.predict(X_test)

In [0]:
print("The confusion matrix: \n")
print(confusion_matrix(y_test, y_test_pred3))
print("\nThe classification report:\n")
print(classification_report(y_test, y_test_pred3, digits=3))