In [2]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn import datasets
import sys
sys.path.append('..')
from src.data import load_data
from src.metric import classificationSummary, confusion_matrix



# Classification

One of the most straightforward machine learning techniques is classification.  That is determining which category a to which particular observation belongs.  For example: 
* an individual passenger on the titanic was likely to survive,
* which income bracket an individual is likely to be a part of 
* will a customer will qualify for a loan.

All of these are classification questions.  Given a set of independent variables (e.g. factors) what is the likelihood that the dependent variable meets that category. 


## Assessing Predictive Performance
Before we get too far into running algorithms we need a method to determine how good our model is.  There are a number of different metrics we might use.  But this depends on whether we are predicting/estimating a continuous target or a categorical target.  Some of the common methods are included here.

Continuous Target
* MAE - Mean absolute error
* Mean Error - 
* MPE - Mean percentage error
* MAPE - Mean absolute percentage error
* RMSE - Root mean square error

We will look at the continuous methods in depth in the [Estimation Notebook](04-EstimationPrediction.ipynb).

Categorical Target
- Misclassification Rate
- Recall (aka True Positive Rate, hit rate, sensitivity)
- Precision

We'll focus here on _accuracy_ 

$$accuracy = \frac{NumberOfCorrectPredictions}{AllPredictions}$$
and _precision_
$$precision = \frac{TruePositives}{TruePositives+FalsePositives}$$

### Naive Performance
The simplest way to determine if a model is valuable or not is to compare to the "naive rule".  In essense, if we predict the most common outcome every time what is the misclassification rate.  If our model is no better than this, then our model isn't very helpful.

For instance, let's say that in a list of 10 loan applicants that 8 of the applicants will be approved for a loan and that 2 won't be approved.  If we were to blindly assume that all 10 applicants would be approved for a loan - then we would be wrong 20% of the time (2/10).  Therefore, if we can't develop a model which is at least 80% accurate then we don't have a very good model.

### Cut-off values for classification
We also need to consider that in nearly all of our classification models, what we are given is the _probability_ or likelihood that the predicted target is in the given class, therefore the value we get from our classification algorithms is on a scale from 0 to 1.  Most often, the default is to say that a target belongs to a class if the probability of being in the class is > 0.5.  But it can be useful to move this cut-off.  For instance, if the cost of predicting a particular class is unusually high, we may want to raise the probability to 0.6 or greater.

## Classification using Logistic Regression
Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a [logistic function](https://en.wikipedia.org/wiki/Logistic_function).

For this problem, we are going to jump right into the pattern we will use for running near all of our models.
- Start by importing the data
- Get a sense of the shape, type and features of the data
- Perform any

In [3]:
# Start by importing a few key functions
from sklearn.metrics import accuracy_score, average_precision_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

delays_df = load_data('FlightDelays')
delays_df.head()


Unnamed: 0,CRS_DEP_TIME,CARRIER,DEP_TIME,DEST,DISTANCE,FL_DATE,FL_NUM,ORIGIN,Weather,DAY_WEEK,DAY_OF_MONTH,TAIL_NUM,Flight Status
0,1455,OH,1455,JFK,184,01/01/2004,5935,BWI,0,4,1,N940CA,ontime
1,1640,DH,1640,JFK,213,01/01/2004,6155,DCA,0,4,1,N405FJ,ontime
2,1245,DH,1245,LGA,229,01/01/2004,7208,IAD,0,4,1,N695BR,ontime
3,1715,DH,1709,LGA,229,01/01/2004,7215,IAD,0,4,1,N662BR,ontime
4,1039,DH,1035,LGA,229,01/01/2004,7792,IAD,0,4,1,N698BR,ontime


In [None]:
# TODO: It maybe helpful here to get a sense of the dataset and decide how best to prepare it for modelling

## Data Preparation

In [4]:
# convert to categorical
delays_df.DAY_WEEK = delays_df.DAY_WEEK.astype('category')

# create hourly bins departure time 
delays_df.CRS_DEP_TIME = [round(t / 100) for t in delays_df.CRS_DEP_TIME]
delays_df.CRS_DEP_TIME = delays_df.CRS_DEP_TIME.astype('category')

predictors = ['DAY_WEEK', 'CRS_DEP_TIME', 'ORIGIN', 'DEST', 'CARRIER']
outcome = 'Flight Status'

X = pd.get_dummies(delays_df[predictors])
y = (delays_df[outcome] == 'delayed').astype(int)
classes = ['ontime', 'delayed']

# split into training and validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.40, random_state=1)

# run naive Bayes
delays_nb = MultinomialNB(alpha=0.01)
delays_nb.fit(X_train, y_train)

# predict probabilities
predProb_train = delays_nb.predict_proba(X_train)
predProb_valid = delays_nb.predict_proba(X_valid)

# predict class membership
y_valid_pred = delays_nb.predict(X_valid)
y_train_pred = delays_nb.predict(X_train)

classificationSummary(y_train,y_train_pred)

Confusion Matrix (Accuracy 0.7955)

       Prediction
Actual   0   1
     0 998  61
     1 209  52
