<p style="font-family: Arial; font-size:3em;color:navy; font-style:bold"><br>
Lecture 7: Logistic Regression and Decision Trees
<br><br></p>

This week we're discussing more classifiers and their applications. 

<p style="font-family: Arial; font-size:2.5em;color:purple; font-style:bold"><br>
Logistic Regression
<br><br></p>

Logistic regression, like linear regression, is a generalized linear model. However, the final output of a logistic regression model is not continuous; it's binary (0 or 1). The following sections will explain how this works.

<p style="font-family: Arial; font-size:2em;color:indigo; font-style:bold"><br>
What is Conditional Probability?
<br><br></p>

Conditional probability is the probability that an event (A) will occur given that some condition (B) is true. For example, say you want to find the probability that a student will take the bus as opposed to walking to class today (A) given that it's snowing heavily outside (B). The probability that the student will take the bus when it's snowing is likely higher than the probability that s/he would take the bus on some other day. 

<p style="font-family: Arial; font-size:2em;color:indigo; font-style:bold"><br>
The Logistic Equation
<br><br></p>

The <b>logistic equation</b> is the basis of the logistic regression model. It looks like this:
![image](https://wikimedia.org/api/rest_v1/media/math/render/svg/5e648e1dd38ef843d57777cd34c67465bbca694f)

The t in the equation is some linear combination of n variables, or a linear function in an n-dimensional feature space. The formulation of t is therefore identical to the linear regression formula.

The logistic equation works as follows:
1. Takes an input of n variables
2. Takes a linear combination of the variables as parameter t
3. Outputs a value given the input and parameter t

The output of the logistic equation is always between 0 and 1.

A visualization of the outputs of the logistic equation is as below (note that this is but one possible output of a logit regression model):
![image](https://upload.wikimedia.org/wikipedia/commons/8/88/Logistic-curve.svg)

<p style="font-family: Arial; font-size:2em;color:indigo; font-style:bold"><br>
Threshold Value
<br><br></p>

The final output of a logistic regression model should be a binary set of numbers - that is, 0 or 1. However, you'll notice that the output of the logistic equation is a continuous set of numbers between 0 and 1. We have to convert it to binary. 

We do this by picking a <b>threshold value</b>. This is a value between 0 and 1 such that if f(x) > threshold, we give it the value 1, and otherwise it is 0. 
![image](https://wikimedia.org/api/rest_v1/media/math/render/svg/aab892e7cf0d00aa6da3aa051335900ff52d12a0)

The threshold value is the epsilon value in the equation, and is a key parameter in logistic regression, because it determines two key characteristics of a logistic regression classifier: 

1. <b>Sensitivity</b>
2. <b>Specificity</b>

<p style="font-family: Arial; font-size:2em;color:indigo; font-style:bold"><br>
Sensitivity and Specificity
<br><br></p>

<b>The Confusion Matrix</b>
![image](http://rasbt.github.io/mlxtend/user_guide/evaluate/confusion_matrix_files/confusion_matrix_1.png)

<b>Sensitivity</b>, also known as the <b>true positive rate</b>, is the proportion of true positives out of all "actual positives" - that is, it is the proportion of positives that are correctly identified as positives.
    
    Sensitivity = True Positive / (True Positive + False Negatives)

<b>Specificity</b>, also called the <b>true negative rate</b>, is the proportion of true negatives out of all "actual negatives" - that is, it is the proportion of negatives that are correctly identified as negatives.

    Specificity = True Negative / (True Negative + False Positives)

There is always a trade-off between the two characteristics. Both depend on the <b>threshold value</b> we choose; the higher the threshold, the lower the sensitivity and the higher the specificity. If we have an arbitrarily high threshold value (i.e. 1), all points will be classified as negative; sensitivity = 0 and specificity = 1. The opposite will be true if we set the threshold to be arbitrarily low (i.e. 0). 

<p style="font-family: Arial; font-size:2em;color:indigo; font-style:bold"><br>
The ROC Curve
<br><br></p>

The ROC curve represents how well a model performs in terms of sensitivity and specificity over all possible thresholds. Sensitivity (on the y-axis) is plotted against 1-specificity, or equivalently the false positive rate (on the x-axis) as the threshold value varies from 0 to 1. An example:
![image](https://www.statsdirect.com/help/resources/images/ebx_1266835018.gif)

<p style="font-family: Arial; font-size:2em;color:indigo; font-style:bold"><br>
Example 1: Predicting Income from Census Data
<br><br></p>

We'll use logistic regression to predict whether annual income is greater than $50k based on census data. You can read more about the dataset <a href="https://www.kaggle.com/uciml/adult-census-income">here</a>.

In [7]:
# import necessary packages
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression

In [8]:
inc_data = pd.read_csv('C:/Input/adult.csv')

# drop null values
inc_data = inc_data.dropna()

IOError: File C:/Input/adult.csv does not exist

The following uses LabelEncoder() in scikit-learn to encode all features to categorical integer values. Many features in this particular dataset, such as race and sex, are represented as strings with a limited number of possible values. LabelEncoder() re-labels these values as integers between 0 and number of classes-1.

In [9]:
# convert all features to categorical integer values
enc = LabelEncoder()
for i in inc_data.columns:
    inc_data[i] = enc.fit_transform(inc_data[i])

NameError: name 'inc_data' is not defined

In [10]:
# target is stored in y
y = inc_data['income']

# X contains all other features, which we will use to predict target
X = inc_data.drop('income', axis=1)

NameError: name 'inc_data' is not defined

Here we split the data into train and test sets, where the test set is 30% of the initial dataset.

In [11]:
# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

NameError: name 'y' is not defined

In [12]:
# build model and fit on train set
logit = LogisticRegression()
logit.fit(X_train, y_train)

NameError: name 'y_train' is not defined

In [13]:
# make predictions on test set
pred_logit = logit.predict(X_test)
pred_logit

NotFittedError: This LogisticRegression instance is not fitted yet

In [14]:
# measure accuracy
accuracy_score(y_true = y_test, y_pred = pred_logit)

NameError: name 'y_test' is not defined

In [15]:
# plot ROC curve?

<p style="font-family: Arial; font-size:2em;color:indigo; font-style:bold"><br>
Example 2: Predict Iris Species
<br><br></p>

In [16]:
# import necessary packages
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression

In [46]:
from sklearn import datasets
iris = datasets.load_iris()

#Here we load the built-in iris dataset
X = iris.data[:, :2]
Y = iris.target

isSetosa = Y == 0
isNot = Y > 0
isSetosa.any()
Y[isSetosa] = 1
Y[isNot] = 0

In [47]:
#Here we create the train/test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)

In [48]:
#Building the model
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [49]:
#Predictions
pred = logreg.predict(X_test)

#Accuracy
accuracy_score(y_true = Y_test, y_pred = pred)

#ROC

0.97777777777777775

<p style="font-family: Arial; font-size:2em;color:indigo; font-style:bold"><br>
Multinomial Logistic Regression
<br><br></p>

We won't discuss this in detail here, but it's worth mentioning briefly. Multinomial logistic regression is another classification algorithm. The difference is that the output isn't binary; there can be multiple possible categories for the target, as implied by the name. For example, we can use multinomial regression to predict which movie genre people will like based on their other characteristics. If you're interested in learning how this model works in more detail, there are a lot of good resources on the internet and we encourage you to explore.

<p style="font-family: Arial; font-size:2.5em;color:purple; font-style:bold"><br>
Decision Trees
<br><br></p>

The decision tree algorithm can be used to do both classification as well as regression and has the advantage of not assuming a linear model. Decisions trees are usually easy to represent visually which makes it easy to understand how the model actually works. 

Another frequently used classifier is <b>CART</b>, or <b>classification and regression trees.</b>
![image](https://upload.wikimedia.org/wikipedia/commons/f/f3/CART_tree_titanic_survivors.png)

<p style="font-family: Arial; font-size:2em;color:indigo; font-style:bold"><br>
Geometric Interpretation
<br><br></p>

![image](https://docs.microsoft.com/en-us/azure/machine-learning/media/machine-learning-algorithm-choice/image5.png)

<p style="font-family: Arial; font-size:2em;color:indigo; font-style:bold"><br>
Mathematical Formulation
<br><br></p>

<p style="font-family: Arial; font-size:2em;color:indigo; font-style:bold"><br>
Example 1: Credit Card Fraud
<br><br></p>

We'll use the decision tree classifier to predict whether credit card transactions are fraudulent. You can read about the dataset we're using <a href="https://www.kaggle.com/dalpozz/creditcardfraud">here</a>.

In [21]:
# import necessary packages
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz

In [22]:
cc_data = pd.read_csv('C:/Input/creditcard.csv')

# drop null values
cc_data = cc_data.dropna()

IOError: File C:/Input/creditcard.csv does not exist

In [23]:
# target is stored in y
y = cc_data['Class']

# X contains all other features, which we will use to predict target
X = cc_data.drop('Class', axis=1)

NameError: name 'cc_data' is not defined

The following note may seem self-evident, but just to be extra clear:

In the cell above, we create X so that it contains all features except for the target variable, and we'll make predictions using X. This doesn't have to be the case, and in fact is usually not the best practice; we can pick features that we think are significant rather than using the entire dataset, and doing so often results in more accurate predictions. For simplicity's sake, however, we omit this in this example. 

In [24]:
# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

NameError: name 'y' is not defined

In [25]:
# build model and fit on train set
tree_classifier = DecisionTreeClassifier(max_leaf_nodes=15)
tree_classifier.fit(X_train, y_train)

NameError: name 'y_train' is not defined

In [26]:
# make predictions on test set
tree_pred = tree_classifier.predict(X_test)
tree_pred

NotFittedError: Estimator not fitted, call `fit` before exploiting the model.

In [27]:
# measure accuracy
accuracy_score(y_true = y_test, y_pred = tree_pred)

NameError: name 'y_test' is not defined

Out of context, this accuracy score looks extremely good. However, notice that the dataset is highly unbalanced; the positive class, fraudulent transactions, accounts only for a very small portion of the dataset. The baseline accuracy, which is the accuracy score we would get by simply looking at the most frequent value, which is 0 and corresponds to non-fradulent transactions, is very high. While the accuracy we get using the classifier is higher than the baseline, it is not as impressive as a similar score would be for a balanced dataset. 

In [28]:
import numpy as np
# create an array of 0s the same length as y_test
arr_0 = np.zeros(y_test.size)

# baseline accuracy
accuracy_score(y_true = y_test, y_pred = arr_0)

NameError: name 'y_test' is not defined

<p style="font-family: Arial; font-size:2em;color:indigo; font-style:bold"><br>
Example 2: Predict Higgs Boson Signal
<br><br></p>

In [29]:
# import necessary packages
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier

In [52]:
higgs = pd.read_csv('higgsTest.csv')
higgs = higgs.dropna()

# target is stored in y
Y = higgs['Class']

# X contains all other features, which we will use to predict target
X = higgs.drop('Class', axis=1)

In [55]:
# train/test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)

# build model and fit on train set
dTree = DecisionTreeClassifier(max_leaf_nodes=15)
dTree.fit(X_train, Y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=15, min_impurity_split=1e-07,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [57]:
# make predictions on test set
dTree_pred = dTree.predict(X_test)
dTree_pred

array([1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1,
       0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1,
       0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1,
       0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1])

In [59]:
# measure accuracy
accuracy_score(y_true = Y_test, y_pred = dTree_pred)

0.69154228855721389