In [None]:
%matplotlib inline
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
import os

<h4>Classification Overview</h4>
<ul>
<li>Predict a binary class as output based on given features.
</li>

<li>Examples: Do we need to follow up on a customer review? Is this transaction fraudulent or valid one? Are there signs of onset of a medical condition or disease? Is this considered junk food or not?</li>

<li>Linear Model. Estimated Target = w<sub>0</sub> + w<sub>1</sub>x<sub>1</sub> 
+ w<sub>2</sub>x<sub>2</sub> + w<sub>3</sub>x<sub>3</sub> 
+ … + w<sub>n</sub>x<sub>n</sub><br>
where, w is the weight and x is the feature
</li>

<li><b>Logistic Regression</b>. Estimated Probability = <b>sigmoid</b>(w<sub>0</sub> + w<sub>1</sub>x<sub>1</sub> 
+ w<sub>2</sub>x<sub>2</sub> + w<sub>3</sub>x<sub>3</sub> 
+ … + w<sub>n</sub>x<sub>n</sub>)<br>
where, w is the weight and x is the feature
</li>
<li>Linear model output is fed thru a sigmoid or logistic function to produce the probability.</li>
<li>Predicted Value: Probability of a binary outcome.  Closer to 1 is positive class, closer to 0 is negative class</li>
<li>Algorithm Used: Logistic Regression. Objective is to find the weights w that maximizes separation between the two classes</li>
<li>Optimization: Stochastic Gradient Descent. Seeks to minimize loss/cost so that predicted value is as close to actual as possible</li>
<li>Cost/Loss Calculation: Logistic loss function</li>
</ul>

In [None]:
# Sigmoid or logistic function
# For any x, output is bounded to 0 & 1.
def sigmoid_func(x):
    return 1.0/(1+math.exp(-x))

In [None]:
sigmoid_func(10)

In [None]:
sigmoid_func(-100)

In [None]:
sigmoid_func(0)

In [None]:
# Sigmoid function example
x = pd.Series(np.arange(-8,8,0.5))
y = x.map(sigmoid_func)

In [None]:
x.head()

In [None]:
plt.plot(x,y)
plt.ylim((-0.2,1.2))
plt.xlabel('input')
plt.ylabel('sigmoid output')
plt.grid(True)

plt.axvline(x=0,ymin=0,ymax=1, ls='dashed')
plt.axhline(y=0.5,xmin=0,xmax=10, ls='dashed')
plt.axhline(y=1.0,xmin=0,xmax=10,color='r')
plt.axhline(y=0.0,xmin=0,xmax=10,color='r')
plt.title('Sigmoid')

Example Dataset - Hours spent and Exam Results: 
https://en.wikipedia.org/wiki/Logistic_regression

Sigmoid function produces an output between 0 and 1 no.  Input closer to 0 produces and output of 0.5 probability.  Negative input produces value less than 0.5 while positive input produces value greater than 0.5

In [None]:
data_path = \
r'HoursExamResult.csv'

In [None]:
df = pd.read_csv(data_path)

In [None]:
df.head()

Input Feature: Hours<br>
Output: Pass (1 = pass, 0 = fail)

In [None]:
# optimal weights given in the wiki dataset
def straight_line(x):
    return 1.5046*x - 4.0777 

In [None]:
y_linear = df.Hours.map(straight_line)

In [None]:
y_linear

In [None]:
df

In [None]:
y_linear

In [None]:
df[df.Pass==1]

In [None]:

plt.scatter(x=df.Hours,y=y_linear,color='b',label='linear')
plt.scatter(x=df[df.Pass==1].Hours,y=df[df.Pass==1].Pass, color='g', label='pass')
plt.scatter(x=df[df.Pass==0].Hours,y=df[df.Pass==0].Pass, color='r',label='fail')
plt.xlim((0,7))
plt.ylim((-1,2))

In [None]:
# Generate probability by running feature thru the linear model and then thru sigmoid function
y_vals = df.Hours.map(straight_line).map(sigmoid_func)

In [None]:
plt.scatter(x=df.Hours,y=y_vals,color='b',label='logistic')
plt.scatter(x=df[df.Pass==1].Hours,y=df[df.Pass==1].Pass, color='g', label='pass')
plt.scatter(x=df[df.Pass==0].Hours,y=df[df.Pass==0].Pass, color='r',label='fail')
plt.title('Hours Spent Reading - Pass Probability')
plt.xlabel('Hours')
plt.ylabel('Pass Probability')
#plt.grid(True)
#plt.legend()
plt.xlim((0,7))
plt.ylim((-0.2,1.5))


plt.axvline(x=3,ymin=0,ymax=1)
plt.axhline(y=0.6,xmin=0,xmax=6, label='cutoff at 0.6', ls='dashed')
#plt.legend()

In [None]:
myhours = 10
y = straight_line(myhours)

In [None]:
y

In [None]:
sigmoid_func(y)

At 2.7 hours of study time, we hit 0.5 probability.  So, any student who spent 2.7 hours or more would have a higher probability of passing the exam.

In the above example,<br>
1. Top right quadrant = true positive. pass got classified correctly as pass
2. Bottom left quadrant = true negative. fail got classified correctly as fail
3. Top left quadrant = false negative. pass got classified as fail
4. Bottom right quadrant = false positive. fail got classified as pass

Cutoff can be adjusted; instead of 0.5, cutoff could be established at 0.4 or 0.6 depending on the nature of problem and impact of misclassification

<h4>Summary</h4>
<p><b>Binary Classifier</b> Predicts positive class probability of an observation </p>
<p><b>Logistic or Sigmod function</b> has an important property where output is between 0 and 1 for any input.  This output is used by binary classifiers as a probability of positive class</p>

<p><b>True Positive</b> - Samples that are actual-positives correctly predicted as positive </p>
<p><b>True Negative</b> - Samples that are actual-negatives correctly predicted as negative </p>
<p><b>False Negative</b> - Samples that are actual-positives incorrectly predicted as negative </p>
<p><b>False Positive</b> - Samples that are actual-negatives incorrectly predicted as positive </p>

<p><b>Logistic Loss Function</b> is parabolic in nature. It has an important property of not only telling us the loss at a given weight, but also tells us which way to go to minimize loss</p>
<p><b>Gradient Descent</b> optimization alogrithm uses loss function to move the weights of all the features and iteratively adjusts the weights until optimal value is reached</p>

<p><b>Batch Gradient Descent</b> predicts y value for all training examples and then adjusts the value of weights based on loss. It can converge much slower when training set is very large. Training set order does not matter as every single example in the training set is considered before making adjustments</p>

<p><b>Stochastic Gradient Descent</b> predicts y value for next training example and immediately adjusts the value of weights.</p> It can converge faster when training set is very large.  Training set should be random order otherwise model will not learn correctly.  <b>AWS ML uses Stochastic Gradient Descent</b>