# Chapter 16 - Logistic Regression

Finding paid and unpaid users. The first attempt is to use linear regression with multiple variables to find the best model.

Paid account = B0 + B1 x experience + B2 x Salary + error
(see book for graphics)

There's nothing wrong with doing it this way, but it leads to some intermediate problems.

1. We'd like our outputs to be ideally 0 or 1, or somewhere in between. But the outputs of the linear model can be huge positive or negative numbers, which is not clear how to interpret.

2. The model assumed that the errors were uncorrelated with the columns of x. Example: it outputs very large values for peopre with lots of experience but we know the actual values must be at most 1, which means large outputs have large error, meaning our beta estimate is biased.

We'd like large positives values of dot(x_i, beta) to correspond to probabilities close to 1 and negative values correspond coles to 0. We can accomplish this by applying another function to the result. In logistic regression this is the logistics function. 

In [6]:
import math
from functools import reduce, partial
import random

def logistic(x):
    return 1.0 / (1 + math.exp(-x))

# and conveniently

def logistic_prime(x):
    return logistic(x) * (1 - logistic(x))

# likelihood finding

def logistic_log_likelihood_i(x_i, y_i, beta):
    if y_i == 1:
        return math.log(logistic(dot(x_i, beta)))
    else:
        return math.log(1 - logistic(dot(x_i, beta)))
    
# the overall likelihood that we see the data we do with the coefficients we have is just the product of all the individual 
# likelihoods, or the sum of all the log likelihoods:

def logistic_log_likelihood(x, y, beta):
    return sum(logistic_log_likelihood_i(x_i, y_i, beta) for x_i, y_i in zip(x ,y))

# the necessary gradient to minimize against
def logistic_log_partial_ij(x_i, y_i, beta, j):
    return (y_i - logistic(dot(x_i, beta))) * x_i[j]

def logistic_log_gradient_i( x_i, y_i, beta):
    return [logistic_log_partial_ij(x_i, y_i, beta, j) for j, _ in enumerate(beta)]

def logistic_log_gradient(x, y, beta):
    return reduce(vector_add, [logistic_log_gradient_i(x_i, y_i, beta) for x_i, y_i in zip(x,y)])

# for ML
def split_data(data, prob):
    results = [], []
    for row in data:
        results[0 if random.random() < prob else 1].append(row)
    return results

def train_test_split(x, y, test_pct):
    data = zip(x,y)
    train, test = split_data(data, 1 - test_pct)
    x_train, y_train = zip(*train)
    x_test, y_test = zip(*test)
    return x_train, x_test, y_train, y_test

def minimize_batch(target_fn, gradient_fn, theta_0, tolerance=0.000001):
    step_sizes = [100, 10, 1, 0.1, 0.01, 0.001, 0.0001, 0.00001]
    theta = theta_0
    target_fn = safe(target_fn)
    value = target_fn(theta)
    
    while True:
        gradient = gradient_fn(theta)
        next_thetas = [step(theta, gradient, -step_size) for step_size in step_sizes]
        next_theta = min(next_thetas, key=target_fn) # choosing the one that minimizes the error function
        next_value = target_fn(next_theta)
        
        # stop if converging
        if abs(value - next_value) < tolerance:
            return theta
        else:
            theta, vaue = next_theta, next_value
            
def negate(f):
    return lambda *args, **kwargs: -f(*args, **kwargs)

def negate_all(f):
    return lambda *args, **kwargs: [-y for y in -f(*args, **kwargs)]

def maximize_batch(target_fn, gradient_fn, theta_0, tolerance=0.000001):
    return minimize_batch(negate(target_fn), negate_all(gradient_fn), theta_0, tolerance)

def safe(f):
    def safe_f(*args, **kwargs):
        try:
            return f(*args, **kwagrs)
        except:
            float('inf')
            
    return safe_f


In [8]:
# Split data into training and test set (example, doesn't run)

random.seed(0)
x_train, x_test, y_train, y_test = train_test_split(rescaled_x, y, 0.33)

# and want to maximize log likelihood on the training data

fn = partial(logistic_log_likelihood, x_train, y_train)
gradient_fn = partial(logistic_log_gradient)

# pick random starting point
beta_0 = [random.random() for _ in range(3)]

# and maximize using gradient descent
beta_hat = maximize_batch(fn, gradient_fn, beta_0)

#alternatively you can use stochastic gradient descent, which we can see in the book. Either way we get our betas (rescaled)

NameError: name 'rescaled_x' is not defined

These are not as easy to interpret, as all else being equal each coefficient adds not to the output value of the linear regression (e.g. minutes spent per day) but to the input of the logistic function. We can say qualitative things based on the positive or negative impact, but otherwise we are looking for the logistic output to tell us the likelihoods.

In [10]:
# Goodness of Fit
# we can use the test data we held back from the model. We can assume that we predict a paid account whenever the 
# probability exceeds 0.5

true_positives = false_positives = true_negatives = false_negatives = 0 # never seen this instantiation methode before. I like it.

for x_i, y_i in zip(x_test, y_test):
    predict = logistic(dot(beta_hat, x_i))
    
    if y_i == 1 and predict >= 0.5:
        true_positives += 1
        
    elif y_i == 1:
        false_negatives += 1
        
    elif predict >= 0.5:
        false_positives += 1
    else:
        true_negatives += 1
        
    # this is so elegant. My intuition tells me there are lots of ways to increment the predictions but damn this is a goodun
    

precision = true_positives / (true_positives + false_positives)
recall = true_positives / (true_positives + false_negatives)



NameError: name 'x_test' is not defined

## Support Vector Machines

Support vector machines find the hyperplane which maximizes the distance to the nearest point in each class. Said more simply, it separates our data most efficiently into the binary categorization that we care about. Data might not be able to be perfectly separated by a single plane, but if we move the data to higher dimensional space it might be. See the book for pretty pictures of this.  

## This concludes Chapter 16

Further exploration in scikit-learn, which has modules for linear regression and support vector machines, and libsvm, which is the support vector machine scikit-learn is using under the hood. See it's documentation for more info. 