## Lab 3 Assignment: Logistic Regression for Spam Analysis ##

This lab project goes over review of the spam data set again, but instead of using SVM this time we are utilizing Logistic Regression.  As this follows the steps of our previous Logistic Regression review of the Phishing dataset, the code below should look familiar.

# Brief Review #

Let's quickly review what Logistic Regression is:

The linear regression model is characterized by the fact that the data is represented as sums of features, leading to a straight line in the Cartesian plane. In the following image:

<div>
<img src="http://vbehzadan.com/AISec/linreg1.png" width="700"/>
</div>

- X = the series of $x_{1}, ..., x_{n}$
- w = weights
- $β$ = bias

How is this model different from perceptron?
- There is no outlined activation function (kernel function) or $f$

Optimization objective:

The weights determine how important each feature is.  Even in this form of writing out the equation, $w_{1}x_{1}, ..., w_{n}x_{n}$ still are paired together.  This may be reminisent of corrolation.

There remains a slight problem with linear regression for classification problems, these equations are for prediction problems.  The reason behind this is that linear regression is mean to minimize forecasting error which causes further classification errors.
> Consider this enforced and enhanced overfitting.
> Desire is to find a hyperplane very close (minimum prefix margin) to the data seporation.

To remediate this problem, estimate the probability of samples for individual classes.  We want to train a model which has **n** input features and **m** outputs.  _This is **not** a n inputs to 1 output problem!_
SVM can be used, or multi-layer perceptrons can be used as well to remediate this issue.

With multiple outputs, the algorithm is evaluating which has the highest probability among the multiple inputs.  The output with the highest probability will be taken as the main or used output.

How do we evaluate the input of a class resulting in an output?  (See below image):
<div>
<img src="http://vbehzadan.com/AISec/logistic1.png" width="700"/>
</div>

- P = probability
- c = classification
- x = input
- sigmoid function: $e^{z}/(1 + e^{z})$

In [1]:
#Set up coding environment for the notebook

import os
import pandas as pd
import numpy as np
from sklearn import *
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import warnings 
warnings.simplefilter('ignore')

df = pd.read_csv(os.path.join('Data','sms_spam_svm.csv'))

y = df.iloc[:, 0].values
y = np.where(y == 'spam', -1, 1)
#Outputs are a scalar (number), so the strings need to be converted into numbers for evaluation.
#first number is if true, second number is if false.

X = df.iloc[:, [1, 2]].values

In [2]:
from sklearn.model_selection import train_test_split

training_samples, testing_samples, training_targets, testing_targets = train_test_split(
         X, y, test_size=0.3, random_state=0)

log_classifier = LogisticRegression()
log_classifier.fit(training_samples, training_targets)
predictions = log_classifier.predict(testing_samples)

In [3]:
accuracy = 100.0 * accuracy_score(testing_targets, predictions)
print ("Logistic Regression accuracy: " + str(accuracy))

Logistic Regression accuracy: 84.44444444444444
