# An Introduction to Classification

* A Statistical Model Revisited

* Contingency Tables Revisited
    * odds
    * other hypothesis tests

* Logistic Regression
    * log odds
    * connection to linear regression
    * interpretation
    * parameters in scikit-learn
    * decision function
    * precision 
    * recall
    * f1 score
    * MCC
    * 

## A Statistical Model Revisited

Thus far we have looked at statistical models that carry out the regression task.  That is, they take in a set of one or more variables and produce a number.  Specifically, when we say regression we mean:

$$ \hat{y} = mX + b $$

On the right hand side:

Where `X` is a tensor of one or more variables.  When `X` represents a single variable, we call it a vector.  And when `X` represents more than one variable we typically refer to it as a matrix.  However, it is also possible for `X` to represent higher dimensions.

`m` and `b` are just scalars, typically from the real numbers.

On the left hand side:

$\hat{y}$ is also typically from the real numbers.  

And we say that we regress X on y.

One of the important things to note about this procedure is the nature of $\hat{y}$, because it is from the reals it's output carries distance.  That means:

if for a given set of X's $\hat{y}$ = 5.32 and for another set of X's $\hat{y}$ = -1.83 then we can say that the output of the first set of variables is strictly higher than the output of the second set.

It is not always the case that our output being metrizable, that is being measurable in terms of distance, is useful.  It may be the case that our output should not carry any sense of distance or comparison in anyway.

For this we need to introduce a new statistical task, that of classification.

## Classification

The basic idea behind classification is, what if we output a $\hat{y}$ that was categorical rather than continuous?  We've already seen categorical variables in the Applying Statistical Tests chapter.  But more formally, a categorical variable is one in which the different classes are just that, classes.  They are just designations.  So let's say we had two classes, A and B.  They could be classes of anything.  Like tall and short people.  Or young and old people.  Or different flavors of ice cream.  As much as people might try to rank order these different classes, neither is truly better than the other.  

If you want to try a fun experiment, ask some friends what they think about different classes of things, like maybe whether it's better to be young or old, better to be tall or short, better to eat vanilla or chocolate ice cream.  I bet, as long as your friends aren't too similar, they'll all answer differently.  And that's the point!  There is no objective ordering of any of these classes.  And therefore, we cannot define an explicit metric to rank them.

So what?  How are categorical variables useful?  Well turns out they have tons of uses!  We used them extensively in Applying Statistical Tests!  Specifically some of the demographic variables and the converted variable were all categorical.  Without categorical data, we'd never be able to model any of that!  And then we'd be greatly constraining the set of problems we can solve with statistical modeling and analysis.

Hopefully I've convinced you that classification is cool!  Now let's look at a basic definition of it, so we can compare against our regression task.

### Linear Discriminant Analysis

We'll start our analysis of classification by looking at Linear Discriminant Analysis.  This technique was invented by great Ronald Fisher along with many of the other foundations of statistics.

Let's start with the problem set up:

Assume we have two classes and a bunch of data about the population in general.  The data about the population of interest is referred to as features of the data.  And the two classes are called the labels or target.  

To make this practical, let's set up a discrete example:

Assume you want to understand whether someone is likely to vote republican or democrat in the up coming election.  Let's assume you have:

* Age
* Salary
* Location

Let's first generate the dataset, and then we can start to go over the technique:

In [1]:
import pandas as pd
import random
import numpy as np

df = pd.DataFrame()

df["party"] = [random.choice(["republican", "democrat"])
               for _ in range(1000)]
df["Age"] = np.random.normal(50, 15, size=1000)
df["Age"] = df["Age"].astype(int)
df["Salary"] = np.random.normal(45000, 1500, size=1000)
df["Salary"] = df["Salary"].apply(lambda x: round(x, 2))
df["Latitude"] = np.random.normal(39, 15, size=1000)
df["Latitude"] = df["Latitude"].apply(lambda x: round(x, 4))
df["Longitude"] = np.random.normal(94, 15, size=1000)
df["Longitude"] = df["Longitude"].apply(lambda x: round(x, 4))

In [4]:
df.head()

Unnamed: 0,party,Age,Salary,Latitude,Longitude
0,republican,41,46287.29,30.1092,99.0837
1,republican,38,45969.98,28.902,77.9553
2,republican,80,45823.78,39.7949,119.1563
3,republican,73,43861.54,59.1974,57.3633
4,republican,38,48256.24,46.4441,85.7561


As you can see, we've also generated a target variable, `party`.  This will be what we want our model to predict.  Linear Discriminate Analysis can also be used for dimensionality reduction, which we will look at in a different chapter.

For classification the procedure is:

1. calculate the mean per class per variable.
2. calculate the covariances per class
3. apply the least sum of squares algorithm to the two matrices calculated above and take the first component.
4. use the diaginal of the dot product between the means and the coefficients to recover the intercept.

In [53]:
import numpy as np

def mean_per_class(df, target_column):
    return df.groupby(target_column).agg(np.mean)

def covariance_per_class(df, target_column):
    return df.groupby(target_column).agg(np.cov)

means = mean_per_class(df, "party")
covariances = covariance_per_class(df, "party")
coefficients = np.linalg.lstsq(covariances.values, means.values)[0].T

  # This is added back by InteractiveShellApp.init_path()


As you can see, we recover the coefficients!  I decided not to show the intercept because it's a bit complex and doesn't add much in terms of teaching value.  

So, we've covered how you train this classifier.  But how do you make predictions?  This is the major difference between classification and regression.  For regression problems you simply apply your matrix to new data and whatever you output is what you get.  With classification the steps are as follows:

1. apply a decision function, which will give you back the log likelihood ratio of the positive class.  

2. calculate the predicted probabilities for membership to each class are generated from the results of the decision function.

3. Get the classes by taking the argmax, which maps to whichever class has a higher probability associated.