# An Introduction to Classification

* A Statistical Model Revisited
* Latent Discriminant Analysis
* Quadratic Discriminant Analysis
* Logistic Regression
    * log odds
    * connection to linear regression
    * interpretation
    * parameters in scikit-learn
    * decision function
    * precision 
    * recall
    * f1 score
    * MCC
    * 
    
* Contingency Tables Revisited
    * odds
    * other hypothesis tests


## A Statistical Model Revisited

Thus far we have looked at statistical models that carry out the regression task.  That is, they take in a set of one or more variables and produce a number.  Specifically, when we say regression we mean:

$$ \hat{y} = mX + b $$

On the right hand side:

Where `X` is a tensor of one or more variables.  When `X` represents a single variable, we call it a vector.  And when `X` represents more than one variable we typically refer to it as a matrix.  However, it is also possible for `X` to represent higher dimensions.

`m` and `b` are just scalars, typically from the real numbers.

On the left hand side:

$\hat{y}$ is also typically from the real numbers.  

And we say that we regress X on y.

One of the important things to note about this procedure is the nature of $\hat{y}$, because it is from the reals it's output carries distance.  That means:

if for a given set of X's $\hat{y}$ = 5.32 and for another set of X's $\hat{y}$ = -1.83 then we can say that the output of the first set of variables is strictly higher than the output of the second set.

It is not always the case that our output being metrizable, that is being measurable in terms of distance, is useful.  It may be the case that our output should not carry any sense of distance or comparison in anyway.

For this we need to introduce a new statistical task, that of classification.

## Classification

The basic idea behind classification is, what if we output a $\hat{y}$ that was categorical rather than continuous?  We've already seen categorical variables in the Applying Statistical Tests chapter.  But more formally, a categorical variable is one in which the different classes are just that, classes.  They are just designations.  So let's say we had two classes, A and B.  They could be classes of anything.  Like tall and short people.  Or young and old people.  Or different flavors of ice cream.  As much as people might try to rank order these different classes, neither is truly better than the other.  

If you want to try a fun experiment, ask some friends what they think about different classes of things, like maybe whether it's better to be young or old, better to be tall or short, better to eat vanilla or chocolate ice cream.  I bet, as long as your friends aren't too similar, they'll all answer differently.  And that's the point!  There is no objective ordering of any of these classes.  And therefore, we cannot define an explicit metric to rank them.

So what?  How are categorical variables useful?  Well turns out they have tons of uses!  We used them extensively in Applying Statistical Tests!  Specifically some of the demographic variables and the converted variable were all categorical.  Without categorical data, we'd never be able to model any of that!  And then we'd be greatly constraining the set of problems we can solve with statistical modeling and analysis.

Hopefully I've convinced you that classification is cool!  Now let's look at a basic definition of it, so we can compare against our regression task.

### Linear Discriminant Analysis

We'll start our analysis of classification by looking at Linear Discriminant Analysis.  This technique was invented by great Ronald Fisher along with many of the other foundations of statistics.

Let's start with the problem set up:

Assume we have two classes and a bunch of data about the population in general.  The data about the population of interest is referred to as features of the data.  And the two classes are called the labels or target.  

To make this practical, let's set up a discrete example:

Assume you want to understand whether someone is likely to vote republican or democrat in the up coming election.  Let's assume you have:

* Age
* Salary
* Location

Let's first generate the dataset, and then we can start to go over the technique:

In [27]:
import pandas as pd
import random
import numpy as np

df = pd.DataFrame()

df["party"] = [random.choice(["republican", "democrat"])
               for _ in range(2000)]
df["Age"] = np.random.normal(50, 15, size=2000)
df["Age"] = df["Age"].astype(int)
df["Salary"] = np.random.normal(45000, 1500, size=2000)
df["Salary"] = df["Salary"].apply(lambda x: round(x, 2))
df["Latitude"] = np.random.normal(39, 15, size=2000)
df["Latitude"] = df["Latitude"].apply(lambda x: round(x, 4))
df["Longitude"] = np.random.normal(94, 15, size=2000)
df["Longitude"] = df["Longitude"].apply(lambda x: round(x, 4))

In [23]:
df.head()

Unnamed: 0,party,Age,Salary,Latitude,Longitude
0,republican,42,49100.44,50.4067,112.0127
1,republican,53,44189.51,31.2893,97.2973
2,republican,48,45943.18,10.4703,84.7361
3,republican,46,42621.86,30.6266,94.1107
4,republican,53,44060.8,13.4962,74.786


As you can see, we've also generated a target variable, `party`.  This will be what we want our model to predict.  Linear Discriminate Analysis can also be used for dimensionality reduction, which we will look at in a different chapter.

For classification the procedure is:

1. calculate the mean per class per variable.
2. calculate the covariances per class
3. apply the least sum of squares algorithm to the two matrices calculated above and take the first component.
4. use the diaginal of the dot product between the means and the coefficients to recover the intercept.

In [24]:
import numpy as np

def mean_per_class(df, target_column):
    return df.groupby(target_column).agg(np.mean)

def covariance_per_class(df, target_column):
    return df.groupby(target_column).agg(np.cov)

means = mean_per_class(df, "party")
covariances = covariance_per_class(df, "party")
coefficients = np.linalg.lstsq(covariances.values, means.values)[0].T

  # This is added back by InteractiveShellApp.init_path()


As you can see, we recover the coefficients!  I decided not to show the intercept because it's a bit complex and doesn't add much in terms of teaching value.  

So, we've covered how you train this classifier.  But how do you make predictions?  This is the major difference between classification and regression.  For regression problems you simply apply your matrix to new data and whatever you output is what you get.  With classification the steps are as follows:

1. apply a decision function, which will give you back the log likelihood ratio of the positive class.  

2. calculate the predicted probabilities for membership to each class are generated from the results of the decision function.

3. Get the classes by taking the argmax, which maps to whichever class has a higher probability associated.

We won't look at those methods explicitly however it is important to know that this is the general procedure for classification.  Let's turn now to making use of scikit-learn's implementation of Linear Discriminant Analysis for a classification task:

In [29]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

y = df["party"].map({"republican": 0, "democrat": 1})
X = df[["Age", "Salary", "Longitude", "Latitude"]].values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
clf = LinearDiscriminantAnalysis(solver="lsqr")
clf.fit(X_train, y_train)
y_hat = clf.predict(X_test)
print(classification_report(y_test, y_hat))

              precision    recall  f1-score   support

           0       0.54      0.74      0.63       261
           1       0.53      0.32      0.40       239

    accuracy                           0.54       500
   macro avg       0.54      0.53      0.51       500
weighted avg       0.54      0.54      0.52       500



While Linear Discriminant Analysis is rarely used for classification because it's many assumptions, namely:

* Each of the variables is independent 
* Each of the variables is normally distributed
* The covariances of variables of each class must be the same.

The third assumption means that the covariance of `Age|republican` must be the same as `Age|democrat`.  An assumption that is rarely met.

We can verify this third assumption via Levene's statistical test, which has the null hypothesis:

* all inputs have the same variance

In [28]:
from scipy import stats

republican = df[df["party"] == "republican"]
democrat = df[df["party"] == "democrat"]
print(stats.levene(republican["Age"], democrat["Age"], center="mean"))
print(stats.levene(republican["Salary"], democrat["Salary"], center="mean"))

LeveneResult(statistic=0.28586435489564416, pvalue=0.5929425110240991)
LeveneResult(statistic=1.0050551737574565, pvalue=0.31621178435344255)


We fail to reject the null hypothesis in all cases!  So our assumptions are met!  Now that we are sure of our analysis, we can do something that we can't do with many of our other classifiers, easily interpret the coefficients of our model!!!

In [30]:
coef = clf.coef_[0]
for index, column in enumerate(["Age", "Salary", "Longitude", "Latitude"]):
    print(column, coef[index])


Age 0.0027272691522056136
Salary -5.358556337285714e-06
Longitude 0.0033216839376942753
Latitude -0.000390080483058608


Because we are dealing with binary classification there is only one discriminat function.  We see that `Longitude` has the largest magnitude and therefore dominates the function used to separate the data into one of the two classes.  In fact, we can directly recover the discriminant score function by taking a linear combination of the data and it's weights:

In [31]:
from functools import partial

def get_score_function(clf, x):
    coef = clf.coef_[0]
    summation = 0
    for index, column in enumerate(["Age", "Salary", "Longitude", "Latitude"]):
        summation += x[column] * coef[index]
    return summation

get_score = partial(get_score_function, clf)

df["score"] = df.apply(get_score, axis=1)

In [32]:
df.head()

Unnamed: 0,party,Age,Salary,Latitude,Longitude,score
0,republican,54,47206.99,40.0968,82.0462,0.151202
1,democrat,22,46681.9,45.918,84.8948,0.073934
2,republican,42,44900.31,65.8046,83.6835,0.126246
3,democrat,62,45391.79,47.562,74.3915,0.154408
4,republican,37,45319.49,50.4515,97.7834,0.163187


Here we can think of this score as the value that seperates what the classify uses to decide whether the voter is republican or democrat.

Now, just for completeness, let's see what our coefficients would look like with three possible classes:

In [19]:
df = pd.DataFrame()

df["party"] = [random.choice(["republican", "democrat", "other"])
               for _ in range(2000)]
df["Age"] = np.random.normal(50, 15, size=2000)
df["Age"] = df["Age"].astype(int)
df["Salary"] = np.random.normal(45000, 1500, size=2000)
df["Salary"] = df["Salary"].apply(lambda x: round(x, 2))
df["Latitude"] = np.random.normal(39, 15, size=2000)
df["Latitude"] = df["Latitude"].apply(lambda x: round(x, 4))
df["Longitude"] = np.random.normal(94, 15, size=2000)
df["Longitude"] = df["Longitude"].apply(lambda x: round(x, 4))

In [20]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

y = df["party"].map({"republican": 0, "democrat": 1, "other": 2})
X = df[["Age", "Salary", "Longitude", "Latitude"]].values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
clf = LinearDiscriminantAnalysis(solver="lsqr")
clf.fit(X_train, y_train)
y_hat = clf.predict(X_test)
print(classification_report(y_test, y_hat))

              precision    recall  f1-score   support

           0       0.30      0.33      0.31       164
           1       0.25      0.18      0.21       148
           2       0.40      0.47      0.43       188

    accuracy                           0.34       500
   macro avg       0.32      0.32      0.32       500
weighted avg       0.33      0.34      0.33       500



In [21]:
clf.coef_

array([[0.14169747, 0.02075122, 0.3577216 , 0.19963037],
       [0.13783221, 0.02068559, 0.35913213, 0.20505868],
       [0.13546089, 0.02070732, 0.36078705, 0.19971238]])

As you can see, now there are three discriminant functions which are used to differentiate the data.  Each row represents a different linear discriminant function.

## Quadratic Discriminant Analysis

With Linear Discriminant Analysis we made some very strong assumptions:

* Each of the variables is independent 
* Each of the variables is normally distributed
* The covariances of variables of each class must be the same.

It is unlikely that the third assumption will be met.  This is the major motivation for Quadratic Discriminant Analysis which drops this assumption.  

In Quadratic Discriminant Analysis we therefore drop this assumption.  The other assumptions of Linear Discriminant Analysis persist.  Therefore our assumptions are:

* Each of the variables is independent 
* Each of the variables is normally distributed

In addition to relaxing the equivalence of covariances of classes we also now can have of a quadratic nature, hence the name.  Therefore we can fit these new shapes:

* conic sections
* line
* circles
* ellipse
* parabola
* hyperbola

Notice that we can still fit a line, meaning that quadratic discriminant analysis has a super set of the curve fitting power of linear discriminant analysis.  One of the use cases I haven't discussed yet, is dimensionality reduction.  It is possible to use linear discriminant analysis to do this.  However, with quadratic discriminant analysis, we can only do classification.  So while quadratic discriminant analysis has many desirable features, it doesn't have the flexability of linear discriminant analysis.  We will cover dimensionality reduction in detail in another chapter, so if you don't know what it is, don't worry!

Let's look at what's changed in the implementation of quadratic discriminant analysis:

In [None]:
import numpy as np

def mean_per_class(df, target_column):
    return df.groupby(target_column).agg(np.mean)

def covariance_per_class(df, target_column):
    return df.groupby(target_column).agg(np.cov)

means = mean_per_class(df, "party")
