# Lecture 9: Introduction to Classification
### 4/19/22

### Table of Contents
* [What is Machine Learning?](#ml)  
    * [Taxonomy of Machine Learning](#taxonomy)

* [Classification](#classification)
    * [Binary Classification](#binary)
    * [Extending Linear Regression to Classification](#extension)
* [Logistic Regression](#logistic_regression)
    * [The Sigmoid Function](#sigmoid)
    * [Fitting Logistic Regression Models](#logistic_fitting)
    * [Logistic Loss](#log_loss)
* [Assessing Classification Models](#assessing)
    * [Decision Boundaries](#decisions)
    * [Classification Metrics](#metrics)
    * [ROC Curves](#roc)

### Hosted by and maintained by the [Student Association for Applied Statistics (SAAS)](https://saas.berkeley.edu).

Presented by Jonathan Pan and Gilbert Feng

<a id='ml'></a>
## What is Machine Learning? 

The term machine learning was coined in 1959 by computer scientist Arthur Samuel after his work on creating a checker playing program. Samuel defined machine learning as:
> "A field of study that gives computers the ability to learn without being explicitly programmed"
Although this quote is likely misattributed, it still serves as a good baseline definition, despite being overly abstract and ambiguous. However, Samuel's work in 1959 is by no means the moment in time at which machine learning was "invented". The least squares method, one of the most common methods of data fitting in machine learning, was discovered over 150 years prior in 1805 by Adrien-Marie Legendre.

Some of theory and methods behind machine learning have been around for decades, so why is it so popular now?
* Abundance of data (cloud storage)
* Abundance of computing power (advancements in GPUs)
* Money (companies have been able to make the above two profitable)

Machine Learning is an evolving field, it has been for nearly a century and will continue to evolve for the foreseeable future. Despite this, there are some commonalities in how machine learning is used today:
![](images/ml_steps.png)
As you can see, there is a lot more to data science and machine learning than building linear regression models. The 7 steps above are a good summary, but the machine learning process actually is not one-directional; it resembles more of a cycle, just like the scientific method. 

<a id='taxonomy'></a>
### Taxonomy of Machine Learning

The background information and definitions above are still kind of ambiguous. Let's do a better job at concretely understanding the foundation of modern machine learning.

If you take a class like Data 100, you will likely see an image similar to the one below. It's not a perfect taxonomy, but it captures most of the key ideas. Let's take a few minutes to understand what's going on here. 

![](images/ml_taxonomy.jpg)

#### Exercise: What group of machine learning algorithms does Linear Regression fall under? 

**Answer:** [Type your answer here]

#### Exercise: What is the difference between Supervised and Unsupervised Learning? Come up with one example of a problem which falls under each one. 

**Answer:** [Type your answer here]

<a id='classification'></a>
## Classification 

So far in your CX journey, you've learned how to solve problems involving Regression, which is the left-most branch of the ML taxonomy tree diagram. 

Now we will jump over to it's sibling, *classification*.

Instead of trying to predict a quantitative response variable, we'll simplify our prediction to selecting between categories. For example, consider the following classification problem of predicting whether a particular image is a cat or a dog: 
![](images/cat_dog.gif)
In a classification problem, we are given data $\mathbf{X}$ and labels $\vec{y}$ and we want to learn the relationship between them. In this particular example, $\mathbf{X}$ might represent a grid of pixels containing the image of a cat, while $\vec{y}$ would would contain a 1 or 0, depending if the image was a cat or dog.

#### Exercise: If you were asked to develop a method to classify images of cats and dogs, how would you do it?

**Answer:** [Type your answer here]

<a id='binary'></a>
### Binary Classification

In general, it's totally possible to have a classification model which can make predictions between many classes all at once; for example, predicting if a given image is a dog, cat, chair, plane, etc. For simplicity today, we will focus on the *binary classification* case: you are only predicting whether a data point is a particular class (**1**) or not (**0**). 

One way to build a binary classification model is to think about conditional probability: $\mathbb{P}[y_i = 1 | \vec{x}_i]$. In words, we want to think about, what is the chance this data point has a label of 1, given its features $\vec{x}_i$? (Note: The notation $\vec{x}_i$ represents a row vector of all the features for a single individual). 

#### Exercise: If you knew exactly what $\mathbb{P}[y_i = 1 | \vec{x}_i]$ was for each individual $i$, for what range of probabilities would you classify the individual into class 1? How about class 0? 

**Answer**: [Type your answer here]

<a id='extension'></a>
### Extending Linear Regression to Classification

Now, we have observed that if we magically knew what $\mathbb{P}[y_i = 1 | \vec{x}_i]$ was for each individual, we could build our binary classifier. But, the problem is that we don't know the relationship between $\vec{y}$ and $\mathbf{X}$ and therefore, $\mathbb{P}[y_i = 1 | \vec{x}_i]$ is unknown. 

Instead, maybe we can **estimate** $\mathbb{P}[y_i = 1 | \vec{x}_i]$ using linear regression! Let's use the following formula:

$$\mathbb{P}[y_i = 1 | \vec{x}_i] \approx \beta_0 + \beta_1 x_{i,1} + \beta_2x_{i,2} + \dots + \beta_d x_{i,d} = \vec{\beta}^T\vec{x_i}$$

Like before, the $\beta_i$ terms represent the coefficients of each feature for a particular individual. Unfortunately, there's a problem with this setup. Let's examine the image below: 
![](images/lin_prob.png)

#### Exercise: What are the features used in the linear probability model above? What is the model trying to predict? 

**Answer**: [Type your answer here]

#### Exercise: Is linear regression a valid model to predict probabilities? Why or why not? 

**Answer**: [Type your answer here]

<a id='logistic_regression'></a>
## Logistic Regression



<a id='sigmoid'></a>
### The Sigmoid Function

To fix the issue we observed in the previous section, we will apply a transformation to our linear model called the **sigmoid function** to ensure that the output is between [0, 1]. Here's an image of what it looks like: 
![](images/sigmoid.png)

#### Exercise: Implement the sigmoid function. 

In [None]:
import numpy as np

def sigmoid(z):
    """Returns the value of the sigmoid function for an input z."""
    pass # TODO: Replace this line with your implementation of sigmoid 

coefficients = np.array([[5, 4, 2, 1]])
features = np.array([[1, 4, 1, 2]])

# Calculate the linear prediction 
prediction = _________  @  _________ # TODO: Compute a dot product between the coefficients and features 

# Test sigmoid (no changes needed)
prob = sigmoid(prediction)
print(f'Your sigmoid function returned: {prob}')
assert (prob <= 1) and (prob >= 0), ValueError('There\'s something wrong with your sigmoid function implementation.')
print('Success!')

<a id='logistic_fitting'></a>
### Fitting Logistic Regression Models 

As we saw in the previous section, we will use the sigmoid function to make our predictions with Logistic Regression. We take the following steps to generate our prediction: 
- First, calculate the linear prediction: 
    $$z = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_dx_d$$
- Next, calculate the predicted probability:
    $$p = \frac{1}{1 + e^{-z}}$$

So, if $p$ is less than 0.5, we classify the data point as $0$ and if $p$ is greater than 0.5, we predict 1. 



#### Exercise: Let's say for a particular data point, you have a value of $z = -0.5$. What would you predict for that point: 0 or 1? 

**Answer**: [Type your answer here]

#### Logistic Regression in Scikit-learn

In practice, how can we write a model?

This example explores a data set of cars, and uses characteristics of the car to predict if oil was expensive (1) or inexpensive (0) when it was released.

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
%matplotlib inline

import random
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LogisticRegression
from sklearn.metrics import confusion_matrix

In [None]:
mpg_cat = pd.read_csv("./data/mpg_category.csv", index_col="name") 

# Encode OilExpensive such that 0 is inexpensive and 1 is expensive
mpg_cat["OilExpensive"] = (mpg_cat.OilExpensive == "expensive")*1
mpg_cat["Old?"] = (mpg_cat.loc[:, "Old?"] == "old")*1


# Train, test split
mpg_cat_train, mpg_cat_test = train_test_split(mpg_cat, 
                                       test_size = .2, 
                                       random_state = 0) 

# Train, validation split
mpg_cat_train, mpg_cat_validation = train_test_split(mpg_cat_train, 
                                             test_size = .25, 
                                             random_state = 0)

# Notice that the splitting above creates a 60/20/20 split
print("Unique Values of OilExpensive: " + str(mpg_cat.OilExpensive.unique()))
mpg_cat.head()

In [None]:
X_train = np.array(mpg_cat_train.drop("OilExpensive", axis=1))
y_train = mpg_cat_train.OilExpensive.values

X_val = np.array(mpg_cat_validation.drop("OilExpensive", axis=1))
y_val = mpg_cat_validation.OilExpensive.values

In [None]:
%%capture
lr = LogisticRegression()
lr = lr.fit(X_train, y_train)
y_val_pred = lr.predict(X_val)

In [None]:
# The true labels of our validation set
y_val

In [None]:
# The labels we predicted for our validation set
y_val_pred

<a id='log_loss'></a>
### Logistic Loss

Now, you've learned about how Logistic Regression makes predictions and how it is fitted to datasets. Lastly, we should learn about its loss function, called *log loss*, also sometimes called *cross entropy loss*.

Recall that for the case of linear regression, we used the sum of squared residuals as our loss function:

$$L(\beta) = \sum_{i=1}^n (x_i^T \beta - y_i)^2  = || X \beta - y ||_2^2$$

In the case of logistic regression, we will instead use the following loss function: 

$$L(\beta) = -\sum_{i=1}^n y_ilog(p_i) + (1 - y_i)log(1 - p_i)$$

where $p_i$ is the predicted probability of the $i$th row. At first, this may seem kind of random compared to what we've seen before. But, this loss function turns out to have a deep connection with the view of binary classification in terms of probabilities. Here's a brief explanation:

Let's say we had 10 total data points (5 from class 1 and 5 from class 0) we wanted to use to build a classifier. Imagine each of these data points has a fixed (but unknown) probability $p$ of belonging to class 1. Therefore, we can express the likelihood of observing these 10 data points as the following:

$$Lik(p) = p^5(1-p)^5$$

Now, we can take a log on both sides to simplify the powers. 

$$Log(Lik(p)) = log(p^5(1-p)^5) = log(p^5) + log((1-p)^5)) = 5 log(p) + 5 log(1 - p)$$

Notice that this "log-likelihood" actually looks really similar to the log loss function we wrote above! Specifically, the log-loss is the negative of the log-likelihood. If you'd like to learn more about this connection between Logistic Regression and probability, [check out this video](https://www.youtube.com/watch?v=3wqXRQzJBpE&t=1s). 

#### Exercise: For a particular data point $i$, what would the log loss be if $y_i = 1$? 

**Answer**: [Type your answer here]

<a id='assessing'></a>
## Assessing Classification Methods

Now that we know how to make a model... how do we know if our model is good? And how do we make our model the best it can be?



<a id='decisions'></a>
### Decision Boundaries

Part of our modeling is choosing our **decision boundary**: a threshold value where if our predicted probability is below our threshold we predict 0 and if our predicted probability is above we predict 1.


<img src="images/decision_boundary.png" width="500">

This decision boundary is actually a parameter of our model that we can tune, we'll see how to set our decision boundary based on a few metrics. The models that find decision boundaries are called **discriminative models**. Another example of a discriminative model besides logistic regression is support vector machine (SVM).

The simplest of SVMs is the hard-margin SVM which essentially calculates the decision boundary by maximizing the distance or margin from the nearest sample point to the line. Without going into too much detail, this boils down to a Quadratic Program optimization problem with a unique solution as long as the data is linearly separable. There are other more complicated SVMs, such as the soft-margin SVM, which allow more flexibility (doesn't need to be linearly separable for example), but this will be out of the scope of this lecture.

In [None]:
# Train using SVM
from sklearn.svm import SVC

lr_svm = SVC(kernel="linear")
lr_svm = lr_svm.fit(X_train, y_train)
y_svm_val_pred = lr_svm.predict(X_val)

In [None]:
y_svm_val_pred

### Generative Models
The other kind of models used are called **generative models**. Generative models focus more on calculating the probabilities of being a certain class for all points. As you can see below, instead of having a clear decision boundary, we instead have probabilities.

<img src="images/gda.png" width="300">

An example of a generative model is Gaussian Discriminant Analysis (GDA). GDA assumes that each class can be modeled as a Gaussian distribution and finds the distributions through MLE.

In [None]:
# Train using GDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

lr_gda = QuadraticDiscriminantAnalysis()
lr_gda = lr_gda.fit(X_train, y_train)
y_gda_val_pred = lr_gda.predict(X_val)

In [None]:
y_gda_val_pred

<a id='metrics'></a>
### Classification Metrics

Once we have a classifier, we want to be able to evaluate it. *Is our classifier even good?* There are a variety of metrics we can use, a few of which we'll highlight in this section.

#### Accuracy 

One metric we can look at is accuracy. Is accuracy all we need to look at? We'll soon see the answer is no!

Accuracy can be defined as:

$$ accuracy = \frac{\text{# of points classified correctly}}{\text{# points total}} $$

Simply put, accuracy is **how many predictions we got right out of our total predictions**. This is pretty useful when the classes we're looking at have roughly similar frequencies.

But what if our frequencies are unbalanced? 

**Side note:** When using Scikit-Learn for logistic regression, you can calculate accuracy just by calling `model.score()`

Learn how!  [Scikit-Learn Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.score)

In [None]:
# This is the accuracy of our model from above!
lr.score(X_val, y_val)

In [None]:
# accuracy for gda
lr_gda.score(X_val, y_val)

In [None]:
# accuracy for svm
lr_svm.score(X_val, y_val)

#### Exercise: Let's see when accuracy fails us...

Your group is tasked with building a classifier to predict if someone is **COVID Negative (1)** or **COVID Positive (0)**. A rival group decides this is too tough of a problem, and makes their classifier assign **everyone** a negative result. 

Your sample is a group of 100 people. From a much more accurate test, we know **98** are truly **COVID negative**, and only **2** in our group are **COVID positive**.

How accurate is this other group's classifer? Does this mean they have a good solution?

**Answer:** [Type your answer here]

The above is an example of when one class (in this case, being COVID positive) is much more rare than the other class (COVID negative). Now that we've established we can't use accuracy to make every evaluation, what else can we use?

#### Precision / Recall

For these two metrics, we should first talk about the two types of errors (and two types of successes) in classification.

##### Types of classification errors

True Negative (TN): We predicted negative (0), and we were right!

False Negative (FN): We predicted negative (0), and we were wrong. )):

False Positive (FP): We predicted positive (1), and we were wrong. 

True Positive (TP): We predicted positive (1), and we were right!

<img src="images/confusion_matrix_2.png" width="450">

##### Precision


$$ precision = \frac{\text{TP}}{\text{TP + FP}} $$

This punishes false positives. The more false positives the lower (worse) our precision score will be.

##### Recall


$$ recall = \frac{\text{TP}}{\text{TP + FN}} $$

This punishes false negatives. The more false negatives the lower our recall score will be.



<a id='roc'></a>
### ROC Curves

Now, we've covered a lot of metrics to evaluate your classifier. We also established that you should consider a few metrics when evaulating, not just one. How should we consider our metrics against/alongside each other? 

One visualization we can consider is an **ROC curve**.

This graphs the false positive rate (FPR) against the true positive rate (TPR) of our model with different thresholds. 

$$ \text{false positive rate} = \frac{\text{FP}}{\text{FP + TN}} $$

$$ \text{true positive rate} = \frac{\text{TP}}{\text{TP + FN}} $$

![](images/roc.png)

#### Excercise: Let's think about our ROC graph...

In the ROC graph above, the beginning of the line represents our model having a false positive rate of 0 and a true positive rate of 0. In this case, our threshold is so high that our model predicts negative only—so we don't have any positive predictions (hence why our true positive *and* false positive rates are both 0).

Similarly, how would you explain the right end of the ROC graph above?

**Answer:** [Type your answer here]

What are we looking for in our ROC curve? A perfect model would have a TPR of 1 and an FPR of 0. This corresponds to the top left of our graph, so we'd like our curve to be as close to the top left as possible (see the orange below). 

One indicator from ROC curves we can calculate is the **area under curve (AUC)** of our model. An AUC of 1 would represent a perfect model. Random guessing would have an AUC of 0.5 (this wouldn't be very good for our model).

![](images/roc_ideal.png)