# **Lecture 6. Classification Algorithms**
## Applied Machine Learning

## **Part 1: Classification**
So far, every supervised learning algorithm that we've seen has been an instance of regression.

Next, let's look at somme classification algorithms. First, we will define what classification is 

### **Review: Components of A Supervised Learning Algorithms**
At a high level, a supervised machine learning problem has the following structure:
$$\underbrace{\text{Training Set}}_{\text{Attributes + Features}} + \underbrace{\text{Learning Algorithm}}_{\text{Model class + Objective + Optimizer}} \to \text{Predictive Model}$$

### **Regression Vs. Classification**
Consider a training dataset $\mathcal{D} = \{(x^{(i)}, y^{(i)})\mid i= 1,2,\dots, n\}.$

We distinguish between two types of sunpervised learning problems depending the targets $y^{(i)}$.


1.   **Regression:** The target variable $y \in \mathcal{Y}$ is continuous: $\mathcal{Y} \subseteq \mathbb{R}$ 
2.   **Classification**: The target variable $y$ is discrete and takes on one of $K$ possible values: $\mathcal{Y}=\{y_1, y_2, \dots, y_K\}$. Each dicrete value corresponds to a *class* that we want to predict.


### **Binary Classification**
An important special case of classification is when the number of classes $K=2$.

In this case, we have an example of *binary classification* problem.

### **Classification Dataset: Iris Flowers**
To demonstrate classification algorithms, we are going to use the Iris flower dataset.

It's a classical dataset was published by [R.A.Fisher]() in 1936. Nowadays, it's widely used for demonstrating machine learning algorithms.

In [None]:
import numpy as np
import pandas as pd 
from sklearn import datasets
import warnings
warnings.filterwarnings('ignore')

# Load Iris dataset
iris = datasets.load_iris(as_frame=True)

print(iris.DESCR)

In [None]:
# print part of dataset
iris_X, iris_y = iris.data, iris.target
pd.concat([iris_X,iris_y], axis=1).head()

Here is a visualization of the dataset in 3D. Note that we are using the first 3 features (out of 4) of tha data.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA

fig = plt.figure(1, figsize=(8, 6))
ax = fig.add_subplot(111, projection="3d", elev=-150, azim=110)

X_reduced = PCA(n_components=3).fit_transform(iris.data)
p1 = ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=iris_y,
                cmap=plt.cm.Set1, edgecolor="k", s=40)

ax.set_title("First three PCA directions")
ax.set_xlabel("Sepal length")
ax.xaxis.set_ticklabels([])
ax.set_ylabel("Sepal width")
ax.yaxis.set_ticklabels([])
ax.set_zlabel("Petal length")
ax.zaxis.set_ticklabels([])

plt.legend(handles=p1.legend_elements()[0], labels=['Iris Setosa', 'Iris Versicolour', 'Iris Virginica'])

plt.show()

### **Understanding Classification**
How is classification different from regression?


*   In regression, we try to fit a curve through the set of $y^{(i)}$.
*   **In classification, classes define a partition of the feature space, and our goal is to find the boundaries that separate these regions.**
*   Output of classification models have a simple probabilistic interpretation: they are probabilities that a data point belongs to a given class.

Let's visualise our Iris dataset to see this. Note that we are using the first two features in this dataset.





In [None]:
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] =[12,4]

# Plot also the training points
p1 = plt.scatter(iris_X.iloc[:,0], iris_X.iloc[:,1],  c=iris_y,
                 edgecolor='k', s=50, cmap = plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.legend(handles=p1.legend_elements()[0], labels=['Setosa', 'Versicolour', 'Virginica'], loc='lower right')

Let's train a classification algorithm on this model \

Below we see the regions predicted to be associated with the blue and non-blue classes and the line between them is the decision boundary.

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e5)

# Create an instance of Logistic Regression Classifier and fit the data
X = iris_X.to_numpy()[:,:2]
# Rename class 2 to class 1
iris_y2 = iris_y.copy()
iris_y2[iris_y2==2] = 1
Y = iris_y2
logreg.fit(X,Y)

xx, yy = np.meshgrid(np.arange(4, 8.2, .02), np.arange(1.8, 4.5, .02))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the results into the color plot
Z = Z.reshape(xx.shape)
plt.pcolormesh(xx,yy,Z,cmap=plt.cm.Paired)

# PLot also the training points
plt.scatter(X[:,0], X[:,1], c=Y, edgecolor='k', cmap=plt.cm.Paired, s=50)
plt.xlabel('Sapal length')
plt.ylabel('Sapal width')

plt.show()


## **Part 2: Nearest Neighbors**
Previously, we have seen what defines a classification problem. Let's now look at our first classification algorithm.

### **Review: Classification**
Consider a training dataset $\mathcal{D} = \{(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}), \dots, (x^{(n)},y^{(n)})\}.$

We distinguish between two types of supervised learning problems depending on the targets $y^{(i)}$


1.   **Regression:** The target variable $y \in \mathcal{Y}$ is continuous: $\mathcal{Y} ⊆ \mathbb{R}$
2.   **Classification:** The target variable $y$ is dicrete and take on one of $K$ possible values: $\mathcal{Y} = \{y_1, y_2, \dots, y_K\}$. Each dicrete value corresponds to a *class* that we want to predict.



### **A simple Classification Algorithm: Nearest Neighbors**
Suppose we have a training dataset $\mathcal{D} = \{(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}), \dots, (x^{(n)},y^{(n)})\}.$ At inference time, we receive a query point $x'$ and we want to predict its label $y'$.

A really simple but suprisingly effective way of returning $y'$ is the nearest neighbors approach.


*   Given a query datapoints $x'$, find the training example $(x,y)$ in $\mathcal{D}$ that's is closest to $x'$, in the sense that $x$ is "nearest" to $x'$. 
*   Return $y$, the label of "nearest neighbor" $x$.

In the sample below on the Iris dataset, the red cross denotes the query $x'$. The closest class to it is "Virginica". (We're only using the first two features in the datset for simplicity)



In [None]:
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12,4]

# Plot also the training points
p1 = plt.scatter(iris_X.iloc[:, 0], iris_X.iloc[:, 1], c=iris_y,
                 edgecolor='k', s=50, cmap=plt.cm.Paired )
p2 = plt.plot([7.5], [4], "rx", ms=10, mew=5)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.legend(['Query point', 'Training date'], loc='lower right')
plt.legend(handles=p1.legend_elements()[0]+p2, labels = ['Setosa', 'Versicolour', 'Virginica', 'Query'], loc='lower right')

### **Choosing a Distance Function**
How do we select the point $x$ that is closest to $x'$? There are many options:


*   The Euclide distance $\|x - x'\|_2 = \sqrt{\sum_{j=1}^d|x_j - x_j'|^2}$ is a popular choice.
*   The Minkowski distancr $\|x - x'\|_p = (\sum_{j=1}^d|x_j - x_j'|^p)^{1/p}$ generalizes the Euclidean, L1 and other distances.
*   The Mahalanobis distance $\sqrt{xVx^\top}$ for a positive semidefinite matrix $V \in \mathbb{R}^{d \times d}$ also generalizes the Euclidean distance.
*   Discrete-valued inputs can be examined via the Hamming distance $|j: x_j \neq x_j'|$ and other distances.

Let's apply Nearest Neighbors to the above dataset using the Euclidean distance (or equivalently Minkowski with $p=2$) 



In [None]:
from sklearn import neighbors
from matplotlib.colors import ListedColormap

# Train a nearest neighbors model
clf = neighbors.KNeighborsClassifier(n_neighbors=1, metric='minkowski', p=2)
clf.fit(iris_X.iloc[:,:2], iris_y)

# Create color maps
cmap_light = ListedColormap(['orange', 'cyan', 'cornflowerblue'])
cmap_bold = ListedColormap(['darkorange', 'c', 'darkblue'])

# Plot the decision boundary. For that, we will asign a color to each point
# in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = iris_X.iloc[:, 0].min() - 1, iris_X.iloc[:, 0].max() + 1
y_min, y_max = iris_X.iloc[:, 1].min() - 1, iris_X.iloc[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.2),
                     np.arange(y_min, y_max, 0.2))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into the color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap = cmap_light)

# PLot also the training points
plt.scatter(iris_X.iloc[:,0], iris_X.iloc[:,1], c=iris_y, cmap=cmap_bold,
            edgecolor='k', s=60)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())

plt.xlabel('Sepal length')
plt.ylabel('Sepal width')






In the above example, the regions of the 2D space that are assigned to each class are highly irregular. In the area where the two classes overlap, the decision of the boundary flips between the classes, depending on which point is closest to it. 

### **K-Nearest Neighbors**
Intuitively, we expect the true decision boundary to be smooth. Therefore, we average $K$ nearest neighbors at a query point.

*   Given a query datapoint $x'$, find the $K$ training examples $\mathcal{N} = \{(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}), \dots, (x^{(K)},y^{(K)})\} \subseteq \mathcal{D}$ that are closest to $x'$.
*   Return $y_{\mathcal{N}}$, the consensus label of the neighborhood $\mathcal{N}$.

The consensus $y_{\mathcal{N}}$ can be determined by voting, weighted average, etc.

Let's look at Nearest Neighbors with a neighborhood of 30. The decision boudary is much smoother than before.




In [None]:
# Train a Nearest Neighbors Model
clf = neighbors.KNeighborsClassifier(n_neighbors=30, metric='minkowski', p=2)
clf.fit(iris_X.iloc[:,:2], iris_y)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into the color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot also the training points
plt.scatter(iris_X.iloc[:,0], iris_X.iloc[:,1], c=iris_y,
            edgecolor='k', s=60, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())

plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')

### **Review: Data Distribution**
We will assume that the dataset is governed by a probability distribution $\mathbb{P}$, which we will call the *data distribution*. We will denote this as:
$$x, y \sim \mathbb{P}.$$

The training set $\mathcal{D} = \{(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}), \dots, (x^{(n)},y^{(n)})\}$ consists of independent and identically distributed (IID) samples from $\mathbb{P}$.

### **KNN Estimates Data Distribution** 
Suppose that the output $y'$ of KNN is the average target in the neighborhood $\mathcal{N}(x')$ around the query $x'$. Observe that we can write:
$$y' = \frac{1}{K}\sum_{(x,y) \in \mathcal{N}(x')}y ≈ \mathbb{E}(y \mid x').$$

*   When $x \approx x'$ and when $\mathbb{P}$ is reasonably smooth, each $y$ for $(x, y) \in \mathcal{N}(x')$ is approximately a sample from $\mathbb{P}(y \mid x')$ (since $\mathbb{P}$ dosen't change much around $x'$, $\mathbb{P}(y \mid x')$ ≈ $\mathbb{P}(y \mid x)$)
*   Thus $y'$ is essentially a Monte Carlo estimate of $\mathbb{E}(y \mid x')$ (the average of $K$ samples from $P(y \mid x')$)



### **Algorithm**: K-Nearest Neighbors


*   **Type:** Supervised Learning (regression and Classification).
*   **Model Family:** Consensus over $K$ training instances.
*   **Objective function:** Euclidean, Minkowski, Hamming, etc.
*   **Optimizer:** Non at training. Nearest neighbor search at inference using specialized search algorithms (Hashing, KD-trees)
*   **Probabilistic and Interpretation:** Directly approximating the density $\mathbb{P}(y \mid x).$



### **Pros and Cons of KNN**

*   **Pros**
 * Can approximate any data distribution arbitrarily well.
*   **Cons**
 * Need to store entire dataset to make queries, which is computationally prohibitive.
 * Number of data needed scale exponentially with dimension (curse of dimensonality)



## **Part 3: Non-parametric Models**
Nearest-Neighbors is the first example of an important type of machine learning algorithm called a non-parametric model.

### **Review: Supervised Learning Model**
We'll say that a model is a function:
$$f : \mathcal{X} \to \mathcal{Y}$$

that maps inputs $x \in \mathcal{X}$ to targets $y \in \mathcal{Y}$.

Often, models have parameters $\theta \in \Theta$. We will then write a model as:
$$f_{\theta}: \mathcal{X} \to \mathcal{Y}$$

to denote that it's parametrized by $\theta$.

### **Review: K-Nearest Neighbors**
Suppose that we are given a training set $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$. At inference time, we recieve a query point $x'$ and we want to predict its label $y'$.

*   Given a query point datapoint $x'$, find the $K$ training examples $\mathcal{N} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(K)}, y^{(K)})\} \subseteq D$ that are closet to $x'$. 
*   Return $y_{\mathcal{N}}$, the consensus label of neighborhood $\mathcal{N}$

The consensus $y_{\mathcal{N}}$ can be determined by voting, weighted average, etc.



### **Non-Parametric Models**
Nearest Neighbors is an example of a *non-parametric* model. Parametric vs. non-parametric is a key distinguishing characteristic for machine learning models. 

A parametric model $f_{\theta}: \mathcal{X} \times \Theta \to \mathcal{Y}$ is defined by a finite set of parameters $\theta \in \Theta$ whose dimensionality is constant with respect to the dataset. Linear models of the form 
$$f_{\theta}(x) = \theta^\top x$$

are an example of a parametric model.

In non-parametric model, the function $f$ uses the entire training dataset (or a post-processed version of it) to make predictions, as in $K$-Nearest Neighbors. In other words, the complexity of the model increases with dataset size. 

Non-parametric models have the advantage of not loosing any information at training time. However, they are also computationally less tractable and may easily overfit the training set.

## **Part 4: Logistic Regression**
Next, we are going to see a simple parametric classification algorithm that addresses many of these limitations of Nearest Neighbors.

### **Review: Classification**
Consider a training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$

We distinguish between two types of supervised learning problems depending on the targets $y^{(i)}$.

1.   **Regression:** The target variable $y \in \mathcal{Y}$ is continuous: $\mathcal{Y} \in \mathbb{R}$ 
2.   **Classification:** The target variable $y$ is discrete and takes on one of $K
$ possible values: $\mathcal{Y}=\{y_1, y_2, \dots,y_K\}$. Each discrete value corresponds to a class that we want to predict. 


### **Binary Classification and the Iris datset**

We are going to start by looking at binary (two-classes) classification.

To keep things simple, we will use Iris dataset. We will be predicting the difference between class 0 (Iris setosa) and the other two classes.



In [None]:
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12,4]
# Rename class two to class one 
iris_y2 = iris_y.copy()
iris_y2[iris_y2==2] = 1

# Plot also the training points
p1 = plt.scatter(iris_X.iloc[:,0], iris_X.iloc[:,1], c=iris_y2,
                 edgecolor='k', s=50, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.legend(handles=p1.legend_elements()[0], labels=['Setosa', 'Non-setosa'], loc='lower right')

### **Review: Least Squares**
Recall that the linear regression algorithm fits a linear model of the form:
$$f(x) = \sum_{j=0}^d\theta_jx_j = \theta^\top x.$$

It minimizes the mean squared error (MSE) 
$$J(\theta) = \frac{1}{2n}\sum_{i=1}^n(y^{(i)} - \theta^\top x^{(i)})^2$$

on a dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$

We could also the above model for classification problem for which $\mathcal{Y} = \{0,1\}$

In [None]:
from numpy.lib.function_base import meshgrid
import warnings
warnings.filterwarnings('ignore')
from sklearn.linear_model import LinearRegression
logreg = LinearRegression()

# Create an instance of linear regression classifier and fit tge data
X = iris_X.to_numpy()[:,:2]
Y = iris_y2
logreg.fit(X,Y)

# Plot the decision boundary. For that, we will assign a color to each point
# in the mesh [x_min, x_max]x[y_min, y_max]
x_min, x_max = X[:,0].min() - .5, X[:,0].max() + .5
y_min, y_max = X[:,1].min() -.5, X[:,1].max() + .5
h = .02 # stepsize in the mesh

xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])
Z[Z>0.5] = 1
Z[Z<0.5] = 0

# Put the result into the color plot
Z = Z.reshape(xx.shape)
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training ponts
plt.scatter(X[:,0], X[:,1], c=Y, edgecolor='k', s=60, cmap = plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.show()

Least squares returns an acceptable decision boundary on this dataset. However, it is problematic for a few reasons:


*   There is nothing to prevent ouputs larger than one or smaller than zero, which is conceptually wrong.
*   We also don't have optimal performance: at least one point is misclassified, and others are too close to the decision boundary.




### **The Logistic Regression**
To address this problem, we will look at a different hypothesis class. We will choose models of the form:
$$f(x) = \sigma (\theta^\top x) = \frac{1}{1+ \exp(- \theta^\top x)},$$

where
$$\sigma(z) = \frac{1}{1 + \exp(-z)}$$

is know as the *sigmoid* or *logistic* function.

The logistic funtion $\sigma : \mathbb{R} \to [0,1]$ 'squeezes' points in the real line into $[0,1]$

In [None]:
import numpy as np
from matplotlib import pyplot as plt

def logistic(z):
  return 1/(1 + np.exp(-z))

z = np.linspace(-5,5)
plt.plot(z, logistic(z))

### **The Logistic Function: Properties**
The sigmoid (or logistic) function is defined as:
$$\sigma(z) = \frac{1}{1 + \exp(-z)}.$$

A few observations:


*   This tends to $1$ as $z \to ∞$ and tends to $0$ as $z \to - ∞$
*   Thus models of the form $\sigma(\theta^\top x)$ output values between $[0,1]$, which is suitable for binary classification.
*   It's easy to show that the derivative of $\sigma(z)$ has a simple form $\frac{d\sigma}{dz} = \sigma(z)(1 - \sigma(z))$.





Let's implement our model using sigmoid function

In [None]:
def f(X, theta):
  return logistic(X.dot(theta))

### **Review: Probabilistic Least Squares**
Recall that least squares can be interpreted as fitting a Gaussian probabilistic model:
$$p(y \mid x; \theta) = \frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(y - \theta^\top x)^2}{2\sigma^2}\right).$$

The log-likelihood of this model at a point $(x,y)$ equals:
$$log L(\theta) = log p(y \mid x; \theta) = \text{const}_1 \cdot (y - \theta^\top x)^2 + \text{const}_2.$$

for some constant $\text{const}_1$ and $\text{const}_2$

Least squares thus amounts to fitting a Gaussian $\mathcal{N}(y; \mu(x), \sigma)$ with a standard deviation of one and mean of $\mu(x) = \theta^\top x$.

### **A Probabilistic Approach to Classifiaction**
We can take this probabilistic perspective to derive a new algorithm for binary classification.

We will start by using our logistic model to parametrize a probability distribution as follows:
$$\begin{align*}
p(y = 1 \mid x; \theta ) &= \sigma(\theta^\top x)\\
p(y = 0 \mid x; \theta ) &= 1 -  \sigma(\theta^\top x).
\end{align*}$$

A probability over $y \in \{0,1\}$ of the form $P(y=1) = p$ is called *Bernoulli*

Note that we can write this compactly as:
$$p(y \mid x; \theta) = \sigma(\theta^\top x)^y(1 - \sigma(\theta^\top x))^{1-y}$$

### **Review: Conditional Maximum Likelihood**
A general approach of optimizing the conditional models of the form $P_{\theta}(y\mid x)$ is by minimizing the expectted KL-divergence with respect to the data distribution
$$\min_{\theta}\mathbb{E}_{x \sim \mathbb{P}}[D(\mathbb{P}(y \mid x) \| P_{\theta}(y \mid x))].$$

with a bit of math, we can show that the *maximum likelihood objective* becomes
$$\max_{\theta}\mathbb{E}_{x \sim \mathbb{P}}\log P_{\theta}(y \mid x).$$

This is the principle of *conditional maximum likelihood*.





### **Applying Maximum Likelihood**
Following the priciple of maximum likelihood, we want the optimizing the following objective defined over the training set $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(n)}, y^{(n)})\}$.

$$\begin{align*}
L(\theta) &= \prod_{i=1}^np(y^{(i)} \mid x^{(i)}; \theta)\\
&= \prod_{i=1}^n\sigma(\theta^\top x^{(i)})^{y^{(i)}}(1 - \sigma(\theta^\top x^{(i)}))^{1 - y^{(i)}}.
\end{align*}$$

The log of this objective is also often called the *log-loss* or **cross-entropy**

This model and objective function define **logistic regression**, one of the most widely used classification algorithms (the name *regression* is an unfortunate misnomer! )

Let's implement the likelihood objective

In [None]:
def log_likelihood(theta, X, y):
  """
  The cost function J(theta0, theta1) describing the goodness of the fit.

  We add the 1e-6 term in order to avoid the overflow (inf and -inf).

  Parameters:
  theta (np.array): d-dimensional vector of parameters
  X (np.array): (n,d)-dimensional design matrix
  y (np.array): n-dimensional vector of targets
  """

  return (y*np.log(f(X, theta) + 1e-6) + (1-y)*np.log(1 - f(X, theta) + 1e-6)).mean()

### **Review: Gradient Descent**
If we want to optimize $J(\theta)$, we start with an initial guess $\theta_0$ for the parameters and repeat the following update:
$$\theta_i := \theta_{i-1} - \alpha\nabla J(\theta_{i-1}).$$



### **Derivatives of the Log-Likelihood**
Let's wprk out the gradient for our log-likelihood objective:
$$\begin{align*}
\frac{\partial \log L(\theta)}{\partial \theta_j} &= \frac{\partial}{\partial \theta_j} \log (\sigma(\theta^\top x)^y(1 - \sigma(\theta^\top x))^{1-y})\\
&= \left(y \cdot \frac{1}{\sigma(\theta^\top x)} - (1-y)\frac{1}{1 - \sigma(\theta^\top x)}\right)\frac{\partial}{\partial \theta_j}\sigma(\theta^\top x)\\
&= \left(y \cdot \frac{1}{\sigma(\theta^\top x)} - (1-y)\frac{1}{1 - \sigma(\theta^\top x)}\right)\sigma(\theta^\top x)(1 - \sigma(\theta^\top x))\frac{\partial}{\partial \theta_j}\theta^\top x\\
&= (y \cdot (1 - \sigma(\theta^\top x)) -  (1-y) \cdot \sigma(\theta^\top x))x_j\\
&= (y - f_{\theta}(x))\cdot x_j.
\end{align*}$$

### **Gradient of the Log-Likelihood**
Using the above extension, we obtain the following gradient:
$$\nabla_{\theta}J(\theta) = (y - f_{\theta}(x))\cdot \bf{x}.$$

This expression looks very similar to the gradient of mean square error, but it is different because the model $f_{\theta}$ is different. 

Let's implement the gradient

In [None]:
def loglik_gradient(theta, X, y):
  return np.mean((y - f(X, theta))*X.T, axis=1)

### **Gradient Desccent for Logistic Regression**
Putting this together, we obtain a complete learning algorithm, logistic regression.

```python
theta, theta_prev = random_initialization()
while abs(J(theta) - J(theta_prev)) > conv_threshold:
    theta_prev = theta
    theta = theta_prev - step_size * (f(x, theta)-y) * x
```
Let's implement this algorithm

In [None]:
threshold = 5e-5
stepsize = 1e-1

theta, theta_prev = np.zeros((3,)), np.ones((3,))
opt_pts = [theta]
opt_grads = []
iter = 0
iris_X['one'] = 1
X_train = iris_X.iloc[:,[0,1,-1]].to_numpy()
y_train = iris_y2.to_numpy()

while np.linalg.norm(theta - theta_prev) > threshold:
  if iter % 50000 == 0:
    print('Iteration %d. Log-likelihood: %.6f' % (iter, log_likelihood(theta, X_train, y_train)))
    theta_prev = theta
    gradient = loglik_gradient(theta, X_train, y_train)
    theta = theta_prev + stepsize*gradient
    opt_pts += [theta]
    opt_grads +=[gradient]
    iter += 1

Let's now visualize the result

In [None]:
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))
Z = f(np.c_[xx.ravel(), yy.ravel(), np.ones(xx.ravel().shape)], theta)
Z[Z<0.5] = 0
Z[Z>=0.5] = 1

# Put the result into the color plot
Z = Z.reshape(xx.shape)
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points
plt.scatter(X[:,0], X[:,1], c=Y, edgecolor='k', s=60, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.show()

This is how we use the algorithm via [sklearn]()

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e5, fit_intercept=True)

# Create an instance of Logistic Regression Classifier and fit the data
X = iris_X.to_numpy()[:,:2]
Y = iris_y2
logreg.fit(X,Y)

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into the color plot
Z = Z.reshape(xx.shape)
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points
plt.scatter(X[:,0], X[:,1], c=Y, edgecolor='k', s=40, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.show()

### **Observations About Logistic Regression**


*   Logistic Regression finds a linear decision boundary, is the set of points for which $P(y = 1 \mid x) = P(y = 0 \mid x)$, or equivalently:
$$0 = \log \frac{P(y = 1 \mid x)}{P(y = 0 \mid x)} = \log \frac{\frac{1}{1 + \exp(-\theta^\top x)}}{1 - \frac{1}{1 + \exp(-\theta^\top x)}} = \theta^\top x.$$
The claim holds because $\theta^\top x = 0$ is a linear function.
*   Unlike least squares, we don't have a close form solution for $\theta$, but we can still apply gradient descent. 



### **Algorithm: Logistic Regression**



*   **Type**: Supervised learning (binary classification)
*   **Model family**: Linear decision boundary 
*   **Objective function**: Cross-entropy, a special case of log-likelihood
*   **Optimizer**: Gradient descent
*   **Probabilistic Interpretation**: Parametrized Bernoulli distribution




## **Part 5: Multi-class classification**
finally, let's look at an extension of logistic regression to an arbitrary number of classes.

### **Review: Logistic Regression**
Logistic regression fits the model of form:
$$f(x) = \sigma(\theta^\top x) = \frac{1}{1 + \exp(-\theta^\top x)}$$

where 
$$\sigma(z) = \frac{1}{1 + \exp(-z)}$$

is known as the *sigmoid function* or *logistic function*.

### **Multi-class Classification**
Linear regression only applies to binary classification problems. What if we have an arbitrary number of classes $K$?

*   The simplest approach that can be used by machine learning algorithm is the "one vs. all" approach. We train one classifier for each class to distinguish one class from all the others. This works, but loses a valid probabilistic interpretation and is not very elegant. Alternatively, we may fit a probabilistic model that outputs multi-class probabilities.


Let's load a fully multiclass version of the Iris dataset.


In [None]:
# Plot also the training points
p1 = plt.scatter(iris_X.iloc[:,0], iris_X.iloc[:,1], c=iris_y, 
                 edgecolor='k', s=40, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.legend(handles=p1.legend_elements()[0], labels=['Setosa', 'Versicolor', 'Virginica'], loc='lower right')


### **The Softmax Function**
The logistic function $\sigma : \mathbb{R} \to [0,1]$ be seen as mapping an input $\vec{z} \in \mathbb{R}$ to a probability.

Its multi-class extension $\vec{\sigma}:\mathbb{R}^K \to [0,1]^K$: maps a $K$-dimensional input $z \in \mathbb{R}$ to a $K$-dimensional vector of probabilities.

Each components of $\vec{\sigma}(\vec{z})$ is defined as:
$$\sigma(\vec{z})_k = \frac{\exp(z_k)}{\sum_{l=1}^K \exp(z_l)}.$$

We call this the *softmax* function.

When $K=2$, this looks as follows:
$$\sigma(\vec{z})_1 = \frac{\exp(z_1)}{\exp(z_1) + \exp(z_2)}.$$

We can assume that $\exp(z_1) = 1$ because multiplying the numerator and denominator doesn't change any of the probabilities (so we can just devide by $\exp(z_1)$). Thus we obtain:
$$\sigma(\vec{z})_1 = \frac{1}{1 + \exp(z_2)}.$$

This is essentially our sigmoid function. Hence softmax function generalizes the sigmoid function.

### **The Softmax Model**
We can use the softmax function to define a $K$-class classification model.

In the binary classification setting, we mapped weights $\theta$ and features $x$ into a probability as follows:
$$\sigma(\theta^\top x) = \frac{1}{1 + \exp(- \theta^\top x)},$$

In the multi-class setting, we define a model $f: \mathcal{X} \to [0,1]^K$ that outputs the probability of class $k$ based on the feature $x$ and class-specific weights $\theta_k$:
$$\sigma(\theta_k^\top x)_k = \frac{\exp(\theta_k^\top x)}{\sum_{l=1}^K \exp(\theta_l^\top x)}.$$ 

Its parameters space lies in $\Theta^K$, where $\Theta = \mathbb{R}^d$ is the parameter space of logistic regression.

We may have noticed that this model is slightly over-parametrized: multiplying every $\theta_k$ by a constant results in an equivalent model. For this reason, it is often assumed that one of the class weights $\theta_l = 0$.

### **Softmax Regression**
We again take a probabilistic perspective to derive a $K$-class classification algorithms based on this model.

We will start by using our softmax model to parametrize a probability distribution as follow:
$$p(y=k \mid x; \theta) = \vec{\sigma}(\theta^\top x)_k$$

This is called the **categorical distribution**, and it generalize the Bernoulli distribution.

Following the principle of maximum likelihood, we want to optimize the following objective over a training dataset

$$L(\theta) = \prod_{i=1}^n p(y^{(i)} \mid x; \theta) = \prod_{i=1}^n  \vec{\sigma}(\theta^\top x^{(i)})_{y^{(i)}}.$$

This model and objective function define *softmax regression*.

Let's now implement softmax regression to Iris dataset by using the implementation from [sklearn]()

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e5, multi_class = 'multinomial')

# Create an instance of Softmax Regression and fit the data
logreg.fit(X,iris_y)
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into the color plot
Z = Z.reshape(xx.shape)
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points
plt.scatter(X[:,0], X[:,1], c=iris_y, cmap=plt.cm.Paired,
            edgecolor='k', s=40)
plt.xlabel('sepal length')
plt.ylabel('Sepal width')


### **ALgorithm: Softmax Regression**


*   **Type:** Supervised learning (classification)
*   **Model Family:** Linear decision boundaries.
*   **Objective function:** Softmax loss, a special case of log-likelihood
*   **MOptimizer:** Gradient descent
*   **Probabilistic and Interpretation:** Parametrized categorical distribution.



