## **Overview of Multiclass Classification with Logistic Regression**

### Algorithm Overview

Multiclass classification with logistic regression extends the standard logistic regression approach, which traditionally handles binary classification, to address problems involving more than two classes. 

### Multiclass Classification Basics
- Problem Definition  
Multiclass classification is the problem of classifying instances into one of three or more classes.

- Input and Output 
  - Inputs $X$ typically come from a feature space.
  - Outputs $Y$ are from a finite set of labels $ Y = \{1, 2, \ldots, k\} $, where $ k $ is the number of classes.

### Multiclass Classification Strategies
In binary logistic regression, a linear function predicts the probability of the positive class using a logistic (sigmoid) function. Extending this to multiclass classification can be done using the following approaches:

##### One-vs-All Approach
One-vs-All involves training a single binary classifier for each class, with the samples of that class as positive samples and all other samples as negatives. The class with the highest probability score is selected for each input.

##### All-pairs Approach
All-pairs involves training $\binom{k}{2} = k(k - 1)/2$ binary classifiers, each receives the samples of a pair of classes from the original training set, and learn to distinguish these two classes. For prediction, all $k (k âˆ’ 1) / 2$ classifiers are applied to an unseen sample and the class that got the highest number of "+1" predictions gets predicted by the combined classifier.

### Logistic Regression
Logistic Regression is a statistical method used for binary classification, predicting one of two possible outcomes based on input features. It estimates the parameters of a logistic model (the coefficients in the linear or non linear combinations) and transforms the linear combination of features using the sigmoid function, which maps any real-valued number into a value between 0 and 1. Logistic regression belongs to the family of generalized linear models and is widely used when the target variable is binary. 

#### Loss Function
In logistic regression, the loss function quantifies the error between the predicted probabilities and the actual class labels. The most commonly used loss function for binary logistic regression is logistic loss(sometimes called cross-entropy loss). This function aims to minimize the log loss across all training observations. By penalizing incorrect predictions, the loss function encourages the model to produce probabilities that are closer to the true class labels.

#### Optimization
Gradient descent and its variants, like stochastic gradient descent (SGD), are common optimization techniques for logistic regression. Gradient descent works by computing the gradient (partial derivatives) of the loss function with respect to each parameter, and updating each parameter in the opposite direction of the gradient to minimize the loss.


### Representation

#### Logistic regression
Logistic regression is common hypothesis class for classification

$$ \mathcal{X} = \mathbb{R}^d \quad \mathcal{Y} = \{1, -1\} $$

Now we use a linear predictor that outputs a continuous value in [0, 1]

$$ h_w(\mathbf{x}) = \frac{1}{1 + e^{-\langle \mathbf{w}, \mathbf{x} \rangle}} $$

Where:

* $\mathbf{x} \in \mathcal{X}$ represents the input vector with dimension $d$
* $\mathbf{w}$ is the weight vector
* $\langle \mathbf{w}, \mathbf{x} \rangle$ denotes the dot product between $\mathbf{w}$ and $\mathbf{x}$

This linear predictor maps to:

$$ h : \mathcal{X} \rightarrow [0, 1] $$

#### One-versus-All Pseudo Code
input:  
* training set $S = (x_1, y_1), \ldots, (x_m, y_m)$
* algorithm for binary classification $ A $ (here $A$ is Logistic Regression)

foreach $ i \in \mathcal{Y} $:   
* let $ S_i = (x_1, (-1)^{\mathbb{I}_{[y_1 \neq i]}}), \ldots, (x_m, (-1)^{\mathbb{I}_{[y_m \neq i]}}) $
* let $ h_i = A(S_i) $

output:  
- the multiclass hypothesis defined by $ h(x) \in \arg\max_{i \in \mathcal{Y}} h_i(x) $

#### All-Pairs Pseudo Code
input:  
- training set $ S = (x_1, y_1), \ldots, (x_m, y_m) $
- algorithm for binary classification $ A $ (here $A$ is Logistic Regression)

foreach $ i, j \in \mathcal{Y} $ such that $ i < j $:
- initialize $ S_{i, j} $ to be the empty sequence
- for $ t = 1, \ldots, m $:
  - If $ y_t = i $, add $ (x_t, 1) $ to $ S_{i, j} $
  - If $ y_t = j $, add $ (x_t, -1) $ to $ S_{i, j} $
- let $ h_{i, j} = A(S_{i, j}) $

output:  
- the multiclass hypothesis defined by  
  $ h(x) \in \arg\max_{i \in \mathcal{Y}} \left( \sum_{j \in \mathcal{Y}} \text{sign}(j - i) h_{i, j}(x) \right) $




### Loss

For binary classification, logistic regression uses the sigmoid function:
$$P(y = 1 | x) = \sigma(w^{T}x + b)$$
Where:
* $x$ is the input vector
* $w$ is the weights
* $b$ is the bias
* $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function

Binary Cross-Entropy Loss:
$$L(y, \hat y) = -(y log(\hat y) + (1 - y)log(1 - \hat y))$$
Where:
* $y$ is the true label (0 or 1)
* $\hat y$ is the predicted probability of the first class
* and $\hat y = \sigma(w^T x + b)$

One-vs-All:
For one-vs-all, we have to train $K$ different classifiers for each class so that each classifier $k$ can learn to distinguish one class from all the others.
The loss for the $i$-th example of classifier $k$ is:
$$L_k(y^{(i)}, \hat y_k^{(i)}) = -[y_k^{(i)}log(\hat y_k^{(i)}) + (1 - y_k^{(i)})log(1 - \hat y_k^{(i)})]$$
Where:
* $y_k^{(i)} = 1$ if the true class of the $i$-th example is class $k$, otherwise $y_k^{(i)} = 0$
* $\hat y_k^{(i)}$ is the predicted probability for class $k$

The overall class is determined by selecting the classifier that has the highest probability (or confidence).

All-Pairs:
For All-Pairs, we have to train a classifier for every pair of classes instead of $K$ classifiers in One-vs-All training. For $K$ classes, we train $\frac{K(K - 1)}{2}$ classifiers to distinguish between 2 classes for each classifier.

The loss function is still the binary cross-entropy loss and rewritten for the $i$-th example as:
$$L_{k,j}(y_{k,j}^{(i)}, \hat y_{k, j}^{(i)}) = -[y_{k, j}^{(i)}log(\hat y_{k, j}^{(i)}) + (1 - y_{k, j}^{(i)})log(1 - \hat y_{k, j}^{(i)})]$$
Each classifier will vote for one of two classes and the overall class is the class that receives the most votes.


### Optimizer