# **DATA 2060 Final Project**

### Written by: Rui Gao,
### Link to the github repo: https://github.com/Jingxian2022/Multiclass-Classification-with-Logistic-Regression/tree/main

## **Overview of Multiclass Classification with Logistic Regression**

### Algorithm Overview

Multiclass classification with logistic regression extends the standard logistic regression approach, which traditionally handles binary classification, to address problems involving more than two classes. 

### Multiclass Classification Basics
- Problem Definition  
Multiclass classification is the problem of classifying instances into one of three or more classes.

- Input and Output 
  - Inputs $X$ typically come from a feature space.
  - Outputs $Y$ are from a finite set of labels $ Y = \{1, 2, \ldots, k\} $, where $ k $ is the number of classes.

### Multiclass Classification Strategies
In binary logistic regression, a linear function predicts the probability of the positive class using a logistic (sigmoid) function. Extending this to multiclass classification can be done using the following approaches:

##### One-vs-All Approach
One-vs-All involves training a single binary classifier for each class, with the samples of that class as positive samples and all other samples as negatives. The class with the highest probability score is selected for each input.

##### All-pairs Approach
All-pairs involves training $\binom{k}{2} = k(k - 1)/2$ binary classifiers, each receives the samples of a pair of classes from the original training set, and learn to distinguish these two classes. For prediction, all $k (k − 1) / 2$ classifiers are applied to an unseen sample and the class that got the highest number of "+1" predictions gets predicted by the combined classifier.

### Logistic Regression
Logistic Regression is a statistical method used for binary classification, predicting one of two possible outcomes based on input features. It estimates the parameters of a logistic model (the coefficients in the linear or non linear combinations) and transforms the linear combination of features using the sigmoid function, which maps any real-valued number into a value between 0 and 1. Logistic regression belongs to the family of generalized linear models and is widely used when the target variable is binary. 

#### Loss Function
In logistic regression, the loss function quantifies the error between the predicted probabilities and the actual class labels. The most commonly used loss function for binary logistic regression is logistic loss(sometimes called cross-entropy loss). This function aims to minimize the log loss across all training observations. By penalizing incorrect predictions, the loss function encourages the model to produce probabilities that are closer to the true class labels.

#### Optimization
Gradient descent and its variants, like stochastic gradient descent (SGD), are common optimization techniques for logistic regression. Gradient descent works by computing the gradient (partial derivatives) of the loss function with respect to each parameter, and updating each parameter in the opposite direction of the gradient to minimize the loss.


### Representation

#### Logistic regression
Logistic regression is common hypothesis class for classification

$$ \mathcal{X} = \mathbb{R}^d \quad \mathcal{Y} = \{1, -1\} $$

Now we use a linear predictor that outputs a continuous value in [0, 1]

$$ h_w(\mathbf{x}) = \frac{1}{1 + e^{-\langle \mathbf{w}, \mathbf{x} \rangle}} $$

Where:

* $\mathbf{x} \in \mathcal{X}$ represents the input vector with dimension $d$
* $\mathbf{w}$ is the weight vector
* $\langle \mathbf{w}, \mathbf{x} \rangle$ denotes the dot product between $\mathbf{w}$ and $\mathbf{x}$

This linear predictor maps to:

$$ h : \mathcal{X} \rightarrow [0, 1] $$

#### One-versus-All Pseudo Code
input:  
* training set $S = (x_1, y_1), \ldots, (x_m, y_m)$
* algorithm for binary classification $ A $ (here $A$ is Logistic Regression)

foreach $ i \in \mathcal{Y} $:   
* let $ S_i = (x_1, (-1)^{\mathbb{I}_{[y_1 \neq i]}}), \ldots, (x_m, (-1)^{\mathbb{I}_{[y_m \neq i]}}) $
* let $ h_i = A(S_i) $

output:  
- the multiclass hypothesis defined by $ h(x) \in \arg\max_{i \in \mathcal{Y}} h_i(x) $

#### All-Pairs Pseudo Code
input:  
- training set $ S = (x_1, y_1), \ldots, (x_m, y_m) $
- algorithm for binary classification $ A $ (here $A$ is Logistic Regression)

foreach $ i, j \in \mathcal{Y} $ such that $ i < j $:
- initialize $ S_{i, j} $ to be the empty sequence
- for $ t = 1, \ldots, m $:
  - If $ y_t = i $, add $ (x_t, 1) $ to $ S_{i, j} $
  - If $ y_t = j $, add $ (x_t, -1) $ to $ S_{i, j} $
- let $ h_{i, j} = A(S_{i, j}) $

output:  
- the multiclass hypothesis defined by  
  $ h(x) \in \arg\max_{i \in \mathcal{Y}} \left( \sum_{j \in \mathcal{Y}} \text{sign}(j - i) h_{i, j}(x) \right) $




### Loss

For binary classification, logistic regression uses the sigmoid function:
$$P(y = 1 | x) = \sigma(w^{T}x + b)$$
Where:
* $x$ is the input vector
* $w$ is the weights
* $b$ is the bias
* $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function

Binary Cross-Entropy Loss:
$$L(y, \hat y) = -(y log(\hat y) + (1 - y)log(1 - \hat y))$$
Where:
* $y$ is the true label (0 or 1)
* $\hat y$ is the predicted probability of the first class
* and $\hat y = \sigma(w^T x + b)$

One-vs-All:
For one-vs-all, we have to train $K$ different classifiers for each class so that each classifier $k$ can learn to distinguish one class from all the others.
The loss for the $i$-th example of classifier $k$ is:
$$L_k(y^{(i)}, \hat y_k^{(i)}) = -[y_k^{(i)}log(\hat y_k^{(i)}) + (1 - y_k^{(i)})log(1 - \hat y_k^{(i)})]$$
Where:
* $y_k^{(i)} = 1$ if the true class of the $i$-th example is class $k$, otherwise $y_k^{(i)} = 0$
* $\hat y_k^{(i)}$ is the predicted probability for class $k$

The overall class is determined by selecting the classifier that has the highest probability (or confidence).

All-Pairs:
For All-Pairs, we have to train a classifier for every pair of classes instead of $K$ classifiers in One-vs-All training. For $K$ classes, we train $\frac{K(K - 1)}{2}$ classifiers to distinguish between 2 classes for each classifier.

The loss function is still the binary cross-entropy loss and rewritten for the $i$-th example as:
$$L_{k,j}(y_{k,j}^{(i)}, \hat y_{k, j}^{(i)}) = -[y_{k, j}^{(i)}log(\hat y_{k, j}^{(i)}) + (1 - y_{k, j}^{(i)})log(1 - \hat y_{k, j}^{(i)})]$$
Each classifier will vote for one of two classes and the overall class is the class that receives the most votes.




## Optimizer
### One-vs-All (OvR) with SGD

In One-vs-All, we train a separate binary classifier for each class. Each classifier learns to distinguish one class from all others. Below is the pesudo code on sample S.

$\text{Initialize parameters } \mathbf{w} \text{ for each class, learning rate } \alpha, \text{ and batch size } b$<br />
$\text{converge} = \text{False}$<br />

$\text{while not converge:}$ <br />
    $\quad \text{epoch} += 1$<br />
    $\quad \text{Shuffle training examples}$<br />
    $\quad \text{Calculate last epoch loss}$<br />
    
$\quad \text{for } i = 0, 1, \dots, \left\lceil \frac{n_{\text{examples}}}{b} \right\rceil - 1 \text{: } \quad \text{(iterate over batches)}$<br />
        $\quad \quad X_{\text{batch}} = X[i \cdot b : (i + 1) \cdot b] \quad \text{(select the } X \text{ in the current batch)}$<br />
        $\quad \quad \mathbf{y}_{\text{batch}} = \mathbf{y}[i \cdot b : (i + 1) \cdot b] \quad \text{(select the labels in the current batch)}$<br />
        $\quad \quad \nabla L_{\mathbf{w}} = \mathbf{0} \quad \text{(initialize gradient matrix for each class)}$

$\quad \quad \text{for each pair of training data } (x, y) \in (X_{\text{batch}}, \mathbf{y}_{\text{batch}}) \text{:}$<br />
            $\quad \quad \quad \text{for } j = 0, 1, \dots, n_{\text{classes}} - 1 \text{:}$<br />
                $\quad \quad \quad \quad \text{if } y = j \text{:}$<br />
                    $\quad \quad \quad \quad \quad \nabla L_{\mathbf{w}_j} += \left( \sigma(\mathbf{w}_j^T x) - 1\right) \cdot x  \quad \text{(for correct class)}$<br />
                    $\quad \quad \quad \quad \text{else:}$<br />
                    $\quad \quad \quad \quad \quad \nabla L_{\mathbf{w}_j} += \sigma(\mathbf{w}_j^T x) \cdot x  \quad \text{(for other classes)}$<br />

$\quad \quad \text{for } j = 0, 1, \dots, n_{\text{classes}} - 1 \text{:}$<br />
            $\quad \quad \quad \mathbf{w}_j = \mathbf{w}_j - \alpha \cdot \frac{\nabla L_{\mathbf{w}_j}}{\text{len}(X_{\text{batch}})}  \quad \text{(update weights for each class)}$<br />

$\quad \text{Calculate this epoch loss}$<br />
    $\quad \text{if } \left| \text{Loss}(X, \mathbf{y})_{\text{this-epoch}} - \text{Loss}(X, \mathbf{y})_{\text{last-epoch}} \right| < \text{CONV-THRESHOLD:}$<br />
        $\quad \quad \text{converge} = \text{True}  \quad \text{(break the loop if loss converged)}$



Here, $sigmoid(w_j^T x)$ gives the probability that 
𝑥
x belongs to class 
𝑗
 (treated as a binary classification for that specific class).

### All Pairs (OvO) with SGD

In All Pairs, we train a separate binary classifier for each pair of classes, focusing only on the data points belonging to the two classes in each pair.

$\text{Initialize parameters } \mathbf{w} \text{ for each pair of classes, learning rate } \alpha, \text{ and batch size } b$<br />
$\text{converge} = \text{False}$<br />

$\text{while not converge:}$<br />
    $\quad \text{epoch} += 1$<br />
    $\quad \text{Shuffle training examples}$<br />
    $\quad \text{Calculate last epoch loss}$<br />
    
$\quad \text{for } i = 0, 1, \dots, \left\lceil \frac{n_{\text{examples}}}{b} \right\rceil - 1 \text{: } \quad \text{(iterate over batches)}$<br />
        $\quad \quad X_{\text{batch}} = X[i \cdot b : (i + 1) \cdot b] \quad \text{(select the } X \text{ in the current batch)}$<br />
        $\quad \quad \mathbf{y}_{\text{batch}} = \mathbf{y}[i \cdot b : (i + 1) \cdot b] \quad \text{(select the labels in the current batch)}$<br />
        $\quad \quad \text{for each unique pair of classes } (A, B) \text{:}$<br />
            $\quad \quad \quad \nabla L_{\mathbf{w}_{AB}} = \mathbf{0} \quad \text{(initialize gradient for each pair (A, B))}$<br />
            $\quad \quad \quad \text{for each } (x, y) \in (X_{\text{batch}}, \mathbf{y}_{\text{batch}}) \text{:}$<br />
                $\quad \quad \quad \quad \text{if } y = A \text{ or } y = B \text{:} \quad \text{(focus on examples for classes A and B)}$<br />
                    $\quad \quad \quad \quad \quad \text{if } y = A \text{:}$<br />
                        $\quad \quad \quad \quad \quad \quad \nabla L_{\mathbf{w}_{AB}} += \left( \sigma(\mathbf{w}_{AB}^T x) - 1 \right) \cdot x  \quad \text{(for class A)}$<br />
                    $\quad \quad \quad \quad \text{else:}$<br />
                        $\quad \quad \quad \quad \quad \quad \nabla L_{\mathbf{w}_{AB}} += \sigma(\mathbf{w}_{AB}^T x) \cdot x  \quad \text{(for class B)}$<br />

$\quad \quad \quad \mathbf{w}_{AB} = \mathbf{w}_{AB} - \alpha \cdot \frac{\nabla L_{\mathbf{w}_{AB}}}{\text{len}(X_{\text{batch}})}  \quad \text{(update weights for the pair (A, B))}$<br />

$\quad \text{Calculate this epoch loss}$<br />
    $\quad \text{if } \left| \text{Loss}(X, \mathbf{y})_{\text{this-epoch}} - \text{Loss}(X, \mathbf{y})_{\text{last-epoch}} \right| < \text{CONV-THRESHOLD:}$<br />
        $\quad \quad \text{converge} = \text{True}  \quad \text{(break the loop if loss converged)}$

            
                   

### Reference

https://github.com/danhergir/Logistic_regression </br>
https://danhergir.medium.com/implementing-multi-class-logistic-regression-with-scikit-learn-53d919b72c13

Run the environment test below and make sure all the requirements are met.

In [217]:
from __future__ import print_function
from packaging.version import parse as Version
from platform import python_version

OK = '\x1b[42m[ OK ]\x1b[0m'
FAIL = "\x1b[41m[FAIL]\x1b[0m"

try:
    import importlib
except ImportError:
    print(FAIL, "Python version 3.12.5 is required,"
                " but %s is installed." % sys.version)

def import_version(pkg, min_ver, fail_msg=""):
    mod = None
    try:
        mod = importlib.import_module(pkg)
        if pkg in {'PIL'}:
            ver = mod.VERSION
        else:
            ver = mod.__version__
        if Version(ver) == Version(min_ver):
            print(OK, "%s version %s is installed."
                  % (lib, min_ver))
        else:
            print(FAIL, "%s version %s is required, but %s installed."
                  % (lib, min_ver, ver))    
    except ImportError:
        print(FAIL, '%s not installed. %s' % (pkg, fail_msg))
    return mod


# first check the python version
pyversion = Version(python_version())

if pyversion >= Version("3.12.5"):
    print(OK, "Python version is %s" % pyversion)
elif pyversion < Version("3.12.5"):
    print(FAIL, "Python version 3.12.5 is required,"
                " but %s is installed." % pyversion)
else:
    print(FAIL, "Unknown Python version: %s" % pyversion)

    
print()
requirements = {'matplotlib': "3.9.1", 'numpy': "2.0.1",'sklearn': "1.5.1", 
                'pandas': "2.2.2"}

# now the dependencies
for lib, required_version in list(requirements.items()):
    import_version(lib, required_version)

[42m[ OK ][0m Python version is 3.12.5

[42m[ OK ][0m matplotlib version 3.9.1 is installed.
[42m[ OK ][0m numpy version 2.0.1 is installed.
[42m[ OK ][0m sklearn version 1.5.1 is installed.
[42m[ OK ][0m pandas version 2.2.2 is installed.


### Model

In [318]:
import numpy as np

def sigmoid(x):
    '''
    Apply sigmoid function to an array.
    @params:
        x: The input array.
    @return:
        An array with the sigmoid function applied elementwise.
    '''
    return 1 / (1 + np.exp(-x))

class MulticlassLogisticRegression:
    '''
    Multiclass Logistic Regression with One-vs-All (OvA) and All-Pairs (OvO) strategies,
    trained using stochastic gradient descent.
    '''
    def __init__(self, n_features, n_classes, batch_size=32, conv_threshold=1e-4, strategy='one-vs-all'):
        '''
        Initializes the Multiclass Logistic Regression classifier.
        @attrs:
            n_features: Number of features in the dataset.
            n_classes: Number of unique classes.
            weights: Model weights, initialized to zeros.
            strategy: Multiclass strategy ('one-vs-all' or 'all-pairs').
            alpha: Learning rate for SGD.
        '''
        self.n_classes = n_classes
        self.n_features = n_features
        self.strategy = strategy
        self.weights = None  # Initialize dynamically based on the strategy
        self.alpha = 0.1
        self.batch_size = batch_size
        self.conv_threshold = conv_threshold

    def train(self, X, Y):
        '''
        Trains the model using stochastic gradient descent.
        Supports both One-vs-All and All-Pairs strategies.
        @params:
            X: 2D Numpy array where each row is an example, padded with one column for bias.
            Y: 1D Numpy array of labels for each example.
        @return:
            Number of epochs taken to converge.
        '''
        if self.strategy == 'one-vs-all':
            self._train_one_vs_all(X, Y)
        elif self.strategy == 'all-pairs':
            self._train_all_pairs(X, Y)
        else:
            raise ValueError(f"Invalid strategy: {self.strategy}. Use 'one-vs-all' or 'all-pairs'.")

    def _train_one_vs_all(self, X, Y):
        '''
        Trains the model using the One-vs-All (OvA) strategy. 
        Each class is treated as a binary classification problem against all other classes, 
        and a separate weight vector is trained for each class.

        @params:
            X: A 2D Numpy array where each row is a feature vector of an example, 
               padded with one column for the bias term.
            Y: A 1D Numpy array of class labels for each example in X.
            
            Labels are converted into binary format for each class during training.
        '''
        self.weights = np.zeros((self.n_classes, self.n_features + 1))
        for class_label in range(self.n_classes):
            binary_Y = (Y == class_label).astype(int) #if label matches then assign 1, otherwise 0
            self._train_binary_class(X, binary_Y, class_label)

    def _train_all_pairs(self, X, Y):
        '''
        Trains the model using the All-Pairs (OvO) strategy.
        Each pair of classes is treated as a binary classification problem, 
        and a separate weight vector is trained for each class pair.

        @params:
            X: A 2D Numpy array where each row is a feature vector of an example, 
               padded with one column for the bias term.
            Y: A 1D Numpy array of class labels for each example in X.
            
            Only examples belonging to any two distinct classes are used for training each classifier.
        '''
        #The weights for all binary classifiers are stored in a dictionary
        #Keys: Tuples representing a pair of classes (e.g., (0, 1), (0, 2))
        #Values: Weight vectors for the corresponding classifier.
        self.weights = {}
        
        #a total of n(n-1)/2 classifiers are trained
        for i in range(self.n_classes):
            for j in range(i + 1, self.n_classes):
                #identifies the indices of examples where the label is either i or j
                indices = np.where((Y == i) | (Y == j))[0]
                X_subset = X[indices]
                Y_subset = Y[indices]

                #labels converted into binary format
                binary_Y = (Y_subset == i).astype(int) #class i = 1, class j = 0
                self.weights[(i, j)] = np.zeros(self.n_features + 1)
                self._train_binary_class(X_subset, binary_Y, (i, j))

    def _train_binary_class(self, X, Y, label):
        '''
        Trains a binary logistic regression model for a specific class or pair of classes.
        @params:
            X: A 2D Numpy array where each row contains a feature vector for a training example.
            Y: A 1D Numpy array with binary labels (0 or 1) corresponding to the examples in X.
            label: An integer (for OvA) or tuple (for OvO) representing the class or class pair being trained.
        @return:
            Number of epochs taken to converge during the training process.
        '''
        num_examples = X.shape[0]
        epoch = 0
        converged = False
        last_loss = float('inf')
        while not converged:
            epoch += 1
            indices = np.arange(num_examples)
            np.random.shuffle(indices)
            X = X[indices]
            Y = Y[indices]
                
            for i in range(int(np.ceil(num_examples/self.batch_size))):
                batch_X = X[i * self.batch_size:(i + 1) * self.batch_size]
                batch_Y = Y[i * self.batch_size:(i + 1) * self.batch_size]

                grad_w = np.zeros_like(self.weights[label] if isinstance(label, tuple) else self.weights[label])
                for x, y in zip(batch_X, batch_Y):
                    raw = np.dot(self.weights[label], x)
                    prob = sigmoid(raw)  # Probability of positive class
                    grad_w += (prob - y) * x

                grad_w /= len(batch_X)
                self.weights[label] -= self.alpha * grad_w
                
            this_loss = self.loss(X, Y, label)
            if abs(this_loss - last_loss) < self.conv_threshold:
                converged = True
                
            last_loss = this_loss

        return epoch

    def predict(self, X):
        '''
        Predicts the class for each example in X.
        @params:
            X: 2D Numpy array of examples, padded with one column for bias.
        @return:
            1D Numpy array of predicted class labels.
        '''
        if self.strategy == 'one-vs-all':
            return self._predict_one_vs_all(X)
        elif self.strategy == 'all-pairs':
            return self._predict_all_pairs(X)
        else:
            raise ValueError(f"Invalid strategy: {self.strategy}. Use 'one-vs-all' or 'all-pairs'.")

    def _predict_one_vs_all(self, X):
        '''
        Predicts the class labels for a given dataset using the One-vs-All (OvA) strategy.
        @params:
            X: A 2D Numpy array where each row is a feature vector of an example, padded with one column for the bias term.
        @return:
            A 1D Numpy array containing the predicted class labels for each example in X.
            Each label corresponds to the class with the highest probability.
        '''
        probabilities = np.dot(X, self.weights.T)
        return np.argmax(probabilities, axis=1)

    def _predict_all_pairs(self, X):
        '''
        Predicts the class labels for a given dataset using the All-Pairs (OvO) strategy.
        @params:
            X: A 2D Numpy array where each row is an example, padded with one column for bias.
        @return:
            A 1D Numpy array of predicted class labels for each example in X.
        '''
        votes = np.zeros((X.shape[0], self.n_classes))
        for (i, j), weight in self.weights.items():
            #raw score for the (i,j) classifier
            raw = X @ weight
            #1 or class i if >= 0, 0 or class j if < 0, decision boundary
            predictions = (raw >= 0).astype(int)
            votes[:, i] += predictions
            votes[:, j] += (1 - predictions)
        #select class with the most votes
        return np.argmax(votes, axis=1)

    def loss(self, X, Y, label):
        '''
        Computes the log loss for the model.
        @params:
            X: 2D Numpy array of examples, padded with one column for bias.
            Y: 1D Numpy array of labels for each example.
            label: Binary classification label or class pair.
        @return:
            Average log loss.
        '''
        total_loss = 0
        num_examples = X.shape[0]

        if isinstance(label, tuple):
            # Binary classification loss (OvO for a specific class pair)
            for x, y in zip(X, Y):
                raw = np.dot(self.weights[label], x)  # Raw score for the OvO classifier
                prob = sigmoid(raw)  # Sigmoid for binary probabilities
                if y == 1:  # Positive class in the pair
                    total_loss += -np.log(prob + 1e-6)
                else:  # Negative class in the pair
                    total_loss += -np.log(1 - prob + 1e-6)
        else:
            # Binary classification loss (OvA for a specific class)
            for x, y in zip(X, Y):
                raw = np.dot(self.weights[label], x)  # Raw score for the OvA classifier
                probability = sigmoid(raw)  # Sigmoid for binary probabilities
                if y == 1:  # Positive class
                    total_loss += -np.log(probability + 1e-6)
                else:  # Negative class (all other classes)
                    total_loss += -np.log(1 - probability + 1e-6)

        return total_loss / num_examples
   

    def accuracy(self, X, Y):
        '''
        Computes accuracy on a given dataset.
        @params:
            X: 2D Numpy array of examples, padded with one column for bias.
            Y: 1D Numpy array of true labels.
        @return:
            Float value representing accuracy.
        '''
        predictions = self.predict(X)
        return np.mean(predictions == Y)


### Check Model

In [320]:
import random
import pytest
from sklearn.multiclass import OneVsOneClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import log_loss
from sklearn.linear_model import SGDClassifier

random.seed(0)
np.random.seed(0)

# binary classification 

x_bias = np.array([[0,4,1], [0,3,1], [5,0,1], [4,1,1], [0,5,1]])
x = x_bias[:,:-1]
y = np.array([0,0,1,1,0])
x_bias_test = np.array([[0,0,1], [-5,3,1], [9,0,1], [1,0,1], [6,-7,1]])
x_test = x_bias_test[:,:-1]
y_test = np.array([0,0,1,0,1])


# generate binary classification model
binary_test_model = MulticlassLogisticRegression(2, 2)
binary_test_model.weights = np.zeros((2, 3))
# check loss function
assert binary_test_model.loss(x_bias, y, 1) == pytest.approx(log_loss(y,sigmoid(x_bias @ np.zeros((3,1)))), .01)
binary_test_model._train_binary_class(x_bias, y, label=1)

# generate one-vs-all binary classification model
one_vs_all_binary_test_model = MulticlassLogisticRegression(2, 2)
one_vs_all_binary_test_model.weights = np.zeros((2, 3))
# check loss function with ova
assert one_vs_all_binary_test_model.loss(x_bias,y,1) == pytest.approx(log_loss(y,sigmoid(x_bias @ np.zeros((3,1)))), .01)
one_vs_all_binary_test_model.train(x_bias, y)

# generate all-pairs binary classification model
all_pairs_binary_test_model = MulticlassLogisticRegression(2, 2, strategy="all-pairs")
all_pairs_binary_test_model.weights = np.zeros((2, 3))
all_pairs_binary_test_model.train(x_bias, y)
# assert one_vs_all_binary_test_model.loss(x_bias,y,1) == pytest.approx(log_loss(y,sigmoid(x_bias @ np.zeros((3,1)))), .01)

# check using binary classification, ovo model, and ova model has the same result with 2 classes
assert np.allclose(binary_test_model.weights[1], one_vs_all_binary_test_model.weights[1], atol=0.001)
assert np.allclose(one_vs_all_binary_test_model.weights[0], all_pairs_binary_test_model.weights[(0,1)], atol=0.001)

assert binary_test_model.accuracy(x_bias_test, y_test) == one_vs_all_binary_test_model.accuracy(x_bias_test, y_test)
assert one_vs_all_binary_test_model.accuracy(x_bias_test, y_test) == all_pairs_binary_test_model.accuracy(x_bias_test, y_test)

assert (all_pairs_binary_test_model.predict(x_bias_test) == one_vs_all_binary_test_model.predict(x_bias_test)).all()

random.seed(0)
np.random.seed(0)

# sklearn SGDClassifier
sgd_logistic = SGDClassifier(
    loss='log_loss', penalty=None, alpha=0, max_iter=1000, tol=1e-4, shuffle=True, learning_rate='constant', eta0=0.1, early_stopping=False, epsilon=1e-6, average=32,
)
ova_model = OneVsRestClassifier(sgd_logistic)
ova_model.fit(x, y)
sklearn_ova_model_weight=[]
for i, estimator in enumerate(ova_model.estimators_):
    sklearn_ova_model_weight.append(np.hstack([estimator.coef_, estimator.intercept_.reshape(-1, 1)]))

# check 
assert log_loss(y,ova_model.predict_proba(x)) == pytest.approx(one_vs_all_binary_test_model.loss(x_bias,y,1), abs=0.01)
assert ova_model.score(x_test,y_test) == one_vs_all_binary_test_model.accuracy(x_bias_test, y_test)
assert (ova_model.predict(x_test) == one_vs_all_binary_test_model.predict(x_bias_test)).all()
assert np.allclose(one_vs_all_binary_test_model.weights[1], sklearn_ova_model_weight, atol=0.5)

sgd_logistic = SGDClassifier(
    loss='log_loss', penalty=None, alpha=0, max_iter=1000, tol=1e-4, shuffle=True, learning_rate='constant', eta0=0.1, early_stopping=False, epsilon=1e-6, average=32,
)
ovo_model = OneVsOneClassifier(sgd_logistic)
ovo_model.fit(x, y)
sklearn_ovo_model_weight = []
for i, estimator in enumerate(ovo_model.estimators_):
    sklearn_ovo_model_weight.append(np.hstack([estimator.coef_, estimator.intercept_.reshape(-1, 1)]))
sklearn_ovo_model_weight *= -1
assert ovo_model.score(x_test,y_test) == all_pairs_binary_test_model.accuracy(x_bias_test, y_test)
assert (ovo_model.predict(x_test) == all_pairs_binary_test_model.predict(x_bias_test)).all()
for a, b in zip(all_pairs_binary_test_model.weights[(0,1)], sklearn_ovo_model_weight):
    assert a == pytest.approx(b, rel=.01)

# multiclass classification
x_bias2 = np.array([[0,0,1], [0,3,1], [4,0,1], [6,1,1], [0,1,1], [0,4,1]])
y2 = np.array([0,1,2,2,0,1])
x_bias_test2 = np.array([[0,0,1], [-5,3,1], [9,0,1], [1,0,1]])
y_test2 = np.array([0,1,2,0])
x2 = x_bias2[:,:-1]
x2_test = x_bias_test2[:,:-1]

# train the multiclass classification with one-vs-all
test_model_one_vs_all_1 = MulticlassLogisticRegression(2, 3)
test_model_one_vs_all_1.train(x_bias2, y2)

random.seed(0)
np.random.seed(0)

sgd_logistic = SGDClassifier(
    loss='log_loss', penalty=None, alpha=0, max_iter=1000, tol=1e-4, shuffle=True, learning_rate='constant', eta0=0.1, early_stopping=False, epsilon=1e-6, average=32,
)
# train sklearn OneVsRestClassifier
ova_model = OneVsRestClassifier(sgd_logistic)
ova_model.fit(x2, y2)
ova_model_weight = []
for i, estimator in enumerate(ova_model.estimators_):
    weights = estimator.coef_
    bias = estimator.intercept_
    ova_model_weight.append(np.hstack([weights, bias.reshape(-1, 1)]))

# assert log_loss(y,ova_model.predict_proba(x2)) == pytest.approx(test_model_one_vs_all_1.loss(x_bias2,y2,1), abs=0.01)
assert ova_model.score(x2_test,y_test2) == test_model_one_vs_all_1.accuracy(x_bias_test2, y_test2)
assert (ova_model.predict(x2_test) == test_model_one_vs_all_1.predict(x_bias_test2)).all()
for a, b in zip(test_model_one_vs_all_1.weights, ova_model_weight):
    assert np.allclose(a, b, atol=0.5)

# train the multiclass classification with all-pairs
test_model_all_pairs_1 = MulticlassLogisticRegression(2,3,strategy="all-pairs")
test_model_all_pairs_1.train(x_bias2, y2)

sgd_logistic = SGDClassifier(
    loss='log_loss', penalty=None, alpha=0, max_iter=1000, tol=1e-4, shuffle=True, learning_rate='constant', eta0=0.1, early_stopping=False, epsilon=1e-6, average=32,
)
# sklearn multiclass classification with all-pairs
ovo_model = OneVsOneClassifier(sgd_logistic)
ovo_model.fit(x2, y2)
ovo_model_weight = []
for i, estimator in enumerate(ovo_model.estimators_):
    ovo_model_weight.append(np.hstack([estimator.coef_, estimator.intercept_.reshape(-1, 1)]))

assert ovo_model.score(x2_test,y_test2) == test_model_all_pairs_1.accuracy(x_bias_test2, y_test2)
assert (ovo_model.predict(x2_test) == test_model_all_pairs_1.predict(x_bias_test2)).all()
for a, b in zip(test_model_all_pairs_1.weights.values(), ovo_model_weight):
    a_ = -a
    assert (np.allclose(a, b, atol=0.1) | np.allclose(a_, b, atol=0.1))

# check raise error when input is not expected
classifier = MulticlassLogisticRegression(x_bias,y,strategy='one-vs-one')
with pytest.raises(ValueError, match="Invalid strategy"):
    classifier.predict(x_bias)

In [179]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import pandas as pd

DATA_FILE = '../data/Dry_Bean.csv'

def get_data(file_path):
    df = pd.read_csv(file_path)
    df['Class'].unique()

    undersample = RandomUnderSampler(random_state=42)

    X = df.drop('Class', axis=1)
    y = df.Class

    X_over, y_over = undersample.fit_resample(X, y)

    # sns.countplot(x=y_over, data=df)
    # plt.xticks(rotation=45)
    # plt.show()

    y_over.replace(list(np.unique(y_over)), [1, 2, 3, 4, 5, 6, 7], inplace=True)
    df_dea = X_over
    df_dea['Class'] = y_over
    
    # This columns may create an overfitted model
    X_over.drop(['ConvexArea', 'EquivDiameter'], axis=1, inplace=True)

    X_train, X_test, y_train, y_test = train_test_split(X_over, y_over, random_state=0, shuffle=True, test_size=.2)
    
    # scale our data
    st_x = StandardScaler()
    X_train = st_x.fit_transform(X_train)
    X_test = st_x.transform(X_test)
    y_train = y_train.to_numpy()
    y_test = y_test.to_numpy()

    return X_train, y_train, X_test, y_test

def test_dry_bean_ovr():
    X_train, Y_train, X_test, Y_test = get_data(DATA_FILE)
    num_features = X_train.shape[1]
    NUM_CLASS = 7
    BATCH_SIZE = 100
    CONV_THRESHOLD = 1e-3

    X_train_b = np.hstack((X_train, np.ones((X_train.shape[0], 1))))
    X_test_b = np.hstack((X_test, np.ones((X_test.shape[0], 1))))

    model = MulticlassLogisticRegression(num_features, NUM_CLASS, BATCH_SIZE, CONV_THRESHOLD)
    model.train(X_train_b, Y_train)
    acc = model.accuracy(X_test_b, Y_test)
    print("One-vs-all model accuracy: ",acc)

    logistic_regression_model = LogisticRegression(solver='liblinear')
    ova_model = OneVsRestClassifier(logistic_regression_model)
    ova_model.fit(X_train, Y_train)
    print("Library model accuracy: ",ova_model.score(X_test,Y_test))
    
def test_dry_bean_ovo():
    X_train, Y_train, X_test, Y_test = get_data(DATA_FILE)
    num_features = X_train.shape[1]
    NUM_CLASS = 7
    BATCH_SIZE = 100
    CONV_THRESHOLD = 1e-3
    
    X_train_b = np.hstack((X_train, np.ones((X_train.shape[0], 1))))
    X_test_b = np.hstack((X_test, np.ones((X_test.shape[0], 1))))

    model = MulticlassLogisticRegression(num_features, NUM_CLASS, BATCH_SIZE, CONV_THRESHOLD, 'all-pairs')
    model.train(X_train_b, Y_train)
    acc = model.accuracy(X_test_b, Y_test)
    print("All-pairs model accuracy: ",acc)

    logistic_regression_model = LogisticRegression(solver='liblinear')
    ovo_model = OneVsOneClassifier(logistic_regression_model)
    ovo_model.fit(X_train, Y_train)
    print("Library model accuracy: ",ovo_model.score(X_test,Y_test))

test_dry_bean_ovr()
test_dry_bean_ovo()

  y_over.replace(list(np.unique(y_over)), [1, 2, 3, 4, 5, 6, 7], inplace=True)


One-vs-all model accuracy:  0.841313269493844
Library model accuracy:  0.9863201094391245


  y_over.replace(list(np.unique(y_over)), [1, 2, 3, 4, 5, 6, 7], inplace=True)


All-pairs model accuracy:  0.8467852257181943
Library model accuracy:  0.9931600547195623
