# Demo Lab Logistic Regression


## Logistic Regression 

The _logistic regression_ can be derived in many forms.  
We'll illustrate 2 of them.

### Derivation 001

One intuitive path is saying that we're after calculating the probability: $p \left( y = 1 \mid \boldsymbol{x} \right)$.  
Since it is a probability function is must obey some rules. The first one being in the range $\left[ 0, 1 \right]$.  

A function which maps $\left( -\infty, \infty \right) \to \left[0, 1 \right]$ is the [Sigmoid Function](https://en.wikipedia.org/wiki/Sigmoid_function): $\sigma \left( z \right) = \frac{1}{1 + \exp \left( z \right)}$.

So now we can say that: $p \left( y = 1 \mid \boldsymbol{x} \right) = \sigma \left( {z}_{i} \right)$.  
Now the problem is modeling the parameter ${z}_{i}$. In which in a linear case will be modeled as ${z}_{i} = \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i}$.  
Namely by a linear model, which in the choice of the Sigmoid Function means the objective function is Convex in $\boldsymbol{w}_{i}$ and $b$:

![](https://upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Exam_pass_logistic_curve.svg/640px-Exam_pass_logistic_curve.svg.png)

* <font color='brown'>(**#**)</font> Actually it is convex only if the problem is not linear separable.

If we expand the above to multi class we'll get the [Softmax Function](https://en.wikipedia.org/wiki/Softmax_function) as in slides.

### Derivation 002

By _Bayes Theorem_ for the $L$ classes model:

$$
\begin{aligned}
p \left( y = {L}_{i} \mid \boldsymbol{x} \right) & = \frac{ p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right) }{ p \left( \boldsymbol{x} \right) } && \text{} \\
& = \frac{ p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right) }{ \sum_{j = 1}^{L} p \left( \boldsymbol{x} \mid y = {L}_{j} \right) p \left( y = {L}_{j} \right) } && \text{Expending by law total probability} \\
& = \frac{ p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right) }{ p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right) + p \left( \boldsymbol{x} \mid y \neq {L}_{i} \right) p \left( y \neq {L}_{i} \right) } && \text{Expending by law total probability} \\
& = \frac{ 1 }{ 1 + \frac{ p \left( \boldsymbol{x} \mid y \neq {L}_{i} \right) p \left( y \neq {L}_{i} \right)}{p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right)} } && \text{Dividing by $p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right)$} \\
& = \frac{ 1 }{ 1 + {e}^{\log \frac{ p \left( \boldsymbol{x} \mid y \neq {L}_{i} \right) p \left( y \neq {L}_{i} \right)}{p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right)}} } && \text{for $x \in \left[ 0, \infty \right) \Rightarrow x = \exp \log x $} \\
& = \frac{ 1 }{ 1 + {e}^{-\log \frac{ p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right) }{ p \left( \boldsymbol{x} \mid y \neq {L}_{i} \right) p \left( y \neq {L}_{i} \right) }} } && \text{$\log x = - \log \frac{1}{x}$} \\
\end{aligned}
$$

Now, if we model the log of likelihood ratio of the ${L}_{i}$ label with a linear model:

$$ \log \frac{ p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right) }{ p \left( \boldsymbol{x} \mid y \neq {L}_{i} \right) p \left( y \neq {L}_{i} \right) } = \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i} $$

So we get:

$$ p \left( y = {L}_{i} \mid \boldsymbol{x} \right) = \frac{1}{ 1 + {e}^{- \left( \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i} \right)} } $$

Yet, since $1 = {e}^{- \log \frac{p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right)}{p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right)}}$ the above can be written as:

$$
\begin{aligned}
p \left( y = {L}_{i} \mid \boldsymbol{x} \right) = \frac{1}{ 1 + {e}^{- \left( \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i} \right)} }
\end{aligned}
$$

### Derivation 003

By _Bayes Theorem_ for the $L$ classes model:

$$
\begin{aligned}
p \left( y = {L}_{i} \mid \boldsymbol{x} \right) & = \frac{ p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right) }{ p \left( \boldsymbol{x} \right) } && \text{} \\
& = \frac{ p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right) }{ \sum_{j = 1}^{L} p \left( \boldsymbol{x} \mid y = {L}_{j} \right) p \left( y = {L}_{j} \right) } && \text{Expending by law total probability} \\
& = \frac{ p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right) }{ p \left( \boldsymbol{x} \mid y = {L}_{k} \right) p \left( y = {L}_{k} \right) + \sum_{j \neq k} p \left( \boldsymbol{x} \mid y = {L}_{j} \right) p \left( y = {L}_{j} \right) } && \text{} \\
& = \frac{ \frac{p \left( \boldsymbol{x} \mid y = {L}_{i} \right) p \left( y = {L}_{i} \right)}{p \left( \boldsymbol{x} \mid y = {L}_{k} \right) p \left( y = {L}_{k} \right)} }{ 1 + \sum_{j \neq k} \frac{p \left( \boldsymbol{x} \mid y = {L}_{j} \right) p \left( y = {L}_{j} \right)}{p \left( \boldsymbol{x} \mid y = {L}_{k} \right) p \left( y = {L}_{k} \right)} } && \text{Dividing by $p \left( \boldsymbol{x} \mid y = {L}_{k} \right) p \left( y = {L}_{k} \right)$} \\
\end{aligned}
$$

As in above, we may model the Log Likelihood Ratio by a linear function of $\boldsymbol{x}$ then we'll get:

$$ p \left( y = {L}_{i} \mid \boldsymbol{x} \right) = \frac{ \exp{\left( \boldsymbol{w}_{i}^{T} \boldsymbol{x} \right)} }{ 1 + \sum_{j \neq k} \exp{\left( \boldsymbol{w}_{j}^{T} \boldsymbol{x} \right)}} $$

Since $1 = \exp{ \left( \log{ \frac{p \left( \boldsymbol{x} \mid y = {L}_{k} \right) p \left( y = {L}_{k} \right)}{p \left( \boldsymbol{x} \mid y = {L}_{k} \right) p \left( y = {L}_{k} \right)} } \right)}$ we can write:

$$ p \left( y = {L}_{i} \mid \boldsymbol{x} \right) = \frac{ \exp{\left( \boldsymbol{w}_{i}^{T} \boldsymbol{x} \right)} }{ \sum_{j} \exp{\left( \boldsymbol{w}_{j}^{T} \boldsymbol{x} \right)}} $$

### Derivation 004

Given $L$ classes, we can chose a reference class: ${L}_{k}$. Then define the linear model of the log likelihood ratio compared to it:

$$ \log{ \left( \frac{ p \left( y = {L}_{i} \mid \boldsymbol{x} \right) }{ p \left( {y} = {L}_{k} \mid \boldsymbol{x} \right) } \right) } = \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i} $$

By definition $p \left( y = {L}_{i} \mid \boldsymbol{x} \right) = p \left( y = {L}_{k} \mid \boldsymbol{x} \right) \exp{ \left( \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i} \right) }$

Then:

$$
\begin{aligned}
1 - p \left( y = {L}_{k} \mid \boldsymbol{x} \right) & = \sum_{j \neq k} p \left( y = {L}_{j} \mid \boldsymbol{x} \right) && \text{} \\
& = \sum_{j \neq k} p \left( y = {L}_{k} \mid \boldsymbol{x} \right) \exp{ \left( \boldsymbol{w}_{j}^{T} \boldsymbol{x} + {b}_{j} \right) } && \text{Since $p \left( y = {L}_{i} \mid \boldsymbol{x} \right) = p \left( y = {L}_{k} \mid \boldsymbol{x} \right) \exp{ \left( \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i} \right) }$} \\
& = p \left( y = {L}_{k} \mid \boldsymbol{x} \right) \sum_{j \neq k} \exp{ \left( \boldsymbol{w}_{j}^{T} \boldsymbol{x} + {b}_{j} \right) } && \text{} \\
& \Rightarrow p \left( y = {L}_{k} \mid \boldsymbol{x} \right) = \frac{1}{1 + \sum_{j \neq k} \exp{ \left( \boldsymbol{w}_{j}^{T} \boldsymbol{x} + {b}_{j} \right) }} \\
& \Rightarrow p \left( y = {L}_{i} \mid \boldsymbol{x} \right) = \frac{ \exp{ \left( \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i} \right) } }{1 + \sum_{j \neq k} \exp{ \left( \boldsymbol{w}_{j}^{T} \boldsymbol{x} + {b}_{j} \right) }} && \text{}
\end{aligned}
$$

Since $1 = \exp{\left( \log{ \frac{ p \left( y = {L}_{k} \mid \boldsymbol{x} \right) }{ p \left( y = {L}_{k} \mid \boldsymbol{x} \right) } } \right)}$ we can write:

$$
\begin{aligned}
p \left( y = {L}_{i} \mid \boldsymbol{x} \right) & = \frac{ \exp{ \left( \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i} \right) } }{1 + \sum_{j \neq k} \exp{ \left( \boldsymbol{w}_{j}^{T} \boldsymbol{x} + {b}_{j} \right) }} \\
& = \frac{ \exp{ \left( \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i} \right) } }{\exp{ \left( \boldsymbol{w}_{k}^{T} \boldsymbol{x} + {b}_{k} \right) } + \sum_{j \neq k} \exp{ \left( \boldsymbol{w}_{j}^{T} \boldsymbol{x} + {b}_{j} \right) }} \\
& = \frac{ \exp{ \left( \boldsymbol{w}_{i}^{T} \boldsymbol{x} + {b}_{i} \right) } }{ \sum_{j} \exp{ \left( \boldsymbol{w}_{j}^{T} \boldsymbol{x} + {b}_{j} \right) }}
\end{aligned}
$$

### Summary

While there are many way to derive the logistic regression (for instance, also by assuming Binomial Distribution), the main motivation is its numerical properties.  
Namely being convex with easy to calculate gradient.

* <font color='brown'>(**#**)</font> The first "Deep Learning" model were actually chaining many logistic regression layers.
* <font color='brown'>(**#**)</font> Most classification layers in Deep Learning models are basically Logistic Regression.
* <font color='brown'>(**#**)</font> The concept of Logistic Regression can also be used as pure regression for continuous data bounded in the range $\left[ a, b \right]$.

## DEMO

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0, n_clusters_per_class=1, flip_y=0.01, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Predict on the testing set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy*100:.2f}%")
print("Confusion Matrix:")
print(conf_matrix)

# Optional: Plot decision boundary (requires additional code)


Accuracy: 88.33%
Confusion Matrix:
[[137  10]
 [ 25 128]]


# Cross-Entropy Loss

In this snippet, predict_proba is used instead of predict to obtain the probabilities of the class labels. The log_loss function from sklearn.metrics then computes the cross-entropy loss for these predictions against the actual labels. A lower cross-entropy loss indicates a model that predicts class labels with higher confidence and accuracy

In [3]:
from sklearn.metrics import log_loss

# Predict probabilities for the test set
y_pred_probs = model.predict_proba(X_test)

# Compute the cross-entropy loss
loss = log_loss(y_test, y_pred_probs)

print(f"Cross-Entropy Loss: {loss:.4f}")

Cross-Entropy Loss: 0.2992
