# The logistic regression

## Introduction

The Logistic regression applies to cases where:

* $Y$ is a random qualitative variable with 2 categories (a binary variable by convention, $Y = 0$ if the event does not occur, and $Y = 1$ if it does),
* $X_1,\ldots,X_k$ are non-random qualitative or quantitative variables ($K$ explanatory variables in total).

* $(Y, X_1,\ldots,X_k)$ represent the population variables, from which a sample of $n$ individuals $(i)$ is drawn, and $(y, x_i)$ is the vector of observed realizations of $(Y_i, X_i)$ for each individual in the sample.

Unlike simple linear regression, logistic regression estimates **the probability** of an event occurring, rather than predicting a specific numerical value.

## The model

The variable $Y_i$ follow a Bernoulli distribution with parameter $p_i$ representing the probability that $Y_i=1$.    

$$Y_i \sim B(p_i)$$


$$P(Y_i=1) = p_i \quad, \quad P(Y_i = 0) = 1 - p_i$$

which is equivalent to: 

$$P(Y_i = k) = {p_i}^k(1 - p_i)^{1-k} \quad \text{for k} \in \{0, 1\}$$

## The linear LOGIT model

To ensure that the expected value of $Y, E(Y)$, only takes values between 0 and 1, we use the logistic function:  

$$f(x) = \dfrac{\text{exp(x)}}{1 + \text{exp(x)}} = p$$

or similarly:  

$$f(x) = \dfrac{1}{1 + \text{exp(-x)}} = p$$

This guarantees that $0 < f(x) < 1$, so $E[Y]$ can represent a valid probability.  

The logit function is used to transform a probability $p$ into an **unrestricted real value**:

$\quad \text{Notations:} \quad X = (1,X_1, \ldots, X_k) \quad \text{and} \quad \beta = (\beta_0,\beta_1, \ldots, \beta_k)$

$$\text{logit}(p) = \text{log}(\dfrac{p}{1 - p})$$

$$\text{logit}(p) = \beta .X$$

$$\text{logit}(p) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_k X_k$$

$$\text{log}\left( \dfrac{p}{1-p} \right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_k X_k$$

$$p = \frac{1}{1 + \exp(-\beta .X)}$$

Demonstration: 

$$p(x) = \dfrac{1}{1 + \exp(-\beta x)}$$

$$\underset{inverse}   \iff \dfrac{1}{p} = 1 + \exp(-\beta x)$$

$$\iff \dfrac{1}{p} - 1 = \exp(-\beta x)$$

$$\iff \dfrac{1}{p} - \dfrac{p}{p} = \exp(-\beta x)$$

$$\iff \dfrac{1-p}{p} = \exp(-\beta x)$$

$$\iff \log(\dfrac{1-p}{p}) = -\beta x$$

$$\iff \log(\dfrac{p}{1-p}) = \beta x$$

To simplify the writing we have put $p$ rather than $p(x)$

## Key Assumptions for Generalizability of the logit model

* **Linearity of Log-Odds:** The relationship between each continuous predictor and the log-odds of $Y=1$ is linear. If this assumption is violated (e.g., non-linear effects), the interpretation of $\beta_1$​ may not hold.  
* **No Multicollinearity:** Predictors should not be highly correlated, as this can distort the interpretation of individual coefficients.  
* **Additivity:** The effect of each predictor on the log-odds is additive. There should be no significant interaction effects unless explicitly modeled.  
* **Independence of Observations:** The model assumes that observations are independent of each other.

## Coefficients interpretation

**The Odds**

The odds are defined by:  

$$\text{Odds} = \dfrac{p}{1-p}$$


$\text{Where} \quad p = P(target=1|X)$

>_If a student has a 3 in 4 chance of passing and a 1 in 4 chance of failing, their odds are '3 to 1':_ $\text{Odds} = \dfrac{3/4}{1/4}=3$  

* **Notation:**
$$\text{Odds}(Y=1|X=0)=\dfrac{P(Y=1|X=0)}{1-P(Y=1|X=0)}$$

### **The Odds Ratio**

The odds ratio comparing the **probability of $target=1$** between individuals with value $X$ and those without it.

$$\text{Odds Ratio} = \dfrac{\text{Odds}(Y=1|X=1)}{\text{Odds}(Y=1|X=0)}$$

$$\text{Odds Ratio} = \dfrac{P(Y_i=1 | X=1)}{1 - P(Y_i=1 | X=1)} / \dfrac{P(Y_i=1 | X=0)}{1 - P(Y_i=1 | X=0)}$$

We know that logit is given by:

$$\text{logit}(p) = \text{log}(\dfrac{p}{1-p}) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_k x_{ik}$$

### **Interpreting the Intercept**  

The intercept $\beta_0$​ represents the **log-odds of the outcome $Y=1$ when all predictors are equal to zero**.  
$\beta_0$​ defines the **baseline probability** of the outcome when all predictors are zero.  

⚠️ **Caveat**:  
This interpretation of $\beta_0$ is often not meaningful if some predictors cannot logically be zero (e.g., age=0, blood pressure). In such cases, $\beta_0$​ is primarily a mathematical component of the model and is rarely interpreted in isolation.  
  
  
* **Odds for the baseline group:**  

$$\text{Odds}(Y=1∣X_1=0)=\exp⁡(\beta_0)$$

* **Probability for the baseline group:**
$$P(Y=1∣X_1=0)=\dfrac{\exp⁡(\beta_0)}{1 + \exp(\beta_0)}$$


>If $X_1$​ is "smoking status" ($0$ = non-smoker, $1$ = smoker), then 
>* $\beta_0$​ gives the **log-odds** of the outcome for non-smokers  
>* $\exp(\beta_0)$ gives the **odds** of the outcome for non-smokers.  
>
>If $\beta_0 = -1$, then: $$\exp(\beta_0) = \exp(-1) \approx 0.37$$
>
>$$P(Y=1∣X_1=0)= \dfrac{0.37}{1 + 0.37} \approx 0.27$$
>
>27% of non-smokers are predicted to have the outcome (e.g., lung cancer), assuming no other predictors.  
>It is the observed proportion of lung cancer for non-smokers.  

### **Interpreting the Slope**

In a model with multiple predictors, each $\beta_i$​ (and its corresponding odds ratio $\exp(\beta_i)$ represents the effect of that predictor on the log-odds of $Y=1$, holding all other predictors constant.  
This is the key assumption of multivariable regression: ceteris paribus (all else being equal).

The coefficient $\beta_1$​ represents the **change in the log-odds of** $Y=1$ for a **one-unit change** in $X_1$​. The odds ratio $\exp(\beta_1)$​ quantifies how the odds of $Y=1$ change with $X_1$​.

**General Formula for Odds Ratio**

For any type of predictor $X_1$, the odds ratio for a one-unit increase is:  
$$\text{Odds Ratio} = \frac{\text{Odds}(Y=1 | X_1 = x+1)}{\text{Odds}(Y=1 | X_1 = x)} = \exp(\beta_1)$$

📌 **Note:**  
$\exp(\beta_1​)$ compares the odds of $Y=1$ between $X_1=1$ and $X_1=0$, controlling for all other variables in the model (all others features constant).

* Case: $X_1$ is Binary

For a binary predictor $X_1$​ (e.g., $0$ = non-smoker, $1$ = smoker), the odds ratio $\exp(\beta_1)$​ compares the odds of $Y=1$ between the two groups.

* **Logistic regression equation:**

$$\log\left(\dfrac{P(Y=1 | X_1)}{1 - P(Y=1 | X_1)}\right) = \beta_0 + \beta_1 1_{\{X_1 = 1\}}​$$

* **Odds ratio:**
$$\text{Odds Ratio} = \dfrac{P(Y=1 | X_1=1)}{1 - P(Y=1 | X_1=1)} / \dfrac{P(Y=1 | X_1=0)}{1 - P(Y=1 | X_1=0)} = \exp(\beta_1)$$

**Interpretation:**

* If $\exp(\beta_1) = 1$: No effect of the feature $X_1$​ on the odds of $Y=1$.
* If $\exp(\beta_1)>1$: The odds of $Y=1$ are higher when $X_1​=1$. The feature $X_1$​ is **positively associated** with the outcome.
* If $\exp(\beta_1) < 1$: The odds of $Y=1$ are lower when $X_1​=1$. The feature $X_1$​ is **negatively associated** with the outcome.

>**Example:**  
>If $\beta_1 = 0.7 \rightarrow \exp(\beta_1) \approx 2.01$. The odds of lung cancer for smokers $(X_1=1)$ are twice as high as for non-smokers $(X_1=0)$.

* Case: $X_1$ is Categorical

For a categorical predictor $X_1$ with more than two levels (e.g., color = red, green, blue), you use **dummy variables**. 

* **The logistic regression model becomes:**

$$\text{log}\left( \dfrac{P(Y=1)}{1 - P(Y=1)} \right) = \beta_0 + \beta_{green}1_{\{X_1 = \text{green}\}} + \beta_{blue}1_{\{X_1 = \text{blue}\}}$$

* **Reference Category ("red"):** When $1_{\{X_1 = \text{green}\}}=0$ and $1_{\{X_1 = \text{blue}\}}=0$, the log-odds are:

$$\text{log}\left( \dfrac{P(Y=1)}{1 - P(Y=1)} \right) = \beta_0$$

**Interpretation:** This means $\beta_0$​ represents the log-odds of $Y=1$ for the reference category ("red").

* **Category ("green"):** When $1_{\{X_1 = \text{green}\}}=1$ and $1_{\{X_1 = \text{blue}\}}=0$, the log-odds are:

$$\text{log}\left( \dfrac{P(Y=1)}{1 - P(Y=1)} \right) = \beta_0 + \beta_{green}$$

* **Category ("blue"):** When $1_{\{X_1 = \text{green}\}}=0$ and $1_{\{X_1 = \text{blue}\}}=1$, the log-odds are:

$$\text{log}\left( \dfrac{P(Y=1)}{1 - P(Y=1)} \right) = \beta_0 + \beta_{blue}$$

The **odds ratio for "blue" relative to the reference "red"** is:
$$\exp(\beta_{blue}) = \dfrac{\text{Odds}(Y=1 | \text{blue})}{\text{Odds}(Y=1 | \text{red})}$$

The same way, $\exp(\beta_{\text{green}})$​ compares the odds for "green" vs. the "red" reference.

>**Interpretation:**
>
>If $\exp(\beta_{\text{green}})​=1.5$, the odds of $Y=1$ are $1.5$ times higher for "green" compared to "red".

* ##### Case: $X_1$ is Quantitative

For a continuous predictor $X_1$ (e.g., age, blood pressure), the odds ratio $\exp(\beta_1)$​ represents the multiplicative change in the odds of $Y=1$ for a one-unit increase in $X_1$​.  

* **Logistic regression equation:**

$$\log\left(\dfrac{P(Y=1 | X_1)}{1 - P(Y=1 | X_1)}\right) = \beta_0 + \beta_1 X_1​$$

* **Odds ratio for a one-unit increase:**

$$\text{Odds Ratio} = \frac{\text{Odds}(Y=1 | X_1 = x+1)}{\text{Odds}(Y=1 | X_1 = x)} = \exp(\beta_1)​$$

**In short**: $\beta_1$​ captures the **constant log-odds** change per unit increase in $X_1$, so $\exp(\beta_1)$​ is the **odds ratio** for that one-unit change.

This holds regardless of the starting value of $X_1$​ because the model assumes a constant multiplicative effect on the odds (a key assumption of logistic regression).


>**Interpretation:**
>
>* If $\beta_1=0.095 \rightarrow \exp(\beta_1)=1.1$, the odds of $Y=1$ increase by 10% for each one-unit increase in $X_1$​.
>* If $X_1$​ is "years of smoking" and $\beta_1 = 0.7 \rightarrow  \exp(\beta_1) \approx 2.01$. For each additional year of smoking, the odds of lung cancer double.
>.

**Summary**

| Type of $X_1$         | Interpretation of $\exp(\beta_1)$                                       |
|-----------------------|-------------------------------------------------------------------------|
| Binary                | Compares odds of $Y=1$ between $X_1=1$ and $X_1=0$                      |
| Categorical           | Compares odds of $Y=1$ for a given category relative to the reference.  |
| Quantitative          | Multiplicative change in odds of $Y=1$ for a one-unit increase in $X_1$ |


In [None]:
import os
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [39]:
# Load the data

rep = '/Users/davidtbo/Library/Mobile Documents/com~apple~CloudDocs/data/external'

filename = os.path.join(rep, 'diabetes.csv')

df = pd.read_csv(filename)

In [42]:
df.columns = [str.lower(col) for col in df.columns.str.strip()]
df.head()

Unnamed: 0,pregnancies,glucose,bloodpressure,skinthickness,insulin,bmi,diabetespedigreefunction,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [43]:
# Data preparation
features = df.drop('outcome', axis=1)

In [50]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# separate the features of the target
features = df.drop('outcome', axis=1)
target = df['outcome']

# Data standardization
scaler = StandardScaler()
scaler.fit(features)
standardized_data = scaler.transform(features)

# Reassign the names of the columns of origin
standardized_df = pd.DataFrame(standardized_data, columns=features.columns)

# Display the head of the standardized DataFrame
print(standardized_df.head().to_string())


   pregnancies   glucose  bloodpressure  skinthickness   insulin       bmi  diabetespedigreefunction       age
0     0.639947  0.848324       0.149641       0.907270 -0.692891  0.204013                  0.468492  1.425995
1    -0.844885 -1.123396      -0.160546       0.530902 -0.692891 -0.684422                 -0.365061 -0.190672
2     1.233880  1.943724      -0.263941      -1.288212 -0.692891 -1.103255                  0.604397 -0.105584
3    -0.844885 -0.998208      -0.160546       0.154533  0.123302 -0.494043                 -0.920763 -1.041549
4    -1.141852  0.504055      -1.504687       0.907270  0.765836  1.409746                  5.484909 -0.020496


In [51]:
# train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=2)

### Logistic Regression with scipy

In [52]:
import statsmodels.api as sm

# Add a constant for statsmodels
X_train_sm = sm.add_constant(X_train)

# Logistic Regression with L2 regularization
logit_model = sm.Logit(y_train, X_train_sm)
result = logit_model.fit_regularized(method='l1', alpha=0.1)

# Display the summary
print(result.summary())


Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.46842232325133
            Iterations: 79
            Function evaluations: 95
            Gradient evaluations: 79
                           Logit Regression Results                           
Dep. Variable:                outcome   No. Observations:                  614
Model:                          Logit   Df Residuals:                      605
Method:                           MLE   Df Model:                            8
Date:                Sun, 05 Oct 2025   Pseudo R-squ.:                  0.2875
Time:                        19:24:54   Log-Likelihood:                -286.64
converged:                       True   LL-Null:                       -402.31
Covariance Type:            nonrobust   LLR p-value:                 1.532e-45
                               coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------

In [53]:
odds_ratios = np.exp(result.params)
print(odds_ratios)

const                       0.000218
pregnancies                 1.174037
glucose                     1.038217
bloodpressure               0.986439
skinthickness               1.003852
insulin                     0.998607
bmi                         1.089609
diabetespedigreefunction    2.732962
age                         1.006341
dtype: float64


**Interpreting the coefficients:**

For each feature, the exponentiated coefficient (exp(coef)) represents the change in odds for a one-unit increase in that feature, holding all other features constant.

For example, has the coefficient for 'pregnancies' is 0.1604, the odds ratio is exp(0.1604) ≈ 1.174037.  
This means that for each one-unit increase in pregnancies, the odds of the outcome occurring (e.g., having diabetes) increase by approximately 17.4%, assuming all other features remain constant.


# APPENDIX

### Interpreting the coefficients

**Demonstration:** The coefficient $\beta_1$​ represents the **change in the log-odds of** $Y=1$ for a **one-unit change** in $X_1$​ quantitative feature.  

Notations:  
$$\text{Odds}(Y=1|X=x+1)=P(Y=1|X=x+1) / (1-P(Y=1|X=x+1))$$

$$\text{Odds}(Y=1|X=x)=P(Y=1|X=x) / (1-P(Y=1|X=x))$$

We know that:  

$$\text{log(Odds)}(Y=1|X=x+1)=\beta_0 + \beta_1 \times (x+1)$$

$$\text{log(Odds)}(Y=1|X=x)=\beta_0 + \beta_1 \times x$$

By difference:

$$\text{log(Odds)}(Y=1|X=x+1) - \text{log(Odds)}(Y=1|X=x) =\beta_0 + \beta_1 \times (x+1) - (\beta_0 + \beta_1 \times x) = \beta_1$$

$$\text{log}\left(\dfrac{\text{Odds}(Y=1|X=x+1)}{\text{Odds}(Y=1|X=x)}\right) =\beta_1$$

**CQFD**

Note:  

$$\dfrac{\text{Odds}(Y=1|X=x+1)}{\text{Odds}(Y=1|X=x)} = \exp(\beta_1)$$

### Model formulation

The prediction $y_i=1$ of the logistic regression is defined:

$$\hat{y_i} = P(y_i=1 | x_i; \theta) = \frac{1}{1 + \exp(-\theta^Tx_i)} = h_{\theta}(x_i)$$

* If $y_i=1$, then $P(y_i|x_i; \theta)=P(y_i=1|x_i; \theta)$
* If $y_i=0$, then $P(y_i|x_i; \theta)=P(y_i=0|x_i; \theta) = 1 - P(y_i=1|x_i; \theta)$

We can write these two equations into a single one:  

$$P(y_i|x_i; \theta)=P(y_i=1|x_i; \theta)^{y_i}\times (1 - P(y_i=1|x_i; \theta))^{1-y_i}$$

With the notations:

$$P(y_i|x_i; \theta)=h_{\theta}(x_i)^{y_i}\times (1 - h_{\theta}(x_i))^{1-y_i}$$


### Likelihood function

The **likelihood** of the observations $y_i$ given the inputs $x_i$ and parameters $\theta$ is defined as:  

$$L(\theta) = \prod_{i=1}^n P(y_i|x_i;\theta) = \prod_{i=1}^n (h_{\theta}(x_i))^{y_i} (1 -h_{\theta}(x_i))^{1-y_i}$$

where the prediction of the logistic regression is defined:

$$h_{\theta}(x_i) = P(y_i=1 | x_i; \theta) = \frac{1}{1 + \exp(-\theta^Tx_i)}$$

The **log-likelihood** is defined as:  

$$l(\theta) = \log(L(\theta))=\sum_{i=1}^n[y_i \log (h_{\theta}(x_i)) + (1 - y_i) \log(1 - h_{\theta}(x_i))]$$

### Objective of Logistic Regession

The goal of learning a **logistic regression model** is to **minimize the cost function** by adjusting the parameters $\theta$.The cost function measures the average prediction error across all $n$ training samples.

### Cost function (general definition)

The cost function $J(\theta)$ is defined as the average penalty for prediction errors across the training set. Mathematically, it is expressed as:

$$J(θ) = -\frac{1}{n} \sum cost(h_\theta(x_i), y_i)$$

where:

* $h_\theta(x_i)$ the model's prediction for sample $x_i$
* $y_i$ the observed (true) value.

Alternatively, it can be written as:

$$J(θ) = \frac{1}{n} \sum cost(\hat{y_i}, y_i)$$

Where:

* $\hat{y_i} = h_{\theta}(x_i)$

### Cost function **log loss**

For logistic regression, the cost function is called **log loss** (or logistic loss).  
**Log loss** is derived from the log-likelihood $l(\theta)$ and is defined as:

$$J(\theta) = -\dfrac{1}{n}l(\theta) = -\frac{1}{n} \sum(y_i \log(h_\theta(x_i)) + (1 - y_i) \log(1 - h_\theta(x_i)))$$


#### How Log-Loss penalizes prediction errors

The log-loss function penalizes prediction errors based on the estimated probability from the model. It assigns higher penalties when predictions are far from the true labels—specifically:

- If the label $y_i = 1$ (positive class), the penalty is $-log(h_{\theta}(x_i))$. The closer $h_{\theta}(x_i)$ is to 0 (far from the true label), the higher the penalty ($-\log(0^+) \approx + \infty$).  

- If the label $y_i = 0$ (negative class), the penalty is $-log(1-h_{\theta}(x_i))$, the closer $h_{\theta}(x_i)$ is to 1 (far from the true label), the higher the penalty ($-\log(1 -1^+) \approx + \infty$).  

The two cases are combined into a single formula for observation $i$:
$$y_i log(h_\theta(x_i)) + (1 - y_i) log(1 - h_\theta(x_i))$$


**Key insights:**

* **Log loss** evaluates how well the model fits the training data.
* The log loss is the negative average the log likelihood.
* **Higher likelihood** leads to **lower the log loss** (since $J(\theta) = -\frac{1}{n}l(\theta)$).
* The log-loss function heavily penalizes confident wrong predictions.

### Optimizing the parameters

By minimizing the cost function $J(\theta)$, we aim to find the parameters $\theta$ that maximize the likelihood of observing the training data given the model parameters.  

To achieve this, we use an iterative optimization method, gradient descent, to find the values of $\theta$ that minimize the cost function over the training set:

$$\underset{\theta}{minimize}(J(\theta))$$

## Computation of the gradient of the cost function
$$J(\theta) = -\frac{1}{n} \sum(y_i \log(h_\theta(x_i)) + (1 - y_i) \log(1 - h_\theta(x_i)))$$

To compute the gradient $\nabla J(\theta)$ we start by transforming the expression of:

$$\log(h_\theta(x_i)) = \log\left(\frac{1}{1 + \exp(-\theta^Tx_i)}\right) = -\log(1 + \exp(-\theta^Tx_i))$$

And:  

$$\log(1 - h_\theta(x_i)) = \log\left(1 - \frac{1}{1 + \exp(-\theta^Tx_i)}\right)$$

$$\log(1 - h_\theta(x_i)) = \log\left(\frac{1 + \exp(-\theta^Tx_i) - 1}{1 + \exp(-\theta^Tx_i)}\right)$$

$$\log(1 - h_\theta(x_i)) = \log\left(\frac{\exp(-\theta^Tx_i)}{1 + \exp(-\theta^Tx_i)}\right)$$

$$\log(1 - h_\theta(x_i)) = \log(\exp(-\theta^Tx_i)) - \log({1 + \exp(-\theta^Tx_i)})$$

$$\log(1 - h_\theta(x_i)) = -\theta^Tx_i - \log({1 + \exp(-\theta^Tx_i)})$$

We integrate these modifications:  

$$J(\theta) = -\frac{1}{n} \sum[y_i (-\log(1 + \exp(-\theta^Tx_i))) + (1 - y_i) (-\theta^Tx_i - \log({1 + \exp(-\theta^Tx_i)}))]$$

$$J(\theta) = -\frac{1}{n} \sum[y_i (-\log(1 + \exp(-\theta^Tx_i))) + (1 - y_i) (-\theta^Tx_i - \log(1 + exp(-\theta^Tx_i)))]$$

$$J(\theta) = -\frac{1}{n} \sum[y_i (-\log(1 + \exp(-\theta^Tx_i))) -\theta^Tx_i - \log(1 + \exp(-\theta^Tx_i)) + y_i \theta^Tx_i  + y_i \log(1 + \exp(-\theta^Tx_i))]$$

$$J(\theta) = -\frac{1}{n} \sum[- y_i \log(1 + \exp(-\theta^Tx_i)) -\theta^Tx_i - \log({1 + \exp(-\theta^Tx_i)}) + y_i \theta^Tx_i  + y_i \log(1 + \exp(-\theta^Tx_i))]$$

$$J(\theta) = -\frac{1}{n} \sum[\cancel{- y_i \log(1 + \exp(-\theta^Tx_i))} -\theta^Tx_i - \log({1 + \exp(-\theta^Tx_i)}) + y_i \theta^Tx_i  + \cancel{y_i \log(1 + \exp(-\theta^Tx_i))}]$$

$$J(\theta) = -\frac{1}{n} \sum[-\theta^Tx_i - \log({1 + \exp(-\theta^Tx_i)}) + y_i \theta^Tx_i  ]$$

$$J(\theta) = -\frac{1}{n} \sum[y_i \theta^Tx_i  -\theta^Tx_i - \log({1 + \exp(-\theta^Tx_i)}) ]$$

with:

$$-\theta^Tx_i - \log({1 + \exp(-\theta^Tx_i)}) = - \log(\exp(\theta^T x_i)) - \log(1 + \exp(-\theta^Tx_i))$$

$$-\theta^Tx_i - \log({1 + \exp(-\theta^Tx_i)}) = -(\log(\exp(\theta^T x_i)) + \log(1 + \exp(-\theta^Tx_i)))$$

$$-\theta^Tx_i - \log({1 + \exp(-\theta^Tx_i)}) = -\log[\exp(\theta^T x_i)(1 + \exp(-\theta^Tx_i))]$$

$$-\theta^Tx_i - \log({1 + \exp(-\theta^Tx_i)}) = -\log(\exp(\theta^T x_i) + 1)$$

$$J(\theta) = -\frac{1}{n} \sum[y_i \theta^Tx_i  -\log(\exp(\theta^T x_i + 1)) ]$$

$$J(\theta) = -\frac{1}{n} \sum[y_i \theta^Tx_i  -\log(1 + \exp(\theta^T x_i)) ]$$

$$\frac{\partial}{\partial \theta_j}J(\theta) = -\frac{1}{n} \sum[y_i \frac{\partial}{\partial \theta_j} (\theta^Tx_i)  - \frac{\partial}{\partial \theta_j}\log(1 + \exp(\theta^T x_i)) ]$$

Knowing that:

$$\theta^Tx_i = \theta_1 {x_i}^{(1)} + \theta_2 {x_i}^{(2)} + \ldots + \theta_k {x_i}^{(k)}$$

$$\frac{\partial}{\partial \theta_j} (\theta^Tx_i) = x_i^{(j)}$$
$$\dfrac{\partial}{\partial \theta_j}\left(\log(1 + \exp(\theta^T x_i))\right) \underset{\log(u)^{'} = \dfrac{u^{'}}{u}} = \dfrac{\dfrac{\partial}{\partial \theta_j}(1 + \exp(\theta^T x_i))} {1 + \exp(\theta x_i)}$$

And:  

$$\dfrac{\partial}{\partial \theta_j}\left(\log(1 + \exp(\theta^T x_i))\right) \underset{\log(u)^{'} = \dfrac{u^{'}}{u}} = \dfrac{\dfrac{\partial}{\partial \theta_j}(1 + \exp(\theta^T x_i))} {1 + \exp(\theta x_i)}$$
$$\dfrac{\dfrac{\partial}{\partial \theta_j}(1 + \exp(\theta^T x_i))} {1 + \exp(\theta x_i)} = \dfrac{\dfrac{\partial}{\partial \theta_j}(\exp(\theta^T x_i))} {1 + \exp(\theta x_i)}$$
$$\dfrac{\dfrac{\partial}{\partial \theta_j}(\exp(\theta^T x_i))} {1 + \exp(\theta x_i)} \underset{\exp(u)^{'} = u^{'}\exp(u)} =  \dfrac{\dfrac{\partial}{\partial \theta_j}(\theta^T x_i) * (\exp(\theta^T x_i))} {1 + \exp(\theta x_i)}$$
$$\dfrac{\dfrac{\partial}{\partial \theta_j}(\theta^T x_i) * (\exp(\theta^T x_i))} {1 + \exp(\theta x_i)} = \frac{x_i^{(j)} * (\exp(\theta^T x_i))} {1 + \exp(\theta x_i)}$$
$$\dfrac{x_i^{(j)} * (\exp(\theta^T x_i))} {1 + \exp(\theta x_i)} = x_i^{(j)} * h_\theta(x_i)$$
$$\dfrac{\partial}{\partial \theta_j}J(\theta) = -\dfrac{1}{n} \sum[y_i x_i^{(j)}  - x_i^{(j)} h_\theta(x_i)]$$

$$-\dfrac{1}{n} \sum[y_i x_i^{(j)}  - x_i^{(j)} h_\theta(x_i)] = -\dfrac{1}{n} \sum[y_i - h_\theta(x_i) ] x_i^{(j)}$$

$$\frac{\partial}{\partial \theta_j}J(\theta) = \frac{1}{n} \sum[h_\theta(x_i) - y_i ] x_i^{(j)}$$

We know that the expression of the Gradient descent to update the weights is for the weight $\theta_j$ :  

$$\theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\theta)$$

where $\alpha$ is the learning rate, we have:

$$\theta_j = \theta_j - \frac{\alpha}{n} \sum[h_\theta(x_i) - y_i ] x_i^{(j)}$$
## Convexity

Convexity is a crucial property in optimization, as it ensures that any local minimum is also a global minimum. This makes it easier to find the optimal solution using methods like gradient descent.  

The Log Loss function is convex due to its logarithmic form, which is always convex for positive values. While convexity guarantees that a stationary point (where the derivative is zero) is a global minimum, it does not guarantee the existence of such a point.  

To prove the existence of a minimum, the function must also be bounded below and attain this lower bound. For Log Loss, the function is bounded below by zero (since the logarithm of a positive number is always defined) and reaches this bound when predictions are perfectly accurate (i.e., the predicted probability for the correct class is 1).  

Combining convexity with the fact that Log Loss is bounded below and attains its lower bound, we conclude that the function reaches a global minimum when predictions are perfectly correct.
## The Newton-Raphson algorithm
The Newton-Raphson algorithm is used to find the coefficients of logistic regression by maximizing the likelihood function.  

Logistic regression is a regression model used to predict the probability of a binary event based on one or more predictive variables. In logistic regression, the likelihood function is convex and can be maximized using the Newton-Raphson algorithm.  

The Newton-Raphson algorithm is an iterative method for finding the maximum of a function using its first and second derivatives. For logistic regression, the likelihood function is given by:

$L(\theta | X, y) = \prod(P(yi | x_i, \theta)^{yi} (1 - P(y_i | x_i, \theta))^{(1 - y_i)})$

where: 
* $\theta$ is the vector of logistic regression coefficients, 
* $X$ is the matrix of predictive variables, 
* $y$ is the vector of binary response variables, and
* $P(y_i | x_i, \theta)$ is the predicted probability of the binary event for observation $i$.

To maximize the likelihood function, the Newton-Raphson algorithm updates the coefficient vector $\theta$ at each iteration using the following formula:


$\theta_{i+1} = \theta_i - H^{-1} . g$  

where  

- $H = \dfrac{\partial^2L}{\partial \theta \partial\theta'}$ is the Hessian matrix of the likelihood function.  

- $g = \dfrac{\partial L}{\partial \theta }$ is the gradient vector of the likelihood function, and,  

- $\theta_i$ is the coefficient vector at iteration $i$. 

The Hessian matrix and gradient vector of the likelihood function are computed using the partial derivatives of the likelihood with respect to the coefficients $\theta$.  

Thus, in logistic regression, the Newton-Raphson algorithm is used to estimate the coefficients by maximizing the likelihood function. This allows the prediction of binary event probabilities based on the predictive variables.  

**Note**: The iterations stop when the difference between two successive solution vectors becomes negligible.

# The Algorithm steps


Note: $\theta = (w,b)$   

with $h_\theta(x) = \frac{1}{1 + \exp(-w x + b)}$

## Training


- Initialize weights as zero
- Initialize bias as zero

## Given a data point

- Predict result by using $\hat{y} = \frac{1}{1 + \exp(-wx+b)}$
- Calculate the error
- Use Gradient descent to figure out new weights and bias values
- Repeat n times

## Testing

Given a data point:  
- Put the values from the data point into the equation $\hat{y} = \frac{1}{1 + \exp(-w+b)}$
- Choose the label based on the probability

Another way to compute the coefficients

### Logistic Regression from scratch

In [None]:
class Logistic_Regression:
    
    def __init__(self, learning_rate=0.01, n_iter=1000):
        '''Initiate the constructor
            INPUT:
                learning_rate: magnitude of the step
                n_iter: number of iterations
        '''
        self.learning_rate = learning_rate
        self.n_iter = n_iter

    def fit(self, X, y):
        '''Train the model
        INPUTS:
            X: the dataset of the features
            y: the target
        OUTPUTS:
            The model
        '''

        self.n_samples, self.n_features = X.shape
        
        # initialize the parameters:
        self.weights = np.zeros(self.n_features)
        self.bias = 0

        self.X = X
        self.y = y

        for _ in range(self.n_iter):
            return self.update_weights()

    def update_weights(self):
        '''Update of the weights with Gradient descent'''

        # we compute the prediction (the probability)
        y_pred = 1 / (1 + np.exp( - (np.dot(self.X, self.weights) + self.bias)))

        # update the weights:
        # w_j = w_j - (alpha / n) * S(p_hat - y_i)xij
        # b = b - (alpha / n) * S(p_hat - y_i)
        dw = (1 / self.n_samples) * np.dot(self.X.T, (y_pred - self.y))
        db = (1 / self.n_samples) * np.sum(y_pred - self.y)

        self.weights = self.weights - self.learning_rate*dw
        self.bias = self.bias - self.learning_rate*db

    def predict(self, X):
        y_pred = 1 / (1 + np.exp( - (X.dot(self.weights) + self.bias)))
        y_pred = np.where(y_pred > 0.5 , 1 , 0)
        return y_pred

    

### Testing the from scratch algorithm

In [None]:
classifier = Logistic_Regression(learning_rate=0.01, n_iter=1000)

In [None]:
classifier.fit(X_train, y_train)

In [None]:
# Accuracy

from sklearn.metrics import accuracy_score

In [None]:
# Accuracy on the training data
y_train_pred = classifier.predict(X_train)
training_data_accuracy = accuracy_score(y_train, y_train_pred)
training_data_accuracy

0.754071661237785

In [None]:
# Accuracy on the training data
y_test_pred = classifier.predict(X_test)
test_data_accuracy = accuracy_score(y_test, y_test_pred)
test_data_accuracy

0.7272727272727273

In [None]:
# Predictive system

input_data = (5, 166, 72, 19, 175, 25.8, 0.587, 51)

# to numpy array
input_data_array = np.asarray(input_data)

# Reshape
input_data_reshape = input_data_array.reshape(1, -1)

# Standardized the data
input_data_std = scaler.transform(input_data_reshape)

pred = classifier.predict(input_data_std)

if pred:
    print('The person is diabetic')
else:
    print('The person is not diabetic')

The person is diabetic


In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
import matplotlib.pyplot as plt

In [None]:
df = datasets.load_breast_cancer()
X, y = df.data, df.target

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
def accuracy(y_true, y_pred):
    accuracy = sum(y_true==y_pred)/len(y_true)
    return accuracy

In [None]:
clf = Logistic_Regression(learning_rate=0.0001)
clf.fit(X_train, y_train)

In [None]:
predictions = clf.predict(X_test)

In [None]:
predictions

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [None]:
accuracy(y_test, predictions)

0.3986013986013986

The inputs:

$$X=\begin{pmatrix} x_{1,1} & \ldots & x_{1,k} \\ x_{2,1} & \ldots & x_{2,k} \\ \ldots & x_{i,j} & \ldots \\ x_{n,1} & \ldots & x_{n,k} \end{pmatrix}, w=\begin{pmatrix} w_1 \\ w_2 \\ \ldots \\ w_k \end{pmatrix}, b = \text{constant}$$

The linear model:

$$X.w + b = \begin{pmatrix} x_{1,1} & \ldots & x_{1,k} \\ x_{2,1} & \ldots & x_{2,k} \\ \ldots & x_{i,j} & \ldots \\ x_{n,1} & \ldots & x_{n,k} \end{pmatrix}.\begin{pmatrix} w_1 \\ w_2 \\ \ldots \\ w_k \end{pmatrix} + b = \begin{pmatrix} x_{1,1}w_1 + & \ldots & + x_{1,k}w_k + b \\ x_{2,1}w_1 + & \ldots & + x_{2,k}w_k + b \\ \ldots & \ldots &   \ldots \\ x_{n,1}w_1 + & \ldots & + x_{n,k}w_k + b \end{pmatrix}$$

The model prediction (output) is given by:

$$\text{sigmoid}(X.w+b) = \frac{1}{1 + \exp(-X.w+b)}= \hat{p} = h_\omega(X)$$

The updates of the weights and bias are given by:

$$\omega_j = \omega_j - \frac{\alpha}{n} \sum[h_\omega(x_i) - y_i ] x_{i,j}$$

$$b = b - \frac{\alpha}{n} \sum[h_\omega(x_i) - y_i ]$$

For $\omega$ using linear algebra formula:

$$\omega = X^t.(\hat{p} - y) = \begin{pmatrix} x_{1,1} & \ldots & x_{1,n} \\ x_{2,1} & \ldots & x_{2,n} \\ \ldots & x_{i,j} & \ldots \\ x_{k,1} & \ldots & x_{k,n} \end{pmatrix}.\begin{pmatrix} \hat{p_1} - y_1 \\ \hat{p_2} - y_2 \\ \ldots \\ \hat{p_n} - y_n \end{pmatrix}$$

For $b$ using linear algebra formula:

$$b = \sum(\hat{p} - y) = \sum\begin{pmatrix} \hat{p_1} - y_1 \\ \hat{p_2} - y_2 \\ \ldots \\ \hat{p_n} - y_n \end{pmatrix}$$

The weights and bias are given by:

$$\text{sigmoid}(X.w+b) = \frac{1}{1 + \exp(-X.w+b)}= \hat{p}$$

In [None]:
# The Logistic Regression from scratch

## Coefficients significativity

The Wald statistic allows to test the coefficients significativity $\hat{w_j}$. Wald statistic is given by::    

$(\frac{\hat{w_j}}{\sigma(\hat{w_j})})^2$  

Under $H_0 : \{\hat{w_j} = 0 \} \Longrightarrow \frac{\hat{w_j}}{\sigma(\hat{w_j})} $ ~ $\mathcal{N}(0, 1)$

The added-value of the variable $X_j$ is only real if the Wald statistic > 4 $(3.84 = 1.96^2)$

$Wald > 4$    

$\iff (\frac{\hat{w_j}}{\sigma(\hat{w_j})})^2 > 4$  

$\iff \frac{\hat{w_j}}{\sigma(\hat{w_j})} > 2$  

$\iff \hat{w_j} > 2\sigma(\hat{w_j}) $  

$\iff \hat{w_j} - 2\sigma(\hat{w_j}) > 0$  

$\iff \hat{w_j}$ se trouve à plus de 2 écarts-type de 0  

$\iff $ l'intervalle de confiance de $\hat{w_j}$ ne contient pas 0 à 95%  

CQFD

## Model quality mesure (Deviance)

Cf. S.Tufféry p.315

$n:$ number of observations  
$k:$ number of features

$L(\omega_k)$ Likelihood of the "modèle ajusté"  

$L(\omega_0)$ Likelihood of the "modèle réduit à la constante"  

$L(\omega_{max})$ Likelihood of the "modèle saturé". The one the model will compare.  


The Deviance formula:  

$D(\omega_k) = -2[log(L(\omega_k)) - log(L(\omega_{max}))]$  $^{(*)}$

As the target is 0 or 1 $\Longrightarrow L(\omega_{max})=1 \Longrightarrow log(L(\omega_{max}))=0$  

$\Longrightarrow D(\omega_k) = -2[log(L(\omega_k))]$

(*) $D(\omega_k) = (\frac{log(L(\omega_k))}{log(L(\omega_{max}))}^2)$ 

The goal of the logistic regression is to maximise the Likelihood which is equivalent to minimize the Deviance.

The Deviance is equivalent to the SCE for the linear regression.