# The logistic regression

## The model

The logistic regression is a linear model where we try to estimate a probability instead of a specific value like in simple linear regression.


$Y_i$ ~ $B(p_i)$

Les $Y_i$ suivent une loi de Bernoulli de paramètre $p_i$:  

It means that:

$P(Y_i=1) = p_i$, $P(Y_i = 0) = 1 - p_i$

Which is equivalent to:  

$P(Y_i = k) = {p_i}^k(1 - p_i)^{1-k}$ pour $k \in \{0, 1\}$

## Sample (_échantillon_)

The joint distribution (_loi conjointe_) or joint probability of the $(Y_i)_{1, \ldots, n}$ ~ $B(p_i)$ is given by:

$P(Y_1=y_1, Y_2=y_2, \ldots, Y_n=y_n) = \underset{1,\ldots, n}\prod P(Y_i=y_i)$ , les $Y_i$ étant indépendants.

With $P(Y_i = y_i) = p_i^{y_i}(1-p_i)^{1-y_i}$  and considering the $Y_i$ independants and identically distributed $p_i = p$ , $ \forall i \in {1, \ldots, n}$


$\underset{1,\ldots, n}\prod P(Y_i=y_i) = p^{\sum{y_i}}(1-p)^{\sum{1-y_i}} = p^{\sum{y_i}}(1-p)^{n - \sum{y_i}}$

# Estimator of p : $\hat{p}$

## The likelihood (_vraissemblance_)

The likelihood is a function that measures the probability of observing a given sample.

The likelihood is defined as the <u>joint probability</u> of the data given the model parameters $\theta$:  

$L_{\theta}(Y_1,Y_2,\ldots,Y_n) = \underset{1,\ldots, n}\prod P_{\theta}(Y_i=y_i)$.

In our case, $\theta = p$

## The Maximum likelihood Method

The objective of the maximum likelihood method is to find the values of the parameter $\theta$ that maximize the likelihood for the observed sample X.  
In other words, we aim to find the values of the model parameters that make the observation of the given sample most probable.

The question is where the likelihood reaches its maximum ?


### The Log-likelihood


Thanks to its monotony, if we transform the likelihood into a log-likelihood, the maximum will be the same.  
In other words, <u>find the maximum of the Likelihood is equivalent to find the maximum of the Log-likelihood</u>.

A $log$ transformation will simplify the calculation:

$\log(L_{\theta}(Y_1,Y_2,\ldots,Y_n)) = \log(\underset{1,\ldots, n}\prod P_{\theta}(Y_i=y_i)) = \underset{1,\ldots, n}\sum \log(P_{\theta}(Y_i=y_i))$

$= \underset{1,\ldots, n}\sum \log (p^{y_i}(1-p)^{1-y_i}) = \underset{1,\ldots, n}\sum [y_i\log(p) + (1-y_i)\log(1-p)]$

$= \underset{1,\ldots, n}\sum y_i\log(p) + \underset{1,\ldots, n}\sum(1-y_i)\log(1-p) = \underset{1,\ldots, n}\sum y_i\log(p) + \underset{1,\ldots, n}\sum\log(1-p)  - \underset{1,\ldots, n}\sum y_i\log(1-p) = \underset{1,\ldots, n}\sum y_i\log(p) +  n\log(1-p) - \underset{1,\ldots, n}\sum y_i\log(1-p)$

$= \underset{1,\ldots, n}\sum y_i\log(p) - \underset{1,\ldots, n}\sum y_i\log(1-p) + n\log(1-p) = \underset{1,\ldots, n}\sum y_i\log(\frac{p}{1-p}) + n\log(1-p)$ 

We have now:  

$\log(L_{\theta}(Y_1,Y_2,\ldots,Y_n)) = \underset{1,\ldots, n}\sum y_i\log(\frac{p}{1-p}) + n\log(1-p)$ 

## The Maximum Log-likelihood Method

To find where the Log-likelihood reaches its maximum we calculate the derivative of the Log-likelihood:  


$\frac{\partial }{\partial \theta} \log(L_{\theta}(Y_1,Y_2,\ldots,Y_n)) =\frac{\partial }{\partial \theta} \underset{1,\ldots, n}\sum y_i\log(\frac{p}{1-p}) + n\log(1-p)$

$\theta = p$

$\frac{\partial }{\partial p} \underset{1,\ldots, n}\sum y_i\log(\frac{p}{1-p}) + n\log(1-p)= \underset{1,\ldots, n}\sum  y_i \frac{\partial }{\partial p} \log(\frac{p}{1-p}) + n \frac{\partial }{\partial p} \log(1-p) $



- $ \frac{\partial }{\partial p} \log(\frac{p}{1-p}) \underset{(log(u))'=\frac{u'}{u}}= \frac{\partial }{\partial p} (\frac{p}{1-p}) \times \frac{1-p}{p} = \frac{1\times(1-p)-p(-1)}{(1-p)^2} \times \frac{1-p}{p} = \frac{1}{p(1-p)}$

- $n \frac{\partial }{\partial p} \log(1-p) \underset{(log(u))'=\frac{u'}{u}}= n \frac{(-1)}{1-p}$  

$\Rightarrow \underset{1,\ldots, n}\sum  y_i \frac{\partial }{\partial p} \log(\frac{p}{1-p}) + n \frac{\partial }{\partial p} \log(1-p) = \underset{1,\ldots, n}\sum  y_i \frac{1}{p(1-p)} + n  \frac{(-1)}{1-p}  $  

$\Rightarrow \frac{\partial }{\partial \theta} \log(L_{\theta}(Y_1,Y_2,\ldots,Y_n)) = 0$  

$\iff \underset{1,\ldots, n}\sum  y_i \frac{1}{p(1-p)} + n  \frac{(-1)}{1-p} = 0$  

$\iff \underset{1,\ldots, n}\sum  y_i \frac{1}{p(1-p)}  = \frac{n}{1-p}$  

$\iff \frac{1}{n}\underset{1,\ldots, n}\sum  y_i   = p$  

$\iff \hat{p} = \frac{1}{n}\underset{1,\ldots, n}\sum  y_i = \bar{y}$



We know now that the likelihood reaches a unique extremum with $\bar{y}$

Let's verfiy if this extremum is a maximum.

Usually, we would have compute: $\frac{\partial^2}{\partial \theta^2} \log(L_{\theta}(Y_1,Y_2,\ldots,Y_n)) $ to study the **Fisher Information** $I(\theta)$

$I(\theta) = - \mathbb{E} \left[ \frac{\partial^2}{\partial \theta^2} \log(L_{\theta}(Y_1,Y_2,\ldots,Y_n)) \right]$

We will rather study the sign of $\frac{\partial }{\partial \theta} \log(L_{\theta}(Y_1,Y_2,\ldots,Y_n))$

$\frac{\partial }{\partial \theta} \log(L_{\theta}(Y_1,Y_2,\ldots,Y_n)) > 0$  

$\iff \underset{1,\ldots, n}\sum  y_i \frac{1}{p(1-p)} + n  \frac{(-1)}{1-p} > 0$  

$\iff \underset{1,\ldots, n}\sum  y_i \frac{1}{p(1-p)}  > \frac{n}{1-p}$  

$\iff \frac{1}{n}\underset{1,\ldots, n}\sum  y_i   > p$  

$\iff \bar{y} > p $


And, $\frac{\partial }{\partial \theta} \log(L_{\theta}(Y_1,Y_2,\ldots,Y_n)) < 0 \iff \bar{y} < p $


Thus, If we construct a monotonicity table of the log-likelihood based on its partial derivative, $\bar{y}$ is indeed a maximum.

#### $\bar{y}$ is unbiased

$\mathbb{E} \left[ \hat{p} \right] = p \underset{n \infty} \rightarrow p $ (évident)

$\text{Var} \left[ \hat{p} \right] = \text{Var} \left[ \frac{1}{n} \sum Y_i \right] = \frac{1}{n^2} \text{Var} \left[  \sum Y_i \right] = \frac{1}{n^2} \sum \text{Var} \left[  Y_i \right] = \frac{1}{n^2} np(1-p) =  \frac{1}{n} p(1-p) = \frac{1}{n} I(\theta)$

On atteint la borne de Cramer-Rao, notre variance est la plus faible de tous les estimateurs de p.

$\text{Var} \left[ \hat{p} \right] = \frac{p(1-p)}{n} \underset{n \infty} \rightarrow 0$

$\Rightarrow \hat{p} $ est un Estimateur convergent Sans Biais de Variance Minimale (ESBVM). Il est parfait :)

## APPENDIX

Rappel: La vraisemblance:

Etant donnée un échantillon observé $(x_1, x_2,\dots,x_n)$ et une loi de probabilité $P(\theta)$, la vraisemblance quantifie la probabilité que les observations proviennent d'un échantillon théorique de loi $P(\theta)$.

Exemple:  

On effectue 10 lancers d'une pièce.  

L'échantillon binaire observé est $0,1,1,0,1,1,1,0,0,1$

Pour un échantillon de taille 10 de la loi de Bernoulli de paramètre $p_j=p$ la probabilité d'une telle réalisation est:  

$p^6(1-p)^4$  

| p | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 |
|-----|---|---|---|---|---|---|---|
| p^6(1-p)^4 | 2.6.10-5 | 1.8.10-4 | 5.3.10-4 | 9.8.10-4 | 1.2.10-3 | 9.5.10-4 | 4.2.10-4 |

In [5]:
for p in [.2, .3 , .4 , .5 , .6 , .7 , .8]:
    likeli = round(p**6*(1-p)**4, 10)
    print(likeli)


2.62144e-05
0.0001750329
0.0005308416
0.0009765625
0.0011943936
0.0009529569
0.0004194304


Il est naturel de choisir commme estimation de $p$ celle pour laquelle la probabilité de l'échantillon observé est la plus forte à savoir $p=0.6$