# Notes for Homework 1

In [1]:
import numpy as np

## Part (a) fitting histogram

### Probabilities

$p_θ(x) = exp(θx) / Σ exp(θx')$

This is a softmax model. Let's see how to compute it with a simple example.

In [None]:
# define theta as an array of shape 'd'
# let's say d=3
theta: np.ndarray = np.array([0.5, 1.0, -0.5])

In [None]:
# We compute all the probabilities at once with numpy
# it is a softargmax...
probs: np.ndarray = np.exp(theta) / np.sum(np.exp(theta))
assert np.sum(probs) == 1.0
probs

array([0.33149896, 0.54654939, 0.12195165])

### Negative Log Likelihood

$L(θ) = - (1/N) Σ log(p_θ(x_i))$

In [4]:
# in the data, there are : one 0, two 1, one 2
# 3 datapoints, matching len(probs)
data = np.array([0, 2, 1, 1])
assert len(np.unique(data)) == len(probs)
n = len(data)
data

array([0, 2, 1, 1])

In [5]:
loss_theta = (-1/n) * np.sum(np.log(probs[data]))
loss_theta

np.float64(1.1041306053367284)

Use np to directly compute 1/n * sum ...

In [6]:
loss_theta_better = -np.mean(np.log(probs[data]))
loss_theta_better

np.float64(1.1041306053367284)

What is probs[data] ?
* probs is an array containing the probability for each class
* data is the observations of each class
* probs[data] is getting in the 'probs' array, the probability assigned to each observation. So this is 'as the model believes'

In [7]:
probs[data]

array([0.33149896, 0.12195165, 0.54654939, 0.54654939])

In [8]:
assert loss_theta == loss_theta_better

## Deriving the Gradient
$∇L = - (1/N) Σ (e_xi - p_θ)$

The gradient is given above. Where does it come from?

**1. Starting with Negative Log Likelihood**

The loss function is:
$$L(\theta) = -\frac{1}{N} \sum \log(p_\theta(x_i))$$
where: $$p_\theta(x) = \frac{\exp(\theta_x)}{\sum_j \exp(\theta_j)}$$

**2. Taking the Derivative**

For one example $x$, let's derive $\log(p_\theta(x))$ with respect to $\theta_k$:
$$\begin{align*}
\frac{\partial}{\partial\theta_k} \log(p_\theta(x)) &= \frac{\partial}{\partial\theta_k} [\log(\exp(\theta_x)) - \log(\sum_j \exp(\theta_j))] \
&= \frac{\partial}{\partial\theta_k} [\theta_x - \log(\sum_j \exp(\theta_j))]
\end{align*}$$


**3. Two Cases to Consider.**

Either k=x (case 1), which means that $\theta(k) == \theta(x)$, then $\frac{\partial}{\partial\theta_k} = 1$.

Or $k \neq x$, then $\frac{\partial}{\partial\theta_k} = 0$.

Case 1: $k = x$ (observed class)
$$\frac{\partial}{\partial\theta_k} [\theta_x - \log(\sum_j \exp(\theta_j))] = 1 - \frac{\exp(\theta_k)}{\sum_j \exp(\theta_j)} = 1 - p_\theta(k)$$

Case 2: $k \neq x$
$$\frac{\partial}{\partial\theta_k} [\theta_x - \log(\sum_j \exp(\theta_j))] = 0 - \frac{\exp(\theta_k)}{\sum_j \exp(\theta_j)} = -p_\theta(k)$$

**4. Combining Results**

This gives us:
$$\frac{\partial L}{\partial\theta_k} = -\frac{1}{N} \sum [1{x=k} - p_\theta(k)]$$
where $1{x=k}$ is the indicator function: 1 if $x=k$, 0 otherwise.

**5. Final Form (vectorized)**
$$\nabla L = -\frac{1}{N} \sum (e_x - p_\theta)$$

---

## Part (b) Fitting Discretized Mixture of Logistics

### How to initialize the coefficients of the model?

In [18]:
d = 20
n_logistics_models: int = 4
theta = np.zeros((n_logistics_models, 3))
theta[:, 0] = np.random.uniform(0, d-1, size=n_logistics_models) # mean initialized at random, with values between [0, d-1]
theta[:, 1] = np.exp(np.random.rand(n_logistics_models)) # s
theta[:, 2] = np.full(n_logistics_models, (1 / n_logistics_models)) # pi

To use them efficiently, in vectorized form, we need to use Numpy's broadcasting feature.

In [None]:
# shape (4)
theta[:, 0] # mu

array([ 7.61667106,  1.50266247, 12.50647138, 13.26991431])

In [None]:
# [None] is equivalent to unsqueeze() in PyTorch.
mu = theta[:, 0][:, None] # shape (4, 1)
s = theta[:, 1][:, None] # shape (4, 1)
pi = theta[:, 2][:, None] # shape (4, 1)

In [24]:
mu # shepe (4, 1)

array([[ 7.61667106],
       [ 1.50266247],
       [12.50647138],
       [13.26991431]])

### Deriving the gradient

# Dérivation du Gradient pour un Mélange de Logistiques Discrétisées

## Rappel du modèle
Notre modèle de probabilité est:
$p_\theta(x) = \sum_{i=1}^4 \pi_i[\sigma(\frac{x+0.5 - \mu_i}{s_i}) - \sigma(\frac{x-0.5-\mu_i}{s_i})]$

où $\sigma(z) = \frac{1}{1 + e^{-z}}$ est la fonction sigmoïde.

## Loss Function
La negative log-likelihood est:
$L = -\frac{1}{N}\sum_{n=1}^N \log(p_\theta(x_n))$

## Dérivées partielles

### Par rapport à μᵢ
$\frac{\partial L}{\partial \mu_i} = -\frac{1}{N}\sum_{n=1}^N \frac{1}{p_\theta(x_n)} \cdot \pi_i[\frac{\partial \sigma}{\partial \mu_i}(\frac{x_n+0.5 - \mu_i}{s_i}) - \frac{\partial \sigma}{\partial \mu_i}(\frac{x_n-0.5-\mu_i}{s_i})]$

où $\frac{\partial \sigma}{\partial \mu_i}(z) = \sigma(z)(1-\sigma(z))(-\frac{1}{s_i})$

### Par rapport à sᵢ
$\frac{\partial L}{\partial s_i} = -\frac{1}{N}\sum_{n=1}^N \frac{1}{p_\theta(x_n)} \cdot \pi_i[\frac{\partial \sigma}{\partial s_i}(\frac{x_n+0.5 - \mu_i}{s_i}) - \frac{\partial \sigma}{\partial s_i}(\frac{x_n-0.5-\mu_i}{s_i})]$

où $\frac{\partial \sigma}{\partial s_i}(z) = \sigma(z)(1-\sigma(z))(-\frac{z}{s_i})$

### Par rapport à πᵢ
$\frac{\partial L}{\partial \pi_i} = -\frac{1}{N}\sum_{n=1}^N \frac{1}{p_\theta(x_n)} \cdot [\sigma(\frac{x_n+0.5 - \mu_i}{s_i}) - \sigma(\frac{x_n-0.5-\mu_i}{s_i})]$

---
## Forme générale du gradient

Pour un point $x_n$, définissons:
$\alpha_i(x_n) = \frac{x_n+0.5 - \mu_i}{s_i}$ et $\beta_i(x_n) = \frac{x_n-0.5 - \mu_i}{s_i}$

Alors la contribution de chaque point $x_n$ au gradient peut s'écrire:

$\frac{\partial L}{\partial \theta_i} = -\frac{1}{N}\sum_{n=1}^N \frac{1}{p_\theta(x_n)} \cdot \pi_i \cdot [\sigma'(\alpha_i(x_n)) \cdot \frac{\partial \alpha_i}{\partial \theta_i} - \sigma'(\beta_i(x_n)) \cdot \frac{\partial \beta_i}{\partial \theta_i}]$

où:
- Pour $\mu_i$: $\frac{\partial \alpha_i}{\partial \mu_i} = \frac{\partial \beta_i}{\partial \mu_i} = -\frac{1}{s_i}$
- Pour $s_i$: $\frac{\partial \alpha_i}{\partial s_i} = -\frac{\alpha_i}{s_i}$ et $\frac{\partial \beta_i}{\partial s_i} = -\frac{\beta_i}{s_i}$
- Pour $\pi_i$: c'est un cas spécial où le $\pi_i$ sort et les dérivées partielles sont 1

Et $\sigma'(z) = \sigma(z)(1-\sigma(z))$ est la dérivée de la fonction sigmoïde.