# ***Logistic Regression and Ensemble Methods***

### 1. Logistic Regression
Logistic regression is used for classification problems, where the goal is to predict a binary outcome given input data. Given a dataset:

$$
(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)
$$

where $ y_i $ is either $ 0 $ or $ 1 $, we want to model the probability:

$$
P(y = 1 | x).
$$

To achieve this, we use the **sigmoid function**, which maps real numbers to the range $ (0,1) $, making it suitable for probability estimation:

$$
P(y = 1 | x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}}.
$$

Similarly, for the probability of $ y = 0 $:

$$
P(y = 0 | x) = 1 - P(y = 1 | x).
$$

### 2. Multiclass Classification and Softmax Function
For **multiclass classification**, logistic regression is extended using the **softmax function**. Suppose $ y $ can take multiple values: $ 1, 2, 3, \dots, M $, then we define:

$$
P(y = j | x) = \frac{e^{\beta_j x}}{\sum_{k=1}^{M} e^{\beta_k x}}.
$$

Unlike the binary logistic function, softmax ensures that all class probabilities sum to 1.

### 3. Statistical Mechanics and Ensembles
The probabilistic formulation of logistic regression can be understood using ideas from **statistical mechanics**, particularly **Boltzmann’s work on ensembles**.

- **Total $ N $ data points** are considered as an **ensemble**.
- **Each data point is a system**, and it can exist in **one of M possible states** (labels).
- The **macrostate** of the system is defined by how the $ N $ data points are distributed across the $ M $ labels.

Now, the probability of a specific configuration $ (a_1, a_2, \dots, a_M) $ occurring follows from the principle of maximum entropy and is given by:

$$
P(a_1, a_2, \dots, a_M) = \frac{W(a_1, a_2, \dots, a_M)}{W_{\text{total}}}
$$

where:
- $ W(a_1, a_2, \dots, a_M) $ is the number of ways to assign $ N $ data points into $ M $ states,
- $ W_{\text{total}} $ is the total number of possible assignments.

Using the multinomial expression for $ W $:

$$
W(a_1, a_2, \dots, a_M) = \frac{N!}{a_1! a_2! \dots a_M!}.
$$

Applying Stirling’s approximation $ \ln N! \approx N \ln N - N $, we derive the most probable distribution by maximizing $ \ln W $ subject to constraints.

This leads to the probability of a system being in state $ j $:

$$
P_j = \frac{W_j}{W_{\text{total}}} = \frac{e^{\beta X_j}}{\sum_{k=1}^{M} e^{\beta X_k}}.
$$

Using combinatorial arguments, the number of ways to assign $ N $ data points across $ M $ states (labels) is given by:

$$
W(a_1, a_2, \dots, a_M) = \frac{N!}{a_1! a_2! \dots a_M!}.
$$

Using **Lagrange multipliers** to enforce constraints (normalization and fixed average cost), we derive the probability:

$$
P(y = j) = \frac{e^{-\beta E_j}}{Z},
$$

where $ Z $ is the partition function. This formulation naturally leads to the **softmax function**.

$$
\sum_{j=1}^{M} a_j X_j = X_{\text{total}}.
$$

To find the most probable configuration (i.e. the set $ \{a_j\} $) that maximizes the multiplicity $ W $, we maximize the logarithm of $ W $ (using Stirling’s approximation) subject to two constraints:
- **Normalization:** $ \sum_{j=1}^{M} a_j = N, $
- **Logit Constraint:** $ \sum_{j=1}^{M} a_j X_j = X_{\text{total}}. $

The corresponding Lagrangian is

$$
\mathcal{L} = \ln W + \alpha \left( \sum_{j=1}^{M} a_j - N \right) + \beta \left( \sum_{j=1}^{M} a_j X_j - X_{\text{total}} \right).
$$

Taking the derivative with respect to $ a_j $ and applying Stirling’s approximation (so that $ \frac{\partial \ln W}{\partial a_j} \approx -\ln a_j - 1 $), we have

$$
-\ln a_j - 1 + \alpha + \beta X_j = 0.
$$

Rearranging yields

$$
\ln a_j = \alpha' + \beta X_j,
$$

where $ \alpha' = \alpha - 1 $. Exponentiating both sides gives

$$
a_j = e^{\alpha'} e^{\beta X_j}.
$$

Normalization requires that

$$
\sum_{j=1}^{M} a_j = e^{\alpha'} \sum_{j=1}^{M} e^{\beta X_j} = N,
$$

so that

$$
e^{\alpha'} = \frac{N}{\sum_{j=1}^{M} e^{\beta X_j}}.
$$

Substituting back, the count for each state becomes

$$
a_j = \frac{N e^{\beta X_j}}{\sum_{k=1}^{M} e^{\beta X_k}}.
$$

Thus, the probability for a data point to be assigned to class $ j $ is

$$
P(y = j) = \frac{a_j}{N} = \frac{e^{\beta X_j}}{\sum_{k=1}^{M} e^{\beta X_k}}.
$$

If we absorb the Lagrange multiplier $ \beta $ into the definition of the logit (or set $ \beta = 1 $ by a suitable choice of units), this immediately recovers the familiar softmax function used in multiclass logistic regression:

$$
P(y = j | x) = \frac{e^{X_j}}{\sum_{k=1}^{M} e^{X_k}},
$$

with $ X_j = \beta_j x $ (including any bias terms as needed).
