![image.png](attachment:image.png)

The main problem here is that we had no metric for determining how good our model is and only cared about the accuracy which might decrease the ability to generalize

![image.png](attachment:image.png)

Here, we can see that model 2 is better clearly

![image.png](attachment:image.png)

But now, here we can't say which is better. This is where our loss function helps us

![image.png](attachment:image.png)

# 📘 Logistic Regression: Log Maximum Likelihood

---

## 🧠 Objective

In **logistic regression**, we aim to model the probability that a binary target variable $y \in \{0, 1\}$ is 1 given an input vector $\mathbf{x}$:

$$
P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}) = \frac{1}{1 + e^{-\mathbf{w}^T \mathbf{x}}}
$$

Where:
- $\mathbf{w}$ is the vector of weights (including bias)
- $\sigma(\cdot)$ is the sigmoid function

---

## 🎯 Goal: Maximum Likelihood Estimation (MLE)

We want to find the weights $\mathbf{w}$ that **maximize the probability of observing the data**.

Let the training data be:
- $\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^n$ where $y^{(i)} \in \{0, 1\}$

Then the **likelihood** is:

$$
\mathcal{L}(\mathbf{w}) = \prod_{i=1}^{n} P(y^{(i)} \mid \mathbf{x}^{(i)}; \mathbf{w})
$$

For binary classification:

$$
P(y^{(i)} \mid \mathbf{x}^{(i)}; \mathbf{w}) = \hat{y}^{(i)} = \sigma(\mathbf{w}^T \mathbf{x}^{(i)})
$$

So,

$$
\mathcal{L}(\mathbf{w}) = \prod_{i=1}^{n} [\hat{y}^{(i)}]^{y^{(i)}} [1 - \hat{y}^{(i)}]^{1 - y^{(i)}}
$$

---

## 📉 Log Likelihood (to simplify computation)

We take the **log** of the likelihood (turns product into sum):

$$
\log \mathcal{L}(\mathbf{w}) = \sum_{i=1}^{n} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right]
$$

This is the **log-likelihood function** for logistic regression.

---

## ❌ Loss Function (to Minimize)

In machine learning, we **minimize** the **negative log-likelihood**:

$$
\mathcal{J}(\mathbf{w}) = -\frac{1}{n} \sum_{i=1}^{n} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right]
$$

This is known as the **binary cross-entropy loss** or **log loss**.

---

## 📌 Summary

- Likelihood = probability of data given parameters
- Log-likelihood simplifies the math
- Maximizing log-likelihood = minimizing binary cross-entropy
- Used as the loss function for logistic regression

---

## 🛠️ Tips

- Always use **log-likelihood** for numerical stability
- Use gradient descent to optimize weights using this loss
- Regularization (like L2) can be added to the loss function for better generalization

---

## 📚 References

- Bishop, *Pattern Recognition and Machine Learning*
- Hastie, Tibshirani, and Friedman, *The Elements of Statistical Learning*


![image.png](attachment:image.png)