# Logistic Regression: Unveiling the Power of Binary Classification

Logistic Regression is a crucial tool for handling binary outcomes, such as yes/no or true/false responses. In the world of classification tasks, this supervised learning method shines brightly, providing deep insights into estimating probabilities. Unlike linear regression, which deals with continuous values, logistic regression focuses on predicting the chances of a particular outcome based on input features.

## Unveiling the Mechanism:

The process begins much like linear regression, where input features are harmonized like instruments in a symphony, each carrying its weight determined by specific coefficients:

$$ z = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \ldots + \theta_n x_n $$

Here, $ z $ represents the core of linear combination, with $ \theta_0, \theta_1, \ldots, \theta_n $ acting as the orchestrators, directing each input feature $ x_1, x_2, \ldots, x_n $.

However, the appeal of logistic regression lies in its transformative capability. In contrast to linear regression, logistic regression embraces the sigmoidal curve of the logistic function, shaping $ z $ into a domain of probabilities.:

$$ y^p = h(x) = \frac{1}{1 + e^{-z}} = \frac{1}{1 + e^{-(\theta_0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n)}} $$

In this context, $ y^p $ emerges as the messenger of predicted probabilities, envisioning the probability of the positive class (for example, rain) among the array of input features $ x_1, x_2, \ldots, x_n $. The sigmoidal curve ensures a confined range, encapsulating outputs between 0 and 1, making them interpretable as probabilities.

<center><img src="./imgs/sigmoid.png"/></center>v


#### Embracing Probability:

When we introduce the concept of probability, denoted by $ y^p $, it prompts us to understand the likelihood of a particular event happening. For instance, stating $ y^p = 0.8 $ reveals an 80% chance of the event occurring, such as rain, based on various factors being considered.


## Cross-Entropy Loss and Binary Cross-Entropy Loss

Cross-entropy loss is a commonly used loss function in machine learning, especially in classification tasks. It measures the dissimilarity between two probability distributions: the predicted probabilities output by the model and the actual true distribution of the labels.

In the context of binary classification, where we have only two classes (usually denoted as 0 and 1), binary cross-entropy loss is a specific form of cross-entropy loss. It is particularly suited for binary classification problems, where the task is to predict whether an example belongs to one class or another.

#### Cross-Entropy Loss

Cross-entropy loss measures the difference between two probability distributions: the predicted probability distribution ($ y_p $) output by the model and the true probability distribution ($ y $) represented by the labels.

The general formula for cross-entropy loss is:

$$ J(\theta) = - \sum_{i=0}^m y_i \log(y_p^{(i)}) $$

where:
- $ y_i $ is the true label (either 0 or 1) for the $ i $-th example.
- $ y_p^{(i)} $ is the predicted probability of the positive class for the $ i $-th example.
- The sum is taken over all examples in the dataset.

#### Binary Cross-Entropy Loss

In binary classification, we have only two classes (0 and 1), and the label $ y_i $ for each example is either 0 or 1. Binary cross-entropy loss is a special case of cross-entropy loss tailored specifically for this scenario.

For binary classification, the formula for binary cross-entropy loss simplifies to:

$$ J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(y_p^{(i)}) + (1-y^{(i)})\log(1-y_p^{(i)})] $$

where:
- $ m $ is the number of examples in the dataset.
- $ y^{(i)} $ is the true label (either 0 or 1) for the $ i $-th example.
- $ y_p^{(i)} $ is the predicted probability of the positive class for the $ i $-th example.
- The sum is taken over all examples in the dataset.

Binary cross-entropy loss quantifies the difference between the true labels and the predicted probabilities for binary classification problems. It penalizes predictions that deviate from the true labels, encouraging the model to output accurate probabilities for each class.

#### Unveiling the Formula for Binary Classification:

The heart of Log Loss resides in a concise yet powerful formula:

$$ J(\theta) = -y \cdot \log(y^p) - (1-y) \cdot \log(1-y^p) $$

Here, $y$ signifies the actual outcome, while $y^p$ denotes the predicted probability.

#### Decoding Log Loss:

The essence of Log Loss reveals itself in its capacity to impose a stringent penalty for incorrect predictions, guiding models towards precision and accuracy. Let's examine its influence through examples:

**1. Precision in Prediction:**
   - When the actual outcome $y$ aligns with the model's prediction $y^p \approx 1$:
     - As Log Loss $J(\theta)$ approaches 0, it signals optimal performance:
     - Mathematically:
       $$ y = 1 \quad \text{and} \quad y^p \approx 1 \implies J(\theta) \approx 0 $$

     Let's assume $ y = 1 $ (actual outcome) and $ y^p = 0.95 $ (predicted probability).  
     - Using the Log Loss formula: $ J(\theta) = -y \cdot \log(y^p) - (1-y) \cdot \log(1-y^p) $  
     - Substituting the values: $ J(\theta) = -(1) \cdot \log(0.95) - (1-1) \cdot \log(1-0.95) $  
     - Calculating: $ J(\theta) \approx 0.051 $ (approximately)

**2. Error Amplification:**
   - Conversely, when predictions stray from reality:
     - Log Loss amplifies proportionally, signifying the magnitude of misjudgment.
     - Mathematically:
       $$ y = 1 \quad \text{but} \quad y^p \neq 1 \implies J(\theta) \text{ escalates significantly} $$
   
     Let's assume $ y = 1 $ (actual outcome) and $ y^p = 0.2 $ (predicted probability).  
     - Using the Log Loss formula: $ J(\theta) = -y \cdot \log(y^p) - (1-y) \cdot \log(1-y^p) $  
     - Substituting the values: $ J(\theta) = -(1) \cdot \log(0.2) - (1-1) \cdot \log(1-0.2) $  
     - Calculating: $ J(\theta) \approx 1.609 $ (approximately)  

**3. Sensitivity to Misclassifications:**
   - Extending its sensitivity to all realms, including when $y = 0 $:
     - Accurate predictions (where $y^p \approx 0 $ entail minimal Log Loss.
     - However, misclassifications (where $ y^p \approx 1 $) incur substantial Log Loss.
     - Mathematically:
       $$ y = 0 \quad \text{and} \quad y^p \approx 0 \implies J(\theta) \approx 0 $$
       $$ y = 0 \quad \text{but} \quad y^p \approx 1 \implies J(\theta) \rightarrow \infty $$

     Let's assume $ y = 0 $ (actual outcome) and $ y^p = 0.05 $ (predicted probability).  
     - Using the Log Loss formula: $ J(\theta) = -y \cdot \log(y^p) - (1-y) \cdot \log(1-y^p) $
     - Substituting the values: $ J(\theta) = -(0) \cdot \log(0.05) - (1-0) \cdot \log(1-0.05) $
     - Calculating: $ J(\theta) \approx 0.051 $ (approximately)
    
     Let's assume $ y = 0 $ (actual outcome) and $ y^p = 0.95 $ (predicted probability).  
     - Using the Log Loss formula: $ J(\theta) = -y \cdot \log(y^p) - (1-y) \cdot \log(1-y^p) $
     - Substituting the values: $ J(\theta) = -(0) \cdot \log(0.95) - (1-0) \cdot \log(1-0.95) $
     - Calculating: $ J(\theta) \rightarrow \infty $ (approximately)ss becomes a beacon for model refinement, compelling it to yield accurate probabilities for each class. By penalizing deviations from ground truth, Log Loss propels classification models towards enhanced performance, ushering in an era of precision-driven analytics.

## Gradient descent for parameter optimisation 
Begin by evaluating the log loss function $ J(\theta) $ utilizing the predicted probabilities alongside the actual labels. Derive the gradients of the loss function concerning each parameter $ \theta_j $ using the subsequent partial derivative formula:

$$\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m} (y^p - y) \cdot x_j^{(i)}$$

Subsequently, update each parameter $ \theta_j $ by incorporating the gradients calculated in the prior step, alongside the learning rate $ \alpha $, via the following formula:

$$\theta_j = \theta_j - \alpha \cdot \frac{\partial J(\theta)}{\partial \theta_j}$$


## One-vs-All Classification:

In multi-class classification scenarios, where the task involves predicting among more than two classes, logistic regression can be extended using the One-vs-All (OvA) method. This approach enables us to train multiple binary classification models, each focused on distinguishing one class from the rest. Subsequently, the class with the highest predicted probability across these models is assigned as the final prediction.

#### Method Overview:

1. **Model Construction:**
    - **Model-1:** Distinguishes *Group 1* from *Everything Else*
    - **Model-2:** Distinguishes *Group 2* from *Everything Else*
    - **Model-3:** Distinguishes *Group 3* from *Everything Else*
    - This pattern continues for each class in the dataset.
  
2. **Training:**
    - Each model is trained independently using logistic regression.
    - Training involves minimizing the log loss, a common loss function for logistic regression, to optimize the model's parameters.
  
3. **Prediction:**
    - To make a prediction for a given input, all models are used to compute the probabilities for each class.
    - The class with the highest predicted probability becomes the final predicted class.

#### Example:

Let's consider a scenario where we have a dataset with three classes: *Cat*, *Dog*, and *Rabbit*. We can apply the OvA method to build three binary logistic regression models:

1. **Model-1:** *Cat* vs *Everything Else*
2. **Model-2:** *Dog* vs *Everything Else*
3. **Model-3:** *Rabbit* vs *Everything Else*

For instance, when presented with an image of an animal, each model calculates the probability that the image belongs to its respective class. If the probabilities are as follows:

- Model-1 (*Cat*): 0.8
- Model-2 (*Dog*): 0.5
- Model-3 (*Rabbit*): 0.3

The OvA approach would predict the input as a *Cat*, as it has the highest predi


#### Advantages:

- **Simplicity:** OvA is straightforward to implement and understand.
- **Flexibility:** It can be applied with any binary classification algorithm, not limited to logistic regression.
- **Interpretability:** Each binary model provides insights into the decision boundaries for its respective class.

#### Limitations:

- **Imbalance:** If classes are imbalanced, it may lead to biased predictions towards the majority class.
- **Complexity:** As the number of classes increases, the number of binary models also increases, potentially leading to scalability issues.
- **Overlap:** Decision boundaries between classes may overlap, leading to ambiguous predictions.

In summary, the One-vs-All classification method expands logistic regression to handle multi-class classification tasks effectively. By training multiple binary models, it allows for the prediction of the most probable class among several options, making it a valuable technique in machine learning pipelines.


## Regularization

Regularization in logistic regression follows a similar principle to linear regression, but with a modified cost function.

The standard cost function for logistic regression, known as Log Loss or Binary Cross Entropy Loss, is given by:

$$ J(\theta) = -\frac{1}{m}\sum_{i=0}^{m} [y^{(i)} \log(y_p^{(i)}) + (1-y^{(i)})\log(1-y_p^{(i)})] $$

where $ m $ is the number of training examples, $ y^{(i)} $ is the actual label of the $ i $-th example, $ y_p^{(i)} $ is the predicted probability of the positive class for the $ i $-th example, and $ \theta $ represents the parameters (weights) of the model.

Regularization adds a penalty term to the cost function to discourage overly complex models. This penalty term helps prevent overfitting by penalizing large parameter values. The regularized cost function for logistic regression is:

$$ J(\theta) = -\frac{1}{m}\sum_{i=0}^{m} [y^{(i)} \log(y_p^{(i)}) + (1-y^{(i)})\log(1-y_p^{(i)})] + \frac{\lambda}{2m}\sum_{j = 0}^{n}\theta_{j}^{2} $$

Here, $ \lambda $ is the regularization parameter, and $ n $ is the number of features. The second term in the equation, $ \frac{\lambda}{2m}\sum_{j = 0}^{n}\theta_{j}^{2} $, is the regularization term. It penalizes large weights $ \theta_{j} $ by adding their squared values to the cost function. The factor $ \frac{\lambda}{2m} $ controls the strength of regularization relative to the data term.

Regularization helps in controlling overfitting and improving the generalization of the model to unseen data. It achieves this by discouraging overly complex models that fit the training data too closely.