## Logistic Regression Base Concepts in detail

**Key pointers :**

    * Supervised Algorithm - to solve classification problems
    * Model the probability of certain event
    * Works well on linearly Separable dataset where outcomes can be seprated in parts (E.g Binary response Yes/no)
    * Generally used for binary classification problems.
    
**Types of Logisitc Regression:**
    
    1. Binary Logisitc Regression  - Only two separable classes in target variable without any ordering 
    2. Mulinomal Logistic Regression - Three or more separable classes in target variable without any ordering 
    3. Ordinal Logistic Regression - Three or moreseparable classes in target variable with ordering present (low,medium,high)
    
    
    Logistic regression depends on its base line model i.e. Linear regression, where we predict best fit straight line using all points in data E.g. Y= B0+B1X1+B2X2+B3X3....BnXn
    
    
    The predicted value in linear regression is further taken ahead in logistic regression as the output expected is a a class (Categorical Value) not a Continous Value. 
    
    To Tackle this issue we need to bring the predicted value in a certain range, but what range ? Since we need to predict the probability of any cetrtain event our expected results needs to be probability values which are always in range between [0,1]
    
    For this we take odd of the value, but odds can only be positive values [0, infintity] i.e. (p/1-p), To tackle this issue of only +ve we take Log of the Odds i.e. log(p/1-p). Now to recover probabilities of the from the equation we take exponential on both the side. on further solving the equation we get p = 1/ (1 + e-(b0+b1x1+b2x2+b3x3+----+bnxn) which is also known as sigmoid function.


    Now to predict the Classes we define a threshold, and identify which probability lies in which class and get the final results.
   ![Screenshot%202022-08-04%20at%208.24.40%20PM.png](attachment:Screenshot%202022-08-04%20at%208.24.40%20PM.png)
    
    
    Here z= linear regression equaltion
    
    
    The sigmoid function returns a probability value between 0 and 1. This probability value is then mapped to a discrete class which is either “0” or “1”. In order to map this probability value to a discrete class (pass/fail, yes/no, true/false), we select a threshold value. This threshold value is called Decision boundary. Above this threshold value, we will map the probability values into class 1 and below which we will map values into class 0.

    Mathematically, it can be expressed as follows:-

        p ≥ 0.5 => class = 1

        p < 0.5 => class = 0

    Generally, the decision boundary is set to 0.5. So, if the probability value is 0.8 (> 0.5), we will map this observation to class 1. Similarly, if the probability value is 0.2 (< 0.5), we will map this observation to class 0. This is represented in the graph below-
    
    Similar to Linear Regression, Logistic Regression also have some assumptions:    
    

### Assumptions of Logistic Regression

    1. Target value needs to be discrete in nature [binary, multinominal, Ordinal]
    2. Observations should be independent of Each other [No duplicate measures]
    3. Independent Variables should have little or no multicollinearity between them, it should not be highly correlated
    4. Requires high sample size to predict good results.
    5. Its better if the input features are scaled on to same limits(Min-Max scaling) Not compulsory required.

### Cost Function in Logistic Regression

    How Correctly our model predicts or how wrong our model predicts relationship between x and y.
    
   ![Screenshot%202022-08-04%20at%209.48.29%20PM.png](attachment:Screenshot%202022-08-04%20at%209.48.29%20PM.png)

**Type of Solver in Sci-kit Learn**


    Scikit-learn ships with five different solvers. Each solver tries to find the parameter weights that minimize a cost function. Here are the five options:

**newton-cg** — A newton method. Newton methods use an exact Hessian matrix. It's slow for large datasets, because it computes the second derivatives.

**lbfgs** — Stands for Limited-memory Broyden–Fletcher–Goldfarb–Shanno. It approximates the second derivative matrix updates with gradient evaluations. It stores only the last few updates, so it saves memory. It isn't super fast with large data sets. It will be the default solver as of Scikit-learn version 0.22.0.


**liblinear** — Library for Large Linear Classification. Uses a coordinate descent algorithm. Coordinate descent is based on minimizing a multivariate function by solving univariate optimization problems in a loop. In other words, it moves toward the minimum in one direction at a time. It is the default solver for Scikit-learn versions earlier than 0.22.0. It performs pretty well with high dimensionality. It does have a number of drawbacks. It can get stuck, is unable to run in parallel, and can only solve multi-class logistic regression with one-vs.-rest.

**sag** — Stochastic Average Gradient descent. A variation of gradient descent and incremental aggregated gradient approaches that uses a random sample of previous gradient values. Fast for big datasets.


**saga** — Extension of sag that also allows for L1 regularization. Should generally train faster than sag.

L1: penalty supported by liblinear and saga solvers

L2: penalty supported by  cg, sag, saga, lbfgs solvers.

elasticnet: penalty only supported by: saga solver.

below reference from : https://stackoverflow.com/questions/38640109/logistic-regression-python-solvers-definitions

   

liblinear is fast with small datasets, but has problems with saddle points and can't be parallelized over multiple processor cores. It can only use one-vs.-rest to solve multi-class problems. It also penalizes the intercept, which isn't good for interpretation.

lbfgs avoids these drawbacks and is relatively fast. It's the best choice for most cases without a really large dataset. 

### Applications of Logistic Regression:
1. Credit Scoring
2. Spam detection
3. Existence of Any Disease

### Model Evaluation:
   
   

A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.


Four types of outcomes are possible while evaluating a classification model performance. These four outcomes are described below:-


**True Positives (TP)** – True Positives occur when we predict an observation belongs to a certain class and the observation actually belongs to that class.


**True Negatives (TN)** – True Negatives occur when we predict an observation does not belong to a certain class and the observation actually does not belong to that class.


**False Positives (FP)** – False Positives occur when we predict an observation belongs to a    certain class but the observation actually does not belong to that class. This type of error is called **Type I error.**



**False Negatives (FN)** – False Negatives occur when we predict an observation does not belong to a certain class but the observation actually belongs to that class. This is a very serious error and it is called **Type II error.**



These four outcomes are summarized in a confusion matrix given below.

   
   
  

## Precision


**Precision** can be defined as the percentage of correctly predicted positive outcomes out of all the predicted positive outcomes. It can be given as the ratio of true positives (TP) to the sum of true and false positives (TP + FP). 


So, **Precision** identifies the proportion of correctly predicted positive outcome. It is more concerned with the positive class than the negative class.



Mathematically, precision can be defined as the ratio of `TP to (TP + FP).`




## Recall


Recall can be defined as the percentage of correctly predicted positive outcomes out of all the actual positive outcomes.
It can be given as the ratio of true positives (TP) to the sum of true positives and false negatives (TP + FN). **Recall** is also called **Sensitivity**.


**Recall** identifies the proportion of correctly predicted actual positives.


Mathematically, recall can be given as the ratio of `TP to (TP + FN).`





## f1-score


**f1-score** is the weighted harmonic mean of precision and recall. The best possible **f1-score** would be 1.0 and the worst 
would be 0.0.  **f1-score** is the harmonic mean of precision and recall. So, **f1-score** is always lower than accuracy measures as they embed precision and recall into their computation. The weighted average of `f1-score` should be used to 
compare classifier models, not global accuracy.







### ROC Curve


Another tool to measure the classification model performance visually is **ROC Curve**. ROC Curve stands for **Receiver Operating Characteristic Curve**. An **ROC Curve** is a plot which shows the performance of a classification model at various 
classification threshold levels. 



The **ROC Curve** plots the **True Positive Rate (TPR)** against the **False Positive Rate (FPR)** at various threshold levels.



**True Positive Rate (TPR)** is also called **Recall**. It is defined as the ratio of `TP to (TP + FN).`



**False Positive Rate (FPR)** is defined as the ratio of `FP to (FP + TN).`




In the ROC Curve, we will focus on the TPR (True Positive Rate) and FPR (False Positive Rate) of a single point. This will give us the general performance of the ROC curve which consists of the TPR and FPR at various threshold levels. So, an ROC Curve plots TPR vs FPR at different classification threshold levels. If we lower the threshold levels, it may result in more items being classified as positve. It will increase both True Positives (TP) and False Positives (FP).





**ROC AUC** stands for **Receiver Operating Characteristic - Area Under Curve**. It is a technique to compare classifier performance. In this technique, we measure the `area under the curve (AUC)`. A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 0.5. 


So, **ROC AUC** is the percentage of the ROC plot that is underneath the curve.