# Intro to Machine Learning and Logistic Regression

## Objectives 
* Understand what machine learning is and the different types of models 
* Understand more specifically what logistic regression is and how it's different from linear regression 
* Be able to explain the sigmoid function and how it is transforms our model 
* Be able to explain how we evaluate ML models going forward 

![](https://wordstream-files-prod.s3.amazonaws.com/s3fs-public/styles/simple_image/public/images/machine-learning1.png?SnePeroHk5B9yZaLY7peFkULrfW8Gtaf&itok=yjEJbEKD)

[Powerpoint](https://docs.google.com/presentation/d/1jdcJsWAmpwH0kAPW4UJNvBQrngyWn-ZIOmLV4qi4KgU/edit#slide=id.g54a549331a_0_0)

## What is Logistic Regression? 

![](https://miro.medium.com/max/400/1*zLfpo6F_Bfi6uvRL6iLX_Q.jpeg)
It belongs to a class of predictive models called _Generalized Linear Models_. All of these models have 2 things in common: They all define significant relationships between independent/dependent variables and they indicate the strength of the relationships. 

Different from Linear regression -- it can predict the probabilities associated with a success or a failure. Is this email likely spam? What is the probability that this citizen will vote Republican? Is this homeowner likely to default on their mortgage? Is this person likely to buy our product? Is this tumor likely to be cancerous or benign?

## Assumptions 
**Logistic Regression Assumptions:**

· Binary logistic regression requires the dependent variable to be binary.

· Only the meaningful variables should be included.

· The independent variables should be independent of each other. That is, the model should have little or no multi-collinearity.

· The independent variables are linearly related to the log odds.

· Logistic regression requires quite large sample sizes.

**Key differences from Linear Regression:**
* GLM does not assume a linear relationship between dependent and independent variables. However, it assumes a linear relationship between link function and independent variables in logit model.

* The dependent variable need not to be normally distributed.

* It does not uses OLS (Ordinary Least Square) for parameter estimation. Instead, it uses maximum likelihood estimation (MLE).

* Errors need to be independent but not normally distributed.

## Logistic Regression Equation
![](https://miro.medium.com/max/571/0*tGVPGu3aa1rhTdfl.png)
Let's say we've constructed our best-fit line, i.e. our linear predictor, $\hat{L} = \beta_0 + \beta_1x_1 + ... + \beta_nx_n$.

Consider the following transformation:
$\large\hat{y} = \Large\frac{1}{1 + e^{-\hat{L}}} \large= \Large\frac{1}{1 + e^{-(\beta_0 + ... + \beta_nx_n)}}$. This is called the sigmoid function.

This function squeezes our predictions between 0 and 1. 

Suppose I'm building a model to predict whether a plant is poisonous or not, based perhaps on certain biological features of its leaves. I'll let '1' indicate a poisonous plant and '0' indicate a non-poisonous plant.

Now I'm forcing my predictions to be between 0 and 1, so suppose for test plant $P$ I get some value like 0.19.

I can naturally understand this as the probability that $P$ is poisonous.

If I truly want a binary prediction, I can simply round my score appropriately.

How do we fit a line to our dependent variable if its values are already stored as probabilities? We can use the inverse of the sigmoid function, and just set our regression equation equal to that. The inverse of the sigmoid function is called the logit function, and it looks like this:

$\large f(y) = \ln\left(\frac{y}{1 - y}\right)$. Notice that the domain of this function is $(0, 1)$.

Quick proof that logit and sigmoid are inverse functions:

$\hspace{170mm}x = \frac{1}{1 + e^{-y}}$;
$\hspace{170mm}$so $1 + e^{-y} = \frac{1}{x}$;
$\hspace{170mm}$so $e^{-y} = \frac{1 - x}{x}$;
$\hspace{170mm}$so $-y = \ln\left(\frac{1 - x}{x}\right)$;
$\hspace{170mm}$so $y = \ln\left(\frac{x}{1 - x}\right)$.)

Our regression equation will now look like this:

$\large\ln\left(\frac{y}{1 - y}\right) = \beta_0 + \beta_1x_1 + ... + \beta_nx_n$.

## Evaluating Classification Models 

For classification problems, the target is a categorical variable. This means that we can simply count the number of times that our model predicts the correct category and the number of times that it predicts something else.

We can visualize this by means of a **confusion matrix**, a tabular representation of Actual vs Predicted values.
![](https://miro.medium.com/max/350/0*rhntpf-9O0A5HjCP)

**The metrics for evaluating your models performance can be drawn from this matrix** 

* Accuracy = $\frac{TP + TN}{TP + TN + FP + FN}$

* Recall = $\frac{TP}{TP + FN}$

* Precision = $\frac{TP}{TP + FP}$

* F-1 Score = $\frac{2PrRc}{Pr + Rc}$ = $\frac{2TP}{2TP + FP + FN}$ 

**General Lessons**: 
First, let's make some general observations about the metrics we've so far defined.

**Accuracy:**

    * **Pro:** Takes into account both false positives and false negatives.

    * **Con:** Can be misleadingly high when there is a significant class imbalance. (A lottery-ticket predictor that always predicts a loser will be highly accurate.)

**Recall:**

    * **Pro:** Highly sensitive to false negatives.

    * **Con:** No sensitivity to false positives.

**Precision:**

    * **Pro:** Highly sensitive to false positives.

    * **Con:** No sensitivity to false negatives.
    
    
**PRACTICE DOCUMENT** - https://docs.google.com/document/d/1KkBBSRqaDXaHMoSNa9hrP0xipFtXoG5UEq7Mdon6RZ4/edit?usp=sharing

**F-1 Score:**

Harmonic mean of recall and precision.

**AIC (Akaike Information Criteria**) — The analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value.

**ROC Curve:** Receiver Operating Characteristic (ROC) summarizes the model’s performance by evaluating the trade-offs between true positive rate (sensitivity) and false positive rate (1- specificity). For plotting ROC, it is advisable to assume p > 0.5 since we are more concerned about success rate. ROC summarizes the predictive power for all possible values of p > 0.5. The area under curve (AUC), referred to as index of accuracy (A) or concordance index, is a perfect performance metric for ROC curve. Higher the area under curve, better the prediction power of the model. Below is a sample ROC curve. The ROC of a perfect predictive model has TP equals 1 and FP equals 0. This curve will touch the top left corner of the graph.
![](https://miro.medium.com/max/300/0*20UWoOC5Gi4SdbAw.jpg)
