# Logistic Regression

Logistic Regression is a supervised learing technique, i.e. you need labeled data. It is a classification algorithm that uses regression to predict the probability (from 0 to 1) of a data sample belonging to a specific category, or class.

In spam filtering, a Logistic Regression model would predict the probability of an incoming email being spam. If that predicted probability is greater than or equal to 0.5, the email is classified as spam. We would call spam the positive class, with the label 1, since the positive class is the class our model is looking to detect. If the predicted probability is less than 0.5, the email is classified as ham (a real email). We would call ham the negative class, with the label 0. This act of deciding which of two classes a data sample belongs to is called binary classification.

## Predicting whether students will pass their final exam

We need to predict the probability of each student passing. 

We'll first try `Linear Regression`, where we model the relationship between a dependent variable to an independent variable based on a line of best fit.

The line of best follows the formula:

![Linear Regression](img/linear-regression-1.png)

 - `y` y is the value we're tryung to predict
 
 - `b_0` is the intercept of the regression line
 
 - `b_1`, `b_2`, ..`b_n` are coefficients of the features
 
 - `x_1`, `x_2`, ...`x_n` are the independent variables

If we were to plot `y` as either `1` passing and `0` failing, vs `num_hours_studied` (our one feature) we get the following graph:

![Linear Regression](img/linear-regression-2.png)

For low values of `num_hours_studied` (below 2.6 hrs) the regression line predicts negative probabilities of passing, and for high values of `num_hours_studied` (> 18hrs) the regression line predicts probabilities of passing greater than 1. Which are meaningless. This occurs because the output of a `Linear Regression` model can range from -Infinity to +Infinity.

### Using Logistic Regression

In Logistic Regression we are also looking to find coefficients for our features, but this time we are fitting a logistic curve to the data so that we can predict probabilities. 

To predict the probability of a data sample belonging to a class, we:

1. initialize all feature coefficients and intercept to 0.

2. multiply each of the feature coefficients by their respective feature value to get what is known as the `log-odds`.

3. place the `log-odds` into the sigmoid function to link the output to the range [0,1], giving us a probability.

By comparing the predicted probabilities to the actual classes of our data points, we can evaluate how well our model makes predictions and use `gradient descent` to update the coefficients and find the best ones for our model.

To then make a final classification, we use a classification threshold to determine whether the data sample belongs to the positive class or the negative class.

![Logistic Regression](img/logistic-regression-1.png)

In Linear Regression we multiply the coefficients of our features by their respective feature values and add the intercept, resulting in our prediction, which can range from -∞ to +∞. In Logistic Regression, we make the same multiplication of feature coefficients and feature values and add the intercept, but instead of the prediction, we get what is called the `log-odds`.

The log-odds are another way of expressing the probability of a sample belonging to the positive class, or a student passing the exam. In probability, we calculate the odds of an event occurring as follows:

![Logistic Regression](img/logistic-regression-2.png)

The odds tell us how many more times likely an event is to occur than not occur. If a student will pass the exam with probability 0.7, they will fail with probability 1 - 0.7 = 0.3. We can then calculate the odds of passing as:

![Logistic Regression](img/logistic-regression-3.png)

The log-odds are then understood as the logarithm of the odds!

![Logistic Regression](img/logistic-regression-4.png)

For our Logistic Regression model, however, we calculate the log-odds, represented by `z` below, by summing the product of each feature value(`x_1`, `x_2`, etc) by its respective coefficient(`b_1`, `b_2`, etc) and adding the intercept(`b_0`). This allows us to map our feature values to a measure of how likely it is that a data sample belongs to the positive class.

![Logistic Regression](img/logistic-regression-5.png)

This kind of multiplication and summing is known as a `dot product`, which can be performed in `numpy` with `np.dot()`.  Given feature matrix `features`, coefficient vector `coefficients`, and an `intercept`, we can calculate the `log-odds` in numpy as follows:

```py
log_odds = np.dot(features, coefficients) + intercept
```

`np.dot()` will take each row, or student, in `features` and multiply each individual feature value by its respective coefficient in `coefficients`, summing the result, as shown below.

![Logistic Regression](img/logistic-regression-6.png)

We then add in the intercept to get the log-odds!

### Sigmoid Function

We use the `Sigmoid Function` to map the log_odds `z` to the range `[0, 1]`.

![Logistic Regression](img/logistic-regression-7.png)

`e^(-z)` is the exponential function, which can be written in numpy as `np.exp(-z)`. The output is the probability of a sample belonging to the positive class, or in our case, a student passing the final exam. Plotting the results(probabilities) produces an S-shaped curve.

![Logistic Regression](img/logistic-regression-8.png)

### Determining Model Performance

To evaluate the performance of our model, we calculate the `loss` for each data sample(how wrong the model's prediction was) and then average the loss across all samples - known as the `Log Loss`. The goal of the `Logistic Regression` model is to find the intercept and feature coefficients that minimize the log-loss of our training data.

Log-loss function:

![Logistic Regression](img/logistic-regression-9.png)

Consider the case when a data sample has class y = 1, or for our data when a student passed the exam. The right-side of the equation drops out because we end up with 1 - 1 (or 0) multiplied by some value. The loss for that individual student becomes:

![Logistic Regression](img/logistic-regression-10.png)

The loss for a student who passed the exam is just the log of the probability the student passed the exam

for a student who fails the exam, where a sample has class y = 0, the left-side of the equation drops out and the loss for that student becomes:

![Logistic Regression](img/logistic-regression-11.png)

The loss for a student who failed the exam is the log of one minus the probability the student passed the exam, which is just the log of the probability the student failed the exam

If we graph the loss for individual samples when the class label is y = 1 and y = 0

![Logistic Regression](img/logistic-regression-12.png)

When we consider the 4 possible cases

![Logistic Regression](img/logistic-regression-13.png)

From the graphs and the table you can see that confident correct predictions result in small losses, while confident incorrect predictions result in large losses that approach infinity.

We want to punish our model with an increasing loss as it makes progressively incorrect predictions, and we want to reward the model with a small loss as it makes correct predictions.

Just like in Linear Regression, we can then use gradient descent to find the coefficients that minimize log-loss across all of our training data.


### Classification Threshold

Once the `Logestic Regression` model calculates oiur probability, we need to decide which class the sample belongs to. this is called the `Classification Threshold`.

The default threshold is often 0.5. If the predicted probability of an observation belonging is greater than or equal to the threshold, the classification of the sample is the positive class. If the predicted probability is less than the threshold, the classification of the sample is the negative class.

If we were trying to identify cancer, we might lower the threshold to 0.4 or 0.3, thus increasing the sensitivity of our model to predict a positive cancer classification. While this might result in more overall misclassifications, we are now missing fewer of the cases we are trying to detect: actual cancer patients.

```py
import numpy as np
from exam import hours_studied, calculated_coefficients, intercept, passed_exam, probabilities_2

# Create your log_odds() function here
def log_odds(features, coefficients, intercept):
  return (features * coefficients) + intercept

# Calculate the log-odds for the Codecademy University data here
calculated_log_odds = log_odds(hours_studied, calculated_coefficients, intercept) 

# Create your sigmoid function here
def sigmoid(z):
  # equal to 1 plus the exponential of -z.
  denominator = 1 + np.exp(-z)
  return 1 / denominator

# Calculate the sigmoid of the log-odds here
probabilities = sigmoid(calculated_log_odds)

# calculates the log-loss for a set of predicted probabilities and their actual classes
def log_loss(probabilities,actual_class):
  return np.sum(-(1/actual_class.shape[0])*(actual_class*np.log(probabilities) + (1-actual_class)*np.log(1-probabilities)))
  
# Print the actual classes, pass (1), or fail (0), for the students.
print(passed_exam)
# [[0]
 [0]
 [0]
 [0]
 ...
 [1]
 [1]
 [1]]
 
# Calculate and print loss_1 here
loss_1 = log_loss(probabilities, passed_exam)
print(loss_1) # 0.398640332141742
```

Now that we have calculated the loss for our best coefficients, let's compare this loss to the loss we begin with when we initialize our coefficients and intercept to 0. probabilities_2 contains the calculated probabilities of the students passing the exam with the coefficient for hours_studied set to 0. 

```py
# Calculate and print loss_2 here
loss_2 = log_loss(probabilities_2, passed_exam)
print(loss_2) # 13.862943611198906
```

```py
# Create predict_class() function here  takes a features matrix, a coefficients vector, an intercept, and a threshold as parameters
def predict_class(features, coefficients, intercept, threshold):
  calculated_log_odds = log_odds(features, coefficients, intercept)
  # find the probabilities that the samples belong to the positive class
  probabilities = sigmoid(calculated_log_odds)
  # compare all the values in an array with some threshold
  return np.where(probabilities >= threshold, 1, 0)

# Make final classifications on Codecademy University data here
final_results = predict_class(hours_studied, calculated_coefficients, intercept, 0.5)
print(final_results)

# [[0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]]
```