### Logistic regression

Logistic regression is a supervised machine learning algorithm that predicts the probability, ranging from 0 to 1, of a datapoint belonging to a specific category, or class. These probabilities can then be used to assign, or classify, observations to the more probable group.

For example, we could use a logistic regression model to predict the probability that an incoming email is spam. If that probability is greater than 0.5, we could automatically send it to a spam folder. This is called binary classification because there are only two groups (eg., spam or not spam).

Logistic Regression
We saw that predicted outcomes from a linear regression model range from negative to positive infinity. These predictions don’t really make sense for a classification problem. Step in logistic regression!

To build a logistic regression model, we apply a logit link function to the left-hand side of our linear regression function. Remember the equation for a linear model looks like this:
$$\\y = b_{0}+m_{1}x_{1}+m_{2}x_{2}+...+m_{n}x_{n}$$
When we apply the logit function, we get the following:
$$\\ln(y / 1 - y)$$

##### Log-Odds

So far, we’ve learned that the equation for a logistic regression model looks like this:
$$\\ln(p / 1 - p)$$
Note that we’ve replaced y with the letter p because we are going to interpret it as a probability (eg., the probability of a student passing the exam). The whole left-hand side of this equation is called log-odds because it is the natural logarithm (ln) of odds (p/(1-p)). The right-hand side of this equation looks exactly like regular linear regression!

In order to understand how this link function works, let’s dig into the interpretation of log-odds a little more. The odds of an event occurring is:
$$\\Odds = \frac{p}{1 - p} = \frac{P(event occurring)}{P(event not occurring)}\$$

For example, suppose that the probability a student passes an exam is 0.7. That means the probability of failing is 1 - 0.7 = 0.3. Thus, the odds of passing are:
$$\\Odds of passing = \frac{0.7}{1 - 0.7} = 2.33$$
This means that students are 2.33 times more likely to pass than to fail.

Odds can only be a positive number. When we take the natural log of odds (the log odds), we transform the odds from a positive value to a number between negative and positive infinity — which is exactly what we need! The logit function (log odds) transforms a probability (which is a number between 0 and 1) into a continuous value that can be positive or negative.

### Sigmoid Function

Let’s return to the logistic regression equation and demonstrate how this works by fitting a model in sklearn. The equation is:
$$\\ln(\frac{p}{1-p}) = b_{0}+m_{1}x_{1}+m_{2}x_{2}+...+m_{n}x_{n}$$

says if $$\\ln(\frac{p}{1-p}) = output$$
We can turn log odds into a probability as follows:
$$\\p = \frac{e^{output}}{1+e^{output}}$$

The calculation that we just did required us to use something called the $$sigmoid function$$, which is the inverse of the logit function. The sigmoid function produces the S-shaped curve:
$$g(z) = \frac{1}{1+e^{-z}}$$

#### Fitting a model in sklearn

Now that we’ve learned a little bit about how logistic regression works, let’s fit a model using sklearn.

To do this, we’ll begin by importing the LogisticRegression module and creating a LogisticRegression object:

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
# just example code don't run it

One important note is that sklearn‘s logistic regression implementation requires the features to be standardized because regularization is implemented by default.

In [None]:
# Transform features like (x1, x2, xn)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)

#### Predictions in sklearn

Using a trained model, we can predict whether new datapoints belong to the positive class (the group labeled as 1) using the .predict() method. The input is a matrix of features and the output is a vector of predicted labels, 1 or 0.

In [None]:
print(model.predict(features))
# Sample output: [0 1 1 0 0]

If we are more interested in the predicted probability of group membership, we can use the .predict_proba() method. The input to predict_proba() is also a matrix of features and the output is an array of probabilities, ranging from 0 to 1:

In [None]:
print(model.predict_proba(features))
# format output like [[proba of 0 is not occurr, proba of 1 occurring]]

print(model.predict_proba(features)[:,1])
# Sample output: [0.32 0.75  0.55 0.20 0.44]

#### Classification Thresholding

As we’ve seen, logistic regression is used to predict the probability of group membership. Once we have this probability, we need to make a decision about what class a datapoint belongs to. This is where the classification threshold comes in!

The default threshold for sklearn is 0.5. If the predicted probability of an observation belonging to the positive class is greater than or equal to the threshold, 0.5, the datapoint is assigned to the positive class.

We can choose to change the threshold of classification based on the use-case of our model. For example, if we are creating a logistic regression model that classifies whether or not an individual has cancer, we may want to be more sensitive to the positive cases. We wouldn’t want to tell someone they don’t have cancer when they actually do!

In order to ensure that most patients with cancer are identified, we can move the classification threshold down to 0.3 or 0.4, increasing the sensitivity of our model to predicting a positive cancer classification. While this might result in more overall misclassifications, we are now missing fewer of the cases we are trying to detect: actual cancer patients.

#### Confusion matrix

When we fit a machine learning model, we need some way to evaluate it. Often, we do this by splitting our data into training and test datasets. We use the training data to fit the model; then we use the test set to see how well the model performs with new data.

As a first step, data scientists often look at a confusion matrix, which shows the number of true positives, false positives, true negatives, and false negatives.

For example, suppose that the true and predicted classes for a logistic regression model are:

In [2]:
y_true = [0, 0, 1, 1, 1, 0, 0, 1, 0, 1]
y_pred = [0, 1, 1, 0, 1, 0, 1, 1, 0, 1]

We can create a confusion matrix as follows:

In [3]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_true, y_pred))

[[3 2]
 [1 4]]


This output tells us that there are 3 true negatives, 1 false negative, 4 true positives, and 2 false positives. Ideally, we want the numbers on the main diagonal (in this case, 3 and 4, which are the true negatives and true positives, respectively) to be as large as possible.

#### Accuracy, Recall, Precision, F1 Score

Once we have a confusion matrix, there are a few different statistics we can use to summarize the four values in the matrix. These include accuracy, precision, recall, and F1 score. We won’t go into much detail about these metrics here, but a quick summary is shown below (T = true, F = false, P = positive, N = negative). For all of these metrics, a value closer to 1 is better and closer to 0 is worse.

* Accuracy = (TP + TN)/(TP + FP + TN + FN)
* Precision = TP/(TP + FP)
* Recall = TP/(TP + FN)
* F1 score: weighted average of precision and recall

In sklearn, we can calculate these metrics as follows:

In [5]:
# accuracy:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_true, y_pred))
# output: 0.7

# precision:
from sklearn.metrics import precision_score
print(precision_score(y_true, y_pred))
# output: 0.67

# recall: 
from sklearn.metrics import recall_score
print(recall_score(y_true, y_pred))
# output: 0.8

# F1 score
from sklearn.metrics import f1_score
print(f1_score(y_true, y_pred))
# output: 0.73

0.7
0.6666666666666666
0.8
0.7272727272727272
