# Logistic Regression

+ Logistic regression is used to perform **binary classification**.
+ Logistic regression is an extension of linear regression where we use a logit link function to fit a sigmoid curve to the data, rather than a line.
+ We can use the coefficients from a logistic regression model to estimate the log odds that a datapoint belongs to the positive class. We can then transform the log odds into a probability.
+ The coefficients of a logistic regression model can be used to estimate relative feature importance.
+ A classification threshold is used to determine the probabilistic cutoff for where a data sample is classified as belonging to a positive or negative class. The default cutoff in sklearn is 0.5.
+ We can evaluate a logistic regression model using a confusion matrix or summary statistics such as accuracy, precision, recall, and F1 score.

## Fitting a model in sklearn

In [63]:
# Import libraries and data
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

codecademyU = pd.read_csv('data.csv')
print(codecademyU.head())

   hours_studied  practice_test  passed_exam
0              0             55            0
1              1             75            0
2              2             32            0
3              3             80            0
4              4             75            0


In [65]:
# Import pandas and the data
import pandas as pd
codecademyU = pd.read_csv('data.csv')


# Separate out X and y
X = codecademyU[['hours_studied', 'practice_test']]
y = codecademyU.passed_exam


# Transform X
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)


# Split data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 27)


# Create and fit the logistic regression model here:
from sklearn.linear_model import LogisticRegression


cc_lr = LogisticRegression()
cc_lr.fit(X_train, y_train)
# Print the intercept and coefficients here:
print(cc_lr.coef_)
print(cc_lr.intercept_)

[[1.5100409  0.12002228]]
[-0.13173123]


## Predictions in sklearn

Using a trained model, we can predict whether new datapoints belong to the positive class (the group labeled as 1) using the .predict() method. The input is a matrix of features and the output is a vector of predicted labels, 1 or 0.

print(model.predict(features))

Sample output: [0 1 1 0 0]

If we are more interested in the **predicted probability** of group membership, we can use the .predict_proba() method. The input to predict_proba() is also a matrix of features and the output is an array of probabilities, ranging from 0 to 1:

print(model.predict_proba(features)[:,1])

Sample output: [0.32 0.75  0.55 0.20 0.44]


By default, .predict_proba() returns the probability of class membership for both possible groups. In the example code above, we’ve only printed out the probability of belonging to the positive class. Notice that datapoints with predicted probabilities greater than 0.5 (the second and third datapoints in this example) were classified as 1s by the .predict() method. This is a process known as thresholding. As we can see here, sklearn sets the default classification threshold probability as 0.5.

In [85]:
# Print out the predicted outcomes for the test data
print(cc_lr.predict(X_test))
print()
print(X_test)
print()
# Print out the predicted probabilities for the test data
print(cc_lr.predict_proba(X_test))
print()
# Print out the true outcomes for the test data
print(y_test)

[0 1 0 1 1]

[[-0.43355498  0.29722219]
 [ 0.95382097  0.29722219]
 [-1.64750894 -1.79313169]
 [ 0.26013299  0.42786931]
 [ 1.30066495  0.62383999]]

[[0.67934073 0.32065927]
 [0.2068119  0.7931881 ]
 [0.94452517 0.05547483]
 [0.42252072 0.57747928]
 [0.12929566 0.87070434]]

7     0
15    1
0     0
11    0
17    1
Name: passed_exam, dtype: int64


### Understanding prediction results

With predict_proba(), the return values relate to the probability of the prediction being 0 ('fail) or 1 ('pass'). In the case of sample 1 here, [0.67934073 0.32065927], there is a 68% probability of a fail, 32% probability of a pass (adding up to 100%).

You should see that the fourth datapoint was incorrectly classified as having passed the exam; however, the predicted probability of passing for this datapoint was only 57.7%, which is much lower than the other students who were correctly predicted to pass the exam (79.3% and 87.1%, respectively).

## Confusion matrix

When we fit a machine learning model, we need some way to evaluate it. Often, we do this by splitting our data into training and test datasets. We use the training data to fit the model; then we use the test set to see how well the model performs with new data.

As a first step, data scientists often look at a confusion matrix, which shows the number of true positives, false positives, true negatives, and false negatives.

For example, suppose that the true and predicted classes for a logistic regression model are:

y_true = [0, 0, 1, 1, 1, 0, 0, 1, 0, 1]

y_pred = [0, 1, 1, 0, 1, 0, 1, 1, 0, 1]


We can create a confusion matrix as follows:

+ from sklearn.metrics import confusion_matrix
+ print(confusion_matrix(y_true, y_pred))


Output:
    
    array([[3, 2]
    
           [1, 4]])


This output tells us that there are 3 true negatives, 1 false negative, 4 true positives, and 2 false positives. Ideally, we want the numbers on the main diagonal (in this case, 3 and 4, which are the true negatives and true positives, respectively) to be as large as possible.

In [91]:
# Save and print the predicted outcomes
y_pred = cc_lr.predict(X_test)
print('predicted classes: ', y_pred)

# Print out the true outcomes for the test data
print('true classes: ', y_test)

# Print out the confusion matrix here
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, y_pred))

predicted classes:  [0 1 0 1 1]
true classes:  7     0
15    1
0     0
11    0
17    1
Name: passed_exam, dtype: int64
[[2 1]
 [0 2]]


## Accuracy, Recall, Precision, F1 Score

Once we have a confusion matrix, there are a few different statistics we can use to summarize the four values in the matrix. These include accuracy, precision, recall, and F1 score. We won’t go into much detail about these metrics here, but a quick summary is shown below (T = true, F = false, P = positive, N = negative). 

For all of these metrics, a value closer to 1 is better and closer to 0 is worse.

+ Accuracy = (TP + TN)/(TP + FP + TN + FN)
+ Precision = TP/(TP + FP)
+ Recall = TP/(TP + FN)
+ F1 score: weighted average of precision and recall

In sklearn, we can calculate these metrics as follows:

In [101]:
# Outputs are taken from the example values above (y_true = [0, 0, 1, 1, 1, 0, 0, 1, 0, 1] and y_pred = [0, 1, 1, 0, 1, 0, 1, 1, 0, 1])

# accuracy:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
# output: 0.7

# precision:
from sklearn.metrics import precision_score
print(precision_score(y_test, y_pred))
# output: 0.67

# recall:
from sklearn.metrics import recall_score
print(recall_score(y_test, y_pred))
# output: 0.8

# F1 score
from sklearn.metrics import f1_score
print(f1_score(y_test, y_pred))
# output: 0.73

0.8
0.6666666666666666
1.0
0.8
