# Introduction to Logistic Regression

In this notebook, we'll learn the basics of building a logistic regression model using scikit-learn.

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt

We'll be working with the penguins dataset again.

In [None]:
penguins = pd.read_csv('data/penguins.csv').dropna().reset_index()

In [None]:
penguins.columns

In [None]:
penguins.sample(5)[['body_mass_g', 'species']]

First, let's prepare out dataset. Our initial target will be predicting whether or not a penguin is of the gentoo species, so we'll create a Boolean column to indicate this.

Note also that in our train test split, we are going to use the **stratify** keyword, which will result in the same (or very close to it) proportions of the target variable in the training and test data.

In [None]:
variables = ['body_mass_g']

X = penguins[variables]
y = penguins['species'] == 'Gentoo'

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321, stratify = y)

The syntax for fitting a logistic regression model is the same as the other scikit-learn models that we've seen so far.

In [None]:
logreg = LogisticRegression().fit(X_train, y_train)

We can extract out the coefficients, which are fit attributes of the model.

In [None]:
logreg.intercept_

In [None]:
logreg.coef_

In [None]:
boundary = - logreg.intercept_[0] / logreg.coef_[0]
boundary

In [None]:
fontsize = 18

plt.figure(figsize = (10,6))

sns.boxplot(data = penguins.assign(Gentoo = penguins['species'] == 'Gentoo'), x = 'Gentoo', y = 'body_mass_g')
plt.xticks(fontsize = fontsize)
plt.yticks(fontsize = fontsize)
plt.xlabel('Gentoo', fontsize = fontsize)
plt.ylabel('body_mass_g', fontsize = fontsize)

xmin, xmax = plt.xlim()
plt.hlines(y = boundary, xmin = xmin, xmax = xmax, linestyle = '--', color = 'red', linewidth = 3)
plt.xlim(xmin, xmax);

Now, let's look at some metrics.

In [None]:
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

Like before, we can use the `predict` method to see the predicted class.

In [None]:
logreg.predict(X_test)

In [None]:
accuracy_score(y_test, logreg.predict(X_test))

In [None]:
print(classification_report(y_test, logreg.predict(X_test)))

In [None]:
print(confusion_matrix(y_test, logreg.predict(X_test)))

Let's look at the confusion matrix in a different way.

In [None]:
from cm import plot_confusion_matrix

In [None]:
plot_confusion_matrix(y_test, logreg.predict(X_test), labels = ['Not Gentoo', 'Gentoo']);

In [None]:
plot_confusion_matrix(y_test, logreg.predict(X_test), labels = ['Not Gentoo', 'Gentoo'], metric = 'accuracy');

In [None]:
plot_confusion_matrix(y_test, logreg.predict(X_test), labels = ['Not Gentoo', 'Gentoo'], metric = 'precision');

In [None]:
plot_confusion_matrix(y_test, logreg.predict(X_test), labels = ['Not Gentoo', 'Gentoo'], metric = 'recall');

In [None]:
plot_confusion_matrix(y_test, logreg.predict(X_test), labels = ['Not Gentoo', 'Gentoo'], metric = 'f1');

We can also access the predicted probabilities using the `predict_proba` method.

In [None]:
logreg.predict_proba(X_test)

## Multi-class Classification

Now, let's see what it looks like to fit a multi-class model. Notice that this time, we are keeping the original species column. We'll add in the flipper length variable into our predictors as well.

In [None]:
variables = ['body_mass_g', 'flipper_length_mm']

X = penguins[variables]
y = penguins['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 321, stratify = y)

logreg = LogisticRegression().fit(X_train, y_train)

Now, let's look at our coefficients

In [None]:
logreg.intercept_

In [None]:
logreg.coef_

We've got three sets of coefficients. This is because we have fit three different models, one for each target class.

If we use the `predict_proba` method, we can see that we get three different probabilities.

In [None]:
logreg.predict_proba(X_test)

The `predict` method now assigns the target class that has the highest predicted probability.

In [None]:
logreg.predict(X_test)

In [None]:
X_test.assign(adelie_prob = logreg.predict_proba(X_test)[:,0],
              chinstrap_prob = logreg.predict_proba(X_test)[:,1],
              gentoo_prob = logreg.predict_proba(X_test)[:,2],
              prediction = logreg.predict(X_test),
              true = y_test)

In [None]:
accuracy_score(y_test, logreg.predict(X_test))

In [None]:
confusion_matrix(y_test, logreg.predict(X_test))

In [None]:
print(classification_report(y_test, logreg.predict(X_test)))

In [None]:
from cm import plot_confusion_matrix_multiclass

In [None]:
plot_confusion_matrix_multiclass(y_test, 
                                 logreg.predict(X_test), 
                                 labels = logreg.classes_,
                                 figsize = (8,8))