<a target="_blank" href="https://colab.research.google.com/github/JLDC/Data-Science-Fundamentals/blob/master/notebooks/107_logistic-regression-roc.ipynb">
    <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Open this notebook in Google Colab
</a>

___

# Logistic Regression for Bank Loan Defaults
___

The *logistic* regression is very similar to *linear* regression, but it is used for **classification** problems, whereas linear regression is used for **regression** problems. Recall that a **regression** problem is one where the target variable is a numerical variable where the numerical values have real meaning. Examples include crop yield, the price of a stock, temperature, sales, blood pressure... A **classification** problem, on the other hand, consists of predicting a *categorical* target, such as predicting whether a bank loan will default, whether an iris belongs to the *setosa* or *versicolor* species, or whether the price of a stock goes *up* or *down*.

___
#### ü§î Pause and ponder
Did you pay attention above? Predicting the price of a stock is a *regression* problem but predicting whether it will go up or down is a *classification* problem. Does it make sense? Is it not the same problem? Can you think of other prediction tasks where we could take both a *regression* and a *classification* perspective?
___

Both linear and logistic regressions are the prototypical *plain vanilla* models for regression and classification problems, respectively. They can be seen as the natural starting point in nearly all applications (except when you have *non-tabular data* like in computer vision or natural language processing). 

Remember the notebook on MSE visualization, with cancer types from the WDBC data. This was in fact also a classification problem. However, when deriving weights by minimizing the MSE using gradient descent, we have treated it as a regression problem! Actually, the weights we derived by gradient descent are the exact same weights of an OLS regression! 

While this is fine to a certain extent, in practice we usually prefer a different approach to classification problems where the models predict a number between 0 and 1, such that it can be interpreted as a *probability* or *share*. By contrast our score ($w_1 \cdot x_1 + w_2 \cdot x_2$) could, in theory, take on any values. A logistic regression provides a simple way to convert a score into a number between 0 and 1 that we can thus interpret as a probability. 

In fact, there is a neat function called the [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function) which takes as input any real number and outputs a number between 0 and 1, looking at the plot below, can you see how this function can be useful to map our unbounded score to a probability range?

The sigmoid function is given as 

$$\sigma(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{e^x + 1} = 1 - \sigma(-x)$$

In [None]:
import matplotlib.pyplot as plt # Plotting
import numpy as np # Numerical computations
import pandas as pd # Dataframes

# Define the path where the data is stored
DATA_PATH = "https://raw.githubusercontent.com/JLDC/Data-Science-Fundamentals/master/data"

In [None]:
# Create canvas
fig, ax = plt.subplots(figsize=(12,8))
# x-axis, 100 linearly spaced points between -7 and +7
xs = np.linspace(-7, 7, num=100)
# y-axis, sigmoid function
ys = 1 / (1 + np.exp(-xs))
# Draw sigmoid function
ax.plot(xs, ys)
# Add ticks, grid, title
ax.set_yticks(np.linspace(0, 1, num=11))
ax.grid(True)
ax.set_title("Sigmoid function")
ax.set_xlabel("$x$")
ax.set_ylabel("$\sigma(x)$")

___
## Data pre-processing
We will work with data on bank loans to illustrate a use-case of logistic regression. If you feel like, however, it is a good exercise to replicate parts of this notebook using the WDBC dataset instead, as it also features a categorical target.

In [None]:
# Read the bank loan dataset
loans = pd.read_csv(f"{DATA_PATH}/data/loan_defaults.csv")
loans # Display the data

As we can see from the excerpt above, we have quite a lot of data (195k observations) and there are a lot of different variables (columns) in our dataset. Normally, you aren't just *given* some dataset, but you obtained it from a specific source and you can find some information on what each variable represents or how it is measured. This is not the case here, so you have to rely on the column names...

In any case, our target variable is the `Loan_Status`, and a `1` indicates a default, i.e., that the loan was not paid back. As you can already see, without this particular information, it would be hard to understand only from the data whether `0` or `1` means a default.

____
#### ü§î Pause and ponder
When you have some dataset and you don't know what the value means exactly (e.g., if we hadn't given you this information on `Loan_Status`), is it still possible to make an educated guess as to which number means what? How would you proceed?
___

First, let's check the value of `Loan_Status`, if there are any missing values, and if it is balanced or not.

In [None]:
# Count of values for the prediction label
loans["Loan_Status"].value_counts()

There are approximately half as many defaults as there are repaid loans, this is expected. The nice thing is that there are no missing values for our label. What about the other columns, do they have missing values?

In [None]:
# Let's check the percentage of missing values in each column
loans.isnull().sum() / loans.shape[0]

`Months_since_last_delinquent` has a whopping 54.71% of missing values, and `Current_Loan_Amount` also has quite a few, with 18.02% ! This raises an issue we have largely ignored until now: how do we deal with missing data?

___
### Handling missing data
Unfortunately, there is no *best* way to handle missing data. As with many aspects of data science, it is more of an art than a science. Here are some possible approaches and what drawback they might entail:

#### 1. Removing columns with missing values
This is what we will do for `Months_since_last_delinquent`, more than half of the data is missing so it really is an extreme case. However, when choosing this approach, we might be dropping important variables that might help us for our prediction. This is something we would like to check in general, e.g., if the 45% of non-missing data in this column was able to perfectly predict the loan defaults, we would obviously still prefer keeping it and we would deal with the missing values in a different way.

#### 2. Removing rows with missing values
What if instead of dropping the entire column, we simply dropped every row that has a missing value in any of the column? This also works. However, this can reduce our sample size significantly. For instance, we do not want to drop 55% of our dataset because a single variable is missing, that might impact the analysis too much. Furthermore, what if there is a *reason* why the data is missing? For instance, there might be a *structure* to the missing data and by ignoring the missing data, we might be biasing our model strongly.

For instance, consider that in some geographical area, the data is missing because it was too hard to gather. If we drop missing data, our model will probably perform badly when it comes to predicting observations in that particular area.

#### 3. Imputing missing values
A third method of dealing with missing values is to *impute* them, i.e., replace them with some other value. For example, we might want to replace the `Current_Loan_Amount` values with the mean of the non-missing values. This would allow us to keep our full dataset instead of discarding 18% of it if we were to remove the rows as mentioned above in 2.

The difficulty of this approach is to choose an imputation strategy. Using the mean or the mode of the non-missing values is the simplest way to do it, but it might not be the best. In fact, if there are a lot of missing observations, it might make things worse.

#### 4. Recoding
In our case, a missing value for `Months_since_last_delinquent` may mean that the person has never been delinquent on a payment so far. We may create a new variable `ever_delinquent` and set it for all missings to 0 and for non-missings to 1. We could also set the missings to a very high value. Literally speaking, the number of months since last delinquency is infinite (or the age of the person) for those who have never been delinquent. Whether this is a good idea depends also whether you can show to improve the predictive performance of the model with this procedure.

___

As you can take away from this discussion, handling missing data is complex and full of somehow a little arbitrary decisions. It is important to double check with our data that our chosen strategy isn't impairing our model, and, most importantly, we must be able to justify our choices.

In this case, learning to deal with missing data is not the main objective of this notebook. So we will take the easy way out: drop `Months_since_last_delinquent` and drop all rows that have missing values anywhere else. But be mindful that we could surely obtain a better model by spending more time and effort on our missing data!

In [None]:
# Drop the Months_since_last_delinquent column
loans.drop(columns=["Months_since_last_delinquent"], inplace=True)
# Drop the rows with missing values
loans.dropna(inplace=True)
loans # Display the data

After this cleaning step, we are left with 152k observations.

As you can see, many of the variables have string values (text). In a statistical model, we need numerical values for all
features (and for the target), so we will need to process these values in a way.

#### Dummy-encoding and One-hot-encoding
The most common way to encode non-numerical variables is the **dummy-encoding** or **one-hot-encoding**. While there is a difference in the encodings, the names are often used interchangeably. As a matter of fact, dummy-encoding is a special case of one-hot-encoding, when there are only two variables. So, one-hot-encoding is just the more general encoding.

Take the variable `Term` in our dataframe, it can take values `Short Term` or `Long Term`. This is perfect for dummy-encoding, e.g., we can replace `Short Term` with a `1` and `Long Term` with a `0`. This way, we have replaced the string variable by a single number.

Take `Home_Ownership` on the other hand. It can take values `Home Mortgage`, `Own Home`, `Rent`, `HaveMortgage` instead, so up to 4 different values, and we won't be able to dummy-encode it. Instead, we can use one-hot-encoding, the idea is to create 3 new variables, also called dummies, (always one less than the possible values, same for dummy-encoding!):  
1. `Home_Mortgage`: takes value `1` if the `Home_Ownership` is equal to `Home Mortgage`, `0` otherwise.
2. `Own_Home`: takes value `1` if the `Home_Ownership` is equal to `Own Home`, `0` otherwise.
3. `Rent`: takes value `1` if the `Home_Ownership` is equal to `Rent`, `0` otherwise.



___
#### ‚û°Ô∏è ‚úèÔ∏è<font color=green>**Question 1**</font>

Why do we need one less dummy than the number of possible values a variable can take? Are we not forgetting about `HaveMortgage` or `Long Term`? Can we just get away with discarding information like that?
___


`pandas` provides a useful function to encode text data, namely the `get_dummies()` function. Let's go ahead and use it.

In [None]:
# First let's have a look at how the function works on a single column
# Notice the use of `drop_first=True`, try to see what happens if you set it to false
pd.get_dummies(loans["Home_Ownership"], drop_first=True)

Unfortunately, the above only returns the encoding for a single column, so we would have to reconstruct our dataframe by replacing every string column with its encoding. This seems cumbersome. Luckily, `pandas` is very flexible, and we can make `get_dummies` work on the full dataframe, transforming only the necessary columns.

In [None]:
# Make a list of columns which have an 'object' type (string)
str_cols = [col for col in loans.columns if loans[col].dtype == "O"]

# For understanding what the code does check this
loans["Home_Ownership"].dtype
# A type ‚ÄòO‚Äô just stands for ‚Äúobject‚Äù which in Pandas‚Äô world is a string.

print(str_cols)

# Encode these columns
loans = pd.get_dummies(loans, columns=str_cols, drop_first=True)
loans # Display the data

___
## Logistic regression on the entire dataset
Let's start with a first logistic regression estimation on the entire dataset.

In [None]:
from sklearn.linear_model import LogisticRegression # Logistic regression

In [None]:
# Create the X and y variables
X = loans.drop(columns=["Loan_Status"])
y = loans["Loan_Status"]

In [None]:
# Create the logistic regression model
model = LogisticRegression()
# Fit it to the data
model.fit(X, y)

In [None]:
# Compute predicted label and probability of default
pred = model.predict(X)
pred_prob = model.predict_proba(X)[:, 1]

In [None]:
# Compute error metrics
N = loans.shape[0] # Number of observations
# Misclassification percentage
misclass = np.mean(y != pred)
# False positive rate
false_pos = np.mean(pred[np.where(y == 0)])
# False negative percentage
false_neg = 1 - np.mean(pred[np.where(y == 1)])

In [None]:
# Print error metrics
print(f"The missclassification rate is {100 * misclass:.2f}%")
print(f"The false positive rate is {100 * false_pos:.2f}%")
print(f"The false negative rate is {100 * false_neg:.2f}%")

___
#### ‚û°Ô∏è ‚úèÔ∏è<font color=green>**Question 2**</font>
Can we rely on these values? Why is the false negative rate so much higher than the false positive rate?
___

## Logistic regression with cross-validation
Remember the discussion we had about training, testing, and cross-validation in the previous notebooks. Above, we have estimated the model on the entire sample and we have also use the same sample to evaluate the metrics. We know that this is a capital sin of data science, so let's do better!

We will use cross-validation to evaluate our results. In the last notebooks, we emphasized that cross-validation helps us select a model, but here, we will not do any model selection. Instead, we will use cross-validation to evaluate the performance of our model out-of-sample. Now you might begin to realize why this differentiation between testing and validating is not always clear. But as we said, this is not extremely important. **Just make sure you never judge a model's performance based on the data you used to fit it and you will be fine!**

In [None]:
from sklearn.model_selection import cross_validate # CV function
from sklearn.metrics import make_scorer # To make our custom scorers

In [None]:
# Create custom scorers for false positive rate and false negative rates
fpos_rate = lambda y_true, y_pred: np.mean(y_pred[np.where(y_true == 0)])
fneg_rate = lambda y_true, y_pred: 1 - np.mean(y_pred[np.where(y_true == 1)])
fpos_scorer = make_scorer(fpos_rate, greater_is_better=False)
fneg_scorer = make_scorer(fneg_rate, greater_is_better=False)

In [None]:
# Estimate 10-fold cross-validation, return accuracy, FPR, FNR
nfolds = 10
cv_results = cross_validate(model, X, y, cv=nfolds,
    scoring={"accuracy": "accuracy", "fpr": fpos_scorer, "fnr": fneg_scorer})

In [None]:
# Compute the metrics for the cross-validation
misclass_cv = 1 - cv_results["test_accuracy"]
false_pos_cv = -cv_results["test_fpr"]
false_neg_cv = -cv_results["test_fnr"]

In [None]:
# Quick and dirty helpers to compute mean and standard error thereof
mean_and_se = lambda x, n: (np.mean(100 * x), np.std(100 * x) / np.sqrt(n))

In [None]:
# Compute means and standard errors
misclass_mean, misclass_se = mean_and_se(misclass_cv, nfolds)
false_pos_mean, false_pos_se = mean_and_se(false_pos_cv, nfolds)
false_neg_mean, false_neg_se = mean_and_se(false_neg_cv, nfolds)

In [None]:
# Print error metrics
print(f"The mean missclassification rate is {misclass_mean:.2f}% (¬± {misclass_se:.2f}%)")
print(f"The mean false positive rate is {false_pos_mean:.2f}% (¬± {false_pos_se:.2f}%)")
print(f"The mean false negative rate is {false_neg_mean:.2f}% (¬± {false_neg_se:.2f}%)")

___
#### ‚û°Ô∏è ‚úèÔ∏è<font color=green>**Question 3**</font>
What do you conclude when comparing those values to the one obtained on the full dataset? Discuss with your classmates.
___

## The receiver operating characteristic (ROC) curve
The [receiver operating characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) (ROC) curve is a diagnostic visualization of a binary classifier's performance and how this performance changes as the discrimination threshold is varied.

First, what is the **discrimination threshold**? We have mentioned above, that the output of the logistic regression is a value between 0 and 1 (see `model.pred_proba(X)`). But in the end, we are not using the predicted probabilities as the labels, instead we are classifying the labels based on the probabilities. E.g., if our prediction is at least 0.5, and we then classify the label as a 1, and, if our predicted probability is below 0.5, we then classify as a 0, then the value 0.5 our **discrimination threshold** (also called critical threshold).

Of course, 0.5 seems like an intuitive threshold since it is *in the middle* between 0 and 1. But depending on the use case, it might be better to adjust this threshold. In fact, the ROC curve will help us select the optimal threshold for our goal.

To learn about the ROC curve and how we can use it to tune our model, we will again work with a training and testing set, as we did in the OLS notebook.

In [None]:
from sklearn.model_selection import train_test_split # Data splitter

In [None]:
# Split data into train and test sets
Xtrain, Xtest, ytrain, ytest = train_test_split(
    X, y, test_size=0.3, random_state=72)

In [None]:
# Initialize a logistic regression model
model_t = LogisticRegression()
# Fit the model on the traning data
model_t.fit(Xtrain, ytrain)

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/13/Roc_curve.svg/1280px-Roc_curve.svg.png" style="width:300px"></center>

The above shows the ROC curve plot. It is a plot where the x-axis represents the false positive rate, and the y-axis represents the true positive rate. `sklearn` provides a function to directly plot the ROC curve from an estimator.

In [None]:
from sklearn.metrics import RocCurveDisplay

In [None]:
# Use the fitted model to predict the train and test labels
pred_prob_train = model_t.predict_proba(Xtrain)
pred_train = model_t.predict(Xtrain)
pred_prob_test = model_t.predict_proba(Xtest)
pred_test = model_t.predict(Xtest)

In [None]:
# Use the ROCDisplay object from sklearn
fig, ax = plt.subplots(figsize=(12, 8))
RocCurveDisplay.from_estimator(model_t, Xtrain, ytrain, ax=ax, label="Train")
RocCurveDisplay.from_estimator(model_t, Xtest, ytest, ax=ax, label="Test")
# Add grid, title
ax.grid(True)
ax.set_title("Receiving Operator Characteristic Curve")

___
#### ü§î Pause and ponder
It's not always easy to get accustomed to the ROC curve, but try to think of it in an intuitive manner. For the logistic regression model we have set up, changing the discrimination threshold allows us to *slide along the curve*, e.g., if we choose a discrimination threshold of zero, everything will be classified as a 1 (because every probability will be above zero), and we will have a TPR of one and a FPR of one, i.e., we will be in the top right of the curve. On the other hand, if we set the discrimination threshold to one, we will be in the lower left corner, classifying everything as a 0.

If you compare this to the above picture, it means that even with the best discrimination threshold possible, we can't reach the *perfect classifier* spot using this particular model. But on the flipside, we will always do better than the random classifier (unless we choose a discrimination threshold of zero of one).

Did all of this make sense? It can be somewhat abstract at first, so don't worry if not, but make sure to ask if you have questions!
___

___
#### ‚û°Ô∏è ‚úèÔ∏è<font color=green>**Question 4**</font>
Are the true positive rate (also called recall or sensitivity) and true negative rate (also called specificity) equally important? Can you think of problems where one is more important then the other?
___

Of course, other metrics can also be helpful to evaluate the performance of our model. Let's do a confusion matrix. And then we calculate and plot the precision and recall based on the numbers in the confusion matrix.

In [None]:
from sklearn.metrics import confusion_matrix, precision_score

In [None]:
# Compute the confustion matrix
conf_mat_test = confusion_matrix(ytest, pred_test, normalize="all")

# Setting normalization to "all" means that we express all the numbers as a fraction of the overall total of all cases

In [None]:
# Display the confusion matrix in tabular format (using pandas)
pd.DataFrame(100 * conf_mat_test)

To better understand the following calculations, note the following relationships. Denote true positives and true negatives by $TP$ and $TN$, respectively, and denote false positives and false negatives by $FP$ and $FN$, respectively. We then have

$$Precision = \frac{TP}{TP + FP}$$

and

$$Recall = \frac{TP}{TP + FN}$$.

Thus, precision gives the answer to the question: out of all cases predicted as positive, what share is actually positive? And recall gives us the answer to the question: out of all cases that *are* actually positive, what share is predicted ("recalled" by the model) as positive?

___
#### ü§î Pause and ponder
Can you see that there is a trad-off between precision and recall? I.e. higher value of one may mean a lower value of the other...

___

Now let's do the calculations...

In [None]:
# Compute TPR, precision for different values of the discrimination threshold
threshold = np.linspace(0, 1, num=100)
# Initialize empty lists for TPR and precision (not very efficient, but easier)
tpr = np.empty(threshold.shape)
precision = np.empty(threshold.shape)
# Iterate over thresholds
for i in range(threshold.shape[0]):
    # Make predictions according to threshold
    ypred = pred_prob_test[:, 1] >= threshold[i]
    # Compute (absolute values of) TP, FP, TN, FN using the confusion_matrix
    (tn, fp), (fn, tp) = confusion_matrix(ytest, ypred)  # Note that we do not use the normalize argument here!
    tpr[i] = tp / (tp + fn)
    # For precision, we need to be careful with division by zero!!
    if tp + fp > 0:
        precision[i] = tp / (tp + fp)
    else: # Add a NaN object
        precision[i] = np.nan

In [None]:
# Plot the TPR and precision as a function of the probability cutoff
fig, ax = plt.subplots(figsize=(12, 8))
# Plot FPR, TPR
ax.plot(threshold, tpr, label="True positive rate")
ax.plot(threshold, precision, label="Precision")
# Add x-label, grid, legend, ticks
ax.set_xlabel("Discrimination threshold")
ax.grid(True)
ax.legend()
ax.set_xticks(np.linspace(0, 1, num=11))

___
#### ‚û°Ô∏è ‚úèÔ∏è<font color=green>**Question 5**</font>

1. Translate this plot into plain language, such that the bank's managers could understand it! 
2. What happens with the precision just after the discrimination threshold 0.7?
3. Suppose we automate our credit decision procedures based on our logistic model. Does the choice of a particular critical probability threshold have an implication on the bottom line of our business? Why or why not? How or how not?
4. How do you decide about the best combination of precision and recall?
5. Would you want other metrics to make an informed decision for the bank? If so, what information would you additionally want to consider?