<a href="https://colab.research.google.com/github/CBravoR/AdvancedAnalyticsLabs/blob/master/notebooks/python/Lab_6_Logistic_Regression_and_Scorecards.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression and Scorecards

In this lab we will finally start running models! For this we will use the excellent [```scikit-learn```](https://scikit-learn.org/stable/) package, which implements many, many data science methods. This is the go-to tool for any structured data analysis package.

First, we will import the data from last week. We will download them from my Google Drive.

In [None]:
# Import the data files from last week.
!gdown 'https://drive.google.com/uc?id=12AFRYPBY6N_hnvZJkSDL_nhWwjt43M-n'
!gdown 'https://drive.google.com/uc?id=1IEvsKnMMwHrOsqR1EaaaMQcms1vTiQgu'
!gdown 'https://drive.google.com/uc?id=1aDraDSR2OQbIMjIY07s-rD5cel2x_iS-'

In [None]:
!ls

Now we install the scorecardpy package and clean our data.

In [None]:
!pip install git+https://github.com/CBravoR/scorecardpy

In [None]:
# Data wrangling
import pandas as pd
import polars as pl
import polars.selectors as cs

# Scorecard construction
import scorecardpy as sc
import numpy as np

# Sampling from scikit-learn
from sklearn.model_selection import train_test_split

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Formatting
from string import ascii_letters

# Sklearn
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV, SGDClassifier
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_auc_score, roc_curve
from sklearn.utils.class_weight import compute_class_weight


In [None]:
# Import the files
bankloan_train_WoE = pl.read_parquet('train_woe.parquet')
bankloan_test_WoE = pl.read_parquet('test_woe.parquet')
bankloan_data = pd.read_pickle('BankloanCleanNewVars.pkl')

# Eliminate unused variables
# bankloan_data.drop(columns=['Education'], inplace = True)

# Same train-test split as before (because of seed!)
bankloan_train_noWoE, bankloan_test_noWoE = train_test_split(bankloan_data, # Dataframe
                                                             test_size=0.3, # Test percentage
                                                             random_state=20251023, # Seed for reproducibility
                                                             stratify=bankloan_data['Default'] # How to stratify sampling
                                                             )

# Give breaks for WoE
breaks_adj = {'Address': [1.0,2.0,7.0,11.0],
              'Age': [21.0,30.0,37.0,46.0],
              'Creddebt': [1.0,6.0],
              'Education': ['Bas','Posg%,%SupInc','Med','SupCom'],
              'Employ': [2.0,4.0,11.0,18.0],
              'Income': [30.0,40.0,80.0,140.0],
              'Leverage': [3.0,7.0,10.0,17.0],
              'MonthlyLoad': [0.1,0.25,0.65],
              'OthDebt': [0.4, 1.6, 3.2],
              'OthDebtRatio': [0.04,0.07,0.09,0.13]
              }

# Apply breaks.
bins_adj = sc.woebin(bankloan_train_noWoE, y="Default",
                     breaks_list=breaks_adj)


## Generating a logistic regression object

To train a logistic regression, we first need to create an object that stores how we want the model to be trained. In general, all of scikit-learn models work this way:

- We create the model we want to train, with all required parameters. This model is **not trained yet**, it just keeps the logic we will use.

- We apply the ```fit``` function to the object we just created. This takes as input the training set and the targets (if the model is supervised), and will update our model with trained parameters.

- We then used our trained model to apply it to a test set, and calculate outputs.

Logistic regression is included in the [```linear_model subpackage```](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) and it comes pre-packaged with all regularization algorithms: the LASSO penalization, the Ridge penalization and the ElasticNet method (refer to the lectures for the explanation of these, or read this [excellent tutorial](https://codingstartups.com/practical-machine-learning-ridge-regression-vs-lasso/)).

In a nutshell, LASSO and Ridge are going to penalize including variables by adding either a linear (LASSO) or quadratic (Ridge) term to the minimization algorithm, or a combination of the two if using Elastic Net.

These methods have hypermparameters that need to be optimized. For this we will use a cross-validation procedure (again, refer to the lectures). Luckily for us, scikit-learn already comes with an object that will allow cross-validated optimization of the penalization parameter. The function to call is[```LogisticRegressionCV```](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV)

Let's start by creating this object.




In [None]:
bankloan_logreg = LogisticRegressionCV(penalty='elasticnet', # Type of penalization l1 = lasso, l2 = ridge, elasticnet
                                     Cs = 10,        # How many parameters to try. Can also be a vector with parameters to try.
                                     tol=0.000001, # Tolerance for parameters
                                     cv = 3,     # How many CV folds to try. 3 or 5 should be enough.
                                     fit_intercept=True, # Use constant?
                                     class_weight='balanced', # Weights, see below
                                     random_state=20190301, # Random seed
                                     max_iter=100, # Maximum iterations
                                     verbose=2, # Show process. 1 is yes.
                                     solver='saga', # How to optimize.
                                     n_jobs=2,      # Processes to use. Set to number of physical cores.
                                     refit=True,     # If to retrain with the best parameter and all data after finishing.
                                     l1_ratios = np.arange(0, 1.01, 0.1), # The LASSO / Ridge ratios.
                                    )

Let's dig deeper into what is needed.

**Penalty**

'l1' penalty refers to LASSO regression (great at selecting variables), 'l2' to Ridge regression (not very good at selecting variables), and 'elasticnet'. My advice: As long as you have more samples than variables, start with LASSO, if it doesn't work or you are not happy with the results, move to elasticnet.

**Penalty constants to try (```Cs```)**

This refers to how many LASSO or Ridge parameters to try. These parameters measure the weight of the error in prediction versus the regularization (penalty) error. When optimizing the parameters, a penalization constant will try to optimise the following:

$$
Error = Error_{prediction} + \frac{1}{C} \times Error_{penalty}
$$

So the $C$ constant will balance both objectives. By giving a Cs larger than 1, it will try as many parameters as given.

**Class weighting**

Most interesting problems are unbalanced. This means the interesting class (Default in our case) has less cases than the opposite class. Models optimise the sum over **all** cases, so if we minimize the error, which class do you think will be better classified?

This means we need to balance the classes to make them equal. Luckily for us, Scikit-Learn includes automatic weighting that assigns the same error to both classes. The error becomes the following:

$$
Error = Weight_1 \times Error_{predictionClass1} + Weight_2 \times Error_{predictionClass2} + \frac{1}{C} \times Error_{penalty}
$$

The weights are selected so the theoretical maximum error in both classes is the same (see the help for the exact equation).

**Random State**

The random seed. Remember to use your student ID.

**Iterations**

The solution comes from an iterative model, thus we specify a maximum number of iterations. Remember to check for convergence after it has been solved!

**Solver**

Data science functions are complex ones, with thousands, millions, or even billions of parameters. Thus we need to use the best possible solver for our problems. Several are implemented in scikit-learn. The help states that:


- For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones.
- For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.
- ‘newton-cg’, ‘lbfgs’ and ‘sag’ only handle L2 penalty, whereas ‘liblinear’ and ‘saga’ handle L1 penalty.

We will use 'saga', a very efficient solver. You can read all about it [here](https://www.di.ens.fr/~fbach/Defazio_NIPS2014.pdf).

**refit**

If your data is sufficiently small to fit in memory, you will be able to use all of the training data for the cross-validation process. If so, then with ```refit=True``` you will retrain the model after the parameter search, using the optimal parameter found.

However, in large datasets this might not be possible. In this case:

1. Obtain a **validation sample** from the original training data. Usually 20% of data is used, but it depends on memory and time constraints.

2. Run the Cross-validation process over this validation data and find the optimal parameter. Let's call it $C^*$.

3. Train a logistic regression with all training data, but with a fixed parameter $C^*$. For this you need to use the function [```LogisticRegression```](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression) in scikit-learn and give the parameter ```C=YOUR_OPTIMAL_C```. The rest of the parameters are similar to ```LogisticRegressionCV```.

The ```LogisticRegression``` object has another interesting interesting parameter for big data models. ```warm_start```.

**Warm start**

Scikit-learn allows for multiple adjustments to the training. For example, you can try first with a little bit of data just to check if everything is working, and then, if you set ```warm_start = True``` before, it will retrain starting from the original parameters. Allows for dynamic updating as well.  ```warm_start = False``` means whenever we give it new data, it will start from scratch, forgetting what it previously learned.

**l1_ratios**

These are the balance parameters between LASSO and Ridge for the ElasticNet optimization, with 0 <= l1_ratio <= 1. A value of 0 is equivalent to using penalty='l2', while 1 is equivalent to using penalty='l1'.

## Training!

Now we are ready to train. We simply apply the method ```fit``` to our data, giving it the training set and the target variable as inputs.

In [None]:
bankloan_train_WoE.columns

In [None]:
bankloan_logreg.fit(X=bankloan_train_WoE.drop("Default"),
                    y=bankloan_train_WoE['Default'] # The target
                   )

Let's read the output:

```convergence after 25 epochs took 0 seconds```

The method was able to find a solution at the given tolerance, and it took 16 iterations and almost no time. **If the method says it did not converge then you need to increase iterations, change C or both!**.

The rest of the output refers to what it did, it is not relevant at this stage.

Done! We have a logistic regression! Let's check the parameters, sorted into a nice table.


In [None]:
coef_df = pd.concat([pd.DataFrame({'column': bankloan_train_WoE.drop("Default").columns}),
                    pd.DataFrame(np.transpose(bankloan_logreg.coef_))],
                    axis = 1
                   )

coef_df

We can see the parameter for each variable now. This does not include the constant. We can get it with

In [None]:
bankloan_logreg.intercept_

We can see all variables are being used, and the intercept is really close to 0. This is expected in a balanced logistic regression that uses WoE transform and is a way to check everything is working as intended.

We can see that most coefficients are correctly determined, even in the presence of correlations. This happens because the **ElasticNet penalty deals with correlations gracefully**. This is NOT the case if we had a LASSO regression. Try it yourself and see. In that case, you would need to manually eliminate the variables so everything works correctly.

However, Income does have a coefficient that is correlated with others. At this point, it may be better to remove the variable and rerun our models.

In [None]:
# Retrain, without income.
x_drop = ["Default", "Income_woe"]
bankloan_logreg.fit(X = bankloan_train_WoE.drop(x_drop),
                    y = bankloan_train_WoE['Default'] # The target
                   )

# Check coefficients.
coef_df = pd.concat([pd.DataFrame({'column': bankloan_train_WoE.drop(x_drop).columns}),
                    pd.DataFrame(np.transpose(bankloan_logreg.coef_))],
                    axis = 1
                   )

coef_df

We can also check the optimal hyperparameters found.

In [None]:
print(f"The L1 ratio is {bankloan_logreg.l1_ratio_.item():.3f}")
print(f"The C parameter is {bankloan_logreg.C_.item():.3f}")

And now we can train the final logistic regression.

In [None]:
# Define the object
bankloan_logreg = LogisticRegression(penalty='elasticnet', # Type of penalization l1 = lasso, l2 = ridge, elasticnet
                                     C = 0.359,        # How many parameters to try. Can also be a vector with parameters to try.
                                     l1_ratio = 0.100, # l1_ratio
                                     tol=0.000001, # Tolerance for parameters
                                     fit_intercept=True, # Use constant?
                                     class_weight='balanced', # Weights, see below
                                     random_state=20190301, # Random seed
                                     max_iter=100, # Maximum iterations
                                     verbose=2, # Show process. 1 is yes.
                                     solver = 'saga', # How to optimize.
                                     n_jobs = 2,      # Processes to use. Set to number of physical cores.
                                    )

# Train it
bankloan_logreg.fit(X = bankloan_train_WoE.drop(x_drop),
                    y = bankloan_train_WoE['Default'])

## Applying to the test set

We can now apply our results to the test set, and check our results. Most models in scikit-learn have the ```predict``` method which applies the model to new data, this gives the 0-1 prediction. Alternatively (and more usefully) we can use the ```predict_proba``` method that gives the probability.

In [None]:
pred_class_test = bankloan_logreg.predict(bankloan_test_WoE.drop(x_drop))
probs_test = bankloan_logreg.predict_proba(bankloan_test_WoE.drop(x_drop))
print(probs_test[0:5], pred_class_test[0:5])

Scikit-learn will give, by default, one probability per class.  The second column is the one that applies for class Default = 1.

We will get the confusion matrix to check our accuracy. These are included in the subpackage ```sklearn.metrics``` and we will plot it using seaborn.

In [None]:
# Calculate confusion matrix
confusion_matrix_cs = confusion_matrix(y_true = bankloan_test_WoE['Default'],
                                        y_pred = pred_class_test)


# Turn matrix to percentages
confusion_matrix_cs = confusion_matrix_cs.astype('float') / confusion_matrix_cs.sum(axis=1)[:, np.newaxis]

# Turn to dataframe
df_cm = pd.DataFrame(
        confusion_matrix_cs, index=['Non Defaulter', 'Defaulter'],
        columns=['Non Defaulter', 'Defaulter'],
)

# Parameters of the image
figsize = (10,7)
fontsize=14

# Create image
fig = plt.figure(figsize=figsize)
heatmap = sns.heatmap(df_cm, annot=True, fmt='.2f')

# Make it nicer
heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0,
                             ha='right', fontsize=fontsize)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45,
                             ha='right', fontsize=fontsize)

# Add labels
plt.ylabel('True label')
plt.xlabel('Predicted label')

# Plot!
plt.show()

Pretty good model!

## Larger-than-memory training: Partial fit.

What if our data does not fit in memory? In these cases, most models come with some sort of solution. Polars uses the very useful [LazyFrame](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html) which can delay reading the data until the very latest. But that has the problem that we would still need to be able to read the full data in chunks. How can we train the model over partial segments of data so we can train even if our data does not fit in RAM?

Combining LazyFrames with the models that, in ```sklearn```, have a [```partial_fit```](https://scikit-learn.org/0.15/modules/scaling_strategies.html#incremental-learning) method available, this can be easily done. This is called **Incremental Learning**. Besides many sklearn methods, XGBoost also comes with this functionality, as we will see in future labs.

There is one important consideration to have in this type of fit: You must pass the full classes on the very first ```partial_fit``` so that the model understand all the classes that are available. Other than that, partial data is handled gracefully.

For logistic regression, the classifier we will use is the Stochastic Gradient Descent classifier [```SGDClassifier```](https://scikit-learn.org/0.15/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier). The strategy will be:

1. Get the dimension of the data using polars. We will now not load the data until the very last possible moment.
2. Load a chunk that we know fits in memory. Our data is small, so for the purpose of our example, we will train in chunks of 200 cases.
3. Iterate until we cover the full data. For that, randomly select 200 cases of the train set, pass it to ```partial_fit``` and check convergence. We stop once the model stops training.
4. Validate the model over the test set and compare our performance.

Let's start by creating our LazyFrame and checking the dimension.

In [None]:
bankloan_lazy_train = pl.scan_parquet('train_woe.parquet')
bankloan_lazy_train.describe()

We have the same 1034 cases as before. This operation was run **without ever loading the full dataset into memory**. We can, in theory, run descriptives over thousands of variables as long as one fits in memory.

Let's assume we measured the RAM usage and we determined that we can load at most **200 cases** into memory. How can we train over those? For that, we first need to define an SGD classifier.

In [None]:
# Add a random column for sampling.
bankloan_lazy_train = bankloan_lazy_train.with_columns(
    pl.lit(np.random.rand(bankloan_lazy_train.collect().height)).alias("rand")
    )

# Split train and validation.
bankloan_lazy_train_sample = bankloan_lazy_train.filter(pl.col("rand") < 0.7).drop("Income_woe")
bankloan_lazy_val_sample = bankloan_lazy_train.filter(pl.col("rand") >= 0.7).drop("Income_woe").collect()

# Describe train
bankloan_lazy_train_sample.describe()

# Calculate the class weight
class_weight = compute_class_weight("balanced", classes=np.array([0,1]),
                                    y=bankloan_lazy_train_sample.select(pl.col("Default")).collect().to_numpy().ravel()
                                    )

# Alternative: never upload default to memory.
# pos_cases = bankloan_lazy_train_sample.select(pl.col("Default")).sum().collect().to_numpy().ravel().item()
# neg_cases = bankloan_lazy_train_sample.select(pl.col("Default")).count().collect().to_numpy().ravel().item() - pos_cases
# class_weight = {0: 1, 1:  neg_cases / pos_cases}
# print(class_weight)

# Calculate class_weight dictionary
class_weight = {0: class_weight[0], 1: class_weight[1]}

# Define SGDClassifier
sgd_bankloan = SGDClassifier(loss='log_loss',  # Loss to use. Logistic regression equals 'log'
                             penalty='elasticnet', # Type of penalization l1 = lasso, l2 = ridge, elasticnet
                             alpha=0.359, #Multiplicative constant of the penalization.
                             l1_ratio=0.100, # Lasso penalty
                             verbose=0, # Output to give. Write 0 for silent training.
                             n_jobs=-1, # Use all cores.
                             random_state=20251030, # Random state.
                             class_weight=class_weight, # How to balance classes.
                             warm_start=False # If each call to fit deletes the original parameters. VERY IMPORTANT IN INCREMENTAL LEARNING.
                             )

Now that we have the model defined, we can split the train and validation set.

Our training sample has 726 cases. We will select around 200 per iteration, that is approximately 30% of the dataset. Let's now create the loop. We will run 100 iterations, or until the error does not decrease beyond 0.1%.

In [None]:
# Define the arrays where we will save the performance.
train_loss = np.array([])

# Define the constants to measure.
best_loss = 10e9
n_iter = 1000
batch_fraction = 0.33
tol = 10e-6

# Create the loop
for i in range(n_iter):
  # Get the sample from the train set. Select 33% of total.
  train_slice = bankloan_lazy_train_sample.filter(
    pl.linear_space(0, 1, pl.len()) # Generate random number
      .sample(fraction=1, with_replacement=True, shuffle=True) <= batch_fraction).drop("rand") # Sample

  # Collect to RAM
  train_slice = train_slice.collect()

  # Train the chunk
  sgd_bankloan.partial_fit(X = train_slice.drop("Default"),
                           y = train_slice["Default"],
                           classes=np.array([0, 1]))

  # Get the values of the loss over the validation set.
  y_pred_proba = sgd_bankloan.predict_proba(bankloan_lazy_val_sample.drop(["Default", "rand"]))
  iter_log_loss = log_loss(bankloan_lazy_val_sample["Default"], y_pred_proba)
  train_loss = np.append(train_loss, iter_log_loss)
  tol_iter = np.abs(best_loss - iter_log_loss)
  print(f"Iteration {i}: Log Loss = {iter_log_loss:.3f}, Tolerance = {tol_iter:.6f}")

  # Check if best_loss is worse than current loss. If so, replace.
  if best_loss > iter_log_loss:
    best_loss = iter_log_loss

  # Check if tolerance is within range.
  if tol_iter < tol:
    print("Converged!")
    break


Now we have a model fully trained!

In [None]:
# Check coefficients.
coef_df = pd.concat([pd.DataFrame({'column': bankloan_lazy_val_sample.drop(["Default", "rand"]).columns}),
                    pd.DataFrame(np.transpose(sgd_bankloan.coef_))],
                    axis = 1
                   )

coef_df

## Scorecards

The package ```scorecardpy``` has the function ```scorecard``` which receives a trained logistic regression model trained over WoE-transformed data, a trained scorecard **over the same variables** and a list of matched columns (that is, the order of columns in the scorecard). As optional arguments it receives a PDO, a base score, and decimal base odds (so instead of 50:1, it receives 0.02).

You should adjust these values so the score is in a range that's acceptable. Typically between 0 and 1000.

In [None]:
bankloan_sc = sc.scorecard(bins_adj,         # bins from the WoE
                           bankloan_logreg,  # Trained logistic regression
                           bankloan_test_WoE.drop(x_drop).columns, # The column names in the trained LR
                           points0=750, # Base points
                           odds0=0.01, # Base odds bads:goods
                           pdo=50
                           ) # PDO


In [None]:
bankloan_sc

In [None]:
# Applying the credit score. Applies over the original data!
train_score = sc.scorecard_ply(bankloan_train_noWoE, bankloan_sc,
                               print_step=0)
test_score = sc.scorecard_ply(bankloan_test_noWoE, bankloan_sc,
                               print_step=0)

In [None]:
train_score.describe()

## ROC Curves

To finish the lab, let's compare the logistic regression and the SGD over the test set.

In [None]:
# Get predicted probabilities over the test set
y_pred_proba_test = bankloan_logreg.predict_proba(bankloan_test_WoE.drop(x_drop))
y_pred_proba_test_sgd = sgd_bankloan.predict_proba(bankloan_test_WoE.drop(["Income_woe","Default"]))

# Set models and probabilities. This structure is called a dictionary.
models = [
{
    'label': 'Logistic Regression',
    'probs': y_pred_proba_test[:,1]
},
{
    'label': 'SGD',
    'probs': y_pred_proba_test_sgd[:,1]
}
]

# Loop that creates the plot. I will pass each ROC curve one by one.
for m in models:
  auc = roc_auc_score(y_true = bankloan_test_WoE['Default'],
                             y_score = m['probs'])
  fpr, tpr, thresholds = roc_curve(bankloan_test_WoE['Default'],
                                           m['probs'])
  plt.plot(fpr, tpr, label=f'{m["label"]} ROC (area = {auc:.3f})')


# Settings
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('1-Specificity(False Positive Rate)')
plt.ylabel('Sensitivity(True Positive Rate)')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")

# Plot!
plt.show()

We can see a slight difference between the SGD training and the LogReg training, due to the training strategy. Still, pretty close!

And that's it! We have a fully functional credit scorecard. In later labs we will contrast this with two more models: a Random Forest and an XGBoost model.