# Lesson 32 - Regularized Logistic Regression

## Prepare Environment

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr

from pyspark.ml.feature import VectorAssembler, OneHotEncoder, StringIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier
from pyspark.ml import Pipeline

from pyspark.ml.tuning import CrossValidator

spark = SparkSession.builder.getOrCreate()

## Regularized Logistic Regression

In machine learning, **regularization** refers to a collection of techniques that can be used to reduce the chance of a model overfitting training data, hopefully producing a model that generalized better to out-of-sample observations. One common approach to regularization when working with logistic regression models is to encourage the training algorithm to favor model's with smaller coefficients. The logic behind this approach is that be selecting models with relatively small coefficients, we are reducing the chance of selecting a model that is overly depending on a single feature that might have a high predictive value within the training set, but that might not generalize well to out-of-sample data. By encouraging the model to spread is attention more evenly across the features, we will hopefully reduce the chance that the model relies of spurious correlations present in the training set.

### Loss Functions
To understand how regularized logistic regression works, we need to take a more careful look at how the optimal logistic regression model is selected by the training algorithm. Like most machine learning algorithms, the logistic regression training algorithm identifies the optimal model by minimizing some loss function which assigns a score to each possible model. The loss function for logistic regression is known as negative log-likelihood loss. We will now discuss how it is calculated.

Consider a logistic regression model. For any observation in the training set, let πi denote the probability of the observation being assigned its true label, as estimated by the model. The negative log-likelihood score for the model is then calculated using the following formula:

$$NLL\:=-\sum\limits_{i=0}^{n} \ln(\pi_i)$$ 

An explanation of why this loss function is used for logistic regression is beyond the scope of this course, but suffice it to say that models with lower negative log-likelihood scores tend to perform better. During training, the training algorithm will consider many different models and will ultimately select the one with the lowest value for NLL.

### L1 and L2 Regularization

When performing regularized logistic regression, we adjust this loss function by adding a penalty based on the size of the (non-intercept) coefficients in the model This penalty term can be defined in a variety of ways, but the two most common penalties are the L1 and L2 penalties.

$$\textrm{L1 Penalty:}\hspace{5 mm}\sum\limits_{i=1}^K \left |\hat{\beta}_i \right|$$ 
$$\textrm{L2 Penalty:}\hspace{5 mm}\sum\limits_{i=1}^K \hat{\beta}_i^2$$ 




In either case, the size of the sum is dependent on the size of the coefficient values. To construct the loss function for a regularized logistic regression model, we select one of these penalties, multiply it by a **regularization parameter** \\(\lambda\\), and then add the result to the NLL loss. This gives us the following loss functions: 

 

  
$$\textrm{L1 Regularization:}\hspace{5 mm}Loss = NLL +\lambda \sum\limits_{i=1}^K \left |\hat{\beta}_i \right|$$ 
$$\textrm{L2 Regularization:}\hspace{5 mm}Loss = NLL +\lambda \sum\limits_{i=1}^K \hat{\beta}_i^2$$ 



The regularization parameter \\(\lambda\\) controls how much weight we wish to put on the penalty terms. If we select a small value for \\(\lambda\\) then the resulting model will now be much different from a standard logistic regression model. If we select a large value for \\(\lambda\\), then the resulting model will likely have relatively small coefficients.

We will discuss techniques for selecting for identifying the appropriate value of \\(\lambda\\) in the next lesson. In the same section, we will discuss techniques for selecting between an L1 and L2 penalty when creating a regularized logistic regression model. 


- L1 penalty - lasso regression
- L2 penalty - ridge regression

### Elastic Net Regularization

As noted, we can select either an L1 or L2 penalty when building a regularized logistic regression model. It is also possible to use a blended penalty created as a weighted sum of the L1 and L2 penalties. This is referred to as **Elastic Net Regularization**. The penalty term for elastic net regularization is shown below. 

$$\textrm{Elastic Net Penalty:}\hspace{5 mm}(1-\alpha) \sum\limits_{i=1}^K \hat{\beta}_i^2 + \alpha \sum\limits_{i=1}^K \left |\hat{\beta}_i \right|$$ 

The **elastic net parameter** \\(\alpha\\) controls how the penalty term is constructed. This is selected to be a value between 0 and 1. Setting \\(\alpha = 0\\) results in a pure L2 penalty, while setting \\(\alpha = 1\\) results in a pure L1 penalty. The loss function for an elastic net model is provided below. 

$$\textrm{Elastic Net Penalty:}\hspace{5 mm} Loss = NLL +\lambda \left[(1-\alpha) \sum\limits_{i=1}^K \hat{\beta}_i^2 + \alpha \sum\limits_{i=1}^K \left |\hat{\beta}_i \right| \right]$$ 



  

## Load and Process Data

To demonstrate regularized logistic regression, we will use the [South German Credit dataset](https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29). In this dataset, we will use several features to try to predict a risk category (Good or Bad) for a potential borrower.

In [0]:
gc = (
    spark.read
    .option('delimiter', '\t')
    .option('header', True)
    .option('inferSchema', True)
    .csv('/FileStore/tables/SouthGermanCredit.txt')
)

gc.printSchema()

In [0]:
gc.select(gc.columns[:10]).show(5)
gc.select(gc.columns[10:]).show(5)

In [0]:
N = gc.count()
print(N)

### Distribution of Label Values

We will now determine the distribution of label values in the dataset.

In [0]:
gc.select('credit_risk').groupby('credit_risk')\
  .agg(
     expr('COUNT(*) as count'), 
     expr(f'ROUND(COUNT(*)/{N},4) as prop')
  ).show()

### Numerical and Categorical Features

The only numerical features in this dataset are `duraction`, `amount`, and `age`. All other features are categorical.

In [0]:
num_features = ['duration', 'amount', 'age']
cat_features = [c for c in gc.columns[:-1] if c not in num_features]

print(num_features)
print(cat_features)

###  Preprocessing Pipeline

We will now create stages assocated with various pre-processing tasks.

In [0]:
ix_features = [c + '_ix' for c in cat_features]
vec_features = [c + '_vec' for c in cat_features]

label_indexer = StringIndexer(inputCol='credit_risk', outputCol='label')

feature_indexer = StringIndexer(inputCols=cat_features, outputCols=ix_features)

encoder = OneHotEncoder(inputCols=ix_features, outputCols=vec_features, dropLast=False)

assembler = VectorAssembler(inputCols=num_features + vec_features, outputCol='features')

Next, we will combine the pre-processing stages into a pipeline and then apply that pipeline to our dataset.

In [0]:
pre_pipe = Pipeline(stages=[label_indexer, feature_indexer, encoder, assembler]).fit(gc)
train = pre_pipe.transform(gc)
train.persist()

train.select(['features', 'credit_risk', 'label']).show(10)

### Evaluator

We will create an accuracy evaluator for use in scoring our models.

In [0]:
accuracy_eval = MulticlassClassificationEvaluator(
    predictionCol='prediction', labelCol='label', metricName='accuracy')

## Basic Logistic Regression Model

The first model we will consider is a standard logistic regression model.

In [0]:
logreg = LogisticRegression(featuresCol='features', labelCol='label')
logreg_model = logreg.fit(train)

### Training Score

In [0]:
pred = logreg_model.transform(train)
pred.persist()

train_acc = accuracy_eval.evaluate(pred)
print('Training Accuracy:', train_acc)

### Cross-Validation Score

In [0]:
cv = CrossValidator(estimator=logreg, estimatorParamMaps=[{}], 
                    evaluator=accuracy_eval, numFolds=10, seed=1)

cv_model = cv.fit(train)

cv_acc = cv_model.avgMetrics[0]

print('\nCross-Validation Estimate of Out-Of-Sample Performance:', cv_acc)

## L1 Regularization

We will now create an L1-regularized logistic regression model with an regularization parameter value of \\(\lambda=0.1\\).

In [0]:
l1_logreg = LogisticRegression(featuresCol='features', labelCol='label',
                               elasticNetParam=1, regParam=0.1)
# elasticNetParam = 1 gives L1 regularization
# elasticNetParam = 0 gives L2 regularization

# elasticNetParam the alpha for elastic net regularization, range from 0 to 1
# regParam defines the lambda for penalty, range from 0 to infine large

l1_model = l1_logreg.fit(train)

### Training Score

In [0]:
l1_pred = l1_model.transform(train)
l1_pred.persist()

l1_train_acc = accuracy_eval.evaluate(l1_pred)
print('Training Accuracy:', l1_train_acc)

### Cross Validation Score

In [0]:
cv = CrossValidator(estimator=l1_logreg, estimatorParamMaps=[{}], evaluator=accuracy_eval, 
                       numFolds=10, seed=1)
cv_model = cv.fit(train)

l1_cv_acc = cv_model.avgMetrics[0]

print('\nCross-Validation Estimate of Out-Of-Sample Performance:', l1_cv_acc)

# possibly over regularized

## L2 Regularization

We will now create an L2-regularized logistic regression model with an regularization parameter value of \\(\lambda=0.1\\).

In [0]:
l2_logreg = LogisticRegression(featuresCol='features', labelCol='label',
                               elasticNetParam=0, regParam=0.1)

l2_model = l2_logreg.fit(train)

### Training Score

In [0]:
l2_pred = l2_model.transform(train)
l2_pred.persist()

l2_train_acc = accuracy_eval.evaluate(l2_pred)
print('Training Accuracy:', l2_train_acc)

### Cross-Validation Score Cross

In [0]:
cv = CrossValidator(estimator=l2_logreg, estimatorParamMaps=[{}], evaluator=accuracy_eval, 
                       numFolds=10, seed=1)
cv_model = cv.fit(train)

l2_cv_acc = cv_model.avgMetrics[0]

print('\nCross-Validation Estimate of Out-Of-Sample Performance:', l2_cv_acc)

## Comparison of Results

We will close the section by comparing the results obtained by the three models.

In [0]:
pd.DataFrame(
    data = [[train_acc, cv_acc],[l1_train_acc, l1_cv_acc],[l2_train_acc, l2_cv_acc]],
    columns = ['Training', 'Cross-Validation'],
    index = ['Basic Model', 'L1 Model', 'L2 Model']
)

Unnamed: 0,Training,Cross-Validation
Basic Model,0.787,0.753963
L1 Model,0.7,0.700085
L2 Model,0.783,0.764685
