# Advanced models: SVM, DecisionTrees, Randomforest

## Date: Nov 21, 2023

---------------

## Introduction

willl use gridearch for hyperparamter optimization   
kfolds cross validiaton
does not have odds ratio explainability
Note:
- SVM's are very slow, especially non linear ones. Running the models may take over an hour. The sklearnex patch_sklearn() has helped tremendously

---------------

### Table of Contents

1. [Introduction](#Introduction)
   - [Table of Contents](#Table-of-contents)
   - [Import Librarys](#Import-Librarys)
   - [Data Dictionary](#Data-Dictionary)
   - [Load the data](#Load-the-data)
3. [SVM Model](#SVM-Model)
   - [Assumptions](#Assumptions)
   - [PreProcessing](#PreProcessing)
   - [1st Iteration](#1st-Iteration)
   - [Evaluation](#Evaluation)
8. [Conclusion](#Conclusion)


### Import Librarys

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.utils import resample
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_selector

from statsmodels.stats.outliers_influence import variance_inflation_factor
from pathlib import Path
from sklearnex import patch_sklearn 


from helpers import display_corr_heatmap, data_dict

Sklearnex will be used to speed up the sklearn library. If any bugs occur, simply comment out the code below/

In [None]:
#accelerate sklearn
patch_sklearn()

### Data Dictionary

In [None]:
data_dict()

### Load the Data

In [None]:
# Define the relative path to the file
parquet_file_path = Path('../Data/Lending_club/model_cleaned')

try:
    # Read the parquet file
    loans_df = pd.read_parquet(parquet_file_path)
except FileNotFoundError as e:
    print(e.args[1])
    print('Check file location')

In [None]:
loans_df.head()

### SVM model

Similarly to log reg, SVM creates linear boundaries. However, SVM aims to to maximize the decision boundary distance between classes, while minimizing misclassified data points. This increased buffer created between classes should better capture the extra variance when being evaluated on the test set. 

As SVM is a linear model (for the linear kernal), it is assumed that the data is linearly separable, to some degree atleast, and that there is no colinearity or multicolinearity. These linear considerations are less important for the non linear kernals. IID is also assumed.
The hyper parameters to be used will be used:

- C value: The C hyperparameter tells the SVM optimization how much to avoid misclassifying each training example. For very large C values, the SVM will fit the training data very closely and tends to find a very small margin in favor of making fewer mistakes. 
For very small C values, the SVM will be allowed to pick a wider margin by misclassifying more point. Since accuracy is preferred for this project, a higher C value is going to be preferred, without overfitting.
- Scaler: As SVM is distance based, it will benefit from distance scaling. 
- Kernal: A linear kernal will be tried, but different kernals to better handle non linearity can be used.

More information on SVM's:  
https://towardsdatascience.com/https-medium-com-pupalerushikesh-svm-f4b42800e989s

## Assumptions

### Colinearity

Plot a correlation heatmap for the remaining features.

In [None]:
display_corr_heatmap(loans_df)

Check for multicollinearity and collineartiy before splitting the data or encoding categorical variables. First check for multicollinearity using Variance Inflation Factor (VIF). 

In [None]:
numeric_df = loans_df.select_dtypes(include=[np.number])

#define a vif threshold
vif_cutoff = 10

#create a dataframe to hold the vif scores for each feature
vif_data = pd.DataFrame()
vif_data['feature'] = numeric_df.columns

In [None]:
%%time
#calculate the vif. This may take a few minutes
print('Running vif calculations ...')
vif_data['VIF'] = [variance_inflation_factor(numeric_df.values, i) for i in range(len(numeric_df.columns))]
print('Finished vif calculations')

In [None]:
#sort vif's in descending order
vif_data.sort_values(by=['VIF'], ascending=False)

#filter columns with a vif greater than the cutoff and place in a list
high_vif_columns = vif_data[vif_data['VIF'] > vif_cutoff]['feature'].tolist()
display(high_vif_columns)

In [None]:
# Drop features with high VIF
# https://easystats.github.io/performance/reference/check_collinearity.html#:~:text=Interpretation%20of%20the%20Variance%20Inflation%20Factor&text=A%20VIF%20less%20than%205,model%20predictors%20(James%20et%20al.
filtered_high_vif_columns = [feature for feature in high_vif_columns if feature not in ['loan_amnt', 'term', 'int_rate']]

loans_df.drop(columns = filtered_high_vif_columns, inplace=True)

The remaining features:

In [None]:
loans_df.head(0)

In [None]:
display_corr_heatmap(loans_df)

### Preprocessing

***Train test split***

In [None]:
# Split the data
X = loans_df.drop(columns=['loan_status'], inplace=False)
y = loans_df['loan_status']

# Split into train and test sets. Stratify to ensure any inbalance is preserved as in the original data. 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=11, stratify=y)

***Data Inbalance***

As shown in the log reg and EDA notebooks, the data is inbalanced. For svm's class weights can be used to combat this, however, we have more than enough data to properly balance the datasets.

In [None]:
print('Number of class 1 examples before:', X_train[y_train == 1].shape[0])

# Downsample majority class
X_downsampled, y_downsampled  = resample(X_train[y_train == 1],
                                   y_train[y_train == 1],
                                   replace=False,
                                   n_samples=X_train[y_train == 0].shape[0],
                                   random_state=1)

print('\nNumber of class 1 examples after:', X_downsampled.shape[0])

# Combine the downsampled successful loans with the failed loans. Will keep as a df since changing to 
X_train_bal = pd.concat([X_train[y_train == 0], X_downsampled])
y_train_bal = np.hstack((y_train[y_train == 0], y_downsampled))

print("New X_train shape: ", X_train_bal.shape)
print("New y_train shape: ", y_train_bal.shape)
print("X_test shape: ", X_test.shape)
print("y_test shape: ", y_test.shape)

***Inspect Categorical Features***

Similarly to the log_reg model, categorical features will be encoded.

In [None]:
categorical_columns = X_train_bal.select_dtypes('object').columns.tolist()
display(categorical_columns)
categorical_columns.remove('verification_status')

In [None]:
#instantiate onehot encoder
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

#instantiate ordinal encoder
ordinal_transformer = OrdinalEncoder(categories=[['Not Verified', 'Source Verified', 'Verified']])

#combine into a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, ['home_ownership', 'verification_status', 'purpose', 'application_type']),
        ('ord', ordinal_transformer, ['verification_status'])],
    remainder='passthrough',
    n_jobs=2
)

#fit to the train set
preprocessor.fit(X_train_bal)

#transform the train and test sets
X_train_transformed = preprocessor.transform(X_train_bal)
X_test_transformed = preprocessor.transform(X_test)

print("Shape of train transformed: ", X_train_transformed.shape)
print("Shape of test transformed: ", X_test_transformed.shape)

### 1st Iteration

For the first iteration, the C value will be varied. This is the most important hyperparameter for SVM's. Manually iterating will give us a target range when we do the gridsearch.

In [None]:
%%time
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import f1_score, classification_report
from sklearn.metrics import precision_score

# Scaling the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_transformed)
X_test_scaled = scaler.transform(X_test_transformed)

# Define a range of C values
C_values = [0.001, 0.01, 0.1, 1, 10, 100]

# Lists to store results
C_values_used = []
f1_scores = []
precision_scores_class_1 = []
train_accuracies = []
test_accuracies = []


# Loop through each C value
for C in C_values:

    print()
    print(f'Starting the loop for C value: {C}')
    
    # Fit an SVM model
    print('Fitting the model')
    model = SVC(C=C, kernel='linear', max_iter=300, verbose=1, cache_size=800, random_state=1)
    model.fit(X_train_scaled, y_train_bal)

    # Predict and evaluate
    print('Evaluating')
    y_pred_train = model.predict(X_train_scaled)
    y_pred = model.predict(X_test_scaled)
    
    report = classification_report(y_test, y_pred)
    
    #get the specific scores out for class 1
    f1 = f1_score(y_test, y_pred, average='binary', pos_label=1)
    precision = precision_score(y_test, y_pred, pos_label=1)
    
   # Calculate accuracies
    train_accuracy = accuracy_score(y_train_bal, y_pred_train)
    test_accuracy = accuracy_score(y_test, y_pred)
    
    train_accuracies.append(train_accuracy)
    test_accuracies.append(test_accuracy)

    # Store results
    C_values_used.append(C)
    f1_scores.append(f1)
    precision_scores_class_1.append(precision)

    # Print the results
    print(f"C value: {C}")
    print(f"F1 Score for class 1: {f1}")
    print(f"Precision for class 1: {precision}")
    print()
    print("Classification Report:")
    print(report)
    print("---------------------------------------------------")

# Print summary of results
print("Finished training and evaluation.")
print("C values used:", C_values_used)
print("F1 Scores for class 1:", f1_scores)
print("Precision for class 1:", precision_scores_class_1)
print("Train Accuracies:", train_accuracies)
print("Test Accuracies:", test_accuracies)

In [None]:
# ... (previous code) ...

# Plotting the F1 Scores and Precision for class 1
plt.figure(figsize=(10, 6))

# Plot F1 scores for class 1
plt.plot(C_values_used, f1_scores, marker='o', label='F1 Score for class 1')
plt.plot(C_values_used, precision_scores_class_1, marker='x', linestyle='--', label='Precision for class 1')
plt.plot(C_values_used, train_accuracies, marker='*', linestyle='-', label='Train accuracy')
plt.plot(C_values_used, test_accuracies, marker='^', linestyle='-', label='Test accuracy')



plt.xscale('log')  # Since C values vary in orders of magnitude
plt.xlabel('C value')
plt.ylabel('Score')
plt.title('SVM Model Performance for Different C Values')
plt.legend()  # Add a legend to distinguish the lines
plt.grid(True)
plt.show()


In [None]:
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(C_values_used, f1_scores, marker='o')
plt.xscale('log')  # Since C values vary in orders of magnitude
plt.xlabel('C value')
plt.ylabel('F1 Score')
plt.title('SVM Model Performance for Different C Values')
plt.grid(True)
plt.show()

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
from sklearn.preprocessing import StandardScaler

# Scaling the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_transformed)
X_test_scaled = scaler.transform(X_test_transformed)

# Define a range of C values
C_values = [0.001, 0.01, 0.1, 1, 10, 100]

# Loop through each C value
for C in C_values:
    print(f'Starting the loop for C value: {C}')
    
    # Fit an SVM model
    print('Fitting the model')
    model = SVC(C=C, kernel='linear', probability=True, verbose=1, random_state=1)
    model.fit(X_train_scaled, y_train_bal)

    print('Evaluating')
    # Predict and evaluate
    y_pred = model.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)

    # Print the results
    print(f"C value: {C}")
    print(f"Accuracy: {accuracy}")
    print("Classification Report:")
    print(report)
    print("---------------------------------------------------")

    # Calculate train accuracy
    y_train_pred = model.predict(X_train_scaled)
    train_accuracy = accuracy_score(y_train_bal, y_train_pred)
    print(f"Train Accuracy: {train_accuracy}")

    # Calculate AUC
    probs = model.predict_proba(X_test_scaled)[:, 1]
    roc_auc = roc_auc_score(y_test, probs)
    print(f'Area Under Curve (AUC): {roc_auc:.2f}')
    print()

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import validation_curve

# Define a range of C values
param_range = np.logspace(-4, 4, 20)

# Calculate validation curve
train_scores, test_scores = validation_curve(
    LogisticRegression(),
    X_train_transformed, y_train_bal, 
    param_name="C", param_range=param_range,
    cv=5, scoring="accuracy", n_jobs=-1)

# Calculate mean and standard deviation for training set scores
train_mean = np.mean(train_scores, axis=1)

# Calculate mean and standard deviation for test set scores
test_mean = np.mean(test_scores, axis=1)

# Plot mean accuracy scores for training and test sets
plt.plot(param_range, train_mean, label="Training score", color="black")
plt.plot(param_range, test_mean, label="Cross-validation score", color="dimgrey")

# Create plot
plt.title("Validation Curve With Logistic Regression")
plt.xlabel("C")
plt.ylabel("Accuracy Score")
plt.tight_layout()
plt.legend(loc="best")
plt.xscale('log')
plt.show()

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

# Define different C values to test
C_values = [0.001, 0.01, 0.1, 1, 10, 100, 1000]

# Initialize the scaler just once, outside the loop
#scaler = RobustScaler()
scaler = StandardScaler()

# Apply scaler to train data
X_train_scaled = scaler.fit_transform(X_train_transformed)

# Apply scaler to test data
X_test_scaled = scaler.transform(X_test_transformed)

train_scores = []
test_scores = []

# Loop over the C values
for C in C_values:
    
    # Create and fit the model
    model = SVC(C=C, verbose=3, random_state=1)
    model.fit(X_train_scaled, y_train_bal)

    # Calculate and store the scores
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)

    train_scores.append(train_score)
    test_scores.append(test_score)

# Plotting the scores
plt.plot(C_values, train_scores, label='Train')
plt.plot(C_values, test_scores, label='Test')
plt.xscale('log')  # Since C values vary exponentially
plt.xlabel('C value')
plt.ylabel('Accuracy')
plt.title('Accuracy vs. C value for Logistic Regression')
plt.legend()
plt.show()


# Loop through each C value
for C in C_values:
    
    print(f'Starting the loop for C value: {C}')
    
    # Fit an SVM model
    print('Fitting the model')
    model = SVC(C=C, verbose=1, random_state=1, )
    model.fit(X_train_scaled, y_train_bal)

    print('Evaluating')
    # Predict and evaluate
    y_pred = model.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)

    # Print the results
    print(f"C value: {C}")
    print(f"Accuracy: {accuracy}")
    print("Classification Report:")
    print(report)
    print("---------------------------------------------------")

    # If you want to track train accuracy as well, you'll need to predict on the train set
    y_train_pred = model.predict(X_train_scaled)
    train_accuracy = accuracy_score(y_train_bal, y_train_pred)
    print(f"Train Accuracy: {train_accuracy}")

    #AUC
    #calculate the probability scores
    #probs = model.predict_proba(X_test_scaled)
    #roc_auc = roc_auc_score(y_test, probs)
    #print(f'Area under curve (AUC):{roc_auc}')
    #print()

For the first iteration, we will manually vary the scaler. Since SVM is distance based, it is sensitive to how the data is scaled and shifted. The standard scaler is most common and shifts the data so that the mean is 0 and standard deviation (STD) of 1. However, it assumes the data follows a normal distribution. The minmax scaler scales the data such that is lands within the range [0,1]. Finally, the robust scaler works essentially the same as the standard scaler except it uses the median and IQR opposed to mean and STD. Since robust scaler uses the median opposed to mean, it handles outliers and skewness better than the standard scaler.

More information on the different scalers can be found here:  
https://medium.com/@onersarpnalcin/standardscaler-vs-minmaxscaler-vs-robustscaler-which-one-to-use-for-your-next-ml-project-ae5b44f571b9

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report

#instaniate the scalers
standard_scaler = StandardScaler()
minmax_scaler = MinMaxScaler()
robust_scaler = RobustScaler()

# List of scalers
scalers = [standard_scaler, minmax_scaler, robust_scaler]
scaler_names = ['StandardScaler', 'MinMaxScaler', 'RobustScaler']

# Loop through each scaler
for scaler, name in zip(scalers, scaler_names):
    
    print(f'Starting the loop for scaler: {name}')
    
    # Apply scaler to train and test data
    X_train_scaled = scaler.fit_transform(X_train_transformed)
    X_test_scaled = scaler.transform(X_test_transformed)
    print('Applied the scaler')

    # Fit an SVM model
    print('Fitting the model')
    #model = LinearSVC(max_iter=10000, verbose=1)
    model = SVC(verbose=1, random_state=1)
    model.fit(X_train_scaled, y_train_bal)

    print('Evaluating')
    # Predict and evaluate
    y_pred = model.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)

    #print the results
    print(f"Scaler: {name}")
    print(f"Accuracy: {accuracy}")
    print("Classification Report:")
    print(report)
    print("---------------------------------------------------")

We can plot the scores over time

In [None]:
#get the feature weights out. 
feature_weights = pd.DataFrame({
    'Feature': preprocessor.get_feature_names_out(),
    'Coefficient': model.coef_[0]
})

# Sort the features by the absolute value of their coefficient
feature_weights = feature_weights.sort_values(by='Coefficient', ascending=True)

In [None]:
# Plotting the feature weights
plt.figure(figsize=(10, 10))
plt.barh(feature_weights['Feature'], feature_weights['Coefficient'], color='lightblue')
plt.xlabel('Coefficient Value')
plt.ylabel('Features')
plt.title('Feature Importance')
plt.show()

Hyperparameters to use:
Since we dont know what parameters to use exactly, RandomizedSearchCV can be used. It is essentially a gridsearch sv but only takes a small sample. This allows us to gauge what the range for our hyperparamters should be. 
In general different kernels work well for different tasks, and in the absence of prior knowledge sometimes the best option is to try out various kernels.

For the first iteration, we will manually vary the scaler. Since SVM is distance based, it is sensitive to how the data is scaled and shifted. The standard scaler is most common and shifts the data so that the mean is 0 and standard deviation (STD) of 1. However, it assumes the data follows a normal distribution. The minmax scaler scales the data such that is lands within the range [0,1]. Finally, the robust scaler works essentially the same as the standard scaler except it uses the median and IQR opposed to mean and STD. Since robust scaler uses the median opposed to mean, it handles outliers and skewness better than the standard scaler.

More information on the different scalers can be found here:  
https://medium.com/@onersarpnalcin/standardscaler-vs-minmaxscaler-vs-robustscaler-which-one-to-use-for-your-next-ml-project-ae5b44f571b9