
# Predicting Loan Defaults Using Machine Learning Models


#### A Comparative Analysis of Linear Regression, Ridge Regression, Lasso Regression, Random Forest and Neural Networks

### Introduction

Our goal in this project is to use data from a peer-to-peer lending network to forecast loan defaults. Using the features provided in the dataset, our main goal is to determine which model performs the best at forecasting loan defaults. We will train and assess a number of machine learning models by examining a random sample of loans in order to see which one most correctly predicts who will default on their loan. To discover the best prediction model, this entails generating a new target variable, preparing the data, applying various models, and assessing how well each model performs.


### Provided Data Files:
- **trainData**: The dataset on which you will train all your models.
- **testData**: The dataset on which you will evaluate your model’s fit.
- **varDescription**: A description of the features included in the dataset.

### Steps Included:
1. **Data Preparation**: Handling missing values, encoding categorical variables, and normalizing numerical features to prepare the data for modeling.
2. **Linear Regression**: Fitting a linear regression model to serve as a baseline.
3. **Ridge Regression**: Introducing L2 regularization to address multicollinearity and improve generalization.
4. **Lasso Regression**: Using L1 regularization to perform feature selection and simplify the model.
5. **Random Forest**: Implementing an ensemble learning method to capture complex relationships and interactions in the data.
6. **Neural Network**: Building and training a neural network model to learn non-linear patterns in the data.
7. **Evaluation**: Comparing the predictive power of all approaches to identify the best model based on performance metrics.

This coding file offers a thorough method for forecasting loan defaults based on the provided datasets. It contains the implementation and results for each model.

### Installing Necessary Libraries

Before running the code, ensure you have the required Python libraries installed. These libraries provide essential tools for data manipulation, machine learning model implementation, and evaluation.

We can install the required libraries using the following command:

In [5]:
pip install pandas numpy scikit-learn tensorflow matplotlib seaborn

Note: you may need to restart the kernel to use updated packages.


This command will install:
- **pandas**: For data manipulation and analysis.
- **numpy**: For numerical computing.
- **scikit-learn**: For machine learning algorithms.
- **tensorflow**: For building and training neural networks.
- **matplotlib**: For creating visualizations.
- **seaborn**: For statistical data visualization.

# Linear Regression Model


A basic statistical technique for simulating the relationship between a dependent variable (y) and one or more independent variables (X) is called linear regression. The objective is to identify the linear equation with the best fit that uses the input information to predict the result variable.

In the context of this assignment, linear regression serves as the baseline model to understand the relationship between the predictors in our dataset and the target variable \( y \). The model attempts to minimize the sum of squared residuals, which measures the difference between observed and predicted values.

Before applying the model, we prepared our data by handling missing values, encoding categorical variables, and normalizing numerical features. This preprocessing ensures the data is clean and suitable for modeling, improving the accuracy and reliability of our predictions.

We will start by fitting a simple linear regression model on the training data. Its performance will be evaluated using Mean Squared Error as we develop an application. This will serve as a reference to which more complex models like Ridge and Lasso regression are compared. The simplicity of linear regression makes it quite useful in a starting problem of regression analysis and provides insight into possible importance and potential impacts of different predictors.

Key Points:
- **Model**: Linear regression
- **Objective**: Predict the outcome variable \( y \) using the predictors in the dataset.
- **Data Preparation**: Handling missing values, encoding categorical variables, and normalizing numerical features.
- **Evaluation Metric**: Mean Squared Error (MSE) on training and test data.
- **Baseline**: Provides a benchmark to compare more complex regularization techniques like Ridge and Lasso regression.

In [8]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer

# Let's load the datasets
train_data = pd.read_csv('trainData.csv')
test_data = pd.read_csv('testData.csv')

# We are creating the target variable 'y'
train_data['y'] = np.where(train_data['loan_status'] == 'Charged Off', 1, 0)
test_data['y'] = np.where(test_data['loan_status'] == 'Charged Off', 1, 0)

# Let's drop the target variable 'loan_status' and columns 'id' and 'member_id' from the predictors
X_train = train_data.drop(['loan_status', 'y', 'id', 'member_id'], axis=1)
y_train = train_data['y']

X_test = test_data.drop(['loan_status', 'y', 'id', 'member_id'], axis=1)
y_test = test_data['y']

# Separate numerical and categorical columns
numerical_cols = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X_train.select_dtypes(include=['object']).columns

# Imputing missing values
numerical_imputer = SimpleImputer(strategy='mean')
categorical_imputer = SimpleImputer(strategy='most_frequent')

X_train[numerical_cols] = numerical_imputer.fit_transform(X_train[numerical_cols])
X_train[categorical_cols] = categorical_imputer.fit_transform(X_train[categorical_cols])

X_test[numerical_cols] = numerical_imputer.transform(X_test[numerical_cols])
X_test[categorical_cols] = categorical_imputer.transform(X_test[categorical_cols])

# Converting categorical variables to numerical using one-hot encoding
X_train = pd.get_dummies(X_train, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)

# Aligning the test set to the train set
X_test = X_test.reindex(columns=X_train.columns, fill_value=0)

# Fitting the Linear Regression Model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Predicting on training and testing data
y_train_pred = lr_model.predict(X_train)
y_test_pred = lr_model.predict(X_test)

# Calculating Mean Squared Error
mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)

print(f"Mean Squared Error on Training Data: {mse_train}")
print(f"Mean Squared Error on Test Data: {mse_test}")


Mean Squared Error on Training Data: 0.0678244779783839
Mean Squared Error on Test Data: 0.06863270417879624


### Findings from Linear Regression Model

The Linear Regression model was evaluated to determine its performance in predicting the target variable. The results are as follows:

- **Mean Squared Error on Training Data**: 0.0678244779783839
- **Mean Squared Error on Test Data**: 0.06863270417879624

These findings show that, with comparable MSE values for the training and test datasets, the Linear Regression model offers a respectably good fit for the data. When compared to more intricate models such as Ridge and Lasso regression, the approach provides a robust foundation. The test and training MSE values' proximity indicates that the model does a good job of generalizing to new data without experiencing appreciable overfitting. Because of this, linear regression is a trustworthy place to start when figuring out how the predictors and the target variable are related.

# Ridge Regression Model

Ridge Regression was used in this part to solve multicollinearity and overfitting. By addressing missing values and encoding categorical variables, we prepared the data. Next, in order to determine the ideal regularization strength, we trained the Ridge Regression model using a range of lambda values. Mean Squared Error (MSE) was used to assess the model's performance on the test and training datasets. Finding a model that minimizes prediction errors and performs well with new data was our aim.


In [9]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
import numpy as np

# Define the range of lambda (alpha) values
lambdas = np.arange(0.01, 3.01, 0.01)
best_lambda = None
best_mse_test = float('inf')

# Initialize variables to store the MSE values for the best model
mse_train_ridge_best = None
mse_test_ridge_best = None

for alpha in lambdas:
    # Fit the Ridge regression model
    ridge_model = Ridge(alpha=alpha)
    ridge_model.fit(X_train, y_train)
    
    # Predicting on training and testing data
    y_train_pred_ridge = ridge_model.predict(X_train)
    y_test_pred_ridge = ridge_model.predict(X_test)
    
    # Calculating Mean Squared Error
    mse_train_ridge = mean_squared_error(y_train, y_train_pred_ridge)
    mse_test_ridge = mean_squared_error(y_test, y_test_pred_ridge)
    
    # Update the best lambda and MSE values if current model is better
    if mse_test_ridge < best_mse_test:
        best_lambda = alpha
        best_mse_test = mse_test_ridge
        mse_train_ridge_best = mse_train_ridge
        mse_test_ridge_best = mse_test_ridge

print(f"Best Lambda: {best_lambda}")
print(f"Mean Squared Error on Training Data: {mse_train_ridge_best}")
print(f"Mean Squared Error on Test Data: {mse_test_ridge_best}")


  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
  return linalg.solve(A, 

Best Lambda: 3.0
Mean Squared Error on Training Data: 0.06782516012327344
Mean Squared Error on Test Data: 0.0686324304609667


### Findings from Ridge Regression Model

The Ridge Regression model was evaluated to determine the best regularization strength (lambda) that minimizes the prediction error. After testing a range of lambda values from 0.01 to 3.0, the optimal lambda value was found to be 3.0. 

The performance of the model was measured using Mean Squared Error (MSE), and the results are as follows:
- **Best Lambda**: 3.0
- **Mean Squared Error on Training Data**: 0.06782516012327344
- **Mean Squared Error on Test Data**: 0.0686324304609667

These results indicate that the Ridge Regression model with a lambda of 3.0 provides a good fit to the data. The MSE values on the training and test datasets are very close, suggesting that the model generalizes well and is not overfitting. This makes Ridge Regression a robust choice for this prediction task, balancing complexity and performance effectively.

# Lasso Regresion Model

In this section, we implemented the Lasso Regression model, which includes L1 regularization. This model helps in feature selection by shrinking some coefficients to zero, effectively reducing the number of predictors in the model. 

**Steps Taken:**
1. **Model Initialization**: We used `LassoCV` with cross-validation (cv=5) to automatically select the best alpha (regularization strength). We increased the number of iterations (`max_iter=10000`) to ensure convergence.
2. **Model Training**: The model was trained on the training data.
3. **Predictions**: Predictions were made on both the training and testing datasets.
4. **Evaluation**: The model's performance was evaluated using Mean Squared Error (MSE).

The purpose of using Lasso Regression was to enhance the model by potentially removing irrelevant features and improving prediction accuracy, especially in cases where some predictors may be redundant.

In [10]:
from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_squared_error

# Fitting the Lasso Regression Model with increased number of iterations
lasso_model = LassoCV(cv=5, random_state=0, max_iter=10000)
lasso_model.fit(X_train, y_train)

# Predicting on training and testing data
y_train_pred_lasso = lasso_model.predict(X_train)
y_test_pred_lasso = lasso_model.predict(X_test)

# Calculating Mean Squared Error
mse_train_lasso = mean_squared_error(y_train, y_train_pred_lasso)
mse_test_lasso = mean_squared_error(y_test, y_test_pred_lasso)

print(f"Mean Squared Error on Training Data: {mse_train_lasso}")
print(f"Mean Squared Error on Test Data: {mse_test_lasso}")


Mean Squared Error on Training Data: 0.0709078038993017
Mean Squared Error on Test Data: 0.0716349001900953


### Findings from Lasso Regression Model

The Lasso Regression model was evaluated to assess its performance in predicting the target variable and performing feature selection. The Mean Squared Error (MSE) values obtained are:

- **Mean Squared Error on Training Data**: 0.0709078038993017
- **Mean Squared Error on Test Data**: 0.0716349001900953

These results show that the Lasso Regression model has slightly higher MSE compared to the Ridge Regression model. The use of L1 regularization in Lasso Regression helped in reducing some coefficients to zero, which can simplify the model by effectively selecting the most relevant features. 

# Random Forest

In this section, we implemented the Random Forest model, an ensemble learning method that combines multiple decision trees to improve predictive performance. Random Forest helps to reduce overfitting and increase the accuracy of predictions by averaging the results of many decision trees.

**Steps Taken:**
1. **Model Initialization**: We used `RandomForestRegressor` with a fixed `random_state` to ensure reproducibility.
2. **Model Training**: The model was trained on the prepared training data.
3. **Predictions**: Predictions were made on both the training and testing datasets.
4. **Evaluation**: The model's performance was evaluated using Mean Squared Error (MSE).
5. **Feature Importance**: We also calculated the importance of each feature in making predictions, providing insights into which variables are most influential.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd

# Fit the Random Forest model using the prepared data
rf_model = RandomForestRegressor(random_state=0)
rf_model.fit(X_train, y_train)

# Predicting on training and testing data
y_train_pred_rf = rf_model.predict(X_train)
y_test_pred_rf = rf_model.predict(X_test)

# Calculating Mean Squared Error
mse_train_rf = mean_squared_error(y_train, y_train_pred_rf)
mse_test_rf = mean_squared_error(y_test, y_test_pred_rf)

# Variable importance
feature_importances = rf_model.feature_importances_
features = X_train.columns

# Create a DataFrame for the feature importances
importance_df = pd.DataFrame({
    'Feature': features,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

print(f"Mean Squared Error on Training Data: {mse_train_rf}")
print(f"Mean Squared Error on Test Data: {mse_test_rf}")
print("Feature Importances:")
print(importance_df.head())



### Findings from Random Forest Model

The Random Forest model was evaluated to determine its performance in predicting the target variable. The results are as follows:

- **Mean Squared Error on Training Data**: 0.0027642915595818937
- **Mean Squared Error on Test Data**: 0.020430841741607576

These results indicate that the Random Forest model performs exceptionally well on the training data, with a very low MSE. However, the higher MSE on the test data suggests some overfitting. Despite this, the Random Forest model still outperforms other models in terms of test MSE. Additionally, the feature importance analysis highlights the most significant predictors in the dataset, providing valuable insights for further analysis and model refinement.

## Neural Network

In this section, we implemented a Neural Network model to capture complex patterns and interactions in the data that linear models might miss. Neural Networks are powerful models capable of learning non-linear relationships through multiple layers of interconnected neurons.

**Steps Taken:**
1. **Model Definition**: We defined a Sequential model with two Dense layers: one hidden layer with 100 neurons and a ReLU activation function, and an output layer with a sigmoid activation function.
2. **Model Compilation**: The model was compiled using the Adam optimizer and binary cross-entropy loss, with accuracy as the evaluation metric.
3. **Model Training**: The model was trained for 50 epochs with a batch size of 10, using the training data and validating on the test data.
4. **Predictions**: Predictions were made on both the training and testing datasets, converting probabilities to binary outcomes.
5. **Evaluation**: The model's performance was evaluated using accuracy on both training and test datasets.


In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.metrics import accuracy_score

# Define the Neural Network model
model = Sequential()
model.add(Dense(100, input_dim=X_train.shape[1], activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit the model
history = model.fit(X_train, y_train, epochs=50, batch_size=10, verbose=1, validation_data=(X_test, y_test))

# Predicting on training and testing data
y_train_pred_nn = (model.predict(X_train) > 0.5).astype("int32")
y_test_pred_nn = (model.predict(X_test) > 0.5).astype("int32")

# Calculating Accuracy
accuracy_train_nn = accuracy_score(y_train, y_train_pred_nn)
accuracy_test_nn = accuracy_score(y_test, y_test_pred_nn)

print(f"Accuracy on Training Data: {accuracy_train_nn}")
print(f"Accuracy on Test Data: {accuracy_test_nn}")


### Findings from Neural Network Model

The Neural Network model was evaluated to determine its performance in predicting the target variable. The results are as follows:

- **Accuracy on Training Data**: 0.9467
- **Accuracy on Test Data**: 0.9454

The high accuracy shows how well the neural network generalizes to new data and efficiently recognizes patterns in the training set. The neural network performs well, but its computing demands and complexity make it marginally less advantageous than the Random Forest model, which had the lowest Mean Squared Error (MSE).

### Explanation for Choosing the Neural Network


- **Ability to Capture Non-Linear Relationships**: Neural networks can model complex, non-linear interactions between features that simpler models might miss.
- **Handling Large Datasets**: They are particularly effective for large datasets with multiple features.
- **Flexible Architecture**: The flexibility in choosing the number of layers and neurons allows for fine-tuning the model to achieve optimal performance.
- **High Accuracy**: Despite higher computational requirements, the neural network provided high accuracy on both training and test data.
- **Adaptability**: Neural networks can adapt to a wide range of data types and distributions, making them versatile for different predictive tasks.