# Task 3: Credit Risk Analysis

## The Task:

The risk manager has collected data on the loan borrowers. The data is in tabular format, with each row providing details of the borrower, including their income, total loans outstanding, and a few other metrics. There is also a column indicating if the borrower has previously defaulted on a loan. You must use this data to build a model that, given details for any loan described above, will predict the probability that the borrower will default (also known as PD: the probability of default). Use the provided data to train a function that will estimate the probability of default for a borrower. Assuming a recovery rate of 10%, this can be used to give the expected loss on a loan.

- You should produce a function that can take in the properties of a loan and output the expected loss.
- You can explore any technique ranging from a simple regression or a decision tree to something more advanced. You can also use multiple methods and provide a comparative analysis.

## The Solution:

Using Object-Oriented Programming (OOP) principles in this solution provides modular, maintainable, and scalable code that streamlines the loan risk assessment process.

### Data Processor

-   Purpose:
    -   The `DataProcessor` class serves to prepare the data for model training.
-   Functionalities:
    -   Preprocessing: Drops unnecessary columns (e.g., `customer_id`).
    -   Splitting: Divides data into training and testing sets.
    -   Normalization: Scales the data features using `StandardScaler` for better model performance.
-   Key Components:
    -   `preprocess()`: Removes irrelevant columns from the data.
    -   `split_data()`: Segregates the data into training and test subsets.
    -   `normalize()`: Uses `StandardScaler` to standardize feature values.
    -   `process()`: A high-level function calling the other three functions to streamline data processing.

### Model Development (LogisticRegressionModel class)

-   Purpose:
    -   Defines and trains a logistic regression model to predict default probabilities.
-   Functionalities:
    -   Training: Trains a logistic regression model using training data.
    -   Prediction: Forecasts class labels and probabilities for given input data.
-   Key Components:
    -   `train()`: Fitting the logistic regression model using training data.
    -   `predict()`: Provides the predicted class labels for input data.
    -   `predict_proba()`: Computes the probabilities of default for input data.

### Risk Calculator (LoanRiskCalculator)

-   Purpose:
    -   Calculates the expected financial loss based on the probability of default and given loan attributes.
-   Functionalities:
    -   Expected Loss Calculation: Uses the formula: PD * EAD * (1 - RR), where:
        -   PD: Probability of Default
        -   EAD: Exposure At Default
        -   RR: Recovery Rate
-   Key Components:
    -   `calculate_expected_loss()`: Determines the expected loss based on the model's predictions and loan data.

### Sample Input (SampleInput class)

-   Purpose:
    -   Provides an easy way to prepare user input for model predictions, especially normalization.
-   Functionalities:
    -   Input Transformation: Converts raw loan properties into a suitable format for prediction.
    -   Normalization: Scales the input using the same scaler from `DataProcessor`.
-   Key Components:
    -   `get_normalized_input()`: Transforms and scales raw input values for model prediction.

### Expected Loss Calculation

-   Steps:
    1.  Manually input loan properties.
    2.  Use the `SampleInput` class to prepare and normalize these values.
    3.  Utilize the `LoanRiskCalculator` class to compute the expected loss.
    4.  Output: Presents the expected financial loss for the given loan details.

### Library Imports

In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from abc import ABC, abstractmethod
from sklearn.linear_model import LogisticRegression
import numpy as np

### Data Processor

In [25]:
class DataProcessor:
    def __init__(self, data_path, test_size=0.2, random_state=None):
        """Initializes the DataProcessor object.
        
        Args:
            data_path (str): Path to the CSV file containing the data.
            test_size (float, optional): Proportion of data to be used for testing. Defaults to 0.2.
            random_state (int, optional): Random seed for reproducibility. Defaults to None.
        """
        self.data = pd.read_csv(data_path)
        self.drop_columns = "customer_id"  # Column to drop from data
        self.test_size = test_size
        self.random_state = random_state
        self.scaler = StandardScaler()  # Standard scaler for data normalization
        self.X_train, self.X_test, self.y_train, self.y_test = None, None, None, None

    def preprocess(self):
        """Preprocesses the data by dropping unnecessary columns."""
        print("Preprocessing data...")
        self.data = self.data.drop(columns=self.drop_columns)
        print(f"Data shape after preprocessing: {self.data.shape}")

    def split_data(self):
        """Splits the data into training and testing sets."""
        print("Splitting data...")
        X = self.data.drop(columns="default")  # Feature columns
        y = self.data["default"]               # Target column
        # Splitting data
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
            X, y, test_size=self.test_size, random_state=self.random_state)
        print(f"Training data shape: {self.X_train.shape}, Testing data shape: {self.X_test.shape}")

    def normalize(self):
        """Normalizes the training and testing data using StandardScaler."""
        print("Normalizing data...")
        # Fit the scaler using training data and transform
        self.X_train = self.scaler.fit_transform(self.X_train)
        # Transform the testing data
        self.X_test = self.scaler.transform(self.X_test)

    def process(self):
        """Processes the data by performing preprocessing, splitting, and normalization."""
        self.preprocess()
        self.split_data()
        self.normalize()

### Model Development

In [26]:
class BaseModel(ABC):
    
    @abstractmethod
    def train(self, X_train, y_train):
        pass

    @abstractmethod
    def predict(self, X):
        pass

    @abstractmethod
    def predict_proba(self, X):
        pass


class LogisticRegressionModel(BaseModel):

    def __init__(self):
        self.model = LogisticRegression()

    def train(self, X_train, y_train) -> LogisticRegression:
        """Train the Logistic Regression model."""
        print("Training the Logistic Regression model...")
        self.model.fit(X_train, y_train)
        return self.model

    def predict(self, X):
        """Predict the class labels for given data."""
        print("Predicting class labels...")
        return self.model.predict(X)

    def predict_proba(self, X):
        """Predict the probabilities for given data."""
        print("Predicting probabilities...")
        return self.model.predict_proba(X)[:, 1]

### Risk Calculator

In [27]:
class LoanRiskCalculator:

    def __init__(self, model: BaseModel, recovery_rate=0.1):
        """
        :param model: An instance of a model implementing the BaseModel protocol.
        :param recovery_rate: The recovery rate for the loan. Defaults to 0.1 (10%).
        """
        self.model = model
        self.RR = recovery_rate

    def calculate_expected_loss(self, X, EAD):
        """
        Calculate the expected loss for a given set of loan properties.

        :param X: The input data for which to predict the probability of default.
        :param EAD: Exposure At Default, the amount at risk in case of a default.
        :return: Expected loss for the given data.
        """
        print("Calculating expected loss...")
        PD_array = self.model.predict_proba(X)
        PD = PD_array[0][1]  # Accessing the probability for the positive class
        expected_loss = PD * EAD * (1 - self.RR)
        print(f"Probability of Default (PD) for the input: {PD:.5f}")
        return expected_loss

### Sample Input

In [28]:
class SampleInput:
    def __init__(self, processor: DataProcessor):
        """Initializes the SampleInput with a reference to the data processor.
        
        Args:
            processor (DataProcessor): Reference to the data processor for normalization.
        """
        self.processor = processor
        self.columns = [
            "credit_lines_outstanding",
            "loan_amt_outstanding",
            "total_debt_outstanding",
            "income",
            "years_employed",
            "fico_score"
        ]
    
    def get_normalized_input(self, values: list) -> np.ndarray:
        """Convert raw input values to a DataFrame and normalize them.
        
        Args:
            values (list): Raw input values.
        
        Returns:
            np.ndarray: Normalized values.
        """
        df = pd.DataFrame([values], columns=self.columns)
        return self.processor.scaler.transform(df)

### Expected Loss Calculation


In [30]:
# 1. Load and process the data
processor = DataProcessor('Loan_Data.csv')
processor.process()

# 2. Train the logistic regression model with processed data
lr_model = LogisticRegressionModel().train(processor.X_train, processor.y_train)

# 3. Initialize the LoanRiskCalculator
calculator = LoanRiskCalculator(model=lr_model)

# 4. Create a sample input and calculate expected loss
# Initialize the SampleInput class
sample_input = SampleInput(processor)

# Manually input values for the prediction
sample_values = [
    1,              # credit_lines_outstanding
    5000,           # loan_amt_outstanding
    10000,          # total_debt_outstanding
    40000,          # income
    3,              # years_employed
    580             # fico_score (credit score)
]

# Obtain normalized values using SampleInput class
X_sample_normalized = sample_input.get_normalized_input(sample_values)

# 5. Calculate expected loss using the manually input values
EAD_sample = 5000  # Just an example value for Exposure At Default
expected_loss = calculator.calculate_expected_loss(X_sample_normalized, EAD_sample)

print(f"Expected Loss: ${expected_loss}")

Preprocessing data...
Data shape after preprocessing: (10000, 7)
Splitting data...
Training data shape: (8000, 6), Testing data shape: (2000, 6)
Normalizing data...
Training the Logistic Regression model...
Calculating expected loss...
Probability of Default (PD) for the input: 0.00153
Expected Loss: $6.899230555657993
