# Creating a PD model

- What is Probability of Default?

- The probability of default (PD) is the probability of a borrower or debtor defaulting on loan repayments. Within financial markets, an asset’s probability of default is the probability that the asset yields no return to its holder over its lifetime and the asset price goes to zero. Investors use the probability of default to calculate the expected loss from an investment.


- What is PD Model?
- A probability of default (PD) model is a statistical model used to estimate the likelihood that a borrower will default on their loan obligations within a given time frame. There are various approaches to building PD models, including traditional statistical methods and machine learning algorithms. Below steps involved in developing a PD model

## Step-1 
### Data Collection: The bank gathers data on past borrowers, including relevant variables such as age, income, credit score, outstanding debt, and default status.

In [60]:
# Here, the pandas library is imported to read the CSV file that is already present in the folder.

import pandas as pd

In [61]:
#In this step, the CSV file is read using the read_csv command, 
#and then we can examine the last 10 observations to get a sense of how the data looks.
# One can also check first 10 observations by usine dp.head(10)

df=pd.read_csv("PD_model_data.csv")
df.tail(10)

Unnamed: 0,Customer,Income (USD),Age,Credit History (months),Outstanding Debt (USD),Defaulted
541,542,70000,38.0,42.0,4000.0,No
542,543,65000,33.0,36.0,3000.0,No
543,544,55000,29.0,24.0,2000.0,No
544,545,48000,45.0,18.0,3000.0,Yes
545,546,60000,55.0,60.0,3000.0,No
546,547,35000,28.0,12.0,15000.0,Yes
547,548,55000,31.0,18.0,3000.0,Yes
548,549,80000,43.0,72.0,2000.0,No
549,550,60000,50.0,24.0,4000.0,May be
550,551,45000,31.0,18.0,,


In [62]:
#It is to gain valuable insights into the dataset's structure, including its size, column data types, and potential missing values. 
#This information serves as a fundamental starting point for further data exploration and analysis tasks

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 551 entries, 0 to 550
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Customer                 551 non-null    int64  
 1   Income (USD)             551 non-null    int64  
 2   Age                      540 non-null    float64
 3   Credit History (months)  545 non-null    float64
 4   Outstanding Debt (USD)   546 non-null    float64
 5   Defaulted                549 non-null    object 
dtypes: float64(3), int64(2), object(1)
memory usage: 26.0+ KB


Based on the provided information, here is the summary:

- There are a total of 551 observations with 6 columns.
- There are no missing values in the "Customer" and "Income" variables.
- There are 11 missing values in the "Age" variable, 6 missing values in the "Credit History" variable, 5 missing values in the "Outstanding Debt" variable, and 2 missing values in the "Defaulted" variable.

## Step-2

### Data Preprocessing: The bank preprocesses the data by handling missing values, converting categorical variables (such as defaulted status) into numerical representations, and scaling the numerical variables if necessary.

#### Handling the missing value- 
When dealing with missing values in a PD (Probability of Default) model, there are several approaches one can consider. Here are a few common strategies:

1- Complete Case Analysis: This approach involves excluding observations with missing values from the analysis. In our case, we could remove the rows with missing values for "Age," "Credit History," "Outstanding Debt," and "Defaulted." However, we have to keep in mind that this method may result in a loss of valuable data if the missing values are randomly distributed.

2- Mean/Median/Mode Imputation: In this method, missing values are replaced with the mean, median, or mode value of the respective variable. For numerical variables like "Age" and "Outstanding Debt," you can calculate the mean or median and substitute the missing values. For categorical variables like "Credit History" and "Defaulted," you can replace the missing values with the mode (the most frequent value).

3- Regression Imputation: If you have other variables that are strongly correlated with the missing variable, you can use regression models to estimate the missing values. For example, you can use a regression model with "Income" as the independent variable and "Age" as the dependent variable to predict missing "Age" values.

4- Multiple Imputation: This method involves creating multiple imputations for missing values based on the observed data. It takes into account the uncertainty associated with missing values and produces multiple complete datasets for analysis.

It is important to note that the choice of handling missing values depends on the specific characteristics of your data and the underlying assumptions of the PD model. It's essential to carefully consider the potential impact of each method on the model's accuracy and validity. Additionally, it's important to evaluate the potential bias introduced by the imputation methods and assess the robustness of your model's results.

Remember, handling missing values is a crucial step in the data preprocessing phase of building a PD model, and the chosen approach should align with the specific requirements and goals of your analysis.

In [63]:
# Removing the missing line by using dropna()
# Then check again if any missing value is there

df2=df.dropna()
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 529 entries, 0 to 549
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Customer                 529 non-null    int64  
 1   Income (USD)             529 non-null    int64  
 2   Age                      529 non-null    float64
 3   Credit History (months)  529 non-null    float64
 4   Outstanding Debt (USD)   529 non-null    float64
 5   Defaulted                529 non-null    object 
dtypes: float64(3), int64(2), object(1)
memory usage: 28.9+ KB



As can be seen, there is no missing value in df2 dataframe

#### Dealing with invalid data

When dealing with invalid data in a PD (Probability of Default) model, it's important to identify and address these issues appropriately. Invalid data refers to observations that contain incorrect or nonsensical values that do not align with the expected range or format for a particular variable. Here are some common approaches for handling invalid data:

- Data Cleaning: Review the data and identify any obvious errors or inconsistencies. If you notice extreme values or values that are outside the expected range for a variable, you may need to correct or remove those observations. For example, if you find negative values for "Age" or "Income," it may indicate an error that needs to be addressed.

- Outlier Detection and Treatment: Use statistical techniques to identify outliers in the data. Outliers are extreme values that deviate significantly from the majority of the data points. Depending on the nature of the outliers and their impact on the analysis, you can choose to remove them or transform them to more reasonable values.

- Validation and Cross-Checking: Cross-check the data against external sources or perform additional validations to verify the accuracy and validity of the information. For example, you can verify employment status or income information by contacting the respective sources or using independent verification methods.

- Expert Knowledge and Business Rules: Consult with domain experts or subject matter specialists who have a deep understanding of the data and the business context. They can provide valuable insights and guidance on how to handle specific invalid data scenarios based on their expertise.

- Imputation or Replacement: In some cases, if the invalid data is missing or incomplete but can be reasonably estimated, you can use imputation techniques to replace the invalid values with plausible values based on other variables or statistical models. However, be cautious when imputing data and consider the potential impact on the model's results and validity.

Remember, handling invalid data requires careful consideration and should be done in a manner that aligns with the specific characteristics of the data and the objectives of the PD model. It's crucial to maintain data integrity and ensure that any treatment applied to invalid data does not introduce bias or compromise the accuracy of the model's predictions.






In [64]:


# Iterate over the columns in the DataFrame
for column in df2.columns:
    # Check if the column is numeric
    if df2[column].dtype in [int, float]:
        # Print the range
        print("range of", column, ":", "(", df2[column].min(), ",",df2[column].max(),")")


range of Customer : ( 1 , 550 )
range of Income (USD) : ( 35000 , 90000 )
range of Age : ( -55.0 , 255.0 )
range of Credit History (months) : ( 6.0 , 72.0 )
range of Outstanding Debt (USD) : ( 2000.0 , 15000.0 )


In [65]:
# Count the distinct values in a column
distinct_counts = df2['Defaulted'].value_counts()
# Print the count of distinct values
print("Distinct count of Defaulted column :\n",distinct_counts)


Distinct count of Defaulted column :
 No        332
Yes       194
May be      3
Name: Defaulted, dtype: int64




From the above two analyses, we have discovered the following:

- There are invalid values in the Age variable (Age < 0 and Age > 100).
- There are invalid inputs in the Defaulted variable, indicated by the value "May be" (count - 3).

Before making any decisions, we will count how many invalid values exist in the Age variable.

In [66]:
# Filter the DataFrame to get the invalid values in the "Age" variable
invalid_age_count = len(df2[(df2['Age'] < 0) | (df2['Age'] > 100)])

# Print the count of invalid values
print("Count of invalid values in Age variable:", invalid_age_count)


Count of invalid values in Age variable: 7


- Since these count are very less then we can delete these inputs 

In [67]:
# Create a new DataFrame excluding rows with invalid age inputs
df_cleaned_age = df2[(df2['Age'] > 0) & (df2['Age'] < 100)].copy()

# Reset the index of the new DataFrame
df_cleaned_age.reset_index(drop=True, inplace=True)


In [68]:
#To check the range of Age again
# Iterate over the columns in the DataFrame
for column in df_cleaned_age.columns:
    # Check if the column is numeric
    if df_cleaned_age[column].dtype in [int, float]:
        # Print the range
        print("range of", column, ":", "(", df_cleaned_age[column].min(), ",",df_cleaned_age[column].max(),")")


range of Customer : ( 1 , 550 )
range of Income (USD) : ( 35000 , 90000 )
range of Age : ( 27.0 , 55.0 )
range of Credit History (months) : ( 6.0 , 72.0 )
range of Outstanding Debt (USD) : ( 2000.0 , 15000.0 )


In [69]:
# deleting invalid input in Default variable

# Create a new DataFrame excluding rows with invalid age inputs
df_cleaned = df_cleaned_age[(df_cleaned_age['Defaulted']!= "May be")].copy()

# Reset the index of the new DataFrame
df_cleaned.reset_index(drop=True, inplace=True)

In [70]:
# to check about invalid input on defaulted variable
distinct_counts = df_cleaned['Defaulted'].value_counts()
# Print the count of distinct values
print("Distinct count of Defaulted column :\n",distinct_counts)

Distinct count of Defaulted column :
 No     325
Yes    193
Name: Defaulted, dtype: int64


In [71]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 518 entries, 0 to 517
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Customer                 518 non-null    int64  
 1   Income (USD)             518 non-null    int64  
 2   Age                      518 non-null    float64
 3   Credit History (months)  518 non-null    float64
 4   Outstanding Debt (USD)   518 non-null    float64
 5   Defaulted                518 non-null    object 
dtypes: float64(3), int64(2), object(1)
memory usage: 24.4+ KB


Now we have a cleaned data which do not have any 

In [72]:
# Create a new DataFrame df_modified based on df_cleaned
df_modified = df_cleaned.copy()

# Update values in the "defaulted" column based on conditions
df_modified.loc[df_modified['Defaulted'] == "Yes", "Defaulted"] = 0
df_modified.loc[df_modified['Defaulted'] == "No", "Defaulted"] = 1

# Convert the "defaulted" column to integer type
df_modified["Defaulted"] = df_modified["Defaulted"].astype(int)


In [73]:
df_modified.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 518 entries, 0 to 517
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Customer                 518 non-null    int64  
 1   Income (USD)             518 non-null    int64  
 2   Age                      518 non-null    float64
 3   Credit History (months)  518 non-null    float64
 4   Outstanding Debt (USD)   518 non-null    float64
 5   Defaulted                518 non-null    int64  
dtypes: float64(3), int64(3)
memory usage: 24.4 KB


### Step3

#### Feature selection, also known as variable selection, is the process of choosing a subset of relevant features or variables from a larger set of available features in a dataset. The goal of feature selection is to improve model performance, reduce overfitting, enhance interpretability, and decrease computational complexity by focusing on the most informative and influential features.



The process of feature selection typically involves the following steps:

- Data Understanding: Gain a thorough understanding of the dataset, including the nature of the variables, their relationships, and their potential impact on the target variable. This step helps in identifying irrelevant or redundant variables.

- Univariate Analysis: Conduct a univariate analysis by examining each feature individually and assessing its correlation or association with the target variable. Statistical tests, such as t-tests or chi-square tests, can be employed for this purpose. Features that exhibit a strong relationship with the target variable are generally considered more relevant.

- Multivariate Analysis: Analyze the relationships among features themselves to identify potential redundancies or dependencies. This can involve techniques such as correlation analysis or variance inflation factor (VIF) calculation to identify highly correlated features. Redundant features may be removed as they provide redundant information.

- Domain Knowledge and Expertise: Leverage domain knowledge or expert input to identify features that are known to be significant in the given problem domain. Expert insights can help prioritize certain variables based on their perceived importance.

- Feature Ranking: Utilize various ranking methods to prioritize features based on their importance. Common techniques include:

  -  Filter Methods: These methods assess the relevance of features independently of any specific machine learning model. They use statistical measures like correlation, mutual information, or chi-square tests to rank features and select the top-ranked ones.

  - Wrapper Methods: Wrapper methods evaluate subsets of features by training and testing a machine learning model. They use the performance of the model as a criterion for feature selection, such as recursive feature elimination (RFE) or forward/backward selection.

  - Embedded Methods: These methods incorporate feature selection within the model training process itself. They utilize built-in feature selection techniques available in certain machine learning algorithms, such as L1 regularization (Lasso) or tree-based feature importance.

- Model Evaluation: Assess the performance of the model using the selected features. This evaluation provides insights into the impact of feature selection on the model's predictive ability and helps determine if further iterations of feature selection are necessary.

- Iterative Process: Feature selection is often an iterative process where steps 2-6 are repeated multiple times to refine the feature set and improve model performance. Different techniques and combinations of features can be explored to identify the most effective subset.

It's important to note that the specific feature selection techniques and steps may vary depending on the problem domain, the type of data, and the modeling approach being employed.

#### In this example, no feature selection has been performed as all the independent variables are important for the model.

## Step 4

#### Splitting the Data: Split the dataset into a training set and a test set. For example, you can allocate 80% of the data for training and 20% for testing.

#### Model Training: Train a PD model using a suitable algorithm such as logistic regression, random forest, or gradient boosting. Fit the model to the training data, using the features (income, age, credit history, outstanding debt) to predict the target variable (defaulted or not).

In [74]:
# Splitting the Data

from sklearn.model_selection import train_test_split

# Split the data into training and test sets
train_data, test_data, train_labels, test_labels = train_test_split(df_modified[['Income (USD)', 'Age', 'Credit History (months)', 'Outstanding Debt (USD)']], df_modified['Defaulted'], test_size=0.2, random_state=42)


In [75]:
train_data.head(5)

Unnamed: 0,Income (USD),Age,Credit History (months),Outstanding Debt (USD)
405,48000,45.0,18.0,3000.0
331,60000,35.0,56.0,4000.0
220,35000,28.0,12.0,15000.0
148,45000,31.0,18.0,7000.0
301,60000,50.0,24.0,4000.0


In [76]:
#fit the logistic regression model

from sklearn.linear_model import LogisticRegression

# Create a logistic regression model object
logreg_model = LogisticRegression()

# Fit the model to the training data
logreg_model.fit(train_data, train_labels)


## To evaluate of the performance of this model

To evaluate the performance of the logistic regression model on the test set, you can use various evaluation metrics. Here are a few commonly used metrics for binary classification models:

Accuracy: Accuracy measures the overall correctness of the model's predictions.

In [77]:
# Evaluate accuracy
accuracy = logreg_model.score(test_data, test_labels)
print("Accuracy:", accuracy)


Accuracy: 0.7980769230769231


- Confusion Matrix: A confusion matrix provides a detailed breakdown of the model's predictions by showing the true positive, true negative, false positive, and false negative counts.

In [78]:
from sklearn.metrics import confusion_matrix

# Generate predictions on the test data
predictions = logreg_model.predict(test_data)

# Calculate confusion matrix
conf_matrix = confusion_matrix(test_labels, predictions)
print("Confusion Matrix:")
print(conf_matrix)


Confusion Matrix:
[[23 20]
 [ 1 60]]


- Precision, Recall, and F1-Score: These metrics provide insights into the model's performance in terms of precision (ability to correctly identify positive instances), recall (ability to find all positive instances), and their harmonic mean F1-score.

In [79]:
from sklearn.metrics import classification_report

# Generate a classification report
classification_rep = classification_report(test_labels, predictions)
print("Classification Report:")
print(classification_rep)


Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.53      0.69        43
           1       0.75      0.98      0.85        61

    accuracy                           0.80       104
   macro avg       0.85      0.76      0.77       104
weighted avg       0.84      0.80      0.78       104



Based on the above evaluation metrics, the logistic regression model demonstrates a decent level of performance. Here's a breakdown of the metrics:

- Accuracy: The accuracy of the model is 0.798, indicating that around 80% of the predictions on the test set are correct. However, accuracy alone may not provide a complete picture of the model's performance.

- Confusion Matrix: The confusion matrix reveals the following:

True Positives (TP): 60

True Negatives (TN): 23

False Positives (FP): 20

False Negatives (FN): 1

The model has a relatively high number of false positives (20) but only a single false negative (1).

- Classification Report: The precision, recall, and F1-score for both classes (0 and 1) are as follows:

Class 0: The precision is 0.96, indicating a high percentage of true negatives among the predicted negatives. The recall is 0.53, suggesting that the model may struggle to identify actual negatives. The F1-score is 0.69.
Class 1: The precision is 0.75, indicating a good ability to identify true positives. The recall is 0.98, indicating that the model effectively captures most actual positives. The F1-score is 0.85.
Considering these metrics, we can conclude the following justifications for the model's performance:

The accuracy of 0.798 is reasonable, indicating that the model is making correct predictions for the majority of the instances in the test set.
The high precision and recall for class 1 suggest that the model is effective at identifying actual defaults.
However, the lower precision and recall for class 0 indicate that the model may struggle to correctly identify non-default cases, leading to a higher number of false positives.
Overall, while the model performs well in detecting defaults (class 1), it could be further improved in distinguishing non-defaults (class 0). Depending on the specific requirements and the associated costs of false positives and false negatives, further model tuning or adjustments to the decision threshold may be necessary.


### Lets try PD model with random forest approach

In [80]:
from sklearn.ensemble import RandomForestClassifier

# Create a Random Forest classifier object
rf_model = RandomForestClassifier()

# Fit the model to the training data
rf_model.fit(train_data, train_labels)


In [81]:
##checking the accuray of the modeL

# Generate predictions on the test data
rf_predictions = rf_model.predict(test_data)

# Evaluate accuracy
rf_accuracy = rf_model.score(test_data, test_labels)
print("Accuracy:", rf_accuracy)

# Calculate confusion matrix
rf_conf_matrix = confusion_matrix(test_labels, rf_predictions)
print("Confusion Matrix:")
print(rf_conf_matrix)

# Generate a classification report
rf_classification_rep = classification_report(test_labels, rf_predictions)
print("Classification Report:")
print(rf_classification_rep)


Accuracy: 0.9903846153846154
Confusion Matrix:
[[43  0]
 [ 1 60]]
Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99        43
           1       1.00      0.98      0.99        61

    accuracy                           0.99       104
   macro avg       0.99      0.99      0.99       104
weighted avg       0.99      0.99      0.99       104



The Random Forest model demonstrates excellent performance on the test set, as indicated by the evaluation metrics:

- Accuracy: The accuracy of the Random Forest model is 0.990, meaning that it correctly predicts the default status for approximately 99% of the instances in the test set.

- Confusion Matrix: The model has 43 true positives, 60 true negatives, no false positives, and only one false negative. This suggests that the model has a very low rate of misclassifications.

- Classification Report: The precision, recall, and F1-scores for both classes (0 and 1) are very high. Class 0 has a precision, recall, and F1-score of 0.98 and class 1 has a precision, recall, and F1-score of 1.00. This indicates that the Random Forest model performs exceptionally well in correctly identifying both defaults and non-defaults.

#### In summary, the Random Forest model demonstrates significantly improved performance compared to the logistic regression model. With an accuracy of 0.990 and high precision, recall, and F1-scores for both classes, the Random Forest model is highly effective in predicting the default status.

In [82]:

#Save the Model:

import joblib

# Save the trained model
joblib.dump(rf_model, 'random_forest_model.pkl')


['random_forest_model.pkl']

In [83]:
# Load the saved model
loaded_model = joblib.load('random_forest_model.pkl')

# Prepare new input data for prediction
new_data = [[60000, 35, 30, 5000]]  # Example of new input data

# Specify the feature names
feature_names = ['Income (USD)', 'Age', 'Credit History (months)', 'Outstanding Debt (USD)']

# Create a DataFrame for the new input data
new_data_df = pd.DataFrame(new_data, columns=feature_names)

# Use the loaded model to predict the result for new input data
predictions = loaded_model.predict(new_data_df)

# Print the predicted result
print("Prediction:", predictions)


Prediction: [1]
