# Credit Risk Prediction Model

**Column Descriptions:**

ID: Unique identifier for each loan applicant.  
Age: Age of the loan applicant.  
Income: Income of the loan applicant.  
Home: Home ownership status (Own, Mortgage, Rent).  
Emp_Length: Employment length in years.  
Intent: Purpose of the loan (e.g., education, home improvement).  
Amount: Loan amount applied for.  
Rate: Interest rate on the loan.  
Status: Loan approval status (Fully Paid, Charged Off, Current).  
Percent_Income: Loan amount as a percentage of income.  
Default: Whether the applicant has defaulted on a loan previously (Yes, No).  
Cred_Length: Length of the applicant's credit history.  

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [3]:
data = pd.read_csv("C:/Users/himan/Documents/Books/Coding Project/Version 1/Databases/Datasets List1/KAGGLE/Credit Risk Analysis/credit_risk.csv")

In [4]:
display(
    data.info(),
    data.head()
)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32581 entries, 0 to 32580
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Id              32581 non-null  int64  
 1   Age             32581 non-null  int64  
 2   Income          32581 non-null  int64  
 3   Home            32581 non-null  object 
 4   Emp_length      31686 non-null  float64
 5   Intent          32581 non-null  object 
 6   Amount          32581 non-null  int64  
 7   Rate            29465 non-null  float64
 8   Status          32581 non-null  int64  
 9   Percent_income  32581 non-null  float64
 10  Default         32581 non-null  object 
 11  Cred_length     32581 non-null  int64  
dtypes: float64(3), int64(6), object(3)
memory usage: 3.0+ MB


None

Unnamed: 0,Id,Age,Income,Home,Emp_length,Intent,Amount,Rate,Status,Percent_income,Default,Cred_length
0,0,22,59000,RENT,123.0,PERSONAL,35000,16.02,1,0.59,Y,3
1,1,21,9600,OWN,5.0,EDUCATION,1000,11.14,0,0.1,N,2
2,2,25,9600,MORTGAGE,1.0,MEDICAL,5500,12.87,1,0.57,N,3
3,3,23,65500,RENT,4.0,MEDICAL,35000,15.23,1,0.53,N,2
4,4,24,54400,RENT,8.0,MEDICAL,35000,14.27,1,0.55,Y,4


In [5]:
data.isna().sum()

Id                   0
Age                  0
Income               0
Home                 0
Emp_length         895
Intent               0
Amount               0
Rate              3116
Status               0
Percent_income       0
Default              0
Cred_length          0
dtype: int64

In [6]:
data[data['Emp_length'] > 35]
rows_to_delete = data[data['Emp_length'] > 50].index
data = data.drop(rows_to_delete, axis = 0)

In [13]:
data.dropna(inplace = True)
data.isna().sum()

Id                0
Age               0
Income            0
Home              0
Emp_length        0
Intent            0
Amount            0
Rate              0
Status            0
Percent_income    0
Default           0
Cred_length       0
dtype: int64

# Regression Begins

In [15]:
# Using LabelEncoder, encode the 'Default' column values into numerical labels (0, 1, 2,...)
data['Default'] = LabelEncoder().fit_transform(data['Default'])

# Convert categorical columns in the 'data' DataFrame to one-hot encoded columns.
# If a column has N unique categories, it creates N-1 binary columns (due to 'drop_first = True')
data_encoded = pd.get_dummies(data, drop_first = True)

# Set 'X' to all columns except for the 'Default' column for our features
X = data_encoded.drop('Default', axis = 1)

# Set 'y' to the 'Default' column as it will be our target variable
y = data_encoded['Default']

# Initialize a StandardScaler instance, which will standardize the features (mean=0 and variance=1)
scaler = StandardScaler()

# Apply the StandardScaler to 'X' and convert the result back into a DataFrame with the original column names
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns = X.columns)

# Split the dataset into training (70%) and testing (30%) sets with a random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.3, random_state = 42)

# Display the first few rows of the training features
X_train.head()


Unnamed: 0,Id,Age,Income,Emp_length,Amount,Rate,Status,Percent_income,Cred_length,Home_OTHER,Home_OWN,Home_RENT,Intent_EDUCATION,Intent_HOMEIMPROVEMENT,Intent_MEDICAL,Intent_PERSONAL,Intent_VENTURE
9842,-0.540501,-0.432251,0.085878,0.302212,-0.73569,0.353077,-0.52579,-0.935235,-0.691814,-0.057388,-0.28791,-1.016337,-0.498734,-0.354567,-0.476182,2.207453,-0.459937
6883,-0.898039,-0.749192,-0.186763,0.054413,0.844656,0.758752,-0.52579,0.945091,-0.939431,-0.057388,3.473308,-1.016337,-0.498734,-0.354567,-0.476182,-0.453011,2.174213
23819,1.148427,0.51857,0.403424,-1.184583,-1.174236,-1.117883,1.901899,-1.405316,-0.196582,-0.057388,-0.28791,-1.016337,-0.498734,-0.354567,2.100039,-0.453011,-0.459937
25065,1.297898,-0.115311,0.855687,-0.688984,0.220419,0.511011,-0.52579,-0.747202,0.051035,-0.057388,-0.28791,0.983926,2.005078,-0.354567,-0.476182,-0.453011,-0.459937
24090,1.179749,-0.115311,0.607904,-1.184583,2.108933,2.28855,1.901899,0.192961,0.793884,-0.057388,-0.28791,-1.016337,-0.498734,-0.354567,2.100039,-0.453011,-0.459937


In [24]:
# Initialize a logistic regression model with a maximum of 1000 iterations and a specific random state for reproducibility
logreg = LogisticRegression(max_iter = 1000, random_state = 42)

# Fit the logistic regression model on the training data
logreg.fit(X_train, y_train)

# Predict the target values (Default) for the test set
y_pred = logreg.predict(X_test)

# Calculate the accuracy of the predictions by comparing them with the true values from the test set
accuracy = accuracy_score(y_test, y_pred)

# Generate a classification report, which includes precision, recall, f1-score, and support for each class
class_report = classification_report(y_test, y_pred)

# Generate a confusion matrix to see the true positive, false positive, true negative, and false negative counts
conf_matrix = confusion_matrix(y_test, y_pred)

# Display the calculated accuracy, classification report, and confusion matrix
display(
    accuracy, class_report, conf_matrix
)


0.8182982190664649

'              precision    recall  f1-score   support\n\n           0       0.86      0.93      0.89      7022\n           1       0.50      0.31      0.39      1569\n\n    accuracy                           0.82      8591\n   macro avg       0.68      0.62      0.64      8591\nweighted avg       0.79      0.82      0.80      8591\n'

array([[6538,  484],
       [1077,  492]], dtype=int64)

# Regression Outcomes


### **Credit Risk Prediction Model**

**Objective**:
Develop a predictive model to determine the likelihood of a borrower defaulting on a loan based on their personal and financial details.

---

**1. Data Exploration & Cleaning**:

- **Dataset Size**: 32,581 rows and 12 columns.
- **Features**: Age, Income, Home type, Employment length, Loan Intent, Amount, Interest Rate, etc.
- **Target Variable**: Default (Yes/No).
- **Data Cleaning**:
  - Removed 3,943 rows with NaN values, resulting in a cleaned dataset of 28,638 entries.

---

**2. Data Preprocessing**:

- **Feature Engineering**: 
  - One-hot encoded categorical variables (Home type, Loan Intent).
  - Scaled numerical variables for model optimization.
- **Data Split**: 70% training set and 30% test set.

---

**3. Model Selection**:

- **Model Chosen**: Logistic Regression.
- **Rationale**: Suitable for binary classification problems.

---

**4. Model Performance**:

- **Accuracy**: 81.94% on the test set.
- **Classification Report**:
  - **No Default (0)**:
    - Precision: 0.86
    - Recall: 0.93
    - F1-score: 0.89
  - **Default (1)**:
    - Precision: 0.49
    - Recall: 0.30
    - F1-score: 0.37

- **Confusion Matrix**:

$$
\begin{array}{c|cc}
 & \textbf{Predicted: No} & \textbf{Predicted: Yes} \\
\hline
\textbf{Actual: No} & \textbf{True Negative (TN)} & \textbf{False Positive (FP)} \\
\textbf{Actual: Yes} & \textbf{False Negative (FN)} & \textbf{True Positive (TP)} \\
\end{array}
$$

$$
\begin{matrix}
6578 & 480 \\
1072 & 462 \\
\end{matrix}
$$

---

**5. Key Insights**:

- The model shows a high accuracy in predicting non-defaults but has room for improvement in predicting defaults.
- Class imbalance might be a reason for lower performance in predicting defaults. Future iterations could benefit from techniques like SMOTE or undersampling to balance the classes.
- Feature importance or other model interpretability tools can be used to understand which features are the most influential in predicting defaults.

