# Credit Risk Prediction Project


## Introduction
Brief overview of the project and its objectives, which include predicting credit risk using machine learning models.

## Setup
Instructions for setting up the environment and installing necessary libraries.

In [1]:
import pandas as pd
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier


# Data Loading and Exploration
Load the dataset and explore its structure and characteristics.

In [2]:
data = pd.read_csv("CreditRisk.csv")

In [3]:
data.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0.0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1.0,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0.0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0.0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0.0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [6]:
dups = data.duplicated()

In [7]:
data[dups]

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status


In [8]:
data.size

12753

In [9]:
data.Property_Area.unique()

array(['Urban', 'Rural', 'Semiurban'], dtype=object)

In [10]:
data.Dependents.unique()

array([ 0.,  1.,  2.,  4.,  3., nan])

In [11]:
data.Loan_Amount_Term.unique()

array([360., 120., 240.,  nan, 180.,  60., 300., 480.,  36.,  84.,  12.,
       350.,   6.])

In [12]:
data.Credit_History.unique()

array([ 1.,  0., nan])

In [13]:
data.shape

(981, 13)

In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 981 entries, 0 to 980
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            981 non-null    object 
 1   Gender             957 non-null    object 
 2   Married            978 non-null    object 
 3   Dependents         956 non-null    float64
 4   Education          981 non-null    object 
 5   Self_Employed      926 non-null    object 
 6   ApplicantIncome    981 non-null    int64  
 7   CoapplicantIncome  981 non-null    float64
 8   LoanAmount         954 non-null    float64
 9   Loan_Amount_Term   961 non-null    float64
 10  Credit_History     902 non-null    float64
 11  Property_Area      981 non-null    object 
 12  Loan_Status        981 non-null    object 
dtypes: float64(5), int64(1), object(7)
memory usage: 99.8+ KB


# Data Preprocessing
Handle missing values and encode categorical variables.

In [15]:
data.isnull().sum()

Loan_ID               0
Gender               24
Married               3
Dependents           25
Education             0
Self_Employed        55
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           27
Loan_Amount_Term     20
Credit_History       79
Property_Area         0
Loan_Status           0
dtype: int64

In [16]:
data.fillna(value=data.mean(), inplace=True)

  data.fillna(value=data.mean(), inplace=True)


In [17]:
data.isnull().sum()

Loan_ID               0
Gender               24
Married               3
Dependents            0
Education             0
Self_Employed        55
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            0
Loan_Amount_Term      0
Credit_History        0
Property_Area         0
Loan_Status           0
dtype: int64

In [18]:
data.dropna(inplace =True)

In [19]:
data.isnull().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [20]:
data.describe()

Unnamed: 0,Dependents,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,899.0,899.0,899.0,899.0,899.0,899.0
mean,0.872656,5067.968854,1574.714038,141.336806,342.533744,0.839099
std,1.23593,5216.909976,2463.858869,74.141877,64.54522,0.352587
min,0.0,0.0,0.0,9.0,6.0,0.0
25%,0.0,2885.5,0.0,100.0,360.0,1.0
50%,0.0,3829.0,1103.0,128.0,360.0,1.0
75%,2.0,5506.0,2362.5,160.0,360.0,1.0
max,4.0,81000.0,33837.0,650.0,480.0,1.0


In [21]:
data.corr()

  data.corr()


Unnamed: 0,Dependents,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
Dependents,1.0,0.139411,-0.049462,0.126309,-0.085572,-0.038714
ApplicantIncome,0.139411,1.0,-0.10321,0.508971,0.013122,0.006933
CoapplicantIncome,-0.049462,-0.10321,1.0,0.187397,-0.007212,-0.029132
LoanAmount,0.126309,0.508971,0.187397,1.0,0.089592,-0.021967
Loan_Amount_Term,-0.085572,0.013122,-0.007212,0.089592,1.0,-0.014867
Credit_History,-0.038714,0.006933,-0.029132,-0.021967,-0.014867,1.0


In [22]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

features = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
            'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
            'Loan_Amount_Term', 'Credit_History', 'Property_Area']
target = 'Loan_Status'

# Encode categorical variables
label_encoder = LabelEncoder()
for feature in features:
    if data[feature].dtype == 'object':
        data[feature] = label_encoder.fit_transform(data[feature].astype(str))






# Data Splitting
Split the dataset into training and testing sets

In [23]:
# Split the data into training and testing sets
X = data[features]
y = data[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model Training and Evaluation
Train various machine learning models and evaluate their performance.

In [24]:

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

model = LogisticRegression(max_iter=1000)  # Increase max_iter if necessary
model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Calculate performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label='Y')
recall = recall_score(y_test, y_pred, pos_label='Y')
f1 = f1_score(y_test, y_pred, pos_label='Y')

# Print the performance metrics
print('Logistic Regression Performance Metrics:')
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1-score: {f1:.4f}')

Logistic Regression Performance Metrics:
Accuracy: 0.8444
Precision: 0.8239
Recall: 1.0000
F1-score: 0.9034


In [25]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# CaDecision Tree Performance Metrics:
Accuracy: 0.7556
Precision: 0.8321
Recall: 0.8321
F1_score: 0.8321 #culate performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label='Y')
recall = recall_score(y_test, y_pred, pos_label='Y')
f1 = f1_score(y_test, y_pred, pos_label='Y')

# Print the performance metrics
print('Decision Tree Performance Metrics:')
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1-score: {f1:.4f}')



Decision Tree Performance Metrics:
Accuracy: 0.7611
Precision: 0.8333
Recall: 0.8397
F1-score: 0.8365


In [26]:
random_forest_model = RandomForestClassifier()
random_forest_model.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = random_forest_model.predict(X_test)

# Calculate performance metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label='Y')
recall = recall_score(y_test, y_pred, pos_label='Y')
f1 = f1_score(y_test, y_pred, pos_label='Y')

# Print the performance metrics
print('Random Forest Performance Metrics:')
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1-score: {f1:.4f}')

Random Forest Performance Metrics:
Accuracy: 0.8389
Precision: 0.8312
Recall: 0.9771
F1-score: 0.8982


# Conclusion:

# Logistic Regression:

Achieves the highest accuracy among the three models, with good precision and recall values.
The recall score indicates that it is able to capture all instances of positive class (loan approval) effectively.
However, precision score is slightly lower, indicating some false positive predictions.
It could be a suitable model for this problem due to its balanced performance.

# Decision Tree:

Achieves lower accuracy compared to Logistic Regression.
Precision and recall scores are balanced but lower than Logistic Regression.
May suffer from overfitting due to its tendency to create complex decision boundaries.

# Random Forest:

Performs well in terms of accuracy and precision, with slightly lower recall compared to Logistic Regression.
Shows promise with a high F1-score, indicating a good balance between precision and recall.
Ensemble methods like Random Forest generally provide robust performance and can handle complex relationships in the data.

# Feature scaling using StandardScaler.

In [28]:
from sklearn.preprocessing import StandardScaler
# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)



# Hyperparameter tuning for Decision Tree and Random Forest using GridSearchCV.

In [30]:
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression





# Hyperparameter Tuning for Decision Tree
param_grid_dt = {'max_depth': [3, 5, 7, 10], 'min_samples_leaf': [1, 3, 5, 10]}
dt_grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid_dt, cv=5)
dt_grid_search.fit(X_train_scaled, y_train)
best_dt_model = dt_grid_search.best_estimator_





In [31]:
# Hyperparameter Tuning for Random Forest
param_grid_rf = {'n_estimators': [100, 200, 300], 'max_depth': [5, 10, 15], 'min_samples_leaf': [1, 3, 5]}
rf_grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid_rf, cv=5)
rf_grid_search.fit(X_train_scaled, y_train)
best_rf_model = rf_grid_search.best_estimator_


# Ensemble methods: AdaBoost and Gradient Boosting.

In [32]:

# Ensemble Methods: AdaBoost
ada_boost_model = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=5), n_estimators=100, random_state=42)
ada_boost_model.fit(X_train_scaled, y_train)





In [33]:
# Ensemble Methods: Gradient Boosting
gradient_boost_model = GradientBoostingClassifier(n_estimators=100, max_depth=5, random_state=42)
gradient_boost_model.fit(X_train_scaled, y_train)

# Cross-validation for Logistic Regression, Decision Tree, Random Forest, AdaBoost, and Gradient Boosting models.

In [34]:
# Cross-Validation
# Logistic Regression
lr_scores = cross_val_score(LogisticRegression(max_iter=1000), X_train_scaled, y_train, cv=5)
print("Logistic Regression Cross-Validation Mean Accuracy:", lr_scores.mean())

# Decision Tree
dt_scores = cross_val_score(best_dt_model, X_train_scaled, y_train, cv=5)
print("Decision Tree Cross-Validation Mean Accuracy:", dt_scores.mean())

# Random Forest
rf_scores = cross_val_score(best_rf_model, X_train_scaled, y_train, cv=5)
print("Random Forest Cross-Validation Mean Accuracy:", rf_scores.mean())

# AdaBoost
ada_scores = cross_val_score(ada_boost_model, X_train_scaled, y_train, cv=5)
print("AdaBoost Cross-Validation Mean Accuracy:", ada_scores.mean())

# Gradient Boosting
gradient_scores = cross_val_score(gradient_boost_model, X_train_scaled, y_train, cv=5)
print("Gradient Boosting Cross-Validation Mean Accuracy:", gradient_scores.mean())

Logistic Regression Cross-Validation Mean Accuracy: 0.8609654234654235
Decision Tree Cross-Validation Mean Accuracy: 0.86513209013209
Random Forest Cross-Validation Mean Accuracy: 0.8706973581973582




AdaBoost Cross-Validation Mean Accuracy: 0.8525738150738151
Gradient Boosting Cross-Validation Mean Accuracy: 0.8526126651126651


Based on the cross-validation mean accuracy scores for Logistic Regression, Decision Tree, Random Forest, AdaBoost, and Gradient Boosting models, here is the conclusion:

**Logistic Regression:**

Achieves a mean cross-validation accuracy of approximately 86.10%.
Shows consistent performance across different folds.
Provides a good baseline model for credit risk prediction.

**Decision Tree:**

Achieves a mean cross-validation accuracy of approximately 86.51%.
Performs slightly better than Logistic Regression in terms of accuracy.
Shows potential for capturing non-linear relationships in the data.
**Random Forest:**

Achieves a mean cross-validation accuracy of approximately 87.07%.
Performs slightly better than both Logistic Regression and Decision Tree models.
Shows the effectiveness of ensemble methods in improving predictive performance.
AdaBoost:

Achieves a mean cross-validation accuracy of approximately 85.26%.
Performs competitively compared to Logistic Regression and Decision Tree models.
Demonstrates the capability of boosting algorithms in enhancing model performance.
Gradient Boosting:

Achieves a mean cross-validation accuracy of approximately 85.26%.
Performs on par with AdaBoost in terms of accuracy.
Shows robustness and generalization ability in capturing complex patterns.
Overall Conclusion:
Among the models evaluated, Random Forest exhibits the highest mean cross-validation accuracy, indicating its effectiveness in predicting credit risk.
Decision Tree also performs well, closely following Random Forest in terms of accuracy.
Ensemble methods (Random Forest, AdaBoost, Gradient Boosting) generally outperform Logistic Regression and Decision Tree models, highlighting the importance of leveraging ensemble techniques for credit risk prediction tasks.
It's recommended to further fine-tune the best-performing model (Random Forest) and evaluate its performance on a separate validation dataset to ensure its reliability before deployment.

In [None]:
import pickle


# Save the model as a pickle file
filename = 'random_forest_model.pkl'
with open(filename, 'wb') as file:
    pickle.dump(model, file)
