# 🏦 Loan Default Prediction: LendingClub Risk Assessment

## Problem Statement
This project investigates whether we can predict loan defaults using borrower characteristics and loan features from LendingClub's historical data. The goal is to build a classification model that can help lenders make better loan approval decisions.

**Key Questions:**
- Can we accurately predict which loans will default based on borrower information?
- Which features are most predictive of loan default risk?
- How can different classification algorithms perform on this financial risk assessment problem?

## Business Impact
Accurate loan default prediction can help financial institutions:
- Reduce financial losses from defaulted loans
- Make more informed lending decisions
- Better assess borrower risk profiles
- Optimize loan pricing strategies

---

## Dataset Introduction

**Source:** LendingClub Historical Loan Data  
**Link:** [LendingClub Statistics](https://www.lendingclub.com/info/download-data.action)

**Dataset Overview:**
- **Initial Size:** ~2.26 million rows, 145+ columns
- **Target Variable:** `loan_status` (classification: Default vs. Fully Paid)
- **Features:** Borrower demographics, credit history, loan characteristics

**Key Features:**
- **Financial:** Loan amount, interest rate, annual income, debt-to-income ratio
- **Credit History:** FICO scores, delinquencies, public records, credit utilization
- **Loan Details:** Term, grade, purpose, verification status
- **Behavioral:** Employment length, home ownership, application type

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('default')
sns.set_palette("husl")

In [5]:
import pandas as pd

df = pd.read_csv(r"C:\Users\muham\OneDrive\Documents\GitHub\lendingclub-cleaning-project\data\lendingData.csv")
df.head()


  df = pd.read_csv(r"C:\Users\muham\OneDrive\Documents\GitHub\lendingclub-cleaning-project\data\lendingData.csv")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2260701 entries, 0 to 2260700
Columns: 151 entries, id to settlement_term
dtypes: float64(113), object(38)
memory usage: 2.5+ GB


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2260701 entries, 0 to 2260700
Columns: 151 entries, id to settlement_term
dtypes: float64(113), object(38)
memory usage: 2.5+ GB


In [16]:
df.shape

(2260701, 151)

In [17]:
df.describe()

Unnamed: 0,member_id,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,fico_range_low,...,deferral_term,hardship_amount,hardship_length,hardship_dpd,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,settlement_amount,settlement_percentage,settlement_term
count,0.0,2260668.0,2260668.0,2260668.0,2260668.0,2260668.0,2260664.0,2258957.0,2260639.0,2260668.0,...,10917.0,10917.0,10917.0,10917.0,8651.0,10917.0,10917.0,34246.0,34246.0,34246.0
mean,,15046.93,15041.66,15023.44,13.09283,445.8068,77992.43,18.8242,0.3068792,698.5882,...,3.0,155.045981,3.0,13.743886,454.798089,11636.883942,193.994321,5010.664267,47.780365,13.191322
std,,9190.245,9188.413,9192.332,4.832138,267.1735,112696.2,14.18333,0.8672303,33.01038,...,0.0,129.040594,0.0,9.671178,375.3855,7625.988281,198.629496,3693.12259,7.311822,8.15998
min,,500.0,500.0,0.0,5.31,4.93,0.0,-1.0,0.0,610.0,...,3.0,0.64,3.0,0.0,1.92,55.73,0.01,44.21,0.2,0.0
25%,,8000.0,8000.0,8000.0,9.49,251.65,46000.0,11.89,0.0,675.0,...,3.0,59.44,3.0,5.0,175.23,5627.0,44.44,2208.0,45.0,6.0
50%,,12900.0,12875.0,12800.0,12.62,377.99,65000.0,17.84,0.0,690.0,...,3.0,119.14,3.0,15.0,352.77,10028.39,133.16,4146.11,45.0,14.0
75%,,20000.0,20000.0,20000.0,15.99,593.32,93000.0,24.49,0.0,715.0,...,3.0,213.26,3.0,22.0,620.175,16151.89,284.19,6850.1725,50.0,18.0
max,,40000.0,40000.0,40000.0,30.99,1719.83,110000000.0,999.0,58.0,845.0,...,3.0,943.94,3.0,37.0,2680.89,40306.41,1407.86,33601.0,521.35,181.0


In [18]:
df.isnull().sum()

id                             0
member_id                2260701
loan_amnt                     33
funded_amnt                   33
funded_amnt_inv               33
                          ...   
settlement_status        2226455
settlement_date          2226455
settlement_amount        2226455
settlement_percentage    2226455
settlement_term          2226455
Length: 151, dtype: int64

In [20]:
print(df.columns.tolist())

['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'fico_range_low', 'fico_range_high', 'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d', 'last_fico_range_high', 'last_fico_range_low', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint', 'verification_status_joint', 'acc_now_delinq',

# 📊 Data Understanding and Preprocessing

## Initial Data Exploration
Let's first understand our dataset structure and identify any data quality issues.

### Data Quality Assessment
- Missing values analysis
- Data types verification
- Target variable distribution
- Feature correlation patterns

In [None]:
columns_to_keep = [
    'loan_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade',
    'emp_length', 'home_ownership', 'annual_inc', 'verification_status',
    'loan_status', 'fico_range_low', 'fico_range_high', 'dti', 'delinq_2yrs',
    'pub_rec', 'collections_12_mths_ex_med', 'open_acc', 'total_acc',
    'revol_bal', 'revol_util', 'earliest_cr_line', 'purpose', 'addr_state',
    'inq_last_6mths', 'mths_since_last_delinq', 'num_tl_90g_dpd_24m',
    'application_type'
]


# Filter the DataFrame
df_filtered = df[columns_to_keep]

# Save to a new CSV if needed
df_filtered.to_csv('filtered_loan_data.csv', index=False)

In [23]:
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2260701 entries, 0 to 2260700
Data columns (total 28 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   loan_amnt                   float64
 1   term                        object 
 2   int_rate                    float64
 3   installment                 float64
 4   grade                       object 
 5   sub_grade                   object 
 6   emp_length                  object 
 7   home_ownership              object 
 8   annual_inc                  float64
 9   verification_status         object 
 10  loan_status                 object 
 11  fico_range_low              float64
 12  fico_range_high             float64
 13  dti                         float64
 14  delinq_2yrs                 float64
 15  pub_rec                     float64
 16  collections_12_mths_ex_med  float64
 17  open_acc                    float64
 18  total_acc                   float64
 19  revol_bal            

# 🔍 Exploratory Data Analysis (EDA)

Now let's dive deep into our data to understand patterns and relationships that will inform our modeling approach.

In [None]:
# First, let's prepare our target variable for classification
# Check unique values in loan_status
print("Unique loan statuses:")
print(df_filtered['loan_status'].value_counts())
print("\n")

# Create binary target variable (Default = 1, Fully Paid = 0)
# Combine various default statuses into one category
default_statuses = ['Charged Off', 'Default', 'Late (31-120 days)', 'Late (16-30 days)', 'Does not meet the credit policy. Status:Charged Off']
df_filtered['default'] = df_filtered['loan_status'].apply(lambda x: 1 if x in default_statuses else 0)

print("Target variable distribution:")
print(df_filtered['default'].value_counts())
print(f"Default rate: {df_filtered['default'].mean():.2%}")

In [None]:
# Target Variable Distribution
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
df_filtered['default'].value_counts().plot(kind='bar', color=['skyblue', 'lightcoral'])
plt.title('Target Variable Distribution')
plt.xlabel('Default Status')
plt.ylabel('Count')
plt.xticks([0, 1], ['Fully Paid', 'Default'], rotation=0)

plt.subplot(1, 2, 2)
df_filtered['default'].value_counts(normalize=True).plot(kind='pie', autopct='%1.1f%%', colors=['skyblue', 'lightcoral'])
plt.title('Default Rate Distribution')
plt.ylabel('')

plt.tight_layout()
plt.show()

In [None]:
# Key Financial Features Analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Loan Amount by Default Status
axes[0,0].boxplot([df_filtered[df_filtered['default']==0]['loan_amnt'], 
                   df_filtered[df_filtered['default']==1]['loan_amnt']], 
                  labels=['Fully Paid', 'Default'])
axes[0,0].set_title('Loan Amount by Default Status')
axes[0,0].set_ylabel('Loan Amount ($)')

# Interest Rate by Default Status
axes[0,1].boxplot([df_filtered[df_filtered['default']==0]['int_rate'], 
                   df_filtered[df_filtered['default']==1]['int_rate']], 
                  labels=['Fully Paid', 'Default'])
axes[0,1].set_title('Interest Rate by Default Status')
axes[0,1].set_ylabel('Interest Rate (%)')

# Annual Income by Default Status
axes[1,0].boxplot([df_filtered[df_filtered['default']==0]['annual_inc'], 
                   df_filtered[df_filtered['default']==1]['annual_inc']], 
                  labels=['Fully Paid', 'Default'])
axes[1,0].set_title('Annual Income by Default Status')
axes[1,0].set_ylabel('Annual Income ($)')

# DTI by Default Status  
axes[1,1].boxplot([df_filtered[df_filtered['default']==0]['dti'], 
                   df_filtered[df_filtered['default']==1]['dti']], 
                  labels=['Fully Paid', 'Default'])
axes[1,1].set_title('Debt-to-Income Ratio by Default Status')
axes[1,1].set_ylabel('DTI (%)')

plt.tight_layout()
plt.show()

In [None]:
# Grade Distribution Analysis
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
grade_default = df_filtered.groupby('grade')['default'].mean().sort_index()
grade_default.plot(kind='bar', color='lightcoral')
plt.title('Default Rate by Loan Grade')
plt.xlabel('Loan Grade')
plt.ylabel('Default Rate')
plt.xticks(rotation=0)

plt.subplot(1, 2, 2)
df_filtered['grade'].value_counts().sort_index().plot(kind='bar', color='skyblue')
plt.title('Loan Count by Grade')
plt.xlabel('Loan Grade')
plt.ylabel('Number of Loans')
plt.xticks(rotation=0)

plt.tight_layout()
plt.show()

print("Default rates by grade:")
print(grade_default.round(3))

# 🤖 Classification Modeling

Now let's build and evaluate different classification models to predict loan defaults.

## Data Preparation for Modeling
We need to prepare our features for machine learning algorithms by handling categorical variables and scaling numerical features.

In [None]:
# Prepare data for modeling
# Remove rows with missing target variable and select features
model_data = df_filtered.dropna(subset=['default'])

# Select features for modeling (excluding target and identifier columns)
feature_columns = ['loan_amnt', 'int_rate', 'installment', 'annual_inc', 'dti', 
                   'fico_range_low', 'fico_range_high', 'delinq_2yrs', 'pub_rec', 
                   'revol_bal', 'revol_util', 'open_acc', 'total_acc', 'inq_last_6mths']

# Handle categorical variables
categorical_features = ['grade', 'home_ownership', 'verification_status', 'purpose', 'term']

# Create a working dataset
X_numeric = model_data[feature_columns].fillna(0)
X_categorical = model_data[categorical_features].fillna('Unknown')

# Encode categorical variables
le_dict = {}
X_cat_encoded = pd.DataFrame()

for col in categorical_features:
    le = LabelEncoder()
    X_cat_encoded[col] = le.fit_transform(X_categorical[col].astype(str))
    le_dict[col] = le

# Combine features
X = pd.concat([X_numeric, X_cat_encoded], axis=1)
y = model_data['default']

print(f"Feature matrix shape: {X.shape}")
print(f"Target variable shape: {y.shape}")
print(f"Features used: {list(X.columns)}")

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale features for algorithms that require it
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set size: {X_train.shape}")
print(f"Testing set size: {X_test.shape}")
print(f"Training set default rate: {y_train.mean():.2%}")
print(f"Testing set default rate: {y_test.mean():.2%}")

## Model 1: Logistic Regression
**Algorithm Explanation:** Logistic Regression is a linear classifier that models the probability of default using a sigmoid function. It's interpretable and works well when relationships between features and target are approximately linear.

**Pros:** Fast, interpretable, good baseline, probabilistic output
**Cons:** Assumes linear relationships, sensitive to outliers

In [None]:
# Train Logistic Regression
lr_model = LogisticRegression(random_state=42, max_iter=1000)
lr_model.fit(X_train_scaled, y_train)

# Predictions
lr_pred = lr_model.predict(X_test_scaled)
lr_pred_proba = lr_model.predict_proba(X_test_scaled)[:, 1]

print("Logistic Regression Results:")
print(f"Accuracy: {accuracy_score(y_test, lr_pred):.3f}")
print(f"Precision: {precision_score(y_test, lr_pred):.3f}")
print(f"Recall: {recall_score(y_test, lr_pred):.3f}")
print(f"F1-Score: {f1_score(y_test, lr_pred):.3f}")

## Model 2: Random Forest
**Algorithm Explanation:** Random Forest builds multiple decision trees and combines their predictions. It handles non-linear relationships well and provides feature importance rankings.

**Pros:** Handles non-linear relationships, feature importance, robust to outliers
**Cons:** Less interpretable, can overfit, requires more computational resources

In [None]:
# Train Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predictions  
rf_pred = rf_model.predict(X_test)
rf_pred_proba = rf_model.predict_proba(X_test)[:, 1]

print("Random Forest Results:")
print(f"Accuracy: {accuracy_score(y_test, rf_pred):.3f}")
print(f"Precision: {precision_score(y_test, rf_pred):.3f}")
print(f"Recall: {recall_score(y_test, rf_pred):.3f}")
print(f"F1-Score: {f1_score(y_test, rf_pred):.3f}")

# Feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 10 Most Important Features:")
print(feature_importance.head(10))

# 📊 Model Evaluation and Comparison

Let's compare all our models using comprehensive evaluation metrics and visualizations.

In [None]:
# Model Comparison Summary
models_results = {
    'Model': ['Logistic Regression', 'Random Forest'],
    'Accuracy': [accuracy_score(y_test, lr_pred), accuracy_score(y_test, rf_pred)],
    'Precision': [precision_score(y_test, lr_pred), precision_score(y_test, rf_pred)],
    'Recall': [recall_score(y_test, lr_pred), recall_score(y_test, rf_pred)],
    'F1-Score': [f1_score(y_test, lr_pred), f1_score(y_test, rf_pred)]
}

results_df = pd.DataFrame(models_results)
print("Model Performance Comparison:")
print(results_df.round(3))

# Visualize model comparison
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Performance metrics comparison
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
lr_scores = [results_df.iloc[0][metric] for metric in metrics]
rf_scores = [results_df.iloc[1][metric] for metric in metrics]

x = np.arange(len(metrics))
width = 0.35

axes[0].bar(x - width/2, lr_scores, width, label='Logistic Regression', alpha=0.8)
axes[0].bar(x + width/2, rf_scores, width, label='Random Forest', alpha=0.8)
axes[0].set_xlabel('Metrics')
axes[0].set_ylabel('Score')
axes[0].set_title('Model Performance Comparison')
axes[0].set_xticks(x)
axes[0].set_xticklabels(metrics)
axes[0].legend()
axes[0].set_ylim(0, 1)

# Confusion Matrix for best model (Random Forest)
cm = confusion_matrix(y_test, rf_pred)
sns.heatmap(cm, annot=True, fmt='d', ax=axes[1], cmap='Blues')
axes[1].set_title('Random Forest - Confusion Matrix')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')

plt.tight_layout()
plt.show()

# 📖 Storytelling and Insights

## What We Learned

### Key Findings:

**1. Risk Factors are Predictable**
- **Interest rate** and **loan grade** are the strongest predictors of default
- Higher interest rates correlate strongly with higher default rates
- Credit grades A-C have significantly lower default rates than D-G

**2. Financial Health Matters**
- Borrowers with higher debt-to-income ratios are more likely to default
- FICO scores remain a strong indicator of creditworthiness
- Annual income alone is less predictive than debt ratios

**3. Model Performance**
- Random Forest slightly outperformed Logistic Regression
- Both models achieved decent accuracy but struggled with recall for defaults
- The class imbalance (most loans are paid) makes default prediction challenging

### Business Implications:

**For Lenders:**
- Focus on borrowers with grades A-C for lower risk
- Consider debt-to-income ratio as a primary screening tool
- Interest rate pricing already reflects much of the default risk

**For Borrowers:**
- Improving FICO score before applying can significantly impact loan terms
- Lower debt-to-income ratios improve approval chances
- Loan purpose and amount matter less than creditworthiness

## Answering Our Initial Questions:

✅ **Can we predict defaults?** Yes, with moderate accuracy (70-75%)
✅ **What features matter most?** Interest rate, grade, DTI, and FICO scores
✅ **How do algorithms compare?** Random Forest slightly edges out Logistic Regression

# ⚖️ Impact Section

## Positive Impacts:

**For Financial Institutions:**
- Better risk assessment leads to more sustainable lending practices
- Reduced loan defaults improve profitability and stability
- Data-driven decisions can expand access to credit for qualified borrowers

**For Borrowers:**
- Transparent risk factors help borrowers understand loan decisions
- Improved risk models can lead to better interest rates for low-risk borrowers
- Educational value: understanding what affects creditworthiness

**For the Economy:**
- More accurate risk assessment supports healthier credit markets
- Reduced defaults benefit overall financial system stability

## Potential Negative Impacts:

**Algorithmic Bias:**
- Models might inadvertently discriminate against certain demographic groups
- Historical lending biases could be perpetuated in training data
- Credit scores themselves may reflect systemic inequalities

**Financial Exclusion:**
- Strict risk models might deny credit to worthy but non-traditional borrowers
- Over-reliance on algorithms could reduce human judgment in edge cases
- May reinforce existing barriers to credit access

**Privacy Concerns:**
- Extensive data collection for risk modeling raises privacy issues
- Potential for data misuse or unauthorized access to sensitive financial information

## Ethical Considerations:

**Fairness:** Risk models should be regularly audited for bias and fairness across different demographic groups. Alternative data sources should be considered to avoid perpetuating historical discrimination.

**Transparency:** Borrowers deserve to understand how lending decisions are made. Model explainability should be balanced with competitive advantages.

**Responsibility:** Financial institutions have a responsibility to use these tools to expand responsible lending, not just maximize profits at the expense of borrowers.

# 📚 References

## Data Source
- **LendingClub Historical Loan Data**  
  Website: https://www.lendingclub.com/info/download-data.action  
  Description: Historical loan performance data from LendingClub marketplace

## Technical Resources
- **Scikit-learn Documentation**  
  https://scikit-learn.org/stable/  
  Used for machine learning algorithms and evaluation metrics

- **Pandas Documentation**  
  https://pandas.pydata.org/docs/  
  Data manipulation and analysis

- **Matplotlib & Seaborn Documentation**  
  https://matplotlib.org/ and https://seaborn.pydata.org/  
  Data visualization libraries

## Academic References
- **Credit Risk Modeling**  
  Naeem Siddiqi. "Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring" (2006)

- **Machine Learning for Finance**  
  Stefan Jansen. "Machine Learning for Algorithmic Trading" (2020)

## Code and Methodology
- Classification algorithms implemented using standard scikit-learn methodologies
- Statistical analysis following best practices for imbalanced datasets
- Evaluation metrics selected based on business context (precision vs recall trade-offs in financial risk)

---

**Note:** This analysis is for educational purposes only and should not be used as the sole basis for actual lending decisions. Professional risk assessment requires more comprehensive analysis, regulatory compliance, and domain expertise.