# A Loan Approval Prediction Model: Senior Data Analyst Assessment for Concept Group.
---
By Abdulbasit Bello



## Understanding the Problem
We need to predict the outcome of a loan application (approval or denial) using the provided historical data. 
This is a binary classification problem
1. The Input: Features like the credit score, employment lenght, Annual income, loan amount etc.
2. The Output: A loan approval decision, approved (1) or denied (0)

## Planning the Solution
Here's a breakdown of the steps I'll take to complete this loan approval prediction model

1. **Import Libraries**: Import all necessary Python libraries (pandas, numpy, scikit-learn, matplotlib, seaborn).

2. **Load Data**: Load the loan_application_data.csv file into a pandas DataFrame.

3. **Exploratory Data Analysis (EDA)**: Perform EDA to understand the data, including data types, summary statistics, distributions of key variables, missing values, and correlations between features. Visualizations are crucial here (histograms, box plots, scatter plots, etc.).

4. **Data Cleaning**: Handle missing values (imputation or removal) and outliers (clipping, transformation, or removal). These choices would be justified in the code if missing data/outliers are found.

5. **Feature Engineering**: Create new features (Debt-To-Income, Loan-to-Income Ratio, Employment Length,Loan-to-Debt (LTD) Ratio) from existing columns. Handle potential errors (e.g., division by zero, invalid dates).

6. **Data Preprocessing**: Prepare the data for modeling. This includes:

    1. Encoding categorical features to enable the model "count" (one-hot encoding, label encoding).
    2. log transformation of numerical features to squish the range between small and large values
    3. Scaling numerical features to establish proportionality (standardization, min-max scaling).
    4. Splitting the data into training and testing sets.

7. **Model Selection and Training**: Choose an appropriate machine learning algorithm (logistic regression, decision tree, random forest, etc.). Train the model on the training data. Use cross-validation for model evaluation.

8. **Model Evaluation**: Evaluate the performance of the model on the testing set using relevant metrics (accuracy, precision, recall, F1-score, AUC-ROC). Justification for my choice of metrics in the code for model evaluation. Present results in a clear and concise manner using classification reports and confusion matrices.

9. **Model Selection**: Choose the best-performing model based on the evaluation metrics.

10. **Model Interpretation**: Interpret the chosen model. For example, analyze feature importances (for tree-based models) or coefficients (for logistic regression) to understand which features are most influential in predicting loan approval.

11. **Report Generation**: Create a well-documented report summarizing my findings, including EDA results, data cleaning and feature engineering steps, model selection and evaluation, and model interpretation. Including relevant visualizations.

## Step 1: Importing Libraries

In [113]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.preprocessing  import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

print("Libraries successfully imported")



## Step 2: Loading the data

In [114]:
try:
    df = pd.read_csv('loan_application_data.csv')
    print("Data loaded successfully!\n")
except FileNotFoundError:
    print("Named file is missing from the directory, kindly double-check the file name and location\n")
except Exception as e:
    print(f"Something's wrong, {e}\n")



## Step 3: Exploratory Data Analysis Part 1 - Raw Data

In [115]:
# Display first 5 rows of the data
print("Snippet of the data\n")
display(df.head())

# Display information about the columns of the data and its shape
print("\nData Shape:\n",df.shape)
print("\nDataset Info")
print(df.info())

# Display summary statistics about the data
print("\nDataset Description\n",df.describe())








### Exploratory Data Analysis Part 1 - Raw Data

1. **Dataset Overview**:
    - The dataset contains **1000 rows** and **13 columns**.
    - Columns include numerical, categorical, and datetime data types.

2. **Key Observations**:
    - No missing values are present in the dataset.
    - Numerical features like `CreditScore`, `AnnualIncome`, and `LoanAmount` have varying ranges.
    - Categorical features include `EmploymentStatus` and `LoanOutcome`.

---

### Checking for missing data and inconsistencies in the dataset

### Further Exploration and Visualization

In [116]:
# Grouping the features for easier analysis
numerical_features = df.select_dtypes(include=['number'])
categorical_features = df.select_dtypes(include=['object'])

# Create a new figure for LoanOutcome Distribution and EmploymentStatus vs LoanOutcome 
plt.figure(figsize=(16, 10))

# LoanOutcome Distribution
plt.subplot(1, 2, 1)
sns.countplot(x='LoanOutcome', data=df)
plt.title('Loan Outcome Distribution')
plt.xlabel('Loan Outcome')
plt.ylabel('Count')

# EmploymentStatus vs LoanOutcome
plt.subplot(1, 2, 2)
sns.countplot(x='EmploymentStatus', hue='LoanOutcome', data=df, palette='Set1')
plt.title('Employment Status vs Loan Outcome')
plt.xlabel('Employment Status')
plt.ylabel('Count')

plt.tight_layout()
plt.show()




In [117]:
# Distributions of numerical features
plt.figure(figsize=(18, 10))

for i, col in enumerate(numerical_features):
    plt.subplot(2, 2, i + 1)
    sns.histplot(df[col], bins=20, kde=True, color='lightcoral')
    plt.title(f"Distribution of {col}")
    plt.xlabel(col)
    plt.ylabel('Frequency')
plt.tight_layout()
plt.show()



### Outlier Detection using Boxplots

In [118]:
# Outlier detection using boxplots
plt.figure(figsize=(20, 20))

# Generate the boxplots for each numerical feature in the dataset
for i, col in enumerate(numerical_features):
    plt.subplot(2, 2, i + 1)
    sns.boxplot(x=df['LoanOutcome'], y=df[col])
    plt.title(f"{col} by Loan Outcome")
plt.tight_layout()
plt.show()



### Correlation analysis using a heatmap

In [119]:
# Checking for Correlation between numerical columns

correlation_matrix = numerical_features.corr()
print("Correlation Matrix \n",correlation_matrix)

# Visualize correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()





## Step 4: Data Cleaning

In [120]:
# Analyze missing data patterns
missing_data_summary = df.isnull().sum().to_frame(name='Missing Count')
missing_data_summary['Missing Percentage'] = (missing_data_summary['Missing Count'] / len(df)) * 100
missing_data_summary['Loan Outcome'] = df.groupby(df.isnull().any(axis=1))['LoanOutcome'].value_counts()
display(missing_data_summary)

# Visualize missing data by Loan Outcome (if any missing values exist)

plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='coolwarm')
plt.title('Missing Data Heatmap')
plt.show()





The inclusion of 'Loan Outcome' in the summary helps identify potential biases in missingness. The heatmap visually represents the distribution of missing values, revealing potential patterns.

---
The plot is empty because there are no missing values in the dataset.

In [121]:
# Unique values in the dataset
print("\nUnique values in the dataset/column:\n",df.nunique())

#Check for inconsistencies (example: unexpected values in categorical columns)
print("\nUnique values in EmploymentStatus:\n", df['EmploymentStatus'].unique())
print("\nUnique values in LoanOutcome:\n", df['LoanOutcome'].unique())

#Handle inconsistencies (example: standardize capitalization)
df['EmploymentStatus'] = df['EmploymentStatus'].str.capitalize() 
df['LoanOutcome'] = df['LoanOutcome'].str.capitalize()

# Handle date columns in the dataset
df['EmploymentStartDate'] = pd.to_datetime(df['EmploymentStartDate'], errors='coerce')
df['ApplicationDate'] = pd.to_datetime(df['ApplicationDate'], errors='coerce')
print("\nDates successfully updated!\n",display(df.head()))








## Step 5: Feature Engineering

In [122]:
# Debt-to-Income Ratio (DTI)
df['DTI'] = df['OutstandingDebt'] / (df['AnnualIncome'] / 12)  # Monthly debt / monthly income

# Loan-to-Income Ratio
df['LoanToIncome'] = df['LoanAmount'] / df['AnnualIncome']

# Credit score bins
df['CreditScoreBins'] = pd.cut(df['CreditScore'], bins=[0, 580, 670, 740, 800, 850],labels=['Poor', 'Fair', 'Good', 'Very Good', 'Excellent'])

# Employment Length Calculation
df['EmploymentLength'] = ((df['ApplicationDate'] - df['EmploymentStartDate']).dt.days / 365.25).fillna(0)

# Loan-to-Debt Ratio (Oustanding) (LDR)
df['LDR'] = df['LoanAmount'] / df['OutstandingDebt']

# Display the updated DataFrame
display(df.head())
print(df.describe())






In [123]:
# Additional Feature Engineering
df['LoanAmount_log'] = np.log1p(df['LoanAmount']) #Log transformation of LoanAmount
df['AnnualIncome_log'] = np.log1p(df['AnnualIncome']) #Log transformation of AnnualIncome
df['CreditScore_scaled'] = (df['CreditScore'] - df['CreditScore'].min()) / (df['CreditScore'].max() - df['CreditScore'].min()) #Scale CreditScore
display(df)
print(df.info())
print(df.describe())





## Step 5.5: Exploratory Data Analysis Part 2 -Enriched Data

In [124]:
# Visualizing the new features
plt.figure(figsize=(20, 15))

numerical_features2 = ['DTI', 'LoanToIncome', 'EmploymentLength', 'LDR', 'AnnualIncome_log', 'LoanAmount_log', 'CreditScore_scaled']

# Plotting the distributions of the new numerical features
for i, col in enumerate(numerical_features2):
    plt.subplot(5, 2, i + 1)
    sns.histplot(df[col], bins=20, kde=True, color='lightcoral')
    plt.title(f"Distribution of {col}")
    plt.xlabel(col)
    plt.ylabel('Frequency')

# Plotting the categorical feature
plt.subplot(5, 2, i + 2)
sns.countplot(x='CreditScoreBins', data=df, palette='Set1')
plt.title('Credit Score Bins Distribution')
plt.xlabel('Credit Score Bins')
plt.ylabel('Count')

plt.tight_layout()
plt.show()





In [125]:
# Checking for outliers in the new features using boxplots
plt.figure(figsize=(20, 20))

# Generate the boxplots for each numerical feature in the dataset
for i, col in enumerate(numerical_features2):
    plt.subplot(4, 2, i + 1)
    sns.boxplot(x=df['LoanOutcome'], y=df[col])
    plt.title(f"{col} by Loan Outcome")
plt.tight_layout()
plt.show()



In [126]:
# Checking for Correlation between the all numerical columns including the new features
numerical_features = df.select_dtypes(include=['number'])

# Drop existing columns that have been log transformed or scaled
numerical_features = numerical_features.drop(columns=['LoanAmount', 'AnnualIncome', 'CreditScore'])
 
# Calculate the correlation matrix
correlation_matrix = numerical_features.corr()
print("Correlation Matrix")
display(correlation_matrix)

# Visualize correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()







In [127]:
# Let's write our analysis based on the correlation matrix

# Strong Negative Correlations
print("Strong Negative Correlations:")
print("- Annual Income (log) and DTI:", "r = -0.605")
print("- Annual Income (log) and LoanToIncome:", "r = -0.591")

# Moderate to Strong Positive Correlations
print("\nModerate to Strong Positive Correlations:")
print("- Outstanding Debt and DTI:", "r = 0.692")
print("- Loan Amount (log) and LoanToIncome:", "r = 0.630")

# Key Insights
print("\nKey Insights:")
print("1. Higher annual income is associated with lower debt-to-income ratios")
print("2. Higher annual income corresponds to lower loan-to-income ratios")
print("3. Outstanding debt strongly influences debt-to-income ratio")
print("4. Loan amount has a strong positive relationship with loan-to-income ratio")
print("5. Credit score shows very weak correlations with other features (all r < 0.05)")
print("6. Employment length shows minimal correlation with other variables")



## Step 6: Data Preprocessing 

In [None]:
# 1. One-hot encoding ONLY (remove label encoding)
df = pd.get_dummies(df, columns=['EmploymentStatus', 'CreditScoreBins'], drop_first=True)
numerical_features = df.select_dtypes(include=['number'])
display(numerical_features.head())

# 2. Keep target simple (no encoding needed if already binary)
# Ensure LoanOutcome is binary (0/1)
df['LoanOutcome'] = np.where(df['LoanOutcome'] == 'Approved', 1, 0)
df['LoanOutcome'] = df['LoanOutcome'].astype(int)

# 3. Select ALL relevant features (including one-hot encoded ones)
display(df.head())
features = ['CreditScore', 'AnnualIncome', 'LoanAmount', 'OutstandingDebt',
           'EmploymentLength', 'DTI', 'LoanToIncome'] + \
           [col for col in df.columns if 'EmploymentStatus_' in col or 'CreditScoreBins_' in col]

# 4. Split before scaling to avoid data leakage
X = df[numerical_features.drop(columns=['LoanOutcome','ApplicationDate', 'EmploymentStartDate'])]
y = df['LoanOutcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 5. Scale only numerical features (not the one-hot encoded ones)
numerical_features = ['CreditScore', 'AnnualIncome', 'LoanAmount', 
                     'OutstandingDebt', 'EmploymentLength', 'DTI', 'LoanToIncome']

scaler = StandardScaler()
X_train[numerical_features] = scaler.fit_transform(X_train[numerical_features])
X_test[numerical_features] = scaler.transform(X_test[numerical_features])



In [112]:
#One-hot encoding for categorical features
df = pd.get_dummies(df, columns=['EmploymentStatus', 'LoanOutcome', 'CreditScoreBins'], drop_first=True)

# Encode categorical variables
df['EmploymentStatus'] = le.fit_transform(df['EmploymentStatus'])
df['LoanOutcome'] = le.fit_transform(df['LoanOutcome'])  # 1 for Approved, 0 for Denied

# Select features and target
features = ['CreditScore', 'AnnualIncome', 'LoanAmount', 'OutstandingDebt', 
           'EmploymentStatus', 'EmploymentLength', 'DTI', 'LoanToIncome']
X = df[features]
y = df['LoanOutcome']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale numerical features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)



The loan-to-debt ratio (LDR) might not have been selected as a feature for the model due to one or more of the following reasons:

1. **Low Correlation with Target Variable**: If the correlation analysis shows that LDR has little to no relationship with the target variable (`LoanOutcome`), it may not contribute significantly to the model's predictive power.

2. **Multicollinearity**: LDR might be highly correlated with other features like `LoanAmount` or `OutstandingDebt`. Including highly correlated features can lead to multicollinearity, which can negatively impact the model's performance and interpretability.

3. **Outliers**: The LDR feature might contain extreme outliers (as observed in earlier boxplots), which could skew the model's predictions. If these outliers are not addressed, the feature might be excluded.

4. **Feature Importance**: During feature selection, techniques like recursive feature elimination (RFE) or tree-based feature importance might have indicated that LDR is not a significant predictor.

5. **Domain Knowledge**: Based on domain expertise, LDR might not be considered as critical as other features like `CreditScore`, `DTI`, or `LoanToIncome` in determining loan approval outcomes.

If you believe LDR could be important, you can include it in the model and evaluate its impact on performance metrics.

## Step 7: Model Selection 