<a href="https://colab.research.google.com/github/EvgeniaKantor/DI-Bootcamp/blob/main/Exercises_XP_W7D4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Exercise 1 : Defining The Problem And Data Collection For Loan Default Prediction



 Predicting loan defaults is a critical task for financial institutions to mitigate risks associated with lending and to make informed decisions regarding loan approvals. By leveraging historical data on loan applicants, including their personal details, financial information, and repayment history, the aim is to create a predictive tool that enhances the institution's ability to assess the creditworthiness of applicants and minimize potential losses due to defaults.

In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, roc_curve
import numpy as np


In [2]:
import random
import csv

# Generate FICO scores for each applicant
def generate_fico_score():
    return random.randint(300, 850)

# Generate random loan amount ($10,000 to $100,000)
def generate_loan_amount():
    return '$' + str(random.randint(10000, 100000))

# Generate random interest rate (3% to 10%)
def generate_interest_rate():
    return str(round(random.uniform(3, 10), 2)) + '%'

# Generate random loan term (1 to 5 years)
def generate_loan_term():
    return str(random.randint(1, 5)) + ' years'

# Generate random purpose of the loan
def generate_loan_purpose():
    purposes = ['Home improvement', 'Debt consolidation', 'Education', 'Medical expenses', 'Car purchase']
    return random.choice(purposes)

# Generate random repayment history
def generate_repayment_history():
    history = ['Excellent', 'Good', 'Fair']
    return random.choice(history)

# Example data
data = [
    ["Name", "Age", "Gender", "Marital Status", "Employment Status", "Income", "Education Level",
     "Credit Score", "Loan Amount", "Interest Rate", "Loan Term", "Purpose of the Loan",
     "Repayment History", "Debt-to-Income Ratio", "Default"],
    ["John", 35, "Male", "Married", "Employed", "$60,000", "Bachelor's", generate_fico_score(), generate_loan_amount(),
     generate_interest_rate(), generate_loan_term(), generate_loan_purpose(), generate_repayment_history(), 0.4, "No"],
    ["Sarah", 28, "Female", "Single", "Employed", "$45,000", "Master's", generate_fico_score(), generate_loan_amount(),
     generate_interest_rate(), generate_loan_term(), generate_loan_purpose(), generate_repayment_history(), 0.3, "No"],
    ["Michael", 45, "Male", "Married", "Self-employed", "$80,000", "High School", generate_fico_score(), generate_loan_amount(),
     generate_interest_rate(), generate_loan_term(), generate_loan_purpose(), generate_repayment_history(), 0.5, "Yes"],
    ["Emily", 30, "Female", "Single", "Employed", "$55,000", "Bachelor's", generate_fico_score(), generate_loan_amount(),
     generate_interest_rate(), generate_loan_term(), generate_loan_purpose(), generate_repayment_history(), 0.35, "No"],
    ["David", 40, "Male", "Married", "Employed", "$70,000", "Doctorate", generate_fico_score(), generate_loan_amount(),
     generate_interest_rate(), generate_loan_term(), generate_loan_purpose(), generate_repayment_history(), 0.6, "Yes"]
]

# Save data to CSV file
with open('loan_data.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(data)

print("Data saved to loan_data.csv")


Data saved to loan_data.csv


In [9]:
# Load the data
data = pd.read_csv('loan_data.csv')

# Preprocess numerical columns
numerical_columns = ['Income', 'Loan Amount']
for column in numerical_columns:
    data[column] = data[column].replace({'\$': '', ',': ''}, regex=True).astype(float)

# Preprocess 'Interest Rate' column
data['Interest Rate'] = data['Interest Rate'].str.rstrip('%').astype(float)

# Preprocess 'Loan Term' column
data['Loan Term'] = data['Loan Term'].str.extract('(\d+)').astype(int)

# Convert categorical variables into numerical ones using one-hot encoding
data = pd.get_dummies(data, columns=['Gender', 'Marital Status', 'Employment Status', 'Education Level', 'Purpose of the Loan', 'Repayment History'])

# Define features (X) and target variable (y)
X = data.drop(['Name', 'Default'], axis=1)
y = data['Default']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the decision tree classifier
clf = DecisionTreeClassifier(random_state=42)

# Train the classifier
clf.fit(X_train, y_train)

# Make predictions on the testing set
predictions = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

Accuracy: 0.0


An accuracy of 0.0 indicates that the model is not making accurate predictions at all

In [17]:
# Convert string labels to binary numerical values
y_test_binary = np.where(y_test == 'Yes', 1, 0)
predictions_binary = np.where(predictions == 'Yes', 1, 0)

# Calculate precision
precision = precision_score(y_test_binary, predictions_binary)

# Calculate recall
recall = recall_score(y_test_binary, predictions_binary)

# Calculate F1-score
f1 = f1_score(y_test_binary, predictions_binary)

# Calculate ROC AUC score if both classes are present
if len(np.unique(y_test_binary)) > 1:
    roc_auc = roc_auc_score(y_test_binary, predictions_binary)
else:
    roc_auc = None

print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
print("ROC AUC score:", roc_auc)

Precision: 0.0
Recall: 0.0
F1-score: 0.0
ROC AUC score: None


  _warn_prf(average, modifier, msg_start, len(result))


It seems that all predictions are from the majority class, resulting in a situation where there are no true positive predictions. As a consequence, the precision, recall, and F1-score are all 0.0, and the ROC AUC score is not defined due to the absence of true positive samples.

This outcome indicates that the model is not effectively predicting the minority class (default cases). The thing is, I created random data.

Sources for Data Collection:

Financial Institution's Internal Records:
Loan application forms
Customer databases
Loan transaction records
Repayment history logs
Credit Bureaus:
Experian, Equifax, TransUnion, etc.
Credit reports and scores
Government Databases:
Tax records (for income verification)
Social security databases (for identity verification)
Online Surveys or Questionnaires:
Gather additional information directly from applicants during the loan application process
Publicly Available Data:
Economic indicators (e.g., unemployment rates, inflation rates)
Demographic data (e.g., census data)

Comprehensive Plan for Data Collection:

Data Gathering: Obtain access to the financial institution's internal records, credit bureau data, and any other relevant sources mentioned above.

Data Cleaning and Preprocessing: Clean the collected data by handling missing values, outliers, and inconsistencies. Preprocess the data to ensure uniformity in formats and scales.

Feature Engineering: Create new features or transform existing ones to capture relevant information and improve model performance. This may include calculating ratios (e.g., debt-to-income ratio) and deriving new variables from existing ones.

Exploratory Data Analysis (EDA): Conduct exploratory data analysis to understand the distribution of variables, identify correlations, and uncover insights that could inform feature selection and model building.

Model Development: Develop machine learning models using appropriate algorithms such as logistic regression, decision trees, random forests, etc. Train the models on historical data, validate their performance using appropriate evaluation metrics, and iterate on model selection and tuning as necessary.

Model Deployment and Monitoring: Deploy the trained model into a production environment, integrating it with the financial institution's loan application system. Implement monitoring mechanisms to track model performance over time, detect concept drift or data drift, and retrain the model as necessary to maintain its effectiveness.

Exercise 2 : Feature Selection And Model Choice For Loan Default Prediction

From "Explainable prediction of loan default based on machine learning models" by Xu Zhu a, Qingyong Chu a, Xinchang Song a, Ping Hu a, Lu Peng:

"To make the prediction model rules more understandable and thereby increase the user’s faith in the model, an explanatory model must be used. Logistic regression, decision tree, XGBoost, and LightGBM models are employed to predict a loan default. The prediction results show that LightGBM and XGBoost outperform logistic regression and decision tree models in terms of the predictive ability. The area under curve for LightGBM is 0.7213. The accuracies of LightGBM and XGBoost exceed 0.8. The precisions of LightGBM and XGBoost exceed 0.55. Simultaneously, we employed the local interpretable model-agnostic explanations approach to undertake an explainable analysis of the prediction findings. The results show that factors such as **the loan term, loan grade, credit rating, and loan amount** affect the predicted outcomes."

Exercise 3 : Training, Evaluating, And Optimizing The Model

To train, evaluate, and optimize the model for predicting loan defaults, we can follow these steps:

Data Preprocessing:

Handle missing values: Impute or remove missing values in the dataset.
Encode categorical variables: Convert categorical variables into numerical format using techniques like one-hot encoding.
Scale numerical features: Scale numerical features to ensure they have similar ranges, which can improve the performance of some machine learning algorithms.
Feature Selection:

Select relevant features based on domain knowledge and feature importance techniques. Focus on features that are likely to have the strongest impact on loan default prediction.
Split Data:

Split the dataset into training and testing sets to evaluate model performance. Typically, 70-80% of the data is used for training, and the remaining portion is used for testing.
Choose Evaluation Metrics:

Select appropriate evaluation metrics to assess the performance of the model. For loan default prediction, the following metrics are relevant:
Accuracy: Measures the overall correctness of the predictions.
Precision: Indicates the proportion of true positive predictions among all positive predictions. It measures the accuracy of positive predictions.
Recall: Measures the proportion of true positive predictions among all actual positive instances. It measures the ability of the model to identify all positive instances.
F1-score: Harmonic mean of precision and recall, providing a balance between the two metrics.
ROC AUC Score: Area under the Receiver Operating Characteristic curve, which measures the ability of the model to discriminate between positive and negative cases across different thresholds.
Select a Model:

Choose a suitable machine learning algorithm for the problem. Common choices include logistic regression, decision trees, random forests, gradient boosting machines, and support vector machines.
Experiment with different algorithms to determine which one performs best for the given dataset.
Train the Model:

Train the chosen model on the training data using the selected features.
Evaluate the Model:

Use the testing set to evaluate the model's performance using the chosen evaluation metrics.
Analyze the confusion matrix to understand the distribution of true positive, true negative, false positive, and false negative predictions.
Plot ROC curves and calculate ROC AUC scores to assess the model's ability to discriminate between positive and negative instances.
Optimize the Model:

Fine-tune hyperparameters of the chosen algorithm using techniques like grid search or randomized search.
Experiment with different feature subsets and preprocessing techniques to improve model performance.
Consider addressing class imbalance if present, using techniques such as oversampling, undersampling, or using class weights.
Iterate:

Iterate on the model training, evaluation, and optimization process based on insights gained from previous iterations.
Continuously monitor model performance and make adjustments as necessary.
By following these steps, we can effectively train, evaluate, and optimize a model for predicting loan defaults, ultimately improving its accuracy and reliability in real-world applications.

Exercise 4 : Designing Machine Learning Solutions For Specific Problems

Predicting Stock Prices:

Suitable Type of Machine Learning: Supervised Learning
Explanation: Supervised learning is appropriate for predicting stock prices because historical data with corresponding target labels (stock prices) is readily available. The algorithm can learn patterns and relationships from past stock price movements and use this information to make predictions about future prices. Regression techniques within supervised learning can be used to predict numerical values (stock prices) based on input features such as past stock prices, trading volume, economic indicators, etc.

Organizing a Library of Books:
In case the books are pre-assigned to genres and labeled accordingly, it can also be
Suitable Type of Machine Learning: Supervised Learning

In case the books arewithout predefined labels or categories:
Suitable Type of Machine Learning: Unsupervised Learning
Explanation: Unsupervised learning is well-suited for organizing a library of books because the data typically comes without predefined labels or categories. Instead, the algorithm needs to discover patterns or similarities in the data to group books into genres or categories. Clustering algorithms, such as K-means or hierarchical clustering, can be applied to group similar books together based on features like content, genre, author, reader reviews, etc., without the need for labeled data.

Program a Robot to Navigate and Find the Shortest Path in a Maze:

Suitable Type of Machine Learning: Reinforcement Learning
Explanation: Reinforcement learning is ideal for programming a robot to navigate and find the shortest path in a maze because the robot learns through trial and error interactions with the environment (maze). The robot receives feedback in the form of rewards or penalties based on its actions (movement in the maze). By exploring different paths and receiving feedback on their success (e.g., reaching the goal or hitting a dead end), the robot can learn to navigate the maze efficiently and find the shortest path. Reinforcement learning algorithms, such as Q-learning or Deep Q-Networks (DQN), can be used to train the robot to make optimal decisions while navigating the maze.

Exercise 5 : Designing An Evaluation Strategy For Different ML Models

Let's focus on a Support Vector Machine (SVM) classifier as our classification model. SVMs are powerful classifiers that are effective in high-dimensional spaces and versatile enough to handle both linear and non-linear classification tasks through the use of different kernel functions.

Here's an outline of a strategy to evaluate the performance of an SVM classifier, including the choice of metrics and methods:

Data Preprocessing:

Handle missing values: Impute or remove missing values.
Encode categorical variables: Convert categorical variables into numerical format using techniques like one-hot encoding.
Scale numerical features: Scale numerical features to ensure they have similar ranges.
Split Data:

Split the dataset into training and testing sets. Typically, 70-80% of the data is used for training, and the remaining portion is used for testing.
Train the Model:

Train the SVM classifier on the training data. Choose an appropriate kernel function (e.g., linear, polynomial, radial basis function) based on the nature of the data and the problem.
Evaluate Performance:

Use the testing set to evaluate the model's performance.
Choice of Metrics:

Accuracy: Measures the overall correctness of the predictions.
Precision: Indicates the proportion of true positive predictions among all positive predictions. It measures the accuracy of positive predictions.
Recall: Measures the proportion of true positive predictions among all actual positive instances. It measures the ability of the model to identify all positive instances.
F1-score: Harmonic mean of precision and recall, providing a balance between the two metrics.
Cross-Validation:

Perform k-fold cross-validation to assess the model's robustness and generalization performance. This involves splitting the dataset into k subsets, training the model on k-1 subsets, and evaluating it on the remaining subset. Repeat this process k times, each time using a different subset as the test set.
ROC Curves:

Plot Receiver Operating Characteristic (ROC) curves and calculate the Area Under the ROC Curve (ROC AUC score) to assess the model's ability to discriminate between positive and negative instances at various threshold levels.
Hyperparameter Tuning:

Conduct hyperparameter tuning to optimize the performance of the SVM classifier. Parameters such as the choice of kernel, regularization parameter (C), and kernel parameters (e.g., degree for polynomial kernel, gamma for radial basis function kernel) can significantly impact the model's performance.
By following this strategy, we can effectively evaluate the performance of the SVM classifier for our classification task and make informed decisions about its deployment in real-world applications.

For the reinforcement learning model, let's consider a classic algorithm called Q-learning. Q-learning is a model-free reinforcement learning technique used to learn a policy for making decisions by trying to find the optimal action-value function.

Here's how we can measure the success of a Q-learning model:

Cumulative Reward:

The cumulative reward over multiple episodes is a key metric for measuring the success of a reinforcement learning model. The cumulative reward is the sum of rewards obtained by the agent over a sequence of actions taken within an episode.
Higher cumulative rewards indicate that the agent is achieving its objectives and performing well in the environment.
Convergence:

Convergence refers to the point at which the Q-values stabilize, indicating that the agent has learned an effective policy for making decisions in the environment.
To measure convergence, we can track the changes in Q-values over time and monitor whether they converge to stable values. Techniques such as plotting Q-value updates over episodes can help visualize convergence.
Exploration vs. Exploitation Balance:

Balancing exploration (trying new actions to discover more about the environment) and exploitation (using known strategies to maximize rewards) is crucial for effective reinforcement learning.
We can measure the exploration vs. exploitation balance by analyzing the agent's behavior over time. Early in the learning process, we expect to see more exploration as the agent tries different actions to learn about the environment. As learning progresses, the agent should shift towards exploitation, favoring actions with higher expected rewards based on learned Q-values.
Techniques such as epsilon-greedy exploration, where the agent selects a random action with probability epsilon and the action with the highest Q-value with probability 1 - epsilon, help balance exploration and exploitation.
Episodic Performance:

Evaluating the agent's performance over individual episodes provides insights into its learning progress and effectiveness. We can track metrics such as episode length, total rewards obtained per episode, and the percentage of successful episodes (e.g., reaching the goal state).
Monitoring episodic performance helps identify trends and patterns in the agent's behavior and performance over time.
By considering these aspects, we can effectively measure the success of a Q-learning model in reinforcement learning. Evaluating cumulative rewards, convergence, exploration vs. exploitation balance, and episodic performance provides comprehensive insights into the model's learning progress and effectiveness in achieving its objectives in the environment.