This project aims to predict loan payment defaults using various classification algorithms. By analyzing data on past loans, the goal is to optimize predictions to minimize financial risks, ensuring robust credit decisions.
- Develop and evaluate classifiers to predict defaults for 500 new clients.
- Identify the most effective classifier to balance risk mitigation and operational efficiency.
-
Data Exploration
- Examined variables to identify predictors of defaults.
- Eliminated redundant or non-informative features (e.g.,
clientID and constantcategorie).
-
Data Preprocessing
- Imputed missing values using:
- Median substitution.
missForestfor enhanced accuracy.
- Transformed
educationinto numeric values for compatibility with algorithms. - Created new features (e.g.,
DE = debcarte/emploi) to improve model accuracy.
- Imputed missing values using:
-
Clustering
- Identified pivotal variables (e.g.,
age,adresse) through clustering. - Created boolean variables for values around these pivots.
- Identified pivotal variables (e.g.,
-
Classifier Development
- Tested models using
caretpackage with:- Decision Trees (C5.0, rpart)
- Random Forest (rf, ranger, Rborist)
- Neural Networks (avNNet, nnet, pcaNNet)
- Naive Bayes
- Support Vector Machines (svmLinear2, svmPoly, svmRadial)
- Tested models using
-
Evaluation
- Metrics used:
- AUC (Area Under Curve) for global performance.
- Positive Predictive Value (PPV) to focus on risk reduction.
- Balanced sampling ensured unbiased evaluations.
- Metrics used:
-
Optimal Classifier Selection
- Chose Naive Bayes with the
SansrevQdataset as the optimal model, balancing AUC and PPV.
- Chose Naive Bayes with the
-
Prediction
- Applied the optimal classifier to predict defaults for new clients.
- Best Model: Naive Bayes on the
SansrevQdataset. - Feature Engineering:
- New variable
DEsignificantly improved model accuracy. - Removing
revenussimplified models without compromising performance.
- New variable
- Code: Code_DataMining_project.R
- Dataset: projet.csv
- New Predictions Dataset: projet_new.csv
- Load and preprocess data from
projet.csv. - Train and evaluate classifiers using the provided R script.
- Apply the optimal model to
projet_new.csvfor predictions.
- Integrate financial impact analysis to weigh false negatives and positives.
- Enhance feature engineering with domain-specific insights.
- Tristan Gonçalves
- Pierre-François Pinelli