A comprehensive machine learning project to predict the probability of loan defaults using borrower and loan characteristics.
Objective: Develop an accurate machine learning model to identify high-risk loans and minimize financial losses for lenders.
Dataset: Loan_default.csv
containing borrower profiles and loan information with features like:
- Credit score
- Loan amount
- Interest rate
- Borrower's income
- Debt-to-income ratio (DTI)
- Employment length
- Home ownership status
- Loan purpose
- Past payment history
- Loan status (target variable)
Project_5/
├── Loan_default.csv # Original dataset
├── loan_default_prediction.ipynb # Main Jupyter notebook
├── examine_data.py # Data exploration script
├── requirements.txt # Python dependencies
├── README.md # This file
├── loan_default_model.pkl # Saved trained model (generated)
└── model_comparison_results.csv # Model performance comparison (generated)
- Missing Value Handling: Median imputation for numerical features
- Outlier Detection: IQR-based outlier removal
- Feature Engineering:
- Income-to-loan ratio
- Age groups categorization
- Credit score categories
- DTI ratio categories
- Categorical Encoding: One-hot encoding for categorical variables
- Feature Scaling: StandardScaler normalization
- SMOTE: Synthetic Minority Oversampling Technique
- Balanced Training: Ensures equal representation of default/non-default cases
- Logistic Regression: Linear baseline model
- Random Forest: Ensemble tree-based model
- XGBoost: Gradient boosting classifier
- Metrics: Accuracy, Precision, Recall, F1-Score, AUC-ROC
- Visualization: ROC curves, confusion matrices, performance comparisons
- Hyperparameter Tuning: GridSearchCV for optimal parameters
- Feature Importance: Analysis of most predictive features
- Performance Comparison: Side-by-side model evaluation
- Python 3.7 or higher
- pip package manager
-
Clone or download the project files
-
Install dependencies:
pip install -r requirements.txt
-
Launch Jupyter Notebook:
jupyter notebook
-
Open the main notebook:
- Navigate to
loan_default_prediction.ipynb
- Run all cells sequentially
- Navigate to
-
Data Exploration:
python examine_data.py
-
Full ML Pipeline:
- Open
loan_default_prediction.ipynb
- Execute all cells to run the complete pipeline
- Open
After training, the model is saved as loan_default_model.pkl
. Use it for predictions:
import joblib
import pandas as pd
# Load the saved model
model_artifacts = joblib.load('loan_default_model.pkl')
model = model_artifacts['model']
scaler = model_artifacts['scaler']
feature_columns = model_artifacts['feature_columns']
# Prepare new data (ensure same features as training)
new_data = pd.DataFrame({
# Add your loan data here with same column names
})
# Make predictions
new_data_scaled = scaler.transform(new_data[feature_columns])
predictions = model.predict(new_data_scaled)
probabilities = model.predict_proba(new_data_scaled)[:, 1]
print(f"Default Probability: {probabilities[0]:.4f}")
print(f"Prediction: {'Default' if predictions[0] == 1 else 'No Default'}")
The notebook provides:
- Comprehensive EDA: Data distribution analysis, correlation matrices, feature relationships
- Model Comparison: Performance metrics for all three models
- Best Model Selection: Automated selection based on AUC-ROC score
- Hyperparameter Tuning: Optimized parameters for the best model
- Feature Importance: Insights into most predictive features
- Production-Ready Model: Saved model with preprocessing pipeline
The models are evaluated using:
- Accuracy: Overall correct predictions
- Precision: True positives / (True positives + False positives)
- Recall: True positives / (True positives + False negatives)
- F1-Score: Harmonic mean of precision and recall
- AUC-ROC: Area under the receiver operating characteristic curve
- Risk Assessment: Early identification of high-risk loans
- Loss Reduction: Minimize financial losses through better screening
- Decision Support: Data-driven loan approval process
- Regulatory Compliance: Transparent and explainable model decisions
- pandas: Data manipulation and analysis
- numpy: Numerical computing
- scikit-learn: Machine learning algorithms
- xgboost: Gradient boosting framework
- imbalanced-learn: Handling imbalanced datasets
- matplotlib/seaborn: Data visualization
- joblib: Model serialization
- Data loading and exploration
- Data cleaning and preprocessing
- Feature engineering
- Train/test split with stratification
- Feature scaling
- Class imbalance handling (SMOTE)
- Model training (3 algorithms)
- Model evaluation and comparison
- Hyperparameter tuning for best model
- Feature importance analysis
- Model persistence
- Model Monitoring: Track performance over time
- Ensemble Methods: Combine multiple models for better performance
- Feature Selection: Automated feature selection techniques
- Real-time Predictions: API development for live predictions
- Model Explainability: SHAP values for individual predictions
- A/B Testing: Compare model versions in production
-
Missing Dependencies:
pip install --upgrade -r requirements.txt
-
Memory Issues: Reduce dataset size or use sampling for large datasets
-
Jupyter Kernel Issues:
python -m ipykernel install --user --name=loan_prediction
-
XGBoost Installation Issues:
conda install -c conda-forge xgboost
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is for educational and research purposes.
For questions or issues, please refer to the notebook documentation or create an issue in the repository.