This project aims to develop a credit scoring model using historical banking data. The goal is to predict customers' credit score based on various features. The project involves several steps, including data exploration, data preprocessing, feature engineering, feature selection, model building, evaluation, and application on new data.
- Introduction
- Data Exploration
- Data Preprocessing
- Feature Engineering
- Model Building
- Model Evaluation
- Applying Model on New Data
- Conclusion
Credit scoring is a crucial task in the banking and financial sectors, enabling institutions to evaluate the creditworthiness of potential borrowers. This project uses a dataset containing various features related to customer information and loan details to build a predictive model.
The dataset comprises multiple features, including numerical and categorical data. Initial exploration revealed insights into the structure and summary statistics of the data.
- Rows and Columns: 100,000 rows and 28 columns
- Data Types: Mixed data types, including integers, floats, and objects
- Missing Values: Various columns with missing values were identified and addressed
To ensure the data was clean and suitable for modeling, several preprocessing steps were undertaken:
- Unwanted Characters Removal: Stripped and replaced unwanted characters
- Missing Value Imputation: Filled missing values using group modes and custom values
- Data Type Conversion: Converted columns to appropriate data types
Feature engineering involved transforming existing features and creating new ones to improve the model's predictive power:
- Credit History Age: Converted from years and months to total months
- Outlier Capping: Applied the IQR method to cap outliers
The model used for this project was Random Forest Classifier. It was the best performing model after testing various machine learning models The model was trained and evaluated using an 70-30 train-test split.
The performance of the models was evaluated using RandomForest Classifier: Accuracy: 0.86
The trained model was applied to a new dataset to predict credit scores. The new data underwent the same preprocessing and feature engineering steps as the training data.
This project successfully developed and evaluated credit scoring models using RandomForest. The models demonstrated good accuracy and robustness. Future improvements could include exploring additional feature engineering techniques, hyperparameter tuning, and incorporating domain-specific knowledge to enhance model performance.