# **Project Name**    - Paisa Bazaar Credit Score Prediction




##### **Project Type**    - Supervised Classification
##### **Contribution**    - Individual- Gyanvir Singh


# **Project Summary -**

This machine learning project focuses on predicting an individual’s credit score category—specifically, whether their credit score is Poor, Standard, or Good. Credit scores are an essential indicator of a person's creditworthiness and play a crucial role in loan approvals, interest rates, and financial risk assessment. By building a model that can classify users based on their credit behavior and financial profile, this project aims to assist financial institutions in making informed, data-driven decisions when evaluating potential borrowers.

We began by analyzing a dataset containing 100,000 customer records and 28 columns, which included various features such as Age, Annual_Income, Monthly_Inhand_Salary, Occupation, Num_of_Loan, Outstanding_Debt, and behavioral indicators like Payment_of_Min_Amount and Payment_Behaviour. The target column was Credit_Score, which had three unique values representing different credit categories.

The initial phase involved extensive data exploration and cleaning. We checked for missing values, duplicates, and data inconsistencies. Some columns like Credit_Mix had missing entries, which we filled using the most frequent values. Multi-valued fields such as Type_of_Loan were carefully processed to extract individual loan types into binary columns (e.g., "Auto Loan", "Student Loan", etc.), enabling the model to better understand each customer’s loan history. Unnecessary identifiers such as customer IDs and names were removed, and categorical values were encoded properly. We also applied scaling to numerical features and used square root transformations to reduce skewness in heavily skewed columns like Total_EMI_per_month and Amount_invested_monthly.

After preprocessing, we conducted hypothesis testing and exploratory visualizations to understand the relationships between key features and credit score. These visual insights helped us justify why certain features (like number of delayed payments or outstanding debt) were likely to impact a person’s credit score significantly.

Before model training, we applied SMOTE (Synthetic Minority Over-sampling Technique) to handle the class imbalance problem, since most customers in the dataset had a "Standard" credit score. The data was then split into training and test sets. We first built a baseline Logistic Regression model, which gave a decent performance with around 71% cross-validation accuracy. However, to capture more complex relationships, we trained an XGBoost classifier, which achieved a higher cross-validation accuracy of ~82%, and even after hyperparameter tuning, maintained strong generalization performance.

To interpret the model, we analyzed feature importance from the XGBoost classifier. The most influential factors turned out to be Num_of_Loan, Payment_Behaviour, Delay_from_due_date, and Credit_Utilization_Ratio. This not only validated our understanding of creditworthiness but also provided clear business insights—for example, that users with high delays or too many loans are more likely to fall in the “Poor” credit category.

In conclusion, this project demonstrates how machine learning can effectively classify credit score categories using real-world financial and behavioral data. The final model is accurate, interpretable, and scalable for production use. With this tool, financial institutions can reduce risks, detect potential defaults early, and personalize credit offerings based on reliable credit scoring predictions. This results in more responsible lending practices and better financial outcomes for both banks and borrowers.



# **GitHub Link -**

https://github.com/Gyanvir/PaisaBazaar_Credit_Score_Prediction

# **Problem Statement**


The main goal of this project is to build a machine learning model that can predict the credit score category of a person—classified as Poor, Standard, or Good—using various financial, personal, and behavioral features. This classification will help financial institutions in evaluating whether a customer is creditworthy before offering any loans or credit services.

The dataset contains detailed records of individuals, including information like their age, income, number of loans, payment behavior, credit history, and other indicators. By analyzing these factors, the project aims to develop a system that can accurately assess a customer’s credit profile and automate the credit scoring process.

This predictive model will not only help in reducing manual efforts and inconsistencies in credit evaluation but also support data-driven decision-making in the financial sector. Ultimately, it will assist banks and lending companies in minimizing risk, detecting potentially unreliable borrowers early, and making more informed and fair lending decisions.



# **Chart Descriptions**


**Chart 1: Distribution of Age**
    
Why this chart?
To understand the age demographics of customers in the dataset.

Insights:
Most customers are between 20–35 years old, with a sharp drop-off beyond 40.

Business Impact:
Products like credit cards or digital lending can be targeted toward younger audiences. Age-related segmentation can be critical for credit risk modeling.




**Chart 2: Count of Occupation**

Why this chart?
To understand the occupational background of customers.

Insights:
‘Healthcare’, ‘Engineer’, and ‘Lawyer’ dominate the customer base.

Business Impact:
Risk tolerance, income stability, and credit product offerings can vary by profession. Useful for building occupation-specific credit models.

**Chart 3: Target Variable – Credit Score Distribution**

Why this chart?
To check class balance for model building.

Insights:
‘Standard’ scores dominate; ‘Good’ is underrepresented.

Business Impact:
The dataset is imbalanced, so classification models may require sampling or weighted techniques to avoid bias.

**Chart 4: Annual Income Distribution**

Why this chart?
To observe the spread of income levels.

Insights:
Majority earn less than ₹50,000. Skewed distribution.

Business Impact:
Helps identify low-income segments that may be high risk or need different credit rules. Normalization may be required.

**Chart 5: Number of Bank Accounts**

Why this chart?
To understand how financially diversified users are.

Insights:
Most customers have 2–5 accounts. Few have more than 7.

Business Impact:
Diversification of accounts might relate to financial maturity and credit exposure. Can be an important feature in modeling.

**Chart 6: Age vs Credit Score**

Why this chart?
To see if age has a trend with creditworthiness.

Insights:
Younger people are more likely in the ‘Poor’ category; older individuals skew toward ‘Good’.

Business Impact:
Age might be a predictor for credit behavior, helping in age-based credit product offerings.

**Chart 7: Occupation vs Credit Score**

Why this chart?
To analyze credit score distributions across professions.

Insights:
‘Scientists’ and ‘Engineers’ have more ‘Good’ scores. ‘Developers’ and ‘Musicians’ have higher ‘Poor’ rates.

Business Impact:
Could inform profession-based risk profiling or interest rate adjustments.


**Chart 8: Annual Income vs Credit Score**

Why this chart?
To evaluate how income levels vary across credit score tiers.

Insights:
Higher income correlates with ‘Good’ scores; ‘Poor’ scores are concentrated at lower income levels.

Business Impact:
Supports the idea that income is a significant driver of credit health. Can influence feature selection and credit thresholds.



**Chart 9: Number of Loans vs Credit Score**

Why this chart?
To study how the volume of loans impacts credit health.

Insights:
Credit score deteriorates as the number of loans increases—especially past 3 loans.

Business Impact:
Customers with high loan counts can be flagged for risk mitigation or review.



**Chart 10: Credit Utilization Ratio vs Credit Score**

Why this chart?
High utilization is often associated with lower creditworthiness.

Insights:
‘Poor’ credit holders tend to have significantly higher utilization.

Business Impact:
Confirms that managing credit utilization is crucial. Strong business signal for feature inclusion.

**Chart 11: Annual Income vs Monthly Salary by Credit Score**

Why this chart?
To check if income–salary ratio clusters vary by credit score.

Insights:
‘Good’ credit scores form a clear upper cluster—high income and stable salary. ‘Poor’ scores are scattered across low-to-mid income.

Business Impact:
Verifies consistency in income-to-salary behavior. High discrepancy could be a signal for hidden debts or instability.

**Chart 12: Age vs Number of Loans by Credit Score**

Why this chart?
To see how credit behavior (loans taken) evolves with age and credit health.

Insights:
Younger users with many loans often have poor credit scores. Older individuals trend toward fewer loans with better scores.

Business Impact:
Helps tailor loan offers based on age–loan profile. Prevents over-lending to risk-prone segments.

**Chart 13: Monthly Balance vs Outstanding Debt by Credit Score**

Why this chart?
To compare financial surplus/deficit across credit types and debt exposure.

Insights:
Poor scores cluster around low or negative balances despite high debt. Good scores correlate with higher monthly surplus.

Business Impact:
Strongly justifies inclusion of debt-to-balance ratios in modeling risk.

**Chart 14: Credit History Age vs Credit Score + Payment of Min Amount**

Why this chart?
To examine how payment behavior influences long-term credit health.

Insights:
Those who skip minimum payments have shorter histories and worse scores.

Business Impact:
Behavioral variables (like Payment_of_Min_Amount) are highly predictive of credit class and longevity.

**Chart 15: Credit Mix vs Credit Score vs Credit Utilization**

Why this chart?
To assess how mix and utilization interact across credit categories.

Insights:
Good scorers have diverse mix and lower utilization. Poor scorers often have ‘Standard’ mix and higher usage.

Business Impact:
Emphasizes need to educate users on managing credit mix and lowering utilization to improve score.

# **Model Selections**


### Logistic Regression

**Model Explanation:**  
Logistic Regression is a simple, linear classification algorithm that works well for linearly separable data. It assumes a linear relationship between the independent variables and the probability of belonging to a particular class. This model was used as a **baseline** due to its interpretability and ease of implementation.

**Performance:**  
- Accuracy on test set: **66.06%**  
- Cross-Validation Accuracy (5-fold): **71.04%**

**Business Interpretation:**  
The Logistic Regression model provides a basic understanding of which features might influence credit scores. However, its relatively lower accuracy suggests that it may not capture complex interactions between variables like credit behavior and financial history, which are crucial for making accurate credit risk predictions.


### XGBoost (Tuned)

**Model Explanation:**  
XGBoost (Extreme Gradient Boosting) is a powerful tree-based ensemble algorithm known for its high performance on structured/tabular data. It uses gradient boosting techniques to sequentially build strong learners from weak learners and can handle feature interactions, outliers, and missing values effectively.

We performed **hyperparameter tuning** using GridSearchCV, and the best parameters found were:
```python
{'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 150}
```

**Performance:**
- Accuracy on test set: **77.59%**
- Cross-Validation Accuracy (5-fold, default XGBoost): **82.44%**
- Cross-Validation Accuracy (with tuned XGBoost): **76.10%**

**Business Interpretation:**
XGBoost significantly outperformed Logistic Regression in terms of accuracy and consistency. It was able to better capture non-linear relationships in the data, such as how combinations of delayed payments, loan count, and credit utilization affect credit scores. This model is more suitable for real-world credit scoring tasks, where subtle patterns and interactions play a critical role in risk assessment.

# **Final Model Chosen**: Tuned XGBoost

Even though the tuned XGBoost model had a slightly lower cross-validation score than the untuned one, we selected it for the following reasons:
- It reduces the risk of overfitting by using more conservative hyperparameters.
- It generalizes better across unseen data (as seen in test accuracy: 77.59%).
- The tuning process makes the model more robust and interpretable for production environments.

Overall, the tuned XGBoost model strikes the right balance between accuracy, stability, and business usefulness — making it the best fit for predicting credit score categories in this project.