# CREDIT RISK SCORING SYSTEM

### Problem Statement: Developing a Robust Credit Risk Scoring System using Machine Learning

#### Context and Background

In the rapidly evolving fintech industry, accurate credit risk assessment is paramount to financial institutions. Credit risk scoring is a critical tool that helps lenders evaluate the likelihood of a borrower defaulting on a loan. Traditional credit scoring models often rely on a limited set of financial indicators and can be prone to biases and inaccuracies. With the advent of big data and advanced machine learning techniques, there is a significant opportunity to enhance the accuracy and reliability of credit risk scoring models.

#### Objective

The primary objective of this project is to develop a robust, scalable, and interpretable machine learning model that predicts the probability of default for loan applicants. The model aims to leverage a wide array of features, including demographic data, financial status, credit history, and behavioral data, to deliver accurate credit risk assessments.

#### Key Questions

1. **Data Integration**: How can we effectively integrate diverse data sources to build a comprehensive dataset for credit risk assessment?
2. **Feature Engineering**: What are the most predictive features for assessing credit risk, and how can we engineer new features to improve model performance?
3. **Model Selection**: Which machine learning algorithms provide the best performance in terms of accuracy, interpretability, and computational efficiency for credit risk scoring?
4. **Model Evaluation**: What metrics should be used to evaluate the performance of the credit risk model, and how can we ensure the model generalizes well to unseen data?
5. **Bias and Fairness**: How can we detect and mitigate biases in the credit risk model to ensure fair treatment of all applicants?
6. **Deployment and Monitoring**: How can we deploy the model in a real-world setting, and what mechanisms should be in place for continuous monitoring and updating of the model?

#### Scope and Deliverables

1. **Data Collection and Preprocessing**:
   - Collect and preprocess data from the chosen datasets (Home Credit Default Risk, LendingClub Loan Data, and Give Me Some Credit).
   - Handle missing values, outliers, and data inconsistencies.

2. **Exploratory Data Analysis (EDA)**:
   - Conduct EDA to understand data distribution, correlations, and key insights.
   - Visualize important patterns and relationships in the data.

3. **Feature Engineering**:
   - Develop and select features that significantly impact credit risk prediction.
   - Implement techniques such as one-hot encoding, scaling, and normalization.

4. **Model Development**:
   - Train and compare multiple machine learning models, including Logistic Regression, Decision Trees, Random Forests, Gradient Boosting Machines, XGBoost, and Neural Networks.
   - Perform hyperparameter tuning to optimize model performance.

5. **Model Evaluation**:
   - Evaluate models using metrics such as AUC-ROC, Precision-Recall, F1 Score, and Confusion Matrix.
   - Ensure model interpretability using SHAP values or LIME.

6. **Deployment**:
   - Develop an API for the model using Flask or FastAPI.
   - Deploy the model on a cloud platform (e.g., AWS, GCP, Azure) and integrate with a front-end application.

7. **Monitoring and Maintenance**:
   - Implement monitoring tools to track model performance over time.
   - Set up a pipeline for continuous learning and model updates based on new data.

#### Expected Impact

By developing a sophisticated credit risk scoring system, this project aims to:
- Enhance the accuracy and reliability of credit risk assessments.
- Enable financial institutions to make more informed lending decisions.
- Reduce the risk of defaults and financial losses.
- Promote fair and unbiased credit evaluation processes.

#### Conclusion

This project will push the limits of data science in the fintech space by leveraging advanced machine learning techniques to address a critical challenge in credit risk assessment. The outcomes will not only benefit financial institutions but also contribute to the broader goal of financial inclusion and stability.

### IMPORT NECESSARY MODULES

In [14]:
# Importing standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Importing libraries for data preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.impute import SimpleImputer

# Importing libraries for machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.neural_network import MLPClassifier

# Importing libraries for model evaluation
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, precision_recall_curve

# Importing libraries for model interpretability
import shap
import lime
import lime.lime_tabular

# Importing libraries for API development and deployment
from flask import Flask, request, jsonify
import joblib

# Miscellaneous libraries
import warnings
warnings.filterwarnings('ignore')

# Display all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)


### IMPORT THE TRAIN AND TEST CSV FILES

In [6]:
# Import training set
train_df = pd.read_csv(r"C:\Users\Black Concept\WorkSpace\ALTSCHOOL\Datasets\application_train.csv")

# Import test set
test_df = pd.read_csv(r"C:\Users\Black Concept\WorkSpace\ALTSCHOOL\Datasets\application_test.csv")

### DATA PREPROCESSING

  Cleaning: Handle missing values, outliers, and data inconsistencies.

  Normalization/Standardization: Scale numerical features.

  Encoding: Convert categorical variables into numerical values using techniques like one-hot encoding or label encoding.

  Feature Engineering: Create new features based on domain knowledge (e.g., debt-to-income ratio, credit utilization rate).

This project already broke down the dataset we are using into train and test set already which is a good approach so we can take a look at the two datasets separately during exploratory data analysis where we take a look at the data- the various connections between features and distribution of features as well.

Taking a look at the training and test dataset is very import to avoid bias which could lead to overfitting and again, its important to see that our features are well represented and distributed evenly across the two datasets.

### How large is the dataset that we are working with and what are the different features?

In [11]:
# Check the size of the datasets
train_size = train_df.shape # Training size
test_size = test_df.shape # Test size

print(f'The training set has {train_size[0]} rows (observations) and {train_size[-1]} columns (features)')

print(f'The test set has {test_size[0]} rows (observations) and {test_size[-1]} columns (features)')

The training set has 307511 rows (observations) and 122 columns (features)
The test set has 48744 rows (observations) and 121 columns (features)


In [15]:
# Check the different columns and their respective descriptions
train_df.columns

Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY',
       ...
       'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
       'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR',
       'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
       'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype='object', length=122)