- Person or organization developing model: Agnes,
agnes@gmc.com
- Model date: August, 2022
- Model version: 1.0.2
- License: MIT
- Model implementation code: DNSC_6301_Project.ipynb
- Primary intended uses: This model is an example probability of default classifier, with an example use case for determining eligibility for a credit line increase.
- Primary intended users: Students in GWU DNSC 6301 bootcamp.
- Out-of-scope use cases: Any use beyond an educational example is out-of-scope.
- Data dictionary:
Name | Modeling Role | Measurement Level | Description |
---|---|---|---|
ID | ID | int | unique row indentifier |
LIMIT_BAL | input | float | amount of previously awarded credit |
SEX | demographic information | int | 1 = male; 2 = female |
RACE | demographic information | int | 1 = hispanic; 2 = black; 3 = white; 4 = asian |
EDUCATION | demographic information | int | 1 = graduate school; 2 = university; 3 = high school; 4 = others |
MARRIAGE | demographic information | int | 1 = married; 2 = single; 3 = others |
AGE | demographic information | int | age in years |
PAY_0, PAY_2 - PAY_6 | inputs | int | history of past payment; PAY_0 = the repayment status in September, 2005; PAY_2 = the repayment status in August, 2005; ...; PAY_6 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; ...; 8 = payment delay for eight months; 9 = payment delay for nine months and above |
BILL_AMT1 - BILL_AMT6 | inputs | float | amount of bill statement; BILL_AMNT1 = amount of bill statement in September, 2005; BILL_AMT2 = amount of bill statement in August, 2005; ...; BILL_AMT6 = amount of bill statement in April, 2005 |
PAY_AMT1 - PAY_AMT6 | inputs | float | amount of previous payment; PAY_AMT1 = amount paid in September, 2005; PAY_AMT2 = amount paid in August, 2005; ...; PAY_AMT6 = amount paid in April, 2005 |
DELINQ_NEXT | target | int | whether a customer's next payment is delinquent (late), 1 = late; 0 = on-time |
- Source of training data: GWU Blackboard, email
jphall@gwu.edu
for more information - How training data was divided into training and validation data: 50% training, 25% validation, 25% test
- Number of rows in training and validation data:
- Training rows: 15,000
- Validation rows: 7,500
- Source of test data: GWU Blackboard, email jphall@gwu.edu for more information
- Number of rows in test data: 7,500
- State any differences in columns between training and test data: None
- Columns used as inputs in the final model: 'LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6'
- Column(s) used as target(s) in the final model: 'DELINQ_NEXT'
- Type of model: Decision Tree
- Software used to implement the model: Python, scikit-learn
- Version of the modeling software: 3.7.13, 1.0.2
- Hyperparameters or other settings of your model:
DecisionTreeClassifier {'ccp_alpha': 0.0,'class_weight': None,'criterion': 'gini',
'max_depth': 12,'max_features': None,'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,'min_samples_leaf': 1,'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,'random_state': 12345,'splitter': 'best'}
Variables have a negative correlation because they move in opposite direction.
With a depth-12, the model exhibits high perfomance and minimal bias: good trade-off between best validation performance and best fairness..
- Best Model AUC:
Training AUC | Validation AUC | Test AUC |
---|---|---|
0.783722 | 0.749610 | 0.743847 |
- Best Model AIR:
Hispanic-to-White AIR | Black-to-White AIR | Asian-to-White AIR | Female-to-Male AIR |
---|---|---|---|
0.83 | 0.85 | 1.00 | 1.02 |
-
Describe potential negative impacts of using your model:
-
Math or software problems: Please polish: Having a 60% accuracy rate, which mean a 40% errors will bring more discrimination.
-
Real-world risks: who, what, when or how: Please polish: less Hispanic and black people will get the credit product.
-
-
Describe potential uncertainties relating to the impacts of using your model:
-
Math or software problems: As potential uncertainties we have inaccurate information and model failure.
-
Real-world risks: who, what, when or how? Using this model inaccurately may cause commercial or social chaos for both the company and the customers.
-
-
Describe any unexpected or results: The model may not grasp the context of the task it performs.