# Classical credit scoring modeling tutorial

## Fields

- **ID**: ID of each client
- **LIMIT_BAL:** Amount of given credit in NT dollars (includes individual and family/supplementary credit
- **SEX:** Gender (1=male, 2=female)
- **EDUCATION:** (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
- **MARRIAGE:** Marital status (1=married, 2=single, 3=others)
- **AGE:** Age in years
- **PAY_0 to PAY_6:** Repayment status at last month until 6 months ago, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
- **BILL_AMT1 to BILL_AMT6:** Amount of bill statement at last month until 6 months ago, 2005 (NT dollar)
- **PAY_AMT1 to PAY_AMT6:** Amount of previous payment at last month until 6 months ago, 2005 (NT dollar)

- **default.payment.next.month:** Default payment (1=yes, 0=no)

# Modeling steps

## Reading data:
 * For all types:
  * Check missing values: null, none, blanket,  9999..99 (are also common), -1 and so on.
 * Categorical variables:
  * STRING: Clean string variables: trim, normalize cases etc.;
  * INTEGER: are integer variables continuous ou categoric codes.
 * Date and Time:
  * They MUST NOT enter in the model. See the feature engineering bellow.
  
## Population and database design

* Observation definition:
 * You should define which field is the primary-key / key of you observation (or row of the database). 
 * Define your population. Should you considere all the documents/clients/entities or a subset of it is enough?
  * Many models are influenced by the proportion of a certain class in the training set and tries to predict the same proportion on the testing set.
  * Example: Are all the people in CRM database the target of your model or only the subset that has already bought someting?
  * Considere undersampling and upsample cautionly because it would require an extra step to calibrate the probabilities.
  
## Feature engineering

* Main tips:
 * All models perform better on variables with linear relationship with the target (not exponential, not unbalanced, not with long tail, and so on).
 * All businesses need to trust your model. Yesterday, to understand the what happened, today, to make decisions, and tomorrow, to forecast risk/revenue/demand/infected etc.
 * The goal in doing feat. engineering is to stabilize the model across time and subsamples.

* For data-time variables:
 * REFDATE: **First** don't mess with the snapshot/reference/month on book/whatever field. ALL DATABASEs must have the data of extration, or the snapshot. This information is part of primary key. They are important for audit, reporting, stability checks, model validation and so on.
  * The refdate is used to create reports.
 * Transform into timedeltas like years (ages), months (time of relationship, month until brankrupcy).
 * When combined with other variables can produce good attributes like (expenses in the last 3 months).
 
* For String/categorical Data
 * Do you have many categories ? why don't use pareto's law and use only those relevants and group the rest on the "others" category? 
 * Do they have similars meaning: "single", "divorced"?  Why not group then "single-divord"? 
 * Do they have same odds ratio? group them.
 * Do they have order: low, medium, high ? why don't replace them by the mean of the target on a sample?
 * Keep the number o categories manageable, around 3 to 10.
 * Do you have only 2 categs ? why not create only one binary var: ex. gender (male, female) ==> (isFemale)
 * Pseudo-categorical: ex. education (high school, college, etc) can hide a continous variables, years of education.
 
 
* For real/floating variables:
 * **Monetary values**: They are prices, income, payments, costs, rents, exchange rate, etc. **IN 99% of cases DON'T USE** them direct in your model. They fragile and oscilate acording to macro-economic movements, like inflation, demand, etc. Create relative variables  like BALANCE_INCOME = BALANCE / BASIC_INCOME_AT_MONTH.
 * Take care with fake precision.  Does 5.009943323 dollars mean something? why not truncated it to 5.01?
 * What is a good range for percentages variables ?
 

* For integer variables:
 * Counting variables usually have very assymetric distribuition. Sometimes grouping them into 1, 2, >=3 leads to good results.
 * **Especial Cases**:
  * **Age**: age are usually measured in years and are highly corrected with marriage status, education and so on. It's distribuion should be checked against the country's demography distribuition to prevent unwanted biaes. In many case, considere create a categorial variable = 19-23 (college age), 23-27 (first job), 27-32 (senior posions), 60-high (retired) etc.
  
 
* Derived attributes:
 * Lagging attributes:
  * Ex: What was the client last status?  Has the machine failed in the last 10 days (for maintainance prediction)?
 * Ratios, variations and deltas attributes: They usually helps to stabilize the model.
  * Ex: ratio between the load and the truck strengh ( Load_Stress = boad/strengh). Ratios can be greater than one.
  * Ex: Balanced_used = (expences/balanced). Proportions should be less than one, specially when they express probabilities.
  
 
* External attributes:
 * Other scores - bureau score: should be used as last resource due to its cost and vunerable to providers errors.
  
  
  
Concepts:
* **Stable model**: a model the is stable across time and subsamples.
* **Future data**: an attribute that uses informations only available after the Snapshot, therefore will break your model.

# Comments

-  Real credit datasets have at least 12 months and not one month because people behave different after christimas, vantines day etc.
--  The variable YMS will be created to **S**imulate **Y**ear**M**onth

- The lagging variable PAY_0  seems to be PAY_1 like the others variables



In [None]:
!pip install sweetviz

In [None]:
%matplotlib inline

#basic data analysis libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pylab as plt
import seaborn as sns

from IPython.display  import HTML
import sweetviz

In [None]:
#machine learning libraries
from sklearn.metrics import pairwise_distances
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc

import shap

## Loading data and doing preprocessing

In [None]:
df = pd.read_csv('/kaggle/input/default-of-credit-card-clients-dataset/UCI_Credit_Card.csv')
df.rename(columns={'default.payment.next.month':'target','PAY_0':'PAY_1'}, inplace=True)

#year month simulation creating 10 groups
df['YMS'] = np.ceil(10*np.random.RandomState(seed=42).rand(len(df)))
# YMS <7  is the training sample and YMS >=7 is the test sample.

In [None]:
df.describe(percentiles=np.linspace(0,1,11)).T.drop(columns=['count']).style.background_gradient(axis=1)

### Feature enginering

Limit_bal  and bill_amt can vary by income, region, time/inflation etc  and these informations are not included in the dataset.  So, the best way of creating stable feature is by calculate relative metrics

In [None]:
#binarizing
df['SEX'] = df['SEX'] - 1

df['MARRIED'] = df['MARRIAGE'].apply(lambda x: 1 if x == 1 else 0);

#converting education into years of study
df['EDUCATION'] = df['EDUCATION'].map({1:5,2:15, 3:9, 4:0, 5:0, 6:0, 0:0}) #1 = graduate school; 2 = university; 3 = high school; 4 = other


# IU = BILL_AMT/LIMIT  =>  percentage of the limit used
# PER_PAYED = pay / bill
for i in range(1,7):
    df['IU'+str(i)]= df['BILL_AMT'+str(i)] / df['LIMIT_BAL'];
    df['PER_PAYED'+str(i)]= (df['PAY_AMT'+str(i)] / df['BILL_AMT'+str(i)]).replace([np.inf, -np.inf], np.nan).fillna(0);

In [None]:
df[['IU'+str(i) for i in range(1,7)]+['YMS']]\
    .groupby('YMS').mean().round(2).plot(figsize=(10,3));
plt.legend(loc='best');

Are there any trend from 6th to the last month ?  the correlation of the percentages with the vector [1,2,3,4,6] is positive?

In [None]:
df['TREND_PER_PAYED'] = 1- pairwise_distances(
    df[['PER_PAYED'+str(i) for i in range(1,7)]],
    (6 - np.arange(6)).reshape(1,-1),
    metric='correlation');
df['TREND_PER_PAYED'] = df['TREND_PER_PAYED'].replace([np.inf, -np.inf], np.nan).fillna(0)

df['TREND_PER_IU'] = 1- pairwise_distances(
    df[['IU'+str(i) for i in range(1,7)]],
    (6 - np.arange(6)).reshape(1,-1),
    metric='correlation')
df['TREND_PER_IU'] = df['TREND_PER_IU'].replace([np.inf, -np.inf], np.nan).fillna(0)


The **EDA_continuous_v1** function encodes the following design/business informations:
- the univarate analisys - Stats table and the histogram - where outliers can be seem.
- Continuous variavels are grouped by percentile with each bucket containing 1% of the data.
- the bivariate analysis contains the relationship between the mean of the bucket with mean of the target.
-- desireable attribute has linear relationship / correlation

In [None]:
def EDA_continuous_v1(df, var,target='target', datetime='YMS', preprocess=None, hist_bins=100, figsize=(10,5)):
    display(HTML("EDA %s"%var))
    display(df[var].to_frame().describe(percentiles=np.linspace(0,1,11)).round(2).T)
    
    plt.figure(figsize=figsize);
    
    ### plot 1
    plt.subplot(121);
    plt.title(" Histogram - %s"%var);
    df[var].hist(bins=hist_bins, density=True);
    plt.xlabel(var);
    
    ### plot 2
    plt.subplot(122);
    temp =pd.DataFrame({        
        'rank': np.ceil(hist_bins*df[var].rank(pct=True)),
        'rank10': KMeans(n_clusters=5,random_state=42).fit_predict(df[var].values.reshape(-1, 1)),
        #'rank10': np.ceil(5*df[var].rank(pct=True))*2,
        var: preprocess(df[var].values) if preprocess is not None else df[var],
        target:df[target],
        datetime:df[datetime]
    })    
    sns.regplot(x=var,y=target,ax=plt.gca(), data=temp.groupby('rank').mean());
    
    plt.title("Correlation %0.4f" % temp.groupby('rank').mean().corr().values[0,1]);
    plt.tight_layout();
    plt.show();
    
    #plot 3
    plt.figure(figsize=(figsize[0],3))
    ax = plt.gca();
    temp.pivot_table(index=datetime, columns='rank10',values=target).plot(ax=ax);
    ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.10), fancybox=True, shadow=True, ncol=10);
    ax.set_xlabel("YMS");
    ax.set_ylabel("Mean target");
    ax.set_title("target per cluster of %s over time"%var);

    plt.tight_layout();

In [None]:
EDA_continuous_v1(df.query('YMS < 7'),'AGE', preprocess = lambda x:1-np.abs(x))

**TREND_PER_IU** relationship with the target contains a inverted "v" shape in the lower valeus what may indicated this variable should be used and categorical (each category representing parts of the curve)

In [None]:
EDA_continuous_v1(df.query('YMS < 7'),'TREND_PER_IU')

In [None]:
#applying the transformation seen above that make the relationship more linear
df['TREND2_PER_PAYED'] = 1-np.abs(df['TREND_PER_PAYED'].values)

## Age var

In [None]:
EDA_continuous_v1(df.query('YMS < 7'),'AGE', preprocess=lambda x: abs(x-32)/32)

In [None]:
df['AGE_NORM'] = np.abs(df['AGE']-32)/32

### Creating a categorical variables from non-linear or unstable variables.

In [None]:
def temp(x):
    if x < -0.8:
        return 'a';
    elif x < -0.5:
        return 'b';
    elif x < 0.5:
        return 'c';
    elif x < 0.8:
        return 'd';
    else:
        return 'x';

df['TREND2_IU']=df['TREND_PER_IU'].apply(temp)
pd.DataFrame({
        'rank': df['TREND_PER_IU'].apply(temp),
        'YMS':df['YMS'],
        'var': df['TREND_PER_IU'],
        'target':df['target']
    }).pivot_table(index='YMS',columns='rank',values='target').round(2).plot(figsize=(15,4));

In [None]:
def income2Range(x):
    # income are generaly exponential.
    x = x/1000;
    
    return str(int(np.log(x)))

# Assuming the limit gave to the client has to something to do with his income.
# in the practice the risk + income defines the limit. So in the real world using the limit as input would break your model. 
df['PROXY_INCOME']=df['LIMIT_BAL'].apply(income2Range)
pd.DataFrame({
        'rank': df['LIMIT_BAL'].apply(income2Range),
        'YMS':df['YMS'],
        'var': df['LIMIT_BAL'],
        'target':df['target']
    }).pivot_table(index='YMS',columns='rank',values='target').round(2).plot(figsize=(15,4));

display(HTML("We can see that different limits/incomes has different risks"))

## Auxiliary functions / reports

In [None]:
def sort_score(df):
    """
    if the score was created using kmeans it will not come ordered.  This function reorder the score to match the target.
    """
    temp = df.groupby('score').apply(lambda x:x['prob'].mean());
    temp = temp.reset_index().sort_values(0).reset_index(drop=True).reset_index()
    temp = dict(zip(temp['score'],temp['index']));
    df['score'] = df['score'].map(temp)

In [None]:
def model_report(df, query_test, target='target',prob='prob'):
    def model_stats(x):
        s = pd.Series({
            'prob':x[prob].mean(),
            target:x[target].mean(),
            'count':len(x),
            'event':x[target].sum(),
            'non_event': len(x) - x[target].sum()
        });
        return s

    temp = df.query(query_test).groupby('score').apply(model_stats).round(2);
    temp['per_event']     = temp['event'] / temp['event'].sum();
    temp['non_per_event'] = temp['non_event'] / temp['non_event'].sum();
    temp['odds']          = temp['per_event']/temp['non_per_event']
    temp['per_pop_acc']   = 1-temp['count'].cumsum()/temp['count'].sum();
    temp['per_event_acc'] = 1-temp['event'].cumsum()/temp['event'].sum();
    temp['lift']          = temp[target] / ( temp['event'].sum()/ temp['count'].sum());
    
    display(temp.reset_index().round(2))
    temp_total = temp.mean();
    display(temp_total.to_frame().T.round(2))

    plt.figure(figsize=(12,8))
    plt.subplot(221);
    temp.plot.scatter(x='prob',y=target,ax=plt.gca());
    plt.xlabel('Prob');
    plt.xlabel('Mean target');

    plt.subplot(222);
    plt.title('Strategy plot')
    temp[['per_pop_acc','per_event_acc']].plot(ax=plt.gca());
    plt.grid();
    #print(temp.reset_index().columns)
    
    plt.subplot(223);
    plt.title('KS %0.2f'% np.abs( temp['per_event'].cumsum() -temp['non_per_event'].cumsum()).max())
    temp.reset_index().plot.bar(x='prob',y=['per_event','non_per_event'],ax=plt.gca());
    
    #ROC
    fpr, tpr, _ = roc_curve(df[target],df[prob])
    roc_auc = auc(fpr, tpr)
    
    plt.subplot(224);
    plt.plot(fpr, tpr, color='darkorange',label='ROC curve (area = %0.2f)' % roc_auc);
    plt.plot([0, 1], [0, 1], color='navy',  linestyle='--');
    plt.xlim([0.0, 1.0]);
    plt.ylim([0.0, 1.05]);
    plt.xlabel('False Positive Rate');
    plt.ylabel('True Positive Rate');
    plt.title('ROC curve');
    plt.legend(loc="lower right");
    plt.tight_layout();
    plt.show();
    
    
    # stability 
    plt.figure(figsize=(12,4))
    plt.subplot(121);
    df.pivot_table(index='YMS',columns='score',values=target).plot(ax=plt.gca());
    plt.title("Score stability - good scores must not cross each other");
    
    plt.subplot(122);
    df.pivot_table(index='YMS',columns='score',values=target, aggfunc='count').plot(ax=plt.gca(), kind='bar', stacked=True);
    plt.title("Score stability - good scores must not cross each other");
    plt.tight_layout();
    plt.show();
    
#model_report(df,'YMS >= 7 ')

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import logit

vars = [
    'PROXY_INCOME',
    'TREND2_IU',
    'TREND_PER_PAYED' ,
    'SEX',
    'MARRIED',
    'IU1',
    'AGE_NORM'
];

df2 = df.copy().replace([np.inf, -np.inf], np.nan).fillna(0);

logit_mod = logit('target ~ ' + (' + '.join(vars)), df2.query('YMS < 7'))
logit_res = logit_mod.fit(disp=0)
display(logit_res.summary())

df['prob'] = logit_res.predict(df2)
#df['score'] = np.ceil(df['prob'].rank(pct=True)*10)
df['score']= KMeans(n_clusters=7,random_state=42).fit_predict(df['prob'].values.reshape(-1, 1))
sort_score(df)

In [None]:
model_report(df,'YMS >= 7 ')

# Let's try some machine learning models
    
    
    
    

In [None]:
df2 = df.copy().replace([np.inf, -np.inf], np.nan).fillna(0);

# creating the dummies manually
dummy = pd.get_dummies(df[['PROXY_INCOME','TREND2_IU']],drop_first=True,prefix=['PROXY_INCOME','TREND2_IU'])
df2 = pd.concat([
    df,
    dummy
], axis=1)

vars = [
    'SEX',
    'MARRIED',
    'TREND_PER_PAYED' ,
    'IU1',
] + dummy.columns.tolist()

### Let's try regularized logistic regression

In [None]:
%%time
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C=0.2,random_state=42, max_iter=1000)\
        .fit(df2.query('YMS < 7')[vars],df2.query('YMS < 7')['target'])
#display(logit_res.summary())
#display(pd.DataFrame({'var':vars, 'importance':clf.feature_importances_}))

df['prob']  = 1- clf.predict_proba(df2[vars])
#df['score'] = np.ceil(df['prob'].rank(pct=True)*10)
df['score']= KMeans(n_clusters=7,random_state=42).fit_predict(df['prob'].values.reshape(-1, 1))
sort_score(df)


display(HTML("<h2>Model with %s </h2>"% clf.__class__.__name__))
model_report(df,'YMS >= 7 ')

## Let's try vanilla neural nets

In [None]:
%%time
from sklearn.neural_network import MLPClassifier

clf = MLPClassifier(hidden_layer_sizes=[20,10,5], random_state=42)\
        .fit(df2.query('YMS < 7')[vars],df2.query('YMS < 7')['target'])
#display(logit_res.summary())
#display(pd.DataFrame({'var':vars, 'importance':clf.feature_importances_}))

df['prob']  = 1- clf.predict_proba(df2[vars])
#df['score'] = np.ceil(df['prob'].rank(pct=True)*10)
df['score']= KMeans(n_clusters=7,random_state=42).fit_predict(df['prob'].values.reshape(-1, 1))
sort_score(df)


display(HTML("<h2>Model with %s </h2>"% clf.__class__.__name__))
model_report(df,'YMS >= 7 ')

## Let's try simple decision trees

In [None]:
%%time
from sklearn.tree import DecisionTreeClassifier, plot_tree

clf = DecisionTreeClassifier(
        criterion='gini',
        min_samples_split=0.05,
        #max_depth=4,
        max_leaf_nodes=20,
        random_state=42)\
        .fit(df2.query('YMS < 7')[vars],df2.query('YMS < 7')['target'])

df['prob']  = 1- clf.predict_proba(df2[vars])
#df['score'] = np.ceil(df['prob'].rank(pct=True)*10)
df['score']= KMeans(n_clusters=7,random_state=42).fit_predict(df['prob'].values.reshape(-1, 1))
sort_score(df)


display(HTML("<h2>Model with %s </h2>"% clf.__class__.__name__))
model_report(df,'YMS >= 7 ')

In [None]:
plt.figure(figsize=(20,10))
plot_tree(clf,filled=True,proportion=True, feature_names=vars);

In [None]:
%%time
from sklearn.svm import LinearSVC
from sklearn.ensemble import BaggingClassifier

clf = BaggingClassifier(base_estimator=LogisticRegression(), n_estimators=15, random_state=42)\
        .fit(df2.query('YMS < 7')[vars],df2.query('YMS < 7')['target'])

df['prob']  = 1- clf.predict_proba(df2[vars])
#df['score'] = np.ceil(df['prob'].rank(pct=True)*10)
df['score']= KMeans(n_clusters=8,random_state=42).fit_predict(df['prob'].values.reshape(-1, 1))
sort_score(df)


display(HTML("<h2>Model with %s </h2>"% clf.__class__.__name__))
model_report(df,'YMS >= 7 ')

## Let's try a boosting algoritm

In [None]:
%%time
from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(min_samples_split=0.05,random_state=42)\
        .fit(df2.query('YMS < 7')[vars],df2.query('YMS < 7')['target'])

pd.DataFrame({'var':vars, 'importance':clf.feature_importances_})\
    .sort_values('importance',ascending=False)\
    .set_index('var').plot(kind='bar', title='Feature Importance')

df['prob']  = 1- clf.predict_proba(df2[vars])
#df['score'] = np.ceil(df['prob'].rank(pct=True)*10)
df['score']= KMeans(n_clusters=7,random_state=42).fit_predict(df['prob'].values.reshape(-1, 1))
sort_score(df)

display(HTML("<h2>Model with %s </h2>"% clf.__class__.__name__))
model_report(df,'YMS >= 7 ')

# Model quality check

Checking if the model has bias by income level

In [None]:
df.pivot_table(index='score',columns='PROXY_INCOME',values='BILL_AMT1').round(0).style.background_gradient(axis=0)

In [None]:
pd.DataFrame({
        'rank': np.ceil(df['score'].rank(pct=True)*10),
        'mean_age': df['AGE'],
        'per_female':df['SEX'],
        'MARRIED': df['MARRIED'],
    }).groupby('rank').mean().round(2).style.background_gradient(axis=0)
