En este notebook vemos como desarrollar un modelo usando el `parceling` como método para la inferencia de denegados

<span style='color:blue'>Importamos los módulos

In [1]:
import numpy as np, pandas as pd, pyken as pyk

<span style='color:blue'>Cargamos el dataset

In [2]:
df = pd.read_csv('hmeq.csv')
print('El dataset tiene {} filas y {} columnas (incluyendo el target)'.format(df.shape[0], df.shape[1]))

El dataset tiene 5960 filas y 13 columnas (incluyendo el target)


<span style='color:blue'>Ponemos el nombre de las columnas en minúsculas, renombramos el target como `target_original` y añadimos una columna de tipo id

In [3]:
df.columns = ['target_original'] + [col.lower() for col in list(df.columns[1:])]
df.insert(0, 'id', [str(i).zfill(4) for i in range(1, len(df)+1)])
df.head()

Unnamed: 0,id,target_original,loan,mortdue,value,reason,job,yoj,derog,delinq,clage,ninq,clno,debtinc
0,1,1,1100,25860.0,39025.0,HomeImp,Other,10.5,0.0,0.0,94.366667,1.0,9.0,
1,2,1,1300,70053.0,68400.0,HomeImp,Other,7.0,0.0,2.0,121.833333,0.0,14.0,
2,3,1,1500,13500.0,16700.0,HomeImp,Other,4.0,0.0,0.0,149.466667,1.0,10.0,
3,4,1,1500,,,,,,,,,,,
4,5,0,1700,97800.0,112000.0,HomeImp,Office,3.0,0.0,0.0,93.333333,0.0,14.0,


<span style='color:blue'>Generamos los denegados aleatoriamente (esto no debería ser así porque en general los denegados tienen un peor pérfil... pero es un ejemplo) marcamos como denegados al 25%

In [4]:
mask_rejected = np.array([True]*round(len(df)*0.75)+[False]*(len(df)-round(len(df)*0.75)))
np.random.seed(123) # Importante fijar semilla para que sea replicable
np.random.shuffle(mask_rejected)
df['decision'] = np.where(mask_rejected, 'aprobado', 'rechazado')
df['target'] = np.where(mask_rejected, df['target_original'], -3)
df.head()

Unnamed: 0,id,target_original,loan,mortdue,value,reason,job,yoj,derog,delinq,clage,ninq,clno,debtinc,decision,target
0,1,1,1100,25860.0,39025.0,HomeImp,Other,10.5,0.0,0.0,94.366667,1.0,9.0,,aprobado,1
1,2,1,1300,70053.0,68400.0,HomeImp,Other,7.0,0.0,2.0,121.833333,0.0,14.0,,aprobado,1
2,3,1,1500,13500.0,16700.0,HomeImp,Other,4.0,0.0,0.0,149.466667,1.0,10.0,,aprobado,1
3,4,1,1500,,,,,,,,,,,,aprobado,1
4,5,0,1700,97800.0,112000.0,HomeImp,Office,3.0,0.0,0.0,93.333333,0.0,14.0,,rechazado,-3


<span style='color:blue'>Vemos la distribución de denegados, buenos y malos que tenemos

In [5]:
pyk.proc_freq(df, 'target').to_dict()['frequency']

{-3: 1490, 0: 3567, 1: 903}

<span style='color:blue'>Lo primero es sacar una scorecard *solo con aceptados*

In [6]:
df_aceptados = df[df.decision == 'aprobado']
pyk.proc_freq(df_aceptados, 'target').to_dict()['frequency']

{0: 3567, 1: 903}

In [7]:
X, y = df_aceptados.drop('target', axis=1), df_aceptados.target.values

<span style='color:blue'>Excluimos las variables que no se pueden usar para el desarollo y la `debtinc` por ser demasiado discriminante

In [8]:
modelo_aceptados = pyk.autoscorecard(excluded_vars=['id', 'target_original', 'decision', 'debtinc']).fit(X, y)

Particionado 70-30 estratificado en el target terminado.
------------------------------------------------------------------------------------------------------------------------------------------------------
Autogrouping terminado. Máximo número de buckets = 5. Mínimo porcentaje por bucket = 0.05
------------------------------------------------------------------------------------------------------------------------------------------------------
Variables no agrupadas: ['reason', 'job']
------------------------------------------------------------------------------------------------------------------------------------------------------
Cuidado, has puesto un valor numero máximo de iteraciones (12) superior al número de variables candidatas (9)
------------------------------------------------------------------------------------------------------------------------------------------------------
Step 01 | Time - 0:00:00.363279 | p-value = 8.86e-68 | Gini train = 33.58% | Gini test = 32.99% -

<span style='color:blue'>Con esta scorecard de aceptados vamos a *inferir* cual hubiera sido el target de los denegados

In [9]:
prediction = modelo_aceptados.transform(df, id_columns=['id'])[['id', 'scorecardpoints']]
df2 = df.merge(prediction.rename(columns={'scorecardpoints': 'scorecardpoints_acep'}), how='left', on='id')
df2.head()

Unnamed: 0,id,target_original,loan,mortdue,value,reason,job,yoj,derog,delinq,clage,ninq,clno,debtinc,decision,target,scorecardpoints_acep
0,1,1,1100,25860.0,39025.0,HomeImp,Other,10.5,0.0,0.0,94.366667,1.0,9.0,,aprobado,1,470.0
1,2,1,1300,70053.0,68400.0,HomeImp,Other,7.0,0.0,2.0,121.833333,0.0,14.0,,aprobado,1,493.0
2,3,1,1500,13500.0,16700.0,HomeImp,Other,4.0,0.0,0.0,149.466667,1.0,10.0,,aprobado,1,496.0
3,4,1,1500,,,,,,,,,,,,aprobado,1,440.0
4,5,0,1700,97800.0,112000.0,HomeImp,Office,3.0,0.0,0.0,93.333333,0.0,14.0,,rechazado,-3,522.0


<span style='color:blue'>Aplicamos el *parceling*. Más información en: https://blogs.sas.com/content/sasla/2020/11/12/sommelier-de-riesgo-entrega-8-riesgo-de-credito-tecnicas-de-inferencia-de-denegados/

In [10]:
df3, c = pyk.parceling(df2, randomly=False)
df3.head()

Breakpoints: [310.0, 332.07, 354.13, 376.2, 398.27, 420.33, 442.4, 464.47, 486.53, 508.6, 530.67, 552.73, 574.8, 596.87, 618.93]


Unnamed: 0,id,target_original,loan,mortdue,value,reason,job,yoj,derog,delinq,clage,ninq,clno,debtinc,decision,target,scorecardpoints_acep,parcel,target_inf,target_def
0,1,1,1100,25860.0,39025.0,HomeImp,Other,10.5,0.0,0.0,94.366667,1.0,9.0,,aprobado,1,470.0,8,,1.0
1,2,1,1300,70053.0,68400.0,HomeImp,Other,7.0,0.0,2.0,121.833333,0.0,14.0,,aprobado,1,493.0,9,,1.0
2,3,1,1500,13500.0,16700.0,HomeImp,Other,4.0,0.0,0.0,149.466667,1.0,10.0,,aprobado,1,496.0,9,,1.0
3,4,1,1500,,,,,,,,,,,,aprobado,1,440.0,6,,1.0
4,5,0,1700,97800.0,112000.0,HomeImp,Office,3.0,0.0,0.0,93.333333,0.0,14.0,,rechazado,-3,522.0,10,0.0,0.0


<span style='color:blue'>Ahora que ya tenemos a los denegados con un target inferido desarrollamos otra scorecard con una nueva partición 70-30 (usando todos, aceptados + denegados)

In [11]:
X_def, y_def = df3[X.columns], df3.target_def

In [12]:
modelo_todos = pyk.autoscorecard(excluded_vars=['id', 'target_original', 'decision', 'debtinc'], save_whole_tables=True).fit(X_def, y_def)

Particionado 70-30 estratificado en el target terminado.
------------------------------------------------------------------------------------------------------------------------------------------------------
Autogrouping terminado. Máximo número de buckets = 5. Mínimo porcentaje por bucket = 0.05
------------------------------------------------------------------------------------------------------------------------------------------------------
Variables no agrupadas: ['reason', 'job']
------------------------------------------------------------------------------------------------------------------------------------------------------
Cuidado, has puesto un valor numero máximo de iteraciones (12) superior al número de variables candidatas (9)
------------------------------------------------------------------------------------------------------------------------------------------------------
Step 01 | Time - 0:00:00.490329 | p-value = 7.92e-91 | Gini train = 34.75% | Gini test = 32.90% -

In [13]:
modelo_def = modelo_todos

<span style='color:blue'>Evaluamos el modelo también solo sobre los aceptados (en el 70-30 del último modelo)

In [14]:
data_train = modelo_def.X_train.copy()
data_train['target'] = modelo_def.y_train
data_train_oa = data_train[data_train.decision == 'aprobado'].reset_index(drop=True)
data_train_final_oa, ks_train_oa, gini_train_oa = modelo_def.transform(data_train_oa, target_name='target', metrics=['ks', 'gini'])

El modelo tiene un 50.11% de KS y un 64.69% de Gini en esta muestra


In [15]:
data_test = modelo_def.X_test.copy()
data_test['target'] = modelo_def.y_test
data_test_oa = data_test[data_test.decision == 'aprobado'].reset_index(drop=True)
data_test_final_oa, ks_test_oa, gini_test_oa = modelo_def.transform(data_test_oa, target_name='target', metrics=['ks', 'gini'])

El modelo tiene un 56.75% de KS y un 69.13% de Gini en esta muestra


<span style='color:blue'>Pintamos la scorecard final

In [16]:
pyk.pretty_scorecard(modelo_def)

Unnamed: 0,Variable,Group,Count,Percent,Goods,Bads,Bad rate,WoE,IV,Raw score,Aligned score
0,delinq,Missing,404,0.096836,361,43,0.106436,0.751242,0.043006,-0.756171,89
1,delinq,"(-inf, 0.50)",2930,0.702301,2515,415,0.141638,0.425313,0.111249,-0.428104,80
2,delinq,"[0.50, 2.50)",622,0.149089,384,238,0.382637,-0.898064,0.15062,0.903957,41
3,delinq,"[2.50, inf)",216,0.051774,71,145,0.671296,-2.09049,0.315871,2.104206,7
4,clage,Missing,225,0.053931,178,47,0.208889,-0.0448,0.00011,0.042189,66
5,clage,"(-inf, 92.78)",520,0.12464,333,187,0.359615,-0.799402,0.097834,0.752817,46
6,clage,"[92.78, 120.72)",595,0.142617,454,141,0.236975,-0.207099,0.006495,0.19503,62
7,clage,"[120.72, 148.14)",500,0.119847,352,148,0.296,-0.510017,0.035858,0.480296,54
8,clage,"[148.14, 239.43)",1446,0.346596,1219,227,0.156985,0.3044,0.029234,-0.286661,76
9,clage,"[239.43, inf)",886,0.212368,795,91,0.102709,0.791046,0.103202,-0.744948,89
