Vemos como desarrollar un modelo usando el `parceling` para la inferencia de denegados

<span style='color:blue'>Importamos los módulos

In [1]:
import sys, numpy as np, pandas as pd, memento as me

<span style='color:blue'>Cargamos los datos

In [2]:
df = pd.read_csv('hmeq.csv')

<span style='color:blue'>Ponemos las columnas en minúsculas, renombramos el target a `target_original` y añadimos un `id`

In [3]:
df.columns = ['target_original'] + [col.lower() for col in df.columns[1:]]
df.insert(0, 'id', [str(i).zfill(4) for i in range(1, len(df)+1)])

<span style='color:blue'>Generamos los denegados aleatoriamente (esto no debería ser así porque en general los denegados tienen un peor perfil... pero bueno es un ejemplo) marcando como denegados al 25% de la muestra

In [4]:
mask_rejected = np.array([True]*round(len(df)*0.75)+[False]*(len(df)-round(len(df)*0.75)))
np.random.seed(123) # Importante fijar semilla para que sea replicable
np.random.shuffle(mask_rejected)
df['decision'] = np.where(mask_rejected, 'aprobado', 'denegado')
df['target'] = np.where(mask_rejected, df['target_original'], -3)
df.head()

Unnamed: 0,id,target_original,loan,mortdue,value,reason,job,yoj,derog,delinq,clage,ninq,clno,debtinc,decision,target
0,1,1,1100,25860.0,39025.0,HomeImp,Other,10.5,0.0,0.0,94.366667,1.0,9.0,,aprobado,1
1,2,1,1300,70053.0,68400.0,HomeImp,Other,7.0,0.0,2.0,121.833333,0.0,14.0,,aprobado,1
2,3,1,1500,13500.0,16700.0,HomeImp,Other,4.0,0.0,0.0,149.466667,1.0,10.0,,aprobado,1
3,4,1,1500,,,,,,,,,,,,aprobado,1
4,5,0,1700,97800.0,112000.0,HomeImp,Office,3.0,0.0,0.0,93.333333,0.0,14.0,,denegado,-3


<span style='color:blue'>Vemos la distribución de denegados, buenos y malos que tenemos

In [5]:
me.proc_freq(df, 'decision', 'target')

target,-3,0,1
decision,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
aprobado,0,3567,903
denegado,1490,0,0


<span style='color:blue'>Lo primero es sacar una scorecard solo con aceptados

In [6]:
df_aceptados = df[df.decision == 'aprobado']
X, y = df_aceptados.drop('target', axis=1), df_aceptados.target.values
modelo_aceptados = me.Scorecard(excluded_vars=['id', 'target_original', 'decision']).fit(X, y)

Particionado 70-30 estratificado en el target terminado
------------------------------------------------------------------------------------------------------------------------
Autogrouping terminado. Máximo número de buckets = 5. Mínimo porcentaje por bucket = 0.05
------------------------------------------------------------------------------------------------------------------------
Cuidado, has puesto un valor numero máximo de iteraciones (14) superior al número de variables candidatas (11)
------------------------------------------------------------------------------------------------------------------------
Step 01 | 0:00:00.565534 | pv = 0.00e+00 | Gini train = 68.93% | Gini test = 66.22% ---> Feature selected: debtinc
Step 02 | 0:00:00.491939 | pv = 1.77e-36 | Gini train = 76.58% | Gini test = 72.63% ---> Feature selected: delinq
Step 03 | 0:00:00.568667 | pv = 5.14e-27 | Gini train = 81.30% | Gini test = 76.92% ---> Feature selected: clage
Step 04 | 0:00:00.672944 | pv = 1.39e-

<span style='color:blue'>Con esta scorecard de aceptados vamos a inferir cual hubiera sido el target de los denegados

In [7]:
prediction = modelo_aceptados.predict(df, keep_columns=['id'])[['id', 'scorecardpoints']]
df2 = df.merge(prediction.rename(columns={'scorecardpoints': 'scorecardpoints_acep'}), 'left', 'id')

<span style='color:blue'>Aplicamos el parceling

In [8]:
df3, c = me.parceling(df2)

<span style='color:blue'>Teniendo los denegados ya un target inferido desarrollamos otra scorecard con una nueva partición 70-30 (usando todo: aceptados + denegados)

In [9]:
X_def, y_def = df3[X.columns], df3.target_def
modelo_def = me.Scorecard(
    excluded_vars=['id', 'target_original', 'decision'], save_tables='all'
).fit(X_def, y_def)

Particionado 70-30 estratificado en el target terminado
------------------------------------------------------------------------------------------------------------------------
Autogrouping terminado. Máximo número de buckets = 5. Mínimo porcentaje por bucket = 0.05
------------------------------------------------------------------------------------------------------------------------
Cuidado, has puesto un valor numero máximo de iteraciones (14) superior al número de variables candidatas (11)
------------------------------------------------------------------------------------------------------------------------
Step 01 | 0:00:00.397887 | pv = 0.00e+00 | Gini train = 67.53% | Gini test = 67.49% ---> Feature selected: debtinc
Step 02 | 0:00:00.543674 | pv = 4.65e-42 | Gini train = 74.84% | Gini test = 75.90% ---> Feature selected: delinq
Step 03 | 0:00:00.589229 | pv = 1.37e-34 | Gini train = 79.47% | Gini test = 79.77% ---> Feature selected: clage
Step 04 | 0:00:00.628488 | pv = 6.20e-

<span style='color:blue'>Evaluamos el modelo también solo sobre los aceptados (en el 70-30 del último modelo)

In [10]:
data_train = modelo_def.X_train.copy()
data_train['target'] = modelo_def.y_train
data_train_oa = data_train[data_train.decision == 'aprobado'].reset_index(drop=True)
data_train_final_oa = modelo_def.predict(data_train_oa, target_name='target')
ks_train, gini_train = me.compute_metrics(data_train_final_oa, 'target', ['gini', 'ks'], True)
print('-'*80)
data_test = modelo_def.X_test.copy()
data_test['target'] = modelo_def.y_test
data_test_oa = data_test[data_test.decision == 'aprobado'].reset_index(drop=True)
data_test_final_oa = modelo_def.predict(data_test_oa, target_name='target')
ks_test, gini_test = me.compute_metrics(data_test_final_oa, 'target', ['gini', 'ks'], True)

El  modelo tiene un 69.64% de KS y un 84.26% de Gini en esta muestra
--------------------------------------------------------------------------------
El  modelo tiene un 71.04% de KS y un 83.38% de Gini en esta muestra


<span style='color:blue'>Mostramos la scorecard final

In [11]:
me.pretty_scorecard(modelo_def)

Unnamed: 0,Variable,Group,Count,Percent,Goods,Bads,Bad rate,WoE,IV,Raw score,Aligned score
0,debtinc,Missing,877,0.210211,328,549,0.625998,-1.907984,1.074769,1.841146,24
1,debtinc,"(-inf, 22.58)",266,0.063758,241,25,0.093985,0.873022,0.03666,-0.84244,101
2,debtinc,"[22.58, 30.77)",763,0.182886,734,29,0.038008,1.838314,0.339517,-1.773917,128
3,debtinc,"[30.77, 40.89)",1844,0.441994,1723,121,0.065618,1.263133,0.467077,-1.218884,112
4,debtinc,"[40.89, 42.90)",257,0.061601,230,27,0.105058,0.749343,0.027194,-0.723094,98
5,debtinc,"[42.90, inf)",165,0.039549,86,79,0.478788,-1.308,0.090837,1.26218,41
6,delinq,Missing,400,0.095877,358,42,0.105,0.749964,0.042387,-0.699137,97
7,delinq,"(-inf, 0.50)",2926,0.701342,2515,411,0.140465,0.418536,0.107716,-0.39017,88
8,delinq,"[0.50, 2.50)",633,0.151726,388,245,0.387046,-0.933152,0.167111,0.869909,52
9,delinq,"[2.50, inf)",213,0.051055,81,132,0.619718,-1.881252,0.253591,1.753754,26
