En este notebook vemos como hacer `agrupaciones manuales` de manera interactiva, algo muy útil para desarrollos más expertos.<br>También mostramos como se manejan internamente los `missings`.

<span style='color:blue'>Importamos los módulos

In [1]:
import sys, numpy as np, pandas as pd, memento as me

<span style='color:blue'>Cargamos los datos

In [2]:
data = pd.read_csv('stroke_data.csv')
X, y = data.drop('stroke', axis=1), data['stroke']
print('El dataset tiene {} filas y {} columnas'.format(X.shape[0], X.shape[1]))

El dataset tiene 5110 filas y 11 columnas


<span style='color:blue'>La variable numérica `bmi` es la única que tiene missings

In [3]:
X.isna().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
dtype: int64

<span style='color:blue'>Metemos missings también la variable de texto `work_type` quitando el valor `Self-employed`

In [4]:
X['work_type'] = X['work_type'].replace('Self-employed', np.nan)
X.isna().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type            819
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
dtype: int64

<span style='color:blue'>Sacamos el modelo automático excluyendo la variable `id` por motivos evidentes

In [5]:
modelo1 = me.scorecard(excluded_vars=['id']).fit(X, y)

Particionado 70-30 estratificado en el target terminado
------------------------------------------------------------------------------------------------------------------------
Autogrouping terminado. Máximo número de buckets = 5. Mínimo porcentaje por bucket = 0.05
------------------------------------------------------------------------------------------------------------------------
Cuidado, has puesto un valor numero máximo de iteraciones (14) superior al número de variables candidatas (9)
------------------------------------------------------------------------------------------------------------------------
Step 01 | 0:00:00.207201 | pv = 1.89e-36 | Gini train = 64.73% | Gini test = 66.78% ---> Feature selected: age
Step 02 | 0:00:00.289906 | pv = 5.96e-11 | Gini train = 70.88% | Gini test = 67.94% ---> Feature selected: bmi
Step 03 | 0:00:00.316756 | pv = 1.58e-05 | Gini train = 72.59% | Gini test = 68.10% ---> Feature selected: avg_glucose_level
----------------------------------

<span style='color:red'>¿Y si no quiero usar las agrupaciones del autogrouping? ¿Y si quiero modificarlas o directamente usar las que a mí me de la gana?<span style='color:blue'><br> Lo ideal es usar la función `reagrupa_var`, podemos llamarla solo pasándole el modelo y el nombre de la variable y así vemos cuales son los puntos de corte que generan la agrupación automática y podemos modificar estos puntos de corte en el argumento new_bp para ver como quedaría reagrupada

In [6]:
# me.reagrupa_var(modelo1, 'age')
me.reagrupa_var(modelo1, 'age', [30, 60])

Agrupación automática (puntos de corte redondeados a 4 decimales): [48.5, 56.5, 67.5, 75.5]


Unnamed: 0,Group,Count,Percent,Goods,Bads,Bad rate,WoE,IV
0,"(-inf, 48.50)",1996,0.55801,1984,12,0.006012,2.134606,1.097293
1,"[48.50, 56.50)",454,0.126922,434,20,0.044053,0.103955,0.001309
2,"[56.50, 67.50)",535,0.149567,496,39,0.072897,-0.430343,0.033732
3,"[67.50, 75.50)",278,0.077719,238,40,0.143885,-1.189966,0.190331
4,"[75.50, inf)",314,0.087783,251,63,0.200637,-1.591039,0.458713
Totals,,3577,1.0,3403,174,0.048644,,1.781379


--------------------------------------------------------------------------------
Agrupación propuesta:


Unnamed: 0,Group,Count,Percent,Goods,Bads,Bad rate,WoE,IV
0,"(-inf, 30.00)",1077,0.30109,1075,2,0.001857,3.313571,1.008663
1,"[30.00, 60.00)",1538,0.429969,1494,44,0.028609,0.551665,0.102693
2,"[60.00, inf)",962,0.26894,834,128,0.133056,-1.099154,0.539195
Totals,,3577,1.0,3403,174,0.048644,,1.65055


<span style='color:blue'>Para usar esta nueva agrupación de la variable `age`  lanzamos de nuevo una scorecard con la agrupación en el diccionario `user_breakpoints`

In [7]:
modelo2 = me.scorecard(excluded_vars=['id'], user_breakpoints={'age': [30, 60]}).fit(X, y)

Particionado 70-30 estratificado en el target terminado
------------------------------------------------------------------------------------------------------------------------
Autogrouping terminado. Máximo número de buckets = 5. Mínimo porcentaje por bucket = 0.05
------------------------------------------------------------------------------------------------------------------------
Cuidado, has puesto un valor numero máximo de iteraciones (14) superior al número de variables candidatas (9)
------------------------------------------------------------------------------------------------------------------------
Step 01 | 0:00:00.207035 | pv = 8.30e-27 | Gini train = 56.54% | Gini test = 54.72% ---> Feature selected: age
Step 02 | 0:00:00.283104 | pv = 2.00e-10 | Gini train = 64.46% | Gini test = 58.66% ---> Feature selected: bmi
Step 03 | 0:00:00.337202 | pv = 1.00e-06 | Gini train = 67.69% | Gini test = 60.01% ---> Feature selected: avg_glucose_level
Step 04 | 0:00:00.341134 | pv = 7.

<span style='color:blue'>**Observación**: Ahora está entrando también la variable `hypertension`, cosa que antes no pasaba... Esto ocurre porque con la nueva agrupación `age` es aparentemente menos discriminante: ahora en el primer paso el modelo tiene un 56.54% de gini en train cuando antes, con la agrupación automática, en el primer paso el modelo tenía un 64.73%. Por este motivo ahora al final el método de selección de variables acaba escogiendo también a `hypertension`.<br>Si se quisiera evitar esto, se puede introducir las variables exactas que queremos formen parte de la scorecard en el parámetro `features` y así comparar mejor el impacto de la agrupación manual

In [8]:
modelo3 = me.scorecard(
    features=['age', 'bmi', 'avg_glucose_level'],
    user_breakpoints={'age': [30, 60]}
).fit(X, y)


Particionado 70-30 estratificado en el target terminado
------------------------------------------------------------------------------------------------------------------------
Autogrouping terminado. Máximo número de buckets = 5. Mínimo porcentaje por bucket = 0.05
------------------------------------------------------------------------------------------------------------------------
Step 01 | 0:00:00.000000 | pv = 8.30e-27 | Gini train = 56.54% | Gini test = 54.72% ---> Feature selected: age
Step 02 | 0:00:00.000000 | pv = 2.00e-10 | Gini train = 64.46% | Gini test = 58.66% ---> Feature selected: bmi
Step 03 | 0:00:00.000000 | pv = 1.00e-06 | Gini train = 67.69% | Gini test = 60.01% ---> Feature selected: avg_glucose_level
------------------------------------------------------------------------------------------------------------------------
Selección terminada: ['age', 'bmi', 'avg_glucose_level']
---------------------------------------------------------------------------------------

<span style='color:blue'>Vemos como estaría quedando la scorecard

In [9]:
me.pretty_scorecard(modelo3, 'pink')

Unnamed: 0,Variable,Group,Count,Percent,Goods,Bads,Bad rate,WoE,IV,Raw score,Aligned score
0,age,"(-inf, 30.00)",1077,0.30109,1075,2,0.001857,3.313571,1.008663,-2.948821,281
1,age,"[30.00, 60.00)",1538,0.429969,1494,44,0.028609,0.551665,0.102693,-0.490939,210
2,age,"[60.00, inf)",962,0.26894,834,128,0.133056,-1.099154,0.539195,0.978162,167
3,bmi,Missing,147,0.041096,118,29,0.197279,-1.569969,0.207222,1.11143,163
4,bmi,"(-inf, 23.75)",910,0.254403,893,17,0.018681,0.988016,0.16274,-0.699448,216
5,bmi,"[23.75, 30.75)",1306,0.36511,1239,67,0.051302,-0.05599,0.001174,0.039637,194
6,bmi,"[30.75, 32.05)",199,0.055633,181,18,0.090452,-0.665232,0.033435,0.470939,182
7,bmi,"[32.05, 36.45)",501,0.140062,489,12,0.023952,0.734098,0.05486,-0.519692,211
8,bmi,"[36.45, inf)",514,0.143696,483,31,0.060311,-0.227328,0.008235,0.160933,191
9,avg_glucose_level,"(-inf, 72.72)",645,0.180319,624,21,0.032558,0.418271,0.026216,-0.226027,202


<span style='color:blue'>Vemos que también entró una de las dos variables con missings: el `bmi`. De hecho, ha puesto en un grupo a parte a estos missings y esto no es casualidad:<br>
- En el autogrouping de una variable numérica con missings siempre se le dará un grupo aparte para estos missings. Siempre y cuando haya al menos un malo y un bueno, independientemente de su volumen (si te fijas en este caso ese grupo ni si quiera llega al 5% mínimo que se suele exigir, da igual). Esto se hace con una asignación inicial de los missings al valor -12345678, que entendemos va a ser el mínimo de esa variable de forma que con un corte inmediatamente posterior nos garantizamos que estos missing están en un grupo aparte.
    
<span style='color:blue'>Ok, pero... Y si quiero juntar esos missings con otro grupo... ¿Cómo lo hago?<br>
- Se pasa un diccionario indicando por un lado los puntos de corte (pudiendo ser los mismos del autogrouping o no) eliminando de ellos el valor -12345670.0 si se quiere juntar a los missings con otro grupo y por otro lado indicando a qué grupo se desea mandar los missings

In [10]:
me.reagrupa_var(modelo1, 'bmi', {'bp': [20, 30], 'mg': 2})

Agrupación automática (puntos de corte redondeados a 4 decimales): [-12345670.0, 23.75, 30.75, 32.05, 36.45]


Unnamed: 0,Group,Count,Percent,Goods,Bads,Bad rate,WoE,IV
0,Missing,147,0.041096,118,29,0.197279,-1.569969,0.207222
1,"(-inf, 23.75)",910,0.254403,893,17,0.018681,0.988016,0.16274
2,"[23.75, 30.75)",1306,0.36511,1239,67,0.051302,-0.05599,0.001174
3,"[30.75, 32.05)",199,0.055633,181,18,0.090452,-0.665232,0.033435
4,"[32.05, 36.45)",501,0.140062,489,12,0.023952,0.734098,0.05486
5,"[36.45, inf)",514,0.143696,483,31,0.060311,-0.227328,0.008235
Totals,,3577,1.0,3403,174,0.048644,,0.467667


--------------------------------------------------------------------------------
Agrupación propuesta:


Unnamed: 0,Group,Count,Percent,Goods,Bads,Bad rate,WoE,IV
0,"(-inf, 20.00)",367,0.1026,364,3,0.008174,1.825184,0.163761
1,"[20.00, 30.00), Missing",1865,0.521387,1760,105,0.0563,-0.154249,0.013305
2,"[30.00, inf)",1345,0.376013,1279,66,0.049071,-0.009178,3.2e-05
Totals,,3577,1.0,3403,174,0.048644,,0.177098


<span style='color:blue'>Vamos a lanzar otra scorecard con esta agrupación en el `bmi`. Dado que esta es peor que la automática debería salir una scorecard con menos Gini

In [11]:
modelo4 = me.scorecard(
    features=['age', 'bmi', 'avg_glucose_level'],
    user_breakpoints={
        'age': [30, 60],
        'bmi': {'bp': [20, 30], 'mg': 2}
    }
).fit(X, y)

Particionado 70-30 estratificado en el target terminado
------------------------------------------------------------------------------------------------------------------------
Autogrouping terminado. Máximo número de buckets = 5. Mínimo porcentaje por bucket = 0.05
------------------------------------------------------------------------------------------------------------------------
Step 01 | 0:00:00.000000 | pv = 8.30e-27 | Gini train = 56.54% | Gini test = 54.72% ---> Feature selected: age
Step 02 | 0:00:00.000000 | pv = 4.08e-01 | Gini train = 58.28% | Gini test = 55.94% ---> Feature selected: bmi
Step 03 | 0:00:00.000000 | pv = 2.13e-08 | Gini train = 63.89% | Gini test = 58.96% ---> Feature selected: avg_glucose_level
------------------------------------------------------------------------------------------------------------------------
Selección terminada: ['age', 'bmi', 'avg_glucose_level']
---------------------------------------------------------------------------------------

<span style='color:blue'>Echamos un ojo a como quedaría la scorecard

In [12]:
me.pretty_scorecard(modelo4, 'red')

Unnamed: 0,Variable,Group,Count,Percent,Goods,Bads,Bad rate,WoE,IV,Raw score,Aligned score
0,age,"(-inf, 30.00)",1077,0.30109,1075,2,0.001857,3.313571,1.008663,-3.001008,282
1,age,"[30.00, 60.00)",1538,0.429969,1494,44,0.028609,0.551665,0.102693,-0.499628,210
2,age,"[60.00, inf)",962,0.26894,834,128,0.133056,-1.099154,0.539195,0.995473,167
3,bmi,"(-inf, 20.00)",367,0.1026,364,3,0.008174,1.825184,0.163761,-0.505628,210
4,bmi,"[20.00, 30.00), Missing",1865,0.521387,1760,105,0.0563,-0.154249,0.013305,0.042731,194
5,bmi,"[30.00, inf)",1345,0.376013,1279,66,0.049071,-0.009178,3.2e-05,0.002543,195
6,avg_glucose_level,"(-inf, 72.72)",645,0.180319,624,21,0.032558,0.418271,0.026216,-0.255072,203
7,avg_glucose_level,"[72.72, 76.48)",201,0.056192,185,16,0.079602,-0.52559,0.019757,0.320518,186
8,avg_glucose_level,"[76.48, 165.21)",2278,0.636847,2205,73,0.032046,0.434666,0.099285,-0.26507,203
9,avg_glucose_level,"[165.21, 213.28)",240,0.067095,203,37,0.154167,-1.271069,0.194461,0.775131,173


<span style='color:blue'>¿Y si la variable que tiene missings es de tipo texto? Mucho más fácil: En una variable de texto el missing se trata como una categoría más, no hay distinción con el resto de categorías. Vamos a verlo con la variable `worktype` a la que metimos missings artificialmente

In [13]:
modelo5 = me.scorecard(
    features=['age', 'bmi', 'avg_glucose_level', 'work_type'],
    user_breakpoints={
        'age': [30, 60],
        'bmi': {'bp': [20, 30], 'mg': 2}
    }
).fit(X, y)

Particionado 70-30 estratificado en el target terminado
------------------------------------------------------------------------------------------------------------------------
Autogrouping terminado. Máximo número de buckets = 5. Mínimo porcentaje por bucket = 0.05
------------------------------------------------------------------------------------------------------------------------
Step 01 | 0:00:00.000000 | pv = 8.30e-27 | Gini train = 56.54% | Gini test = 54.72% ---> Feature selected: age
Step 02 | 0:00:00.000000 | pv = 4.08e-01 | Gini train = 58.28% | Gini test = 55.94% ---> Feature selected: bmi
Step 03 | 0:00:00.000000 | pv = 2.13e-08 | Gini train = 63.89% | Gini test = 58.96% ---> Feature selected: avg_glucose_level
Step 04 | 0:00:00.000000 | pv = 3.11e-01 | Gini train = 63.45% | Gini test = 57.89% ---> Feature selected: work_type
------------------------------------------------------------------------------------------------------------------------
Selección terminada: ['age'

<span style='color:blue'>Reagrupamos con una lista normal, independientemente de si la variable tiene missings o no

In [14]:
me.reagrupa_var(modelo5, 'work_type',
[['Private', 'Govt_job'], ['Missing', 'Never_worked'], ['children']])

Agrupación automática: [['Missing'], ['Private'], ['Govt_job'], ['children', 'Never_worked']]


Unnamed: 0,Group,Count,Percent,Goods,Bads,Bad rate,WoE,IV
0,[Missing],572,0.159911,530,42,0.073427,-0.43815,0.037521
1,[Private],2025,0.566117,1915,110,0.054321,-0.116365,0.008081
2,[Govt_job],467,0.130556,447,20,0.042827,0.133469,0.002191
3,"[children, Never_worked]",513,0.143416,511,2,0.003899,2.569865,0.356356
Totals,,3577,1.0,3403,174,0.048644,,0.404149


--------------------------------------------------------------------------------
Agrupación propuesta:


Unnamed: 0,Group,Count,Percent,Goods,Bads,Bad rate,WoE,IV
0,"[Private, Govt_job]",2492,0.696673,2362,130,0.052167,-0.073628,0.003905
1,"[Missing, Never_worked]",590,0.164943,548,42,0.071186,-0.404752,0.03252
2,[children],495,0.138384,493,2,0.00404,2.534005,0.33798
Totals,,3577,1.0,3403,174,0.048644,,0.374405


<span style='color:blue'>Lanzamos la última scorecard con todas las reagrupaciones que hemos hecho

In [15]:
modelo6 = me.scorecard(
    features=['age', 'bmi', 'avg_glucose_level', 'work_type'],
    user_breakpoints={
        'age': [30, 60],
        'bmi': {'bp': [20, 30], 'mg': 2},
        'work_type': [['Private', 'Govt_job'], ['Missing', 'Never_worked'], ['children']]
    }
).fit(X, y)


Particionado 70-30 estratificado en el target terminado
------------------------------------------------------------------------------------------------------------------------
Autogrouping terminado. Máximo número de buckets = 5. Mínimo porcentaje por bucket = 0.05
------------------------------------------------------------------------------------------------------------------------
Step 01 | 0:00:00.000000 | pv = 8.30e-27 | Gini train = 56.54% | Gini test = 54.72% ---> Feature selected: age
Step 02 | 0:00:00.000000 | pv = 4.08e-01 | Gini train = 58.28% | Gini test = 55.94% ---> Feature selected: bmi
Step 03 | 0:00:00.000000 | pv = 2.13e-08 | Gini train = 63.89% | Gini test = 58.96% ---> Feature selected: avg_glucose_level
Step 04 | 0:00:00.000000 | pv = 7.66e-02 | Gini train = 63.96% | Gini test = 58.05% ---> Feature selected: work_type
------------------------------------------------------------------------------------------------------------------------
Selección terminada: ['age'

In [16]:
me.pretty_scorecard(modelo6, 'orange')

Unnamed: 0,Variable,Group,Count,Percent,Goods,Bads,Bad rate,WoE,IV,Raw score,Aligned score
0,age,"(-inf, 30.00)",1077,0.30109,1075,2,0.001857,3.313571,1.008663,-3.230982,240
1,age,"[30.00, 60.00)",1538,0.429969,1494,44,0.028609,0.551665,0.102693,-0.537915,162
2,age,"[60.00, inf)",962,0.26894,834,128,0.133056,-1.099154,0.539195,1.071758,115
3,bmi,"(-inf, 20.00)",367,0.1026,364,3,0.008174,1.825184,0.163761,-0.762567,168
4,bmi,"[20.00, 30.00), Missing",1865,0.521387,1760,105,0.0563,-0.154249,0.013305,0.064446,145
5,bmi,"[30.00, inf)",1345,0.376013,1279,66,0.049071,-0.009178,3.2e-05,0.003835,146
6,avg_glucose_level,"(-inf, 72.72)",645,0.180319,624,21,0.032558,0.418271,0.026216,-0.253087,154
7,avg_glucose_level,"[72.72, 76.48)",201,0.056192,185,16,0.079602,-0.52559,0.019757,0.318024,137
8,avg_glucose_level,"[76.48, 165.21)",2278,0.636847,2205,73,0.032046,0.434666,0.099285,-0.263008,154
9,avg_glucose_level,"[165.21, 213.28)",240,0.067095,203,37,0.154167,-1.271069,0.194461,0.769098,124
