# Aplicación Financiera: Riesgo Crediticio
**En este notebook se mostrará un modelo de predicción de score crediticio. Usted puede encontrar las bases de datos en el [INEI](http://iinei.inei.gob.pe/microdatos/).**

- Encuesta ENAHO Metodología actualizada
- Año: 2020
- Periodo: Anual

---

Variables en el modelo:

##### Variable Target
- `P107E`: ¿Ha tenido dificultades que le han impedido cumplir con el cronograma de pagos del crédito o préstamo obtenido? Si/No


##### Variables dependientes

- `P207`: Sexo
- `P208A` : ¿Qué edad tiene en años cumplidos? - En años 
- `P209` : ¿Cuál es su estado civil o conyugal?    
> 1. Conviviente 
> 2. Casado(a)
> 3. Viudo(a)  
> 4. Divorciado(a)  
> 5. Separado(a)  
> 6. Soltero(a)  

- `P107D4` : ¿Cúal fue el Monto Total del Crédito recibido? N
- `P524A1` : Ingreso - Ingreso total-monto (S/.)
- `P523` : En su ocupación principal, ¿A Ud. le pagan: 
> 1. Diario?
> 2. Semanal?
> 3. Quincenal?
> 4. Mensual?  

- `P301A` N ¿Cuál es el último año o grado de estudios y nivel que aprobó? – Nivel educativo     
> 1 Sin nivel     
> 2  Inicial     
> 3 Primaria incompleta     
> 4 Primaria completa     
> 5 Secundaria incompleta     
> 6 Secundaria completa     
> 7 Superior no Universitaria Incompleta     
> 8 Superior no Universitaria Completa     
> 9 Superior Universitaria Incompleta     
> 10 Superior Universitaria Completa     
> 11 Maestri/Doctorado     
> 12 Básica especial 

- `ESTRATO`: area urbana/rural
- `P105A`: ¿La vivienda que ocupa su hogar es ?     
> 1 Alquilada     
> 2 Propia, totalmente pagada     
> 3 Propia, por invasión     
> 4 Propia, comprándola a plazos     
> 5 Cedida por el centro de trabajo     
> 6 Cedida por otro hogar o institución     
> 7 Otra forma 

In [5]:
# Librerías
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns # Se basa en Matplotlib y la complementa en el tema de graficos y demás.

# Importar bases de datos 

In [6]:
data1=pd.read_csv('./datos/Enaho01-2020-100.csv',encoding='latin')
data1.columns

  data1=pd.read_csv('./datos/Enaho01-2020-100.csv',encoding='latin')
  data1=pd.read_csv('./datos/Enaho01-2020-100.csv',encoding='latin')


Index(['AÑO', 'MES', 'CONGLOME', 'VIVIENDA', 'HOGAR', 'UBIGEO', 'DOMINIO',
       'ESTRATO', 'PERIODO', 'TIPENC',
       ...
       'FACTOR07', 'FACTOR_P', 'RECHAZO_RAZONES', 'NCONGLOME', 'SUB_CONGLOME',
       'CODCCPP', 'NOMCCPP', 'LONGITUD', 'LATITUD', 'ALTITUD'],
      dtype='object', length=331)

Index(['AÑO', 'MES', 'CONGLOME', 'VIVIENDA', 'HOGAR', 'UBIGEO', 'DOMINIO',
       'ESTRATO', 'PERIODO', 'TIPENC',
       ...
       'FACTOR07', 'FACTOR_P', 'RECHAZO_RAZONES', 'NCONGLOME', 'SUB_CONGLOME',
       'CODCCPP', 'NOMCCPP', 'LONGITUD', 'LATITUD', 'ALTITUD'],
      dtype='object', length=331)

In [7]:
data1 = data1[['CONGLOME','VIVIENDA','HOGAR','UBIGEO',"P107E","P107D4","ESTRATO","P105A"]]
data1.head()

Unnamed: 0,CONGLOME,VIVIENDA,HOGAR,UBIGEO,P107E,P107D4,ESTRATO,P105A
0,5009,12,11,10101,,,4,6.0
1,5009,41,11,10101,,,4,2.0
2,5009,56,11,10101,,,4,
3,5009,84,11,10101,,,4,
4,5009,98,11,10101,,,4,1.0


Unnamed: 0,CONGLOME,VIVIENDA,HOGAR,UBIGEO,P107E,P107D4,ESTRATO,P105A
0,5009,12,11,10101,,,4,6.0
1,5009,41,11,10101,,,4,2.0
2,5009,56,11,10101,,,4,
3,5009,84,11,10101,,,4,
4,5009,98,11,10101,,,4,1.0


In [8]:
data2=pd.read_csv('./datos/Enaho01-2020-200.csv',encoding='latin')
data2.columns

  data2=pd.read_csv('./datos/Enaho01-2020-200.csv',encoding='latin')
  data2=pd.read_csv('./datos/Enaho01-2020-200.csv',encoding='latin')


Index(['AÑO', 'MES', 'CONGLOME', 'VIVIENDA', 'HOGAR', 'CODPERSO', 'UBIGEO',
       'DOMINIO', 'ESTRATO', 'P201P', 'P203', 'P203A', 'P203B', 'P204', 'P205',
       'P206', 'P207', 'P208A', 'P208B', 'P209', 'P210', 'P211A', 'P211D',
       'P212', 'P213', 'P214', 'P215', 'P216', 'P217', 'T211', 'OCUPAC_R3',
       'OCUPAC_R4', 'RAMA_R3', 'RAMA_R4', 'CODTAREA', 'CODTIEMPO', 'TICUEST01',
       'TIPOCUESTIONARIO', 'TIPOENTREVISTA', 'FACPOB07', 'FACTOR_P',
       'NCONGLOME', 'SUB_CONGLOME'],
      dtype='object')

Index(['AÑO', 'MES', 'CONGLOME', 'VIVIENDA', 'HOGAR', 'CODPERSO', 'UBIGEO',
       'DOMINIO', 'ESTRATO', 'P201P', 'P203', 'P203A', 'P203B', 'P204', 'P205',
       'P206', 'P207', 'P208A', 'P208B', 'P209', 'P210', 'P211A', 'P211D',
       'P212', 'P213', 'P214', 'P215', 'P216', 'P217', 'T211', 'OCUPAC_R3',
       'OCUPAC_R4', 'RAMA_R3', 'RAMA_R4', 'CODTAREA', 'CODTIEMPO', 'TICUEST01',
       'TIPOCUESTIONARIO', 'TIPOENTREVISTA', 'FACPOB07', 'FACTOR_P',
       'NCONGLOME', 'SUB_CONGLOME'],
      dtype='object')

In [9]:
data2 = data2[['CONGLOME','VIVIENDA','HOGAR','UBIGEO',"P207","P208A","P209"]]
data2.head()

Unnamed: 0,CONGLOME,VIVIENDA,HOGAR,UBIGEO,P207,P208A,P209
0,5009,12,11,10101,2,49,5.0
1,5009,12,11,10101,2,16,6.0
2,5009,41,11,10101,2,61,6.0
3,5009,41,11,10101,2,29,6.0
4,5009,41,11,10101,2,10,


Unnamed: 0,CONGLOME,VIVIENDA,HOGAR,UBIGEO,P207,P208A,P209
0,5009,12,11,10101,2,49,5.0
1,5009,12,11,10101,2,16,6.0
2,5009,41,11,10101,2,61,6.0
3,5009,41,11,10101,2,29,6.0
4,5009,41,11,10101,2,10,


In [10]:
data3=pd.read_csv('./datos/Enaho01A-2020-300.csv',encoding='latin')
data3.columns

  data3=pd.read_csv('./datos/Enaho01A-2020-300.csv',encoding='latin')
  data3=pd.read_csv('./datos/Enaho01A-2020-300.csv',encoding='latin')


Index(['AÑO', 'MES', 'CONGLOME', 'VIVIENDA', 'HOGAR', 'CODPERSO', 'UBIGEO',
       'DOMINIO', 'ESTRATO', 'CODINFOR',
       ...
       'I3120C', 'NIVEL', 'TICUEST01A', 'TIPOCUESTIONARIO', 'TIPOENTREVISTA',
       'FACTOR07', 'FACTORA07', 'FACTORA_P', 'NCONGLOME', 'SUB_CONGLOME'],
      dtype='object', length=568)

Index(['AÑO', 'MES', 'CONGLOME', 'VIVIENDA', 'HOGAR', 'CODPERSO', 'UBIGEO',
       'DOMINIO', 'ESTRATO', 'CODINFOR',
       ...
       'I3120C', 'NIVEL', 'TICUEST01A', 'TIPOCUESTIONARIO', 'TIPOENTREVISTA',
       'FACTOR07', 'FACTORA07', 'FACTORA_P', 'NCONGLOME', 'SUB_CONGLOME'],
      dtype='object', length=568)

In [None]:
data3= data3[['CONGLOME','VIVIENDA','HOGAR','UBIGEO',"P301A"]]
data3.head()

In [None]:
data5=pd.read_csv('./datos/Enaho01A-2020-500.csv',encoding='latin')
data5.columns

In [None]:
data5= data5[['CONGLOME','VIVIENDA','HOGAR','UBIGEO',"P524A1","P523"]]
data5.head()

In [None]:
d1 = pd.merge(data1, data2, on=['CONGLOME','VIVIENDA','HOGAR','UBIGEO'])

In [None]:
d2 = pd.merge(d1, data3,  on=['CONGLOME','VIVIENDA','HOGAR','UBIGEO'])

In [None]:
d3 = pd.merge(d2, data5,  on=['CONGLOME','VIVIENDA','HOGAR','UBIGEO'])
d3

In [None]:
d3 = d3.rename(columns = {'P107E':'prob_pago',
                        'P107D4': 'MONTO_CREDITO',
                        'P207': 'SEXO',
                         'P208A': 'EDAD',
                         'P209': 'ESTADO_CIVIL',
                         'P301A': 'EDUCACION',
                         'P524A1': 'INGRESO',
                         'P523': 'INGRESO_T',
                        'P105A': "T_VIVIENDA"
})

In [None]:
d3.head()

In [None]:
d3.dtypes

In [None]:
base= d3.copy()

In [None]:
base.isnull().sum()

## Variables cualitativas

In [None]:
base['MONTO_CREDITO']=pd.to_numeric(base['MONTO_CREDITO'], errors='coerce')
base['EDAD']=pd.to_numeric(base['EDAD'], errors='coerce')
base['INGRESO']=pd.to_numeric(base['INGRESO'], errors='coerce')

In [None]:
base[['MONTO_CREDITO','EDAD','INGRESO']].head()

## Filtro a los NAs

In [None]:
base= base[base["MONTO_CREDITO"].notnull() & base["INGRESO"].notnull() & base["INGRESO_T"].notnull()]
base.tail()

In [None]:
base.info()

## Crear la variable ingreso mensual

In [None]:
base['INGRESO']=base['INGRESO'].apply(np.int64)
base['INGRESO'].head()
base['INGRESO_T']=base['INGRESO_T'].apply(np.int64)
base['INGRESO_T'].head()

In [None]:
base['iNGRESO_MENSUAL']=base['INGRESO']*base['INGRESO_T']
base['iNGRESO_MENSUAL'].head()

## Crear la variable area urbana / rural

In [None]:
base['ESTRATO']= base['ESTRATO'].replace({1:1,2:1,3:1,4:1,5:1,6:0,7:0,8:0})
base['ESTRATO']

In [None]:
base.head()

In [None]:
## Explorando variable educacion

In [None]:
base["EDUCACION"].value_counts()

In [None]:
base['EDUCACION']= base['EDUCACION'].replace({1:'No graduado', 2:'No graduado',3:'No graduado', 4:'No graduado',5:'No graduado',
                                          6:'No graduado', 7:'No graduado',8:'No graduado',9:'No graduado',
                                          10:'Graduado', 11:'Graduado', 12:'No graduado', '10':'Graduado',
                                          '1':'No graduado', '2':'No graduado','3':'No graduado','4':'No graduado',
                                          '5':'No graduado', '6':'No graduado', '7':'No graduado','8':'No graduado',
                                          '9':'No graduado',  '11':'Graduado'
                                         })
base["EDUCACION"].value_counts()

# Filtrar mayores de 18 años

In [None]:
base= base[base["EDAD"]>=18]
base.tail()

## Base a trabajar

In [None]:
base_final = base.drop(columns=['CONGLOME', 'VIVIENDA','HOGAR','INGRESO_T', 'INGRESO'])
base_final.head()

In [None]:
base_final.dtypes

In [None]:
base_final.describe()

In [None]:
base_final.describe(include="O")

In [None]:
base_final["prob_pago"].value_counts()

In [None]:
#Diferenciamos variables continuas, de las categoricas
cols_continuas = ["EDAD","ESTRATO", "iNGRESO_MENSUAL"]
cols_categoricas = ["SEXO","ESTADO_CIVIL","EDUCACION", "T_VIVIENDA"]

In [None]:
for c in cols_continuas:
    fig, ax = plt.subplots()
    base_final[c].hist(bins=50, ax= ax)
    ax.set_title(c)
    plt.show()

In [None]:
for c in cols_categoricas:
    fig, ax = plt.subplots()
    sns.countplot(x =c, data = base_final)
    ax.set_title(c)
    plt.show()

In [None]:
## Evaluar valores extremos

In [None]:
df_aux = pd.DataFrame()

for c in cols_continuas:
    p1 = base_final[c].quantile(0.01) 
    p5 = base_final[c].quantile(0.05)
    p95 = base_final[c].quantile(0.95)
    p99 = base_final[c].quantile(0.99)
    lst = [[c,p1,p5,p95,p99]]
    df_temp = pd.DataFrame(lst, columns =['Variable', 'P1','P5','P95','P99'])
    df_aux = df_aux.append(df_temp)
    df_aux.reset_index(drop=True,inplace=True)

In [None]:
df_aux.head()

In [None]:
cols= df_aux["Variable"].unique()

In [None]:
for c in cols:
    cota_izquierda = df_aux.loc[df_aux["Variable"] == c,"P5"].values[0]
    cota_derecha = df_aux.loc[df_aux["Variable"] == c ,"P95"].values[0]
    base_final[c] = base_final[c].astype("float64")
    base_final.loc[base_final[c] > cota_derecha,c] = cota_derecha
    base_final.loc[base_final[c] < cota_izquierda,c] = cota_izquierda

In [None]:
for c in cols_continuas:
    fig, ax = plt.subplots()
    base_final[c].hist(bins=50, ax= ax)
    ax.set_title(c)
    plt.show()

In [None]:
## Recodificando valores

In [None]:
base_final['EDUCACION'] = base_final['EDUCACION'].replace({'Graduado': 1,'No graduado': 2})
base_final['EDUCACION'].value_counts()

In [None]:
base_final['prob_pago'] = base_final['prob_pago'].replace({'2': 0})
base_final['prob_pago'].value_counts()

In [None]:
base_final['T_VIVIENDA'] = base_final['T_VIVIENDA'].replace({'2': 1,'3': 1,'4': 1,'5': 0, 6: 0, 5: 0,3: 1})
base_final['T_VIVIENDA'].value_counts()

In [None]:
base_final.head()

### Particionar la muestra 

In [None]:
X, y = base_final.drop(columns = ["prob_pago"]), base_final["prob_pago"]

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=4)

In [None]:
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

In [None]:
df_train = pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
### Analisis missings

In [None]:
def num_missing(x):
    return sum(x.isnull())

#Aplicamos por columna:
print ("Valores perdidos por columna")
print (df_train.apply(num_missing, axis=0))

In [None]:
target ="prob_pago"

In [None]:
from scipy.stats import mode

for column in cols_categoricas:
    if column != target:
        df_train[column] = df_train[column].fillna(df_train[column].mode()[0])
    
for column in cols_continuas:
    if column != target:
        df_train[column] = df_train[column].fillna(df_train[column].median())

### Balanceo de datos 

Se evalua la distribución de las clases a predecir, en nuestro caso de loans prediction es 2 y 1 las clases, siendo 1 el evento de interes de que si hizo default o impago la persona.

In [None]:
df_train["prob_pago"].value_counts()

In [None]:
cols = df_train.columns

In [None]:
X, y = df_train.drop(columns = [target]), df_train["prob_pago"]

In [None]:
X.head()

In [None]:
y.head()

In [None]:
#pip install imbalanced-learn

In [None]:
#pip install imblearn

In [None]:
from imblearn.over_sampling import SMOTE 

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_sm, y_sm = smote.fit_resample(X, y)

columns_X = X.columns
columns_y = [target]

df_X_sm = pd.DataFrame(data=X_sm,columns=columns_X)
df_y_sm = pd.DataFrame(data=y_sm,columns=columns_y)

# Concatenamos la información
df_balanceado_sm = pd.concat([df_X_sm, df_y_sm], axis=1)

In [None]:
df_balanceado_sm["prob_pago"].value_counts()

In [None]:
df_balanceado_sm.shape

In [None]:
X_ros = df_X_sm
y_ros = df_y_sm

### Selección de Variables

In [None]:
# Feature Extraction with RFE
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# feature extraction
model = LogisticRegression()
rfe = RFE(model,8)
fit = rfe.fit(X_ros, y_ros)

In [None]:
# horizontal bar plot of feature importance
pos = np.arange(len(X_ros.columns)) + 0.5
plt.barh(pos, fit.ranking_, align='center')
plt.title("Feature Importance")
plt.xlabel("Número de Modelos")
plt.ylabel("Features")
plt.yticks(pos, (X.columns))
plt.grid(True)

In [None]:
fit.support_

In [None]:
features_selected = X.columns[fit.support_]

In [None]:
features_selected

In [None]:
y_train = y_ros
X_train = X_ros[features_selected]

In [None]:
from sklearn import model_selection
from sklearn.model_selection import cross_val_score

kfold = model_selection.KFold(n_splits=10)
lr = LogisticRegression()
scoring = 'accuracy'
results = model_selection.cross_val_score(lr, X_train, y_train, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.3f" % (results.mean()))
results

In [None]:
### Evaluación del Modelo

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import log_loss
from sklearn.metrics import (precision_score, recall_score,f1_score,accuracy_score)

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)

In [None]:
X_test = df_test[features_selected]
y_pred=lr.predict(X_test)
yhat_prob = lr.predict_proba(X_test)

In [None]:
### Indicadores de rendimiento de modelos

In [None]:
# Obtenemos la matriz de confusión con las métricas anteriores
print(classification_report(y_test, y_pred))

In [None]:
print("\tIndicadores:")
print("\t1. Accuracy: %1.3f" % accuracy_score(y_test, y_pred))
print("\t2. Precision: %1.3f" % precision_score(y_test, y_pred))
print("\t3. Recall: %1.3f" % recall_score(y_test, y_pred))
print("\t4. F1: %1.3f" % f1_score(y_test, y_pred))
print("\t5. Log loss: %1.3f\n" % log_loss(y_test, yhat_prob))