## **Credit Score Classification**

## **Alumno: José Llashag Huamán**

## I. Introducción<a class="anchor" id="Seccion_II"></a>

**Problema:**

Está trabajando como científico de datos en una compañía financiera global. A lo largo de los años, la empresa ha recopilado datos bancarios básicos y mucha información relacionada con el crédito. La gerencia quiere construir un sistema inteligente para segregar a las personas entramos de puntaje de crédito para reducir los esfuerzos manuales.

**Objetivo:**

Dada la información relacionada con el crédito de una persona, cree un modelo de aprendizaje automático que pueda clasificar el puntaje crediticio.

## II. Importación de librerias y Carga de dataset<a class="anchor" id="Seccion_II"></a>

### 2.1. Importación de Librerias<a class="anchor" id="Seccion_II_2_1"></a>

In [3]:
import sklearn
sklearn.__version__

'1.2.2'

In [4]:
import pandas as pd
import numpy as np

In [5]:
from pycaret.datasets import get_data

In [6]:
from pycaret.classification import *

### 2.2. Carga de dataset<a class="anchor" id="Seccion_II_2_2"></a>

Carga del dataset de entrenamiento

In [8]:
dataset_train = pd.read_csv('datasets/train.csv', encoding='latin-1')

Carga del dataset de test

In [9]:
dataset_test = pd.read_csv('datasets/test.csv', encoding='latin-1')

## III. Análisis de Variables y Limpieza de valores<a class="anchor" id="Seccion_III"></a>

### 3.1. Análisis de Variables<a class="anchor" id="Seccion_II_2_3"></a>

Número de variables y cantidad de datos

In [10]:
dataset_train.shape

(100000, 28)

In [11]:
dataset_train.head(3)

Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,...,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance,Credit_Score
0,0x1602,CUS_0xd40,January,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,...,_,809.98,26.82262,22 Years and 1 Months,No,49.574949,80.41529543900253,High_spent_Small_value_payments,312.49408867943663,Good
1,0x1603,CUS_0xd40,February,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,31.94496,,No,49.574949,118.28022162236736,Low_spent_Large_value_payments,284.62916249607184,Good
2,0x1604,CUS_0xd40,March,Aaron Maashoh,-500,821-00-0265,Scientist,19114.12,,3,...,Good,809.98,28.609352,22 Years and 3 Months,No,49.574949,81.699521264648,Low_spent_Medium_value_payments,331.2098628537912,Good


Se cuenta con 8 variables numericas y 20 variables cualitativas incluyendo el target (Credit_Score)

In [12]:
dataset_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 28 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ID                        100000 non-null  object 
 1   Customer_ID               100000 non-null  object 
 2   Month                     100000 non-null  object 
 3   Name                      90015 non-null   object 
 4   Age                       100000 non-null  object 
 5   SSN                       100000 non-null  object 
 6   Occupation                100000 non-null  object 
 7   Annual_Income             100000 non-null  object 
 8   Monthly_Inhand_Salary     84998 non-null   float64
 9   Num_Bank_Accounts         100000 non-null  int64  
 10  Num_Credit_Card           100000 non-null  int64  
 11  Interest_Rate             100000 non-null  int64  
 12  Num_of_Loan               100000 non-null  object 
 13  Type_of_Loan              88592 non-null   ob

Se cuenta la cantidad de variables nulos

In [13]:
dataset_train.isna().sum().sort_values()

ID                              0
Payment_Behaviour               0
Total_EMI_per_month             0
Payment_of_Min_Amount           0
Credit_Utilization_Ratio        0
Outstanding_Debt                0
Credit_Mix                      0
Changed_Credit_Limit            0
Delay_from_due_date             0
Num_of_Loan                     0
Interest_Rate                   0
Credit_Score                    0
Num_Bank_Accounts               0
Annual_Income                   0
Occupation                      0
SSN                             0
Age                             0
Month                           0
Customer_ID                     0
Num_Credit_Card                 0
Monthly_Balance              1200
Num_Credit_Inquiries         1965
Amount_invested_monthly      4479
Num_of_Delayed_Payment       7002
Credit_History_Age           9030
Name                         9985
Type_of_Loan                11408
Monthly_Inhand_Salary       15002
dtype: int64

### 3.2. Limpieza de valores<a class="anchor" id="Seccion_II_2_4"></a>

In [14]:
def isfloat(num):
  try:
      float(str(num).replace('__', "").replace('_', ""))
      return True
  except ValueError:
      return False

Limpieza de valores irregulares

In [15]:
#Para la data de entrennamiento
dataset_train = dataset_train.replace(to_replace=r'!@9#%8', value='', regex=True)
dataset_train = dataset_train.replace(to_replace=r'#F%$D@*&8', value='', regex=True)
dataset_train = dataset_train.replace('_______', "")
dataset_train = dataset_train.replace('__', "")
dataset_train = dataset_train.replace('_', "")

#Para la data de prueba
dataset_test = dataset_test.replace(to_replace=r'!@9#%8', value='', regex=True)
dataset_test = dataset_test.replace(to_replace=r'#F%$D@*&8', value='', regex=True)
dataset_test = dataset_test.replace('_______', "")
dataset_test = dataset_test.replace('__', "")
dataset_test = dataset_test.replace('_', "")

Limpieza de variables nulos

In [16]:
dataset_train = dataset_train.fillna(method='ffill').fillna(method='bfill')
dataset_test = dataset_test.fillna(method='ffill').fillna(method='bfill')

Se cuenta la cantidad de variables nulos

In [17]:
dataset_train.isna().sum().sort_values()

ID                          0
Payment_Behaviour           0
Amount_invested_monthly     0
Total_EMI_per_month         0
Payment_of_Min_Amount       0
Credit_History_Age          0
Credit_Utilization_Ratio    0
Outstanding_Debt            0
Credit_Mix                  0
Num_Credit_Inquiries        0
Changed_Credit_Limit        0
Num_of_Delayed_Payment      0
Delay_from_due_date         0
Type_of_Loan                0
Num_of_Loan                 0
Interest_Rate               0
Num_Credit_Card             0
Num_Bank_Accounts           0
Monthly_Inhand_Salary       0
Annual_Income               0
Occupation                  0
SSN                         0
Age                         0
Name                        0
Month                       0
Customer_ID                 0
Monthly_Balance             0
Credit_Score                0
dtype: int64

Limpieza de la variable Age

In [18]:
dataset_train['Age'].describe()

count     100000
unique      1788
top           38
freq        2833
Name: Age, dtype: object

In [19]:
dataset_train['Age'] = dataset_train['Age'].apply(lambda x : int(x.replace('__', "").replace('_', "")) if isfloat(x) else int(0))
dataset_test['Age'] = dataset_test['Age'].apply(lambda x : int(x.replace('__', "").replace('_', "")) if isfloat(x) else int(0))

In [20]:
dataset_train['Age'].describe()

count    100000.000000
mean        110.649700
std         686.244717
min        -500.000000
25%          24.000000
50%          33.000000
75%          42.000000
max        8698.000000
Name: Age, dtype: float64

Limpieza de la variable Annual_Income

In [21]:
dataset_train['Annual_Income'].describe()

count       100000
unique       18940
top       36585.12
freq            16
Name: Annual_Income, dtype: object

In [22]:
dataset_train['Annual_Income'] = dataset_train['Annual_Income'].apply(lambda x : float(x.replace('__', "").replace('_', "")) if isfloat(x) else float(0))
dataset_test['Annual_Income'] = dataset_test['Annual_Income'].apply(lambda x : float(x.replace('__', "").replace('_', "")) if isfloat(x) else float(0))

In [23]:
dataset_train['Annual_Income'].describe()

count    1.000000e+05
mean     1.764157e+05
std      1.429618e+06
min      7.005930e+03
25%      1.945750e+04
50%      3.757861e+04
75%      7.279092e+04
max      2.419806e+07
Name: Annual_Income, dtype: float64

Limpieza de la variable Changed_Credit_Limit

In [24]:
dataset_train['Changed_Credit_Limit'].describe()

count     100000
unique      4384
top             
freq        2091
Name: Changed_Credit_Limit, dtype: object

In [25]:
dataset_train['Changed_Credit_Limit'] = dataset_train['Changed_Credit_Limit'].apply(lambda x : float(x.replace('__', "").replace('_', "")) if isfloat(x) else float(0))
dataset_test['Changed_Credit_Limit'] = dataset_test['Changed_Credit_Limit'].apply(lambda x : float(x.replace('__', "").replace('_', "")) if isfloat(x) else float(0))

In [26]:
dataset_train['Changed_Credit_Limit'].describe()

count    100000.000000
mean         10.171791
std           6.880628
min          -6.490000
25%           4.970000
50%           9.250000
75%          14.660000
max          36.970000
Name: Changed_Credit_Limit, dtype: float64

Limpieza de la variable Outstanding_Debt

In [27]:
dataset_train['Outstanding_Debt'].describe()

count      100000
unique      13178
top       1360.45
freq           24
Name: Outstanding_Debt, dtype: object

In [28]:
dataset_train['Outstanding_Debt'] = dataset_train['Outstanding_Debt'].apply(lambda x : float(x.replace('__', "").replace('_', "")) if isfloat(x) else float(0))
dataset_test['Outstanding_Debt'] = dataset_test['Outstanding_Debt'].apply(lambda x : float(x.replace('__', "").replace('_', "")) if isfloat(x) else float(0))

In [29]:
dataset_train['Outstanding_Debt'].describe()

count    100000.000000
mean       1426.220376
std        1155.129026
min           0.230000
25%         566.072500
50%        1166.155000
75%        1945.962500
max        4998.070000
Name: Outstanding_Debt, dtype: float64

Limpieza de la variable Amount_invested_monthly

In [30]:
dataset_train['Amount_invested_monthly'].describe()

count        100000
unique        91049
top       __10000__
freq           4500
Name: Amount_invested_monthly, dtype: object

In [31]:
dataset_train['Amount_invested_monthly'] = dataset_train['Amount_invested_monthly'].apply(lambda x : float(x.replace('__', "").replace('_', "")) if isfloat(x) else float(0))
dataset_test['Amount_invested_monthly'] = dataset_test['Amount_invested_monthly'].apply(lambda x : float(x.replace('__', "").replace('_', "")) if isfloat(x) else float(0))

In [32]:
dataset_train['Amount_invested_monthly'].describe()

count    100000.000000
mean        636.932570
std        2041.827136
min           0.000000
25%          74.616863
50%         135.959898
75%         266.118215
max       10000.000000
Name: Amount_invested_monthly, dtype: float64

Limpieza de la variable Monthly_Balance

In [33]:
dataset_train['Monthly_Balance'].describe()

count                               100000
unique                               98792
top       __-333333333333333333333333333__
freq                                     9
Name: Monthly_Balance, dtype: object

In [34]:
dataset_train['Monthly_Balance'] = dataset_train['Monthly_Balance'].apply(lambda x : float(str(x).replace('__', "").replace('_', "")) if isfloat(x) else float(0))
dataset_test['Monthly_Balance'] = dataset_test['Monthly_Balance'].apply(lambda x : float(str(x).replace('__', "").replace('_', "")) if isfloat(x) else float(0))

In [35]:
dataset_train['Monthly_Balance'].describe()

count    1.000000e+05
mean    -3.000000e+22
std      3.162151e+24
min     -3.333333e+26
25%      2.700394e+02
50%      3.370244e+02
75%      4.714038e+02
max      1.602041e+03
Name: Monthly_Balance, dtype: float64

Cantidad de variable Credit_Score

In [36]:
dataset_train['Credit_Score'].value_counts()

Standard    53174
Poor        28998
Good        17828
Name: Credit_Score, dtype: int64

Se elimina algunos campos que no servirán para la predicción

In [37]:
dataset_train = dataset_train.drop("ID", axis=1)
dataset_train = dataset_train.drop("Name", axis=1)
dataset_train = dataset_train.drop("Type_of_Loan", axis=1)
dataset_train = dataset_train.drop("SSN", axis=1)
dataset_train = dataset_train.drop("Monthly_Inhand_Salary", axis=1)
dataset_train = dataset_train.drop("Credit_History_Age", axis=1)
dataset_train = dataset_train.drop("Num_Credit_Inquiries", axis=1)

dataset_test = dataset_test.drop("ID", axis=1)
dataset_test = dataset_test.drop("Name", axis=1)
dataset_test = dataset_test.drop("Type_of_Loan", axis=1)
dataset_test = dataset_test.drop("SSN", axis=1)
dataset_test = dataset_test.drop("Monthly_Inhand_Salary", axis=1)
dataset_test = dataset_test.drop("Credit_History_Age", axis=1)
dataset_test = dataset_test.drop("Num_Credit_Inquiries", axis=1)

Nos quedamos en total con 22 columnas

In [38]:
dataset_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 21 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Customer_ID               100000 non-null  object 
 1   Month                     100000 non-null  object 
 2   Age                       100000 non-null  int64  
 3   Occupation                100000 non-null  object 
 4   Annual_Income             100000 non-null  float64
 5   Num_Bank_Accounts         100000 non-null  int64  
 6   Num_Credit_Card           100000 non-null  int64  
 7   Interest_Rate             100000 non-null  int64  
 8   Num_of_Loan               100000 non-null  object 
 9   Delay_from_due_date       100000 non-null  int64  
 10  Num_of_Delayed_Payment    100000 non-null  object 
 11  Changed_Credit_Limit      100000 non-null  float64
 12  Credit_Mix                100000 non-null  object 
 13  Outstanding_Debt          100000 non-null  fl

Sacamos un muestra de la data total

In [39]:
dataset_train_sample = dataset_train.sample(n=12500)
dataset_train_sample.shape

dataset_test_sample = dataset_test.sample(n=12500)
dataset_test_sample.shape

(12500, 20)

## IV. Uso del dataset de Pycaret

Se ejecuta el setup

In [40]:
dataset_credit = setup(data=dataset_train_sample, target='Credit_Score', session_id=123)

Evaluar el mejor modelo

In [41]:
best = compare_models()

Obtener mejor estimadores del mejor modelo

In [42]:
tuned_best = tune_model(estimator = best)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.696,0.8075,0.696,0.7595,0.7028,0.5374,0.5597
1,0.6629,0.7828,0.6629,0.7264,0.6672,0.4894,0.5122
2,0.6709,0.7893,0.6709,0.7417,0.675,0.5038,0.5297
3,0.6629,0.7968,0.6629,0.7302,0.6683,0.4898,0.513
4,0.6617,0.7889,0.6617,0.7335,0.6673,0.4891,0.5134
5,0.6503,0.7792,0.6503,0.7151,0.6546,0.4706,0.493
6,0.6549,0.7873,0.6549,0.7269,0.6615,0.4789,0.5027
7,0.6514,0.7905,0.6514,0.7377,0.6554,0.4807,0.5114
8,0.672,0.7964,0.672,0.7416,0.6749,0.5057,0.5319
9,0.6583,0.7779,0.6583,0.7221,0.664,0.4812,0.5026


Fitting 10 folds for each of 10 candidates, totalling 100 fits


Entremamiento y evaluación del modelo

In [43]:
final_model = finalize_model(estimator = tuned_best)

In [44]:
evaluate_model(final_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

Probar el modelo

In [45]:
predict_model(final_model, data = dataset_test_sample)

Unnamed: 0,Customer_ID,Month,Age,Occupation,Annual_Income,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Num_of_Loan,Delay_from_due_date,...,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance,prediction_label,prediction_score
0,CUS_0x8142,September,22,Entrepreneur,31156.179688,7,4,15,5,5,...,Standard,159.080002,27.886053,Yes,113.901184,128.070190,Low_spent_Large_value_payments,274.863464,Standard,0.75
1,CUS_0x4056,October,46,Developer,71842.796875,3,7,12,2,18,...,Standard,39.310001,34.422684,No,65.273575,70.093201,High_spent_Large_value_payments,714.823242,Standard,0.63
2,CUS_0xa224,October,26,Accountant,9166.464844,9,1363,22,3,13,...,,1943.209961,26.799608,Yes,21.385996,14.261943,High_spent_Small_value_payments,276.139282,Poor,0.70
3,CUS_0x9124,September,24,Entrepreneur,52561.621094,8,8,32,2,30,...,Standard,1312.290039,33.199986,Yes,48.935272,286.340485,High_spent_Small_value_payments,381.937744,Poor,0.71
4,CUS_0x5d6d,November,39,Writer,126514.718750,8,7,5,3,21,...,Standard,470.420013,39.855457,No,312.032867,387.999115,Low_spent_Large_value_payments,604.157349,Standard,0.82
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12495,CUS_0xa95d,September,47,Architect,38690.039062,5,3,16,2,11,...,Standard,590.940002,28.963427,Yes,40.095337,71.680115,,445.641541,Standard,0.81
12496,CUS_0x487a,September,27,Scientist,58728.058594,8,5,18,5,5,...,,1272.790039,41.076851,Yes,129.832825,130.280716,High_spent_Medium_value_payments,553.835083,Poor,0.49
12497,CUS_0x3d60,September,24,Teacher,40289.789062,5,6,10,2,11,...,Good,1073.739990,32.286114,NM,38.976391,101.192390,High_spent_Medium_value_payments,417.179474,Standard,0.54
12498,CUS_0x58aa,September,41,,7664.314941,6,5,15,5,36,...,Standard,1294.449951,36.152924,Yes,24.236105,23.742453,High_spent_Large_value_payments,283.890747,Poor,0.59


Guardar el modelo

In [46]:
save_model(final_model, 'Credit_Score_Model')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=FastMemory(location=C:\Users\josel\AppData\Local\Temp\joblib),
          steps=[('label_encoding',
                  TransformerWrapperWithInverse(exclude=None, include=None,
                                                transformer=LabelEncoder())),
                 ('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['Age', 'Annual_Income',
                                              'Num_Bank_Accounts',
                                              'Num_Credit_Card', 'Interest_Rate',
                                              'Delay_from_due_date',
                                              'C...
                  RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                         class_weight=None, criterion='gini',
                                         max_depth=None, max_features='sqrt',
                                         max_leaf_nodes=None, max_sample