# Desenvolupament pràctic TFG

Per al desenvolupament del projecte pràctic, farem us de una base de dades que les seves entrades consisteixen en una persona que demana un crèdit al banc. Cada persona es classifica segons el risc que generi fer-li un prèstam (poden ser bons prestams o dolents).

In [2]:
import pandas as pd
import altair as alt
from IPython.display import display
import warnings

warnings.filterwarnings("ignore")
%load_ext autoreload
%autoreload 2

### Anàlisi de les dades

Primer de tot, haurem de carregar les dades en un fitxer.

In [3]:
data = pd.DataFrame(pd.read_csv("./archive/german_credit_data.csv")).drop("Id", axis=1)
data

Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose,Risk
0,67,male,2,own,,little,1169,6,radio/TV,good
1,22,female,2,own,little,moderate,5951,48,radio/TV,bad
2,49,male,1,own,little,,2096,12,education,good
3,45,male,2,free,little,little,7882,42,furniture/equipment,good
4,53,male,2,free,little,little,4870,24,car,bad
...,...,...,...,...,...,...,...,...,...,...
995,31,female,1,own,little,,1736,12,furniture/equipment,good
996,40,male,3,own,little,little,3857,30,car,good
997,38,male,2,own,little,,804,12,radio/TV,good
998,23,male,2,free,little,little,1845,45,radio/TV,bad


Com podem veure en el display anterior, tenim un total de 1000 files (sent cada fila uan persona) i cada una de les files compten amb 10 columnes.

In [3]:
# unique to extract values

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Age               1000 non-null   int64 
 1   Sex               1000 non-null   object
 2   Job               1000 non-null   int64 
 3   Housing           1000 non-null   object
 4   Saving accounts   817 non-null    object
 5   Checking account  606 non-null    object
 6   Credit amount     1000 non-null   int64 
 7   Duration          1000 non-null   int64 
 8   Purpose           1000 non-null   object
 9   Risk              1000 non-null   object
dtypes: int64(4), object(6)
memory usage: 78.2+ KB


In [4]:
data.fillna(value="unknown", inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Age               1000 non-null   int64 
 1   Sex               1000 non-null   object
 2   Job               1000 non-null   int64 
 3   Housing           1000 non-null   object
 4   Saving accounts   1000 non-null   object
 5   Checking account  1000 non-null   object
 6   Credit amount     1000 non-null   int64 
 7   Duration          1000 non-null   int64 
 8   Purpose           1000 non-null   object
 9   Risk              1000 non-null   object
dtypes: int64(4), object(6)
memory usage: 78.2+ KB


In [13]:
data.groupby("Job").size()

Job
0     22
1    200
2    630
3    148
dtype: int64

### Entrenament per al model

In [3]:
from carla.data.catalog import CsvCatalog, OnlineCatalog
from carla.models.catalog.catalog import MLModelCatalog, OnlineCatalog
from carla.models.negative_instances import predict_negative_instances
import carla.recourse_methods.catalog as recourse_catalog
from carla.data.causal_model import CausalModel
from carla.recourse_methods import GrowingSpheres, Wachter, CCHVAE

Using TensorFlow backend.


[INFO] Using Python-MIP package version 1.12.0 [model.py <module>]


Després d'haver realitzat un analisi de les dades, haurem de preparar les dades per a l'entrenament del model.


In [22]:
# data = pd.DataFrame(pd.read_csv("./archive/german_credit_data.csv"))
# data.isnull().sum().sort_values(ascending=False)

data = pd.DataFrame(pd.read_csv('./statlog+german+credit+data/german.data-no-numeric.csv'))
data.columns

Index(['Checking Account', 'Duration', 'Credit History', 'Purpose',
       'Credit Amount', 'Saving Accounts', 'Employment Since',
       'Installment Rate', 'Personal Status', 'Other debtors',
       'Present Residence', 'Property', 'Age', 'Installment Plans', 'Housing',
       'Credits', 'Job', 'People Liable', 'Telephone', 'Foreign Worker',
       'Risk'],
      dtype='object')

Com podem veure a la cel·la anterior, a la columna de Checking account comptem amb 394 files amb valors Nan i Saving accounts compte amb 183 valors Nan.

Donat que la llibreria Carla no pot tractar amb dades que continguin valors Nan, haurem de fer un tractament de les dades a fi de otorgar un format acceptat. La primera opció que tenim per fer aquesta feina es buidar aquelles fileres que no tinguin un valor Nan.

In [10]:
data_no_nan = data.dropna()
print("Dataset size after droping Nan values:", data_no_nan.shape)

Dataset size after droping Nan values: (1000, 21)


Com podem veure a la cel·la anterior, el tamany del dataset es veu molt reduït, passant de 1000 files a 522.

Donat que això pot reduir molt la qualitat de l'estudi, no podem considerar suficientment acceptable la qualitat del dataset amb una reducció tan gran de les dades (quasi un 50% de les dades han desaparegut).

Per això, aprofitant que en el punt anterior hem intercanviat els valors Nan per el valor "unkown", aprofitarem el mateix dataset amb aquests valors modificats a fi de poder aprofitar el 100% de les files.

In [8]:
# data.fillna(value="unknown", inplace=True)
# data_carla = data.loc[:, ~data.columns.str.contains("^Unnamed")]

# for i in range(len(data_carla["Risk"])):
#     if data_carla["Risk"][i] == 'good':
#         data_carla["Risk"][i] = 1.0
#     else:
#         data_carla["Risk"][i] = 0.0

# data_carla.info()
# data_carla.to_csv("./archive/german_credit_data_noNan.csv", index=False)



Ara si podem començar a usar les dades amb la llibreria Carla. Primer de tot, haurem de fer un tractament d'aquestes dades per a que siguin compatibles amb carla, fent us de la funció *CsvCatalog*.

In [35]:
continuous = ["Duration", "Credit Amount", 
              "Installment Rate",  
               "Present Residence", "Age", 
               "Credits",  "People Liable"] 

categorical = ["Checking Account", "Credit History", "Purpose", "Saving Accounts", "Employment Since", 
               "Personal Status", "Other debtors", "Property", "Installment Plans", "Housing", "Job",
               "Telephone", "Foreign Worker"]
immutable = []
# data_bank = CsvCatalog(file_path = "./archive/german_credit_data_noNan.csv",
#                  continuous=continuous,
#                  categorical=categorical,
#                  immutables=immutable,
#                  target='Risk')

data_bank = CsvCatalog(file_path='./statlog+german+credit+data/german.data-no-numeric.csv',
                      continuous=continuous,
                      categorical=categorical,
                      immutables=immutable,
                      target="Risk")

Un cop tenim les dades preparades, haurem de preparar el nostre model per a poder-la entrenar.

In [48]:
# Batch size = numero de datos que se coge en cada iteración
# Learning rate = 
# epochs = 
# hidden size = 

# Paramos for training
training_params = {"lr": 0.01, "epochs": 10, 
                   "batch_size": 100, "hidden_size": [18, 9, 3], 
                   "max_depth": 100}

model = MLModelCatalog(data=data_bank,
                      model_type="ann",
                       backend="tensorflow",
                      load_online=False)

model.train(learning_rate=training_params["lr"],
            epochs=training_params["epochs"],
            batch_size=training_params["batch_size"],
            hidden_size=training_params["hidden_size"],
            max_depth=training_params["max_depth"],
            force_train=True)

some_factuals = predict_negative_instances(model, data_bank.df).iloc[:5]

balance on test set 1.2933333333333332, balance on test set 1.32
Model: "sequential_18"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_66 (Dense)             (None, 18)                1080      
_________________________________________________________________
dense_67 (Dense)             (None, 9)                 171       
_________________________________________________________________
dense_68 (Dense)             (None, 3)                 30        
_________________________________________________________________
dense_69 (Dense)             (None, 2)                 8         
Total params: 1,289
Trainable params: 1,289
Non-trainable params: 0
_________________________________________________________________
None
Train on 750 samples, validate on 250 samples
Epoch 1/10


InvalidArgumentError: Incompatible shapes: [100,3] vs. [100,2]
	 [[{{node training_16/RMSprop/gradients/loss_16/dense_69_loss/weighted_binary_cross_entropy/mul_2_grad/BroadcastGradientArgs}}]]

En aquest moment, tenim un model de tipus ann el qual es troba entrenat, i com retorna el seu valor, 

In [16]:
recourse_method = recourse_catalog.GrowingSpheres(model)
data_counterfactuals = recourse_method.get_counterfactuals(some_factuals)
display(data_counterfactuals)

Unnamed: 0,Age,Checking account_little,Checking account_moderate,Checking account_rich,Checking account_unknown,...,Saving accounts_moderate,Saving accounts_quite rich,Saving accounts_rich,Saving accounts_unknown,Sex_male
0,0.80856,1.0,0.0,0.0,1.0,...,0.0,1.0,1.0,1.0,1.0
1,0.456959,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0
2,0.606403,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,1.0
3,0.258234,0.0,1.0,0.0,0.0,...,0.0,1.0,1.0,0.0,1.0
4,0.768767,0.0,1.0,1.0,1.0,...,0.0,1.0,1.0,0.0,0.0
