# Desenvolupament pràctic TFG

Per al desenvolupament del projecte pràctic, farem us de una base de dades que les seves entrades consisteixen en una persona que demana un crèdit al banc. Cada persona es classifica segons el risc que generi fer-li un prèstam (poden ser bons prestams o dolents).

In [1]:
import pandas as pd
import altair as alt
from IPython.display import display
import warnings

warnings.filterwarnings("ignore")
%load_ext autoreload
%autoreload 2

### Anàlisi de les dades

Primer de tot, haurem de carregar les dades en un fitxer.

In [4]:
data = pd.DataFrame(pd.read_csv("./archive/german_credit_data.csv"))
print(data['Age'].min(), data['Age'].max())

19 75


Com podem veure en el display anterior, tenim un total de 1000 files (sent cada fila uan persona) i cada una de les files compten amb 10 columnes.

In [3]:
# unique to extract values

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Age               1000 non-null   int64 
 1   Sex               1000 non-null   object
 2   Job               1000 non-null   int64 
 3   Housing           1000 non-null   object
 4   Saving accounts   817 non-null    object
 5   Checking account  606 non-null    object
 6   Credit amount     1000 non-null   int64 
 7   Duration          1000 non-null   int64 
 8   Purpose           1000 non-null   object
 9   Risk              1000 non-null   object
dtypes: int64(4), object(6)
memory usage: 78.2+ KB


In [4]:
data.fillna(value="unknown", inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Age               1000 non-null   int64 
 1   Sex               1000 non-null   object
 2   Job               1000 non-null   int64 
 3   Housing           1000 non-null   object
 4   Saving accounts   1000 non-null   object
 5   Checking account  1000 non-null   object
 6   Credit amount     1000 non-null   int64 
 7   Duration          1000 non-null   int64 
 8   Purpose           1000 non-null   object
 9   Risk              1000 non-null   object
dtypes: int64(4), object(6)
memory usage: 78.2+ KB


In [13]:
data.groupby("Job").size()

Job
0     22
1    200
2    630
3    148
dtype: int64

### Entrenament per al model

In [2]:
from carla.data.catalog import CsvCatalog, OnlineCatalog
from carla.models.catalog.catalog import MLModelCatalog, OnlineCatalog
from carla.models.negative_instances import predict_negative_instances
import carla.recourse_methods.catalog as recourse_catalog
from carla.data.causal_model import CausalModel
from carla.recourse_methods import GrowingSpheres, Wachter, CCHVAE, Dice, FOCUS

Using TensorFlow backend.


[INFO] Using Python-MIP package version 1.12.0 [model.py <module>]


In [52]:
data = pd.DataFrame(pd.read_csv('./archive/german_credit_data.csv'))
data.fillna(value="unknown", inplace=True)
data_carla = data.loc[:, ~data.columns.str.contains("^Unnamed")]

for i in range(len(data_carla["Risk"])):
    if data_carla["Risk"][i] == 'good':
        data_carla["Risk"][i] = 1.0
    else:
        data_carla["Risk"][i] = 0.0

data_carla.to_csv("./archive/german_credit_data_noNan.csv", index=False)

continuous = ["Age", "Credit amount", "Duration"]
categorical = ["Sex", "Job", "Housing", "Saving accounts", "Checking account", "Purpose"]
immutables = []

data_bank = CsvCatalog(file_path = "./archive/german_credit_data_noNan.csv",
                 continuous=continuous,
                 categorical=categorical,
                 immutables=immutables,
                 target='Risk')

In [7]:
data = pd.DataFrame(pd.read_csv('./archive/german_credit_data.csv'))

data.fillna(value="unknown", inplace=True)
data_carla = data.loc[:, ~data.columns.str.contains("^Unnamed")]

# Changing Value of the risk to float
data_carla.loc[data_carla['Risk'] == 'good', 'Risk'] = 1.
data_carla.loc[data_carla['Risk'] == 'bad', 'Risk'] = 0.

# Adjusting Saving accounts
data_carla.loc[data_carla['Saving accounts'] == 'quite rich', 'Saving accounts'] = 'rich'
data_carla.loc[data_carla["Saving accounts"] == 'rich', 'Saving accounts'] = 'rich'
data_carla.loc[data_carla["Saving accounts"] == 'moderate', 'Saving accounts'] = 'poor'
data_carla.loc[data_carla["Saving accounts"] == 'little', 'Saving accounts'] = 'poor'
data_carla.loc[data_carla["Saving accounts"] == 'unknown', 'Saving accounts'] = 'poor'

# Adjusting Checking account
data_carla.loc[data_carla['Checking account'] == 'quite rich', 'Checking account'] = 'rich'
data_carla.loc[data_carla["Checking account"] == 'rich', 'Checking account'] = 'rich'
data_carla.loc[data_carla["Checking account"] == 'moderate', 'Checking account'] = 'poor'
data_carla.loc[data_carla["Checking account"] == 'little', 'Checking account'] = 'poor'
data_carla.loc[data_carla["Checking account"] == 'unknown', 'Checking account'] = 'poor'

# Adjusting housing
data_carla.loc[data_carla["Housing"] == 'own', 'Housing'] = 'not free'
data_carla.loc[data_carla["Housing"] == 'rent', 'Housing'] = "not free"
data_carla.loc[data_carla["Housing"] == 'free', 'Housing'] = "free"

#Adjusting job
data_carla.loc[data_carla["Job"] == 0, "Job"] = 0
data_carla.loc[data_carla["Job"] == 1, "Job"] = 0
data_carla.loc[data_carla["Job"] == 2, "Job"] = 0
data_carla.loc[data_carla["Job"] == 3, "Job"] = 1
data_carla.loc[data_carla["Job"] == 4, "Job"] = 1

#Adjusting purpose
data_carla.loc[data_carla["Purpose"] == 'car', "Purpose"] = "home/whim"
data_carla.loc[data_carla["Purpose"] == 'furniture/equipment', "Purpose"] = "home/whim"
data_carla.loc[data_carla["Purpose"] == 'radio/TV', "Purpose"] = "home/whim"
data_carla.loc[data_carla["Purpose"] == 'domestic appliances', "Purpose"] = "home/whim"
data_carla.loc[data_carla["Purpose"] == 'repairs', "Purpose"] = "home/whim"
data_carla.loc[data_carla["Purpose"] == 'vacation/others', "Purpose"] = "home/whim"
data_carla.loc[data_carla["Purpose"] == 'education', "Purpose"] = "education/work"
data_carla.loc[data_carla["Purpose"] == 'business', "Purpose"] = "education/work"

data_carla.to_csv("./archive/german_credit_data_bin.csv", index=False)

continuous = ["Age", "Credit amount", "Duration"]
categorical = ["Sex", "Job", "Housing", "Saving accounts", "Checking account", "Purpose"]
immutables = []

data_bank = CsvCatalog(file_path = "./archive/german_credit_data_bin.csv",
                 continuous=continuous,
                 categorical=categorical,
                 immutables=immutables,
                 target='Risk')

display(data_bank.df)

Unnamed: 0,Age,Credit amount,Duration,Risk,Sex_male,Job_1,Housing_not free,Saving accounts_rich,Checking account_rich,Purpose_home/whim
0,0.857143,0.050567,0.029412,1.0,1.0,0.0,1.0,0.0,0.0,1.0
1,0.053571,0.313690,0.647059,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,0.535714,0.101574,0.117647,1.0,1.0,0.0,1.0,0.0,0.0,0.0
3,0.464286,0.419941,0.558824,1.0,1.0,0.0,0.0,0.0,0.0,1.0
4,0.607143,0.254209,0.294118,0.0,1.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...
995,0.214286,0.081765,0.117647,1.0,0.0,0.0,1.0,0.0,0.0,1.0
996,0.375000,0.198470,0.382353,1.0,1.0,1.0,1.0,0.0,0.0,1.0
997,0.339286,0.030483,0.117647,1.0,1.0,0.0,1.0,0.0,0.0,1.0
998,0.071429,0.087763,0.602941,0.0,1.0,0.0,0.0,0.0,0.0,1.0


Ara si podem començar a usar les dades amb la llibreria Carla. Primer de tot, haurem de fer un tractament d'aquestes dades per a que siguin compatibles amb carla, fent us de la funció *CsvCatalog*.

In [20]:
import csv
archivo_data = "./statlog+german+credit+data/german.data"
archivo_csv = "./statlog+german+credit+data/german.csv"
headers = ["Checking account","Duration","Credit history",
           "Purpose","Credit amount","Saving accounts",
           "Employment since","Installment rate","Personal status",
           "Other debtors","Present residence","Property","Age",
           "Installment plans ","Housing","Credits","Job",
           "People liable","Telephone","Foreign worker","Risk"]

with open(archivo_data, "r") as file:
    data = [line.strip().split() for line in file]
    
with open(archivo_csv, "w", newline="") as file:
    writer = csv.writer(file)
    writer.writerow(headers)
    writer.writerows(data)
    
print("ARCHIVO CREADO")

continuous = ["Duration", "Credit amount", 
              "Installment rate",  
               "Present residence", "Age", 
               "Credits",  "People liable"] 

categorical = ["Checking account", "Credit history", "Purpose", 
               "Saving accounts", "Employment since", 
               "Personal status", "Other debtors", 
               "Property", 
               "Housing", "Job", "Telephone", 
               "Foreign worker"]
immutable = []

data_bank = CsvCatalog(file_path='./statlog+german+credit+data/german.csv',
                      continuous=continuous,
                      categorical=categorical,
                      immutables=immutable,
                      target="Risk")

display(data_bank.df.shape)


ARCHIVO CREADO


(1000, 58)

Un cop tenim les dades preparades, haurem de preparar el nostre model per a poder-la entrenar.

In [9]:
import numpy as np
# Params for training
training_params = {"lr": 0.002, "epochs": 10, 
                   "batch_size": 1024, "hidden_size": [18, 9, 3]}

model = MLModelCatalog(data=data_bank,
                      model_type="linear",
                      backend="pytorch",
                      load_online=False)

model.train(learning_rate=training_params["lr"],
            epochs=training_params["epochs"],
            batch_size=training_params["batch_size"],
            hidden_size=training_params["hidden_size"])


some_factuals = predict_negative_instances(model, data=data_bank.df)
display(some_factuals)
some_factuals.to_excel("./contrafactuals.xlsx")

factual_to_comapre = some_factuals.iloc[0]

Loaded model from C:\Users\gerar\carla\models\custom\linear.pt
test accuracy for model: 0.712


Unnamed: 0,Age,Credit amount,Duration,Risk,Sex_male,Job_1,Housing_not free,Saving accounts_rich,Checking account_rich,Purpose_home/whim
11,0.089286,0.223286,0.647059,0.0,0.0,0.0,1.0,0.0,0.0,0.0
18,0.446429,0.678387,0.294118,0.0,0.0,1.0,0.0,0.0,0.0,1.0
85,0.178571,0.063937,0.117647,1.0,0.0,1.0,1.0,0.0,0.0,0.0
116,0.196429,0.380984,0.558824,0.0,0.0,1.0,1.0,0.0,0.0,1.0
141,0.196429,0.250083,0.470588,1.0,0.0,1.0,1.0,0.0,0.0,1.0
195,0.267857,0.068835,0.073529,0.0,0.0,1.0,1.0,0.0,0.0,0.0
244,0.285714,0.175911,0.117647,1.0,0.0,0.0,1.0,1.0,0.0,0.0
332,0.089286,0.393859,0.823529,0.0,0.0,1.0,1.0,0.0,0.0,1.0
340,0.089286,0.302245,0.294118,1.0,0.0,0.0,0.0,0.0,0.0,0.0
374,0.732143,0.799604,0.823529,0.0,0.0,1.0,0.0,0.0,0.0,1.0


In [3]:
data_name = "adult"
dataset = OnlineCatalog(data_name)

from carla.models.catalog import MLModelCatalog

training_params = {"lr": 0.002, "epochs": 10, "batch_size": 1024, "hidden_size": [18, 9, 3]}

ml_model = MLModelCatalog(
    dataset,
    model_type="ann",
    load_online=True,
    backend="pytorch"
)

ml_model.train(
    learning_rate=training_params["lr"],
    epochs=training_params["epochs"],
    batch_size=training_params["batch_size"],
    hidden_size=training_params["hidden_size"]
)

factuals = predict_negative_instances(ml_model, dataset.df)
test_factual = factuals[:100]

display(test_factual)

hyperparams = {
        "data_name": data_name,
        "n_search_samples": 100,
        "p_norm": 1,
        "step": 0.1,
        "max_iter": 1000,
        "clamp": True,
        "binary_cat_features": True,
        "vae_params": {
            "layers": [len(ml_model.feature_input_order), 512, 256, 8],
            "train": True,
            "lambda_reg": 1e-6,
            "epochs": 5,
            "lr": 1e-3,
            "batch_size": 32,
        },
    }

recourse_method = recourse_catalog.CCHVAE(ml_model, hyperparams)
df_cfs = recourse_method.get_counterfactuals(test_factual)

display(df_cfs)

# recourse_method = recourse_catalog.GrowingSpheres(ml_model)
# data_counterfactuals = recourse_method.get_counterfactuals(some_factuals)
# display(data_counterfactual)
# data_counterfactuals.to_excel("prueba.xlsx")

balance on test set 0.23883245958934032, balance on test set 0.2408256880733945
Epoch 0/9
----------
train Loss: 0.5672 Acc: 0.6450

test Loss: 0.4304 Acc: 0.7788

Epoch 1/9
----------
train Loss: 0.4109 Acc: 0.7997

test Loss: 0.4010 Acc: 0.8064

Epoch 2/9
----------
train Loss: 0.3958 Acc: 0.8081

test Loss: 0.3924 Acc: 0.8119

Epoch 3/9
----------
train Loss: 0.3862 Acc: 0.8167

test Loss: 0.3846 Acc: 0.8168

Epoch 4/9
----------
train Loss: 0.3769 Acc: 0.8240

test Loss: 0.3728 Acc: 0.8303

Epoch 5/9
----------
train Loss: 0.3679 Acc: 0.8301

test Loss: 0.3648 Acc: 0.8355

Epoch 6/9
----------
train Loss: 0.3605 Acc: 0.8350

test Loss: 0.3583 Acc: 0.8376

Epoch 7/9
----------
train Loss: 0.3558 Acc: 0.8352

test Loss: 0.3583 Acc: 0.8348

Epoch 8/9
----------
train Loss: 0.3519 Acc: 0.8381

test Loss: 0.3499 Acc: 0.8397

Epoch 9/9
----------
train Loss: 0.3468 Acc: 0.8400

test Loss: 0.3469 Acc: 0.8421



Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,...,occupation_Other,race_White,relationship_Non-Husband,sex_Male,workclass_Private
0,0.301370,0.044131,0.800000,0.02174,0.0000,...,0.0,1.0,1.0,1.0,0.0
1,0.452055,0.048052,0.800000,0.00000,0.0000,...,0.0,1.0,0.0,1.0,0.0
2,0.287671,0.137581,0.533333,0.00000,0.0000,...,1.0,1.0,1.0,1.0,1.0
3,0.493151,0.150486,0.400000,0.00000,0.0000,...,1.0,0.0,0.0,1.0,1.0
6,0.438356,0.100061,0.266667,0.00000,0.0000,...,1.0,0.0,1.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
127,0.191781,0.069448,0.733333,0.00000,0.0000,...,0.0,1.0,0.0,1.0,1.0
128,0.246575,0.079168,0.533333,0.00000,0.0000,...,0.0,1.0,0.0,1.0,1.0
129,0.301370,0.239125,0.600000,0.00000,0.0000,...,1.0,1.0,1.0,1.0,1.0
130,0.150685,0.038790,0.733333,0.00000,0.0000,...,1.0,1.0,1.0,0.0,1.0


[INFO] Start training of Variational Autoencoder... [models.py fit]
[INFO] [Epoch: 0/5] [objective: 0.362] [models.py fit]
[INFO] [ELBO train: 0.36] [models.py fit]
[INFO] [ELBO train: 0.14] [models.py fit]
[INFO] [ELBO train: 0.12] [models.py fit]
[INFO] [ELBO train: 0.12] [models.py fit]
[INFO] [ELBO train: 0.12] [models.py fit]
[INFO] ... finished training of Variational Autoencoder. [models.py fit]


Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,...,occupation_Other,relationship_Non-Husband,race_White,sex_Male,native-country_US
0,,,,,,...,,,,,
1,,,,,,...,,,,,
2,,,,,,...,,,,,
3,,,,,,...,,,,,
6,,,,,,...,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
127,,,,,,...,,,,,
128,,,,,,...,,,,,
129,,,,,,...,,,,,
130,,,,,,...,,,,,


En aquest moment, tenim un model de tipus ann el qual es troba entrenat, i com retorna el seu valor, 

In [10]:
recourse_method = recourse_catalog.GrowingSpheres(model)
data_counterfactuals = recourse_method.get_counterfactuals(some_factuals)
data_counterfactuals.to_excel("prueba.xlsx")

counterfactual = data_counterfactuals.iloc[0]


In [12]:
from carla.plotting.plotting import summary_plot, single_sample_plot
single_s

NameError: name 'single_s' is not defined