# German Credit Data Exploration

## Conceptual description

| Variable    | Descripción                                    | Tipo    | Valores       |
| ----------- | ---------------------------------------------- | ------- | ------------- |
| credit-risk | Etiqueta binaria del comportamiento crediticio | Binaria | `good`, `bad` |
| checking\_status        | Estado de la cuenta corriente del solicitante      | Categórica ordinal  | `<0`, `0<=X<200`, `>=200`, `no checking`                          |
| duration                | Duración del crédito en meses                      | Numérica (discreta) | 6, 12, 24, 48, 60, ...                                            |
| credit\_history         | Historial crediticio previo del solicitante        | Categórica nominal  | `critical/other`, `existing paid`, `delayed previously`, ...      |
| purpose                 | Finalidad declarada del crédito                    | Categórica nominal  | `radio/tv`, `education`, `furniture`, `car`, `business`, ...      |
| credit\_amount          | Monto del crédito solicitado                       | Numérica (continua) | 250 – 18424 (ej.: 1169, 5951, 9055, ...)                          |
| savings\_status         | Nivel de ahorro declarado                          | Categórica ordinal  | `<100`, `100<=X<500`, `500<=X<1000`, `>=1000`, `no known savings` |
| employment              | Tiempo en el empleo actual                         | Categórica ordinal  | `<1`, `1<=X<4`, `4<=X<7`, `>=7`, `unemployed`                     |
| installment\_commitment | Porcentaje del ingreso asignado a la cuota mensual | Categórica ordinal  | 1, 2, 3, 4                                                        |
| other\_parties          | Existencia de co-solicitante o avalista            | Categórica nominal  | `none`, `guarantor`, `co applicant`                               |
| residence\_since        | Años de residencia en el domicilio actual          | Numérica (discreta) | 1, 2, 3, 4                                                        |
| property\_magnitude     | Tipo de propiedad o garantía declarada             | Categórica nominal  | `real estate`, `life insurance`, `car`, `no known property`       |
| age                     | Edad del solicitante                               | Numérica (continua) | 19 – 75 (ej.: 22, 45, 53, ...)                                    |
| other\_payment\_plans   | Otros planes de pago disponibles                   | Categórica nominal  | `none`, `bank`, `stores`                                          |
| housing                 | Régimen de vivienda                                | Categórica nominal  | `own`, `rent`, `for free`                                         |
| existing\_credits       | Número de créditos vigentes                        | Numérica (discreta) | 1, 2, 3, 4                                                        |
| job                     | Tipo de ocupación                                  | Categórica nominal  | `skilled`, `unskilled resident`, `high qualif`, `unemp/unskilled` |
| num\_dependents         | Número de personas dependientes                    | Numérica (discreta) | 1, 2                                                              |
| own\_telephone          | Si el solicitante tiene teléfono                   | Binaria             | `yes`, `none`                                                     |
| foreign\_worker         | Si el solicitante es trabajador extranjero         | Binaria             | `yes`, `no`                                                       |
| sex                     | Género del solicitante                             | Categórica nominal  | `male`, `female`                                                  |
| marital\_status         | Estado civil del solicitante                       | Categórica nominal  | `single`, `div/dep/mar`, `div/sep`, `mar/wid`                     |


## Requirements

In [None]:
import os
import polars as pl

current_path = os.getcwd()
data_path = os.path.join(current_path, '..', '..', 'data')
filename = 'uci_german_credit.csv'
data_file_path = os.path.join(data_path, filename)

## Data view

In [43]:
df = pl.read_csv(source=data_file_path)

In [44]:
df

credit-risk,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,other_parties,residence_since,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,sex,marital_status
str,str,i64,str,str,i64,str,str,i64,str,i64,str,i64,str,str,i64,str,i64,str,str,str,str
"""good""","""<0""",6,"""critical/other existing credit""","""radio/tv""",1169,"""no known savings""",""">=7""",4,"""none""",4,"""real estate""",67,"""none""","""own""",2,"""skilled""",1,"""yes""","""yes""","""male""","""single"""
"""bad""","""0<=X<200""",48,"""existing paid""","""radio/tv""",5951,"""<100""","""1<=X<4""",2,"""none""",2,"""real estate""",22,"""none""","""own""",1,"""skilled""",1,"""none""","""yes""","""female""","""div/dep/mar"""
"""good""","""no checking""",12,"""critical/other existing credit""","""education""",2096,"""<100""","""4<=X<7""",2,"""none""",3,"""real estate""",49,"""none""","""own""",1,"""unskilled resident""",2,"""none""","""yes""","""male""","""single"""
"""good""","""<0""",42,"""existing paid""","""furniture/equipment""",7882,"""<100""","""4<=X<7""",2,"""guarantor""",4,"""life insurance""",45,"""none""","""for free""",1,"""skilled""",2,"""none""","""yes""","""male""","""single"""
"""bad""","""<0""",24,"""delayed previously""","""new car""",4870,"""<100""","""1<=X<4""",3,"""none""",4,"""no known property""",53,"""none""","""for free""",2,"""skilled""",2,"""none""","""yes""","""male""","""single"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""good""","""no checking""",12,"""existing paid""","""furniture/equipment""",1736,"""<100""","""4<=X<7""",3,"""none""",4,"""real estate""",31,"""none""","""own""",1,"""unskilled resident""",1,"""none""","""yes""","""female""","""div/dep/mar"""
"""good""","""<0""",30,"""existing paid""","""used car""",3857,"""<100""","""1<=X<4""",4,"""none""",4,"""life insurance""",40,"""none""","""own""",1,"""high qualif/self emp/mgmt""",1,"""yes""","""yes""","""male""","""div/sep"""
"""good""","""no checking""",12,"""existing paid""","""radio/tv""",804,"""<100""",""">=7""",4,"""none""",4,"""car""",38,"""none""","""own""",1,"""skilled""",1,"""none""","""yes""","""male""","""single"""
"""bad""","""<0""",45,"""existing paid""","""radio/tv""",1845,"""<100""","""1<=X<4""",4,"""none""",4,"""no known property""",23,"""none""","""for free""",1,"""skilled""",1,"""yes""","""yes""","""male""","""single"""


In [45]:
df.shape

(1000, 22)

## Data types

In [46]:
print('Data types:\n')

for col, dtype in zip(df.columns, df.dtypes):
    print(f'- {col}: {dtype}\n')

Data types:

- credit-risk: String

- checking_status: String

- duration: Int64

- credit_history: String

- purpose: String

- credit_amount: Int64

- savings_status: String

- employment: String

- installment_commitment: Int64

- other_parties: String

- residence_since: Int64

- property_magnitude: String

- age: Int64

- other_payment_plans: String

- housing: String

- existing_credits: Int64

- job: String

- num_dependents: Int64

- own_telephone: String

- foreign_worker: String

- sex: String

- marital_status: String



## Unique values

In [47]:
print('Unique values:\n')

for col in df.columns:
    try:
        unique_values =df [col].unique().to_numpy()
    except:
        unique_values = df[col].unique()

    print(f'- {col}: {unique_values}\n')

Unique values:

- credit-risk: ['bad' 'good']

- checking_status: ['0<=X<200' '>=200' '<0' 'no checking']

- duration: [ 4  5  6  7  8  9 10 11 12 13 14 15 16 18 20 21 22 24 26 27 28 30 33 36
 39 40 42 45 47 48 54 60 72]

- credit_history: ['delayed previously' 'no credits/all paid' 'all paid' 'existing paid'
 'critical/other existing credit']

- purpose: ['radio/tv' 'education' 'domestic appliance' 'used car' 'other' 'business'
 'retraining' 'furniture/equipment' 'new car' 'repairs']

- credit_amount: [  250   276   338   339   343   362   368   385   392   409   426   428
   433   448   454   458   484   518   522   571   585   590   601   609
   618   625   626   629   639   640   652   654   660   662   666   672
   674   682   683   684   685   691   697   700   701   707   708   709
   717   719   727   730   731   741   745   750   753   754   759   760
   763   766   776   781   783   790   795   797   802   804   806   836
   841   846   860   866   874   882   884   886   888

## Processing

### Sort variables according to data type

Ordenar variables en base a p1, p2, p3, para aplicar nuestros algoritmos correctamente

In [48]:
quant_cols = [col for col, dtype in zip(df.columns, df.dtypes) if dtype != pl.String]
cat_cols = [col for col in df.columns if col not in quant_cols]
binary_cols = [col for col in cat_cols if len(df[col].unique()) == 2]
multiclass_cols = [col for col in cat_cols if col not in binary_cols]

In [49]:
quant_cols

['duration',
 'credit_amount',
 'installment_commitment',
 'residence_since',
 'age',
 'existing_credits',
 'num_dependents']

In [50]:
binary_cols

['credit-risk', 'own_telephone', 'foreign_worker', 'sex']

In [51]:
multiclass_cols

['checking_status',
 'credit_history',
 'purpose',
 'savings_status',
 'employment',
 'other_parties',
 'property_magnitude',
 'other_payment_plans',
 'housing',
 'job',
 'marital_status']

In [52]:
df = df[quant_cols + binary_cols + multiclass_cols]

### Encode categorical variables

In [53]:
exceptions = ['checking_status', 'savings_status', 'employment']
categorical_cols = [col for col in df.columns if df[col].dtype == pl.String]
encoding = {}

for col in [x for x in categorical_cols if x not in exceptions]: 
        
    unique_values_sorted = sorted(df[col].unique().to_list())
    new_values = list(range(0, len(unique_values_sorted)))
    encoding[col] = dict(zip(unique_values_sorted, new_values))

In [54]:
encoding['checking_status'] = {
    'no checking': 0, 
    '<0': 1,
    '0<=X<200': 2,
    '>=200': 3
}

encoding['savings_status'] = {
    'no known savings': 0,
    '<100': 1,
    '100<=X<500': 2,
    '500<=X<1000': 3,
    '>=1000': 4
}

encoding['employment'] = {
    'unemployed': 0,
    '<1': 1,
    '1<=X<4': 2,
    '4<=X<7': 3,
    '>=7': 4
}

In [55]:
encoding

{'credit-risk': {'bad': 0, 'good': 1},
 'own_telephone': {'none': 0, 'yes': 1},
 'foreign_worker': {'no': 0, 'yes': 1},
 'sex': {'female': 0, 'male': 1},
 'credit_history': {'all paid': 0,
  'critical/other existing credit': 1,
  'delayed previously': 2,
  'existing paid': 3,
  'no credits/all paid': 4},
 'purpose': {'business': 0,
  'domestic appliance': 1,
  'education': 2,
  'furniture/equipment': 3,
  'new car': 4,
  'other': 5,
  'radio/tv': 6,
  'repairs': 7,
  'retraining': 8,
  'used car': 9},
 'other_parties': {'co applicant': 0, 'guarantor': 1, 'none': 2},
 'property_magnitude': {'car': 0,
  'life insurance': 1,
  'no known property': 2,
  'real estate': 3},
 'other_payment_plans': {'bank': 0, 'none': 1, 'stores': 2},
 'housing': {'for free': 0, 'own': 1, 'rent': 2},
 'job': {'high qualif/self emp/mgmt': 0,
  'skilled': 1,
  'unemp/unskilled non res': 2,
  'unskilled resident': 3},
 'marital_status': {'div/dep/mar': 0, 'div/sep': 1, 'mar/wid': 2, 'single': 3},
 'checking_stat

In [56]:
for col in categorical_cols: 
    df = df.with_columns(pl.col(col).replace_strict(encoding[col]))

### Split in Predictors and Response

In [57]:
response = 'credit-risk'
predictors = [col for col in df.columns if col != response]

X = df[predictors]
y = df[response]

In [58]:
y.head(5)

credit-risk
i64
1
0
1
1
0


In [59]:
X.head()

duration,credit_amount,installment_commitment,residence_since,age,existing_credits,num_dependents,own_telephone,foreign_worker,sex,checking_status,credit_history,purpose,savings_status,employment,other_parties,property_magnitude,other_payment_plans,housing,job,marital_status
i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64
6,1169,4,4,67,2,1,1,1,1,1,1,6,0,4,2,3,1,1,1,3
48,5951,2,2,22,1,1,0,1,0,2,3,6,1,2,2,3,1,1,1,0
12,2096,2,3,49,1,2,0,1,1,0,1,2,1,3,2,3,1,1,3,3
42,7882,2,4,45,1,2,0,1,1,1,3,3,1,3,1,1,1,0,1,3
24,4870,3,4,53,2,2,0,1,1,1,2,4,1,2,2,2,1,0,1,3


### Compute p1, p2, p3

In [61]:
quant_predictors = [col for col in predictors if col in quant_cols]
binary_predictors = [col for col in predictors if col in binary_cols]
multiclass_predictors = [col for col in predictors if col in multiclass_cols]

p1 = len(quant_predictors)
p2 = len(binary_predictors)
p3 = len(multiclass_cols)

In [64]:
print(f'p1: {p1},', f'p2: {p2},', f'p3: {p3}')

p1: 7, p2: 3, p3: 11


### Outliers analysis

In [87]:
from BigEDA.descriptive import outliers_table

In [89]:
outliers_df = outliers_table(X, auto=False, col_names=quant_predictors, h=1.5)
outliers_df

quant_variables,lower_bound,upper_bound,n_outliers,n_not_outliers,prop_outliers,prop_not_outliers
str,f64,f64,i64,i64,f64,f64
"""duration""",-6.0,42.0,70,930,0.07,0.93
"""credit_amount""",-2543.0,7881.0,73,927,0.073,0.927
"""installment_commitment""",-1.0,7.0,0,1000,0.0,1.0
"""residence_since""",-1.0,7.0,0,1000,0.0,1.0
"""age""",4.5,64.5,23,977,0.023,0.977
"""existing_credits""",-0.5,3.5,6,994,0.006,0.994
"""num_dependents""",1.0,1.0,155,845,0.155,0.845


In [92]:
outliers_df['prop_outliers'].mean()

0.046714285714285715

---

Probar uno de  nuestros algoritmos con el datset procesado, para testar que todo funciona correctamente:

In [20]:
from FastKmedoids.models import FastKmedoidsGGower, FoldFastKmedoidsGGower

In [81]:
config = {
    'frac_sample_size': 0.05,
    'n_clusters': len(y.unique()),
    'method': 'pam',
    'init': 'heuristic',
    'max_iter': 100,
    'p1': p1,
    'p2': p2,
    'p3': p3,
    'd1': 'robust_mahalanobis',
    'd2': 'sokal',
    'd3': 'hamming',
    'robust_method': 'trimmed',
    'alpha': 0.05,
    'epsilon': 0.05,
    'n_iters': 20,
    'VG_sample_size': 1000,
    'VG_n_samples': 5
}

In [82]:

fast_kmedoids = FastKmedoidsGGower(
    n_clusters=config["n_clusters"], 
    method=config["method"], 
    init=config["init"], 
    max_iter=config["max_iter"], 
    random_state=123,
    frac_sample_size=config["frac_sample_size"], 
    p1=config["p1"], 
    p2=config["p2"], 
    p3=config["p3"], 
    d1=config["d1"], 
    d2=config["d2"], 
    d3=config["d3"], 
    robust_method=config["robust_method"], 
    alpha=config["alpha"], 
    epsilon=config["epsilon"], 
    n_iters=config["n_iters"], 
    VG_sample_size=config["VG_sample_size"], 
    VG_n_samples=config["VG_n_samples"]
)

fast_kmedoids.fit(X=X) 

In [83]:
fast_kmedoids.labels_

array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,

In [84]:
from FastKmedoids.metrics import adjusted_accuracy

In [85]:
adj_acc, adj_labels = adjusted_accuracy(y_pred=fast_kmedoids.labels_, y_true=y.to_numpy())

In [86]:
adj_acc

0.618