# Comparação Generalização - Statlog (German Credit Data) Data Set

[Dataset (OpenML)](https://www.openml.org/search?type=data&sort=runs&status=active&qualities.NumberOfClasses=%3D_2&id=31)

**Referência:** Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

## Carregamento do dataset

In [66]:
from sklearn.datasets import fetch_openml

credit_dataset = fetch_openml(data_id=31)

## Metadados do dataset

In [67]:
credit_dataset.keys()

dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

In [68]:
credit_dataset['DESCR']

'**Author**: Dr. Hans Hofmann  \n**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)) - 1994    \n**Please cite**: [UCI](https://archive.ics.uci.edu/ml/citation_policy.html)\n\n**German Credit dataset**  \nThis dataset classifies people described by a set of attributes as good or bad credit risks.\n\nThis dataset comes with a cost matrix: \n``` \nGood  Bad (predicted)  \nGood   0    1   (actual)  \nBad    5    0  \n```\n\nIt is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).  \n\n### Attribute description  \n\n1. Status of the existing checking account, in Deutsche Mark.  \n2. Duration in months  \n3. Credit history (credits taken, paid back duly, delays, critical accounts)  \n4. Purpose of the credit (car, television,...)  \n5. Credit amount  \n6. Status of savings account/bonds, in Deutsche Mark.  \n7. Present employment, in number of years.  \n8. Installment rate in percentage o

In [69]:
credit_dataset['data']

Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,residence_since,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker
0,<0,6.0,critical/other existing credit,radio/tv,1169.0,no known savings,>=7,4.0,male single,none,4.0,real estate,67.0,none,own,2.0,skilled,1.0,yes,yes
1,0<=X<200,48.0,existing paid,radio/tv,5951.0,<100,1<=X<4,2.0,female div/dep/mar,none,2.0,real estate,22.0,none,own,1.0,skilled,1.0,none,yes
2,no checking,12.0,critical/other existing credit,education,2096.0,<100,4<=X<7,2.0,male single,none,3.0,real estate,49.0,none,own,1.0,unskilled resident,2.0,none,yes
3,<0,42.0,existing paid,furniture/equipment,7882.0,<100,4<=X<7,2.0,male single,guarantor,4.0,life insurance,45.0,none,for free,1.0,skilled,2.0,none,yes
4,<0,24.0,delayed previously,new car,4870.0,<100,1<=X<4,3.0,male single,none,4.0,no known property,53.0,none,for free,2.0,skilled,2.0,none,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,no checking,12.0,existing paid,furniture/equipment,1736.0,<100,4<=X<7,3.0,female div/dep/mar,none,4.0,real estate,31.0,none,own,1.0,unskilled resident,1.0,none,yes
996,<0,30.0,existing paid,used car,3857.0,<100,1<=X<4,4.0,male div/sep,none,4.0,life insurance,40.0,none,own,1.0,high qualif/self emp/mgmt,1.0,yes,yes
997,no checking,12.0,existing paid,radio/tv,804.0,<100,>=7,4.0,male single,none,4.0,car,38.0,none,own,1.0,skilled,1.0,none,yes
998,<0,45.0,existing paid,radio/tv,1845.0,<100,1<=X<4,4.0,male single,none,4.0,no known property,23.0,none,for free,1.0,skilled,1.0,yes,yes


In [70]:
from pandas import DataFrame

dataframe: DataFrame = credit_dataset['data']
dataframe.dtypes

checking_status           category
duration                   float64
credit_history            category
purpose                   category
credit_amount              float64
savings_status            category
employment                category
installment_commitment     float64
personal_status           category
other_parties             category
residence_since            float64
property_magnitude        category
age                        float64
other_payment_plans       category
housing                   category
existing_credits           float64
job                       category
num_dependents             float64
own_telephone             category
foreign_worker            category
dtype: object

In [71]:
target = credit_dataset['target']
target

0      good
1       bad
2      good
3      good
4       bad
       ... 
995    good
996    good
997    good
998     bad
999    good
Name: class, Length: 1000, dtype: category
Categories (2, object): ['good', 'bad']

## Treinamento do modelo

### Executar cada método de classificação 10 vezes e armazenar predições

In [72]:
# Selecione apenas features numéricas
features = dataframe.select_dtypes(include=['number'])

# Liste os estados aleatórios para embaralhar dataset
random_states = [1, 2, 3, 5, 8, 13, 21, 34, 55, 89]

# Inicialize as listas para armazenar os resultados
knn_accuracy, knn_iou, knn_precision, knn_recall = [], [], [], []
logistic_accuracy, logistic_iou, logistic_precision, logistic_recall = [], [], [], []

In [73]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, jaccard_score, recall_score, precision_score

for state in random_states:

    # Separe 25% do dataset para teste e outros 75% use para treino
    x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.25, random_state=state)

    logistic_model = LogisticRegression(max_iter=350)
    logistic_model.fit(x_train, y_train)
    logistic_y_predicted = logistic_model.predict(x_test)

    knn_model = KNeighborsClassifier(n_neighbors=50)
    knn_model.fit(x_train, y_train)
    knn_y_predicted = knn_model.predict(x_test)

    # Encoder para converter a saída esperada de string ('bad', 'good') para números avaliáveis pelas métricas
    label_encoder = LabelEncoder().fit(y_test)
    y_test_int = label_encoder.transform(y_test)

    knn_y_predict_int = label_encoder.transform(knn_y_predicted)
    knn_accuracy.append(accuracy_score(y_test_int, knn_y_predict_int))
    knn_iou.append(jaccard_score(y_test_int, knn_y_predict_int))
    knn_precision.append(precision_score(y_test_int, knn_y_predict_int))
    knn_recall.append(recall_score(y_test_int, knn_y_predict_int))

    logistic_y_predict_int = label_encoder.transform(logistic_y_predicted)
    logistic_accuracy.append(accuracy_score(y_test_int, logistic_y_predict_int))
    logistic_iou.append(jaccard_score(y_test_int, logistic_y_predict_int))
    logistic_precision.append(precision_score(y_test_int, logistic_y_predict_int))
    logistic_recall.append(recall_score(y_test_int, logistic_y_predict_int))

## Visualização dos resultados

### Métricas gerais (Acurácia, Interseção sobre União, Precisão, Sensibilidade)

In [74]:
from numpy import mean

data = {
    'Acurácia (média)': [mean(logistic_accuracy), mean(knn_accuracy)],
    'Jaccard Index / IoU (média)': [mean(logistic_iou), mean(knn_iou)],
    'Precision (média)': [mean(logistic_precision), mean(knn_precision)],
    'Recall (média)': [mean(logistic_recall), mean(knn_recall)],
}

columns = ['Logistic Regression', 'KNN']
DataFrame.from_dict(data, orient='index', columns=columns)

Unnamed: 0,Logistic Regression,KNN
Acurácia (média),0.6948,0.6988
Jaccard Index / IoU (média),0.684867,0.692986
Precision (média),0.711563,0.707311
Recall (média),0.947999,0.972277
