### Klasifikacija kreditnog rizika - credit scoring classification

Osnovni cilj ovog primjera je upoznavanje sa Python paketima za obradu podataka i mašinsko učenje, sa fokusom na **scikit-learn** i **LightGBM** pakete. U ovom primjeru ćemo koristiti skup podataka **credit-g v.2** koji sadrži informacije o klijentima banke, cilj je klasifikacija klijenata kao dobrih ili loših korisnika kredita.

**Dokumentacija paketa:**
* [pandas docs](https://pandas.pydata.org/docs/)
* [scikit-learn docs](https://scikit-learn.org/stable/)
* [LightGBM docs](https://lightgbm.readthedocs.io/en/stable/)
* [Matplotlib docs](https://matplotlib.org/stable/index.html)
* [seaborn docs](https://seaborn.pydata.org/)

Koristeći sklearn, učitati skup podataka *credit-g v.2* (dataset id=44096), upoznati se sa skupom podataka, analizirati podatke i pripremiti ih za obradu.  
Identifikovati tipove podataka - numeričke i kategoričke.  
Provjeriti da li postoje nedostajuće vrijednosti, ako da, obraditi ih.  
Provjeriti broj jedinstvenih vrijednosti za svaku kolonu.

In [1]:
import pandas as pd
from sklearn.datasets import fetch_openml

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (
    OrdinalEncoder,
    StandardScaler,
    LabelEncoder
)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import accuracy_score

import seaborn as sns
import matplotlib.pyplot as plt
import lightgbm as lgb

In [2]:
credit = fetch_openml(data_id=44096, as_frame=True)

In [None]:
credit_df = credit['data']
credit_df.head()

In [None]:
credit_df_target = credit["target"]
credit_df_target.head()

In [None]:
print(credit_df.info())

In [None]:
print(credit_df.isnull().sum())

In [None]:
print(credit_df.nunique())

Izdvojiti numeričke i kategoričke kolone.

In [None]:
num_cols = credit_df.select_dtypes(include=["int64", "float64"]).columns.tolist()
cat_cols = credit_df.select_dtypes(include=["category"]).columns.tolist()

print(f"Numericke kolone:\n{num_cols}")
print(f"Kategoricke kolone:\n{cat_cols}")

Pregledati jedinstvene vrijednosti za kategoričke kolone.

In [None]:
for col in cat_cols:
    print(f"{col} : {credit_df[col].unique().tolist()}")

Kreirati preprocessing pipeline koji će:  
    - koristiti StandardScaler za sve numeričke kolone osim 'age' kolone  
    - koristiti OrdinalEncoder za kategoričke kolone  
    - koristiti custom transformer koji će godine klijenta prebaciti u odgovarajuće starosne grupe.  
  
Navedene transformacije je potrebno integrisati u jedan transformer pipeline - koristiti ColumnTransformer.

In [None]:
num_cols_without_age = num_cols.copy()
num_cols_without_age.remove("age")
print(num_cols_without_age)

In [11]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols_without_age),
        (
            "cat",
            OrdinalEncoder(
                handle_unknown="use_encoded_value",
                unknown_value=-1,
                encoded_missing_value=-1,
            ),
            cat_cols,
        ),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
).set_output(transform="pandas")

In [12]:
class AgeTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, age_column):
        self.age_column = age_column

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        age_bins = [0, 18, 25, 35, 45, 55, 65, 70, 100]

        X[self.age_column] = pd.cut(X[self.age_column], bins=age_bins, labels=range(len(age_bins) - 1))
        return X

In [None]:
pipeline = Pipeline(
    [
        ("preprocessor", preprocessor),
        (
            "age",
            AgeTransformer(age_column="age"),
        ),
    ]
)

transformed_credit_df = pipeline.fit_transform(credit_df)
transformed_credit_df.head()

Enkodirati ciljanu varijablu koristeći LabelEncoder.

In [14]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(credit_df_target)
y = pd.Series(y_encoded, name="target")

Dodati ciljanu varijablu u skup podataka i izračunati matricu korelacije.

In [15]:
df_with_target = transformed_credit_df.copy()
df_with_target["target"] = y

In [None]:
cor_mat = df_with_target.corr()

plt.figure(figsize=(16, 8))
sns.heatmap(cor_mat, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation mat")
plt.show()

Analizirati korelacije u odnosu na ciljanu varijablu i izdvojiti 4 najveće i 4 najmanje korelacije.

In [None]:
target_correlations = cor_mat["target"].sort_values(ascending=False)
print("Correlation:")
print(target_correlations)

In [None]:
threshold = 0.06
selected_features = target_correlations[
    abs(target_correlations) > threshold 
].index.tolist()
selected_features.remove("target")

print(selected_features)

Podijeliti podatke na skup za treniranje i testiranje

In [19]:
X = transformed_credit_df[selected_features]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.35, random_state=42
)

Izvršiti treniranje LightGBM modela i prikazati rezultat evaluacije modela koristeći 'accuracy_score'.

In [None]:
model = lgb.LGBMClassifier(num_leaves=8, max_depth=4, random_state=42, n_jobs=12)
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy score: {accuracy:.3f}")

Koristeći scikit-learn model_selection.GridSearchCV, optimizirati parametre LightGBM modela.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer


lgb_model = lgb.LGBMClassifier(random_state=42, n_jobs=12)

param_grid = {
    "num_leaves": [4, 5, 6, 7],
    "learning_rate": [0.1, 0.01],
    "max_depth": [3, 4, 5, 6],
    "n_estimators": [200, 500, 1000],
}

grid_search = GridSearchCV(
    estimator=lgb_model,
    param_grid=param_grid,
    scoring=make_scorer(accuracy_score),
    n_jobs=12,
    cv=5,
    verbose=1,
)

grid_search.fit(X_train, y_train)

print(f"Best parameters found: {grid_search.best_params_}")

In [None]:
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy score of the best model: {accuracy}")