# 개요 
* 딥러닝 스터디에 제출했던 과제에 대한 피드백 반영 및 개선(지속 개선예정)
* 원본데이터 : [Kaggle CreditCard Fraud Detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)
* 적용한 피드백
  * weighted f1 score 사용
* 추가 테스트
  * Optuna (optuna searchCV) 사용해 봄
  * Randomforest를 배우지는 않아서, 몇가지 파라미터만 뽑아서 optuna로 돌려 봄
    * 과적합 방지를 위한 가지치기(Pruning)이 있다고 하여 적용
  * 파라미터를 별도로 저장하고 다시 estimator에 넣는게 불편해보였는데, refit이라는 기능으로 바로 사용가능하다고 하여 적용해 봄
* 결과 및 감상
  * 수업때 데이터에 따라 오히려 머신러닝이 더 적합할 수 있다고 했는데, Keras Tuner딥러닝보다 점수가 잘나와서 신기
  * 복잡한 머신러닝 모델은 SMOTE와 같은 샘플링이 오히려 안좋을 수 있다하여 추이를 보고 적용하려 했는데, 결과적으로 미적용
  * optuna가 int, float, categorical로 Keras Tuner대비 입력이 쉽고 전반적으로 사용성이 좋은 느낌
    * 간단히 구글링했을 때 샘플코드는, 파라미터가 dict에 담겨 옮겨야했는데, refit 기능으로 best_estimator를 편하게 불러올 수 있었음


# 개선과제 진행

## 데이터셋 구성

In [None]:
import pandas as pd
import sqlite3

conn = sqlite3.connect('creditcard.db')
df = pd.read_sql_query("SELECT * FROM creditcard", conn)
conn.close()
df

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


In [None]:
df_x = df.drop(['Time', 'Class'], axis=1).copy()
df_y = df['Class'].copy()

df_x.shape, df_y.shape

((284807, 29), (284807,))

### Train, Validation, Test 나누기
* Train, Test로만 나누고, optuna의 CV기능을 사용할 예정으로 별도 분할하지 않음

In [None]:
# Train, Test 나누기
from sklearn.model_selection import train_test_split

# stratify 적용
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.1, stratify=df_y)

print(f"{x_train.shape}, {x_test.shape}")
print(f"{y_train.shape}, {y_test.shape}")
print()
print(f"y_train {y_train.value_counts()}")
print(f"y_test {y_test.value_counts()}")

(256326, 29), (28481, 29)
(256326,), (28481,)

y_train Class
0    255883
1       443
Name: count, dtype: int64
y_test Class
0    28432
1       49
Name: count, dtype: int64


## 모델 구성 및 학습(머신러닝)

### RandomForestClassifier with optuna(OptunaSearchCV)
* optuna OptunaSearchCV 공식문서
  * [https://optuna.readthedocs.io/en/v2.0.0/reference/generated/optuna.integration.OptunaSearchCV.html](https://optuna.readthedocs.io/en/v2.0.0/reference/generated/optuna.integration.OptunaSearchCV.html)
* optuna OptunaSearchCV 샘플코드
  * [https://github.com/optuna/optuna-examples/blob/main/sklearn/sklearn_optuna_search_cv_simple.py](https://github.com/optuna/optuna-examples/blob/main/sklearn/sklearn_optuna_search_cv_simple.py)
* Scikit-learn RandomForestClassifier 공식문서
  * [https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#randomforestclassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#randomforestclassifier)
* Scikit-learn ccp_alpha(Pruning, [과적합방지용]가지치기)
  * [https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html#sphx-glr-auto-examples-tree-plot-cost-complexity-pruning-py](https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html#sphx-glr-auto-examples-tree-plot-cost-complexity-pruning-py)

In [None]:
import optuna
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10) # n_estimators : number of trees

param_distributions = {
    "n_estimators":optuna.distributions.IntDistribution(10, 10), # 트리의 수
    "max_depth": optuna.distributions.IntDistribution(2, 32, log=True),
    "criterion": optuna.distributions.CategoricalDistribution(['gini', 'entropy', 'log_loss']),
    "class_weight" : optuna.distributions.CategoricalDistribution(['balanced', 'balanced_subsample']),
    "ccp_alpha" : optuna.distributions.FloatDistribution(0, 0.05, step=0.005)
}

optuna_search = optuna.integration.OptunaSearchCV(
    clf, 
    param_distributions, 
    n_jobs=-1, # Number of parallel jobs. -1 means using all processors.
    cv=5, #  estimator가 classifier & label이 binary or multiclass라면 sklearn.model_selection.StratifiedKFold 적용 (이외는 sklearn.model_selection.KFold)
    n_trials=100, 
    timeout=600, 
    verbose=2,
    scoring='f1_weighted',
    refit=True # Best Parameter로 refit. refitted estimator는 best_estimator_ attribute로 바로 predict가능
)

X, y = x_train, y_train
optuna_search.fit(X, y)

print("Best trial:")
trial = optuna_search.study_.best_trial

print("  Value: ", trial.value)
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

  optuna_search = optuna.integration.OptunaSearchCV(
[I 2024-07-30 22:37:21,982] A new study created in memory with name: no-name-b5929343-b4ec-493f-9805-5e3c4e45dd7c
[I 2024-07-30 22:37:44,249] Trial 1 finished with value: 0.9952872331714773 and parameters: {'n_estimators': 10, 'max_depth': 2, 'criterion': 'log_loss', 'class_weight': 'balanced_subsample', 'ccp_alpha': 0.025}. Best is trial 1 with value: 0.9952872331714773.
[I 2024-07-30 22:37:45,772] Trial 4 finished with value: 0.995843694897976 and parameters: {'n_estimators': 10, 'max_depth': 3, 'criterion': 'gini', 'class_weight': 'balanced', 'ccp_alpha': 0.05}. Best is trial 4 with value: 0.995843694897976.
[I 2024-07-30 22:37:59,089] Trial 6 finished with value: 0.9904473457587857 and parameters: {'n_estimators': 10, 'max_depth': 5, 'criterion': 'gini', 'class_weight': 'balanced', 'ccp_alpha': 0.015}. Best is trial 4 with value: 0.995843694897976.
[I 2024-07-30 22:38:01,804] Trial 3 finished with value: 0.9926729729394594 and pa

Best trial:
  Value:  0.999488225721899
  Params: 
    n_estimators: 10
    max_depth: 18
    criterion: log_loss
    class_weight: balanced
    ccp_alpha: 0.0


#### Attributes(Best Parameter, Scorer, Best estimator[Fitted])

* Best Parameter

In [None]:
optuna_search.best_params_

{'n_estimators': 10,
 'max_depth': 18,
 'criterion': 'log_loss',
 'class_weight': 'balanced',
 'ccp_alpha': 0.0}

* Scorer

In [None]:
optuna_search.scorer_

make_scorer(f1_score, response_method='predict', pos_label=None, average=weighted)

* Best estimator[Fitted]

In [None]:
best_model_randomforest = optuna_search.best_estimator_
best_model_randomforest

#### 모델평가
*  Scikit-learn cross_val_score 공식문서
    * [https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score)
* Scikit-learn f1_score 공식문서 (적용한 weighted f1 score에 대한 설명)
  *  [https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score)
  * 'weighted' : Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). 
  `This alters ‘macro’ to account for label imbalance` it can result in an F-score that is not between precision and recall.

In [None]:
import sklearn.model_selection

In [None]:
sklearn.model_selection.cross_val_score(best_model_randomforest, x_test, y_test, scoring='f1_weighted', cv=5, 
                                        n_jobs=None, verbose=0)

array([0.99921023, 0.99962696, 0.99945948, 0.99866279, 0.99897608])

In [None]:
sklearn.model_selection.cross_val_score(best_model_randomforest, x_test, y_test, scoring='f1_macro', cv=5, 
                                        n_jobs=None, verbose=0)

array([0.87482422, 0.9374121 , 0.94991206, 0.81548173, 0.78545062])

In [None]:
sklearn.model_selection.cross_val_score(best_model_randomforest, x_test, y_test, scoring='accuracy', cv=5, 
                                        n_jobs=None, verbose=0)


array([0.99912235, 0.99982444, 0.99964888, 0.99877107, 0.99929775])

In [None]:
from sklearn.metrics import classification_report 
print(classification_report(y_test, best_model_randomforest.predict(x_test)))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     28432
           1       0.97      0.78      0.86        49

    accuracy                           1.00     28481
   macro avg       0.99      0.89      0.93     28481
weighted avg       1.00      1.00      1.00     28481

