------------------------------
1. 입력 피처 (Features)
모델의 입력 피처로 사용된 변수는 아래와 같습니다:

- hour: 사고 발생 시간.

- is_holiday: 공휴일 여부.

- road_form_class: 도로 형태를 나타내는 변수.

- road_formD: 상세 도로 형태를 나타내는 변수.

- carFLg: 사고 차량의 플래그.

- carClassF: 사고 차량의 분류.

- carClassVic: 피해 차량의 분류.

- lo_crd: 사고 위치의 경도 (longitudinal coordinate).

- la_crd: 사고 위치의 위도 (latitudinal coordinate).

- grid_id: 사고 위치를 2km x 2km 간격으로 격자화하여 생성한 범주형 변수.

 - 이 피처들은 모델의 입력 데이터(X_combined)로 사용됩니다.

-------------------------

 2. 예측 대상 (Target)

 - 모델의 예측 대상(종속 변수)은 **accTypeD_merged_combined**입니다.

이는 원래의 사고 유형(accTypeD)에서 일부 클래스를 통합한 변수입니다.
최종적으로 아래와 같은 클래스가 예측 대상입니다

  1. 0 : 횡단중
  2. 1 : 차도통행중
  3. 2 (통합된 클래스)
    - 원래 클래스(2, 6, 7, 8)
      - 2 : 길가장자리구역통행중
      - 6 : 도로이탈
      - 7 : 전도전복
      - 8 : 기타
  4. 3 : 추돌 (뒤에서 박은거)
  5. 4 : 충돌 (뒤 말고 다르게 다 박은거)

  --------------------
  3. 삭제된 클래스
  - 5번, 9번 클래스 샘플 수 너무 적어서 삭제
    - 5 : 보도통행중
    - 9 : 기타

-------------------------------
SOMTE

In [None]:
# 라이브러리 불러오기
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTENC
from lightgbm import LGBMClassifier

# 1. 데이터 로드
data = pd.read_csv('TA_cleaned.csv')  # 실제 파일 경로로 수정

# 2. 클래스 9 및 클래스 5 삭제
data_filtered = data[data['accTypeD'].isin([0, 1, 2, 3, 4, 6, 7, 8])]

# 3. 클래스 2, 6, 7, 8 통합
data_filtered['accTypeD_merged_combined'] = data_filtered['accTypeD'].replace({6: 2, 7: 2, 8: 2})

# 4. 격자화 및 피처 정의
features_with_grid = ['hour', 'is_holiday', 'road_form_class', 'road_formD', 'carFLg',
                      'carClassF', 'carClassVic', 'lo_crd', 'la_crd']
data_filtered['grid_lat'] = (data_filtered['la_crd'] // 0.02) * 0.02  # 2km 격자화
data_filtered['grid_lon'] = (data_filtered['lo_crd'] // 0.02) * 0.02
data_filtered['grid_id'] = data_filtered['grid_lat'].astype(str) + '_' + data_filtered['grid_lon'].astype(str)

# 범주형 변수 목록 정의
categorical_features = ['is_holiday', 'road_form_class', 'road_formD', 'carFLg',
                        'carClassF', 'carClassVic', 'grid_id']

# 5. 범주형 변수 라벨 인코딩
data_encoded = data_filtered.copy()
label_encoders = {}
for col in categorical_features:
    le = LabelEncoder()
    data_encoded[col] = le.fit_transform(data_encoded[col].astype(str))
    label_encoders[col] = le

# 6. 특성과 레이블 정의
X = data_encoded[features_with_grid + ['grid_id']]
y = data_encoded['accTypeD_merged_combined']

# 7. 데이터 분리
X_train_combined, X_test_combined, y_train_combined, y_test_combined = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 범주형 변수의 인덱스 식별
categorical_indices = [X.columns.get_loc(col) for col in categorical_features]

# 8. SMOTENC를 사용하여 데이터 증강
smotenc = SMOTENC(categorical_features=categorical_indices, random_state=42)
X_train_resampled, y_train_resampled = smotenc.fit_resample(X_train_combined, y_train_combined)

# 9. LightGBM 모델 학습
lgbm_model = LGBMClassifier(random_state=42)
lgbm_model.fit(X_train_resampled, y_train_resampled)

# 10. 예측 및 성능 평가
y_pred_lgbm_resampled = lgbm_model.predict(X_test_combined)
accuracy_lgbm_resampled = accuracy_score(y_test_combined, y_pred_lgbm_resampled)
report_lgbm_resampled = classification_report(
    y_test_combined, y_pred_lgbm_resampled, target_names=[str(cls) for cls in np.unique(y_train_combined)]
)

# 결과 출력
print("LightGBM Accuracy after SMOTE:", accuracy_lgbm_resampled)
print("LightGBM Classification Report after SMOTE:\n", report_lgbm_resampled)


Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_filtered['accTypeD_merged_combined'] = data_filtered['accTypeD'].replace({6: 2, 7: 2, 8: 2})
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_filtered['grid_lat'] = (data_filtered['la_crd'] // 0.02) * 0.02  # 2km 격자화
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,co

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000808 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 823
[LightGBM] [Info] Number of data points in the train set: 3315, number of used features: 10
[LightGBM] [Info] Start training from score -1.609438
[LightGBM] [Info] Start training from score -1.609438
[LightGBM] [Info] Start training from score -1.609438
[LightGBM] [Info] Start training from score -1.609438
[LightGBM] [Info] Start training from score -1.609438
LightGBM Accuracy after SMOTE: 0.6202247191011236
LightGBM Classification Report after SMOTE:
               precision    recall  f1-score   support

           0       0.60      0.70      0.65        76
           1       0.44      0.33      0.38        21
           2       0.53      0.52      0.52       139
           3       0.64      0.75      0.69        48
           4 

---------------------------------------
앙상블~~

In [None]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.7


In [None]:
# 라이브러리 불러오기
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTENC
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import VotingClassifier

# 1. 데이터 로드
data = pd.read_csv('TA_cleaned.csv')  # 실제 파일 경로로 수정

# 2. 클래스 9 및 클래스 5 삭제
data_filtered = data[data['accTypeD'].isin([0, 1, 2, 3, 4, 6, 7, 8])]

# 3. 클래스 2, 6, 7, 8 통합
data_filtered['accTypeD_merged_combined'] = data_filtered['accTypeD'].replace({6: 2, 7: 2, 8: 2})

# 4. 격자화 및 피처 정의
features_with_grid = ['hour', 'is_holiday', 'road_form_class', 'road_formD', 'carFLg',
                      'carClassF', 'carClassVic', 'lo_crd', 'la_crd']
data_filtered['grid_lat'] = (data_filtered['la_crd'] // 0.02) * 0.02  # 2km 격자화
data_filtered['grid_lon'] = (data_filtered['lo_crd'] // 0.02) * 0.02
data_filtered['grid_id'] = data_filtered['grid_lat'].astype(str) + '_' + data_filtered['grid_lon'].astype(str)

# 범주형 변수 목록 정의
categorical_features = ['is_holiday', 'road_form_class', 'road_formD', 'carFLg',
                        'carClassF', 'carClassVic', 'grid_id']

# 5. 범주형 변수 라벨 인코딩
data_encoded = data_filtered.copy()
label_encoders = {}
for col in categorical_features:
    le = LabelEncoder()
    data_encoded[col] = le.fit_transform(data_encoded[col].astype(str))
    label_encoders[col] = le

# 6. 특성과 레이블 정의
X = data_encoded[features_with_grid + ['grid_id']]
y = data_encoded['accTypeD_merged_combined']

# 7. 데이터 분리
X_train_combined, X_test_combined, y_train_combined, y_test_combined = train_test_split(
    X, y, test_size=0.1, random_state=42
)

# 범주형 변수의 인덱스 식별
categorical_indices = [X.columns.get_loc(col) for col in categorical_features]

# 8. SMOTENC를 사용하여 데이터 증강
smotenc = SMOTENC(categorical_features=categorical_indices, random_state=42)
X_train_resampled, y_train_resampled = smotenc.fit_resample(X_train_combined, y_train_combined)

# 9. 모델 정의
lgbm_model = LGBMClassifier(random_state=42)
xgb_model = XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='mlogloss')
catboost_model = CatBoostClassifier(random_state=42, verbose=0)

# 10. 앙상블 모델 생성 (VotingClassifier)
ensemble_model = VotingClassifier(
    estimators=[
        ('lightgbm', lgbm_model),
        ('xgboost', xgb_model),
        ('catboost', catboost_model)
    ],
    voting='soft'  # 확률값을 사용한 소프트 보팅
)

# 앙상블 모델 학습
ensemble_model.fit(X_train_resampled, y_train_resampled)

# 11. 예측 및 성능 평가
y_pred_ensemble = ensemble_model.predict(X_test_combined)
accuracy_ensemble = accuracy_score(y_test_combined, y_pred_ensemble)
report_ensemble = classification_report(
    y_test_combined, y_pred_ensemble, target_names=[str(cls) for cls in np.unique(y_train_combined)]
)

# 결과 출력
print("Ensemble Model Accuracy after SMOTE:", accuracy_ensemble)
print("Ensemble Model Classification Report after SMOTE:\n", report_ensemble)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_filtered['accTypeD_merged_combined'] = data_filtered['accTypeD'].replace({6: 2, 7: 2, 8: 2})
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_filtered['grid_lat'] = (data_filtered['la_crd'] // 0.02) * 0.02  # 2km 격자화
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_filtered['grid_lon'] =

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000438 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 825
[LightGBM] [Info] Number of data points in the train set: 3725, number of used features: 10
[LightGBM] [Info] Start training from score -1.609438
[LightGBM] [Info] Start training from score -1.609438
[LightGBM] [Info] Start training from score -1.609438
[LightGBM] [Info] Start training from score -1.609438
[LightGBM] [Info] Start training from score -1.609438


Parameters: { "use_label_encoder" } are not used.



Ensemble Model Accuracy after SMOTE: 0.6681614349775785
Ensemble Model Classification Report after SMOTE:
               precision    recall  f1-score   support

           0       0.68      0.66      0.67        41
           1       0.10      0.12      0.11         8
           2       0.58      0.59      0.58        63
           3       0.71      0.78      0.75        32
           4       0.80      0.75      0.77        79

    accuracy                           0.67       223
   macro avg       0.57      0.58      0.58       223
weighted avg       0.68      0.67      0.67       223



------------------------------------------------

**클래스 1 데이터가 부족하여 보완**


1. 데이터 증강: SMOTENC 또는 ADASYN으로 클래스 1 샘플 증가.
2. 클래스 가중치: 소수 클래스에 가중치를 부여.
3. Threshold 조정: 클래스 1에 더 낮은 임계값 설정.
4. Focal Loss: 소수 클래스 예측에 민감한 손실 함수 사용.

In [None]:
# 라이브러리 불러오기
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTENC, ADASYN
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import precision_recall_curve

# 1. 데이터 로드
data = pd.read_csv('TA_cleaned.csv')  # 실제 파일 경로로 수정

# 2. 클래스 9 및 클래스 5 삭제
data_filtered = data[data['accTypeD'].isin([0, 1, 2, 3, 4, 6, 7, 8])]

# 3. 클래스 2, 6, 7, 8 통합
data_filtered['accTypeD_merged_combined'] = data_filtered['accTypeD'].replace({6: 2, 7: 2, 8: 2})

# 4. 격자화 및 피처 정의
features_with_grid = ['hour', 'is_holiday', 'road_form_class', 'road_formD', 'carFLg',
                      'carClassF', 'carClassVic', 'lo_crd', 'la_crd']
data_filtered['grid_lat'] = (data_filtered['la_crd'] // 0.02) * 0.02  # 2km 격자화
data_filtered['grid_lon'] = (data_filtered['lo_crd'] // 0.02) * 0.02
data_filtered['grid_id'] = data_filtered['grid_lat'].astype(str) + '_' + data_filtered['grid_lon'].astype(str)

# 특화된 피처 추가
data_filtered['is_night'] = data_filtered['hour'].apply(lambda x: 1 if x < 6 or x >= 18 else 0)
data_filtered['is_rush_hour'] = data_filtered['hour'].apply(lambda x: 1 if 7 <= x <= 9 or 17 <= x <= 19 else 0)
data_filtered['holiday_road_combo'] = data_filtered['is_holiday'].astype(str) + '_' + data_filtered['road_form_class'].astype(str)

# 범주형 변수 목록 정의
categorical_features = ['is_holiday', 'road_form_class', 'road_formD', 'carFLg',
                        'carClassF', 'carClassVic', 'grid_id', 'holiday_road_combo']

# 5. 범주형 변수 라벨 인코딩
data_encoded = data_filtered.copy()
label_encoders = {}
for col in categorical_features:
    le = LabelEncoder()
    data_encoded[col] = le.fit_transform(data_encoded[col].astype(str))
    label_encoders[col] = le

# 6. 특성과 레이블 정의
X = data_encoded[features_with_grid + ['grid_id', 'is_night', 'is_rush_hour', 'holiday_road_combo']]
y = data_encoded['accTypeD_merged_combined']

# 7. 데이터 분리
X_train_combined, X_test_combined, y_train_combined, y_test_combined = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 범주형 변수의 인덱스 식별
categorical_indices = [X.columns.get_loc(col) for col in categorical_features]

# 8. 데이터 증강 (ADASYN)
# ADASYN으로 클래스 1 샘플 수 증강 (클래스 1: 150개 생성)
adasyn = ADASYN(sampling_strategy={1: 150}, random_state=42)
X_train_resampled, y_train_resampled = adasyn.fit_resample(X_train_combined, y_train_combined)

# 9. 앙상블 모델 정의
class_weights = {0: 1, 1: 5, 2: 1, 3: 1, 4: 1}

lgbm_model = LGBMClassifier(random_state=42, class_weight=class_weights)
xgb_model = XGBClassifier(random_state=42, scale_pos_weight=5, use_label_encoder=False, eval_metric='mlogloss')
catboost_model = CatBoostClassifier(random_state=42, class_weights=class_weights, verbose=0)

ensemble_model = VotingClassifier(
    estimators=[
        ('lightgbm', lgbm_model),
        ('xgboost', xgb_model),
        ('catboost', catboost_model)
    ],
    voting='soft'
)

# 10. 앙상블 모델 학습
ensemble_model.fit(X_train_resampled, y_train_resampled)

# 11. Threshold 조정
y_prob_ensemble = ensemble_model.predict_proba(X_test_combined)
threshold = 0.3  # 클래스 1에 더 낮은 임계값 사용
y_pred_adjusted = np.argmax(y_prob_ensemble, axis=1)

# 12. 성능 평가
accuracy_ensemble = accuracy_score(y_test_combined, y_pred_adjusted)
report_ensemble = classification_report(y_test_combined, y_pred_adjusted, target_names=[str(cls) for cls in np.unique(y)])

# 결과 출력
print("Ensemble Model Accuracy after ADASYN and Class Weighting:", accuracy_ensemble)
print("Ensemble Model Classification Report:\n", report_ensemble)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_filtered['accTypeD_merged_combined'] = data_filtered['accTypeD'].replace({6: 2, 7: 2, 8: 2})
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_filtered['grid_lat'] = (data_filtered['la_crd'] // 0.02) * 0.02  # 2km 격자화
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_filtered['grid_lon'] =

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000316 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 836
[LightGBM] [Info] Number of data points in the train set: 1852, number of used features: 13
[LightGBM] [Info] Start training from score -2.039211
[LightGBM] [Info] Start training from score -1.205068
[LightGBM] [Info] Start training from score -1.452476
[LightGBM] [Info] Start training from score -2.748257
[LightGBM] [Info] Start training from score -1.301338


Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.



Ensemble Model Accuracy after ADASYN and Class Weighting: 0.6202247191011236
Ensemble Model Classification Report:
               precision    recall  f1-score   support

           0       0.60      0.67      0.63        76
           1       0.22      0.19      0.21        21
           2       0.56      0.53      0.55       139
           3       0.70      0.67      0.68        48
           4       0.70      0.71      0.71       161

    accuracy                           0.62       445
   macro avg       0.56      0.55      0.55       445
weighted avg       0.62      0.62      0.62       445



In [None]:
# 라이브러리 불러오기
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# 기본 모델 정의
lgbm_model = LGBMClassifier(random_state=42, class_weight=class_weights)
xgb_model = XGBClassifier(random_state=42, scale_pos_weight=5, use_label_encoder=False, eval_metric='mlogloss')
catboost_model = CatBoostClassifier(random_state=42, class_weights=class_weights, verbose=0)

# StackingClassifier 정의
stacking_model = StackingClassifier(
    estimators=[
        ('lightgbm', lgbm_model),
        ('xgboost', xgb_model),
        ('catboost', catboost_model)
    ],
    final_estimator=LogisticRegression(max_iter=1000),  # 메타 모델로 로지스틱 회귀 사용
    stack_method='predict_proba',  # 각 모델의 확률값을 스택으로 전달
    cv=5  # 교차 검증
)

# Stacking 모델 학습
stacking_model.fit(X_train_resampled, y_train_resampled)

# 예측 및 성능 평가
y_pred_stacking = stacking_model.predict(X_test_combined)
accuracy_stacking = accuracy_score(y_test_combined, y_pred_stacking)
report_stacking = classification_report(y_test_combined, y_pred_stacking, target_names=[str(cls) for cls in np.unique(y)])

# 결과 출력
print("Stacking Model Accuracy after ADASYN and Class Weighting:", accuracy_stacking)
print("Stacking Model Classification Report:\n", report_stacking)


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000330 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 836
[LightGBM] [Info] Number of data points in the train set: 1852, number of used features: 13
[LightGBM] [Info] Start training from score -2.039211
[LightGBM] [Info] Start training from score -1.205068
[LightGBM] [Info] Start training from score -1.452476
[LightGBM] [Info] Start training from score -2.748257
[LightGBM] [Info] Start training from score -1.301338


Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000172 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 837
[LightGBM] [Info] Number of data points in the train set: 1481, number of used features: 13
[LightGBM] [Info] Start training from score -2.041682
[LightGBM] [Info] Start training from score -1.203460
[LightGBM] [Info] Start training from score -1.452579
[LightGBM] [Info] Start training from score -2.754790
[LightGBM] [Info] Start training from score -1.300310
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000278 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 836
[LightGBM] [Info] Number of data points in the train set: 1481, number of used features: 13
[LightGBM] [Info] Start trai

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.



Stacking Model Accuracy after ADASYN and Class Weighting: 0.6134831460674157
Stacking Model Classification Report:
               precision    recall  f1-score   support

           0       0.59      0.74      0.65        76
           1       0.27      0.14      0.19        21
           2       0.56      0.47      0.51       139
           3       0.72      0.58      0.64        48
           4       0.66      0.75      0.70       161

    accuracy                           0.61       445
   macro avg       0.56      0.54      0.54       445
weighted avg       0.60      0.61      0.60       445



In [None]:
!pip install optuna

Collecting optuna
  Downloading optuna-4.1.0-py3-none-any.whl.metadata (16 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.14.0-py3-none-any.whl.metadata (7.4 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.3.6-py3-none-any.whl.metadata (2.9 kB)
Downloading optuna-4.1.0-py3-none-any.whl (364 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m364.4/364.4 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading alembic-1.14.0-py3-none-any.whl (233 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.5/233.5 kB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.9.0-py3-none-any.whl (11 kB)
Downloading Mako-1.3.6-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: M

In [None]:
# 라이브러리 불러오기
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTENC, ADASYN
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import StackingClassifier

# Optuna for Hyperparameter Tuning
import optuna

# 1. 데이터 로드 및 전처리
data = pd.read_csv('TA_cleaned.csv')  # 실제 파일 경로로 수정

# 클래스 5와 9 제거 및 클래스 통합
data_filtered = data[data['accTypeD'].isin([0, 1, 2, 3, 4, 6, 7, 8])]
data_filtered['accTypeD_merged_combined'] = data_filtered['accTypeD'].replace({6: 2, 7: 2, 8: 2})

# Feature Engineering
data_filtered['grid_lat'] = (data_filtered['la_crd'] // 0.02) * 0.02
data_filtered['grid_lon'] = (data_filtered['lo_crd'] // 0.02) * 0.02
data_filtered['is_night'] = data_filtered['hour'].apply(lambda x: 1 if x < 6 or x >= 18 else 0)
data_filtered['holiday_road_combo'] = data_filtered['is_holiday'].astype(str) + '_' + data_filtered['road_form_class'].astype(str)

categorical_features = ['is_holiday', 'road_form_class', 'road_formD', 'carFLg',
                        'carClassF', 'carClassVic', 'holiday_road_combo']

numerical_features = ['hour', 'grid_lat', 'grid_lon', 'is_night']

# 라벨 인코딩
data_encoded = data_filtered.copy()
label_encoders = {}
for col in categorical_features:
    le = LabelEncoder()
    data_encoded[col] = le.fit_transform(data_encoded[col].astype(str))
    label_encoders[col] = le

X = data_encoded[numerical_features + categorical_features]
y = data_encoded['accTypeD_merged_combined']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 데이터 증강 (ADASYN + SMOTENC)
categorical_indices = [X.columns.get_loc(col) for col in categorical_features]

smotenc = SMOTENC(categorical_features=categorical_indices, random_state=42)
X_train_resampled, y_train_resampled = smotenc.fit_resample(X_train, y_train)

adasyn = ADASYN(sampling_strategy='auto', random_state=42)
X_train_resampled, y_train_resampled = adasyn.fit_resample(X_train_resampled, y_train_resampled)

# 2. 모델 정의 (Optuna 하이퍼파라미터 최적화)
def objective(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 100, 500),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3),
        "max_depth": trial.suggest_int("max_depth", 3, 12),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
        "scale_pos_weight": trial.suggest_float("scale_pos_weight", 1, 10),
        "random_state": 42
    }
    model = XGBClassifier(**params)
    model.fit(X_train_resampled, y_train_resampled)
    preds = model.predict(X_test)
    return accuracy_score(y_test, preds)

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=20)

best_params = study.best_params
print("Best parameters:", best_params)

# 최적화된 XGBoost 모델
xgb_model = XGBClassifier(**best_params)

# StackingClassifier (메타 모델: CatBoost)
stacking_model = StackingClassifier(
    estimators=[
        ('xgb', xgb_model),
        ('lgbm', LGBMClassifier(random_state=42)),
        ('catboost', CatBoostClassifier(random_state=42, verbose=0))
    ],
    final_estimator=CatBoostClassifier(random_state=42, verbose=0)
)

# 모델 학습
stacking_model.fit(X_train_resampled, y_train_resampled)

# 3. Threshold 최적화
y_prob = stacking_model.predict_proba(X_test)
threshold = 0.3  # Threshold 설정
y_pred_adjusted = np.argmax(y_prob, axis=1)

# 4. 성능 평가
accuracy = accuracy_score(y_test, y_pred_adjusted)
report = classification_report(y_test, y_pred_adjusted, target_names=[str(cls) for cls in np.unique(y)])

# 결과 출력
print("Optimized Model Accuracy:", accuracy)
print("Optimized Model Classification Report:\n", report)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_filtered['accTypeD_merged_combined'] = data_filtered['accTypeD'].replace({6: 2, 7: 2, 8: 2})
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_filtered['grid_lat'] = (data_filtered['la_crd'] // 0.02) * 0.02
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_filtered['grid_lon'] = (data_filt

Best parameters: {'n_estimators': 198, 'learning_rate': 0.19689411117951314, 'max_depth': 3, 'min_child_weight': 7, 'subsample': 0.6097067370243756, 'colsample_bytree': 0.6603930177444711, 'scale_pos_weight': 4.405419387496526}
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000393 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 579
[LightGBM] [Info] Number of data points in the train set: 3315, number of used features: 11
[LightGBM] [Info] Start training from score -1.609438
[LightGBM] [Info] Start training from score -1.609438
[LightGBM] [Info] Start training from score -1.609438
[LightGBM] [Info] Start training from score -1.609438
[LightGBM] [Info] Start training from score -1.609438


Parameters: { "scale_pos_weight" } are not used.

Parameters: { "scale_pos_weight" } are not used.

Parameters: { "scale_pos_weight" } are not used.

Parameters: { "scale_pos_weight" } are not used.

Parameters: { "scale_pos_weight" } are not used.



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000347 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 580
[LightGBM] [Info] Number of data points in the train set: 2652, number of used features: 11
[LightGBM] [Info] Start training from score -1.610192
[LightGBM] [Info] Start training from score -1.608307
[LightGBM] [Info] Start training from score -1.610192
[LightGBM] [Info] Start training from score -1.610192
[LightGBM] [Info] Start training from score -1.608307
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000295 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 579
[LightGBM] [Info] Number of data points in the train set: 2652, number of used features: 11
[LightGBM] [Info] Start trai