# 실시간 날씨 데이터를 기반으로 충남·대전·세종 각 지역별 1시간 이내 교통사고확률 예측
- 팀명 : 국민대민쑤

## 코드 실행환경

In [None]:
import platform
print(platform.platform())
!cat /etc/issue.net
!python --version
!nvidia-smi

Linux-5.10.147+-x86_64-with-glibc2.29
Ubuntu 20.04.5 LTS
Python 3.8.10
Wed Feb 15 07:10:32 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P0    27W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+----------------

## 구글 코랩 사용시 구글 드라이브 연결 사용

In [None]:
#구글 드라이브 연결
from google.colab import drive
drive.mount('/content/gdrive')

#코랩 환경 경로 설정 -> 자신에게 맞는 경로로 설정해주시면 됩니다 
DATA_PATH = '/content/gdrive/MyDrive/지역치안공모전/data/'

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


## Import & Install

In [None]:
# 코랩 기준 필요 라이브러리 설치

!pip install catboost
!pip install haversine
!pip install optuna

In [None]:
#Base & visualization
import pandas as pd
import random
import os
import numpy as np
import warnings
import matplotlib.pylab as plt
import seaborn as sns

#Feature engineering
import datetime
from haversine import haversine

#Scaling
from sklearn.preprocessing import StandardScaler

#Sklearn module & utils
from sklearn.model_selection import StratifiedKFold , KFold, train_test_split, cross_val_score, cross_validate

#Metric
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

#Modeling
import lightgbm as lgb
import catboost as cb
import optuna
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_val_score

#Model save
import pickle

## Fix Seed

In [None]:
#Seed 고정
class CFG:
    SEED = 42

def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
seed_everything(CFG.SEED) # Seed 고정

## Data Load

In [None]:
kp2020 = pd.read_csv(DATA_PATH + 'KP2020.csv', encoding = 'cp949')
kp2021 = pd.read_csv(DATA_PATH + 'KP2021.csv', encoding = 'cp949')
npa2020 = pd.read_csv(DATA_PATH + 'NPA2020.csv', encoding = 'cp949')
codeBook = pd.read_excel(DATA_PATH + 'codeBook_v3.xlsx')

In [None]:
# 외부데이터(기상청)
temp2020 = pd.read_csv(DATA_PATH + '2020년기상청관측데이터.csv', encoding = 'cp949')
temp2021 = pd.read_csv(DATA_PATH + '2021년기상청관측데이터.csv', encoding = 'cp949')
temp2022 = pd.read_csv(DATA_PATH + '2022년기상청관측데이터.csv', encoding = 'cp949')
temp2023 = pd.read_csv(DATA_PATH + '2023년기상청관측데이터.csv', encoding = 'cp949')
temp2023_02 = pd.read_csv(DATA_PATH + '2023년최신기상청관측데이터.csv', encoding = 'cp949')
location = pd.read_csv(DATA_PATH + '관측지점정보.csv', encoding = 'cp949')

#### **외부데이터 설명**
- temp2020-temp2022 : 2020-2022년 날씨데이터
- temp2023 : 2023년 1월 18일까지 날씨데이터
- temp2023_02 : 20203년 1월 19일부터 2월 14일 09:00까지의 데이터
- location : 전국에 있는 기상관측소 위치 및위도,경도

[외부데이터 출처] : https://data.kma.go.kr/data/grnd/selectAsosRltmList.do?pgmNo=36

## EDA

In [None]:
codeBook.head()

Unnamed: 0,No,컬럼명,컬럼 그룹,코드명,코드값
0,1,NPA_CL,경찰청 구분,본청,1
1,2,NPA_CL,경찰청 구분,서울청,8
2,3,NPA_CL,경찰청 구분,부산청,9
3,4,NPA_CL,경찰청 구분,대구청,10
4,5,NPA_CL,경찰청 구분,인천청,11


#### **- codeBook 정보**
**NPA_CL : 경찰청 구분**  
**EVT_STAT_CD : 사건상태코드**  
**EVT_CL_CD : 사건종별코드**  
**RPTER_SEX : 성별**

In [None]:
# 교통사고 코드값 추출
codeBook.query("컬럼명 == 'EVT_CL_CD'").query("코드명 == '교통사고'")

Unnamed: 0,No,컬럼명,컬럼 그룹,코드명,코드값
68,69,EVT_CL_CD,사건종별코드,교통사고,401


In [None]:
npa2020.sort_values(by='RECV_CPLT_DT')

Unnamed: 0,RECV_CPLT_DT,RECV_CPLT_TM,NPA_CL,EVT_STAT_CD,EVT_CL_CD,RPTER_SEX,HPPN_OLD_ADDR,HPPN_X,HPPN_Y,SME_EVT_YN
0,20200101,7,13,10,501,2,대전광역시 중구 목동(행정:목동) 360,127.409270,36.333010,Y
2134,20200101,233731,19,10,208,1,충청남도 당진시 읍내동(행정:당진1동) 1525,126.631020,36.901677,
2135,20200101,233828,19,10,610,1,충청남도 홍성군 홍성읍 남장리(행정:홍성읍) 95-54,126.665464,36.593157,
2136,20200101,233715,19,10,606,3,충청남도 예산군 예산읍 창소리 304-3 도원가게 앞,126.841990,36.732200,N
2137,20200101,234120,19,10,501,1,충청남도 천안시 서북구 쌍용동(행정:쌍용3동) 1279,127.122227,36.802885,
...,...,...,...,...,...,...,...,...,...,...
1178093,20201122,1043,19,10,601,3,,,,
1178094,20201122,1517,19,10,601,1,충청남도 천안시 동남구 풍세면 남관리(풍세면) 136-1,,,
1178095,20201122,1857,13,10,601,1,,127.404663,36.341685,
1178083,20201122,2007,13,7,406,1,대전광역시 중구 오류동(행정:오류동) 175-2,127.404540,36.325219,


In [None]:
kp2020.sort_values(by='RECV_CPLT_DM')

Unnamed: 0,RECV_DEPT_NM,RECV_CPLT_DM,NPA_CL,EVT_STAT_CD,EVT_CL_CD,RPTER_SEX,HPPN_PNU_ADDR,HPPN_X,HPPN_Y,SME_EVT_YN
219,대전청,20/12/01 00:00:54.000000000,13,10,406,1.0,대전광역시 서구 복수동(행정:복수동) 475,127.377982,36.299738,
168,충남청,20/12/01 00:02:07.000000000,19,10,406,1.0,충청남도 아산시 배방읍 공수리(행정:배방읍) 141-7,127.057063,36.774519,Y
218,대전청,20/12/01 00:02:08.000000000,13,10,501,2.0,대전광역시 중구 오류동(행정:오류동) 160-4,127.408753,36.321745,
217,대전청,20/12/01 00:02:22.000000000,13,10,505,2.0,대전광역시 유성구 죽동(행정:온천2동) 688,127.337740,36.368390,Y
169,충남청,20/12/01 00:03:18.000000000,19,10,201,2.0,충청남도 천안시 서북구 성정동(행정:성정2동) 1345,127.141725,36.828716,
...,...,...,...,...,...,...,...,...,...,...
74153,대전청,20/12/31 23:59:10.000000000,13,10,601,3.0,,127.404663,36.341685,
74126,충남청,20/12/31 23:59:16.000000000,19,10,307,2.0,충청남도 부여군 부여읍 동남리(행정:부여읍) 515-3,126.907614,36.277902,Y
74182,대전청,20/12/31 23:59:39.000000000,13,10,601,3.0,,127.404663,36.341685,
73845,충남청,20/12/31 23:59:44.000000000,19,10,204,1.0,충청남도 아산시 배방읍 회룡리(행정:배방읍) 65,127.072778,36.753030,


In [None]:
kp2021.sort_values(by='RECV_CPLT_DM')

Unnamed: 0,RECV_DEPT_NM,RECV_CPLT_DM,NPA_CL,EVT_STAT_CD,EVT_CL_CD,RPTER_SEX,HPPN_PNU_ADDR,HPPN_X,HPPN_Y,SME_EVT_YN
2076,대전청,21/01/01 00:01:09.000000000,19,10,301,1.0,충청남도 아산시 탕정면 호산리(행정:탕정면) 490-3,127.077039,36.809625,
2065,대전청,21/01/01 00:01:09.000000000,13,10,301,1.0,충청남도 아산시 탕정면 호산리(행정:탕정면) 490-3,127.077039,36.809625,
2064,대전청,21/01/01 00:01:21.000000000,13,10,501,3.0,,127.404663,36.341685,
2050,대전청,21/01/01 00:01:25.000000000,13,10,601,3.0,,,,
2077,충남청,21/01/01 00:01:37.000000000,19,10,601,1.0,,,,
...,...,...,...,...,...,...,...,...,...,...
2580512,대전청,23/01/18 23:58:33.000000000,13,10,501,1.0,,127.404663,36.341685,
2583744,충남청,23/01/18 23:58:39.000000000,19,10,501,1.0,,,,
2579949,충남청,23/01/18 23:58:52.000000000,19,10,601,3.0,,,,
2579902,충남청,23/01/18 23:59:40.000000000,19,10,406,1.0,충청남도 아산시 모종동 (온양3동 ) 661,127.019848,36.778266,


- npa2020 : 2020년1월부터 2020년 11월까지의 데이터
- kp2020 : 2020년 12월 데이터
- kp2021 : 2021년1월부터 2023년1월18일까지인 데이터

#Feature engineering

### 날씨데이터

In [None]:
# 2020년1월부터 2020년1월18일까지 병합
temp_all = pd.concat([temp2020,temp2021,temp2022,temp2023]).sort_values(by=["지점", "일시"]).reset_index(drop=True)

In [None]:
temp_all

Unnamed: 0,지점,지점명,일시,기온(°C),풍속(m/s),풍향(16방위),습도(%),증기압(hPa),이슬점온도(°C),현지기압(hPa),해면기압(hPa),전운량(10분위),시정(10m),지면온도(°C)
0,129,서산,2020-01-01 00:00,-7.2,0.2,0.0,84.0,3.0,-9.4,1030.9,1034.2,9.0,2056.0,-1.4
1,129,서산,2020-01-01 01:00,-5.7,0.2,0.0,80.0,3.2,-8.6,1029.6,1032.9,8.0,2642.0,-0.9
2,129,서산,2020-01-01 02:00,-5.1,1.1,200.0,77.0,3.2,-8.5,1029.4,1032.7,9.0,1923.0,-0.9
3,129,서산,2020-01-01 03:00,-4.3,0.7,70.0,77.0,3.4,-7.7,1029.7,1033.0,9.0,550.0,-0.8
4,129,서산,2020-01-01 04:00,-4.1,1.1,50.0,85.0,3.8,-6.2,1029.6,1032.9,9.0,709.0,-0.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213883,239,세종,2023-01-18 19:00,-0.6,1.1,320.0,41.0,2.4,-12.2,1014.8,1026.2,0.0,1688.0,-1.2
213884,239,세종,2023-01-18 20:00,-0.5,1.2,270.0,33.0,1.9,-14.8,1014.9,1026.3,0.0,2021.0,-1.9
213885,239,세종,2023-01-18 21:00,-1.6,0.0,0.0,35.0,1.9,-15.1,1014.8,1026.2,0.0,1892.0,-2.5
213886,239,세종,2023-01-18 22:00,-2.5,0.7,180.0,60.0,3.0,-9.2,1014.7,1026.2,0.0,1506.0,-3.2


In [None]:
#날씨데이터 결측치 확인
temp_all.isnull().sum()

지점              0
지점명             0
일시              0
기온(°C)         91
풍속(m/s)       410
풍향(16방위)      420
습도(%)         110
증기압(hPa)      139
이슬점온도(°C)     161
현지기압(hPa)      77
해면기압(hPa)      92
전운량(10분위)    3507
시정(10m)      1202
지면온도(°C)       82
dtype: int64

In [None]:
temp_all.지점.unique()

array([129, 133, 177, 232, 235, 236, 238, 239])

In [None]:
지점_list = [129, 133, 177, 232, 235, 236, 238, 239]
temp_fine = pd.DataFrame(columns=['지점', '지점명', '일시', '기온(°C)', '풍속(m/s)', '풍향(16방위)', '습도(%)', '증기압(hPa)','이슬점온도(°C)', '현지기압(hPa)', '해면기압(hPa)', '전운량(10분위)', '시정(10m)','지면온도(°C)'])

In [None]:
# 시계열데이터에서 일반적으로 결측치 처리에 좋은 interpolate(보간법)을 이용하여 결측치 처리
# interpolate 특성상 지역별로 결측치 처리를 안하면 다른 지역 데이터가 들어갈수 있으므로 지역별로 결측치 처리
for i in range(len(지점_list)):
  temp_fine = pd.concat([temp_fine,temp_all.query(f'지점=={지점_list[i]}').interpolate()])

In [None]:
# 결측치 1547개 확인 -> 결측치 확인결과 세종시에서 발생
# 사유 : 세종시에서는 2020-03-05 이전까지 전운량측정을 안했음 -> 근처 대전시 값으로 대체
temp_fine.isnull().sum()

지점              0
지점명             0
일시              0
기온(°C)          0
풍속(m/s)         0
풍향(16방위)        0
습도(%)           0
증기압(hPa)        0
이슬점온도(°C)       0
현지기압(hPa)       0
해면기압(hPa)       0
전운량(10분위)    1547
시정(10m)         0
지면온도(°C)        0
dtype: int64

In [None]:
# 세종시 미관측데이터 대전시 데이터로 보강
cloud_list = temp_fine.query('지점명=="대전"')['전운량(10분위)'][:1547].to_list()
temp_fine.loc[range(187152, 188699), '전운량(10분위)'] = cloud_list

In [None]:
# 전운량데이터의 경우 10분위로 이루어져있는 범주형 피처
# interpolate때문에 소수점발생하므로 반올림진행
temp_fine["전운량(10분위)"] = temp_fine["전운량(10분위)"].round()

### 기상관측소 데이터

In [None]:
# location데이터에서 충남,대전,세종에 있는 기상 관측소만 추출
location = location.query(f'지점 == {지점_list}')
location = location.drop_duplicates(subset='지점', keep='first').reset_index(drop=True)
location['위도'] = location['위도'].astype(float)

In [None]:
location

Unnamed: 0,지점,지점명,위도,경도
0,129,서산,36.7766,126.4939
1,133,대전,36.372,127.3721
2,177,홍성,36.6576,126.6877
3,232,천안,36.7624,127.2927
4,235,보령,36.3272,126.5574
5,236,부여,36.2724,126.9208
6,238,금산,36.1056,127.4818
7,239,세종,36.4852,127.2444


### 교통사고 데이터

In [None]:
# 교통사고 데이터만 추출
kp2020_traffic = kp2020.query('EVT_CL_CD == 401').reset_index(drop=True)
kp2021_traffic = kp2021.query('EVT_CL_CD == 401').reset_index(drop=True)
npa2020_traffic = npa2020.query('EVT_CL_CD == 401').reset_index(drop=True)

In [None]:
# 교통사고 데이터중 위도,경도 결측치 제거
kp2020_traffic = kp2020_traffic.dropna(subset=['HPPN_X']).reset_index(drop=True)
kp2021_traffic = kp2021_traffic.dropna(subset=['HPPN_X']).reset_index(drop=True)
npa2020_traffic = npa2020_traffic.dropna(subset=['HPPN_X']).reset_index(drop=True)
npa2020_traffic = npa2020_traffic[npa2020_traffic.HPPN_X != 0]

In [None]:
# npa2020,와 kp2020,kp2021 날짜 형식 통일
# 1시간 이내 데이터 예측이므로 ex) 00:00~00:59분까지 00:00으로 통일
def add_zeros(x):
    x = str(x)
    return x.zfill(6)

npa2020_traffic['RECV_CPLT_TM'] = npa2020_traffic['RECV_CPLT_TM'].apply(add_zeros)
npa2020_traffic['RECV_CPLT_DM'] = npa2020_traffic['RECV_CPLT_DT'].astype(str).str[:4] + '-' + npa2020_traffic['RECV_CPLT_DT'].astype(str).str[4:6] + '-' + npa2020_traffic['RECV_CPLT_DT'].astype(str).str[6:8] + ' ' + npa2020_traffic['RECV_CPLT_TM'].astype(str).str[:2]+':00'
npa2020_traffic = npa2020_traffic.drop(columns=['RECV_CPLT_DT','RECV_CPLT_TM','HPPN_OLD_ADDR'])

kp2020_traffic['RECV_CPLT_DM'] = '20' + kp2020_traffic['RECV_CPLT_DM']
kp2020_traffic['RECV_CPLT_DM'] = pd.to_datetime(kp2020_traffic['RECV_CPLT_DM'])
kp2020_traffic['RECV_CPLT_DM'] = kp2020_traffic['RECV_CPLT_DM'].dt.strftime('%Y-%m-%d %H')
kp2020_traffic['RECV_CPLT_DM'] = kp2020_traffic['RECV_CPLT_DM'] + ':00'

kp2021_traffic['RECV_CPLT_DM'] = '20' + kp2021_traffic['RECV_CPLT_DM']
kp2021_traffic['RECV_CPLT_DM'] = pd.to_datetime(kp2021_traffic['RECV_CPLT_DM'])
kp2021_traffic['RECV_CPLT_DM'] = kp2021_traffic['RECV_CPLT_DM'].dt.strftime('%Y-%m-%d %H')
kp2021_traffic['RECV_CPLT_DM'] = kp2021_traffic['RECV_CPLT_DM'] + ':00'

In [None]:
# 교통사고데이터 병합
kp_all = pd.concat([kp2020_traffic,kp2021_traffic])
kp_all = kp_all.drop(columns=['RECV_DEPT_NM','HPPN_PNU_ADDR'])
traffic_all = pd.concat([npa2020_traffic,kp_all]).reset_index(drop=True)

In [None]:
# haversine : 위도,경도 거리 계산 라이브러리(따로 설치필요)
# 교통사고데이터 위도,경도와 충남,대전,세종에 있는 모든 기상관측소 위도,경도 거리 계산 후 제일 가까운 관측소 위도,경도를 추출
traffic_all['지점'] = traffic_all.apply(lambda x: location.loc[np.argmin([haversine((x.HPPN_Y, x.HPPN_X), (loc.위도, loc.경도), unit='km') for i, loc in location.iterrows()]), '지점'], axis=1)

In [None]:
# 지점별로 정렬 후 시간순으로 정렬
traffic_all = traffic_all[['지점','RECV_CPLT_DM']].sort_values(by=["지점", "RECV_CPLT_DM"]).reset_index(drop=True)

In [None]:
# 교통사고 데이터에 '사고유무'컬럼 추가 후 날씨 데이터와 병합
# '지점'과'일시'를 기준으로 병합하여 날씨데이터 기준 1시간 간격 교통사고 유무를 파악할 수 있음
traffic_all['사고유무'] = 1
traffic_all.drop_duplicates(subset=['지점','RECV_CPLT_DM']).reset_index(drop=True)
traffic_all.rename(columns = {"RECV_CPLT_DM": "일시"}, inplace = True)
X = pd.merge(temp_fine, traffic_all, on=['지점', '일시'], how='left')
X['사고유무'] = X['사고유무'].fillna(0)
X['일시'] = pd.to_datetime(X['일시'])

In [None]:
X

Unnamed: 0,지점,지점명,일시,기온(°C),풍속(m/s),풍향(16방위),습도(%),증기압(hPa),이슬점온도(°C),현지기압(hPa),해면기압(hPa),전운량(10분위),시정(10m),지면온도(°C),사고유무
0,129,서산,2020-01-01 00:00:00,-7.2,0.2,0.0,84.0,3.0,-9.4,1030.9,1034.2,9.0,2056.0,-1.4,1.0
1,129,서산,2020-01-01 01:00:00,-5.7,0.2,0.0,80.0,3.2,-8.6,1029.6,1032.9,8.0,2642.0,-0.9,1.0
2,129,서산,2020-01-01 01:00:00,-5.7,0.2,0.0,80.0,3.2,-8.6,1029.6,1032.9,8.0,2642.0,-0.9,1.0
3,129,서산,2020-01-01 01:00:00,-5.7,0.2,0.0,80.0,3.2,-8.6,1029.6,1032.9,8.0,2642.0,-0.9,1.0
4,129,서산,2020-01-01 02:00:00,-5.1,1.1,200.0,77.0,3.2,-8.5,1029.4,1032.7,9.0,1923.0,-0.9,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
376389,239,세종,2023-01-18 20:00:00,-0.5,1.2,270.0,33.0,1.9,-14.8,1014.9,1026.3,0.0,2021.0,-1.9,1.0
376390,239,세종,2023-01-18 21:00:00,-1.6,0.0,0.0,35.0,1.9,-15.1,1014.8,1026.2,0.0,1892.0,-2.5,0.0
376391,239,세종,2023-01-18 22:00:00,-2.5,0.7,180.0,60.0,3.0,-9.2,1014.7,1026.2,0.0,1506.0,-3.2,1.0
376392,239,세종,2023-01-18 22:00:00,-2.5,0.7,180.0,60.0,3.0,-9.2,1014.7,1026.2,0.0,1506.0,-3.2,1.0


## 모델 생성 및 성능 확인을 위한 Train Test split

In [None]:
# Predict확인을 위해 index값 저장
X_train_index = X.query('일시 <= "2022-01-18 22:00:00"').iloc[:,:3]
X_test_index = X.query('일시 > "2022-01-18 22:00:00"').iloc[:,:3]

# Train데이터 : 2020년1월1일 ~ 2022년 1월18일22시 데이터
# Test데이터 : 2022년1월18일23시 ~ 2023년1월18일23시 데이터
X_train = X.query('일시 <= "2022-01-18 22:00:00"').iloc[:,3:]
X_test = X.query('일시 > "2022-01-18 22:00:00"').iloc[:,3:]

# y_label 분리
y_train = X_train['사고유무']
X_train = X_train.drop(columns ='사고유무')
y_test = X_test['사고유무']
X_test = X_test.drop(columns ='사고유무')

In [None]:
X_train.head()

Unnamed: 0,기온(°C),풍속(m/s),풍향(16방위),습도(%),증기압(hPa),이슬점온도(°C),현지기압(hPa),해면기압(hPa),전운량(10분위),시정(10m),지면온도(°C)
0,-7.2,0.2,0.0,84.0,3.0,-9.4,1030.9,1034.2,9.0,2056.0,-1.4
1,-5.7,0.2,0.0,80.0,3.2,-8.6,1029.6,1032.9,8.0,2642.0,-0.9
2,-5.7,0.2,0.0,80.0,3.2,-8.6,1029.6,1032.9,8.0,2642.0,-0.9
3,-5.7,0.2,0.0,80.0,3.2,-8.6,1029.6,1032.9,8.0,2642.0,-0.9
4,-5.1,1.1,200.0,77.0,3.2,-8.5,1029.4,1032.7,9.0,1923.0,-0.9


#Scaling

In [None]:
# 수치형 데이터 스케일링(전운량의 경우 범주형 속성에 가까우므로 제외)
num_features = X_train.drop(columns ='전운량(10분위)').columns

In [None]:
# 일반적으로 분류에 성능이 좋은 StandardScaler 사용
scaler = StandardScaler()
X_train[num_features] = scaler.fit_transform(X_train[num_features])
X_test[num_features] = scaler.transform(X_test[num_features])

# Modeling

### LGBMClassifier

In [None]:
# Define objective function for Optuna
def objective_lgbm(trial):
    # Define parameters to be optimized
    lgbm_params = {
        'objective': 'binary',
        'boosting_type': 'gbdt',
        'verbosity': -1,
        'random_state' : CFG.SEED,
        'num_leaves': trial.suggest_int('num_leaves', 2, 256),
        'learning_rate': trial.suggest_uniform('learning_rate', 0.01, 0.1),
        'feature_fraction': trial.suggest_uniform('feature_fraction', 0.1, 1.0),
        'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.1, 1.0),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 10),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
    }
    
    # Create LGBM classifier with parameters
    lgbm_clf = lgb.LGBMClassifier(**lgbm_params)
    
    # Compute cross validation score
    lgbm_score = cross_val_score(lgbm_clf, X_train, y_train, cv=15).mean()
    
    return lgbm_score

# Run optuna to optimize hyperparameters
study_lgbm = optuna.create_study(direction='maximize')
study_lgbm.optimize(objective_lgbm, n_trials=15)

# Get best hyperparameters and create LGBM classifier with those parameters
lgbm_params = study_lgbm.best_params
lgbm_model = lgb.LGBMClassifier(**lgbm_params)

[32m[I 2023-02-15 07:14:05,430][0m A new study created in memory with name: no-name-faf858d5-749a-4dd0-aca3-91cb7b183355[0m
  'learning_rate': trial.suggest_uniform('learning_rate', 0.01, 0.1),
  'feature_fraction': trial.suggest_uniform('feature_fraction', 0.1, 1.0),
  'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.1, 1.0),
[32m[I 2023-02-15 07:14:42,946][0m Trial 0 finished with value: 0.7309960642359201 and parameters: {'num_leaves': 92, 'learning_rate': 0.0646498615614581, 'feature_fraction': 0.6264218894167806, 'bagging_fraction': 0.7900666950415415, 'bagging_freq': 2, 'min_child_samples': 78}. Best is trial 0 with value: 0.7309960642359201.[0m
  'learning_rate': trial.suggest_uniform('learning_rate', 0.01, 0.1),
  'feature_fraction': trial.suggest_uniform('feature_fraction', 0.1, 1.0),
  'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.1, 1.0),
[32m[I 2023-02-15 07:15:09,605][0m Trial 1 finished with value: 0.729199150122422 and parameters:

### CatBoostClassifier

In [None]:
# Define objective function for Optuna for CatBoost
def objective_cb(trial):
    # Define parameters to be optimized
    cb_params = {
        'loss_function': 'Logloss',
        'iterations': 1000,
        'task_type' : "GPU",
        'verbose' : False,
        'random_state' : CFG.SEED,
        'learning_rate': trial.suggest_uniform('learning_rate', 0.01, 0.1),
        'depth': trial.suggest_int('depth', 2, 10),
        'l2_leaf_reg': trial.suggest_loguniform('l2_leaf_reg', 0.01, 10.0),
        'random_strength': trial.suggest_uniform('random_strength', 0.1, 1.0),
        'bagging_temperature': trial.suggest_uniform('bagging_temperature', 0.0, 10.0),
        'border_count': trial.suggest_int('border_count', 1, 255),
    }

    # Create Catboost classifier with parameters
    cb_clf = cb.CatBoostClassifier(**cb_params)
    
    # Compute cross validation score
    cb_score = cross_val_score(cb_clf, X_train, y_train, cv=5).mean()
    
    return cb_score    

# Run optuna to optimize hyperparameters
study_cb = optuna.create_study(direction='maximize')
study_cb.optimize(objective_cb, n_trials=5)

# Get best hyperparameters and create Catboost classifier with those parameters
cb_params = study_cb.best_params
catboost_model = cb.CatBoostClassifier(**cb_params)

[32m[I 2023-02-15 07:21:48,300][0m A new study created in memory with name: no-name-67786f33-5466-4227-8148-4b54ce51dc48[0m
  'learning_rate': trial.suggest_uniform('learning_rate', 0.01, 0.1),
  'l2_leaf_reg': trial.suggest_loguniform('l2_leaf_reg', 0.01, 10.0),
  'random_strength': trial.suggest_uniform('random_strength', 0.1, 1.0),
  'bagging_temperature': trial.suggest_uniform('bagging_temperature', 0.0, 10.0),
[32m[I 2023-02-15 07:22:18,563][0m Trial 0 finished with value: 0.6897791030721545 and parameters: {'learning_rate': 0.057691792284794104, 'depth': 6, 'l2_leaf_reg': 9.182961164818833, 'random_strength': 0.603987631453176, 'bagging_temperature': 2.5275382275431557, 'border_count': 63}. Best is trial 0 with value: 0.6897791030721545.[0m
  'learning_rate': trial.suggest_uniform('learning_rate', 0.01, 0.1),
  'l2_leaf_reg': trial.suggest_loguniform('l2_leaf_reg', 0.01, 10.0),
  'random_strength': trial.suggest_uniform('random_strength', 0.1, 1.0),
  'bagging_temperature':

### Ensemble

In [None]:
# Create voting ensemble of LightGBM and CatBoost models
ensemble_model = VotingClassifier(estimators=[('lgbm', lgbm_model), ('catboost', catboost_model)], voting='soft')
ensemble_model.fit(X_train, y_train)

# y_pred는 사고유무(0,1)로 구성 y_pred_proba는 사고확률 %로 구성
y_pred = ensemble_model.predict(X_test)
y_pred_proba = ensemble_model.predict_proba(X_test)

0:	learn: 0.6793404	total: 21.7ms	remaining: 21.7s
1:	learn: 0.6674043	total: 41.3ms	remaining: 20.6s
2:	learn: 0.6559403	total: 60.8ms	remaining: 20.2s
3:	learn: 0.6456775	total: 81.2ms	remaining: 20.2s
4:	learn: 0.6368522	total: 100ms	remaining: 19.9s
5:	learn: 0.6287895	total: 120ms	remaining: 19.9s
6:	learn: 0.6213741	total: 138ms	remaining: 19.6s
7:	learn: 0.6144102	total: 157ms	remaining: 19.5s
8:	learn: 0.6080921	total: 175ms	remaining: 19.3s
9:	learn: 0.6022154	total: 195ms	remaining: 19.3s
10:	learn: 0.5967690	total: 215ms	remaining: 19.3s
11:	learn: 0.5909963	total: 239ms	remaining: 19.7s
12:	learn: 0.5866228	total: 258ms	remaining: 19.6s
13:	learn: 0.5824670	total: 277ms	remaining: 19.5s
14:	learn: 0.5784633	total: 298ms	remaining: 19.6s
15:	learn: 0.5741337	total: 319ms	remaining: 19.6s
16:	learn: 0.5708716	total: 337ms	remaining: 19.5s
17:	learn: 0.5678805	total: 357ms	remaining: 19.5s
18:	learn: 0.5653542	total: 376ms	remaining: 19.4s
19:	learn: 0.5628993	total: 395ms	rem

In [None]:
# 이전에 복사해두었던 index값 불러와서 submission 생성
X_test_y_t = X_test_index.copy()
X_test_y_p = X_test_index.copy()
X_test_y_proba = X_test_index.copy()
X_test_y_t['사고유무'] = y_test
X_test_y_p['사고유무'] = y_pred
X_test_y_proba['사고확률'] = [row[1] for row in y_pred_proba]
X_test_y_proba['사고확률'] = X_test_y_proba['사고확률'].apply(lambda x: str(round(x * 100, 2)) + '%')

## 학습 및 성능테스트 결과

In [None]:
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print('F1 Score : ', f1)
print('Accuracy : ', accuracy)

F1 Score :  0.858374669172211
Accuracy :  0.79348112307034


In [None]:
X_test_y_p.head()

Unnamed: 0,지점,지점명,일시,사고유무
26477,129,서산,2022-01-18 23:00:00,0.0
26478,129,서산,2022-01-19 00:00:00,0.0
26479,129,서산,2022-01-19 01:00:00,0.0
26480,129,서산,2022-01-19 02:00:00,0.0
26481,129,서산,2022-01-19 03:00:00,0.0


In [None]:
X_test_y_proba.head()

Unnamed: 0,지점,지점명,일시,사고확률
26477,129,서산,2022-01-18 23:00:00,25.69%
26478,129,서산,2022-01-19 00:00:00,29.89%
26479,129,서산,2022-01-19 01:00:00,21.58%
26480,129,서산,2022-01-19 02:00:00,25.74%
26481,129,서산,2022-01-19 03:00:00,20.44%


#### 결과
- F1 Score :  0.8577479940831054
- Accuracy :  0.7910865343471248

#### **- 1시간 이내 교통사고발생을 5개를 예측하면 4개를 성공적으로 맞추는 준수한 성능을 가진 모델 개발하였고 사고확률까지 계산 가능하게 만들었음**



# **실사용 예측모델 생성(모든데이터 학습)**
- 주어진 데이터를 이용하여 실제 미래데이터 예측을 진행(2023년2월14일09:00까지 예측 진행)


#### Train 데이터 생성(2020년1월1일~2023년1월18일 데이터)

In [None]:
# 모든 날짜가 들어있는 데이터 X에서 y_label 분리
y = X['사고유무']
X_index = X.iloc[:,:3]
X = X.drop(columns='사고유무')
X = X.iloc[:,3:]

#### Test 데이터 생성(2023년 1월19일~2023년2월14일09:00 데이터)

In [None]:
# Test데이터 전처리
temp2023_02 = temp2023_02.sort_values(by=["지점", "일시"]).reset_index(drop=True)

# 결측치 처리
X_t = pd.DataFrame(columns=['지점', '지점명', '일시', '기온(°C)', '풍속(m/s)', '풍향(16방위)', '습도(%)', '증기압(hPa)','이슬점온도(°C)', '현지기압(hPa)', '해면기압(hPa)', '전운량(10분위)', '시정(10m)','지면온도(°C)'])

# 각 지역별로 처리
for i in range(len(지점_list)):
  X_t = pd.concat([X_t,temp2023_02.query(f'지점=={지점_list[i]}').interpolate()])

# '일시'컬럼 datetime형식으로 변환
X_t['일시'] = pd.to_datetime(X_t['일시'])

# 인덱스 추출
X_t_index = X_t.iloc[:,:3]

# Test데이터 생성
X_t = X_t.iloc[:,3:]

#### Scaling

In [None]:
# Scaling
X[num_features] = scaler.fit_transform(X[num_features])
X_t[num_features] = scaler.transform(X_t[num_features])

#### LGBMClassifier

In [None]:
# Define objective function for Optuna
def objective_lgbm(trial):
    # Define parameters to be optimized
    lgbm_params = {
        'objective': 'binary',
        'boosting_type': 'gbdt',
        'verbosity': -1,
        'random_state' : CFG.SEED,
        'num_leaves': trial.suggest_int('num_leaves', 2, 256),
        'learning_rate': trial.suggest_uniform('learning_rate', 0.01, 0.1),
        'feature_fraction': trial.suggest_uniform('feature_fraction', 0.1, 1.0),
        'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.1, 1.0),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 10),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
    }
    
    # Create LGBM classifier with parameters
    lgbm_clf = lgb.LGBMClassifier(**lgbm_params)
    
    # Compute cross validation score
    lgbm_score = cross_val_score(lgbm_clf, X, y, cv=15).mean()
    
    return lgbm_score

# Run optuna to optimize hyperparameters
study_lgbm = optuna.create_study(direction='maximize')
study_lgbm.optimize(objective_lgbm, n_trials=15)

# Get best hyperparameters and create LGBM classifier with those parameters
lgbm_params = study_lgbm.best_params
lgbm_model = lgb.LGBMClassifier(**lgbm_params)

[32m[I 2023-02-15 07:25:44,959][0m A new study created in memory with name: no-name-55dbefa0-535c-4b0b-9c7b-90b0e98b2c84[0m
  'learning_rate': trial.suggest_uniform('learning_rate', 0.01, 0.1),
  'feature_fraction': trial.suggest_uniform('feature_fraction', 0.1, 1.0),
  'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.1, 1.0),
[32m[I 2023-02-15 07:26:19,816][0m Trial 0 finished with value: 0.7245305814037649 and parameters: {'num_leaves': 57, 'learning_rate': 0.018545077358635483, 'feature_fraction': 0.3718088903393284, 'bagging_fraction': 0.4627646618026091, 'bagging_freq': 2, 'min_child_samples': 91}. Best is trial 0 with value: 0.7245305814037649.[0m
  'learning_rate': trial.suggest_uniform('learning_rate', 0.01, 0.1),
  'feature_fraction': trial.suggest_uniform('feature_fraction', 0.1, 1.0),
  'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.1, 1.0),
[32m[I 2023-02-15 07:27:09,554][0m Trial 1 finished with value: 0.7253034959485757 and paramete

#### CatBoostClassifier

In [None]:
# Define objective function for Optuna for CatBoost
def objective_cb(trial):
    # Define parameters to be optimized
    cb_params = {
        'loss_function': 'Logloss',
        'iterations': 1000,
        'task_type' : "GPU",
        'verbose' : False,
        'random_state' : CFG.SEED,
        'learning_rate': trial.suggest_uniform('learning_rate', 0.01, 0.1),
        'depth': trial.suggest_int('depth', 2, 10),
        'l2_leaf_reg': trial.suggest_loguniform('l2_leaf_reg', 0.01, 10.0),
        'random_strength': trial.suggest_uniform('random_strength', 0.1, 1.0),
        'bagging_temperature': trial.suggest_uniform('bagging_temperature', 0.0, 10.0),
        'border_count': trial.suggest_int('border_count', 1, 255),
    }

    # Create Catboost classifier with parameters
    cb_clf = cb.CatBoostClassifier(**cb_params)
    
    # Compute cross validation score
    cb_score = cross_val_score(cb_clf, X, y, cv=5).mean()
    
    return cb_score    

# Run optuna to optimize hyperparameters
study_cb = optuna.create_study(direction='maximize')
study_cb.optimize(objective_cb, n_trials=5)

# Get best hyperparameters and create Catboost classifier with those parameters
cb_params = study_cb.best_params
catboost_model = cb.CatBoostClassifier(**cb_params)

[32m[I 2023-02-15 07:38:39,866][0m A new study created in memory with name: no-name-3a9dd23e-1237-447e-9165-c26b9c4ba037[0m
  'learning_rate': trial.suggest_uniform('learning_rate', 0.01, 0.1),
  'l2_leaf_reg': trial.suggest_loguniform('l2_leaf_reg', 0.01, 10.0),
  'random_strength': trial.suggest_uniform('random_strength', 0.1, 1.0),
  'bagging_temperature': trial.suggest_uniform('bagging_temperature', 0.0, 10.0),
[32m[I 2023-02-15 07:40:21,159][0m Trial 0 finished with value: 0.6502913859998888 and parameters: {'learning_rate': 0.060017872010730326, 'depth': 10, 'l2_leaf_reg': 0.053694108798102144, 'random_strength': 0.2631854979885574, 'bagging_temperature': 1.069840763737897, 'border_count': 53}. Best is trial 0 with value: 0.6502913859998888.[0m
  'learning_rate': trial.suggest_uniform('learning_rate', 0.01, 0.1),
  'l2_leaf_reg': trial.suggest_loguniform('l2_leaf_reg', 0.01, 10.0),
  'random_strength': trial.suggest_uniform('random_strength', 0.1, 1.0),
  'bagging_temperatu

#### Ensemble

In [None]:
# Create voting ensemble of LightGBM and CatBoost models
ensemble_model = VotingClassifier(estimators=[('lgbm', lgbm_model), ('catboost', catboost_model)], voting='soft')
ensemble_model.fit(X, y)

0:	learn: 0.6786726	total: 32.5ms	remaining: 32.5s
1:	learn: 0.6657046	total: 65ms	remaining: 32.4s
2:	learn: 0.6541075	total: 101ms	remaining: 33.7s
3:	learn: 0.6422345	total: 135ms	remaining: 33.6s
4:	learn: 0.6327883	total: 167ms	remaining: 33.3s
5:	learn: 0.6228726	total: 201ms	remaining: 33.4s
6:	learn: 0.6138565	total: 239ms	remaining: 34s
7:	learn: 0.6070968	total: 272ms	remaining: 33.7s
8:	learn: 0.6005981	total: 304ms	remaining: 33.4s
9:	learn: 0.5947542	total: 336ms	remaining: 33.2s
10:	learn: 0.5891480	total: 371ms	remaining: 33.4s
11:	learn: 0.5840309	total: 402ms	remaining: 33.1s
12:	learn: 0.5784931	total: 438ms	remaining: 33.3s
13:	learn: 0.5733365	total: 473ms	remaining: 33.3s
14:	learn: 0.5690127	total: 506ms	remaining: 33.2s
15:	learn: 0.5646265	total: 539ms	remaining: 33.2s
16:	learn: 0.5615295	total: 573ms	remaining: 33.1s
17:	learn: 0.5576655	total: 608ms	remaining: 33.2s
18:	learn: 0.5531342	total: 638ms	remaining: 33s
19:	learn: 0.5494180	total: 672ms	remaining: 

VotingClassifier(estimators=[('lgbm',
                              LGBMClassifier(bagging_fraction=0.9490834605207769,
                                             bagging_freq=5,
                                             feature_fraction=0.9882200407222264,
                                             learning_rate=0.028222798631533333,
                                             min_child_samples=98,
                                             num_leaves=116)),
                             ('catboost',
                              <catboost.core.CatBoostClassifier object at 0x7ff070631fa0>)],
                 voting='soft')

#### Model Save

In [None]:
# 학습된 모델 저장
with open('traffic_accident_prediction.pkl', 'wb') as f:
    pickle.dump(ensemble_model, f)

### 결과

In [None]:
#생성한 예측 모델로 실제 미래데이터를 예측하기
y_t_pred = ensemble_model.predict(X_t)
y_t_pred_proba = ensemble_model.predict_proba(X_t)

In [None]:
X_t_y_p = X_t_index.copy()
X_t_y_proba = X_t_index.copy()
X_t_y_p['예측사고유무'] = y_t_pred
X_t_y_proba['예측사고확률'] = [row[1] for row in y_t_pred_proba]
X_t_y_proba['예측사고확률'] = X_t_y_proba['예측사고확률'].apply(lambda x: str(round(x * 100, 2)) + '%')

In [None]:
X_t_y_p

Unnamed: 0,지점,지점명,일시,예측사고유무
0,129,서산,2023-01-19 00:00:00,0.0
1,129,서산,2023-01-19 01:00:00,0.0
2,129,서산,2023-01-19 02:00:00,0.0
3,129,서산,2023-01-19 03:00:00,0.0
4,129,서산,2023-01-19 04:00:00,0.0
...,...,...,...,...
5067,239,세종,2023-02-14 05:00:00,0.0
5068,239,세종,2023-02-14 06:00:00,0.0
5069,239,세종,2023-02-14 07:00:00,0.0
5070,239,세종,2023-02-14 08:00:00,1.0


In [None]:
X_t_y_proba

Unnamed: 0,지점,지점명,일시,예측사고확률
0,129,서산,2023-01-19 00:00:00,33.97%
1,129,서산,2023-01-19 01:00:00,33.86%
2,129,서산,2023-01-19 02:00:00,30.83%
3,129,서산,2023-01-19 03:00:00,34.43%
4,129,서산,2023-01-19 04:00:00,31.13%
...,...,...,...,...
5067,239,세종,2023-02-14 05:00:00,41.27%
5068,239,세종,2023-02-14 06:00:00,42.01%
5069,239,세종,2023-02-14 07:00:00,33.28%
5070,239,세종,2023-02-14 08:00:00,62.54%


## 지역통합
- 실시간 일기예보 같은 경우는 종관기상관측(ASOS)를 사용한다. 가장 정확성이 높은 ASOS를 기반으로 인근 지역 통합 일기 예보를 하기 때문에 ASOS 관측 데이터를 사용하였다.
- ASOS는 충남,대전,세종에 대표적으로 ('서산', '대전', '홍성', '천안', '보령', '부여', '금산', '세종')이렇게 8군데가 있는데 교통사고 데이터와 접목하기 위해서 8군데 위도,경도와 실제 교통사고데이터 위도,경도를 이용하여 제일 가까운 ASOS 관측소 날씨데이터를 이용하였다.
- 교통사고 데이터는 충청남도 전체에서 발생했기 때문에 예를들어 모델에서 '서산'이라고 예측을 했어도 서산 뿐만 아니라 태안, 당진일 가능성도 있기 때문에 ASOS 관측소 위치 기반으로 지역을 묶어보였다. 

In [None]:
X_index.지점명.unique()

array(['서산', '대전', '홍성', '천안', '보령', '부여', '금산', '세종'], dtype=object)

![이미지](https://www.kma.go.kr/daejeon/images/info/area2_2.jpg)

[이미지출처] : https://www.kma.go.kr/daejeon/html/info/business01.jsp

- 지점명에['서산', '대전', '홍성', '천안', '보령', '부여', '금산', '세종']와 같이 8가지 지역밖에 없는데 일기예보가 실시간으로 예보하는 대표적인 8개지역이다.
-기상청 관측소 위도,경도를 이용해 사고시 가장 가까운 관측소 데이터를 활용하였는데 모델에서 나온 지역명을 이렇게 해석하면 된다.
- 서산 -> 서산,당진,태안
- 대전 -> 대전,계룡
- 홍성 -> 홍성,예산
- 천안 -> 천안,아산
- 보령 -> 보령,서천
- 부여 -> 부여,청양
- 금산 -> 금산
- 세종 -> 세종,공주

# 추후 발전가능성 및 실용성

- **실시간 교통 사고 예측 모델로 사고 발생 예측 확률이 높은 지역을 우선적으로 순찰하여 교통사고 방지 및 빠른 대처가 가능해진다.**

- **실시간 교통 사고 예측 모델은 실시간으로 기상청 Open Api를 이용하여 1시간 이내 각 지역별 사고 예측 확률 및 사고 발생 여부를 추가적으로 접목시킬 수 있다.**

- **실시간 교통 사고 예측 모델은 시계열 데이터 예측 모델이기 때문에 시간이 지날 수록 더 많은 데이터를 학습이 가능하므로 성능이 계속 올라 갈 수 있다.**

- **실시간 교통 사고 예측 모델은 시간 단위 데이터를 활용하였지만 기상청 일단위 데이터를 사용한다면 같은 매커니즘으로 일단위 사고 예측 확률 및 사고예측 여부 모델 개발이 가능하다.**