#### **<span style="color:red">[프로젝트 안내]</span>**
* 미션: 미세먼지 농도를 예측하는 머신러닝 모델을 만들기

#### **<span style="color:red">[데이터 설명]</span>**

* 학습 데이터
    * air_2021.csv : 2021년 미세먼지 데이터
    * weather_2021.csv : 2021년 날씨 데이터
* 테스트 데이터
    * air_2022.csv : 2022년 미세먼지 데이터
    * weather_2022.csv : 2022년 날씨 데이터
    
###### 데이터 출처 
- 어코리아 - 미세 먼지 데이터
- 기상청 기상자료 개방포털- 날씨 데이터

# 데이터 처리 부분

#### 라이브러리 불러오기

In [1]:
import pandas as pd
import datetime

#### 데이터 불러오기

In [40]:
# 데이터 로딩
air_21 = pd.read_csv("./data/air_2021.csv", sep=',', index_col = 0, encoding = 'utf-8' )
air_22 = pd.read_csv("./data/air_2022.csv", sep=',', index_col = 0, encoding = 'utf-8' )
weather_21 = pd.read_csv("./data/weather_2021.csv", sep = ',', encoding='cp949')
weather_22 = pd.read_csv("./data/weather_2022.csv", sep = ',', encoding='cp949')

#### 필요한 열만 가져오기

In [3]:
air_21 = air_21[['측정일시','SO2','CO','O3','NO2','PM10','PM25']]
air_22 = air_22[['측정일시','SO2','CO','O3','NO2','PM10','PM25']]

weather_21 = weather_21.iloc[:,2:] # 지점, 지점명 제외 모든 열 가져오기
weather_22 = weather_22.iloc[:,2:] # 지점, 지점명 제외 모든 열 가져오기

#### 데이터 프레임 합치기

In [4]:
air_21['time'] = pd.to_datetime(air_21['측정일시']-1, format = '%Y%m%d%H') # datetime 형식으로 변환
weather_21['time']=pd.to_datetime(weather_21['일시']) # datetime 형식으로 변환
df_21 = pd.merge(air_21, weather_21, on='time', how='left') # 데이터 프레임 time열 기준으로 합치기

air_22['time'] = pd.to_datetime(air_22['측정일시']-1, format = '%Y%m%d%H') # datetime 형식으로 변환
weather_22['time']=pd.to_datetime(weather_22['일시']) # datetime 형식으로 변환
df_22 = pd.merge(air_22, weather_22, on='time', how='left') # 데이터 프레임 time열 기준으로 합치기

In [5]:
df_21

Unnamed: 0,측정일시,SO2,CO,O3,NO2,PM10,PM25,time,일시,기온(°C),...,최저운고(100m ),시정(10m),지면상태(지면상태코드),현상번호(국내식),지면온도(°C),지면온도 QC플래그,5cm 지중온도(°C),10cm 지중온도(°C),20cm 지중온도(°C),30cm 지중온도(°C)
0,2021100101,0.003,0.6,0.002,0.039,31.0,18.0,2021-10-01 00:00:00,2021-10-01 00:00,19.2,...,,2000.0,,,17.8,,22.3,22.3,22.7,22.9
1,2021100102,0.003,0.6,0.002,0.035,27.0,16.0,2021-10-01 01:00:00,2021-10-01 01:00,18.7,...,,2000.0,,,17.4,,22.1,22.0,22.5,22.9
2,2021100103,0.003,0.6,0.002,0.033,28.0,18.0,2021-10-01 02:00:00,2021-10-01 02:00,18.3,...,,2000.0,,,17.2,,21.8,21.8,22.4,22.8
3,2021100104,0.003,0.6,0.002,0.030,26.0,16.0,2021-10-01 03:00:00,2021-10-01 03:00,17.7,...,,2000.0,,,17.0,,21.6,21.6,22.2,22.8
4,2021100105,0.003,0.5,0.003,0.026,26.0,16.0,2021-10-01 04:00:00,2021-10-01 04:00,17.4,...,,2000.0,,,16.5,,21.3,21.4,22.0,22.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8755,2021093020,0.003,0.7,0.020,0.036,35.0,24.0,2021-09-30 19:00:00,2021-09-30 19:00,22.7,...,,2000.0,,,20.6,,24.2,23.7,23.0,22.6
8756,2021093021,0.003,0.6,0.016,0.035,34.0,21.0,2021-09-30 20:00:00,2021-09-30 20:00,21.7,...,,2000.0,,,19.9,,23.7,23.4,23.1,22.8
8757,2021093022,0.003,0.6,0.012,0.036,30.0,19.0,2021-09-30 21:00:00,2021-09-30 21:00,20.9,...,,2000.0,,,19.2,,23.3,23.1,23.0,22.9
8758,2021093023,0.003,0.6,0.004,0.042,33.0,19.0,2021-09-30 22:00:00,2021-09-30 22:00,20.4,...,,2000.0,,,18.6,,23.0,22.8,22.9,22.9


#### 데이터 인덱스 설정 및 정렬

In [6]:
drop_col = ['측정일시','일시'] # 제거할 columns

In [7]:
df_21.drop(drop_col, axis=1, inplace=True) # columns 삭제
df_21 = df_21.set_index(keys='time') # index 설정
df_21 = df_21.sort_index(ascending=True) # index 기준 정렬

df_22.drop(drop_col, axis=1, inplace=True) # columns 삭제
df_22 = df_22.set_index(keys='time') # index 설정
df_22 = df_22.sort_index(ascending=True) # index 기준 정렬

#### 필요 없는 항목 제거
- drop() 메소드 활용

In [8]:
# 제거할 열 리스트
drop_col_2 = ['기온 QC플래그',
              '강수량 QC플래그',
              '풍속 QC플래그',
              '풍향 QC플래그',
              '습도 QC플래그',
              '현지기압 QC플래그',
              '해면기압 QC플래그',
              '일조 QC플래그',
              '일사 QC플래그',
              '현상번호(국내식)',
              '지면온도 QC플래그',
              '지면상태(지면상태코드)',
              '운형(운형약어)']

In [9]:
df_21.drop(drop_col_2, axis=1, inplace=True) # 열 제거

df_22.drop(drop_col_2, axis=1, inplace=True) # 열 제거

In [10]:
df_21.head()

Unnamed: 0_level_0,SO2,CO,O3,NO2,PM10,PM25,기온(°C),강수량(mm),풍속(m/s),풍향(16방위),...,3시간신적설(cm),전운량(10분위),중하층운량(10분위),최저운고(100m ),시정(10m),지면온도(°C),5cm 지중온도(°C),10cm 지중온도(°C),20cm 지중온도(°C),30cm 지중온도(°C)
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-01-01 00:00:00,0.002,0.5,0.022,0.016,24.0,14.0,,,,,...,,,,,,,,,,
2021-01-01 01:00:00,0.002,0.6,0.018,0.02,25.0,14.0,-8.7,,2.4,270.0,...,,0.0,0.0,,2000.0,-6.9,-1.0,-0.8,0.3,1.6
2021-01-01 02:00:00,0.002,0.6,0.013,0.025,27.0,16.0,-9.1,,1.6,270.0,...,,0.0,0.0,,2000.0,-7.1,-1.1,-0.8,0.3,1.6
2021-01-01 03:00:00,0.003,0.6,0.011,0.027,23.0,13.0,-9.3,,1.1,250.0,...,,0.0,0.0,,2000.0,-7.3,-1.2,-0.9,0.3,1.6
2021-01-01 04:00:00,0.003,0.6,0.008,0.032,24.0,14.0,-9.3,,0.3,0.0,...,,0.0,0.0,,2000.0,-7.5,-1.3,-1.0,0.2,1.5


#### 결측치 처리하기

- 강수량의 경우 NaN 값은 0으로
- 나머지 결측값의 경우 이전 값으로

- fillna() 메소드 활용


In [11]:
df_21['강수량(mm)'] = df_21['강수량(mm)'].fillna(0)
df_21 = df_21.fillna(method='pad') # NaN값 이전 값으로 채우기
df_21 = df_21.fillna(0) # 나머지 NaN값 0으로 채우기

df_21.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8760 entries, 2021-01-01 00:00:00 to 2021-12-31 23:00:00
Data columns (total 28 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   SO2            8760 non-null   float64
 1   CO             8760 non-null   float64
 2   O3             8760 non-null   float64
 3   NO2            8760 non-null   float64
 4   PM10           8760 non-null   float64
 5   PM25           8760 non-null   float64
 6   기온(°C)         8760 non-null   float64
 7   강수량(mm)        8760 non-null   float64
 8   풍속(m/s)        8760 non-null   float64
 9   풍향(16방위)       8760 non-null   float64
 10  습도(%)          8760 non-null   float64
 11  증기압(hPa)       8760 non-null   float64
 12  이슬점온도(°C)      8760 non-null   float64
 13  현지기압(hPa)      8760 non-null   float64
 14  해면기압(hPa)      8760 non-null   float64
 15  일조(hr)         8760 non-null   float64
 16  일사(MJ/m2)      8760 non-null   float64
 17  적설(cm)         8

In [12]:
df_22['강수량(mm)'] = df_22['강수량(mm)'].fillna(0)
df_22 = df_22.fillna(method='pad')
df_22 = df_22.fillna(0)

#### 전일 같은 시간대의 미세먼지 농도 변수 추가

In [13]:
df_21['month'] = df_21.index.month
df_21['day'] = df_21.index.day
df_21['hour'] = df_21.index.hour
df_21['PM10_lag1'] = df_21['PM10'].shift(24)

df_22['month'] = df_22.index.month
df_22['day'] = df_22.index.day
df_22['hour'] = df_22.index.hour
df_22['PM10_lag1'] = df_22['PM10'].shift(24)

#### 예측할 변수 열 만들기
- PM10의 1시간 후 값을 예측하려고 한다.
- shift() 메소드 활용

In [14]:
df_21['target'] = df_21['PM10'].shift(-1)
df_21 = df_21.dropna()

df_22['target'] = df_22['PM10'].shift(-1)
df_22 = df_22.dropna()

In [15]:
df_21.head(2)

Unnamed: 0_level_0,SO2,CO,O3,NO2,PM10,PM25,기온(°C),강수량(mm),풍속(m/s),풍향(16방위),...,지면온도(°C),5cm 지중온도(°C),10cm 지중온도(°C),20cm 지중온도(°C),30cm 지중온도(°C),month,day,hour,PM10_lag1,target
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-01-02 00:00:00,0.003,0.4,0.032,0.009,32.0,15.0,-3.4,0.0,2.5,320.0,...,-3.8,-0.1,-0.5,0.1,1.3,1,2,0,24.0,25.0
2021-01-02 01:00:00,0.003,0.4,0.031,0.01,25.0,9.0,-4.1,0.0,2.3,320.0,...,-4.7,-0.2,-0.5,0.1,1.3,1,2,1,25.0,29.0


#### train, test 데이터 분리
- 학습용 데이터 : df_21
- 테스트용 데이터 : df_22
1. x,y 데이터 분리하기
    - drop() 메소드 활용
2. train, test 데이터 세트 분리하기
    - from sklearn.model_selection import train_test_split 활용

In [16]:
# target(=y) 설정
target = 'target'

# 데이터 분리
x_train = df_21.drop(target,axis=1)
y_train = df_21.loc[:, target]

x_test = df_22.drop(target,axis=1)
y_test = df_22.loc[:, target]

In [17]:
# 데이터 저장
x_train.to_csv('x_train.csv',index=False, encoding="utf-8-sig")
y_train.to_csv('y_train.csv',index=False, encoding="utf-8-sig")
x_test.to_csv('x_test.csv',index=False, encoding="utf-8-sig")
y_test.to_csv('y_test.csv',index=False, encoding="utf-8-sig")

# 머신러닝 모델링 부분

In [18]:
# 필요한 라이브러리 설치 및 임포트

import pandas as pd

from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.tree import DecisionTreeRegressor as DTR
from sklearn.ensemble import RandomForestRegressor as RFR
from sklearn.ensemble import GradientBoostingRegressor as GBR
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score as acc
from sklearn.metrics import recall_score as recall
import joblib

#### 데이터 로딩

In [19]:
# 훈련 데이터
x_train = pd.read_csv('./x_train.csv')
y_train = pd.read_csv('./y_train.csv')

# 테스트 데이터
x_test = pd.read_csv('./x_test.csv')
y_test = pd.read_csv('./y_test.csv')

In [20]:
x_test

Unnamed: 0,SO2,CO,O3,NO2,PM10,PM25,기온(°C),강수량(mm),풍속(m/s),풍향(16방위),...,시정(10m),지면온도(°C),5cm 지중온도(°C),10cm 지중온도(°C),20cm 지중온도(°C),30cm 지중온도(°C),month,day,hour,PM10_lag1
0,0.004,0.8,0.002,0.052,38.0,24.0,-2.8,0.0,2.3,50,...,2000,-3.3,-0.4,-0.7,-0.3,0.8,1,2,0,23.0
1,0.004,0.8,0.002,0.052,34.0,23.0,-2.9,0.0,2.1,50,...,2000,-3.4,-0.4,-0.7,-0.3,0.8,1,2,1,20.0
2,0.004,0.8,0.002,0.052,35.0,26.0,-2.6,0.0,1.9,50,...,2000,-2.5,-0.4,-0.7,-0.3,0.8,1,2,2,20.0
3,0.004,0.6,0.002,0.046,33.0,24.0,-2.1,0.0,2.5,50,...,2000,-2.1,-0.4,-0.7,-0.3,0.8,1,2,3,19.0
4,0.003,0.5,0.005,0.039,33.0,25.0,-1.9,0.0,2.0,50,...,2000,-2.1,-0.3,-0.7,-0.3,0.8,1,2,4,24.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2130,0.002,0.4,0.044,0.010,10.0,9.0,13.2,0.0,4.3,340,...,2000,12.4,12.9,12.0,10.8,10.1,3,31,18,29.0
2131,0.002,0.4,0.036,0.017,11.0,8.0,12.3,0.0,2.9,340,...,2000,10.0,12.6,11.8,11.0,10.2,3,31,19,34.0
2132,0.002,0.4,0.032,0.018,10.0,7.0,11.6,0.0,2.7,340,...,2000,8.9,12.2,11.6,11.0,10.3,3,31,20,49.0
2133,0.003,0.3,0.038,0.013,11.0,5.0,10.5,0.0,3.5,320,...,2000,7.8,11.8,11.4,11.0,10.4,3,31,21,51.0


#### 머신러닝 모델링

In [21]:
# LinearRegression 모델 선언하기
lg_model = LinearRegression()

lg_model.fit(x_train, y_train)

In [22]:
# 예측하기
y_pred_LR = lg_model.predict(x_test)

# 평가하기
print(mse(y_test, y_pred_LR))
print(r2_score(y_test, y_pred_LR))

37.71227049086118
0.9321963888313434


# 다양한 머신러닝 모델링
- 다양한 알고리즘 활용하여 성능 높은 것 찾기
```
    from sklearn.linear_model import LinearRegression
    from sklearn.neighbors import KNeighborsRegressor
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.ensemble import RandomForestRegressor
    from xgboost import XGBRegressor
    from lightgbm import LGBMRegressor
    
    # 평가 알고리즘
    from sklearn.metrics import mean_absolute_error
    from sklearn.metrics import mean_squared_error
    from sklearn.metrics import mean_absolute_percentage_error
    from sklearn.metrics import r2_score
```

In [30]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

# 평가 알고리즘
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import r2_score

#### 결과 저장 딕셔너리 만들기

In [36]:
results = {}

In [37]:
# 모델 선언
LR_model = LinearRegression()
Tree_model = DecisionTreeRegressor()
RFR_model = RandomForestRegressor()
xg_model = XGBRegressor()
lgb_model = LGBMRegressor()
gb_model = GradientBoostingRegressor()

models = [LR_model, Tree_model, RFR_model, xg_model, lgb_model, gb_model]

In [38]:
# 모델 학습하기
for n, model in enumerate(models):
    model.fit(x_train, y_train) # 모델 학습하기
    
    y_pred = model.predict(x_test) # 학습된 모델로 예측하기
    
    model_mae = mean_absolute_error(y_test, y_pred)
    model_mse = mean_squared_error(y_test, y_pred)
    model_r2 = r2_score(y_test, y_pred)
    results[n]=[model_mae, model_mse,model_r2]

  model.fit(x_train, y_train) # 모델 학습하기


You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 4154
[LightGBM] [Info] Number of data points in the train set: 8735, number of used features: 32
[LightGBM] [Info] Start training from score 38.810303


  y = column_or_1d(y, warn=True)


In [39]:
results

{0: [3.9072058090607067, 37.71227049086118, 0.9321963888313434],
 1: [6.147540983606557, 90.4032786885246, 0.837462219145521],
 2: [3.998922716627635, 40.60448824355972, 0.9269964153116748],
 3: [4.283485021859198, 45.51755702188999, 0.9181631151482306],
 4: [3.863094135736097, 40.44897937389284, 0.9272760076776323],
 5: [3.9628791517059727, 43.00139321855987, 0.9226869740922045]}

---

In [29]:
LR_model.fit(x_train, y_train)

# 예측하기
y_pred = LR_model.predict(x_test)

# 평가하기
print(mse(y_test, y_pred))
print(r2_score(y_test, y_pred))

37.71227049086118
0.9321963888313434
