# 서울시 평균 기온 예측 Baseline

## 진행 순서
1. 라이브러리 및 데이터셋 불러오기
2. EDA
   1. 데이터타입 확인
   2. 인사이트 정리하기  <br>  
3. 전처리(Preprocessing)
   1. train/validation split  
   2. 결측값 처리
   3. 인코딩(LabelEncoding or dummy) - object type에 대해서 원핫인코딩 or 라벨인코딩
   4. 스케일링 - Standard Scaler / MinMaxScaler
4. 모델링
   1. 모델 선정 
      - 머신러닝 모델 - 랜덤포레스트, SVM, LightGBM, XGBoost, Catboost 등
      - 딥 러닝 모델 - LSTM, CNN-LSTM, GRU 등
      - 고전 모델 - ARIMA, Linear Regression 등.
   2. 하이퍼파라미터 튜닝 (Optuna, gridsearch)
   3. 교차검증(Cross Validation)
   4. 앙상블 (Ensemble)
5. 결과도출
   1. pd.DataFrame.to_csv()

### 1. 라이브러리 및 데이터셋 불러오기

In [14]:
###  1. 라이브러리 및 데이터셋 불러오기

# 기본적인 라이브러리 불러오기 
import pandas as pd
import numpy as np

# EDA 및 시각화를 위한 라이브러리
import seaborn as sns
import matplotlib.pyplot as plt

# Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder

# Modeling
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV


In [54]:
### 2. 데이터셋 불러오기
train = pd.read_csv('./dataset/train.csv')
test = pd.read_csv('./dataset/sample_submission.csv')
train.head()

Unnamed: 0,일시,최고기온,최저기온,일교차,강수량,평균습도,평균풍속,일조합,일사합,일조율,평균기온
0,1960-01-01,2.2,-5.2,7.4,,68.3,1.7,6.7,,,-1.6
1,1960-01-02,1.2,-5.6,6.8,0.4,87.7,1.3,0.0,,,-1.9
2,1960-01-03,8.7,-2.1,10.8,0.0,81.3,3.0,0.0,,,4.0
3,1960-01-04,10.8,1.2,9.6,0.0,79.7,4.4,2.6,,,7.5
4,1960-01-05,1.3,-8.2,9.5,,44.0,5.1,8.2,,,-4.6


In [57]:
# 60년도 데이터로 2023년도 데이터 평균기온 예측하는게 맞나...?
# 그래도 일단 해보자.
test.head()

Unnamed: 0,일시,평균기온
0,2023-01-01,0
1,2023-01-02,0
2,2023-01-03,0
3,2023-01-04,0
4,2023-01-05,0


In [71]:
train.drop(['일시_datetime'], axis=1, inplace=True)

KeyError: "['일시_datetime'] not found in axis"

In [None]:
train.head()

Unnamed: 0,일시,최고기온,최저기온,일교차,강수량,평균습도,평균풍속,일조합,일사합,일조율,평균기온
0,1960-01-01,2.2,-5.2,7.4,,68.3,1.7,6.7,,,-1.6
1,1960-01-02,1.2,-5.6,6.8,0.4,87.7,1.3,0.0,,,-1.9
2,1960-01-03,8.7,-2.1,10.8,0.0,81.3,3.0,0.0,,,4.0
3,1960-01-04,10.8,1.2,9.6,0.0,79.7,4.4,2.6,,,7.5
4,1960-01-05,1.3,-8.2,9.5,,44.0,5.1,8.2,,,-4.6


## 2. EDA

- info
- describe
- isna().sum()

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23011 entries, 0 to 23010
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   일시      23011 non-null  object 
 1   최고기온    23008 non-null  float64
 2   최저기온    23008 non-null  float64
 3   일교차     23007 non-null  float64
 4   강수량     9150 non-null   float64
 5   평균습도    23011 non-null  float64
 6   평균풍속    23007 non-null  float64
 7   일조합     22893 non-null  float64
 8   일사합     18149 non-null  float64
 9   일조율     22645 non-null  float64
 10  평균기온    23011 non-null  float64
dtypes: float64(10), object(1)
memory usage: 1.9+ MB


### EDA - 1) info
####  Column별 설명
> 일시 : Datetime  
> 최고기온 : max temperatures  
> 최저기온 : min temperatures  
> 일교차 : max - min  
> 강수량 : rainfalls  
> 평균습도 : mean humidity  
> 평균풍속 : mean wind speed  
> 일조합 : the amount of sunshine (태양광선이 구름이나 안개로 가려지지 않고 실제로 땅 위를 비추는 양) - 시간의 개념(hr)  
> 일사합 : the amount of insolation (지표에 도달하는 태양에너지의 합) - (MJ/m2)  
> 일조율 : rate of sunshine (일조시간 / 가조시간) - 일출부터 일몰까지 중 일조시간의 비율  
> 평균기온 : mean temperatures (`Target value`)

In [72]:
# 일시가 object type인데, 이걸 먼저 datetime으로 바꿔주자.
train['일시'] = train['일시'].astype('datetime64[ns]')
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23011 entries, 0 to 23010
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   일시      23011 non-null  datetime64[ns]
 1   최고기온    23008 non-null  float64       
 2   최저기온    23008 non-null  float64       
 3   일교차     23007 non-null  float64       
 4   강수량     9150 non-null   float64       
 5   평균습도    23011 non-null  float64       
 6   평균풍속    23007 non-null  float64       
 7   일조합     22893 non-null  float64       
 8   일사합     18149 non-null  float64       
 9   일조율     22645 non-null  float64       
 10  평균기온    23011 non-null  float64       
dtypes: datetime64[ns](1), float64(10)
memory usage: 1.9 MB


In [73]:
# 결측치 확인
train.isna().sum()

일시          0
최고기온        3
최저기온        3
일교차         4
강수량     13861
평균습도        0
평균풍속        4
일조합       118
일사합      4862
일조율       366
평균기온        0
dtype: int64

일사합, 강수량의 결측치는 상당히 많다.



In [79]:
train.head()

Unnamed: 0,일시,최고기온,최저기온,일교차,강수량,평균습도,평균풍속,일조합,일사합,일조율,평균기온
0,1960-01-01,2.2,-5.2,7.4,,68.3,1.7,6.7,,,-1.6
1,1960-01-02,1.2,-5.6,6.8,0.4,87.7,1.3,0.0,,,-1.9
2,1960-01-03,8.7,-2.1,10.8,0.0,81.3,3.0,0.0,,,4.0
3,1960-01-04,10.8,1.2,9.6,0.0,79.7,4.4,2.6,,,7.5
4,1960-01-05,1.3,-8.2,9.5,,44.0,5.1,8.2,,,-4.6


In [78]:
train.loc[train['최고기온'].isna()]

Unnamed: 0,일시,최고기온,최저기온,일교차,강수량,평균습도,평균풍속,일조합,일사합,일조율,평균기온
2606,1967-02-19,,,,,62.0,1.8,9.5,,93.1,-1.7
5037,1973-10-16,,,,0.4,74.0,1.8,3.5,9.24,29.7,12.3
21104,2017-10-12,,8.8,,,71.0,2.0,,2.23,0.0,11.4


In [None]:
train.describe()

Unnamed: 0,최고기온,최저기온,일교차,강수량,평균습도,평균풍속,일조합,일사합,일조율,평균기온
count,23008.0,23008.0,23007.0,9150.0,23011.0,23007.0,22893.0,18149.0,22645.0,23011.0
mean,17.071714,8.45196,8.619277,9.593683,65.202851,2.380993,5.858826,11.93317,48.653526,12.415419
std,10.714471,10.578285,2.907096,21.966135,14.549077,0.947595,3.816941,6.419122,31.662321,10.489515
min,-13.6,-20.2,1.0,0.0,17.9,0.1,0.0,0.0,0.0,-16.4
25%,7.8,-0.3,6.6,0.1,54.9,1.7,2.2,7.0,17.8,3.4
50%,18.9,9.2,8.6,1.4,65.5,2.2,6.6,11.22,55.7,13.8
75%,26.4,17.9,10.6,8.5,75.8,2.9,9.0,16.62,78.0,21.8
max,39.6,30.3,19.6,332.8,99.8,7.8,13.7,33.48,96.9,33.7


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23011 entries, 0 to 23010
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   일시      23011 non-null  datetime64[ns]
 1   최고기온    23008 non-null  float64       
 2   최저기온    23008 non-null  float64       
 3   일교차     23007 non-null  float64       
 4   강수량     9150 non-null   float64       
 5   평균습도    23011 non-null  float64       
 6   평균풍속    23007 non-null  float64       
 7   일조합     22893 non-null  float64       
 8   일사합     18149 non-null  float64       
 9   일조율     22645 non-null  float64       
 10  평균기온    23011 non-null  float64       
dtypes: datetime64[ns](1), float64(10)
memory usage: 1.9 MB


In [None]:
# 컬럼 수 11, 레이블 수 23011
train.shape

(23011, 11)

In [None]:
# 각각 결측치 비율이 어떻게 되는지 확인하기.
train.isna().sum() / train.shape[0]
# train.columns()

일시      0.000000
최고기온    0.000130
최저기온    0.000130
일교차     0.000174
강수량     0.602364
평균습도    0.000000
평균풍속    0.000174
일조합     0.005128
일사합     0.211290
일조율     0.015905
평균기온    0.000000
dtype: float64

In [None]:
train['강수량'].head(20)

0     NaN
1     0.4
2     0.0
3     0.0
4     NaN
5     0.0
6     0.1
7     0.0
8     1.2
9     0.1
10    NaN
11    0.0
12    NaN
13    0.4
14    0.0
15    0.0
16    NaN
17    0.0
18    NaN
19    0.0
Name: 강수량, dtype: float64

In [None]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 358 entries, 0 to 357
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   일시      358 non-null    object
 1   평균기온    358 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 5.7+ KB
