# 시작 전 살펴보기

### **Logloss (로그손실)** 이란?
- 분류문제에서 사용되는 모델 성능 평가 지표 (손실함수)
- 모델이 예측한 확률분포와 실제 레이블의 확률분포 간 차이를 측정 -> 작을수록 성능 좋음!
- Why use?
  - ***모델이 예측한 확률값***을 직접적으로 반영해 평가함!
    - (최종적으로 맞춘 결과만 갖고 성능을 평가하면, 얼마만큼의 확률로 해당 값을 도출했는지 알 수 X)
  - 확률이 낮을 때, ***패널티를 더 많이 부여***하기 위해 ***음의 log*** 사용!
    - (확률이 낮아질수록 logloss 값이 급격히 증가함)
  - predict가 아닌 predict_proba로 예측 확률을 계산해 출력하기!
- 참고: https://seoyoungh.github.io/machine-learning/ml-logloss/


# 데이콘 예시 필사 & 공부

### import

In [None]:
import pandas as pd
import numpy as np
import random
import os
import gc

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

In [None]:
# Seed 고정 함수
def seed_everything(seed):
  # random하게 숫자 리스트를 불러오는 seed를 고정
  random.seed(seed)
  os.environ['PYTHONHASHSEED'] = str(seed)
  np.random.seed(seed)

seed_everything(42)    # 모든 Seed를 42로 고정

### csv to parquet
메모리에 효율적인 데이터 유형을 사용하여 용량을 줄이고 빠른 작업이 가능함

- **parquet 파일?**
  - 하둡에서 열기반 압축 방식을 사용해 데이터를 저장하는 방법
  - 빅데이터처리는 많은 시간과 비용이 들어가기 때문에 데이터를 빠르게 불러오고, 압축률이 높아야 하며, 특정언어에 종속되지 않아야 함.
    - 이러한 특징을 가진 포맷이 parquet(파케이), ORC, avro(에이브로)
  - csv보다 시간 및 메모리 절약이 가능한 데이터 저장 방식
  - 칼럼 단위로 구성하여 압축률이 높고, 필요한 데이터만 불러올 수 있기때문에 데이터에 들어가는 자원 절약 가능
  - 자동으로 기존의 데이터 스키마를 저장함
  - parquet 파일로 불러온 결과는 데이터프레임!
  - 참고
    - https://butter-shower.tistory.com/245
    - https://pearlluck.tistory.com/561

- **gc.collect()?**
  - gc: garbage collection
  - threshold 기반의 gc가 충분하지 않을 때, 수동으로 gc를 수행하여 순환참조 객체의 메모리 해제함
  - 순환참조탐지 알고리즘을 바탕으로 도달 가능 객체와 도달 불가능 객체로 나눈 후, 도달 가능 객체는 세대 이동 시키고, 도달 불가능 객체는 콜백 후 메모리 해제 시킴
  - 점유된 객체 숫자, 메모리 해제된 객체 숫자를 반환함
  - 참고
    - https://twinparadox.tistory.com/623
    - https://wikidocs.net/13969
    - https://velog.io/@zihs0822/Python%EC%9D%98-GC%EC%99%80-GIL


In [None]:
file_url = '/content/drive/MyDrive/데이콘_연습/[1] 월간 데이콘 항공편 지연 예측 AI 경진대회/'

In [None]:
# csv 형식 파일을 parquet 형식 파일로 저장하는 함수
def csv_to_parquet(csv_path, save_name):
  df = pd.read_csv(csv_path)
  df.to_parquet(file_url + f'{save_name}.parquet')
  del df    # 기존 파일 지우기
  gc.collect()    # 수동으로 가비지 컬렉션 수행
  print(save_name, 'Done.')

In [None]:
csv_to_parquet(file_url + 'train.csv', 'train')
csv_to_parquet(file_url + 'test.csv', 'test')

train Done.
test Done.


### Data Load

In [None]:
# 저장한 parquet 형식 파일을 불러오면, dataframe 형식으로 반환됨
train = pd.read_parquet(file_url + 'train.parquet')
test = pd.read_parquet(file_url + 'test.parquet')
sample_submission = pd.read_csv(file_url + 'sample_submission.csv', index_col = 0)    # 0번째 칼럼을 index로 설정함

In [None]:
sample_submission

Unnamed: 0_level_0,Not_Delayed,Delayed
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
TEST_000000,0,1
TEST_000001,0,1
TEST_000002,0,1
TEST_000003,0,1
TEST_000004,0,1
...,...,...
TEST_999995,0,1
TEST_999996,0,1
TEST_999997,0,1
TEST_999998,0,1


### Data Pre-Processing

##### 레이블(Delay)를 제외한 결측값이 존재하는 변수들을 학습 데이터의 최빈값으로 대체함

In [None]:
train.isna().sum()

ID                               0
Month                            0
Day_of_Month                     0
Estimated_Departure_Time    109019
Estimated_Arrival_Time      109040
Cancelled                        0
Diverted                         0
Origin_Airport                   0
Origin_Airport_ID                0
Origin_State                109015
Destination_Airport              0
Destination_Airport_ID           0
Destination_State           109079
Distance                         0
Airline                     108920
Carrier_Code(IATA)          108990
Carrier_ID(DOT)             108997
Tail_Number                      0
Delay                       744999
dtype: int64

In [None]:
test.isna().sum()

ID                               0
Month                            0
Day_of_Month                     0
Estimated_Departure_Time    108984
Estimated_Arrival_Time      109048
Cancelled                        0
Diverted                         0
Origin_Airport                   0
Origin_Airport_ID                0
Origin_State                106505
Destination_Airport              0
Destination_Airport_ID           0
Destination_State           106523
Distance                         0
Airline                     106527
Carrier_Code(IATA)          108993
Carrier_ID(DOT)             109006
Tail_Number                      0
dtype: int64

In [None]:
test.isna().sum()

ID                               0
Month                            0
Day_of_Month                     0
Estimated_Departure_Time    108984
Estimated_Arrival_Time      109048
Cancelled                        0
Diverted                         0
Origin_Airport                   0
Origin_Airport_ID                0
Origin_State                106505
Destination_Airport              0
Destination_Airport_ID           0
Destination_State           106523
Distance                         0
Airline                     106527
Carrier_Code(IATA)          108993
Carrier_ID(DOT)             109006
Tail_Number                      0
dtype: int64

In [None]:
# mode(): 최빈값 반환
NaN_col = ['Estimated_Departure_Time', 'Estimated_Arrival_Time', 'Origin_State', 'Destination_State', 'Airline', 'Carrier_Code(IATA)', 'Carrier_ID(DOT)']

for col in NaN_col:
  mode = train[col].mode()[0]
  train[col] = train[col].fillna(mode)

  # test에도 똑같은 값으로 null값 채워줘야 함
  if col in test.columns:
    test[col] = test[col].fillna(mode)

print('Done.')

Done.


##### 질적 변수들을 수치화함

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 19 columns):
 #   Column                    Non-Null Count    Dtype  
---  ------                    --------------    -----  
 0   ID                        1000000 non-null  object 
 1   Month                     1000000 non-null  int64  
 2   Day_of_Month              1000000 non-null  int64  
 3   Estimated_Departure_Time  1000000 non-null  float64
 4   Estimated_Arrival_Time    1000000 non-null  float64
 5   Cancelled                 1000000 non-null  int64  
 6   Diverted                  1000000 non-null  int64  
 7   Origin_Airport            1000000 non-null  object 
 8   Origin_Airport_ID         1000000 non-null  int64  
 9   Origin_State              1000000 non-null  object 
 10  Destination_Airport       1000000 non-null  object 
 11  Destination_Airport_ID    1000000 non-null  int64  
 12  Destination_State         1000000 non-null  object 
 13  Distance                  10

In [None]:
qual_col = ['Origin_Airport', 'Origin_State', 'Destination_Airport', 'Destination_State', 'Airline', 'Carrier_Code(IATA)', 'Tail_Number']

for i in qual_col:
  le = LabelEncoder()
  le = le.fit(train[i])
  train[i] = le.transform(train[i])

  # test도 같은 기준으로 수치화 시켜주기
  # train에 없는 데이터가 test에 있다면, 추가해주고 수치화함
  for label in np.unique(test[i]):
    if label not in le.classes_:
      le.classes_ = np.append(le.classes_, label)
  test[i] = le.transform(test[i])

print('Done.')

Done.


##### 레이블(Delay)가 null 값인 데이터들 제거

In [None]:
train.isna().sum()

ID                               0
Month                            0
Day_of_Month                     0
Estimated_Departure_Time         0
Estimated_Arrival_Time           0
Cancelled                        0
Diverted                         0
Origin_Airport                   0
Origin_Airport_ID                0
Origin_State                     0
Destination_Airport              0
Destination_Airport_ID           0
Destination_State                0
Distance                         0
Airline                          0
Carrier_Code(IATA)               0
Carrier_ID(DOT)                  0
Tail_Number                      0
Delay                       744999
dtype: int64

In [None]:
train = train.dropna()
train.isna().sum()

ID                          0
Month                       0
Day_of_Month                0
Estimated_Departure_Time    0
Estimated_Arrival_Time      0
Cancelled                   0
Diverted                    0
Origin_Airport              0
Origin_Airport_ID           0
Origin_State                0
Destination_Airport         0
Destination_Airport_ID      0
Destination_State           0
Distance                    0
Airline                     0
Carrier_Code(IATA)          0
Carrier_ID(DOT)             0
Tail_Number                 0
Delay                       0
dtype: int64

In [None]:
# 칼럼으로 분류된 레이블 이름, 수치값을 묶어 dict 형식으로 저장함
column_number = {}
for i, column in enumerate(sample_submission.columns):
  column_number[column] = i
column_number

{'Not_Delayed': 0, 'Delayed': 1}

In [None]:
# train 데이터의 object 타입이었던 레이블을 수치형으로 바꿔줌
def to_number(x, dic):
  return dic[x]

train.loc[:, 'Delay_num'] = train['Delay'].apply(lambda x: to_number(x, column_number))
print('Done.')

Done.


In [None]:
# 불필요한 칼럼들 drop하고, 학습시킬 데이터 형태로 만들어줌
train_x = train.drop(columns=['ID', 'Delay', 'Delay_num'])
train_y  = train['Delay_num']
test_x = test.drop(columns=['ID'])

### Classification Model Fit

In [None]:
clf = RandomForestClassifier()
clf.fit(train_x, train_y)

### Inference

In [None]:
# LogLoss로 모델의 성능을 평가하기때문에, predict가 아닌 predict_proba로 수행
y_pred = clf.predict_proba(test_x)
y_pred

array([[0.78, 0.22],
       [0.47, 0.53],
       [0.57, 0.43],
       ...,
       [0.83, 0.17],
       [0.8 , 0.2 ],
       [0.85, 0.15]])

### Submit

In [None]:
submission = pd.DataFrame(data=y_pred, columns=sample_submission.columns, index=sample_submission.index)
submission.to_csv(file_url + 'baseline_submission.csv', index=True)

In [None]:
submission

Unnamed: 0_level_0,Not_Delayed,Delayed
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
TEST_000000,0.78,0.22
TEST_000001,0.47,0.53
TEST_000002,0.57,0.43
TEST_000003,0.58,0.42
TEST_000004,0.69,0.31
...,...,...
TEST_999995,0.66,0.34
TEST_999996,0.97,0.03
TEST_999997,0.83,0.17
TEST_999998,0.80,0.20


# 스스로 풀어보기

In [None]:
import pandas as pd
import numpy as np

In [None]:
file_url = '/content/drive/MyDrive/데이콘_연습/[1] 월간 데이콘 항공편 지연 예측 AI 경진대회/'

train = pd.read_csv(file_url + '/train.csv')
train.head()

Unnamed: 0,ID,Month,Day_of_Month,Estimated_Departure_Time,Estimated_Arrival_Time,Cancelled,Diverted,Origin_Airport,Origin_Airport_ID,Origin_State,Destination_Airport,Destination_Airport_ID,Destination_State,Distance,Airline,Carrier_Code(IATA),Carrier_ID(DOT),Tail_Number,Delay
0,TRAIN_000000,4,15,,,0,0,OKC,13851,Oklahoma,HOU,12191,Texas,419.0,Southwest Airlines Co.,WN,19393.0,N7858A,
1,TRAIN_000001,8,15,740.0,1024.0,0,0,ORD,13930,Illinois,SLC,14869,Utah,1250.0,SkyWest Airlines Inc.,UA,20304.0,N125SY,
2,TRAIN_000002,9,6,1610.0,1805.0,0,0,CLT,11057,North Carolina,LGA,12953,New York,544.0,American Airlines Inc.,AA,19805.0,N103US,
3,TRAIN_000003,7,10,905.0,1735.0,0,0,LAX,12892,California,EWR,11618,New Jersey,2454.0,United Air Lines Inc.,UA,,N595UA,
4,TRAIN_000004,1,11,900.0,1019.0,0,0,SFO,14771,California,ACV,10157,California,250.0,SkyWest Airlines Inc.,UA,20304.0,N161SY,


In [None]:
test = pd.read_csv(file_url + '/test.csv')
test.head()

Unnamed: 0,ID,Month,Day_of_Month,Estimated_Departure_Time,Estimated_Arrival_Time,Cancelled,Diverted,Origin_Airport,Origin_Airport_ID,Origin_State,Destination_Airport,Destination_Airport_ID,Destination_State,Distance,Airline,Carrier_Code(IATA),Carrier_ID(DOT),Tail_Number
0,TEST_000000,12,16,1156.0,,0,0,IAH,12266,Texas,SAT,14683,Texas,191.0,United Air Lines Inc.,UA,,N79402
1,TEST_000001,9,12,1500.0,1715.0,0,0,EWR,11618,New Jersey,ATL,10397,,746.0,Delta Air Lines Inc.,DL,19790.0,N3765
2,TEST_000002,3,6,1600.0,1915.0,0,0,ORD,13930,Illinois,LGA,12953,New York,733.0,United Air Lines Inc.,UA,19977.0,N413UA
3,TEST_000003,5,18,1920.0,2045.0,0,0,OAK,13796,California,LAX,12892,California,337.0,Southwest Airlines Co.,WN,19393.0,N905WN
4,TEST_000004,7,7,1915.0,2152.0,0,0,FLL,11697,Florida,LAX,12892,California,2343.0,JetBlue Airways,B6,20409.0,N945JT


In [None]:
train.columns

Index(['ID', 'Month', 'Day_of_Month', 'Estimated_Departure_Time',
       'Estimated_Arrival_Time', 'Cancelled', 'Diverted', 'Origin_Airport',
       'Origin_Airport_ID', 'Origin_State', 'Destination_Airport',
       'Destination_Airport_ID', 'Destination_State', 'Distance', 'Airline',
       'Carrier_Code(IATA)', 'Carrier_ID(DOT)', 'Tail_Number', 'Delay'],
      dtype='object')

In [None]:
train.isna().sum()

ID                               0
Month                            0
Day_of_Month                     0
Estimated_Departure_Time    109019
Estimated_Arrival_Time      109040
Cancelled                        0
Diverted                         0
Origin_Airport                   0
Origin_Airport_ID                0
Origin_State                109015
Destination_Airport              0
Destination_Airport_ID           0
Destination_State           109079
Distance                         0
Airline                     108920
Carrier_Code(IATA)          108990
Carrier_ID(DOT)             108997
Tail_Number                      0
Delay                       744999
dtype: int64

# 데이터 전처리

### **레이블의 결측치 확인**

In [None]:
air_train = train.loc[train['Delay'].dropna().index]
air_train

Unnamed: 0,ID,Month,Day_of_Month,Estimated_Departure_Time,Estimated_Arrival_Time,Cancelled,Diverted,Origin_Airport,Origin_Airport_ID,Origin_State,Destination_Airport,Destination_Airport_ID,Destination_State,Distance,Airline,Carrier_Code(IATA),Carrier_ID(DOT),Tail_Number,Delay
5,TRAIN_000005,4,13,1545.0,,0,0,EWR,11618,,DCA,11278,Virginia,199.0,Republic Airlines,UA,20452.0,N657RW,Not_Delayed
6,TRAIN_000006,1,20,1742.0,1903.0,0,0,EWR,11618,New Jersey,BOS,10721,Massachusetts,200.0,United Air Lines Inc.,UA,,N66825,Not_Delayed
8,TRAIN_000008,6,13,1420.0,1550.0,0,0,BWI,10821,,CLT,11057,North Carolina,361.0,Southwest Airlines Co.,WN,19393.0,N765SW,Not_Delayed
10,TRAIN_000010,8,13,1730.0,1844.0,0,0,DCA,11278,Virginia,PIT,14122,Pennsylvania,204.0,Republic Airlines,AA,,N119HQ,Delayed
12,TRAIN_000012,1,12,1015.0,1145.0,0,0,CLE,11042,Ohio,DEN,11292,Colorado,1201.0,Southwest Airlines Co.,WN,,N8696E,Not_Delayed
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999962,TRAIN_999962,10,11,,2003.0,0,0,SAT,14683,Texas,ORD,13930,Illinois,1041.0,SkyWest Airlines Inc.,UA,20304.0,N152SY,Not_Delayed
999963,TRAIN_999963,5,2,1759.0,1926.0,0,0,LGA,12953,New York,DCA,11278,Virginia,214.0,,DL,20452.0,N871RW,Delayed
999969,TRAIN_999969,10,10,940.0,1056.0,0,0,MFE,13256,Texas,IAH,12266,Texas,316.0,Mesa Airlines Inc.,,20378.0,N89321,Delayed
999985,TRAIN_999985,8,8,1914.0,2039.0,0,0,RDU,14492,North Carolina,JAX,12451,Florida,407.0,Frontier Airlines Inc.,F9,20436.0,N316FR,Not_Delayed


In [None]:
air_train['Delay'].value_counts()

Not_Delayed    210001
Delayed         45000
Name: Delay, dtype: int64

In [None]:
air_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 255001 entries, 5 to 999992
Data columns (total 19 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ID                        255001 non-null  object 
 1   Month                     255001 non-null  int64  
 2   Day_of_Month              255001 non-null  int64  
 3   Estimated_Departure_Time  227160 non-null  float64
 4   Estimated_Arrival_Time    227317 non-null  float64
 5   Cancelled                 255001 non-null  int64  
 6   Diverted                  255001 non-null  int64  
 7   Origin_Airport            255001 non-null  object 
 8   Origin_Airport_ID         255001 non-null  int64  
 9   Origin_State              227145 non-null  object 
 10  Destination_Airport       255001 non-null  object 
 11  Destination_Airport_ID    255001 non-null  int64  
 12  Destination_State         227323 non-null  object 
 13  Distance                  255001 non-null  f

In [None]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 18 columns):
 #   Column                    Non-Null Count    Dtype  
---  ------                    --------------    -----  
 0   ID                        1000000 non-null  object 
 1   Month                     1000000 non-null  int64  
 2   Day_of_Month              1000000 non-null  int64  
 3   Estimated_Departure_Time  891016 non-null   float64
 4   Estimated_Arrival_Time    890952 non-null   float64
 5   Cancelled                 1000000 non-null  int64  
 6   Diverted                  1000000 non-null  int64  
 7   Origin_Airport            1000000 non-null  object 
 8   Origin_Airport_ID         1000000 non-null  int64  
 9   Origin_State              893495 non-null   object 
 10  Destination_Airport       1000000 non-null  object 
 11  Destination_Airport_ID    1000000 non-null  int64  
 12  Destination_State         893477 non-null   object 
 13  Distance                  10

In [None]:
air_test = test.copy()    # 깊은 복사
air_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 18 columns):
 #   Column                    Non-Null Count    Dtype  
---  ------                    --------------    -----  
 0   ID                        1000000 non-null  object 
 1   Month                     1000000 non-null  int64  
 2   Day_of_Month              1000000 non-null  int64  
 3   Estimated_Departure_Time  891016 non-null   float64
 4   Estimated_Arrival_Time    890952 non-null   float64
 5   Cancelled                 1000000 non-null  int64  
 6   Diverted                  1000000 non-null  int64  
 7   Origin_Airport            1000000 non-null  object 
 8   Origin_Airport_ID         1000000 non-null  int64  
 9   Origin_State              893495 non-null   object 
 10  Destination_Airport       1000000 non-null  object 
 11  Destination_Airport_ID    1000000 non-null  int64  
 12  Destination_State         893477 non-null   object 
 13  Distance                  10

### **명목형 변수 레이블링**

In [None]:
qual_cols = list(air_train.select_dtypes(include=['object']).columns)
qual_cols = qual_cols[1:-1]
qual_cols

['Origin_Airport',
 'Origin_State',
 'Destination_Airport',
 'Destination_State',
 'Airline',
 'Carrier_Code(IATA)',
 'Tail_Number']

In [None]:
air_train[qual_cols] = air_train[qual_cols].astype('category')
air_test[qual_cols] = air_test[qual_cols].astype('category')

In [None]:
from sklearn.preprocessing import LabelEncoder

for i in qual_cols:

  # train 데이터 처리
  le = LabelEncoder()
  le = le.fit(air_train[i])
  air_train[i] = le.transform(air_train[i])

  # test 데이터 처리
  for label in np.unique(air_test[i]):
    if label not in le.classes_:    # test 데이터에만 있는 값 처리
      le.classes_ = np.append(le.classes_, label)
  air_test[i] = le.transform(air_test[i])

TypeError: ignored

# 모델 평가

### **Python Assert (가정 설정문)**
- 사용
  - assert 조건, '메세지'
    - (메세지는 생략 가능!)
    - -> 조건이 거짓이면, [AssertionError: 메세지] 출력됨

In [None]:
def log_loss(y_true, y_pred):
  assert len(y_true) == len(y_pred), "Lengths of true labels and predicted probabilities must be equal."
  n_samples = len(y_true)
  log_loss_value = -np.sum(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)) / n_samples
  return log_loss_value

# 샘플 데이터
y_true = np.array([1, 0, 1, 1, 0, 1])
y_pred = np.array([0.9, 0.1, 0.8, 0.7, 0.2, 0.6])

# 로그 손실 계산
loss_value = log_loss(y_true, y_pred)
print("Log loss:", loss_value)