### Week 1. Titanic Classification with DistilBERT + XGBoost 과제

Week 1. 과제로는 `week1_lecture.ipynb` 코드 예제를 참고하여, `kaggle`에서 titanic 데이터를 받아 데이터 전처리를 진행한 후 학습해 자신의 모델을 만든 후, ipynb 파일과 결과 csv파일을

- ***hw1_학번_이름.ipynb***
- ***submission.csv***

실전데이터사이언스 스터디 repository week1 폴더에 커밋하시면 됩니다.!

https://github.com/a2ran/prac_ds

타이타닉 데이터 전처리 예시는 다음과 같습니다.

1. Filling out missing (NaN) Values
2. VIF으로 분산이 높은 column 제거 or 수정
3. train_test_split 으로 train/val 나누는 과정 수정
4. "bert-base-uncased" 이외 다른 BERT 모델을 사용해 Embedding하기
5. "xgboost" 이외 다른 머신러닝 알고리즘 사용

참고할만한 example은 다음과 같습니다.


https://github.com/minsuk-heo/kaggle-titanic/blob/master/titanic-solution.ipynb

`Optional` :

https://github.com/mrdbourke/your-first-kaggle-submission/blob/master/kaggle-titanic-dataset-example-submission-workflow.ipynb
https://github.com/agconti/kaggle-titanic/blob/master/Titanic.ipynb


# 기본 제공 코드

### 1. Kaggle에서 데이터 받아오기

In [None]:
## pip install ~ : 패키지 다운로드
## -q : 로그 메세지 출력 X

!pip install -q kaggle

In [None]:
## Kaggle에서 데이터를 받아오기 위해서는 Authenticator Token인 "kaggle.json"이 있어야 합니다.
## 자세한 내용은 영상을 참고하세요

from google.colab import drive
drive.mount("/content/drive")

!mkdir ~/.kaggle
## Drive에 kaggle.json을 업로드한 경로를 적으시면 됩니다. ex) (/content/drive/MyDrive/study_session/kaggle.json)
!cp /content/drive/MyDrive/ColabNotebooks/실데방/kaggle.json ~/.kaggle/

! chmod 600 ~/.kaggle/kaggle.json

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# 공모전 이름
competition_name = "titanic"

# 공모전 다운로드 to local environment
! kaggle competitions download -c {competition_name}

# {competition_name}이름의 폴더에 zip 파일 압축해제
! unzip {competition_name + ".zip"} -d {competition_name}

# 드라이브 확인을 완료했으므로 드라이브 mount를 해제합니다.
drive.flush_and_unmount()

Downloading titanic.zip to /content
  0% 0.00/34.1k [00:00<?, ?B/s]
100% 34.1k/34.1k [00:00<00:00, 54.3MB/s]
Archive:  titanic.zip
  inflating: titanic/gender_submission.csv  
  inflating: titanic/test.csv        
  inflating: titanic/train.csv       


### 2. 데이터 전처리

In [None]:
## GPU 활성화
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
import numpy as np
import pandas as pd

train = pd.read_csv('titanic/train.csv')
test = pd.read_csv('titanic/test.csv')

In [None]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
train.isnull().sum() #Age, Cabin, Embarked 결측치 존재

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [52]:
###결측치 정리

##Cabin 제거
train.drop('Cabin', axis = 1, inplace = True)

## Age는 Pclass별 최빈값 계산해서 최빈값으로 결측치 대체
# Pclass별 최빈값 계산
mode_by_pclass = train.groupby('Pclass')['Age'].apply(lambda x: x.mode()[0])

# 결측치 채우기
train['Age'] = train.apply(lambda row: mode_by_pclass[row['Pclass']] if pd.isnull(row['Age']) else row['Age'], axis=1)

## Embarked도 최빈값으로 대체
train['Embarked'] = train['Embarked'].fillna(train['Embarked'].mode().iloc[0])

In [53]:
train.isnull().sum() #Age, Cabin, Embarked 결측치 사라짐

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

In [None]:
test.isnull().sum() #Age, Cabin, Fare 결측치 존재

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [55]:
### test 결측치 정리

## Cabin 열 제거
test.drop('Cabin', axis = 1, inplace = True)

##Fare는 Pclass별 중위값으로 대체
test['Fare'] = test['Fare'].fillna(test.groupby('Pclass')['Fare'].transform('median'))

## Age
# Pclass별 최빈값 계산
mode_by_pclass = test.groupby('Pclass')['Age'].apply(lambda x: x.mode()[0])

# 결측치 채우기
test['Age'] = test.apply(lambda row: mode_by_pclass[row['Pclass']] if pd.isnull(row['Age']) else row['Age'], axis=1)



In [56]:
test.isnull().sum() #Age, Cabin, Fare 결측치 사라짐

PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

In [57]:
from sklearn.model_selection import train_test_split

X = train.drop(columns = ['PassengerId', 'Survived'], axis = 1)
y = train['Survived']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.05, random_state = 77, stratify = y)

### 3. 데이터 임베딩

In [58]:
from typing import List
from tqdm.notebook import tqdm

!pip install -q transformers
from transformers import AutoModel, AutoTokenizer

class Encode_with_BERT:
    def __init__(self):
        ## Huggingface에서 BERT 모델을 받아옵니다.
        self.tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
        self.model = AutoModel.from_pretrained('bert-base-uncased').to(device)

    ## 입력한 `texts`을 768 길이의 숫자벡터로 특징을 추출합니다.
    def extract(self, texts: List[str]):
        features = np.zeros((len(texts), 768), dtype = np.float16)
        for index, text in enumerate(tqdm(texts)):
            tokenized_text = self.tokenizer(text, return_tensors="pt").to(device)
            model_output = self.model(**tokenized_text)[0].detach().cpu()
            features[index, :] = model_output.numpy().mean(axis=1)

        return features

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m45.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m76.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m59.1 MB/s[0m eta [36m0:00:00[0m
[?25h

### 4. 임베딩한 데이터 머신러닝 알고리즘으로 분류

In [59]:
from xgboost.sklearn import XGBClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

extractor = Encode_with_BERT()
scaler = StandardScaler()
classifier = XGBClassifier(use_label_encoder = False)

## XGBoost 학습
train_labels = [int(_) for _ in y_train.values] ## [0,1,1, ...]
texts = [", ".join(str(_)) for _ in X_train.values] ##[[0,1,male],[2,1,female]...]
train_features = scaler.fit_transform(extractor.extract(texts)) ## to (len(texts), 768) 숫자벡터
classifier.fit(train_features, train_labels) ## XGBoost train

## Prediction with Test Data
answer = [int(_) for _ in y_val.values]
texts = [", ".join(str(_)) for _ in X_val.values]
preds = classifier.predict(scaler.transform(extractor.extract(texts)))

## 모델 예측의 정밀도 측정!
accuracy = accuracy_score(answer, preds)
print(f'\naccuracy : {accuracy*100:.2f}%')

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


  0%|          | 0/846 [00:00<?, ?it/s]

  0%|          | 0/45 [00:00<?, ?it/s]


accuracy : 86.67%


### 5. csv로 저장 + kaggle에 submit

In [60]:
import csv

ids = test['PassengerId'].values
texts = [", ".join(str(_)) for _ in test.iloc[:, 1:].values]
preds = classifier.predict(scaler.transform(extractor.extract(texts)))

with open('submission.csv', "w") as to_file:
    csvwriter = csv.writer(to_file, delimiter=",", quotechar='"')
    csvwriter.writerow(["PassengerId", "Survived"])
    for id, pred in zip(ids, preds):
        csvwriter.writerow([id, pred])

  0%|          | 0/418 [00:00<?, ?it/s]

In [61]:
sub = pd.read_csv('submission.csv')
sub.head(5)

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
