안녕하세요.

유경누나 친구입니다.

최대한 코드를 변형하지 않는 선에서 코멘트 드릴게요.

---

In [1]:
import numpy as np
import random
import torch
import json
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, fbeta_score, f1_score
from PyKomoran import *
komoran=Komoran("EXP")

In [2]:
def set_seed():
    random.seed(777)
    np.random.seed(777)
    torch.manual_seed(777)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(777)
set_seed()

In [3]:
#label list, mapping dict
label_list=['opening', 'request', 'wh-question', 'yn-question', 'inform', 'affirm', 'ack', 'expressive']
label_map = {label: i for i, label in enumerate(label_list)}

train_tfidf_list=list()
train_label_list=list()
test_tifdif_list=list()
test_label_list=list()

# Load train, test data
- 이 부분은 기존 코드로 사용하실 경우 함수화 해주시면 좋을 것 같습니다.
- 아래와 같이 itertools.chain과 zip을 활용하면, 데이터를 좀 더 직관적으로 빠르게 불러올 수 있습니다.
    - 파이썬에서는 왠만해서는 explicit하게 반복문을 안쓰는 것이 좋습니다. 
    - 특히, 딥러닝 하시면, 파이썬 코드를 최대한 효율적으로 짜는게 좋긴합니다. 경우에 따라서 5분이면 돌아갈 것이, 2~3시간 걸릴 수가 있습니다.
    - `itertools`는 파이썬 기본 라이브러리입니다.

In [4]:
from itertools import chain

In [5]:
# load json data
train_path = 'data/SpeechAct_tr.json'
test_path = 'data/SpeechAct_te.json'

with open(train_path) as json_file:
    tr_json_data=json.load(json_file)
    
with open(test_path) as json_file:
    te_json_data=json.load(json_file)

- komoran tokenizer에서 tag_list를 제외했습니다.
    - 딥러닝할 때는 전체 품사를 다 넣는 것이 좋습니다. 
    - Information loss로 성능하락의 원인이 될 수 있습니다.
- TfidfVectorizer의 tokenizer로 사용하시는 토크나이저를 넘겨줬습니다.
- tfidf를 list로 굳이 변환하지 않았습니다.

In [6]:
## date preprocess and tfidfvectorizer
# train
# dictionary item에 대한 zip
_ ,tr_corpus = list(zip(*tr_json_data.items()))

# chain후 zip을 통해 sentence list, label list 분리.
_, tr_corpus, train_label_list = list(zip(*chain(*tr_corpus))) 

# label index화
train_label_list = [label_map[l] for l in train_label_list]

# tfidf 정의
tfidfvect = TfidfVectorizer(tokenizer=komoran.get_morphes_by_tags)

# fit, transform, fit_transform의 차이를 이해하세요.
train_tfidf_list = tfidfvect.fit_transform(tr_corpus).toarray().tolist()

In [7]:
# test
_ ,te_corpus = list(zip(*te_json_data.items()))
_, te_corpus, test_label_list = list(zip(*chain(*te_corpus)))
test_label_list = [label_map[l] for l in test_label_list]
test_tfidf_list = tfidfvect.transform(te_corpus).toarray().tolist() # transform

In [8]:
# to tensor
train_tfidf_tensor = torch.tensor(train_tfidf_list)
train_label_tensor = torch.tensor(train_label_list)
test_tfidf_tensor = torch.tensor(test_tfidf_list)
test_label_tensor = torch.tensor(test_label_list)

In [9]:
print(train_tfidf_tensor.shape)
print(train_label_tensor.shape)
print(test_tfidf_tensor.shape)
print(test_label_tensor.shape)

torch.Size([5825, 1117])
torch.Size([5825])
torch.Size([6671, 1117])
torch.Size([6671])


- batch_size는 되도록이면 크게
- test set은 shuffle 안함

In [10]:
bs = 256
vocab_size = train_tfidf_tensor.shape[1]

#데이터 묶기
Train_dataset = torch.utils.data.TensorDataset(train_tfidf_tensor, train_label_tensor)
Test_dataset = torch.utils.data.TensorDataset(test_tfidf_tensor, test_label_tensor)

#batch size 가져와서 학습
train_DataLoader = torch.utils.data.DataLoader(Train_dataset, shuffle=True, batch_size=bs, num_workers=16)
test_DataLoader = torch.utils.data.DataLoader(Test_dataset, batch_size=bs, num_workers=8)

# Define Model

- hidden layer의 activation function은 특별한 이유가 없으면  ReLU로 하세요.
- 학습이 안된 이유는 softmax함수를 안써서 그런거 같네요.
    - 기존 코드는 logit을 계산하지 않고 loss계산을 해서 loss가 엄청 크게 나온 듯 하네요/
    - CrossEntropyLoss로 loss 계산하시면, softmax 써주셔야됩니다.
- pytorch 익숙하지 않으시면, tutorial 봐보세요.

In [11]:
class Perceptron(torch.nn.Module):
    def __init__(self, vocab_size, label):
        super(Perceptron, self).__init__()
        self.linear1 = torch.nn.Linear(vocab_size, 512)
        torch.nn.init.xavier_normal_(self.linear1.weight)
        self.relu1 = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(512, 128)
        torch.nn.init.xavier_normal_(self.linear2.weight)
        self.relu2 = torch.nn.ReLU()
        self.linear3 = torch.nn.Linear(128, label)
        torch.nn.init.xavier_normal_(self.linear3.weight)

    def forward(self, X):
        y_pred = self.linear1(X)
        y_pred = self.relu1(y_pred)
        y_pred = self.linear2(y_pred)
        y_pred = self.relu2(y_pred)
        y_pred = self.linear3(y_pred)
        return y_pred

In [12]:
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
model = Perceptron(vocab_size = vocab_size, label=len(label_list))
model.to(device)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [13]:
epochs = 10

In [14]:
def acc(yhat, y):
    with torch.no_grad():
        yhat = yhat.max(dim=1)[1]
        acc = (yhat == y).float().mean()
    return acc

In [15]:
model.zero_grad()
for epoch in range(epochs):
    tr_loss = 0.
    tr_acc = 0.
    model.train()
    for step, batch in enumerate(train_DataLoader):
        
        inputs = batch[0].to(device)
        labels = batch[1].to(device)
        
        y_pred = model(inputs)
        loss = criterion(y_pred, labels)
        loss.backward()
        
        accuracy = acc(y_pred, labels)
        tr_acc += accuracy.item()
        tr_loss += loss.item()
        optimizer.step()
        model.zero_grad()
    print(f"{epoch} epoch - Train loss: {tr_loss} / Train accuracy: {tr_acc / (step+1)}")

add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add_(Tensor other, Number alpha)
addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
addcmul_(Tensor tensor1, Tensor tensor2, Number value)
addcdiv_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
addcdiv_(Tensor tensor1, Tensor tensor2, Number value)


0 epoch - Train loss: 37.99076998233795 / Train accuracy: 0.5246483579925869
1 epoch - Train loss: 19.41859859228134 / Train accuracy: 0.7631645643192789
2 epoch - Train loss: 9.738964259624481 / Train accuracy: 0.8765566919160925
3 epoch - Train loss: 6.80925615131855 / Train accuracy: 0.9045173111169235
4 epoch - Train loss: 5.541881054639816 / Train accuracy: 0.9211578109989995
5 epoch - Train loss: 4.903770610690117 / Train accuracy: 0.9320273762163909
6 epoch - Train loss: 4.425178080797195 / Train accuracy: 0.937118965646495
7 epoch - Train loss: 4.080668821930885 / Train accuracy: 0.9397184382314268
8 epoch - Train loss: 3.8294343277812004 / Train accuracy: 0.9446991500647172
9 epoch - Train loss: 3.630621001124382 / Train accuracy: 0.9471358268157296
