
# Pytorch + HuggingFace
## KoElectra Model
박장원님의 KoElectra-small 사용<br>
https://monologg.kr/2020/05/02/koelectra-part1/<br>
https://github.com/monologg/KoELECTRA

## Dataset
네이버 영화 리뷰 데이터셋<br>
https://github.com/e9t/nsmc

## References
- https://huggingface.co/transformers/training.html
- https://tutorials.pytorch.kr/beginner/data_loading_tutorial.html
- https://tutorials.pytorch.kr/beginner/blitz/cifar10_tutorial.html
- https://wikidocs.net/44249

## 주의사항
꼭 GPU로 해주세요 - 1epoch 당 약 20분 소요

In [1]:
# HuggingFace transformers 설치 및 NSMC 데이터셋 다운로드
!pip install transformers
!wget https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt
!wget https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt

Collecting transformers
  Downloading transformers-4.33.2-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m40.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.17.2-py3-none-any.whl (294 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.9/294.9 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m57.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m43.5 MB/s[0m eta [36m0:00:0

In [2]:
!head ratings_train.txt
!head ratings_test.txt

id	document	label
9976970	아 더빙.. 진짜 짜증나네요 목소리	0
3819312	흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나	1
10265843	너무재밓었다그래서보는것을추천한다	0
9045019	교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정	0
6483659	사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 던스트가 너무나도 이뻐보였다	1
5403919	막 걸음마 뗀 3세부터 초등학교 1학년생인 8살용영화.ㅋㅋㅋ...별반개도 아까움.	0
7797314	원작의 긴장감을 제대로 살려내지못했다.	0
9443947	별 반개도 아깝다 욕나온다 이응경 길용우 연기생활이몇년인지..정말 발로해도 그것보단 낫겟다 납치.감금만반복반복..이드라마는 가족도없다 연기못하는사람만모엿네	0
7156791	액션이 없는데도 재미 있는 몇안되는 영화	1
id	document	label
6270596	굳 ㅋ	1
9274899	GDNTOPCLASSINTHECLUB	0
8544678	뭐야 이 평점들은.... 나쁘진 않지만 10점 짜리는 더더욱 아니잖아	0
6825595	지루하지는 않은데 완전 막장임... 돈주고 보기에는....	0
6723715	3D만 아니었어도 별 다섯 개 줬을텐데.. 왜 3D로 나와서 제 심기를 불편하게 하죠??	0
7898805	음악이 주가 된, 최고의 음악영화	1
6315043	진정한 쓰레기	0
6097171	마치 미국애니에서 튀어나온듯한 창의력없는 로봇디자인부터가,고개를 젖게한다	0
8932678	갈수록 개판되가는 중국영화 유치하고 내용없음 폼잡다 끝남 말도안되는 무기에 유치한cg남무 아 그립다 동사서독같은 영화가 이건 3류아류작이다	0


In [3]:
import pandas as pd
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer, ElectraForSequenceClassification, AdamW
from tqdm.notebook import tqdm

In [4]:
# GPU 사용
device = torch.device("cuda")

# Dataset 만들어서 불러오기

In [5]:
class NSMCDataset(Dataset):

  def __init__(self, csv_file):
    # 일부 값중에 NaN이 있음...
    self.dataset = pd.read_csv(csv_file, sep='\t').dropna(axis=0)
    # 중복제거
    self.dataset.drop_duplicates(subset=['document'], inplace=True)
    self.tokenizer = AutoTokenizer.from_pretrained("monologg/koelectra-small-v2-discriminator")

    print(self.dataset.describe())

  def __len__(self):
    return len(self.dataset)

  def __getitem__(self, idx):
    row = self.dataset.iloc[idx, 1:3].values
    text = row[0]
    y = row[1]

    inputs = self.tokenizer(
        text,
        return_tensors='pt',
        truncation=True,
        max_length=256,
        pad_to_max_length=True,
        add_special_tokens=True
        )

    input_ids = inputs['input_ids'][0]
    attention_mask = inputs['attention_mask'][0]

    return input_ids, attention_mask, y

In [6]:
train_dataset = NSMCDataset("ratings_train.txt")
test_dataset = NSMCDataset("ratings_test.txt")

Downloading (…)okenizer_config.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/486 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/255k [00:00<?, ?B/s]

                 id          label
count  1.461820e+05  146182.000000
mean   6.779186e+06       0.498283
std    2.919223e+06       0.499999
min    3.300000e+01       0.000000
25%    4.814832e+06       0.000000
50%    7.581160e+06       0.000000
75%    9.274760e+06       1.000000
max    1.027815e+07       1.000000
                 id         label
count  4.915700e+04  49157.000000
mean   6.752945e+06      0.502695
std    2.937158e+06      0.499998
min    6.010000e+02      0.000000
25%    4.777143e+06      0.000000
50%    7.565415e+06      1.000000
75%    9.260204e+06      1.000000
max    1.027809e+07      1.000000


# Create Model

In [7]:
model = ElectraForSequenceClassification.from_pretrained("monologg/koelectra-base-v3-discriminator").to(device)

# 한번 실행해보기
# text, attention_mask, y = train_dataset[0]
# model(text.unsqueeze(0).to(device), attention_mask=attention_mask.unsqueeze(0).to(device))

Downloading (…)lve/main/config.json:   0%|          | 0.00/467 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/452M [00:00<?, ?B/s]

Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at monologg/koelectra-base-v3-discriminator and are newly initialized: ['classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
model.load_state_dict(torch.load("model.pt"))

FileNotFoundError: ignored

In [9]:
# 모델 레이어 보기
model

ElectraForSequenceClassification(
  (electra): ElectraModel(
    (embeddings): ElectraEmbeddings(
      (word_embeddings): Embedding(35000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): ElectraEncoder(
      (layer): ModuleList(
        (0-11): 12 x ElectraLayer(
          (attention): ElectraAttention(
            (self): ElectraSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): ElectraSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): L

# Learn

In [10]:
epochs = 5
batch_size = 16

In [11]:
optimizer = AdamW(model.parameters(), lr=5e-6)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=True)



In [None]:
losses = []
accuracies = []

for i in range(epochs):
  total_loss = 0.0
  correct = 0
  total = 0
  batches = 0

  model.train()

  for input_ids_batch, attention_masks_batch, y_batch in tqdm(train_loader):
    optimizer.zero_grad()
    y_batch = y_batch.to(device)
    y_pred = model(input_ids_batch.to(device),
                   attention_mask=attention_masks_batch.to(device))[0]
    loss = F.cross_entropy(y_pred, y_batch)
    loss.backward()
    optimizer.step()

    total_loss += loss.item()

    _, predicted = torch.max(y_pred, 1)
    correct += (predicted == y_batch).sum()
    total += len(y_batch)

    batches += 1
    if batches % 100 == 0:
      print("Batch Loss:", total_loss, "Accuracy:", correct.float() / total)

  losses.append(total_loss)
  accuracies.append(correct.float() / total)
  print("Train Loss:", total_loss, "Accuracy:", correct.float() / total)

  0%|          | 0/9137 [00:00<?, ?it/s]



Batch Loss: 69.25467556715012 Accuracy: tensor(0.5250, device='cuda:0')
Batch Loss: 138.31639498472214 Accuracy: tensor(0.5219, device='cuda:0')
Batch Loss: 207.08487117290497 Accuracy: tensor(0.5321, device='cuda:0')
Batch Loss: 275.1113106608391 Accuracy: tensor(0.5445, device='cuda:0')
Batch Loss: 341.3435778617859 Accuracy: tensor(0.5592, device='cuda:0')
Batch Loss: 404.26704224944115 Accuracy: tensor(0.5750, device='cuda:0')
Batch Loss: 466.4220041036606 Accuracy: tensor(0.5875, device='cuda:0')
Batch Loss: 528.1753177046776 Accuracy: tensor(0.5974, device='cuda:0')
Batch Loss: 587.5966936945915 Accuracy: tensor(0.6072, device='cuda:0')
Batch Loss: 646.5270308852196 Accuracy: tensor(0.6157, device='cuda:0')
Batch Loss: 703.8973246514797 Accuracy: tensor(0.6241, device='cuda:0')
Batch Loss: 762.0893881618977 Accuracy: tensor(0.6294, device='cuda:0')
Batch Loss: 818.7401643097401 Accuracy: tensor(0.6360, device='cuda:0')
Batch Loss: 876.427414149046 Accuracy: tensor(0.6409, device=

  0%|          | 0/9137 [00:00<?, ?it/s]

Batch Loss: 39.255688942968845 Accuracy: tensor(0.8219, device='cuda:0')
Batch Loss: 78.93941567093134 Accuracy: tensor(0.8184, device='cuda:0')
Batch Loss: 117.5753080919385 Accuracy: tensor(0.8200, device='cuda:0')
Batch Loss: 156.05260226875544 Accuracy: tensor(0.8223, device='cuda:0')
Batch Loss: 194.9609649553895 Accuracy: tensor(0.8241, device='cuda:0')
Batch Loss: 233.8115511611104 Accuracy: tensor(0.8229, device='cuda:0')
Batch Loss: 273.13768834620714 Accuracy: tensor(0.8220, device='cuda:0')
Batch Loss: 312.80857663601637 Accuracy: tensor(0.8218, device='cuda:0')
Batch Loss: 351.176371358335 Accuracy: tensor(0.8228, device='cuda:0')
Batch Loss: 388.92075615376234 Accuracy: tensor(0.8243, device='cuda:0')
Batch Loss: 431.89827797561884 Accuracy: tensor(0.8219, device='cuda:0')
Batch Loss: 472.76609898358583 Accuracy: tensor(0.8209, device='cuda:0')
Batch Loss: 510.9513918235898 Accuracy: tensor(0.8207, device='cuda:0')
Batch Loss: 552.2571426555514 Accuracy: tensor(0.8196, dev

  0%|          | 0/9137 [00:00<?, ?it/s]

Batch Loss: 34.192659467458725 Accuracy: tensor(0.8500, device='cuda:0')
Batch Loss: 69.10339975357056 Accuracy: tensor(0.8487, device='cuda:0')
Batch Loss: 101.56368239969015 Accuracy: tensor(0.8502, device='cuda:0')
Batch Loss: 135.34768897294998 Accuracy: tensor(0.8489, device='cuda:0')
Batch Loss: 168.38832645118237 Accuracy: tensor(0.8496, device='cuda:0')
Batch Loss: 201.33621939271688 Accuracy: tensor(0.8507, device='cuda:0')
Batch Loss: 234.0136626958847 Accuracy: tensor(0.8514, device='cuda:0')
Batch Loss: 268.0524907410145 Accuracy: tensor(0.8513, device='cuda:0')
Batch Loss: 303.6666154488921 Accuracy: tensor(0.8504, device='cuda:0')
Batch Loss: 338.3616046383977 Accuracy: tensor(0.8496, device='cuda:0')
Batch Loss: 370.6627522855997 Accuracy: tensor(0.8500, device='cuda:0')
Batch Loss: 406.1983975470066 Accuracy: tensor(0.8491, device='cuda:0')
Batch Loss: 438.9228173866868 Accuracy: tensor(0.8492, device='cuda:0')
Batch Loss: 472.594027929008 Accuracy: tensor(0.8496, devic

In [None]:
losses, accuracies

테스트 데이터셋 정확도 확인하기

In [None]:
model.eval()

test_correct = 0
test_total = 0

for input_ids_batch, attention_masks_batch, y_batch in tqdm(test_loader):
  y_batch = y_batch.to(device)
  y_pred = model(input_ids_batch.to(device), attention_mask=attention_masks_batch.to(device))[0]
  _, predicted = torch.max(y_pred, 1)
  test_correct += (predicted == y_batch).sum()
  test_total += len(y_batch)

print("Accuracy:", test_correct.float() / test_total)

In [None]:
# 모델 저장하기
torch.save(model.state_dict(), "model.pt")