© 2023 summer https://fastcampus.co.kr/data_online_llama

Park:2016, "Naver Sentiment Movie Corpus", "Lucy Park", "2016", https://github.com/e9t/nsmc

kim2020lmkor, Kiyoung Kim, Pretrained Language Models For Korean, 2020, GitHub, https://github.com/kiyoungkim1/LMkor

# load dataset

In [1]:
from datasets import load_dataset
nsmc_dataset = load_dataset('nsmc')

In [2]:
nsmc_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 150000
    })
    test: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 50000
    })
})

In [3]:
nsmc_dataset['train'][1]

{'id': '3819312', 'document': '흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나', 'label': 1}

In [4]:
nsmc_dataset['train'].features

{'id': Value(dtype='string', id=None),
 'document': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'positive'], id=None)}

In [5]:
nsmc_dataset['train'].features['label'].str2int('negative')

0

In [6]:
nsmc_dataset['train'].features['label'].str2int('positive')

1

In [7]:
nsmc_df = nsmc_dataset['train'].to_pandas()
nsmc_df

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1
...,...,...,...
149995,6222902,인간이 문제지.. 소는 뭔죄인가..,0
149996,8549745,평점이 너무 낮아서...,1
149997,9311800,이게 뭐요? 한국인은 거들먹거리고 필리핀 혼혈은 착하다?,0
149998,2376369,청춘 영화의 최고봉.방황과 우울했던 날들의 자화상,1


In [8]:
nsmc_df.groupby('label').count()

Unnamed: 0_level_0,id,document
label,Unnamed: 1_level_1,Unnamed: 2_level_1
0,75173,75173
1,74827,74827


In [9]:
nsmc_df['review_length'] = nsmc_df['document'].str.len()
nsmc_df.review_length.describe()

count    150000.000000
mean         35.203353
std          29.532097
min           0.000000
25%          16.000000
50%          27.000000
75%          42.000000
max         146.000000
Name: review_length, dtype: float64

# preprocess

In [10]:
from transformers import BertTokenizerFast

tok = BertTokenizerFast.from_pretrained("kykim/bert-kor-base")

In [11]:
tok.tokenize('청춘 영화의 최고봉.')

['청춘', '영화의', '최고', '##봉', '.']

In [12]:
tok('청춘 영화의 최고봉.')

{'input_ids': [2, 28546, 26683, 14317, 8461, 2016, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [13]:
tok(['청춘 영화의 최고봉.', '청춘'], padding=True)

{'input_ids': [[2, 28546, 26683, 14317, 8461, 2016, 3], [2, 28546, 3, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 0, 0, 0, 0]]}

In [14]:
def tokenizer(data):
    return tok(data['document'], max_length=32, padding='max_length', truncation=True)

In [15]:
nsmc_dataset_tokenized = nsmc_dataset.map(tokenizer)

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [16]:
nsmc_dataset_tokenized['train'][0]

{'id': '9976970',
 'document': '아 더빙.. 진짜 짜증나네요 목소리',
 'label': 0,
 'input_ids': [2,
  5504,
  3175,
  8638,
  2016,
  2016,
  14188,
  22922,
  35063,
  26796,
  3,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'token_type_ids': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0]}

# model load

In [17]:
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

device

device(type='cuda')

In [18]:
from transformers import BertModel

model = BertModel.from_pretrained("kykim/bert-kor-base", num_labels=2)

In [19]:
num_train_epochs = 4
learning_rate = 2e-7
batch_size = 128

# train-trainer

In [20]:
from transformers import TrainingArguments

logging_steps = len(nsmc_dataset['train']) // batch_size
output_dir = 'trainer_test'

training_args = TrainingArguments(output_dir=output_dir,
                                 num_train_epochs=num_train_epochs,
                                 learning_rate = learning_rate,
                                 per_device_train_batch_size=batch_size,
                                 per_device_eval_batch_size=batch_size,
                                 evaluation_strategy='epoch',
                                 logging_steps=logging_steps,
                                 fp16=True,
                                 push_to_hub=False)

In [21]:
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [22]:
from transformers import Trainer
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("kykim/bert-kor-base", num_labels=2)

trainer = Trainer(model=model,
                 args=training_args,
                 compute_metrics=compute_metrics,
                 train_dataset=nsmc_dataset_tokenized['train'],
                 eval_dataset=nsmc_dataset_tokenized['test'],
                 tokenizer=tok)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at kykim/bert-kor-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# train

In [23]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.6019,0.424389,0.8297,0.828772,0.839184,0.818615
2,0.4032,0.371762,0.84332,0.843044,0.850439,0.835776
3,0.3752,0.359944,0.84814,0.848712,0.851375,0.846065
4,0.366,0.356526,0.84986,0.850199,0.85417,0.846264


TrainOutput(global_step=4688, training_loss=0.4365487623794494, metrics={'train_runtime': 1182.2066, 'train_samples_per_second': 507.525, 'train_steps_per_second': 3.965, 'total_flos': 9866664576000000.0, 'train_loss': 0.4365487623794494, 'epoch': 4.0})

In [25]:
model.save_pretrained('sentiment_kobert_0904')