<a href="https://colab.research.google.com/github/SYEON9/natural_language_3th/blob/main/NLP/huggingface_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers
!pip install datasets

In [None]:
import transformers
from torch import nn
from tqdm import tqdm

import torch

# Huggingface's Transformers

huggingface는 pytorch version의 BERT를 가장 먼저 구현하여 주목을 받았다. 현재는 transformer 기반의 다양한 모델을 구현 및 공개하고 있다. 

### Main classes

AutoConfig에서는 다양한 모델의 configuration(환경설정)을 string tag을 이용해 쉽게 load할 수 있다. 각 config에는 해당 모델의 architecture, task에 대한 정보를 담고 있다.(architecture 종류, 레이어 수, hidden unit size, hyperparameter)

In [None]:
from transformers import AutoConfig

config = AutoConfig.from_pretrained('bert-base-uncased')
config

In [None]:
gpt_config = AutoConfig.from_pretrained('gpt2')

In [None]:
gpt_config

In [None]:
print(config.vocab_size)

In [None]:
config_dict = config.to_dict()
config_dict

In [None]:
from transformers import BertConfig

# bert type의 사전 학습된 config 정보 불러오기.
bertconfig = BertConfig.from_pretrained('bert-base-uncased')

In [None]:
bert_in_gpt2_config = BertConfig.from_pretrained('gpt2')

In [None]:
from transformers import BertForMaskedLM, BertForQuestionAnswering, BertForSequenceClassification, BertForTokenClassification, BertForMultipleChoice, BertModel 

In [None]:
from transformers import AutoModel, AutoTokenizer, AutoConfig

In [None]:
bertmodel = AutoModel.from_pretrained('bert-base-uncased')

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
input = tokenizer('hi, my name is Taehee')

In [None]:
input

In [None]:
bert_qa = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

In [None]:
bert_qa

In [None]:
bert_qa.state_dict()

In [None]:
bert_qa = AutoModel.from_pretrained('deepset/bert-base-cased-squad2')

In [None]:
bert_token_cls = BertForTokenClassification.from_pretrained('ckiplab/bert-base-chinese-ner')

optimization은 널리 쓰이고 있는 다양한 optimizer를 제공한다.
이와 관련하여 learning rate를 조절하는 scheduler도 제공한다. 

In [None]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [None]:
bert_maskedlm = BertForMaskedLM.from_pretrained('bert-base-uncased')

parameters = bert_maskedlm.parameters()
# parameters = bert_maskedlm.named_parameters()
optimizer = AdamW(parameters, lr=5e-5)
total_training_step = 100
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=int(total_training_step/10), num_training_steps=total_training_step)

# loss.backward()
optimizer.step()
scheduler.step()

### Trainig Movie Review Classifier with BERTForSequenceClassification Class

pre-trained BERT의 config, tokenizer, model을 각각 불러온다. 

In [None]:
from transformers import AutoConfig, AutoTokenizer, AutoModelForSequenceClassification

In [None]:
config = AutoConfig.from_pretrained('bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')

데이터를 불러온다.

In [None]:
from datasets import load_dataset
raw_datasets = load_dataset('imdb')

In [None]:
raw_datasets

In [None]:
import datasets
print(datasets.list_datasets())

In [None]:
tokenizer.model_max_length = 512

In [None]:
def tokenize_function(example):
    return tokenizer(example['text'], padding = 'max_length', truncation = True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched = True)

In [None]:
small_train_dataset = tokenized_datasets['train'].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets['test'].shuffle(seed = 42).select(range(1000))
full_train_dataset = tokenized_datasets['train']
full_eval_dataset = tokenized_datasets['test']

### Case 1: Transformers library를 이용한 영화 리뷰 분류기 학습

In [None]:
from transformers import TrainingArguments, Trainer

In [None]:
training_args = TrainingArguments("test_trainer")
# 전체 dataset 학습/평가을 원하시는 분들은 full_train_dataset, full_eval_dataset을 사용하시면 됩니다.
trainer = Trainer(model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset)

In [None]:
trainer.train()

In [None]:
#모델 학습
model = BertForSequenceClassification.from_pretrained('finiteautomata/beto-sentiment-analysis')
trainer = Trainer(model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset)