<a href="https://colab.research.google.com/github/Minwoo-study/Naver_map_rating/blob/main/KBH_KcELECTRA_%EB%AA%A8%EB%8D%B8_copy_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Pytorch + HuggingFace
## KoElectra Model
박장원님의 KoElectra-small 사용<br>
https://monologg.kr/2020/05/02/koelectra-part1/<br>
https://github.com/monologg/KoELECTRA

## Dataset
네이버 영화 리뷰 데이터셋<br>
https://github.com/e9t/nsmc

## References
- https://huggingface.co/transformers/training.html
- https://tutorials.pytorch.kr/beginner/data_loading_tutorial.html
- https://tutorials.pytorch.kr/beginner/blitz/cifar10_tutorial.html
- https://wikidocs.net/44249

## 주의사항
꼭 GPU로 해주세요 - 1epoch 당 약 20분 소요

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# HuggingFace transformers 설치 및 NSMC 데이터셋 다운로드
!pip install transformers
#!wget https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt #download data
#!wget https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 8.2 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 46.7 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 64.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 11.4 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstal

In [None]:
import pandas as pd
train = '/content/drive/MyDrive/kaggle/input/train_set.csv'
test = '/content/drive/MyDrive/kaggle/input/test_set.csv'
test_pd = pd.read_csv('/content/drive/MyDrive/kaggle/input/test_set.csv')

In [None]:
#print(train.head())
#print(test.head())

                                                  리뷰   평점  라벨링
0                                                좋아요  5.0    1
1  남자친구랑 황리단길 놀러 왔다가 친구추천으로 왔는데\n홀여직원분도 너무 친절히 대해...  5.0    1
2  육즙이살아있어 정말 맛있었어요\n겉바속촉!\n꿔바로우도 간이 되어있어 입맛에 딱 맛...  5.0    1
3                             국물이  냄새도없고 담백하고  맛있었어요  3.5    0
4                             뷰 맛집~ \n근데 파스타는 별로에요.   3.0    0
                                       리뷰   평점  라벨링
0           육수와 비빔양념 너무 맛있네요~이모님들도 친절하구요~  5.0    1
1      스탭들 대박친절해요\n매운뼈구이 강력추천\n매콤하니 핵존맛탱~  5.0    1
2                                     좋아요  5.0    1
3  맛있게 잘먹고 갑니다. 아기가 있어 염도조절도 물어봐주고 고맙습니다.  5.0    1
4                                     귀엽다  3.0    0


In [None]:
import pandas as pd
import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer, ElectraForSequenceClassification, AdamW, TextClassificationPipeline
from tqdm.notebook import tqdm

In [None]:
# GPU 사용
device = torch.device("cuda")

# Dataset 만들어서 불러오기

In [None]:
class NSMCDataset(Dataset):

  def __init__(self, csv_file):
    # 일부 값중에 NaN이 있음...
    self.dataset = pd.read_csv(csv_file).dropna(axis=0)
    # 중복제거
    self.dataset.drop_duplicates(subset=['리뷰'], inplace=True)

    self.tokenizer = AutoTokenizer.from_pretrained("beomi/KcELECTRA-base") #monologg/koelectra-small-v2-discriminator

    print(self.dataset.describe())

  def __len__(self):
    return len(self.dataset)

  def __getitem__(self, idx):
    row = self.dataset.iloc[idx, 0:3].values #idx 행과 0,1,2 columns
    text = row[0]
    y = row[1]

    inputs = self.tokenizer(
        text,
        return_tensors='pt', #return pytorch tensors
        truncation=True, #reducing long sequences, 256개의 token만 살리고 뒤는 자름
        max_length=256,
        pad_to_max_length=True, #padding
        add_special_tokens=True #자동으로 문장 앞뒤로 special tocken - padding 부착
        )

    input_ids = inputs['input_ids'][0] #모델의 입력
    attention_mask = inputs['attention_mask'][0] #padding(0이면 패딩 없음)

    return input_ids, attention_mask, y

In [None]:
train_dataset = NSMCDataset(train)
test_dataset = NSMCDataset(test)

                 평점           라벨링
count  86043.000000  86043.000000
mean       4.484909      0.872947
std        0.945344      0.333034
min        0.500000      0.000000
25%        4.000000      1.000000
50%        5.000000      1.000000
75%        5.000000      1.000000
max        5.000000      1.000000
                 평점           라벨링
count  22951.000000  22951.000000
mean       4.484619      0.871378
std        0.939953      0.334788
min        0.500000      0.000000
25%        4.000000      1.000000
50%        5.000000      1.000000
75%        5.000000      1.000000
max        5.000000      1.000000


# Create Model

In [None]:
model = ElectraForSequenceClassification.from_pretrained("beomi/KcELECTRA-base").to(device)
tokenizer = AutoTokenizer.from_pretrained("beomi/KcELECTRA-base")
sentiment_classifier = TextClassificationPipeline(tokenizer=tokenizer, model=model)

# 한번 실행해보기
# text, attention_mask, y = train_dataset[0]
# model(text.unsqueeze(0).to(device), attention_mask=attention_mask.unsqueeze(0).to(device))

Some weights of the model checkpoint at beomi/KcELECTRA-base were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at beomi/KcELECTRA-base and are newly initialized: ['classifier.dense.bias', 'classifier.de

In [None]:
model.load_state_dict(torch.load("model.pt"))

FileNotFoundError: ignored

# Learn

In [None]:
epochs = 5
batch_size = 16

In [None]:
optimizer = AdamW(model.parameters(), lr=5e-6)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=True)



In [None]:
losses = []
accuracies = []

for i in range(epochs): #epoch 5
  total_loss = 0.0
  correct = 0
  total = 0
  batches = 0

  model.train() #forward

  for input_ids_batch, attention_masks_batch, y_batch in tqdm(train_loader): #tqdm 진행상황 확인
  # train_loader batch_size = 16 -> iterations에 대해서 batches? (data size / batch size = num of iterations ---> 1 epoch)
    optimizer.zero_grad()

    y_batch = y_batch.to(device)
    y_pred = model(input_ids_batch.to(device), attention_mask=attention_masks_batch.to(device))[0] #to(device) : gpu에 복사본 저장(pass data to device)
    loss = F.cross_entropy(y_pred, y_batch)

    loss.backward()
    optimizer.step() #update params(weights and biases)

    total_loss += loss.item()

    _, predicted = torch.max(y_pred, 1) #max로 하는 이유?
    correct += (predicted == y_batch).sum()
    total += len(y_batch)

    batches += 1
    if batches % 100 == 0:
      print("Batch Loss:", total_loss, "Accuracy:", correct.float() / total)

  losses.append(total_loss)
  accuracies.append(correct.float() / total)
  print("Train Loss:", total_loss, "Accuracy:", correct.float() / total) #예측한 결과 loss, accuracy (지도학습)

  0%|          | 0/9137 [00:00<?, ?it/s]



Batch Loss: 69.02747416496277 Accuracy: tensor(0.5487, device='cuda:0')
Batch Loss: 138.4853185415268 Accuracy: tensor(0.5181, device='cuda:0')
Batch Loss: 207.63781696558 Accuracy: tensor(0.5158, device='cuda:0')
Batch Loss: 276.73519283533096 Accuracy: tensor(0.5158, device='cuda:0')
Batch Loss: 345.9089332818985 Accuracy: tensor(0.5163, device='cuda:0')
Batch Loss: 415.18428629636765 Accuracy: tensor(0.5153, device='cuda:0')
Batch Loss: 484.32429200410843 Accuracy: tensor(0.5151, device='cuda:0')
Batch Loss: 553.4464355707169 Accuracy: tensor(0.5166, device='cuda:0')
Batch Loss: 622.7726057171822 Accuracy: tensor(0.5157, device='cuda:0')
Batch Loss: 691.9422873854637 Accuracy: tensor(0.5152, device='cuda:0')
Batch Loss: 761.1817731261253 Accuracy: tensor(0.5144, device='cuda:0')
Batch Loss: 830.1318537592888 Accuracy: tensor(0.5167, device='cuda:0')


KeyboardInterrupt: ignored

In [None]:
losses, accuracies

테스트 데이터셋 정확도 확인하기

In [None]:
model.eval()

test_correct = 0
test_total = 0

for input_ids_batch, attention_masks_batch, y_batch in tqdm(test_loader):
  y_batch = y_batch.to(device)
  y_pred = model(input_ids_batch.to(device), attention_mask=attention_masks_batch.to(device))[0]
  _, predicted = torch.max(y_pred, 1)
  test_correct += (predicted == y_batch).sum()
  test_total += len(y_batch)

print("Accuracy:", test_correct.float() / test_total)


'''
for idx, review in enumerate(test_pd['리뷰']):
  pred = sentiment_classifier(review)
  print(f'{review}\n>> {pred[0]}')
'''

  0%|          | 0/3073 [00:00<?, ?it/s]



KeyboardInterrupt: ignored

In [None]:
test_pd['리뷰'].head()

0             육수와 비빔양념 너무 맛있네요~이모님들도 친절하구요~
1        스탭들 대박친절해요\n매운뼈구이 강력추천\n매콤하니 핵존맛탱~
2                                       좋아요
3    맛있게 잘먹고 갑니다. 아기가 있어 염도조절도 물어봐주고 고맙습니다.
4                                       귀엽다
Name: 리뷰, dtype: object

In [None]:
y_pred = []
total_len = len(test_pd)
for cnt, review in enumerate(test_pd['리뷰']):
    pred = sentiment_classifier(review) #gpu로 변환
#     print(f"{cnt} / {total_len} : {pred[0]}")
    if pred[0]['라벨링'] == 'LABEL_1': #'LABEL_1'
        y_pred.append(1)
    else:
        y_pred.append(0)

RuntimeError: ignored

In [None]:
from sklearn.metrics import classification_report

print(classification_report(test_pd['라벨링'], y_pred))

In [None]:
# 모델 저장하기
torch.save(model.state_dict(), "model.pt")