#### 모델링 순서
1. 라이브러리 import
2. 데이터 읽어오기
3. label 컬럼에 대해 label encoding하기
4. X, y sksnrl
5. Train / Test 나누기
6. Pre-Trained된 BERT tokenizer 가져오기
7. BERT tokenizer 이용하여 Train, Test 데이터 토큰화하기
8. Train, Test 데이터셋을 Tensorflow Dataset 형태로 변환
9. Pre-Trained된 BERT config 확인
10. Pre-Trained된 BERT 모델 가져오고 컴파일, 학습 수행
11. 학습된 모델로 test_dataset 예측
12. 예측 잘 맞는지 확인

#### 1. 라이브러리 설치

In [4]:
!pip install transformers



In [1]:
import warnings
warnings.filterwarnings(action = 'ignore')

In [3]:
import numpy as np
import pandas as pd
import tensorflow as tf
import transformers

In [4]:
print(transformers.__version__)
print(tf.__version__)

4.52.4
2.19.0


#### 2. 데이터 읽어오기

In [6]:
!pip install openpyxl



In [7]:
import openpyxl

In [8]:
comment_train = pd.read_excel('https://github.com/gzone2000/TEMP_TEST/raw/master/A_comment_train.xlsx', engine = 'openpyxl')
comment_test = pd.read_excel('https://github.com/gzone2000/TEMP_TEST/raw/master/A_comment_test.xlsx', engine = 'openpyxl')
comment = pd.concat([comment_train, comment_test])

In [9]:
comment.head()

Unnamed: 0.1,Unnamed: 0,data,label
0,0,재미는 있는데 시간이 짧은게 아쉽네요~,긍정
1,1,"OO 관련 내용은 우리 직원과는 거리가 멀었음, 특히, 사내에 홍보할 내용은 아니라고 봄",부정
2,2,스토리가 너무 딱딱해서 별로였음,부정
3,3,프로그램A 화이팅하세요!!,긍정
4,4,높은 곳에 올라가는 모습이 너무 위험해 보여요.,부정


In [10]:
comment.isnull().sum()

Unnamed: 0    0
data          0
label         0
dtype: int64

#### 3. label 컬럼에 대해 LabelEncoding 하기

In [12]:
comment['label'] = comment['label'].replace(['부정', '긍정'], [0,1])

In [13]:
comment.tail()

Unnamed: 0.1,Unnamed: 0,data,label
96,96,작년에 프로그램A를 재밋게 봤던 시청자로서 올해의 미니드라마도 매우 기대가 됩니다....,1
97,97,프로그램C 잘 보았습니다. 모든일의 바탕은 안전인것 같습니다. 모두를 보호하는 최고...,1
98,98,위험한 시설에 대한 설명도 부탁드립니다,0
99,99,구체적으로 어떤 활동을 해왔었고 앞으로 어떤활동을 할건지 잘 설명해줬으면 좋았을 것...,0
100,100,우리 회사가 잘되면 나라도 잘된다. 회사 화이팅 나라 화이팅,1


In [14]:
comment.info()

<class 'pandas.core.frame.DataFrame'>
Index: 352 entries, 0 to 100
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  352 non-null    int64 
 1   data        352 non-null    object
 2   label       352 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 11.0+ KB


#### 4. X, y 나누기

In [16]:
# ndarray로 하면 에러가 나므로 리스트로 만들기

X = comment.data.to_list()
y = comment.label.to_list()

#### 5. Train / Test dataset 나누기

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.2, random_state = 48)

In [19]:
len(X_train), len(X_test), len(y_train), len(y_test)

(281, 71, 281, 71)

In [20]:
X_train[:2]

['뭘 말하고싶은건지 파악이 안됩니다', '우리 회사가 잘되면 나라도 잘된다. 회사 화이팅 나라 화이팅']

#### 6. Pre-Trained된 BERT tokenizer 가져오기

In [22]:
bert_model = 'klue/bert-base'

In [23]:
from transformers import AutoConfig, BertTokenizerFast, TFBertForSequenceClassification
tokenizer = BertTokenizerFast.from_pretrained(bert_model)

In [24]:
tokenizer.vocab_size

32000

In [25]:
tokenizer.vocab

{'마중': 18448,
 '선가': 15248,
 '##탠': 2952,
 '종족': 13014,
 '갓난': 25129,
 '선대': 11476,
 '님': 805,
 'En': 11349,
 '##불교': 26660,
 '순리': 25508,
 '사스': 30968,
 '어울려': 13486,
 '민원': 6710,
 '가리킨다': 16519,
 '박석': 20350,
 '현대사': 17019,
 '원세': 22214,
 '만약': 4826,
 '##지구': 22472,
 '애틀': 22500,
 '##립': 2339,
 '구별': 7423,
 '경색': 14349,
 '이뤄질': 9733,
 '정액': 15525,
 'FDA': 24236,
 '##략': 2773,
 '국당': 7774,
 '##않': 2872,
 '하인': 12265,
 '특례': 12037,
 '연관': 6597,
 '칠판': 25359,
 '여심': 29477,
 'IT': 5392,
 '터스': 30374,
 '##라이': 4893,
 '닥터': 11380,
 '수배': 18627,
 '자릿수': 14690,
 '변한': 13307,
 '##식간': 9247,
 '신곡': 14418,
 '##렌토': 31081,
 '평야': 27344,
 '베드': 15328,
 '가까스로': 16938,
 '##못해': 22697,
 '부지사': 18079,
 '##오른': 17290,
 '##이션': 4632,
 '[unused57]': 31557,
 '막아내': 22750,
 '바흐': 18252,
 '정상회의': 24051,
 '패딩': 16221,
 '스코어': 15041,
 '뒷걸음질': 26733,
 '궁지': 22736,
 '이뤄낸': 25872,
 '만사': 23346,
 '원단': 11964,
 '평균치': 27923,
 '자외선': 10505,
 '루스벨트': 26649,
 '[unused171]': 31671,
 '넷': 756,
 '원비': 24782,
 '학기제': 22

#### 7. Pre-Trained된 BERT tokenizer 사용하여 Train, Test 데이터 토큰화하기

In [27]:
# 일정 길이 이상 자동으로 잘라주고, 패딩까지 진행
train_encodings = tokenizer(X_train, truncation=True, padding = True)
test_encodings = tokenizer(X_test, truncation=True, padding = True)

In [28]:
# input_ids : 문장을 숫자 매핑하는 keras Tokenizer의 texts_to_sequence + pad_sequence 후의 결과값
print(train_encodings['input_ids'][0])

[2, 1099, 1041, 19521, 2585, 2073, 2332, 2118, 4591, 2052, 1378, 3598, 3606, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [29]:
print(test_encodings['input_ids'][0])

[2, 5696, 12821, 2079, 1388, 6233, 2079, 3746, 7285, 3732, 3869, 3598, 3606, 5, 7818, 19521, 6001, 2259, 3969, 3869, 2205, 2918, 2219, 3606, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


#### 8. Train, Test 데이터셋을 Tensorflow Dataset 형태로 변환

In [33]:
# 반드시 입력할 데이터를 tensorflow Dataset 형태로 만들어야 함

# 1. Load each example from train_encodings (feature_dict) and y_train(label)
# 2. Shuffle the order with a buffer of size 1000
# 3. Batch into mini-batches of 16 examples
# 4. Cache the resulting batches so repeated epochs don't recumpute everything
# 5. Prefetch so that while the model is training on one batch, the bext batch is already being prepared on the background

train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), y_train))
train_dataset = train_dataset.shuffle(buffer_size = 1000).batch(16).cache().prefetch(tf.data.AUTOTUNE)

test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_encodings), y_test))
test_dataset = test_dataset.batch(16).cache().prefetch(tf.data.AUTOTUNE)

#### 9. Pre-Trained된 BERT config 확인

In [36]:
config = AutoConfig.from_pretrained(bert_model)
config

BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.52.4",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 32000
}

#### 10. Pre-Trained된 BERT 모델 가져오고 컴파일, 학습 수행

In [41]:
# num_labels: 모델 입력으로 몇 개로 분류할지 알려주기
# from_pt=True -> pytorch 학습된 정보를 tensorflow로 불러올 수 있도록 한다.

model = TFBertForSequenceClassification.from_pretrained(bert_model, num_labels=2, from_pt = True)
from transformers import TFBertForSequenceClassification

optimizer = tf.keras.optimizers.Adam(learning_rate = 5e-5)
model.compile(optimizer=optimizer, loss=model.hf_compute_loss, metrics=['accuracy'])
model.fit(train_dataset, epochs = 1, validation_data = test_dataset)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertForSequenceClassification: ['bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




<tf_keras.src.callbacks.History at 0x321bd0b60>

In [43]:
model.summary()

Model: "tf_bert_for_sequence_classification_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  110617344 
                                                                 
 dropout_75 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 110618882 (421.98 MB)
Trainable params: 110618882 (421.98 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


#### 11. 학습된 모델로 test_dataset 예측하기

In [46]:
y_test_pred = model.predict(test_dataset)



In [48]:
y_test_pred.logits.shape

(71, 2)

In [50]:
y_test_pred.logits[:10]

array([[-2.552149  ,  2.7259643 ],
       [-2.083826  ,  2.2113676 ],
       [-2.4810295 ,  2.6503031 ],
       [ 0.21361487,  0.03153067],
       [-1.7604138 ,  2.159036  ],
       [-2.6179163 ,  2.7887166 ],
       [ 2.3119416 , -2.2264845 ],
       [-0.8722002 ,  1.1628476 ],
       [ 2.4974332 , -1.9687002 ],
       [ 2.0189905 , -1.9547454 ]], dtype=float32)

#### 12. 예측 잘 맞는지 확인

In [53]:
# 예측 결과를 DataFrame 넣기

df = pd.DataFrame(np.argmax(y_test_pred.logits, axis = 1), columns=['predict'])
df

Unnamed: 0,predict
0,1
1,1
2,1
3,0
4,1
...,...
66,0
67,0
68,0
69,0


In [55]:
df['true'] = y_test
df

Unnamed: 0,predict,true
0,1,1
1,1,1
2,1,1
3,0,0
4,1,1
...,...,...
66,0,0
67,0,0
68,0,0
69,0,0


In [57]:
np.sum(df['true'] == df['predict']) / len(df)

0.9436619718309859