<a href="https://colab.research.google.com/github/dhdbsrlw/kupply-MLOps/blob/main/Classification_Model_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **파이프라인 개요**

- **Baseline Model:** SKT/koBERT (BERT 의 kor-finetuning)
- **Task**: Multiclass classification (4 class)
- **유의사항**: 실제 서비스 내에서 사용될 모델이기 때문에, 모델의 input/output format 에 주의 \
(Train Data 와 실제 서비스 클라이어트 단에서 GET 해오는 Input Data 가 상이할 경우,\
 데이터 분포 자체가 달라져 좋은 예측 성능을 기대하기 어렵다.)

In [19]:
from google.colab import drive
drive.mount('/content/drive/')

import pandas as pd
import numpy as np
import json

from sklearn.preprocessing import OneHotEncoder
import torch.nn.functional as F

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


*전처리된 데이터 별도로 저장해두었으므로, 바로 Step 2 부터 시작하시면 됩니다.*

# 1. 데이터 Import

모델링에 활용되는 항목: 1전공(본전공) / 입학연도 / 1지망 학과 / 지원연도 및 시기 / 학점 / 합격여부

In [12]:
# 과거 합/불합 데이터 불러오기

df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/kupply-MLOps/rawData/surveyResponses.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115 entries, 0 to 114
Data columns (total 12 columns):
 #   Column                                                                                                                   Non-Null Count  Dtype  
---  ------                                                                                                                   --------------  -----  
 0   타임스탬프                                                                                                                    115 non-null    object 
 1   귀하에게는 개인정보 수집 및 이용을 거부할 권리가 있으며 거부 시 설문 제출 및 경품 수령이 불가합니다. 「개인정보보호법」 등 관련 법규에 의거하여 상기 본인은 위와 같이 개인정보 수집 및 활용에 동의하십니까?    115 non-null    object 
 2   제1전공(본전공) 학과 또는 학부를 입력해주세요.                                                                                              115 non-null    object 
 3   본인의 학번(입학연도)을 선택해주세요.                                                                                                    115 non-null    obj

In [67]:
df.head()

Unnamed: 0,타임스탬프,귀하에게는 개인정보 수집 및 이용을 거부할 권리가 있으며 거부 시 설문 제출 및 경품 수령이 불가합니다. 「개인정보보호법」 등 관련 법규에 의거하여 상기 본인은 위와 같이 개인정보 수집 및 활용에 동의하십니까?,제1전공(본전공) 학과 또는 학부를 입력해주세요.,본인의 학번(입학연도)을 선택해주세요.,1지망으로 지원하신 이중전공 학과 또는 학부를 입력해주세요.,지원하신 연도와 학기를 입력해주세요.,지원서 제출 당시 대내용 평점평균(학점)을 입력해주세요.,(1지망) 이중전공 합격 여부를 선택해주세요.,"(선택문항) 이중전공 합격을 위해, 본인이 준비한 기타 스펙이 있었다면 자세히 입력해주세요.",(선택 문항/자소서 제출학과 지원자만 해당) 지원 당시 제출하였던 학업계획서(자기소개서)를 업로드해주세요.,휴대전화 번호를 입력해주세요.,자유 의견
0,2023. 7. 11 오후 5:52:40,동의합니다,컴퓨터학과,22학번,통계학과,2023년 1학기,4.47,합격,없음,,01090403172,
1,2023. 7. 11 오후 6:10:54,동의합니다,수학과,20학번,경영학과,2023년 1학기,4.09,불합격,없음,https://drive.google.com/open?id=1_n9J-OKUeDv1...,010-3866-5244,
2,2023. 7. 11 오후 6:18:54,동의합니다,일어일문학과,20학번,컴퓨터학과,2021년 1학기,4.48,합격,,,,
3,2023. 7. 11 오후 6:26:55,동의합니다,언어학과,20학번,컴퓨터학과,2021년 1학기,4.19,합격,"동기들과 스터디, 전산언어학 겨울학교, 컴퓨터프로그래밍I 선수강",https://drive.google.com/open?id=1v07NOmumdiBW...,01065080118,파일은 사용 후 바로 폐기해주셨으면 좋겠습니다..! 서비스 잘 되시길 바라요~~!!
4,2023. 7. 11 오후 6:28:50,동의합니다,컴퓨터학과,22학번,수학과,2023년 1학기,3.73,합격,,,01032541341,


#1.5. 데이터 Preprocessing

In [13]:
# column 명 교체
df.columns = ['A', 'B', 'firstMajor', 'applyGrade','applyMajor', 'applySemester', 'applyGPA', 'pass', 'C', 'D', 'E', 'F']
# df.head()

# 불필요한 column 제거
drop_list = ['A', 'B', 'C', 'D', 'E', 'F']
df.drop(labels=drop_list, axis=1, inplace=True)

In [14]:
print(df.head())

# 결측값 확인
df.isnull().sum()

  firstMajor applyGrade applyMajor applySemester  applyGPA pass
0      컴퓨터학과       22학번       통계학과     2023년 1학기      4.47   합격
1        수학과       20학번       경영학과     2023년 1학기      4.09  불합격
2     일어일문학과       20학번      컴퓨터학과     2021년 1학기      4.48   합격
3       언어학과       20학번      컴퓨터학과     2021년 1학기      4.19   합격
4      컴퓨터학과       22학번        수학과     2023년 1학기      3.73   합격


firstMajor       0
applyGrade       0
applyMajor       0
applySemester    0
applyGPA         0
pass             0
dtype: int64

In [15]:
# 서비스 GET 데이터 포맷과 통일시키기 (applyGrade, applySemester, pass)
# 모든 학생이 '무' 휴학 상태임을 가정 (데이터 수집 상의 한계로 인한 가정)
def preprocess_applyGrade(example):
  grade = example['applyGrade']
  semester = example['applySemester']
  tmp1 = int(grade[:2])  # Adjusted to get two digits
  tmp2 = int(semester[2:4])  # Adjusted to get the year part
  tmp3 = str(tmp2 - tmp1 + 1)
  result = tmp3 + '-' + semester[-1]
  return result

  # tmp1 = int(grade[:1])
  # tmp2 = int(semester[2:3])
  # tmp3 = str(tmp2 - tmp1 + 1)
  # result = tmp3 + '-' + semester[6]
  # return result

# applySemeseter format 통일
def preprocess_applySemester(example):
  try:
    parts = example.split(' ')
    year = parts[0].replace('년', '')
    term = parts[1].replace('학기', '')
    return f"{year}-{term}"
  except:
    # Handle unexpected format
    return example

  # year, term = example.split('년 ')
  # term = term.split('학기')[0]
  # return f"{year}-{term}"

# pass column 값 숫자로 변환 (for better embedding)
def preprocess_pass(example):
  return 1 if (example == '합격') else 0


In [16]:
# 전처리 함수 적용

# temp_df = df.copy() # 테스트
# temp_df['applySemester'] = df['applySemester'].apply(preprocess_applySemester)
# temp_df['pass'] = df['pass'].apply(preprocess_pass)
# temp_df['applyGrade'] = temp_df.apply(preprocess_applyGrade, axis=1)

df['applySemester'] = df['applySemester'].apply(preprocess_applySemester)
df['pass'] = df['pass'].apply(preprocess_pass)
df['applyGrade'] = df.apply(preprocess_applyGrade, axis=1)

df.head()

Unnamed: 0,firstMajor,applyGrade,applyMajor,applySemester,applyGPA,pass
0,컴퓨터학과,2-1,통계학과,2023-1,4.47,1
1,수학과,4-1,경영학과,2023-1,4.09,0
2,일어일문학과,2-1,컴퓨터학과,2021-1,4.48,1
3,언어학과,2-1,컴퓨터학과,2021-1,4.19,1
4,컴퓨터학과,2-1,수학과,2023-1,3.73,1


In [20]:
# 전처리 완료한 df 를 csv 로 저장

df.to_csv('/content/drive/MyDrive/Colab Notebooks/kupply-MLOps/kobert_data.csv', index=None)

# 2. 토크나이저 및 모델 Import (koBERT)

In [1]:
# koBERT tokenizer용 라이브러리

!pip install mxnet
!pip install gluonnlp==0.8.0
!pip install tqdm pandas
!pip install sentencepiece

Collecting mxnet
  Downloading mxnet-1.9.1-py3-none-manylinux2014_x86_64.whl (49.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.1/49.1 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
Collecting graphviz<0.9.0,>=0.8.1 (from mxnet)
  Downloading graphviz-0.8.4-py2.py3-none-any.whl (16 kB)
Installing collected packages: graphviz, mxnet
  Attempting uninstall: graphviz
    Found existing installation: graphviz 0.20.1
    Uninstalling graphviz-0.20.1:
      Successfully uninstalled graphviz-0.20.1
Successfully installed graphviz-0.8.4 mxnet-1.9.1
Collecting gluonnlp==0.8.0
  Downloading gluonnlp-0.8.0.tar.gz (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: gluonnlp
  Building wheel for gluonnlp (setup.py) ... [?25l[?25hdone
  Created wheel for gluonnlp: filename=gluonnlp-0.8.0-py3-none-

In [2]:
!pip install torch
!pip install transformers
!pip install datasets

Collecting transformers
  Downloading transformers-4.35.2-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.19.4-py3-none-any.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.7/311.7 kB[0m [31m39.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.19,>=0.14 (from transformers)
  Downloading tokenizers-0.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m41.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m52.5 MB/s[0m eta [36m0:00:00[0m
Ins

In [3]:
!pip install 'git+https://github.com/SKTBrain/KoBERT.git#egg=kobert_tokenizer&subdirectory=kobert_hf' # koBERT tokenizer

Collecting kobert_tokenizer
  Cloning https://github.com/SKTBrain/KoBERT.git to /tmp/pip-install-69rze47y/kobert-tokenizer_d7df4da03721432da8eba8ac82dbede3
  Running command git clone --filter=blob:none --quiet https://github.com/SKTBrain/KoBERT.git /tmp/pip-install-69rze47y/kobert-tokenizer_d7df4da03721432da8eba8ac82dbede3
  Resolved https://github.com/SKTBrain/KoBERT.git to commit 47a69af87928fc24e20f571fe10c3cc9dd9af9a3
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: kobert_tokenizer
  Building wheel for kobert_tokenizer (setup.py) ... [?25l[?25hdone
  Created wheel for kobert_tokenizer: filename=kobert_tokenizer-0.1-py3-none-any.whl size=4632 sha256=2b33210967039da9b01abb77b884737d5c4ac74b188e86a5de1e4ebf0e738a0e
  Stored in directory: /tmp/pip-ephem-wheel-cache-rdtwrsh5/wheels/e9/1a/3f/a864970e8a169c176befa3c4a1e07aa612f69195907a4045fe
Successfully built kobert_tokenizer
Installing collected packages: kobert_tokenizer
Successfully ins

In [4]:
# 기본적인 라이브러리
import pandas as pd
import numpy as np
import math
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from os import path
from datetime import datetime

# for koBERT
from kobert_tokenizer import KoBERTTokenizer
from transformers.optimization import AdamW, get_cosine_schedule_with_warmup
from tqdm import tqdm
import sentencepiece

In [23]:
# 추가 (23.11.16) - 삭제 필요
# !git clone https://github.com/SKTBrain/KoBERT.git
# %cd KoBERT

Cloning into 'KoBERT'...
remote: Enumerating objects: 428, done.[K
remote: Counting objects: 100% (148/148), done.[K
remote: Compressing objects: 100% (46/46), done.[K
remote: Total 428 (delta 125), reused 104 (delta 102), pack-reused 280[K
Receiving objects: 100% (428/428), 218.85 KiB | 2.60 MiB/s, done.
Resolving deltas: 100% (221/221), done.
/content/KoBERT


In [28]:
# !pip install boto3 # 삭제 필요

Collecting boto3
  Downloading boto3-1.29.1-py3-none-any.whl (135 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.8/135.8 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting botocore<1.33.0,>=1.32.1 (from boto3)
  Downloading botocore-1.32.1-py3-none-any.whl (11.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.4/11.4 MB[0m [31m88.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jmespath<2.0.0,>=0.7.1 (from boto3)
  Downloading jmespath-1.0.1-py3-none-any.whl (20 kB)
Collecting s3transfer<0.8.0,>=0.7.0 (from boto3)
  Downloading s3transfer-0.7.0-py3-none-any.whl (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.8/79.8 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jmespath, botocore, s3transfer, boto3
Successfully installed boto3-1.29.1 botocore-1.32.1 jmespath-1.0.1 s3transfer-0.7.0


In [5]:
import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import gluonnlp as nlp
import numpy as np
from tqdm.notebook import tqdm

# import boto3



In [33]:
# from kobert import get_tokenizer

In [6]:
# tokenizer = get_tokenizer()
tokenizer = KoBERTTokenizer.from_pretrained('skt/kobert-base-v1')
model = BertForSequenceClassification.from_pretrained('skt/kobert-base-v1',num_labels=2) # class 개수 맞게 수정
# model = BertForSequenceClassification.from_pretrained('monologg/kobert',num_labels=4) # class 개수 맞게 수정

(…)se-v1/resolve/main/tokenizer_config.json:   0%|          | 0.00/432 [00:00<?, ?B/s]

(…)kobert-base-v1/resolve/main/spiece.model:   0%|          | 0.00/371k [00:00<?, ?B/s]

(…)-v1/resolve/main/special_tokens_map.json:   0%|          | 0.00/244 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'XLNetTokenizer'. 
The class this function is called from is 'KoBERTTokenizer'.


(…)/kobert-base-v1/resolve/main/config.json:   0%|          | 0.00/535 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/369M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at skt/kobert-base-v1 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# 3. 데이터 Preprocessing (Data Loader)

In [21]:
# 전처리된 데이터 불러오기
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/kupply-MLOps/kobert_data.csv')
print(df.head())

  firstMajor applyGrade applyMajor applySemester  applyGPA  pass
0      컴퓨터학과        2-1       통계학과        2023-1      4.47     1
1        수학과        4-1       경영학과        2023-1      4.09     0
2     일어일문학과        2-1      컴퓨터학과        2021-1      4.48     1
3       언어학과        2-1      컴퓨터학과        2021-1      4.19     1
4      컴퓨터학과        2-1        수학과        2023-1      3.73     1


In [22]:
# 데이터프레임 (true) label (= pass여부) 만 별도로 저장

labels = list(map(int,df['pass'].tolist()))
print(labels)
# label_dict = {0, 1} # 0: 불합격, 1: 합격 의미
# {label: i for i, label in enumerate(set(labels))}
# labels = [label_dict[label] for label in label_list]

[1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1]


In [23]:
# 테이블 한 row 에 해당하는 모든 데이터 concatenation
def cls_preprocess(dataset):
    preprocessed = []

    for _, row in dataset.iterrows():
        text = f"First Major is {row['firstMajor']}, Apply Grade is {row['applyGrade']}, Apply Major is {row['applyMajor']}, Apply Semester is {row['applySemester']}, GPA is {row['applyGPA']}, Pass is {row['pass']}"
        preprocessed_text = "[CLS] " + text + " [SEP]"
        preprocessed.append(preprocessed_text)

    return preprocessed

# 함수 적용
processed_data = cls_preprocess(df)
print(processed_data[0])  # Print the first processed item


[CLS] First Major is 컴퓨터학과, Apply Grade is 2-1, Apply Major is 통계학과, Apply Semester is 2023-1, GPA is 4.47, Pass is 1 [SEP]


In [24]:
# 토크나이징

tokenized_data = tokenizer.batch_encode_plus(
    processed_data, # lyrics (수정)
    add_special_tokens=True,
    padding='longest',
    truncation=True,
    max_length=256, # 수정
    return_attention_mask=True,
    return_tensors='pt'
)


In [25]:
# DataLoader 정의

class applyDataset(Dataset):
    def __init__(self, content, labels, attention_masks):
        self.content = content
        self.labels = labels
        self.attention_masks = attention_masks
        self.num_classes = 2 # 훈련 데이터셋에 라벨이 2 class 인 관계로, 우선 2 class 로 설정 (코드 수정 시, 위 토크나이저 임포트 코드와 함께 바꿔주기)
        self.one_hot_labels = torch.zeros(len(labels), self.num_classes)
        for i, label in enumerate(self.labels):
              # print(i,label)
              self.one_hot_labels[i, label] = 1

    def __len__(self):
        return len(self.content)

    def __getitem__(self, idx):
        return {
            'content': self.content[idx],
            'label': self.one_hot_labels[idx],
            'attention_mask': self.attention_masks[idx],
            'gt_label': self.labels[idx]
        }

        # num_classes = len(set(labels)) # 2
        # self.one_hot_labels = torch.zeros(len(labels), num_classes)
        # for i, label in enumerate(labels):
            # self.one_hot_labels[i, label] = 1.0


In [26]:
# 데이터로더를 통해 훈련용 데이터셋 생성

dataset = applyDataset(
    content=tokenized_data['input_ids'],
    labels=labels,
    attention_masks=tokenized_data['attention_mask'],
)

In [27]:
# 데이터셋 분할 함수 정의 (별도의 수정 없이 그대로 사용 가능)

def dataset_split(dataset, ratio):
    train_size = int(ratio * len(dataset))
    val_size = len(dataset) - train_size
    train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])
    return train_dataset, val_dataset

In [28]:
# 데이터로더 config
batch_size = 8 # 데이터 개수가 작아 비교적 작은 값으로 설정

# 데이터 분할
train_dataset, val_dataset = dataset_split(dataset, 0.8)

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False) # val 데이터셋은 shuffle 하면 안된다.

# for data in train_dataloader:
  # print(data['content'])

# 4. Train

In [29]:
import logging
import os

logger = logging.getLogger()  ###진행 과정에서 로깅 포인트가 발생할 경우 로깅을 하기 위한 코드
logger.setLevel(logging.INFO)  ###진행 과정에서 로깅 포인트가 발생할 경우 로깅을 하기 위한 코드


In [31]:
# Trainig Config
# 딥러닝...이지만... 샘플 데이터수가 작은 관계로 전반적으로 작은 값으로 설정

epochs = 50
warmup_ratio = 0.1
lr = 2e-5
grad_clip = 1.0
train_log_interval = 30 # train 이 100번 이루어질 때마다 logging
# validation_interval = 1000
save_interval = 60 # save point는 1000번의 train


# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device) # GPU 에 모델 올리기

# Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

# scheduler
data_len = len(train_dataloader)
num_train_steps = int(data_len / batch_size * epochs)
num_warmup_steps = int(num_train_steps * warmup_ratio)
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_train_steps)


In [33]:
# Train 함수 정의

def model_train(model, optimizer, scheduler, train_dataloader, device):
    model.to(device)  # 모델 학습을 설정된 device (CPU, cuda) 위에서 진행하도록 설정

    model.train() # 모델을 학습 모드로 전환
    loss_list_between_log_interval = []

    for epoch_id in range(epochs):
        for step_index, batch_data in tqdm(enumerate(train_dataloader), f"[TRAIN] Epoch:{epoch_id+1}", total=len(train_dataloader)):
                global_step = len(train_dataloader) * epoch_id + step_index + 1

                # Add a condition to break the loop if we've gone through all data points
                if step_index * batch_size >= len(dataset):
                  continue

                optimizer.zero_grad()
                contents = batch_data['content']
                labels = batch_data['label']
                attention_masks = batch_data['attention_mask']

                # 모델의 input들을 device(GPU)와 호환되는 tensor로 변환
                contents = contents.to(device)
                labels = labels.to(device)
                attention_masks = attention_masks.to(device)

                model_outputs = model(
                    contents, token_type_ids=None, attention_mask=attention_masks, labels=labels
                    )

                loss = model_outputs.loss
                loss.backward()

                torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
                optimizer.step()
                scheduler.step()

                # for logging
                loss_list_between_log_interval.append(model_outputs.loss.item())

                #if global_step % train_log_interval == 0:

        mean_loss = np.mean(loss_list_between_log_interval)

        # 콘솔 출력
        logger.info(
            f"EP:{epoch_id} global_step:{global_step} "
            f"loss:{mean_loss:.4f} perplexity:{math.exp(mean_loss):.4f}"
        )

        loss_list_between_log_interval.clear()

        # if global_step % validation_interval == 0:
        # dev_loss = _validate(model, val_dataloader, device, logger, global_step)

        # 각 epoch 마다 모델 저장
        state_dict = model.state_dict()
        model_path = os.path.join('/content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1', f"kupply_epoch_{epoch_id}.pth")
        logger.info(f"global_step: {global_step} model saved at {model_path}")
        torch.save(state_dict, model_path)

    return model

In [34]:
# Let's Training
model_v1 = model_train(model, optimizer, scheduler, train_dataloader, device)

[TRAIN] Epoch:1:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:0 global_step:12 loss:0.6555 perplexity:1.9262
INFO:root:global_step: 12 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_0.pth


[TRAIN] Epoch:2:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:1 global_step:24 loss:0.4397 perplexity:1.5523
INFO:root:global_step: 24 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_1.pth


[TRAIN] Epoch:3:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:2 global_step:36 loss:0.3962 perplexity:1.4862
INFO:root:global_step: 36 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_2.pth


[TRAIN] Epoch:4:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:3 global_step:48 loss:0.3641 perplexity:1.4392
INFO:root:global_step: 48 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_3.pth


[TRAIN] Epoch:5:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:4 global_step:60 loss:0.3502 perplexity:1.4193
INFO:root:global_step: 60 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_4.pth


[TRAIN] Epoch:6:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:5 global_step:72 loss:0.3645 perplexity:1.4398
INFO:root:global_step: 72 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_5.pth


[TRAIN] Epoch:7:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:6 global_step:84 loss:0.3442 perplexity:1.4108
INFO:root:global_step: 84 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_6.pth


[TRAIN] Epoch:8:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:7 global_step:96 loss:0.3527 perplexity:1.4229
INFO:root:global_step: 96 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_7.pth


[TRAIN] Epoch:9:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:8 global_step:108 loss:0.3176 perplexity:1.3738
INFO:root:global_step: 108 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_8.pth


[TRAIN] Epoch:10:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:9 global_step:120 loss:0.2172 perplexity:1.2426
INFO:root:global_step: 120 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_9.pth


[TRAIN] Epoch:11:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:10 global_step:132 loss:0.1486 perplexity:1.1602
INFO:root:global_step: 132 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_10.pth


[TRAIN] Epoch:12:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:11 global_step:144 loss:0.0827 perplexity:1.0862
INFO:root:global_step: 144 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_11.pth


[TRAIN] Epoch:13:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:12 global_step:156 loss:0.0495 perplexity:1.0507
INFO:root:global_step: 156 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_12.pth


[TRAIN] Epoch:14:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:13 global_step:168 loss:0.0353 perplexity:1.0359
INFO:root:global_step: 168 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_13.pth


[TRAIN] Epoch:15:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:14 global_step:180 loss:0.0291 perplexity:1.0295
INFO:root:global_step: 180 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_14.pth


[TRAIN] Epoch:16:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:15 global_step:192 loss:0.0259 perplexity:1.0263
INFO:root:global_step: 192 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_15.pth


[TRAIN] Epoch:17:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:16 global_step:204 loss:0.0247 perplexity:1.0250
INFO:root:global_step: 204 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_16.pth


[TRAIN] Epoch:18:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:17 global_step:216 loss:0.0245 perplexity:1.0248
INFO:root:global_step: 216 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_17.pth


[TRAIN] Epoch:19:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:18 global_step:228 loss:0.0247 perplexity:1.0250
INFO:root:global_step: 228 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_18.pth


[TRAIN] Epoch:20:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:19 global_step:240 loss:0.0237 perplexity:1.0240
INFO:root:global_step: 240 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_19.pth


[TRAIN] Epoch:21:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:20 global_step:252 loss:0.0219 perplexity:1.0221
INFO:root:global_step: 252 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_20.pth


[TRAIN] Epoch:22:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:21 global_step:264 loss:0.0198 perplexity:1.0200
INFO:root:global_step: 264 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_21.pth


[TRAIN] Epoch:23:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:22 global_step:276 loss:0.0163 perplexity:1.0165
INFO:root:global_step: 276 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_22.pth


[TRAIN] Epoch:24:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:23 global_step:288 loss:0.0142 perplexity:1.0143
INFO:root:global_step: 288 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_23.pth


[TRAIN] Epoch:25:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:24 global_step:300 loss:0.0122 perplexity:1.0122
INFO:root:global_step: 300 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_24.pth


[TRAIN] Epoch:26:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:25 global_step:312 loss:0.0107 perplexity:1.0108
INFO:root:global_step: 312 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_25.pth


[TRAIN] Epoch:27:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:26 global_step:324 loss:0.0101 perplexity:1.0102
INFO:root:global_step: 324 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_26.pth


[TRAIN] Epoch:28:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:27 global_step:336 loss:0.0097 perplexity:1.0098
INFO:root:global_step: 336 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_27.pth


[TRAIN] Epoch:29:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:28 global_step:348 loss:0.0099 perplexity:1.0099
INFO:root:global_step: 348 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_28.pth


[TRAIN] Epoch:30:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:29 global_step:360 loss:0.0099 perplexity:1.0099
INFO:root:global_step: 360 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_29.pth


[TRAIN] Epoch:31:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:30 global_step:372 loss:0.0097 perplexity:1.0098
INFO:root:global_step: 372 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_30.pth


[TRAIN] Epoch:32:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:31 global_step:384 loss:0.0094 perplexity:1.0095
INFO:root:global_step: 384 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_31.pth


[TRAIN] Epoch:33:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:32 global_step:396 loss:0.0088 perplexity:1.0089
INFO:root:global_step: 396 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_32.pth


[TRAIN] Epoch:34:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:33 global_step:408 loss:0.0080 perplexity:1.0080
INFO:root:global_step: 408 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_33.pth


[TRAIN] Epoch:35:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:34 global_step:420 loss:0.0072 perplexity:1.0072
INFO:root:global_step: 420 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_34.pth


[TRAIN] Epoch:36:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:35 global_step:432 loss:0.0066 perplexity:1.0066
INFO:root:global_step: 432 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_35.pth


[TRAIN] Epoch:37:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:36 global_step:444 loss:0.0061 perplexity:1.0061
INFO:root:global_step: 444 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_36.pth


[TRAIN] Epoch:38:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:37 global_step:456 loss:0.0057 perplexity:1.0057
INFO:root:global_step: 456 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_37.pth


[TRAIN] Epoch:39:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:38 global_step:468 loss:0.0056 perplexity:1.0056
INFO:root:global_step: 468 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_38.pth


[TRAIN] Epoch:40:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:39 global_step:480 loss:0.0055 perplexity:1.0055
INFO:root:global_step: 480 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_39.pth


[TRAIN] Epoch:41:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:40 global_step:492 loss:0.0055 perplexity:1.0055
INFO:root:global_step: 492 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_40.pth


[TRAIN] Epoch:42:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:41 global_step:504 loss:0.0055 perplexity:1.0055
INFO:root:global_step: 504 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_41.pth


[TRAIN] Epoch:43:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:42 global_step:516 loss:0.0054 perplexity:1.0054
INFO:root:global_step: 516 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_42.pth


[TRAIN] Epoch:44:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:43 global_step:528 loss:0.0051 perplexity:1.0052
INFO:root:global_step: 528 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_43.pth


[TRAIN] Epoch:45:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:44 global_step:540 loss:0.0049 perplexity:1.0049
INFO:root:global_step: 540 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_44.pth


[TRAIN] Epoch:46:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:45 global_step:552 loss:0.0044 perplexity:1.0044
INFO:root:global_step: 552 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_45.pth


[TRAIN] Epoch:47:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:46 global_step:564 loss:0.0041 perplexity:1.0041
INFO:root:global_step: 564 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_46.pth


[TRAIN] Epoch:48:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:47 global_step:576 loss:0.0039 perplexity:1.0039
INFO:root:global_step: 576 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_47.pth


[TRAIN] Epoch:49:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:48 global_step:588 loss:0.0036 perplexity:1.0036
INFO:root:global_step: 588 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_48.pth


[TRAIN] Epoch:50:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:root:EP:49 global_step:600 loss:0.0036 perplexity:1.0036
INFO:root:global_step: 600 model saved at /content/drive/MyDrive/Colab Notebooks/kupply-MLOps/checkpoint/train_1/kupply_epoch_49.pth


# 5. Evaluation (미완성, 당장은 필요 없음)

In [None]:
# Evaluation

def model_eval(model, val_dataloader):
  predictions = []
  gts = []

  model.eval() # evaluation mode 로 전환

  for batch_data in tqdm(val_dataloader):
    with torch.no_grad():
                contents = batch_data['content']
                labels = batch_data['label']
                attention_masks = batch_data['attention_mask']
                gt_labels=batch_data['gt_label']

                contents = contents.to(device)
                labels = labels.to(device)
                attention_masks = attention_masks.to(device)

                outputs = model(
                    contents, token_type_ids=None, attention_mask=attention_masks, labels=labels
                    )

                logits = outputs.logits
                # print(logits)
                # Predict Class (숫자값 - 소프트맥스값)
                predicted_labels = torch.argmax(logits, dim=1)

                # Accuracy 측정을 위해 pair-wise 하게 각각 리스트로 저장
                predictions.append(predicted_labels)
                gts.append(gt_labels)


  return predictions, gts


In [None]:
# 모델 성능 평가

pred, gt = model_eval(model_v1, val_dataloader)
# pred = pred.tolist()
# print(pred)
# print(len(val_dataloader))

**문제점**: batch_size 를 8 로 정했다보니, 그 output 도 batch 로 묶여서 나온다. \
따라서 이를 분리시켜주어야 한다.

In [38]:
# 실제 Inference

# pred = model_inference(model_v1, val_dataloader)
# print(pred)

  0%|          | 0/3 [00:00<?, ?it/s]

[tensor([1, 1, 1, 0, 1, 1, 1, 0], device='cuda:0'), tensor([0, 1, 1, 0, 1, 1, 1, 1], device='cuda:0'), tensor([1, 1, 1, 1, 1, 1, 1], device='cuda:0')]


# 6. Inference (UNLABELLED INPUT DATA)

In [None]:
# 예시 INPUT DATA 읽어오기 (실제 서비스 GET 형식 참고)


In [36]:
# Inference

def model_inference(model, apply_dataloader):

  predictions = []
  model.eval() # evaluation mode 로 전환 (= no parameter update)

  for batch_data in tqdm(apply_dataloader):

    with torch.no_grad():
      contents = batch_data['content']
      attention_masks = batch_data['attention_mask']

      contents = contents.to(device)
      attention_masks = attention_masks.to(device)

      outputs = model(
                contents, token_type_ids=None, attention_mask=attention_masks
                )

      logits = outputs.logits
      # print(logits)
      predicted_labels = torch.argmax(logits, dim=1)
      predictions.append(predicted_labels)

  return predictions

필요한 작업 (tbc) \
1. inference code 에서 앞 쪽에 코드 추가 \
2. 원하는 input 에 대해 원하는 output 으로 나오도록 format 통일