### 2025-06-16 23:33
1. (o) spark_encode.py : 검증 완료 -> int8을 spark로 encoding(str)
2. (o) spark_decode.py : 검증 완료 -> spark encoding(str)을 decoding(decimal)
3. ( ) BERT-Base 사전 학습 모델 로딩
4. ( ) Weight tensor 추출 및 int8 quantization 적용
5. ( ) 각 weight를 spark_encode -> spark_decode 순서로 변환
6. ( ) decoding된 값을 다시 float로 복원 후 모델 적용
7. ( ) GLUE-SST-2 Dataset에서 평가 진행

-model : BERT-Base - 분류 task 용이 \
-dataset : GLUE의 SST-2 - movie reviews & human annotations of sentiment\
> task : predict sentiment of a given sentence (positive/negeative)

-Baseline : BERT-Base FP32 \
-Target : BERT-Base SPARK (After INT8) \
-조작변인 : data type (INT8 vs SPARK) * SPARK : int8 -> encoding(4/8) -> decoding \
-통제변인 : pre-trained model

In [None]:
# Trasnformers library 설치
! pip install transformers

In [7]:
from transformers import BertForSequenceClassification, BertTokenizer
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

ImportError: 
BertForSequenceClassification requires the PyTorch library but it was not found in your environment. Checkout the instructions on the
installation page: https://pytorch.org/get-started/locally/ and follow the ones that match your environment.
Please note that you may need to restart your runtime after installation.


In [None]:
import torch

scale_map = {}  # 레이어별 scale 저장 (복원용)

for name, param in model.named_parameters():
    if "weight" in name and param.requires_grad:
        # 1. float → INT8
        max_val = param.data.abs().max()
        scale = 127 / max_val
        scale_map[name] = scale
        int8_tensor = torch.round(param.data * scale).clamp(-128, 127).to(torch.int8)

        # 2. SPARK 인코딩 + 디코딩
        decoded_vals = []
        for v in int8_tensor.view(-1):
            unsigned_val = abs(int(v))
            encoded, _, _ = spark_encode(unsigned_val)
            decoded = spark_decode(encoded)
            restored = decoded / scale  # 다시 float로 복원
            restored *= -1 if v < 0 else 1
            decoded_vals.append(restored)

        decoded_tensor = torch.tensor(decoded_vals).reshape(param.shape)
        param.data = decoded_tensor.float()


In [None]:
from datasets import load_dataset
from transformers import TrainingArguments, Trainer
from sklearn.metrics import accuracy_score

# 데이터셋 로드
dataset = load_dataset("glue", "sst2")
def preprocess(example):
    return tokenizer(example["sentence"], truncation=True, padding="max_length", max_length=128)

encoded_dataset = dataset.map(preprocess, batched=True)
encoded_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

# 정확도 metric
def compute_metrics(p):
    preds = p.predictions.argmax(axis=-1)
    return {"accuracy": accuracy_score(p.label_ids, preds)}

# Trainer 설정
training_args = TrainingArguments(output_dir="./spark_eval", per_device_eval_batch_size=64)
trainer = Trainer(model=model, args=training_args, compute_metrics=compute_metrics)

# 평가 실행
eval_result = trainer.evaluate(eval_dataset=encoded_dataset["validation"])
print(f"\n✅ SPARK 디코딩 후 정확도: {eval_result['eval_accuracy']:.4f}")
