T5 기반 Text Summarization 예제 코드를 응용하여 뉴스를 요약하는 모델을 만들고, ROUGE를 이용해 성능을 평가하기

In [None]:
pip install evaluate

In [None]:
pip install rouge_score

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
import pandas as pd
df_train=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/train.csv')
df_test=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/test.csv')
df_validation=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/validation.csv')

In [5]:
df_train.head()

Unnamed: 0,id,article,highlights
0,0001d1afc246a7964130f43ae940af6bc6c57f01,By . Associated Press . PUBLISHED: . 14:11 EST...,"Bishop John Folda, of North Dakota, is taking ..."
1,0002095e55fcbd3a2f366d9bf92a95433dc305ef,(CNN) -- Ralph Mata was an internal affairs li...,Criminal complaint: Cop used his role to help ...
2,00027e965c8264c35cc1bc55556db388da82b07f,A drunk driver who killed a young woman in a h...,"Craig Eccleston-Todd, 27, had drunk at least t..."
3,0002c17436637c4fe1837c935c04de47adb18e9a,(CNN) -- With a breezy sweep of his pen Presid...,Nina dos Santos says Europe must be ready to a...
4,0003ad6ef0c37534f80b55b4235108024b407f0b,Fleetwood are the only team still to have a 10...,Fleetwood top of League One after 2-0 win at S...


In [6]:
import pandas as pd
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from tqdm import tqdm

In [None]:
pip install datasets

**1. 토크나이저 및 모델 불러오기**

In [8]:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("t5-base", model_max_length=512)
print("Tokenizer type:", type(tokenizer))

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
print("Model type:", type(model))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Tokenizer type: <class 'transformers.models.t5.tokenization_t5_fast.T5TokenizerFast'>


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Model type: <class 'transformers.models.t5.modeling_t5.T5ForConditionalGeneration'>


**GPU 설정**

In [9]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

**3. 요약 생성 함수 정의**

:주어진 텍스트에 대해 요약을 생성

In [14]:
def summarize_text(text):
    # 텍스트 전처리: 공백, 줄바꿈 제거
    preprocess_text = text.strip().replace("\n", " ")
    input_text = "summarize: " + preprocess_text

    # 토큰화
    tokenized_text = tokenizer.encode(input_text, return_tensors="pt").to(device)

    # 모델 추론 (요약 생성)
    with torch.no_grad():
        summary_ids = model.generate(
            tokenized_text,
            max_length=150,       # 요약 최대 길이
            min_length=50,        # 요약 최소 길이
            length_penalty=1.0,   # 길이 penalty
            num_beams=8,          # 빔 서치 사용
            no_repeat_ngram_size=3,
            early_stopping=True   # 일찍 멈추기
        )

    # 요약 결과 디코딩
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

**4. 데이터프레임 요약 함수**

데이터프레임에 있는 여러 기사들에 대해 요약을 생성

각 기사(article 열)에 대해 summarize_text를 호출 -> 요약을 생성

모든 요약을 리스트에 저장 -> 이를 데이터프레임의 generated_summary 열에 저장

In [15]:
def summarize_dataframe(df):
    summaries = []
    for article in tqdm(df['article'], desc="Summarizing"):
        summary = summarize_text(article)
        summaries.append(summary)
    return summaries

**5. ROUGE 점수 계산 함수**

In [16]:
import evaluate
rouge = evaluate.load("rouge")

def calculate_rouge(predictions, references):
    # ROUGE 점수 계산
    results = rouge.compute(predictions=predictions, references=references, rouge_types=["rouge1", "rouge2", "rougeL"])
    return {
        "ROUGE-1": results["rouge1"],
        "ROUGE-2": results["rouge2"],
        "ROUGE-L": results["rougeL"]
    }

In [17]:
#6-1) test data 요약 실행
print("test dataset 요약:")
df_test['generated_summary'] = summarize_dataframe(df_test)

#6-2) ROUGE 점수 계산
print("ROUGE 점수: ")
rouge_scores = calculate_rouge(df_test['generated_summary'], df_test['highlights'])

#6-3) 결과 출력
print("\nROUGE Evaluation Results:")
for metric, score in rouge_scores.items():
    print(f"{metric}: {score:.4f}")

#6-4) 요약된 결과 예시 출력
print("결과:")
for i in range(3):
    print(f"Article {i+1}:\n{df_test['article'].iloc[i][:200]}...")
    print(f"Reference Summary:\n{df_test['highlights'].iloc[i]}")
    print(f"Generated Summary:\n{df_test['generated_summary'].iloc[i]}\n")


test dataset 요약:


Summarizing: 100%|██████████| 11490/11490 [3:39:19<00:00,  1.15s/it]


ROUGE 점수: 

ROUGE Evaluation Results:
ROUGE-1: 0.3661
ROUGE-2: 0.1602
ROUGE-L: 0.2585
결과:
Article 1:
Ever noticed how plane seats appear to be getting smaller and smaller? With increasing numbers of people taking to the skies, some experts are questioning if having such packed out planes is putting p...
Reference Summary:
Experts question if  packed out planes are putting passengers at risk .
U.S consumer advisory group says minimum space must be stipulated .
Safety tests conducted on planes with more leg room than airlines offer .
Generated Summary:
a consumer advisory group set up by the department of transportation said that while the government is happy to set standards for animals flying on planes, it doesn't stipulate a minimum amount of space for humans . many economy seats on united airlines have 30 inches of space, while some airlines offer as little as 28 inches .

Article 2:
A drunk teenage boy had to be rescued by security after jumping into a lions' enclosure at a zoo in w

**성능 향상을 위한 finetuning**

In [None]:
pip install transformers datasets evaluate torch accelerate


In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./t5-summarization",    # 모델 저장 경로
    evaluation_strategy="epoch",       # 평가 주기
    learning_rate=5e-5,                # 학습률
    per_device_train_batch_size=8,     # 훈련 배치 크기
    per_device_eval_batch_size=8,      # 평가 배치 크기
    weight_decay=0.01,                 # Weight Decay
    save_total_limit=3,                # 저장할 체크포인트 수
    num_train_epochs=3,                # 훈련 에폭 수
    predict_with_generate=True,        # 요약 생성 활성화
    fp16=torch.cuda.is_available(),    # Mixed Precision 활성화
)

In [None]:
# 7. Trainer 설정
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].shuffle(seed=42).select(range(20000)),  # 훈련 데이터 샘플 (20,000개)
    eval_dataset=tokenized_datasets["validation"].select(range(1000)),               # 검증 데이터 샘플 (1,000개)
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# 8. 모델 훈련
trainer.train()

# 9. 모델 저장
trainer.save_model("./fine_tuned_t5")