# 1. 허깅페이스 BART모델의 개요
- Transformer encoder-decoder (seq2seq) 모델
- BERT-like 양방향 인코더 + GPT-like 자기회귀 디코더
- 텍스트 노이즈 추가 후 원문 복구 방식으로 사전훈련

# 2. 마스크 모델
facebook/bart-base 와 facebook/bart-base-large모델에서 지원됨

BART는 사전학습(pretraining) 단계에서 마스크 언어 모델(MLM, Masked Language Model) 방식을 포함한 다양한 노이즈를 적용합니다.
따라서 BERT처럼 <mask> 토큰이 포함된 입력 문장을 주면, 빈칸 채우기(mask filling)를 수행할 수 있습니다. 대표적으로 facebook/bart-large 모델이 mask filling을 지원합니다.
(반면, bart-large-cnn 같은 요약 특화 모델은 mask filling 기능을 지원하지 않음 → mask_token_id 없음)  입력 문장에서 일부 단어나 구절을 <mask>로 치환

      - BERT보다 유연함 : BERT도 mask filling이 가능하지만, 단순한 단어 예측에 가까움
      - BART는 문장 수준에서 학습했기 때문에 더 자연스러운 문맥 복원이 가능
      - 여러 개 마스크 처리 가능: 하나의 문장에 여러 <mask>가 있을 때, 순차적으로 또는 동시에 예측 가능
      - 생성적 특성: GPT처럼 오토리그레시브 디코딩을 하기 때문에,                단순 단어 치환이 아니라 문맥에 맞는 다양한 후보 생성 가능
      - 활용성: 빈칸 채우기 퀴즈 (Cloze Test) /  텍스트 복원 (손상된 문서, OCR 오류 보정)
        데이터 증강 (문장 일부를 마스크 후 다양하게 채우기)

### 2-1. 빈칸 채우기 퀴즈 (Cloze Test)

In [64]:
from transformers import pipeline, AutoTokenizer

# BART-large는 영어 전용 모델
unmask = pipeline("fill-mask", model="facebook/bart-large")
tok = AutoTokenizer.from_pretrained("facebook/bart-large")

print("mask_token:", tok.mask_token)  # -> <mask>

# 영어 입력 예시
text = f"I went to the {tok.mask_token} today."
results = unmask(text, top_k=5)

for r in results:
    print("→", r["sequence"], round(r["score"], 4))


Device set to use cuda:0


mask_token: <mask>
→ I went to the doctor today. 0.1187
→ I went to the dentist today. 0.0778
→ I went to the gym today. 0.0483
→ I went to the grocery today. 0.0309
→ I went to the chirop today. 0.0247


In [63]:
##################
## 한국에서 진행된 KLUE(Korean Language Understanding Evaluation) 벤치마크를 위해 공개된 한국어 전용 모델입니다.
########################

from transformers import pipeline, AutoTokenizer

model_id = "klue/roberta-base"
unmask = pipeline("fill-mask", model=model_id, tokenizer=model_id)
tok = AutoTokenizer.from_pretrained(model_id)

text = f"나는 오늘 {tok.mask_token}에 갔다."
results = unmask(text, top_k=5)

for r in results:
    print("→", r["sequence"], round(r["score"], 4))


Device set to use cuda:0


→ 나는 오늘 에 갔다. 0.0872
→ 나는 오늘? 에 갔다. 0.038
→ 나는 오늘 # 에 갔다. 0.0152
→ 나는 오늘. 에 갔다. 0.0104
→ 나는 오늘 제주도 에 갔다. 0.0096


# 3. 텍스트 요약 모델 : facebook/ bart-large-cnn

In [1]:
from transformers import pipeline

# 요약 파이프라인 생성
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0
Your max_length is set to 130, but your input_length is only 17. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=8)


In [20]:
from transformers import pipeline, BartTokenizer

# 1. 모델 로드
def load_bart():
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
    tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
    return summarizer, tokenizer

# 2. 길이별 요약 비교
def test_lengths(text, min_len, max_len):
    summarizer, tokenizer = load_bart()

    original_tokens = len(tokenizer.encode(text))

    result = summarizer(text, max_length=max_len, min_length=min_len, do_sample=False)
    summary = result[0]['summary_text']
    summary_tokens = len(tokenizer.encode(summary))

    display(f"원본 토큰: {original_tokens}")
    display(f"설정: {min_len}~{max_len} 토큰")
    display(f"요약: {summary}")
    display(f"실제 토큰: {summary_tokens}")
    display("-" * 30)

# 테스트
text = """Artificial intelligence is revolutionizing healthcare by transforming how doctors diagnose diseases and treat patients.
Machine learning algorithms can now analyze medical images like X-rays and MRIs with remarkable accuracy, often matching experienced radiologists. AI is also accelerating drug
discovery, with algorithms analyzing molecular structures much faster than traditional methods."""


# 실행
test_lengths(text, 20, 40)  # 짧게
test_lengths(text, 40, 80)  # 중간
test_lengths(text, 80, 120) # 길게

Device set to use cuda:0


'원본 토큰: 67'

'설정: 20~40 토큰'

'요약: Machine learning algorithms can now analyze medical images like X-rays and MRIs with remarkable accuracy. AI is also accelerating drug discovery, with algorithms analyzing molecular structures much faster than traditional'

'실제 토큰: 37'

'------------------------------'

Device set to use cuda:0
Your max_length is set to 80, but your input_length is only 66. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=33)


'원본 토큰: 67'

'설정: 40~80 토큰'

'요약: Machine learning algorithms can analyze medical images like X-rays and MRIs with remarkable accuracy, often matching experienced radiologists. AI is also accelerating drug discovery, with algorithms analyzing molecular structures much faster than traditional methods.'

'실제 토큰: 44'

'------------------------------'

Device set to use cuda:0
Your max_length is set to 120, but your input_length is only 66. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=33)


'원본 토큰: 67'

'설정: 80~120 토큰'

'요약: Machine learning algorithms can now analyze medical images like X-rays and MRIs with remarkable accuracy, often matching experienced radiologists. AI is also accelerating drug discovery, with algorithms analyzing molecular structures much faster than traditional methods. It is revolutionizing healthcare by transforming how doctors diagnose diseases and treat patients. For more information on how to use machine learning in your healthcare, visit: www.nhs.uk/machine-learning.'

'실제 토큰: 86'

'------------------------------'

# 4. 텍스트 분류 모델 : facebook/bart-large-mnli
 BART-MNLI는 이 추론 능력을 이용해 제로샷 분류를 수행함으로서 "이 텍스트는 {라벨}에 관한 것이다" 형태로 변환하여
        ->  각 라벨별로 참/거짓 확률 계산한후 -> 가장 높은 확률의 라벨 선택
        즉 MNLI 학습으로 BART가 텍스트 간의 논리적 관계를 이해하게 되어, 다양한 분류 태스크에 응용 가능!

        이 모델의 효과성을 높이려면
        (1)관련성 높은 라벨들로 구성
        (2)애매한 경우 멀티 라벨 모드 활용
        (3)신뢰도 0.5 이하는 재검토 필요 함

### 4-1. MNLI 자료 셋 보기

In [None]:
from datasets import load_dataset

# MNLI 데이터셋 로드
dataset = load_dataset("nyu-mll/multi_nli")


In [24]:
def explore_mnli():
    """MNLI 데이터셋 기본 정보 탐색"""

    # 1. 데이터셋 구조 확인
    print("=== 데이터셋 구조 ===")
    print(f"분할: {list(dataset.keys())}")
    print(f"훈련 데이터 크기: {len(dataset['train']):,}")
    print(f"검증 데이터 크기: {len(dataset['validation_matched']):,}")
    print()

    # 2. 데이터 컬럼 확인
    print("=== 데이터 컬럼 ===")
    print(f"컬럼들: {dataset['train'].column_names}")
    print()

    # 3. 라벨 정보
    print("=== 라벨 정보 ===")
    labels = ["entailment", "neutral", "contradiction"]
    for i, label in enumerate(labels):
        print(f"{i}: {label}")
    print()

    # 4. 샘플 데이터 확인
    print("=== 샘플 데이터 ===")
    for i in range(3):
        sample = dataset['train'][i]
        print(f"샘플 {i+1}:")
        print(f"  전제: {sample['premise']}")
        print(f"  가설: {sample['hypothesis']}")
        print(f"  라벨: {labels[sample['label']]}")
        print(f"  장르: {sample['genre']}")
        print()
explore_mnli()

=== 데이터셋 구조 ===
분할: ['train', 'validation_matched', 'validation_mismatched']
훈련 데이터 크기: 392,702
검증 데이터 크기: 9,815

=== 데이터 컬럼 ===
컬럼들: ['promptID', 'pairID', 'premise', 'premise_binary_parse', 'premise_parse', 'hypothesis', 'hypothesis_binary_parse', 'hypothesis_parse', 'genre', 'label']

=== 라벨 정보 ===
0: entailment
1: neutral
2: contradiction

=== 샘플 데이터 ===
샘플 1:
  전제: Conceptually cream skimming has two basic dimensions - product and geography.
  가설: Product and geography are what make cream skimming work. 
  라벨: neutral
  장르: government

샘플 2:
  전제: you know during the season and i guess at at your level uh you lose them to the next level if if they decide to recall the the parent team the Braves decide to call to recall a guy from triple A then a double A guy goes up to replace him and a single A guy goes up to replace him
  가설: You lose the things to the following level if the people recall.
  라벨: entailment
  장르: telephone

샘플 3:
  전제: One of our number will carry out your instru

In [27]:
def check_genres():
    """장르별 분포 확인"""

    from collections import Counter

    # 장르 분포 계산
    genres = [sample['genre'] for sample in dataset['train']]
    genre_counts = Counter(genres)

    print("=== 장르별 분포 ===")
    for genre, count in genre_counts.most_common():
        print(f"{genre}: {count:,}")
check_genres()

=== 장르별 분포 ===
telephone: 83,348
government: 77,350
travel: 77,350
fiction: 77,348
slate: 77,306


In [28]:
def check_labels():
    """라벨별 분포 확인"""

    from collections import Counter

    # 라벨 분포 계산
    labels = [sample['label'] for sample in dataset['train']]
    label_counts = Counter(labels)

    print("=== 라벨 분포 ===")
    label_names = ["entailment", "neutral", "contradiction"]
    for label_id, count in sorted(label_counts.items()):
        print(f"{label_names[label_id]}: {count:,}")
check_labels()

=== 라벨 분포 ===
entailment: 130,899
neutral: 130,900
contradiction: 130,903


In [32]:
def find_examples(keyword):
    """특정 키워드가 포함된 예시 찾기"""

    print(f"=== '{keyword}' 포함 예시 ===")
    count = 0
    for sample in dataset['train']:
        if keyword.lower() in sample['premise'].lower() or keyword.lower() in sample['hypothesis'].lower():
            if count >= 3:  # 3개만 출력
                break
            labels = ["entailment", "neutral", "contradiction"]
            print(f"전제: {sample['premise']}")
            print(f"가설: {sample['hypothesis']}")
            print(f"라벨: {labels[sample['label']]}")
            print("-" * 30)
            count += 1
find_examples("dog")  # 'dog' 키워드 예시 찾기

=== 'dog' 포함 예시 ===
전제: it really is our kids are all grown and gone and away from home so our our new family is the you know our two cats and our dog we never really well we had we did have some time to devote to them you know but not nearly as much time as we have now so they've really become children they're they're real characters they really well all of them are
가설: Since our kids are all grown and gone, we have the pets to fill their spots.
라벨: entailment
------------------------------
전제: make sure that they didn't have to do it again  make hot dogs or some potato chips or
가설: Make sure to have steaks and potatoes. 
라벨: contradiction
------------------------------
전제: They also regularly hold sheep dog and sheep-shearing demonstrations, all in a covered auditorium, which allows you to watch the shepherds at work without having to stand out on the hillsides.
가설: If you want to watch the shepherds at work, you can go to the auditorium.
라벨: entailment
------------------------------

### 4-2. 분류모델 추론

In [22]:
from transformers import pipeline, AutoTokenizer

# 모델 로드 (한 번만 실행)
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-mnli")

def test_classification(text, labels, multi_label=False):
    """기본 분류 테스트"""
    tokens = len(tokenizer.encode(text))
    result = classifier(text, labels, multi_label=multi_label)

    print(f"텍스트: {text}")
    print(f"토큰 수: {tokens}, 멀티라벨: {multi_label}")
    print("결과:", result['labels'][0], f"({result['scores'][0]:.3f})")
    print("-" * 30)

def compare_confidence():
    """신뢰도 높음 vs 낮음 비교"""
    # 높은 신뢰도: 명확한 금융 텍스트
    high_result = classifier("Apple stock increased 5% after earnings.", ["finance", "sports"])

    # 낮은 신뢰도: 관련없는 라벨들
    low_result = classifier("Nice weather today.", ["technology", "politics"])

    print("신뢰도 비교:")
    print(f"높음: {high_result['scores'][0]:.3f}")
    print(f"낮음: {low_result['scores'][0]:.3f}")
    print("-" * 30)

def compare_languages():
    """영어 vs 한국어 토큰 수 비교"""
    en_tokens = len(tokenizer.encode("AI transforms healthcare."))
    ko_tokens = len(tokenizer.encode("AI가 의료를 변화시킨다."))

    print("언어별 토큰 수:")
    print(f"영어: {en_tokens}, 한국어: {ko_tokens}")
    print(f"비율: {ko_tokens/en_tokens:.2f}배")
    print("-" * 30)

def compare_modes():
    """단일 라벨 vs 멀티 라벨 비교"""
    text = "AI fitness app monitors health data."
    labels = ["technology", "health", "fitness"]

    # 단일 라벨: 하나만 선택
    single = classifier(text, labels, multi_label=False)
    print(f"단일: {single['labels'][0]}")

    # 멀티 라벨: 임계값 이상 모두 선택
    multi = classifier(text, labels, multi_label=True)
    selected = [l for l, s in zip(multi['labels'], multi['scores']) if s > 0.5]
    print(f"멀티: {selected}")
    print("-" * 30)

# 실행
if __name__ == "__main__":
    test_classification("iPhone has better camera.", ["technology", "sports"])
    compare_confidence()
    compare_languages()
    compare_modes()

Device set to use cuda:0


텍스트: iPhone has better camera.
토큰 수: 7, 멀티라벨: False
결과: technology (0.987)
------------------------------
신뢰도 비교:
높음: 0.916
낮음: 0.702
------------------------------
언어별 토큰 수:
영어: 6, 한국어: 30
비율: 5.00배
------------------------------
단일: technology
멀티: ['technology', 'health', 'fitness']
------------------------------


### 4-3. 템플릿

In [39]:
##################
## 템플릿에 따른
##################

from transformers import pipeline

# BART-MNLI 모델 로드
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

def test_template(text, labels, template, name):
    """단일 템플릿 테스트"""
    if template:
        result = classifier(text, labels, hypothesis_template=template)
    else:
        result = classifier(text, labels)

    print(f"{name}: {result['labels'][0]} ({result['scores'][0]:.3f})")

def compare_templates():
    """템플릿에 따른 분류 결과 변화"""

    text = "This product costs $500"  # 가격만 언급
    labels = ["business", "technology", "sports"]

    print("텍스트:", text)
    print()

    test_template(text, labels, None, "기본")
    test_template(text, labels, "This news is about {}.", "뉴스")
    test_template(text, labels, "This belongs to {} category.", "카테고리")

# 실행
compare_templates()


Device set to use cuda:0


텍스트: This product costs $500

기본: business (0.812)
뉴스: technology (0.507)
카테고리: business (0.458)


In [42]:
def test_template(text, labels, template, name):
    """템플릿별 분류 테스트"""
    if template:
        result = classifier(text, labels, hypothesis_template=template)
    else:
        result = classifier(text, labels)

    print(f"{name}: {result['labels'][0]} ({result['scores'][0]:.3f})")

# 사용 예시
text = "This product costs $500"
labels = ["business", "technology", "sports"]

test_template(text, labels, None, "기본")
test_template(text, labels, "This news is about {}.", "뉴스")
test_template(text, labels, "This belongs to {} category.", "카테고리")

기본: business (0.812)
뉴스: technology (0.507)
카테고리: business (0.458)


In [41]:
def test_template(text, labels, template, name):
    """템플릿별 분류 테스트"""
    if template:
        result = classifier(text, labels, hypothesis_template=template)
    else:
        result = classifier(text, labels)

    print(f"{name}: {result['labels'][0]} ({result['scores'][0]:.3f})")

# 사용 예시
text = "This product costs $500"
labels = ["business", "technology", "sports"]

test_template(text, labels, None, "기본")
test_template(text, labels, "This news is about {}.", "뉴스")
test_template(text, labels, "This belongs to {} category.", "카테고리")

Device set to use cuda:0


기본: business (0.812)
뉴스: technology (0.507)
카테고리: business (0.458)


In [50]:
# ===========
# 도메인 특화

def test_template(text, labels, template, name):
    """템플릿별 분류 테스트"""
    if template:
        result = classifier(text, labels, hypothesis_template=template)
    else:
        result = classifier(text, labels)

    print(f"{name}: {result['labels'][0]} ({result['scores'][0]:.3f})")

# 도메인 특화 템플릿 테스트
text = "I've seen worse products."

labels = ["positive", "negative", "neutral"]

print("\n=== 도메인 특화 효과 ===")
test_template(text, labels, "This example is {}.", "일반")
test_template(text, labels, "The sentiment is {}.", "감정")
test_template(text, labels, "This review is {}.", "리뷰")


=== 도메인 특화 효과 ===
일반: negative (0.924)
감정: negative (0.570)
리뷰: positive (0.440)


In [53]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

def test_template(text, labels, template, name):
    """템플릿별 분류 테스트"""
    if template:
        result = classifier(text, labels, hypothesis_template=template)
    else:
        result = classifier(text, labels)

    print(f"{name}: {result['labels'][0]} ({result['scores'][0]:.3f})")

# 부적절한 vs 적절한 템플릿
text = "The meeting was scheduled."
labels = ["pleasant", "unpleasant"]

print("\n=== 템플릿 적절성 비교 ===")
test_template(text, labels, "This person is {}.", "부적절")
test_template(text, labels, "This weather is {}.", "적절")

Device set to use cuda:0



=== 템플릿 적절성 비교 ===
부적절: pleasant (0.700)
적절: pleasant (0.673)


In [55]:
from transformers import pipeline

# BART mask filling 모델 로드
fill_mask = pipeline("fill-mask", model="facebook/bart-large")

def test_mask(text, top_k=3):
    """마스크 채우기 테스트"""
    result = fill_mask(text, top_k=top_k)

    print(f"입력: {text}")
    for i, pred in enumerate(result, 1):
        print(f"{i}. {pred['token_str']} ({pred['score']:.3f})")
    print("-" * 30)

# 테스트 예시들
test_mask("The weather is <mask> today.")
test_mask("I love eating <mask> for breakfast.")
test_mask("The movie was <mask> and entertaining.")

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.58.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


입력: The weather is <mask> today.
1.  expected (0.138)
2.  mostly (0.066)
3.  looking (0.065)
------------------------------
입력: I love eating <mask> for breakfast.
1.  eggs (0.061)
2.  pancakes (0.042)
3.  a (0.039)
------------------------------
입력: The movie was <mask> and entertaining.
1.  well (0.156)
2.  very (0.133)
3.  both (0.064)
------------------------------
