전이학습
```
Fine-tuning
  모델의 파라메터를 새로운 테스크의 데이터로 추가 학습해서 최적화 과정  
  방법 : 모델의 모든레이어(일부 레이어) 업데이트
    테스크별 출력 레이어(분류를 위한 linear layer)를 추가 학습
  단점 : 자원소비가 많죠, 데이터가 적으면 과적합
  활용 : 텍스트분류, 질문응답(QA),개체명인식(NER)
Feature-Extraction
  사전 학습된 모델을 고정된 특성 추출기로 사용, 모델의 출력을 활용해 새로운 분류기(MLP, SVM) 학습
  방법 : 모델의 가중치를 고정(freeze)하고 출력의 토큰 임베딩만 추출
  추출된 임베딩을 새로운 머신러닝 모델 학습
  단점 : 성능이 fine tuning에 비해 떨어짐
  활용 : 빠르게 확인, 데이터가 매우 적은
전략 : 초기에는 Feature-Extraction 으로 테스트 성능이 부족하면 Fine-tuning으로 전환

```

NLP(자연어)에서 전이학습이 필요한 이유
```
데이터 부족문제
언어의 복잡성
사례
  BERT는 wikipedia와 BookCorpus로 사전학습 문맥이해 강력
  ->Fine-Tuning해서 SQuAD(질문응답),
  한국어 KoBERT, KLUE-BERT 한국어 데이터로 사전학습  
한국어 특화모델
  koBERT : SKT에서 한국어 데이터로 학습
  KLUE-BERT : 한국어 벤치마크(KLUE)학습
  KoGPT2  : SKT에서 한국에 데이터로 학습  
```

전이학습을 위한 텍스트 전처리 및 텍스트 표현방법
```
데이터전처리
  토큰화
    텍스트를단어, 서브워드,문자단위로 분리    
  패딩
    입력 길이 고정
  마스킹
    Attention Mast : 패딩토큰을 Attention 계산에서 제외
    1 1 1 1 0 0
  정규화
    소문자,특수문자제거,형태소분석
  도메인 특화 전처리
    의료(법률)데이터 : 전문용어사전 추가
텍스트표현
  임베딩 : 고차원에 벡터로 단어를 표현
  Positional Encoding : 위치정보(헤드단위로 병렬처리)
전처리 시 고려사상
  입력길이: 모델마다 최대 토큰수제안(BERT 512, gpt-3:2048)
  다국어 처리: 한국어 영어 혼합 다국어 모델 bert-base-multi...
  데이터 증강 : 역변역이나 동의어 치환 활용
```

허깅페이스의 파이프라인을 활용해서 전이학습 없이 감정 분류 를 수행하고 전이학습의 기본 흐름을 파악

In [None]:
from huggingface_hub import login
login('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')

In [None]:
# 다국어 베이스 - 파인튜닝이 필요한 모델(특정 상황메 맞게 튜닝된 모델이 아님)
from transformers import pipeline
model_name = 'bert-base-multilingual-cased'
classifier = pipeline('sentiment-analysis', model=model_name, tokenizer=model_name)
# 한국어와 영어 혼합
texts = [
    "이 영화 정말 재미있어요",
    "This movie is absolutely terrible",
    "배우들의 연기가 훌륭했어요",
    "I was bored througtout the film"
]
# 추론
results = classifier(texts)
for result in results:
    print(result)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


{'label': 'LABEL_0', 'score': 0.5385086536407471}
{'label': 'LABEL_0', 'score': 0.5986961126327515}
{'label': 'LABEL_0', 'score': 0.5690847039222717}
{'label': 'LABEL_0', 'score': 0.566684365272522}


In [None]:
# 다국어 기반의 5점 척도(1~5점) 감성분석 모델
from transformers import pipeline
model_name = 'nlptown/bert-base-multilingual-uncased-sentiment'
classifier = pipeline('sentiment-analysis', model=model_name, tokenizer=model_name)
# 한국어와 영어 혼합
texts = [
    "이 영화 정말 재미있어요",
    "This movie is absolutely terrible",
    "배우들의 연기가 훌륭했어요",
    "I was bored througtout the film"
]
# 추론
results = classifier(texts)
for result in results:
    print(result)

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/669M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


{'label': '5 stars', 'score': 0.4925258755683899}
{'label': '1 star', 'score': 0.9602923393249512}
{'label': '5 stars', 'score': 0.49889132380485535}
{'label': '1 star', 'score': 0.4800568222999573}


한글에특화된 전이학습
  - 감성분석 : kobert
  - 텍스트 생성 : kogpt2

In [None]:
# 분류를 하려면.. 모델을 분류모델로 전환
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("monologg/kobert")
tokenizer = AutoTokenizer.from_pretrained("monologg/kobert", trust_remote_code=True)
# 한국어와 영어 혼합
texts = [
    "이 영화 정말 재미있어요",
    "배우들의 연기가 훌륭했어요",
]
# 추론
model.eval()
for text in texts:
  inputs = tokenizer(text, return_tensors='pt')
  outputs = model(**inputs)
  logits = outputs.logits
  predicted_class_id = logits.argmax().item()
  print(predicted_class_id)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at monologg/kobert and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0
1


In [None]:
# kogpt2
model_name = 'skt/kogpt2-base-v2'
generator = pipeline('text-generation', model=model_name, tokenizer=model_name)
# 프롬프트
prompt = '오늘의 주요 뉴스!'
# 추론
generated_text = generator(prompt, max_length=50, truncation=True,num_return_sequences=1)[0]['generated_text']
print(generated_text)

Device set to use cpu


오늘의 주요 뉴스!!! 새해맞이 보너스 포인트!
대선 특집!!!! 방송이 많이 들어와서 한 10분 정도 전에 찍었습니다 
그리고 내년에도 꼭 다시 봐야겠죠~


지금까지는 기존의 베이스 모델 또는 특화 모델을 가져왔음
  - ▶ 원하는 데이터로 fine-tuning을 하려면

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer,TrainingArguments
model = AutoModelForSequenceClassification.from_pretrained("monologg/kobert")
tokenizer = AutoTokenizer.from_pretrained("monologg/kobert", trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/426 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/369M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at monologg/kobert and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/263 [00:00<?, ?B/s]

tokenization_kobert.py:   0%|          | 0.00/10.9k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/monologg/kobert:
- tokenization_kobert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


tokenizer_78b3253a26.model:   0%|          | 0.00/371k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/77.8k [00:00<?, ?B/s]

In [None]:
!pip install datasets
from datasets import load_dataset
dataset = load_dataset('nsmc')

Collecting datasets
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.1-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.4/491.4 kB[0m [31m36.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl 

README.md:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

nsmc.py:   0%|          | 0.00/3.18k [00:00<?, ?B/s]

The repository for nsmc contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/nsmc.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/14.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.89M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/150000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
# 파인튜닝할 데이터
import pandas as pd
pd.DataFrame(dataset['train'])

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1
...,...,...,...
149995,6222902,인간이 문제지.. 소는 뭔죄인가..,0
149996,8549745,평점이 너무 낮아서...,1
149997,9311800,이게 뭐요? 한국인은 거들먹거리고 필리핀 혼혈은 착하다?,0
149998,2376369,청춘 영화의 최고봉.방황과 우울했던 날들의 자화상,1


In [None]:
# 토큰화 함수
def tokenizer_fn(sentence):
  return tokenizer(sentence['document'], padding='max_length', truncation=True,return_tensors='pt')

In [None]:
tokenized_datasets = dataset.map(tokenizer_fn,batched=True)
tokenized_datasets

Map:   0%|          | 0/150000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'document', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 150000
    })
    test: Dataset({
        features: ['id', 'document', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [None]:
# 허깅페이스의 트레이너는 labels를 사용 기존 label용어를 변경
tokenized_datasets = tokenized_datasets.rename_column('label','labels')
tokenized_datasets.set_format('torch',columns=['input_ids','attention_mask','labels'])

In [None]:
# 시간단출을 위해 샘플링을 하고 select를 슬라이싱대신 사용해서 데이터셋 자료구조가 유지됨
train_dataset = tokenized_datasets['train'].shuffle(seed=42).select(range(5000))
eval_dataset = tokenized_datasets['test'].shuffle(seed=42).select(range(1000))

In [None]:
# 학습
traning_args = TrainingArguments(
    output_dir='./results',          # output directory
    eval_strategy='epoch',     # evaluation strategy to adopt during training
    learning_rate=2e-5,              # number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    num_train_epochs=3,              # total number of training epochs
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=100,               # log saving step
    save_strategy='epoch',           # save strategy
    load_best_model_at_end = True,
    metric_for_best_model = 'accuracy',
    report_to='none'                 #wandb 사용 안함
)
# 트레이너
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
def compute_metrics(pred):
  lables = pred.label_ids
  preds = pred.predictions.argmax(-1)
  precision, recall, f1, _ = precision_recall_fscore_support(lables, preds, average='binary')
  acc = accuracy_score(lables, preds)
  return {
      'accuracy': acc,
      'f1': f1,
      'precision': precision,
      'recall': recall
  }

trainer = Trainer(
    model=model,
    args=traning_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    # processing_class=tokenizer,  # tokenizer 라는 arg 명은 앞으로 사용 안할 예정
    compute_metrics=compute_metrics,

)
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.248,0.518647,0.823,0.823177,0.8,0.847737
2,0.2162,0.505181,0.84,0.832285,0.848291,0.816872
3,0.1248,0.615675,0.829,0.830189,0.802303,0.860082


TrainOutput(global_step=939, training_loss=0.22431685317692537, metrics={'train_runtime': 1451.252, 'train_samples_per_second': 10.336, 'train_steps_per_second': 0.647, 'total_flos': 3946665830400000.0, 'train_loss': 0.22431685317692537, 'epoch': 3.0})

In [None]:
# 필요시 저장
model.save_pretrained('kobert_finetuned')
tokenizer.save_vocabulary('kobert_finetuned')

('kobert_finetuned/tokenizer_78b3253a26.model', 'kobert_finetuned/vocab.txt')

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# 추론 함수 작성
import torch
def predict_sentiment(text,model,tokenizer):
  model.eval()
  inputs = tokenizer(text,return_tensors='pt',truncation=True,padding=True).to(device)
  with torch.no_grad():
    outputs = model(**inputs)
  logits = outputs.logits
  predicted_class = torch.argmax(logits,dim=1).item()
  return '긍정' if predicted_class == 1 else '부정'

In [None]:
predict_sentiment(  "이걸 영화라고... ",model,tokenizer)

'부정'

Feature extraction

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [None]:
from transformers import AutoModel, AutoTokenizer
model_FE = AutoModel.from_pretrained("monologg/kobert")
tokenizer_FE = AutoTokenizer.from_pretrained("monologg/kobert", trust_remote_code=True)

In [None]:
import numpy as np
from tqdm import tqdm
# 디바이스 설정
model_FE.to(device)
model.eval() #평가모드(가중치 고정)
#임베딩 추출
dataset = load_dataset('nsmc')
def extract_embedding(dataset,tokenizer,model, max_samples=1000):
  embeddings, labels = [],[]
  for i, example in tqdm(enumerate(dataset)):
    if i >= max_samples:  # 샘플링
      break
    text = example['document']
    label = example['label']

    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True).to(device)
  # CLS 임베딩을 추출
    with torch.no_grad():
      outputs = model(**inputs)
      cls_embedding = outputs.last_hidden_state[:, 0, :].cpu().numpy()  # cls 토큰
    embeddings.append(cls_embedding)
    labels.append(label)
  return np.array(embeddings), np.array(labels)


In [None]:
# 훈련 및 임베딩 추출
train_embeddings, train_labels = extract_embedding(dataset['train'],tokenizer_FE,model_FE)
test_embeddings, test_labels = extract_embedding(dataset['test'],tokenizer_FE,model_FE)

1000it [00:13, 76.12it/s]
1000it [00:09, 105.77it/s]


In [None]:
train_embeddings.squeeze(1).shape

(1000, 768)

In [None]:
# 분류기 학습
classifier = LogisticRegression(max_iter=1000)
classifier.fit(train_embeddings.squeeze(1), train_labels)

In [None]:
test_predictions = classifier.predict(test_embeddings.squeeze(1))
print(classification_report(test_labels, test_predictions))

              precision    recall  f1-score   support

           0       0.68      0.70      0.69       492
           1       0.70      0.68      0.69       508

    accuracy                           0.69      1000
   macro avg       0.69      0.69      0.69      1000
weighted avg       0.69      0.69      0.69      1000



In [None]:
# 추론 함수
def predict_sentiment(text,tokenizer,model,classifier):
  model.eval()
  inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
  with torch.no_grad():
    outputs = model(**inputs)
    cls_embedding = outputs.last_hidden_state[:,0,:].cpu().numpy()  # cls 토큰  마지막레이어의 토큰
    prediction = classifier.predict(cls_embedding)
    return "긍정" if prediction == 1 else '부정'

In [None]:
predict_sentiment('영화 엄복동 하고 똑 같아요.',tokenizer_FE,model_FE,classifier)

'긍정'

FE에대한 개선방향
```
임베딩을 추출할때 batch 단위로 추출하면 더 빠르고 다양한 데이터 섞여서 좀더 학습에 유리
```

질의 응답(koBERT)
- 1. 원형 모델 그대로 사용
- 2. fine-turning
- 3. feature-extraction
- 주어진 문맥에서 질문에 대한 답변을 문맥내 특정 구간으로 추출
- 문맥 :  애플은 1976년에 스트브잡스와 스티브 워즈니악에 의해 설립되었다
- 질문 : 애플은 언제 설립되었나요?
- 답변 : 1976년
- 문맥의 질문을 입력으로 받아서 시작/끝 토큰 위치 예측
- KoBERT는 [cls] 및 토큰 임베딩을 활용
- [CLS] 질문 [SEP] 문맥 [SEP]

In [None]:
# 1. base 모델 원형
# pipeline  AutoModel.... Answering
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "monologg/kobert"
model_QA_base = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer_QA_base = AutoTokenizer.from_pretrained(model_name)
# 파이프라인
qa_pipeline = pipeline('question-answering', model=model_QA_base, tokenizer=tokenizer_QA_base)
# 테스트데이터
context = "현대자동차는 다양한 자동차를 생산하고 있으며, 대표 차량으로는 쏘나타가 있다"
question = [
    '현대자동차는 어떤 차를 생산하나요?',
    '현대자동차의 대표 차량은?'
]
# 추론
results = qa_pipeline(context=context, question=question)
for result in results:
  print(result)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/426 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/369M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at monologg/kobert and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/263 [00:00<?, ?B/s]

The repository for monologg/kobert contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/monologg/kobert.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


tokenization_kobert.py:   0%|          | 0.00/10.9k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/monologg/kobert:
- tokenization_kobert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


tokenizer_78b3253a26.model:   0%|          | 0.00/371k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/77.8k [00:00<?, ?B/s]

Device set to use cuda:0


{'score': 0.0038827478419989347, 'start': 35, 'end': 39, 'answer': '쏘나타가'}
{'score': 0.003912993241101503, 'start': 35, 'end': 39, 'answer': '쏘나타가'}


fine-tunning

Feature-extract

In [None]:
!pip install datasets -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/491.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━[0m [32m348.2/491.4 kB[0m [31m10.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.4/491.4 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's depen

In [None]:
from datasets import load_dataset
from transformers import AutoModel, AutoTokenizer
import torch
dataset = load_dataset('klue','mrc')
model_name = "monologg/kobert"
model_QA_FE = AutoModel.from_pretrained(model_name)
tokenizer_QA_FE = AutoTokenizer.from_pretrained(model_name)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_QA_FE.to(device)
model_QA_FE.eval()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/22.5k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.4M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/8.68M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/17554 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5841 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/426 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/369M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/263 [00:00<?, ?B/s]

KeyboardInterrupt: Interrupted by user