ref: 서울대학교 신효필 교수님의 텍스트 및 자연어 빅데이터 분석 방법론 2021-2학기 강의에서 발췌된 내용을 포함합니다.

https://hpshin.github.io/NaturalLanguageBigDataAnalysis/index.html


# HuggingFace🤗 [Documentation](https://huggingface.co/transformers/quicktour.html)

![HuggingFace Image](https://monologg.kr/images/2020-05-01-transformers-porting/thumbnail.png)

# Quick tour

## What transformers can do
- __Sentiment analysis__: is a text positive or negative?
- __Text generation__: provide a prompt and the model will generate what follows.
- __Name entity recognition (NER)__: in an input sentence, label each word with the entity it represents (person, place, etc.)
- __Question answering__: provide the model with some context and a question, extract the answer from the context.
- __Filling masked text__: given a text with masked words (e.g., replaced by [MASK]), fill the blanks.
- __Summarization__: generate a summary of a long text.
- __Translation__: translate a text in another language.
- __Feature extraction__: return a tensor representation of the text.

# How? Pipelining
![Pipelining](https://huggingface.co/course/static/chapter2/full_nlp_pipeline.png)

# Getting started
Installing transformers (huggingface)

In [1]:
!pip install transformers datasets -q

Importing transformers (hugging face)

In [2]:
import os

import torch
import torch.nn.functional as F

import transformers
from transformers import pipeline

# Example (Sentiment Analysis)
- Lets classify the following sentence
    - "We are very happy to show you the 🤗 Transformers library."

In [3]:
from transformers import pipeline

classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


In [4]:
classifier('We are very happy to show you the 🤗 Transformers library.')

[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

- but how do you classify more than one sentence?

In [5]:
x = [
    "We are very happy to show you the 🤗 Transformers library.",
    "We hope you don't hate it."
]
classifier(x)

[{'label': 'POSITIVE', 'score': 0.9997795224189758},
 {'label': 'NEGATIVE', 'score': 0.5308598279953003}]

# Just now we used [DistillBert](https://huggingface.co/transformers/model_doc/distilbert.html)
- To be more specific "distilbert-base-uncased-finetuned-sst-2-english" 
    - Model: DistillBert
    - Model size: base
    - Input: lowercased input
    - Finetuned: SST (Stanford Sentiment Treebank)
    - Language: English

# DistillBert (Distilled BERT)
- Advantages
    - 40% less parameters than bert-base-uncased
    - 60% faster while preserving over 95% of BERT’s performances
![ModelvsParam Image](https://4.bp.blogspot.com/-v0xrp7eJRfM/Xr77DD85ObI/AAAAAAAADDY/KjIlWlFZExQA84VRDrMEMrB534euKAzlgCLcBGAsYHQ/s1600/NLP%2Bmodels.png)

# How does it work? 
## Lets go over it step by step

In [6]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [7]:
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

In [8]:
text = "We are very happy to show you the 🤗 Transformers library."
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
encoded_input = tokenizer(text, padding=True, truncation=True, max_length=512, return_tensors='pt')

print(f"   Tokens: {tokens}")
print(f"Token IDs: {token_ids}")
print(f"Input IDs: {encoded_input}")


   Tokens: ['we', 'are', 'very', 'happy', 'to', 'show', 'you', 'the', '[UNK]', 'transformers', 'library', '.']
Token IDs: [2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012]
Input IDs: {'input_ids': tensor([[  101,  2057,  2024,  2200,  3407,  2000,  2265,  2017,  1996,   100,
         19081,  3075,  1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


# Now lets try multiple sentences

In [9]:
text = [
    "We are very happy to show you the 🤗 Transformers library.",
    "We hope you don't hate it."
]
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
encoded_input = tokenizer(text, padding=True, truncation=True, max_length=512, return_tensors='pt')

print (encoded_input)

{'input_ids': tensor([[  101,  2057,  2024,  2200,  3407,  2000,  2265,  2017,  1996,   100,
         19081,  3075,  1012,   102],
        [  101,  2057,  3246,  2017,  2123,  1005,  1056,  5223,  2009,  1012,
           102,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]])}


In [10]:
with torch.no_grad():
    output = model(**encoded_input)
    predictions = F.softmax(output.logits, dim=1)
    print(f"Softmax predictions: {predictions}")
    predictions = torch.argmax(output.logits, dim=1)
    print(predictions)
    label = [model.config.id2label[label_id] for label_id in predictions.tolist()]
    print(label)

Softmax predictions: tensor([[2.2043e-04, 9.9978e-01],
        [5.3086e-01, 4.6914e-01]])
tensor([1, 0])
['POSITIVE', 'NEGATIVE']


# What else can pipelines do? [Examples](https://huggingface.co/transformers/usage.html)

# Question Answering
- distilbert-base-cased-distilled-squad

In [11]:
nlp = pipeline("question-answering")

context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the `run_squad.py`.
"""

print(nlp(question="What is extractive question answering?", context=context))
print(nlp(question="What is a good example of a question answering dataset?", context=context))

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


{'score': 0.6222440600395203, 'start': 34, 'end': 95, 'answer': 'the task of extracting an answer from a text given a question'}
{'score': 0.5115304589271545, 'start': 147, 'end': 160, 'answer': 'SQuAD dataset'}


# Fill mask
- distilroberta-base

In [12]:
from transformers import pipeline

nlp = pipeline("fill-mask")
print(nlp(f"HuggingFace is creating a {nlp.tokenizer.mask_token} that the community uses to solve NLP tasks."))

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


[{'sequence': 'HuggingFace is creating a tool that the community uses to solve NLP tasks.', 'score': 0.17927563190460205, 'token': 3944, 'token_str': ' tool'}, {'sequence': 'HuggingFace is creating a framework that the community uses to solve NLP tasks.', 'score': 0.11349444836378098, 'token': 7208, 'token_str': ' framework'}, {'sequence': 'HuggingFace is creating a library that the community uses to solve NLP tasks.', 'score': 0.052435252815485, 'token': 5560, 'token_str': ' library'}, {'sequence': 'HuggingFace is creating a database that the community uses to solve NLP tasks.', 'score': 0.034935373812913895, 'token': 8503, 'token_str': ' database'}, {'sequence': 'HuggingFace is creating a prototype that the community uses to solve NLP tasks.', 'score': 0.02860225923359394, 'token': 17715, 'token_str': ' prototype'}]


# Name Entity Recognition
- dbmdz/bert-large-cased-finetuned-conll03-english

In [13]:
from transformers import pipeline

nlp = pipeline("ner")

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
           "close to the Manhattan Bridge which is visible from the window."

print(nlp(sequence))

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


[{'entity': 'I-ORG', 'score': 0.9995633, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}, {'entity': 'I-ORG', 'score': 0.9915939, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}, {'entity': 'I-ORG', 'score': 0.9982672, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}, {'entity': 'I-ORG', 'score': 0.9994404, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}, {'entity': 'I-LOC', 'score': 0.99943465, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}, {'entity': 'I-LOC', 'score': 0.99932706, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}, {'entity': 'I-LOC', 'score': 0.9993865, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}, {'entity': 'I-LOC', 'score': 0.9825622, 'index': 19, 'word': 'D', 'start': 79, 'end': 80}, {'entity': 'I-LOC', 'score': 0.9369829, 'index': 20, 'word': '##UM', 'start': 80, 'end': 82}, {'entity': 'I-LOC', 'score': 0.8987098, 'index': 21, 'word': '##BO', 'start': 82, 'end': 84}, {'entity': 'I-LOC', 'score': 0.97582406, 'index': 29, 'word': 'Manhatt

# Summarization
- sshleifer/distilbart-cnn-12-6

In [14]:
from transformers import pipeline

summarizer = pipeline("summarization")

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

print(summarizer(ARTICLE, max_length=130, min_length=30))

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


[{'summary_text': ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002 . At one time, she was married to eight men at once, prosecutors say .'}]


# How to use custom pretrained models [Model Hub](https://huggingface.co/models)
![Model Hub Image](https://media.vlpt.us/images/jaehyeong/post/a480f27e-d91a-48e5-8dc5-b670bd92c652/image.png) 



# Text Generation using [KoGPT2](https://github.com/SKT-AI/KoGPT2)
- skt/kogpt2-base-v2

In [15]:
from transformers import GPT2LMHeadModel
from transformers import PreTrainedTokenizerFast

model_name = 'skt/kogpt2-base-v2'

model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_name,
  bos_token='</s>', eos_token='</s>', unk_token='<unk>',
  pad_token='<pad>', mask_token='<mask>') 

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'GPT2Tokenizer'. 
The class this function is called from is 'PreTrainedTokenizerFast'.


In [16]:
text = '공부를 잘 하기 위해서는'
input_ids = tokenizer.encode(text)
gen_ids = model.generate(torch.tensor([input_ids]),
                           max_length=128,
                           repetition_penalty=2.0,
                           pad_token_id=tokenizer.pad_token_id,
                           eos_token_id=tokenizer.eos_token_id,
                           bos_token_id=tokenizer.bos_token_id,
                           use_cache=True)
generated = tokenizer.decode(gen_ids[0,:].tolist())
print(generated)

공부를 잘 하기 위해서는 먼저 자신의 능력을 최대한 발휘해야 한다.
자신의 능력으로 충분히 발휘할 수 있는 능력이 바로 자기 자신이다.
자기 자신을 제대로 계발하기 위해선 무엇보다 자신이 가지고 있던 장점을 마음껏 펼칠 줄 알아야 하며, 이를 통해 자신에게 맞는 일을 할 때 비로소 진정한 성공이 보장된다.
또한 이러한 능력은 다른 사람의 도움을 받지 않고 스스로 노력해서 이룰 수도 있다.
따라서 자신에 대한 긍정적인 마인드를 갖고 적극적으로 도전해보는 것이 중요하다.
그렇다면 어떻게 해야 성공적인 인생을 살수 있을까?
먼저 본인의 능력과 잠재력을 객관적으로 평가하여 그에 걸맞은 보상을 받는 것이다.
자신이 가진 장점이나 약점을 정확히 파악하고 그것을 바탕으로 한 자신만의 전략을 세워야 하는 것은 물론이고,


# We see that many models are doing very well
- Carefully look at BERT
![BERT Training](https://production-media.paperswithcode.com/methods/new_BERT_Overall.jpg)
- We need to fine tune bert for a specific task
    - Lets try fine-tuning a pretrained model

# Steps for fine tuning
1. Prepare dataset
2. Load pretrained Tokenizer
3. Build PyTorch Dataset with encodings
4. Load pretrained model
5. Load trainer and train it

# 1. Prepare dataset

In [17]:
from datasets import load_dataset
raw_datasets = load_dataset("imdb")

Reusing dataset imdb (/home/sungwookson/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a)


  0%|          | 0/3 [00:00<?, ?it/s]

# Load pretrained tokenizer & Build dataset with encodings

In [18]:
model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

def preprocess_fn(x):
    return tokenizer(x['text'], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(preprocess_fn, batched=True)

tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000)) 
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000)) 
full_train_dataset = tokenized_datasets["train"]
full_eval_dataset = tokenized_datasets["test"]

Loading cached processed dataset at /home/sungwookson/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a/cache-d8899e122d561bd5.arrow
Loading cached processed dataset at /home/sungwookson/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a/cache-fd5fb8cabe67605b.arrow
Loading cached processed dataset at /home/sungwookson/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a/cache-124ba69408debb19.arrow
Loading cached shuffled indices for dataset at /home/sungwookson/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a/cache-2695cd48903e6005.arrow
Loading cached shuffled indices for dataset at /home/sungwookson/.cache/huggingface/datasets/imdb/plain_text/1.0.0/e3c66f1788a67a89c7058d97ff62b6c30531e05b549de56d3ab91891f0561f9a/cache-56eb74918e2

In [19]:
# In tranditional pytorch

from torch.utils.data import DataLoader
train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset)

# Load pretrained model

In [20]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'pre_classi

# Load trainer and train it

In [21]:
from transformers import TrainingArguments
from transformers import Trainer

training_args = TrainingArguments("test_trainer")
trainer = Trainer(
    model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset
)
trainer.train()

***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 375


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=375, training_loss=0.2855883382161458, metrics={'train_runtime': 35.3371, 'train_samples_per_second': 84.897, 'train_steps_per_second': 10.612, 'total_flos': 397402195968000.0, 'train_loss': 0.2855883382161458, 'epoch': 3.0})

In [22]:
# In traditional pytorch

import torch
from transformers import AdamW
from transformers import get_scheduler
from tqdm.auto import tqdm

optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/375 [00:00<?, ?it/s]

# How to save and load model / tokenizer

In [23]:
save_directory = os.path.join('ckpt')

#Saving
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

#Loading
model = AutoModelForSequenceClassification.from_pretrained(save_directory)
tokenizer = AutoTokenizer.from_pretrained(save_directory)

Configuration saved in ckpt/config.json
Model weights saved in ckpt/pytorch_model.bin
tokenizer config file saved in ckpt/tokenizer_config.json
Special tokens file saved in ckpt/special_tokens_map.json
loading configuration file ckpt/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.12.3",
  "vocab_size": 30522
}

loading weights file ckpt/pytorch_model.bin
All model checkpoint weights were used when initializing DistilB

# Thank You