### Attention Model
- Attention: 중요한 feature에 집중
- Attention 모델: Decoder 각 예측 시점마다 주어진 전체 Encoder hidden state 중에서 더 중요한 특징에 집중
  - Decoder에서는 매 step마다 인코더의 hidden_state을 이용해 dynamic하게 Context vector를 생성하는데, 이때 hidden state 전체 중에서 해당 시점에 예측해야 할 단어와 연관이 깊은 입력 단어를 집중해서 보게 됨.

### Transformer Model

In [5]:
# 경고 메세지 끄기
import warnings
warnings.filterwarnings(action = 'ignore')

In [18]:
!pip install transformers tf-keras

Collecting tf-keras
  Downloading tf_keras-2.19.0-py3-none-any.whl.metadata (1.8 kB)
Downloading tf_keras-2.19.0-py3-none-any.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m-:--:--[0m
Installing collected packages: tf-keras
Successfully installed tf-keras-2.19.0


In [11]:
import transformers
import tensorflow as tf

print(transformers.__version__)
print(tf.__version__)

4.52.4
2.19.0


### Transformer Pipeline 활용

In [14]:
from transformers import pipeline

#### 1. Sequence Classification: 감성 분류

In [20]:
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


In [24]:
print(classifier("I love you"))
print(classifier("I hate you"))

[{'label': 'POSITIVE', 'score': 0.9998656511306763}]
[{'label': 'NEGATIVE', 'score': 0.9991129040718079}]


In [26]:
print(classifier('We are very happy to show you the Transformers library'))

[{'label': 'POSITIVE', 'score': 0.9998044371604919}]


In [28]:
# pre-trained model 선택 가능. multilingual -> 한국어 지원
classifier = pipeline('sentiment-analysis', model = "nlptown/bert-base-multilingual-uncased-sentiment")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Device set to use mps:0


In [30]:
print(classifier("We are very happy to show you the Transform library"))

[{'label': '5 stars', 'score': 0.6782801151275635}]


In [32]:
print(classifier("그 영화 지루하고 재미없네."))

[{'label': '3 stars', 'score': 0.22962135076522827}]


#### 2. Unmasking

In [37]:
# fill-mask: 빠진 내용 채우기
unmasker = pipeline('fill-mask', model = 'bert-base-uncased')

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


In [47]:
# [Mask]에 들어갈 말들을 골라준다
unmasker("Hello I am a [MASK] model.")

[{'score': 0.1317753791809082,
  'token': 4827,
  'token_str': 'fashion',
  'sequence': 'hello i am a fashion model.'},
 {'score': 0.112042136490345,
  'token': 2535,
  'token_str': 'role',
  'sequence': 'hello i am a role model.'},
 {'score': 0.04798121750354767,
  'token': 2047,
  'token_str': 'new',
  'sequence': 'hello i am a new model.'},
 {'score': 0.03963988646864891,
  'token': 2449,
  'token_str': 'business',
  'sequence': 'hello i am a business model.'},
 {'score': 0.02204621024429798,
  'token': 2944,
  'token_str': 'model',
  'sequence': 'hello i am a model model.'}]

#### 3. Question Answering

In [50]:
# Question Answering: 먼저 지문을 주고 질문을 하면 지문 내에서 답을 찾아 답변한다.
qa = pipeline("question-answering")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use mps:0


In [52]:
# 지문
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a question answering dataset is the SQuAD dataset, which is entirely based on that task. 
If you would like to fine-tune a model on a SQuAD task, you may leverage the `run_squad.py`.
"""

In [54]:
# 질문에 대한 대답을 지문 내에서 찾아 답변
# 결과: start: 지문 내에서 답변 시작 위치, end: 지문 내에서 답변 끝 위치, answer: 답변

print(qa(question = "What is extractive question answering", context = context))
print(qa(question = "What is a good example of a question answering dataset?", context = context))

{'score': 0.38024961948394775, 'start': 34, 'end': 95, 'answer': 'the task of extracting an answer from a text given a question'}
{'score': 0.5115312337875366, 'start': 147, 'end': 160, 'answer': 'SQuAD dataset'}


#### 4. Text Generation: 문장 생성을 주면 알아서 문장 뒤를 이어 만들어 준다

In [57]:
text_generator = pipeline("text-generation")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use mps:0


In [59]:
print(text_generator("When the Titanic crashed, I", max_length = 50, do_sample = False))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': "When the Titanic crashed, I was in the middle of a long day, and I was thinking, 'What if I'm going to die?' And I was thinking, 'What if I'm going to die?' And I was thinking, 'What if I'm going to die?' And I was thinking, 'What if I'm going to die?' And I was thinking, 'What if I'm going to die?' And I was thinking, 'What if I'm going to die?' And I was thinking, 'What if I'm going to die?' And I was thinking, 'What if I'm going to die?' And I was thinking, 'What if I'm going to die?' And I was thinking, 'What if I'm going to die?' And I was thinking, 'What if I'm going to die?' And I was thinking, 'What if I'm going to die?' And I was thinking, 'What if I'm going to die?' And I was thinking, 'What if I'm going to die?' And I was thinking, 'What if I'm going to die?' And I was thinking, 'What if I'm going to die?' And I was thinking, 'What if I'm going to die?' And I was thinking, 'What if I"}]


#### 5. Named Entity Recognotion(NER): 단어 형태소 분석처럼 각 단어별 객체명을 나열

In [64]:
nlp = pipeline("ner")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initiali

In [66]:
sequence = """Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, 
therefore very close to the Manhattan Bridge which is visible from the window."""

In [68]:
# 단어별 객체명을 나열

# O, Outside of a named entity
# B-MIS, Begining of a miscellaneous entity right after another miscellaneous entity
# I-MIS, Miscellaneous entity
# B-PER, Beginning of a person's name right after another person's name
# I-PER, Person's name
# B-ORG, Begining of an organization right after another organization
# I-ORG, Organization
# B-LOC, Beginning of a location right after another Location
# I-Loc, Location

# Hu, ##gging: I-ORG(조직명) 인식
# New, York : I - LOC(위치) 인식

print(nlp(sequence))

[{'entity': 'I-ORG', 'score': 0.99957865, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}, {'entity': 'I-ORG', 'score': 0.9909764, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}, {'entity': 'I-ORG', 'score': 0.9982224, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}, {'entity': 'I-ORG', 'score': 0.9994879, 'index': 4, 'word': 'Inc', 'start': 13, 'end': 16}, {'entity': 'I-LOC', 'score': 0.9994344, 'index': 11, 'word': 'New', 'start': 40, 'end': 43}, {'entity': 'I-LOC', 'score': 0.99931955, 'index': 12, 'word': 'York', 'start': 44, 'end': 48}, {'entity': 'I-LOC', 'score': 0.9993794, 'index': 13, 'word': 'City', 'start': 49, 'end': 53}, {'entity': 'I-LOC', 'score': 0.98625815, 'index': 19, 'word': 'D', 'start': 79, 'end': 80}, {'entity': 'I-LOC', 'score': 0.951427, 'index': 20, 'word': '##UM', 'start': 80, 'end': 82}, {'entity': 'I-LOC', 'score': 0.9336593, 'index': 21, 'word': '##BO', 'start': 82, 'end': 84}, {'entity': 'I-LOC', 'score': 0.97616553, 'index': 28, 'word': 'Manhatt

#### 6. Summarization

In [71]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


In [74]:
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York. A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband. Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other. In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage. Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the 2010 marriage license application, according to court documents. Prosecutors said the marriages were part of an immigration scam. On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further. After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002. All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say. Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages. Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted. The case was referred to the Bronx District Attorney’s Office by Immigration and Customs Enforcement and the Department of Homeland Security’s Investigation Division. Seven of the men are from so‐called "red‐flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali. Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force. If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18. """

In [76]:
print(summarizer(ARTICLE, max_length = 130, min_length = 30, do_sample=False))

[{'summary_text': ' Liana Barrientos pleaded not guilty to two criminal counts of "offering a false instrument for filing in the first degree" Prosecutors say the marriages were part of an immigration scam . If convicted, she faces up to four years in prison .'}]


#### 7. Translation

In [78]:
translator = pipeline("translation_en_to_fr")

No model was supplied, defaulted to google-t5/t5-base and revision a9723ea (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use mps:0


In [80]:
print(translator("Hugging Face is a technology company based in New York and Paris", max_length = 40))

Both `max_new_tokens` (=256) and `max_length`(=40) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'translation_text': 'Hugging Face est une entreprise technologique basée à New York et à Paris.'}]
