# A Tour of Transformer Applications via Hugging Face
- All of the models that we have used are already fine-tuned for the task at hand.

### The Hugging Face Ecosystem
- The Hugging Face Ecosystem consists of mainly two parts:
    - A family of libraries: provide the code
    - The Hub: provides the pretrained model weights, datasets, evaluation metrics, and more.
- The Hugging Face Hub
    - The Hugging Face Hub hosts over 20,000 freely available models.
    - You can search for those using filtering for Tasks, Frameworks, and Datasets, and load them with one line of code!
- The Hugging Face Datasets
    - The Hugging Face Datasets simplifies loading, processing, and storing datasets, by standard interface and smart caching. (You don't have to redo your preprocessing each time you run the code.)
    - Avoids RAM limitations by memory mapping, which stores the contents of a file in virtual memory and enables multiple processes to modify a file more efficeintly.
- The Hugging Face Accelerate
    - You can run your raw PyTorch training scripts on any kind of device.
    - Accelerate adds a layer of abstraction to your normal training loops, which takes care of all the custom logic necessary for the training intrastructure.
    - Accelerator.device automatically detects and selects the best available hardware(CPU, GPU, TPU) on your system.
    - Parallel learning can be easily performed on multiple GPUs or TPUs.

In [1]:
text = """Dear Amazon, last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

### Required Packages:
- datasets, pandas, transformers, torch, accelerate, numpy, scikit-learn

In [2]:
import pandas as pd
from transformers import pipeline
classifier = pipeline('text-classification') # pipeline은 굉장히 abstracting level이 높은 API

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


### A simple classifier
- Let's generate some prediction.
- Each pipeline takes a string of text as input, and returns a list of predictions (list of dictionary)).

In [3]:
# Sentimental Analysis
outputs = classifier(text)
print(pd.DataFrame(outputs)) # The model is very confident that the text has a negative sentiment.

      label     score
0  NEGATIVE  0.901546


In [4]:
# Named-Entity Recognition (NER): Named Entity(이름을 가진 개체)를 Recognition(인식)하는 것
ner_tagger = pipeline('ner', aggregation_strategy='simple')
# "Simple" aggregation strategy merges tokens that have the same entity tag (i.e., Optimus Prime)
# When they have different predictions, they will be treated separately. (i.e., Mega / ##tron)
outputs = ner_tagger(text)
print(pd.DataFrame(outputs))

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFBertForTokenClassification.

All the weights of TFBertForTokenClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


  entity_group     score           word  start  end
0          ORG  0.879010         Amazon      5   11
1         MISC  0.990859  Optimus Prime     36   49
2          LOC  0.999755        Germany     90   97
3         MISC  0.556569           Mega    208  212
4          PER  0.590257         ##tron    212  216
5          ORG  0.669692         Decept    253  259
6         MISC  0.498349        ##icons    259  264
7         MISC  0.775361       Megatron    350  358
8         MISC  0.987854  Optimus Prime    367  380
9          PER  0.812096      Bumblebee    502  511


- Output
    - ORG: Organization
    - LOC: Location
    - PER: Person
    - MISC: Etc.

In [5]:
# (Extractive) Question Answering
reader = pipeline('question-answering')
question = "What does the customer want?"
outputs = reader(question=question, context=text)
print(pd.DataFrame([outputs])) # Start, end: The character indices where the answer span was found.

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForQuestionAnswering.

All the weights of TFDistilBertForQuestionAnswering were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForQuestionAnswering for predictions without further training.


      score  start  end                   answer
0  0.631292    335  358  an exchange of Megatron


In [6]:
# Summarization
summarizer = pipeline('summarization')
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text']) 

No model was supplied, defaulted to google-t5/t5-small and revision df1b051 (https://huggingface.co/google-t5/t5-small).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
I0000 00:00:1729141809.072661 4887093 service.cc:146] XLA service 0x6000017ce400 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1729141809.074595 4887093 service.cc:154]   StreamExecutor device (0): Host, Default Version
2024-10-17 14:10:09.164324: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_D

last week, I ordered an Optimus Prime action figure from your online store in germany. when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead


In [7]:
# Translation
translator = pipeline('translation_en_to_de', model='Helsinki-NLP/opus-mt-en-de') # able to specify model 
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

All model checkpoint layers were used when initializing TFMarianMTModel.

All the layers of TFMarianMTModel were initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-de.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMarianMTModel for predictions without further training.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]



Sehr geehrter Amazon, letzte Woche habe ich eine Optimus Prime Action Figur aus Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket öffnete, entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur von Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich hoffe, Sie können mein Dilemma verstehen. Um das Problem zu lösen, Ich fordere einen Austausch von Megatron für die Optimus Prime Figur bestellte ich. Anbei sind Kopien meiner Aufzeichnungen über diesen Kauf. Ich erwarte, von Ihnen bald zu hören. Aufrichtig, Bumblebee.


In [8]:
# Text generation (autocomplete, fast reply to customer feedback)
# RAG
generator = pipeline('text-generation')
response = "Dear Bublebee, I am sorry to hear that your order was mixed up."
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=200)
print(outputs[0]['generated_text'])

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.

Customer service response:
Dear Bublebee, I am sorry to hear that your order was mixed up. My order has been processed and I am confident that it will be delivered tomorrow. If you would just like to express your concerns about your own package or wish to contact me for further information, visit Amazon.


For more information, please visit the Amazon.com website.
