## Intro to Hugging Face

*Prepared by:*  
**Jude Michael Teves**  
Faculty, Software Technology Department  
College of Computer Studies - De La Salle University

## Introduction

Hugging Face is a platform and community for artificial intelligence (AI) and machine learning (ML). It's often referred to as the "GitHub for machine learning" because it allows users to share and collaborate on pre-trained models, datasets, and applications. Hugging Face's primary focus is on natural language processing (NLP), offering tools and resources for tasks like text generation, translation, and sentiment analysis. It provides a user-friendly interface and a vast library of open-source models, making AI more accessible to developers and researchers. Essentially, Hugging Face simplifies the development and deployment of AI models, fostering collaboration and innovation within the AI community.

## Preliminaries

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m6.

In [None]:
import logging, warnings

logging.getLogger("transformers").setLevel(logging.ERROR)
warnings.filterwarnings("ignore")  # Ignore all warnings

## Basic Pipeline

### Sentiment Analysis

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
      "Attack on Titan is a masterpiece with stunning visuals and complex characters.",
      "I found the plot of My Hero Academia confusing and predictable."
    ]
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9998816251754761},
 {'label': 'NEGATIVE', 'score': 0.9998169541358948}]

There's more loading happening behind the scenes because this is the first time we are using this model. Running it again will be faster.

In [None]:
classifier = pipeline("sentiment-analysis")
classifier(
    [
      "Attack on Titan is a masterpiece with stunning visuals and complex characters.",
      "I found the plot of My Hero Academia confusing and predictable."
    ]
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9998816251754761},
 {'label': 'NEGATIVE', 'score': 0.9998169541358948}]

### Specifying a model

You can see that it automatically uses a default model if we not supply anything.

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment")
classifier(
    [
      "Attack on Titan is a masterpiece with stunning visuals and complex characters.",
      "I found the plot of My Hero Academia confusing and predictable."
    ]
)

config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/669M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


[{'label': '5 stars', 'score': 0.7471705675125122},
 {'label': '2 stars', 'score': 0.5448039174079895}]

The `nlptown/bert-base-multilingual-uncased-sentiment` model is a powerful tool for sentiment analysis, adept at understanding the emotional tone of text in multiple languages. It's built upon the BERT architecture and fine-tuned for sentiment classification, making it capable of identifying positive, negative, or neutral sentiments in diverse textual data.

For more details, you may refer to this link: https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment

## Other NLP Tasks

### Zero-Shot Classification

In [None]:
# Classifying the genre of an anime description
classifier = pipeline("zero-shot-classification")
results = classifier(
    "A young wizard attends a magical school to learn powerful spells.",
    candidate_labels=["Fantasy", "Sci-Fi", "Romance", "Action"]
)
print("Anime Genre Classification:", results)

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Anime Genre Classification: {'sequence': 'A young wizard attends a magical school to learn powerful spells.', 'labels': ['Action', 'Fantasy', 'Sci-Fi', 'Romance'], 'scores': [0.44977834820747375, 0.23850484192371368, 0.23816724121570587, 0.07354957610368729]}


### Text Generation

In [None]:
# Generating text about AI
generator = pipeline('text-generation', model='gpt2')
generated_text = generator("The intersection of AI and philosophy needs to be discussed more", max_length=50, num_return_sequences=1)
print("Text Generation:", generated_text)

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Text Generation: [{'generated_text': 'The intersection of AI and philosophy needs to be discussed more thoroughly in the coming issue of the Journal of Human-Computer Interaction.\n\nImage Credit: http://www.sanscad.org/Articles/Autonomic-Mechan'}]


### Translation

In [None]:
# Translating an anime title
translator = pipeline("translation_en_to_de")  # Example: English to German
results = translator("Demon Slayer")
print("Anime Title Translation:", results)

Anime Title Translation: [{'translation_text': 'Dämonenschläger'}]


The Helsinki-NLP/opus-mt-en-jap model is a specialized tool for machine translation, specifically designed to translate text from English to Japanese. It's part of the OPUS-MT project, known for providing a wide range of pre-trained translation models for various language pairs. This particular model leverages neural machine translation techniques to achieve accurate and fluent translations.

For more details, you may refer to this link: https://huggingface.co/Helsinki-NLP/opus-mt-en-jap

In [None]:
# Translating an anime title
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-jap")  # Example: English to Japanese
results = translator("Demon Slayer")
print("Anime Title Translation:", results)

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/274M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/509k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/1.02M [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.64M [00:00<?, ?B/s]



Anime Title Translation: [{'translation_text': 'メセク を 殺 し て しま い ,'}]


### Summarization

In [None]:
# Summarizing AI intro
summarizer = pipeline("summarization")
results = summarizer("""Artificial intelligence (AI) is the field devoted to
building artificial animals (or at least artificial creatures that – in suitable
contexts – appear to be animals) and, for many, artificial persons (or at least
artificial creatures that – in suitable contexts – appear to be persons).
Such goals immediately ensure that AI is a field of considerable interest to
many philosophers, and this has been true for the entire history of the field.
The philosophy of AI is concerned with the following sorts of questions:
What are our concepts of intelligence, thought, and rationality, and how might
they need to be revised in light of progress in AI? What would it take for AI
systems to really be intelligent, to really be thinking, or to really understand
something? Is it possible for an artifact to be conscious, to have genuine
subjective experience? Could such artifacts deserve moral consideration? Are we
ourselves essentially biological machines, and how would this affect our
understanding of AI?""")

print("Summary:", results)

config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Summary: [{'summary_text': ' Artificial intelligence (AI) is the field devoted to building artificial animals (or at least artificial creatures that – in suitable contexts – appear to be animals) and, for many, artificial persons . The philosophy of AI is concerned with the following sorts of questions: What are our concepts of intelligence, thought, and rationality, and how might uablythey need to be revised in light of progress in AI?'}]


In [None]:
print(results[0]['summary_text'])

 Artificial intelligence (AI) is the field devoted to building artificial animals (or at least artificial creatures that – in suitable contexts – appear to be animals) and, for many, artificial persons . The philosophy of AI is concerned with the following sorts of questions: What are our concepts of intelligence, thought, and rationality, and how might uablythey need to be revised in light of progress in AI?


### Question and Answering

In [None]:
# Question answering
question_answerer = pipeline("question-answering")
context = """
Albert Einstein was a physicist who developed the theory of relativity.
He was born in Germany in 1879 and died in the United States in 1955.
Einstein is considered one of the most important scientists of all time.
"""

question = "When was Albert Einstein born?"
results = question_answerer(question=question, context=context)
print("Q&A:", results)

Q&A: {'score': 0.9955588579177856, 'start': 100, 'end': 104, 'answer': '1879'}


In [None]:
question = "What is his job?"
results = question_answerer(question=question, context=context)
print("Q&A:", results)

Q&A: {'score': 0.9271867871284485, 'start': 23, 'end': 32, 'answer': 'physicist'}


## Datasets

The Stanford Question Answering Dataset (SQuAD) is a widely used benchmark dataset for evaluating question answering models. It consists of a large collection of reading passages, along with questions posed by humans about those passages. The answers to these questions are spans of text directly extracted from the passages. SQuAD is designed to challenge models' ability to understand natural language and extract relevant information from given context. It has become a cornerstone for developing and evaluating models for reading comprehension and question answering tasks. Let's use that in the following example to demonstrate how to load data and use it for Q&A.

In [None]:
from datasets import load_dataset

# Load the SQuAD dataset
raw_datasets = load_dataset("squad")

# Example usage of the loaded dataset for question answering
question_answerer = pipeline("question-answering")

# Choose a sample from the dataset (e.g., the first one from the training set)
sample = raw_datasets["train"][0]
context = sample["context"]
question = sample["question"]
answers = sample["answers"]

# Use the pipeline for question answering
results = question_answerer(question=question, context=context)
print(f"Question: {question}")
print(f"Context: {context}")
print(f"Predicted Answer: {results['answer']}")
print(f"Actual Answer(s): {answers['text']}")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Context: Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
Predicted Answer: Saint Bernadette Soubirous
Actual Answer(s): ['Saint Bernadette Soubirous']


## Evaluate

In [None]:
import evaluate
from tqdm import tqdm

# Load the SQuAD dataset
raw_datasets = load_dataset("squad")

# Example usage of the loaded dataset for question answering
question_answerer = pipeline("question-answering")

# Initialize the metric
metric = evaluate.load("squad")

# Loop through a subset of the validation set (adjust the range as needed)
filtered_dataset = raw_datasets["validation"][:10]  # Process the first 100 examples for demonstration

predicted_answers_list = []
results_list = []

for i in tqdm(range(len(filtered_dataset['id']))):

    context = filtered_dataset["context"][i]
    question = filtered_dataset["question"][i]
    answers = filtered_dataset["answers"][i]

    # Use the pipeline for question answering
    results = question_answerer(question=question, context=context)
    predicted_answer = results["answer"]

    predicted_answers_list.append(predicted_answer)
    results_list.append(results)

    # Prepare the prediction and reference for the metric
    predictions = [{"id": filtered_dataset["id"][i], "prediction_text": predicted_answer}]
    references = [{"id": filtered_dataset["id"][i], "answers": answers}]

    # Add the example to the metric
    metric.add_batch(predictions=predictions, references=references)

print('---')

for i in range(len(predicted_answers_list)):
  print(f"Predicted Answer: {predicted_answers_list[i]}")
  print(f"Actual Answer(s): {results_list[i]['answer']}")

# Compute the metric
final_score = metric.compute()
final_score

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
100%|██████████| 10/10 [00:05<00:00,  1.74it/s]

---
Predicted Answer: Denver Broncos
Actual Answer(s): Denver Broncos
Predicted Answer: Carolina Panthers
Actual Answer(s): Carolina Panthers
Predicted Answer: Levi's Stadium in the San Francisco Bay Area at Santa Clara, California
Actual Answer(s): Levi's Stadium in the San Francisco Bay Area at Santa Clara, California
Predicted Answer: Carolina Panthers
Actual Answer(s): Carolina Panthers
Predicted Answer: gold
Actual Answer(s): gold
Predicted Answer: golden anniversary
Actual Answer(s): golden anniversary
Predicted Answer: February 7, 2016
Actual Answer(s): February 7, 2016
Predicted Answer: American Football Conference
Actual Answer(s): American Football Conference
Predicted Answer: golden anniversary
Actual Answer(s): golden anniversary
Predicted Answer: American Football Conference
Actual Answer(s): American Football Conference





{'exact_match': 90.0, 'f1': 90.0}

## Activity

For the following activities, you may manually add your own data via any means.

### Activity 1: English to Tagalog Translation

**Objective:** Translate English sentences into Tagalog using Hugging Face pipelines.

**Task:** Use the `translation` pipeline to perform English-to-Tagalog translation.

**Steps:**
1. Prepare 10 examples of English sentences or phrases for translation (e.g., "How are you?" or "The weather is nice today.").
2. Use the `translation` pipeline to translate each sentence into Tagalog.
3. Compare the translations with your own understanding of the language or professional translations.
4. Discuss the strengths and limitations of the translation pipeline.

In [None]:
# YOUR CODE HERE

### Activity 2: Translation of Song Lyrics

**Objective:** Translate song lyrics between languages using Hugging Face pipelines.

**Task:** Use the `translation` / `translation_xx_to_yy` pipeline to translate lyrics.

**Steps:**
1. Select song lyrics in a language you are familiar with.
2. Use the `translation` pipeline to translate the lyrics into another language.
3. Compare the translated lyrics with professional translations (if available).
4. Discuss the nuances, challenges, and accuracy of the machine translation.


In [None]:
# YOUR CODE HERE

### Activity 3: Question Answering About a Historical Text

**Objective:** Use Hugging Face pipelines to answer questions based on a historical text.

**Task:** Employ the question-answering pipeline to answer questions.

**Steps:**
1. Choose a passage from a historical document or textbook.
2. Formulate a few questions about the text.
3. Use the question-answering pipeline to find answers to these questions.
4. Compare the answers with your knowledge or verify them using other sources.

In [None]:
# YOUR CODE HERE

### Activity 4: Zero-Shot Classification of News Headlines

**Objective:** Categorize news headlines using Hugging Face zero-shot classification.

**Task:** Apply the zero-shot-classification pipeline to classify headlines into categories.

**Steps:**
1. Collect a set of news headlines from various sources.
2. Define a few relevant categories (e.g., politics, sports, technology).
3. Use the zero-shot-classification pipeline to assign categories to each headline.
4. Discuss the accuracy and limitations of the results with the group.

In [None]:
# YOUR CODE HERE

### Activity 5: Text Generation for Creative Writing

**Objective:** Generate creative text prompts for storytelling or poetry using Hugging Face.

**Task:** Utilize the text-generation pipeline to inspire creative writing.

**Steps:**

1. Start with simple prompts like "Once upon a time" or "In a distant galaxy."
2. Use the text-generation pipeline to expand on the prompts.
3. Continue the generated story or write a poem inspired by it.
4. Share your creative works and discuss how the pipeline influenced your ideas.

In [None]:
# YOUR CODE HERE

## End
<sup>made by **Jude Michael Teves**</sup> <br>
<sup>for comments, corrections, suggestions, please email:</sup><sup> <href>judemichaelteves@gmail.com</href> or <href>jude.teves@dlsu.edu.ph</href></sup><br>
