
 # Hugging Face Transformers - A Complete Guide

 This notebook provides a complete guide on how to use Hugging Face Transformers to perform common Natural Language Processing (NLP) tasks such as:
 - Sentiment Analysis
 - Text Summarization
 - Question Answering
 - Text Translation
 - Text Generation


Additionally, it will demonstrate how to fine-tune a pre-trained model on a custom dataset for specific tasks.

# 1. Installing Necessary Libraries
Before we can start, we need to install the required Python packages. We will use the Hugging Face `transformers` and `datasets` libraries along with `torch`, which is the backend framework that runs the models.


In [1]:
!pip install transformers datasets torch

Defaulting to user installation because normal site-packages is not writeable


# 2. Using Hugging Face Pipelines

Hugging Face provides a high-level abstraction called `pipeline`. The `pipeline` is designed to allow you to quickly apply a model to a task without needing to worry about the underlying details.

You can use the `pipeline` function to load a pre-trained model for different tasks such as sentiment analysis, text generation, summarization, etc.

Let's start by importing the `pipeline` function from the Hugging Face Transformers library.


In [4]:
from transformers import pipeline
import torch


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.3.3 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/joelboer/Library/Python/3.11/lib/python/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Users/joelboer/Library/Python/3.11/lib/python/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/Users/joelboer/Library/Python/3.11/lib/python/site-packages/ipykernel/kernelapp.py", line 739, in start
    self.io_loop.

AttributeError: _ARRAY_API not found

ImportError: numpy.core._multiarray_umath failed to import

ImportError: numpy.core.umath failed to import

### Task 1: Sentiment Analysis
Sentiment Analysis is the task of classifying a given text into positive, negative, or neutral sentiments.

In this example, we will use a pre-trained model for sentiment analysis. The `pipeline` will automatically download and load a model that has been pre-trained on a large dataset to perform this task.

In [None]:
classifier = pipeline('sentiment-analysis')
result = classifier("I love the Large Language Model course!")
print(f"Sentiment Analysis Result: {result}")

### Task 2: Text Summarization

Notice the errors:
   - 'Using a pipeline without specifying a model name and revision in production is not recommended.'
   - 'FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default.

The first error is a warning that suggests specifying the model name and revision when using a pipeline in production. This is important to ensure reproducibility and consistency in your results. Huggingface's standard libraries and versions change frequently, so it's a good practice to specify the model name and revision. This is specified in the cell below, where model is the model name, and revision is the version of the model.

`clean_up_tokenization_spaces` removes spaces before punctuations and adds spaces after these punctuations. Only relevant if `add_prefix_space` is `True` in the tokenizer. It makes sure the text is human-readable without odd spacing issues.

Furthermore, the gpu is not yet selected, so we need to do that too.

Text Summarization is the task of creating a shorter version of a long text while preserving the main content. This can be useful when you need to condense large articles or reports.

We'll use the `summarization` pipeline for this task, which leverages models that are fine-tuned specifically for generating summaries.


In [None]:
model_name = "t5-small" # or gpt-2, facebook/barg-large-cnn, etc.
revision = "main"  # or a specific commit hash, version, or tag

# Check if GPU is available
device = 0 if torch.cuda.is_available() else -1

summarizer = pipeline("summarization", model=model_name, revision=revision, device=device)

text = "Machine learning is the study of computer algorithms that improve automatically through experience. It is seen as a part of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so."

summary = summarizer(text, max_length=50, min_length=25, do_sample=False)
print(f"Summary: {summary[0]['summary_text']}")

### Task 3: Question-Answering

Question Answering involves answering a question based on a provided context. This task is useful for systems like chatbots or information retrieval systems where the goal is to answer specific queries from a given body of text.

We'll use the `question-answering` pipeline for this task, which requires both a question and a context.

In [None]:
question_answerer = pipeline("question-answering", device=device)

context = "Machine learning is a subset of artificial intelligence, which involves using statistical techniques to give computer systems the ability to 'learn' from data, without being explicitly programmed."

question = "What is machine learning?"
answer = question_answerer(question=question, context=context)
print(f"Answer: {answer['answer']}")

### Task 4: Text Translation

Text Translation is the task of converting text from one language to another. Hugging Face provides translation pipelines for a wide range of languages.

In this example, we will translate a sentence from English to French using the `translation_en_to_fr` pipeline.

In [None]:
translator = pipeline("translation_en_to_fr", device=device)

translation = translator("Hello, how are you?")
print(f"Translation: {translation[0]['translation_text']}")

### Task 5: Text Generation
Text Generation involves generating coherent text from a given prompt. Models like GPT-2 are commonly used for this task.

We'll use the text-generation pipeline to generate text based on an initial prompt.

Let's run the following cell to generate text.

In [None]:
generator = pipeline("text-generation", device=device)

generated_text = generator("Artificial intelligence will revolutionize the future of technology")
print(f"Generated Text: {generated_text[0]['generated_text']}")


# 3. Fine-Tuning Pre-trained Models
While the pre-trained models provided by Hugging Face are powerful, you may want to fine-tune them for a specific task or dataset.

Fine-tuning involves taking a pre-trained model and training it further on your own data. This can improve the model’s performance for specific use cases.

For this section, we’ll load the IMDB dataset (which contains movie reviews) and fine-tune a pre-trained model for sentiment classification.

### Step 1: Load Dataset
We'll use Hugging Face's datasets library to load the IMDB dataset.

Datasets from the dataset library often come with pre-defined splits of the data, such as `train` and `test` sets.

It is possible to filter or slice datasets to focus on specific subsets of the data, using the `select` method.

In [None]:
from datasets import load_dataset

dataset = load_dataset("imdb")
train_dataset = dataset["train"].shuffle(seed=42).select(range(100))  # Using a subset for quick fine-tuning
test_dataset = dataset["test"].shuffle(seed=42).select(range(100))

### Step 2: Tokenize the Dataset
The dataset needs to be tokenized before it can be fed into the model. Tokenization converts the text data into numerical format (tokens) that the model can process.

We'll use the `AutoTokenizer` class from HuggingFace to tokenize the data. The `AutoTokenizer` class automatically selects the appropriate tokenizer for the model based on the `model_name`.

Tokenization or transformation of the dataset can be done using the `map` method, which applies a function to all the elements of the dataset. This is easily done by defining a function that tokenizes the text data and then applying it to the dataset. When `batched=True`, the function will be applied to batches of data, which can improve performance by applying the function in parallel.

In [None]:
from transformers import AutoTokenizer

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    # print(examples["text"][0])
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_test = test_dataset.map(tokenize_function, batched=True)


### Step 3: Load a Pre-trained Model
Now that the data is tokenized, we'll load a pre-trained model that we'll fine-tune for sentiment classification.

We'll use distilbert-base-uncased for this task.

We need to import `AutoModelForSequenceClassification` for that. The key feature of this class is that it adds a classification head on top of the pre-trained transformer model to allow it to classify sequences into one or more categories (e.g., positive/negative sentiment, spam/ham, etc.). The `from_pretrained` method loads the pre-trained model with the specified configuration. The `num_labels` parameter specifies the number of labels in the classification task (binary in this case).

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2).to(device)

### Step 4: Set Up the Trainer
Hugging Face provides the Trainer class to help with the training and fine-tuning of models. We need to set up the trainer by providing the model, training arguments, and the datasets.


In [None]:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",          # Output directory
    evaluation_strategy="epoch",     # Evaluate after each epoch
    learning_rate=2e-5,              # Learning rate
    per_device_train_batch_size=8,   # Batch size for training
    per_device_eval_batch_size=8,    # Batch size for evaluation
    num_train_epochs=1,              # Number of epochs
    weight_decay=0.01,               # Strength of weight decay
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test
)


### Step 5: Fine-tune the Model
Now that the trainer is set up, we can start the fine-tuning process.

Run the following cell to fine-tune the model.

In [None]:
trainer.train()

### Step 6: Evaluate the Model
After training, we can evaluate the model’s performance on the test set.

In [None]:
eval_results = trainer.evaluate()
print(f"Evaluation Results: {eval_results}")

### Step 7: Try out model

In [None]:
input_string = "I really liked this tutorial!"

# Tokenize the input string
inputs = tokenizer(input_string, return_tensors="pt").to(device)

# Get predictions (logits)
with torch.no_grad():  # Disable gradient computation since we're just doing inference
    outputs = model(**inputs)
    logits = outputs.logits

predicted_label = torch.argmax(logits, dim=1).item()


print(f"Predicted label: {predicted_label}")

### Step 8. Saving the Fine-tuned Model
After training, it is often useful to save the fine-tuned model, so you can use it later without needing to re-train it.

In [None]:
model.save_pretrained("./fine-tuned-model")
tokenizer.save_pretrained("./fine-tuned-model")