# **Hugging Face**

Hugging Face is one of the most influential companies in the field of artificial intelligence, particularly known for democratizing access to state-of-the-art machine learning models. Originally famous for its work in natural language processing (NLP), Hugging Face created the `transformers` library, which provides easy-to-use implementations of powerful models like BERT, GPT, T5, and many others. Over time, it expanded beyond NLP into areas like computer vision, audio processing, and even reinforcement learning.

The Hugging Face Hub acts as a central repository where researchers and developers can share, download, and fine-tune pre-trained models and datasets. This ecosystem has become essential for accelerating AI research and application development, reducing the need to train large models from scratch. Today, Hugging Face offers tools for model training, evaluation, deployment, and even optimization, playing a critical role in making cutting-edge AI more accessible to both researchers and industry practitioners.

## 1. The `transfomer` Library

With the `transformers` library, you can perform tasks like text classification, translation, summarization, question answering, and more with just a few lines of code. It supports interoperability with both PyTorch and TensorFlow backends, and its API is designed to be user-friendly and flexible.

Here's a simple example of how you can use `transformers` to perform summarization:





In [None]:
!pip install transformers

In [None]:
from transformers import pipeline

# Load a summarization pipeline
summarizer = pipeline("summarization")

# Text to summarize
text = """
Hugging Face is a company that specializes in Natural Language Processing technologies.
They have created the popular transformers library which allows users to access
pre-trained models for a variety of tasks such as text classification, summarization,
and translation, with minimal code and configuration.
"""

# Generate summary
summary = summarizer(text, max_length=40, min_length=5, do_sample=False)

print(summary[0]['summary_text'])

This example demonstrates how easily you can harness the power of a pre-trained model without any deep learning expertise. The `transformers` library continues to evolve rapidly, introducing new models and capabilities that are at the forefront of AI research and application.

###  1.2 Sentiment Analysis Example

Let's make another example about sentiment analysis.

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier("I hated the last Star Wars movie")   # rightfully so, it sucked
print(result)

There is another way we can pass the task  to our model:

In [None]:
pipeline(task="sentiment-analysis")("I kind of enjoyed the Barbie Movie")

An important hing we should learnis specify a model that we want to use. When we don't select one, Hugging Face defaults one for us.

Let's choose [facebook/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) and perform sentiment analysis with it.

In [None]:
pipeline(task="sentiment-analysis", model="facebook/bart-large-mnli")\
                                    ("Western european media is censoring Palestine related content.\
                                    They are defending Israel, while it is committing war crimes. \
                                    That's unacceptable.")

We can actually perform sentiment analysis in batches, by passing a list of texts to perform the task on.

In [None]:
classifier = pipeline(task="sentiment-analysis", model="SamLowe/roberta-base-go_emotions")  # using a more complex sentiment model

task_list= ["I ove learning about AI", \
        "I am not sure using GPT in your everyday life is going to be good long term", \
        "I love working out"]

classifier(task_list)

### 1.3 Text Generation

Another incredible task available is text generation:

In [None]:
from transformers import pipeline
text_generator = pipeline("text-generation", model="distilbert/distilgpt2")

generated_text = text_generator("Today is a rainy day in London",
                                truncation=True,
                                num_return_sequences = 2)
print("Generated_text:\n", generated_text[0]['generated_text'])

## 2. Transformer-Based Tokenization

A key focus for working with Large Language Models (LLMs) is understanding how to properly tokenize text, and Hugging Face makes this process seamless. Before diving into model usage, it's essential to grasp some foundational technicalities.

One of the most important is transformer-based tokenization, the standard method for preparing text for LLMs like ChatGPT and BERT. Unlike traditional methods such as Bag of Words, transformer tokenization breaks down text into subword units using techniques like Byte-Pair Encoding (BPE) or WordPiece. This allows models to efficiently handle rare words, typos, and unseen vocabulary while preserving semantic meaning.

Transformer tokenizers are dynamic and context-aware, significantly improving performance and generalization in modern NLP tasks compared to older tokenization strategies.

### 2.1 Understanding Transfomer Tokenization

The best way to understand what's happening when we tokenize through a transformer-like arhcitecture is to look at what models do under the hood.

In order to do so we can use the [`AutoTokenizer`](https://huggingface.co/docs/transformers/v4.52.3/en/model_doc/auto#transformers.AutoTokenizer) class from the `transformers` library.

This is part of [Hugging Face's Auto Classes](https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes).

The Auto Classes in Hugging Face, such as `AutoTokenizer` and `AutoModel`, are designed to simplify working with a wide range of pre-trained models. Instead of manually selecting the appropriate tokenizer or model class, Auto Classes automatically infer the correct one based on the model checkpoint you provide. This abstraction makes your code more flexible and model-agnostic, allowing you to swap between different architectures like BERT, RoBERTa, or DistilBERT without changing your pipeline.

`AutoTokenizer` in particular ensures that text is tokenized using the exact method the underlying model was trained with, whether it’s WordPiece, Byte-Pair Encoding (BPE), or SentencePiece. This guarantees compatibility and maximizes model performance when processing text data.

In [None]:
from transformers import AutoTokenizer

# Load a pre-trained tokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")  # using a bert-like tokenizer

# example text
text = "I really liked the movie No Country for Old Men"

# Tokenize the Text
tokens = tokenizer.tokenize(text)
print("Tokens:\n", tokens)

These tokens look like usual word tokens from before, where is the transformer architecture?

The important step is the transformation from tokens to ids. Here, each token is mapped to a unique integer ID using the model's pre-trained vocabulary. These IDs are then fed into the embedding layer of the transformer architecture, which transforms them into dense vectors that capture semantic information.

This step is crucial because transformers operate on *continuous vector representations* rather than discrete words or tokens.

The (auto) tokenizer ensures that the input format precisely matches what the transformer model expects, preserving alignment between tokenization and model training.


In [None]:
# Convert tokens to input IDs

input_ids = tokenizer.convert_tokens_to_ids(tokens)

print("Input IDs:\n", input_ids)

Whenever we want to pass data to our transformer model, we should do these encoding operations: tokenization + conversion to IDs.

The `tokenizer()` method allows us to do so in one go:

In [None]:
# Encode the text (tokenization + converting to IDs)

encoded_input = tokenizer(text)

print("Encoded input", encoded_input)

There are some interesting things happeping here:

* First, we can see that the first token is $101$: this indicates the start of the sentence.

* Secondly, we see a `token_type_ids` object: let's skip its analysis for now, we'll come back to it later on.

* Third, and most imortantly, we see our `attention_mask`, automatically computed by our tokenizer.

If we want to decode, we will pass the input ids to the `decode()` method:

In [None]:
# Decode text

decoded_output = tokenizer.decode(input_ids)
print("Decode Output: ", decoded_output)

## 3. Fine Tuning Using a Pretrained Model

We will now try to fine tune a pretrained model on the `IMDB Dataset`.


### 3.1 Loading an Hugging Face Dataset

It's really simple to use Hugging Face's datasets. First we must install the `datasets` library:

In [None]:
!pip install --upgrade datasets fsspec
!pip install transformers

Now we can just load the dataset by passing its name to the `load_dataset()` function:

In [None]:
from datasets import load_dataset

ds = load_dataset("imdb")

In [None]:
ds

In [None]:
import pandas as pd

train_df = pd.DataFrame(ds['train'])

train_df.head()

In [None]:
test_df = pd.DataFrame(ds['test'])

test_df.head()

In [None]:
train_df.info()

In [None]:
import numpy as np
print(np.unique(train_df['label'])) # what labels do we have?

### 3.2 Tokenize The Data

Let's tokenize the dataset. We will add padding of size `max_length` (the maximum lenght accpeted by the model) in order for our tokens to be of the same size - padding shorter phrases -  and set `truncation=True` for the same reason (truncating longer words to our max length). See [padding and truncation](https://huggingface.co/docs/transformers/pad_truncation#padding-and-truncation) for more details.

In [None]:
from transformers import AutoTokenizer

# Load the (bert-like) tokenizer (automatically retrieved by AutoTokenizer)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the dataset
def tokenize_function(examples):
  return tokenizer(examples['text'], padding="max_length", truncation=True)

tokenized_ds = ds.map(tokenize_function, batched=True)

Above we use the `map` function. From its [documentation](https://huggingface.co/docs/datasets/process#map):

"*Some of the more powerful applications of 🤗 Datasets come from using the `map()` function. The primary purpose of `map()` is to speed up processing functions. It allows you to apply a processing function to each example in a dataset, independently or in batches.*"

In [None]:
tokenized_ds

### 3.3 Set Up the Training Arguments

We can almost start training; we just need to secify the hyperparameters and the training settings.

We can do this with the help of Hugging face's [`TrainingArguments`](https://huggingface.co/docs/datasets/process#map):

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results", # output dir
    eval_strategy="epoch",  # evaluate every epoch
    learning_rate=2e-5, # lr
    per_device_train_batch_size=16, #batch size for training
    per_device_eval_batch_size=16,  # batch size for evaluation
    num_train_epochs=1, # number of training epochs
    weight_decay=0.01 # strength of weight decay
)

training_args

### 3.4 Initialize the Model

Now we can initialize our model, using the `Auto` class, and initialize its [`Trainer`](https://huggingface.co/docs/transformers/main_classes/trainer#trainer).

In [None]:
from transformers import AutoModelForSequenceClassification, Trainer

# Load pre-trained model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased',
                                                           num_labels=2)  # we only have 2 labels (0,1)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['test']
)

> **Note:**
>
> In Hugging Face's `transformers` library, choosing the correct Auto Class is crucial depending on your task.
>
> `AutoModel` loads **only the base pre-trained transformer model** (like BERT or RoBERTa) without any task-specific layers. *It outputs embeddings but does not add the necessary heads for tasks like classification, generation, or question answering*.
>
> On the other hand, `AutoModelForSequenceClassification` automatically adds a classification head on top of the base model, specifically designed for classification tasks. This head typically consists of a dense layer that outputs logits for each class, enabling the model to compute classification loss functions like cross-entropy.
>
> Specifying the right Auto Class ensures that your model is not only correctly structured but also fully compatible with high-level APIs like Hugging Face’s `Trainer`, which expects models to output both logits and loss during training.
>
> In short, use `AutoModel` if you only need raw embeddings or plan to design your own head; use task-specific Auto Classes like `AutoModelForSequenceClassification` to quickly get a model ready for fine-tuning on your downstream task.


### 3.5 Train the Model

Now training is as simple as it could be:

In [None]:
# Train the model

trainer.train()

### 3.6 Evaluate the Trained Model

Evaluating the model is very simple as well:

In [None]:
# Evaluate the model
results = trainer.evaluate()
print(results)

### 3.7 Save the Fine-Tuned Model

We can of course save the trained model for later use:

In [None]:
# Save the model:
model.save_pretrained('./fine-tuned-model')

> **Note:**
>
> When fine-tuning a transformer model, it’s essential to save the tokenizer along with the model. The tokenizer defines how input text is split into tokens and mapped to IDs, and it must match the model’s pretraining configuration. `AutoTokenizer` relies on files like `tokenizer_config.json` and `vocab.txt` to load the correct tokenizer automatically. Without these files, the model would not know how to correctly interpret input text, leading to mismatches or errors. Saving both ensures that future users can seamlessly reload and use the fine-tuned model without manual adjustments.





In [None]:
# Save tokenizer
tokenizer.save_pretrained('./fine-tuned-model')