# HuggingFace

Hugging Face offers everything from tokenizers, which help computers make sense of text, to a huge variety of ready-to-go language models, and even a treasure trove of data suited for language tasks.

HF provides many things, some of which are:
1. Tokenizers
2. Models
3. Datasets
4. Trainers

**Tokenizers:** These work like a translator, converting the words we use into smaller parts and creating a secret code that computers can understand and work with.

**Models:** These are like the brain for computers, allowing them to learn and make decisions based on information they've been fed.

**Datasets:** Think of datasets as textbooks for computer models. They are collections of information that models study to learn and improve.

**Trainers:** Trainers are the coaches for computer models. They help these models get better at their tasks by practicing and providing guidance. HuggingFace Trainers implement the PyTorch training loop for you, so you can focus instead on other aspects of working on the model.



## Tokenizers

HuggingFace tokenizers help us break down text into smaller, manageable pieces called tokens. These tokenizers are easy to use and also remarkably fast due to their use of the Rust programming language.

**Tokenization:** The process by which an input series of characters is transformed into units the model is prepared to predict upon. A model trained on data tokenized by one tokenizer must use that same tokenizer for prediction; this is similar to feature engineering in traditional machine learning. It's like cutting a sentence into individual pieces, such as words or characters, to make it easier to analyze.

**Tokens:** Fundamental unit of input to language models. These are the pieces you get after cutting up text during tokenization, kind of like individual Lego blocks that can be words, parts of words, or even single letters. These tokens are converted to numerical values for models to understand.

**Pre-trained Model:** This is a ready-made model that has been previously taught with a lot of data.

**Uncased:** This means that the model treats uppercase and lowercase letters as the same.

In [1]:
from transformers import BertTokenizer

In [2]:
# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [3]:
# See how many tokens are in the vocabulary
tokenizer.vocab_size

30522

In [4]:
# Tokenize the sentence
sent_0 = "I heart Generative AI"
tokens = tokenizer.tokenize(sent_0)

In [5]:
# Print the tokens
print(tokens)

['i', 'heart', 'genera', '##tive', 'ai']


In [6]:
# Show the token ids assigned to each token
print(tokenizer.convert_tokens_to_ids(tokens))

[1045, 2540, 11416, 6024, 9932]


In [7]:
dict(zip(tokens,tokenizer.convert_tokens_to_ids(tokens)))

{'i': 1045, 'heart': 2540, 'genera': 11416, '##tive': 6024, 'ai': 9932}

## Models

Hugging Face models provide a quick way to get started using models trained by the community. With only a few lines of code, you can load a pre-trained model and start using it on tasks such as sentiment analysis.

In [8]:
from transformers import BertForSequenceClassification, BertTokenizer
import torch

In [9]:
# Load a pre-trained sentiment analysis model
model_name = "textattack/bert-base-uncased-imdb"
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = BertTokenizer.from_pretrained(model_name)

In [10]:
# Tokenize the input sequence
sent = "I love mathematics"
inputs = tokenizer.tokenize(text=sent)
dict(zip(inputs,tokenizer.convert_tokens_to_ids(inputs)))

{'i': 1045, 'love': 2293, 'mathematics': 5597}

In [11]:
# Make prediction
def use_model(input_text):
    inputs = tokenizer(
        text=input_text,
        return_tensors="pt")
    # Get predictions without updating the model - the no_grad method means no updating of the gradients
    with torch.no_grad():
        outputs = model(**inputs).logits
        probabilities = torch.nn.functional.softmax(outputs, dim=1)
        predicted_class = torch.argmax(probabilities)
    if predicted_class == 1:
        print(f"Sentiment: Positive ({probabilities[0][1] * 100:.2f}%)")
    else:
        print(f"Sentiment: Negative ({probabilities[0][0] * 100:.2f}%)")
    label = model.config.id2label[predicted_class.item()]
    arg_ind = predicted_class.item()
    print(f"\tModel label: {label}")
    print(f"\tModel arg_index: {arg_ind}")

In [14]:
use_model(sent)

Sentiment: Positive (82.51%)
	Model label: LABEL_1
	Model arg_index: 1


In [13]:
# Alternatively:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [15]:
use_model(sent)

Sentiment: Positive (82.51%)
	Model label: LABEL_1
	Model arg_index: 1


In [16]:
# Alternatively:
from transformers import pipeline
pipe = pipeline("text-classification", model="textattack/bert-base-uncased-imdb")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Device set to use mps:0


In [17]:
out = pipe(sent)
print(out[0])
print()
label = out[0]['label']
score = out[0]['score']
if label == "LABEL_1":
    print(f"Sentiment: Positive ({score * 100:.2f}%)")
else:
    print(f"Sentiment: Negative ({score * 100:.2f}%)")
# label = model.config.id2label[predicted_class.item()]
# arg_ind = predicted_class.item()
print(f"\tModel label: {label}")
print(f"\tModel score: {score}")

{'label': 'LABEL_1', 'score': 0.8251070380210876}

Sentiment: Positive (82.51%)
	Model label: LABEL_1
	Model score: 0.8251070380210876


## Datasets

HuggingFace Datasets library is a powerful tool for managing a variety of data types, like text and images, efficiently and easily. This resource is incredibly fast and doesn't use a lot of computer memory, making it great for handling big projects without any hassle.

**IMDb dataset:** A dataset of movie reviews that can be used to train a machine learning model to understand human sentiments.

**Apache Arrow:** A software framework that allows for fast data processing

In [18]:
# %pip install datasets huggingface_hub
import huggingface_hub
from datasets import load_dataset
from IPython.display import HTML, display

In [19]:
ds_list = huggingface_hub.list_datasets()

In [20]:
ds_list

<generator object HfApi.list_datasets at 0x3ad8c10e0>

In [21]:
lst_ds = list(ds_list)

In [22]:
lst_ds[0]

DatasetInfo(id='fka/awesome-chatgpt-prompts', author='fka', sha='68ba7694e23014788dcc8ab5afe613824f45a05c', created_at=datetime.datetime(2022, 12, 13, 23, 47, 45, tzinfo=datetime.timezone.utc), last_modified=datetime.datetime(2025, 1, 6, 0, 2, 53, tzinfo=datetime.timezone.utc), private=False, gated=False, disabled=False, downloads=6066, downloads_all_time=None, likes=6939, paperswithcode_id=None, tags=['task_categories:question-answering', 'license:cc0-1.0', 'size_categories:n<1K', 'format:csv', 'modality:text', 'library:datasets', 'library:pandas', 'library:mlcroissant', 'library:polars', 'region:us', 'ChatGPT'], trending_score=133, card_data=None, siblings=None)

In [23]:
# load the IMDB dataset, which contains movie reviews and sentiment labels (positive or negative)
ds = load_dataset("imdb")

In [24]:
# Retrieve a single review
rev_num = 42
sample_rev = ds["train"][rev_num]
display(HTML(sample_rev["text"][:450] + "..."))

In [25]:
sample_rev.keys()

dict_keys(['text', 'label'])

In [26]:
if sample_rev["label"] == 1:
    print("Sentiment: Positive")
else:
    print("Sentiment: Negative")
# Sentiment: Negative

Sentiment: Negative


## Trainers

[HuggingFace trainers](https://huggingface.co/docs/transformers/v4.47.1/en/main_classes/trainer#trainer) offer a simplified approach to training generative AI models, making it easier to set up and run complex machine learning tasks. This tool wraps up the hard parts, like handling data and carrying out the training process, allowing us to focus on the big picture and achieve better outcomes with our AI endeavors.

**Truncating:** This refers to shortening longer pieces of text to fit a certain size limit.

**Padding:** Adding extra data to shorter texts to reach a uniform length for processing.

**Batches:** Batches are small, evenly divided parts of data that the AI looks at and learns from each step of the way.

**Batch Size:** The number of data samples that the machine considers in one go during training.

**Epochs:** A complete pass through the entire training dataset. The more epochs, the more the computer goes over the material to learn.

**Dataset Splits:** Dividing the dataset into parts for different uses, such as training the model and testing how well it works.



In [27]:
from transformers import (DistilBertForSequenceClassification,
    DistilBertTokenizer,
    TrainingArguments,
    Trainer
)
from datasets import load_dataset

In [28]:
# https://huggingface.co/distilbert/distilbert-base-uncased
model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)
# https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.DistilBertForSequenceClassification
tokenizer_1 = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
# https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.DistilBertTokenizer

def tokenize_function(examples):
    return tokenizer_1(examples["text"], padding="max_length", truncation=True)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [29]:
ds = load_dataset("imdb")
tokenized_datasets = ds.map(tokenize_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [30]:
model_chk = DistilBertForSequenceClassification.from_pretrained(
    pretrained_model_name_or_path="./results/checkpoint-1173"
    ,num_labels=2
)
# "./results/mps_1"
# "./results"
# Make prediction
def use_model_chk(input_text):
    inputs = tokenizer_1(
        text=input_text,
        return_tensors="pt")
    # Get predictions without updating the model - the no_grad method means no updating of the gradients
    with torch.no_grad():
        outputs = model_chk(**inputs).logits
        probabilities = torch.nn.functional.softmax(outputs, dim=1)
        predicted_class = torch.argmax(probabilities)
    if predicted_class == 1:
        print(f"Sentiment: Positive ({probabilities[0][1] * 100:.2f}%)")
    else:
        print(f"Sentiment: Negative ({probabilities[0][0] * 100:.2f}%)")
    label = model_chk.config.id2label[predicted_class.item()]
    arg_ind = predicted_class.item()
    print(f"\tModel label: {label}")
    print(f"\tModel arg_index: {arg_ind}")

In [33]:
print(sent)

I love mathematics


In [34]:
use_model_chk(sent)

Sentiment: Positive (93.09%)
	Model label: LABEL_1
	Model arg_index: 1


In [35]:
# Load a pre-trained sentiment analysis model
model_name = "textattack/bert-base-uncased-imdb"
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = BertTokenizer.from_pretrained(model_name)
use_model(sent)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Sentiment: Positive (82.51%)
	Model label: LABEL_1
	Model arg_index: 1


In [36]:
model_chk = DistilBertForSequenceClassification.from_pretrained(
    pretrained_model_name_or_path="./results/mps/checkpoint-1173"
    ,num_labels=2
)
# "./results/mps_1"
# "./results"
# Make prediction

In [37]:
use_model_chk(sent)

Sentiment: Positive (98.04%)
	Model label: LABEL_1
	Model arg_index: 1


In [76]:
training_args = TrainingArguments(
    per_device_train_batch_size=64,
    output_dir="./results",
    learning_rate=2e-5,
    num_train_epochs=3,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

In [77]:
trainer.train()

Step,Training Loss
500,0.2494
1000,0.138


TrainOutput(global_step=1173, training_loss=0.18140956101413477, metrics={'train_runtime': 2333.48, 'train_samples_per_second': 32.141, 'train_steps_per_second': 0.503, 'total_flos': 9935054899200000.0, 'train_loss': 0.18140956101413477, 'epoch': 3.0})

In [85]:
training_args = TrainingArguments(
    per_device_train_batch_size=64,
    output_dir="./results/mps",
    learning_rate=2e-5,
    num_train_epochs=3,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)
trainer.train().to(torch.device("mps"))

Step,Training Loss
500,0.0762
1000,0.0393


AttributeError: 'TrainOutput' object has no attribute 'to'

In [92]:
training_args = TrainingArguments(
    per_device_train_batch_size=128,
    output_dir="./results/mps_1",
    learning_rate=2e-5,
    num_train_epochs=1,
    use_mps_device=True
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)
trainer.train()



Step,Training Loss


TrainOutput(global_step=196, training_loss=0.046327522822788784, metrics={'train_runtime': 3151.4722, 'train_samples_per_second': 7.933, 'train_steps_per_second': 0.062, 'total_flos': 3311684966400000.0, 'train_loss': 0.046327522822788784, 'epoch': 1.0})

In [38]:
import timeit

In [39]:
a_cpu = torch.rand(250, device='cpu')
b_cpu = torch.rand((250, 250), device='cpu')
a_mps = torch.rand(250, device='mps')
b_mps = torch.rand((250, 250), device='mps')

print('cpu w/ mem write', timeit.timeit(lambda: a_cpu @ b_cpu, number=100_000))
print('\tcpu w/ NO mem write', timeit.timeit(lambda: a_cpu @ a_cpu, number=100_000))
print('mps  w/ mem write', timeit.timeit(lambda: a_mps @ b_mps, number=100_000))
print('\tmps  w/ NO mem write', timeit.timeit(lambda: a_mps @ a_mps, number=100_000))

cpu w/ mem write 0.7389892083592713
	cpu w/ NO mem write 0.07051066681742668
mps  w/ mem write 2.83012962481007
	mps  w/ NO mem write 2.426200541201979


In [40]:
import multiprocessing

multiprocessing.cpu_count()

10

# Exercise: PyTorch and HuggingFace scavenger hunt!

PyTorch and HuggingFace have emerged as powerful tools for developing and deploying neural networks.

In this scavenger hunt, we will explore the capabilities of PyTorch and HuggingFace, uncovering hidden treasures on the way.

We have two parts:
* Familiarize yourself with PyTorch
* Get to know HuggingFace

## Familiarize yourself with PyTorch

Learn the basics of PyTorch, including tensors, neural net parts, loss functions, and optimizers. This will provide a foundation for understanding and utilizing its capabilities in developing and training neural networks.

### PyTorch tensors

Scan through the PyTorch tensors documentation [here](https://pytorch.org/docs/stable/tensors.html). Be sure to look at the examples.

In the following cell, create a tensor named `my_tensor` of size 3x3 with values of your choice. The tensor should be created on the GPU if available. Print the tensor.

In [46]:
import torch

In [47]:
# Fill in the missing parts labelled <MASK> with the appropriate code to complete the exercise.
# Hint: Use torch.cuda.is_available() to check if GPU is available

# Set the device to be used for the tensor
device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')
# device = torch.device('mps' if torch.cuda.is_available() else 'cpu')
print(device)

mps


In [48]:
# Create a tensor on the appropriate device
# my_tensor = <MASK>
my_tensor = torch.randn((3, 3))
# # Print the tensor
print(my_tensor)

tensor([[ 0.2337,  0.0951, -1.4135],
        [-1.0538,  0.7545, -0.2705],
        [ 1.4455,  0.3592,  0.9547]])


In [49]:
# Check the previous cell
# assert my_tensor.device.type in {"cuda", "cpu"}
assert my_tensor.device.type in {"mps", "cpu"}
assert my_tensor.shape == (3, 3)

print("Success!")

Success!


### Neural Net Constructor Kit `torch.nn`

You can think of the `torch.nn` ([documentation](https://pytorch.org/docs/stable/nn.html)) module as a constructor kit for neural networks. It provides the building blocks for creating neural networks, including layers, activation functions, loss functions, and more.

Instructions:

Create a three layer Multi-Layer Perceptron (MLP) neural network with the following specifications:

- Input layer: 784 neurons
- Hidden layer: 128 neurons
- Output layer: 10 neurons

Use the ReLU activation function for the hidden layer and the softmax activation function for the output layer. Print the neural network.

Hint: MLP's use "fully-connected" or "dense" layers. In PyTorch's `nn` module, this type of layer has a different name. See the examples in [this tutorial](https://pytorch.org/tutorials/recipes/recipes/defining_a_neural_network.html) to find out more.

In [50]:
import torch.nn as nn

In [52]:
class MyMLP(nn.Module):
    def __init__(self, input_size:int=784):
        """My Multilayer Perceptron (MLP)
    
        Specifications:
    
            - Input layer: 784 neurons
            - Hidden layer: 128 neurons with ReLU activation
            - Output layer: 10 neurons with softmax activation
    
        """
        super(MyMLP, self).__init__()
        # self.fc1 = <MASK>
        # self.fc2 = <MASK>
        # self.relu = <MASK>
        # self.softmax = <MASK>

        # self.hidden_layer = nn.Linear(in_features=input_size, out_features=128)
        self.fc1 = nn.Linear(in_features=input_size, out_features=128)
        # self.output_layer = nn.Linear(128, 10)
        self.fc2 = nn.Linear(128, 10)
        # self.activation = nn.ReLU()
        self.relu = nn.ReLU()
        # Single dimension input to the softmax layer, so dim=0
        self.softmax = nn.Softmax(dim=0)
        # https://discuss.pytorch.org/t/implicit-dimension-choice-for-softmax-warning/12314/17


    def forward(self, x):
        # # Pass the input to the second layer
        # x = <MASK>
        # # Apply ReLU activation
        # x = <MASK>
        # # Pass the result to the final layer
        # x = <MASK>
        # # Apply softmax activation
        # x = <MASK>
        
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.softmax(x)
        
        return x

In [53]:
my_mlp = MyMLP()
print(my_mlp)

MyMLP(
  (fc1): Linear(in_features=784, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=10, bias=True)
  (relu): ReLU()
  (softmax): Softmax(dim=0)
)


In [55]:
# Check your work here:


# Check the number of inputs
assert my_mlp.fc1.in_features == 784

# Check the number of outputs
assert my_mlp.fc2.out_features == 10

# Check the number of nodes in the hidden layer
assert my_mlp.fc1.out_features == 128

# Check that my_mlp.fc1 is a fully connected layer
assert isinstance(my_mlp.fc1, nn.Linear)

# Check that my_mlp.fc2 is a fully connected layer
assert isinstance(my_mlp.fc2, nn.Linear)

print("Success!")

Success!


### PyTorch Loss Functions and Optimizers

PyTorch comes with a number of built-in loss functions and optimizers that can be used to train neural networks. The loss functions are implemented in the `torch.nn` ([documentation](https://pytorch.org/docs/stable/nn.html#loss-functions)) module, while the optimizers are implemented in the `torch.optim` ([documentation](https://pytorch.org/docs/stable/optim.html)) module.


Instructions:

- Create a loss function using the `torch.nn.CrossEntropyLoss` ([documentation](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss)) class.
- Create an optimizer using the `torch.optim.SGD` ([documentation](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html#torch.optim.SGD)) class with a learning rate of 0.01.



In [58]:
# Replace <MASK> with the appropriate code to complete the exercise.

# Loss function
# loss_fn = <MASK>
loss_fn = nn.CrossEntropyLoss()

# Optimizer (by convention we use the variable optimizer)
# optimizer = <MASK>
optimizer = torch.optim.SGD(
    params=my_mlp.parameters()
    ,lr=0.01)

In [59]:
# Check

assert isinstance(
    loss_fn, nn.CrossEntropyLoss
), "loss_fn should be an instance of CrossEntropyLoss"
assert isinstance(optimizer, torch.optim.SGD), "optimizer should be an instance of SGD"
assert optimizer.defaults["lr"] == 0.01, "learning rate should be 0.01"
assert optimizer.param_groups[0]["params"] == list(
    my_mlp.parameters()
), "optimizer should be passed the MLP parameters"
print("Success!")

Success!


### PyTorch Training Loops

PyTorch makes writing a training loop easy!


Instructions:

- Fill in the blanks!

In [60]:
# Replace <MASK> with the appropriate code to complete the exercise.
def fake_training_loaders():
    for _ in range(30):
        yield torch.randn(64, 784), torch.randint(0, 10, (64,))


for epoch in range(3):
    # Create a training loop
    for i, data in enumerate(fake_training_loaders()):
        # Every data instance is an input + label pair
        x, y = data
        # Zero your gradients for every batch!
        # <MASK>
        optimizer.zero_grad()
        # Forward pass (predictions)
        # y_pred = <MASK>
        y_pred = my_mlp(x)
        # Compute the loss and its gradients
        # loss = <MASK>
        # <MASK>
        loss = loss_fn(y_pred, y)
        loss.backward()
        # Adjust learning weights
        # <MASK>
        optimizer.step()


        if i % 10 == 0:
            print(f"Epoch {epoch}, batch {i}: {loss.item():.5f}")

Epoch 0, batch 0: 2.30273
Epoch 0, batch 10: 2.30335
Epoch 0, batch 20: 2.30230
Epoch 1, batch 0: 2.30261
Epoch 1, batch 10: 2.30205
Epoch 1, batch 20: 2.30260
Epoch 2, batch 0: 2.30272
Epoch 2, batch 10: 2.30204
Epoch 2, batch 20: 2.30278


In [61]:
# Check

assert abs(loss.item() - 2.3) < 0.1, "the loss should be around 2.3 with random data"
print("Success!")

Success!


In [62]:
m
Great job! Now you know the basics of PyTorch! Let's turn to HuggingFace 🤗.


SyntaxError: unterminated string literal (detected at line 2) (641866468.py, line 2)

Great job! Now you know the basics of PyTorch! Let's turn to HuggingFace 🤗.