In this assignment, you will perform three NLP tasks using Hugging Face tokenizers, models, and pipelines. The goal is to learn:

- How to use Hugging Face tokenizers for preprocessing tasks like padding, truncation, and batching of text.

- How model configuration works, including mapping between id2label and label2id for token classification tasks.

- How Hugging Face models work, including passing text through the model to generate logits.

- How to use the logits output from Hugging Face models to make predictions for your NLP task.

- How to recreate Hugging Face pipelines by using the tokenizers and models directly, instead of relying on the pipelines.

- Compare the results of using the tokenizers and models directly versus using the Hugging Face pipelines to evaluate the differences.

The focus of this assignment is gaining hands-on experience with Hugging Face tokenizers, configuration, models, and pipelines through implementing three text processing tasks end-to-end. This will provide a deeper understanding of how these key NLP components work.

# Installing Core NLP Libraries

This section installs 3 key libraries for NLP and ML projects:

- Transformers - Provides access to pretrained models like BERT, RoBERTa for NLP.

- Datasets - Provides convenient access to common NLP datasets.

- Rich - For nicely formatted console output when training models.

Installing these libraries in one line allows quick setup of the Python environment with critical functionality for working on text data.


In [None]:
!pip install transformers datasets rich

Collecting transformers
  Downloading transformers-4.33.1-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m41.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Download

The transformers.pipeline() method provides quick access to pretrained NLP models for making predictions. The rich.pretty.pprint() method prints Python objects to the console in a readable formatted way.

In [None]:
from transformers import pipeline
from rich.pretty import pprint
import torch.nn.functional as F
import torch

Here is documentation for each section of the notebook:




# Summary

By walking through the pipeline components, this shows how to go from raw text to formatted predictions step-by-step. This provides more visibility than just using the packaged pipeline.

# Creating a Text Classification Pipeline

This section creates a text classification pipeline using Hugging Face's transformers library. The pipeline gives quick access to a pretrained DistilBERT model finetuned on the SST-2 sentiment analysis dataset.

The pipeline makes predictions on some sample text, returning the sentiment label and score for each sentence.

In [None]:
classification = pipeline(task="text-classification")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [None]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
results = classification(raw_inputs)
pprint(results)

# Loading Tokenizer, Config, and Model

This section loads the lower-level components used by the pipeline:

- Tokenizer: Preprocesses the text into ids, handles padding/truncation.

- Config: Contains model configuration like hyperparams and mapping from ids to labels.

- Model: The core Transformer model like DistilBERT that generates embeddings and predictions.

Loading these separately gives more control than just using the packaged pipeline.


In [None]:
from transformers import AutoTokenizer, AutoConfig, AutoModelForSequenceClassification

In [None]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
config = AutoConfig.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)


# Tokenizing the Text

The tokenizer is used to preprocess the raw text into tokenized ids with padding & truncation to fit the expected model input shape.

This shows how the tokenizer prepares the data before passing it to the model.


In [None]:
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
pprint(inputs)

# Printing Truncated Text

The truncated input text is decoded back to readable text using the tokenizer's decode method.

This shows how padding and truncation end up masking part of the original input.


In [None]:
pprint(tokenizer.decode(inputs["input_ids"][0]))
pprint(tokenizer.decode(inputs["input_ids"][1]))

# Passing Inputs to Model

The tokenized & padded inputs are passed to the model to generate predictions.

This uses the model directly instead of the pipeline, giving more control.

In [None]:
outputs = model(**inputs)
pprint(outputs)

# Interpreting Model Outputs

This happens in two steps:

1. The raw numeric tensor outputs of the model are converted into probability scores and sentiment labels.

2. This uses the mapping in the config to go from indices predicted by the model back to the associated labels.


## Covert logits to probabilities

In [None]:
predictions = F.softmax(outputs.logits, dim=-1)
pprint(predictions)

## Loop through probabilities and convert to interpretable results

In [None]:
result = []
for index, prediction in enumerate(predictions):
  probability = torch.max(prediction).item()
  sentiment = config.id2label[torch.argmax(prediction).item()]
  result.append({"probability": probability, "label": sentiment})
pprint(result)

The end result matches what the pipeline originally produced (compare with the pipeline results)

# EXERCISE 1

Write the code for analyzing the sentiment of the same raw_inputs using the model "cardiffnlp/twitter-roberta-base-sentiment"



In [None]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

checkpoint = "cardiffnlp/twitter-roberta-base-sentiment"

# Complete the codes for each of the task below

# initialize tokenizer
# initialize config
# initialize model
# create inputs for the model (from raw inputs)
# get model outputs
# convert logits to probabilities
# get the labels for each item
# print the result
# Now, use pipeline to do the same task
# compare the results

# EXERCISE 2
* Finish the (unfinished) commented codes below
* Instructions start with "##"

# Creating a Token Classification Pipeline

This section creates a named entity recognition (NER) pipeline using the Hugging Face transformers library. The pipeline provides quick access to a pretrained BERT model finetuned on the CoNLL 2003 NER dataset.

The pipeline makes predictions on a sample input text, returning the predicted NER tags with scores for each token.


In [None]:
## Create a token classifier using pipeline
# token_classifier =

# Inspecting the Pipeline Output

The raw JSON output from the NER pipeline is printed to inspect the predicted entity, score, index, word, start and end values for each tagged token.

In [None]:
ner_raw_inputs = "My name is Wolfgang and I live in Berlin"
# result =
# pprint(result)

# Loading the Pipeline Components

The lower level tokenizer, config, and model objects that compose the pipeline are loaded. This gives more control than just using the packaged pipeline.


In [None]:
ner_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
# ner_tokenizer =
# pprint(ner_tokenizer)

# Tokenizing the Input

The tokenizer preprocesses the raw text into tokenized ids, padding & truncating as needed to match the expected model input shape.


In [None]:
# ner_model_inputs =

# Passing Inputs to the Model

The tokenized inputs are passed to the model to generate predictions. This uses the model directly instead of relying on the pipeline abstraction.


In [None]:
## import the correct module for loading models
# from transformers import

# ner_model =

In [None]:
# ner_model_outputs =
# pprint(ner_model_outputs)

# Interpreting Model Outputs

The raw tensor outputs are converted to probability scores over the possible entity tags for each token.


In [None]:
# ner_predictions =
# pprint(ner_predictions)

# Converting to Human-Readable Outputs

The probabilities are parsed to extract the highest scoring entity tag per token. The start and end offsets are looked up based on the original input text.

This mirrors the output format returned by the pipeline to extract human-readable entity, score, start, end results.


In [None]:
# ner_results= []
# for index, prediction in enumerate(ner_predictions[0]):
#   prediction_probability =
#   prediction_id =
#   if prediction_id > 0:
#     entity =
#     word =
#     start =
#     end =
#     ner_results.append({"entity":entity,"score":prediction_probability, "index": index, "word": word, "start": start, "end": end})
# pprint(ner_results)

# EXERCISE 3
* Finish the (unfinished) commented codes below
* Instructions start with "##"

# Load QA Model

- qa_checkpoint: Specifies pretrained QA model from Hugging Face Hub to use

- pipeline: Constructs question answering pipeline object using the QA model

# Define Question and Context

- question: Question text string to ask the model

- context: Context paragraphs providing information to answer question

# Get QA Predictions

- qa_pipeline: Runs input question and context through model to make predictions

- qa_results: Contains predicted answer text and confidence score

- pprint: Prints prediction results in readable formatted output

This code loads a pretrained QA model, defines a question and context, passes them through the pipeline to generate an answer prediction, and prints the prediction nicely formatted. The pipeline handles running the inputs through the full model to output the top answer text span and score.

In [None]:
qa_checkpoint = "deepset/roberta-base-squad2"
qa_pipeline = pipeline("question-answering",model=qa_checkpoint)

question = "What is the capital of France?"
context = "The capital of France is Paris."

qa_results = qa_pipeline(question,context)
pprint(qa_results)

Loads pretrained question answering model using AutoModelForQuestionAnswering class.

Loads corresponding tokenizer using AutoTokenizer that was used during model training.

Tokenizer preprocesses text to numeric ids.

Model generates start and end logits to predict answer span.

In [None]:
## import the module AutoModelForQuestionAnswering from transformer and get the model and tokenizer

# from import

# qa_model =
# qa_tokenizer =

In [None]:

# qa_model_inputs =
# pprint(qa_model_inputs)

In [None]:
# qa_model_outputs =
# pprint(qa_model_outputs)

Here is brief documentation for the provided code snippet:

# Extract Logits

- Get start and end logits from model outputs

# Get Prediction Indices

- Find index of maximum start and end logits

# Decode Answer Text

- Extract predicted answer tokens from input ids

- Convert tokens back to text with tokenizer

# Compute Probability

- Take softmax of start and end logits

- Find max joint probability of start and end

# Format Human-Readable Output

- Get start and end char offsets in context

- Format into dict with score, text, offsets

# Print Output

- Display prediction result nicely formatted

This takes the raw start and end logits from the model, picks the most likely start and end points, extracts the predicted answer text, computes the overall probability, and formats into a human-readable output with score and answer text.

In [None]:
start_logits, end_logits = qa_model_outputs.start_logits, qa_model_outputs.end_logits

In [None]:
## calculate the start and end positions of the logits

# start_pos =
# end_pos =
# pprint((start_pos,end_pos))

In [None]:
# answer_tokens =
# answer =
# start_probs, end_probs =
# probability =
# start =
# end =
# result = {"score": probability, "answer": answer, "start": start, "end": end}
# pprint(result)