# Week 3: Transformers & Modern NLP Coding Tasks

In [1]:
!pip install transformers datasets sentencepiece torch accelerate



In [2]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import sentencepiece as spm
import os

  if not hasattr(np, "object"):





## 1. Inference Using Transformers

In [3]:
classifier = pipeline("sentiment-analysis")

examples = [
    "Transformers are amazing and have revolutionized NLP!",
    "I am not a fan of waiting in long lines.",
    "The movie was okay, but the ending was disappointing."
]

results = classifier(examples)

for text, result in zip(examples, results):
    print(f"Text: {text}\nLabel: {result['label']}, Score: {result['score']:.4f}\n")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Text: Transformers are amazing and have revolutionized NLP!
Label: POSITIVE, Score: 0.9998

Text: I am not a fan of waiting in long lines.
Label: NEGATIVE, Score: 0.9210

Text: The movie was okay, but the ending was disappointing.
Label: NEGATIVE, Score: 0.9991



## 2. Fine-tuning BERT for Text Classification

In [4]:
dataset = load_dataset("imdb")
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(100))
small_test_dataset = dataset["test"].shuffle(seed=42).select(range(50))

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_train = small_train_dataset.map(tokenize_function, batched=True)
tokenized_test = small_test_dataset.map(tokenize_function, batched=True)

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
)

print("Starting training...")
# trainer.train()
print("Training setup complete. Uncomment trainer.train() to execute.")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting training...
Training setup complete. Uncomment trainer.train() to execute.


## 3. Train a Custom SentencePiece Tokenizer

In [5]:
dummy_text = """
Transformers are state-of-the-art models in NLP.
SentencePiece is an unsupervised text tokenizer and detokenizer.
It is mainly used for Neural Network-based text generation systems.
Machine learning is fascinating and powerful.
"""

with open("bot_corpus.txt", "w") as f:
    f.write(dummy_text)

spm.SentencePieceTrainer.train(input='bot_corpus.txt', model_prefix='m', vocab_size=50)

sp = spm.SentencePieceProcessor()
sp.load('m.model')

text_to_tokenize = "SentencePiece is an unsupervised text tokenizer"
encoded_pieces = sp.encode_as_pieces(text_to_tokenize)
encoded_ids = sp.encode_as_ids(text_to_tokenize)

print(f"Original: {text_to_tokenize}")
print(f"Pieces: {encoded_pieces}")
print(f"IDs: {encoded_ids}")

decoded_text = sp.decode_ids(encoded_ids)
print(f"Decoded: {decoded_text}")

Original: SentencePiece is an unsupervised text tokenizer
Pieces: ['▁', 'S', 'en', 't', 'en', 'ce', 'P', 'i', 'e', 'ce', '▁', 'is', '▁an', '▁', 'u', 'n', 's', 'u', 'p', 'er', 'v', 'is', 'e', 'd', '▁', 'te', 'x', 't', '▁', 'tokenizer']
IDs: [3, 39, 22, 5, 22, 37, 26, 38, 4, 37, 3, 14, 18, 3, 16, 15, 6, 16, 30, 13, 45, 14, 4, 7, 3, 20, 32, 5, 3, 31]
Decoded: SentencePiece is an unsupervised text tokenizer


## 4. Mini-Project: Text Summarization

In [6]:
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

long_text = """
The Transformer is a deep learning model introduced in 2017 by Google researchers in the paper "Attention Is All You Need". 
It is primarily used in the field of natural language processing (NLP). 
Like recurrebt neural networks (RNNs), transformers are designed to process sequential input data, such as natural language, 
with applications for tasks such as translation and text summarization. 
However, unlike RNNs, transformers process the entire input all at once. 
The attention mechanism provides context for any position in the input sequence. 
For example, if the input data is a natural language sentence, the transformer does not need to process the beginning of it before the end. 
The transformer model has replaced RNNs and LSTMs as the model of choice for NLP problems, replacing them with attention mechanisms 
that allow for parallelization and better handling of long-range dependencies.
"""

summary = summarizer(long_text, max_length=60, min_length=30, do_sample=False)

print("Original Text Length:", len(long_text))
print("\nSummary:")
print(summary[0]['summary_text'])

Device set to use cpu


Original Text Length: 907

Summary:
The Transformer is a deep learning model introduced in 2017 by Google researchers in the paper "Attention Is All You Need" It is primarily used in the field of natural language processing (NLP)
