### Hands-On with Pre-Trained Transformers BERT and GPT

## Introduction to BERT and GPT

### What is BERT?  
**BERT** (Bidirectional Encoder Representations from Transformers) is a pre-trained language model developed by Google AI.  
BERT processes input sequences bidirectionally, meaning it considers context from both the left and right of each word. This enables a deeper understanding of word meaning and context within a sentence.

#### Key Features of BERT
- **Bidirectional Context:** Considers both previous and next words, improving comprehension of ambiguous language.
- **Transformer Encoder-Based:** Built on the transformer encoder architecture, which is highly effective for understanding and representing input text.
- **Pretraining Tasks:**
    - **Masked Language Modeling (MLM):** Randomly masks words in a sentence and trains the model to predict them, helping the model learn deep contextual representations.
    - **Next Sentence Prediction (NSP):** Trains the model to predict if one sentence logically follows another, aiding in tasks like question answering and natural language inference.
- **Applications:**  
    - Sentiment analysis  
    - Named entity recognition (NER)  
    - Question answering  
    - Text classification  
    - Semantic search  
    - Document summarization

---

### What is GPT?  
**GPT** (Generative Pretrained Transformer) is a language model developed by OpenAI.  
GPT processes input sequences unidirectionally (left to right), making it particularly effective for generative tasks such as text completion and generation.

#### Key Features of GPT
- **Unidirectional Context:** Processes text from left to right, focusing on predicting the next word in a sequence, which is ideal for text generation.
- **Transformer Decoder-Based:** Utilizes the transformer decoder architecture, optimized for generating coherent and contextually relevant text.
- **Pretraining Task:**
    - **Causal Language Modeling:** Trains the model to predict the next word in a sequence, given the previous words, enabling fluent and context-aware text generation.
- **Applications:**  
    - Text generation  
    - Chatbots and conversational AI  
    - Summarization  
    - Creative writing  
    - Code generation  
    - Translation

---

### Key Differences Between BERT and GPT

| Feature                | BERT                                         | GPT                                      |
|------------------------|----------------------------------------------|------------------------------------------|
| Architecture           | Transformer Encoder                          | Transformer Decoder                      |
| Context Processing     | Bidirectional                                | Unidirectional (left-to-right)           |
| Pretraining Tasks      | MLM, NSP                                     | Causal Language Modeling                 |
| Main Strength          | Understanding and representing text          | Generating coherent and fluent text      |
| Typical Applications   | Classification, NER, QA, semantic search     | Text generation, chatbots, summarization |

---

Both BERT and GPT have revolutionized natural language processing by leveraging the transformer architecture, but they are optimized for different tasks:  
- **BERT** excels at understanding and representing text for downstream tasks that require comprehension.
- **GPT** is designed for generating text, making it suitable for creative and conversational applications.

---

## Fine-Tuning Pretrained Models for Downstream Tasks

### Why Fine-Tune?
- Pretrained models are trained on large, generic datasets (e.g., Wikipedia, BookCorpus).
- Fine-tuning adapts these models to specific tasks (e.g., sentiment analysis, classification, NER) by training them further on task-specific data.
- This approach leverages the general language understanding of the pretrained model and tailors it to the nuances of the target task.

### Steps to Fine-Tune a Pretrained Model

1. **Load a Pretrained Model:**  
   Use libraries like [Hugging Face Transformers](https://huggingface.co/transformers/) to load a pretrained BERT or GPT model.

2. **Prepare the Dataset:**  
   - Format your dataset for the specific task (e.g., tokenization for text classification, labeling for NER).
   - Split the data into training, validation, and test sets.

3. **Configure the Model for the Task:**  
   - Add task-specific layers (e.g., classification head for sentiment analysis).
   - Set up loss functions and evaluation metrics appropriate for the task.

4. **Train and Evaluate:**  
   - Fine-tune the model using your task-specific data.
   - Monitor performance on the validation set to avoid overfitting.
   - Evaluate the final model on the test set.

5. **Deploy and Use:**  
   - Save the fine-tuned model.
   - Integrate it into your application for inference on new data.

---

Fine-tuning allows you to harness the power of large language models for your own specialized NLP tasks, achieving state-of-the-art results with relatively little labeled data.

In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm





load and preprocess dataset

In [2]:
dataset = load_dataset("imdb")

# tokeniser
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")


# tokese the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length")


tokenized_datasets = dataset.map(tokenize_function, batched=True)

# prepare data for training
tokenized_datasets = tokenized_datasets.remove_columns("text")
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

train_dataset = tokenized_datasets["train"]
test_dataset = tokenized_datasets["test"]

# load model
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)

# define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_steps=500,
)
# train model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    processing_class = tokenizer
)

trainer.train()

results = trainer.evaluate()
print("Evaluation results:", results)


Map: 100%|██████████| 25000/25000 [00:05<00:00, 4454.77 examples/s]
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,0.1853,0.244896
2,0.3436,0.31357
3,0.001,0.310552




Evaluation results: {'eval_loss': 0.31055212020874023, 'eval_runtime': 6540.8205, 'eval_samples_per_second': 3.822, 'eval_steps_per_second': 0.478, 'epoch': 3.0}


experiment with GPT

In [3]:
from transformers import AutoModelForCausalLM

gpt_model = AutoModelForCausalLM.from_pretrained("gpt2")

input_text = "Once upon a time"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

output = gpt_model.generate(input_ids, max_length=50, num_return_sequences=1)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated text:", generated_text)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated text: once upon a time [unused193] [unused193] [unused812] upon a time [unused193] [unused193] [unused812] upon a time [unused193] [unused193] [unused812] upon a time [unused193] [unused193] [unused812] upon a time [unused193] [unused193] [unused812] upon a time [unused193] [unused193] [unused812] upon a time [unused193] [unused193]
