### Transfew Learning in NLP

### Popular Pre-Trained NLP Models

#### **BERT (Bidirectional Encoder Representations from Transformers)**
- **Architecture:** Transformer-based encoder model
- **Training Tasks:**
    - **Masked Language Modeling (MLM):** Randomly masks words in a sentence and trains the model to predict them.
    - **Next Sentence Prediction (NSP):** Trains the model to predict if one sentence follows another.
- **Applications:**
    - Text classification
    - Sentiment analysis
    - Question answering
    - Named entity recognition (NER)
    - Sentence similarity

---

#### **GPT (Generative Pretrained Transformer)**
- **Architecture:** Transformer-based decoder model
- **Training Task:**
    - **Causal Language Modeling:** Predicts the next word in a sequence (unidirectional).
- **Applications:**
    - Text generation
    - Summarization
    - Dialogue systems (chatbots)
    - Code generation
    - Creative writing

---

#### **T5 (Text-to-Text Transfer Transformer)**
- **Approach:** Treats every NLP problem as a text-to-text task (input and output are always text).
- **Applications:**
    - Summarization
    - Translation
    - Text classification
    - Question answering
    - Sentence paraphrasing

---

#### **RoBERTa (Robustly Optimized BERT Approach)**
- **Improvements over BERT:**
    - Removes Next Sentence Prediction (NSP) task.
    - Trained on larger datasets with longer sequences and more robust training strategies.
- **Applications:**
    - Similar to BERT but often achieves better performance on downstream tasks such as classification, NER, and QA.

---

### Tokenization and Text Preprocessing for Fine-Tuning NLP Models

#### **Tokenization**
- Converts raw text into numerical representations (tokens) that models can process.
- **Types:**
    - **WordPiece Tokenization:** Used in BERT; splits words into subword units.
    - **Byte Pair Encoding (BPE):** Used in GPT and RoBERTa; merges frequent pairs of characters or subwords.

#### **Text Preprocessing**
- **Cleaning:**
    - Remove unnecessary characters (e.g., URLs, special symbols, HTML tags).
    - Normalize text (convert to lowercase, remove extra spaces).
- **Optional Steps:**
    - Remove stopwords (common words like "the", "is", etc., if not needed for the task).
    - Lemmatization or stemming (reduce words to their base form).
- **Tokenization:**
    - Break text into tokens compatible with the chosen pre-trained model.

---

### Adapting Pre-Trained Models for NLP Tasks

#### **Common Tasks**
- **Text Classification:** Categorize text into predefined labels (e.g., spam detection, topic classification).
- **Sentiment Analysis:** Determine the sentiment polarity (positive, negative, neutral) of text.
- **Summarization:** Generate concise summaries from lengthy texts.
- **Named Entity Recognition (NER):** Identify entities like names, locations, and organizations in text.
- **Question Answering:** Extract answers from context passages.

#### **Fine-Tuning Steps**
1. **Load Pretrained Model:** Choose a model architecture and load pretrained weights.
2. **Add a Task-Specific Head:** Attach a classification, regression, or sequence labeling head as needed.
3. **Prepare Data:** Tokenize and preprocess your dataset according to the model's requirements.
4. **Fine-Tune Model:** Train the model on your specific dataset, adjusting weights for your task.
5. **Evaluate and Deploy:** Assess performance on validation/test data and deploy the model for inference.

---

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

In [None]:
dataset = load_dataset("imdb")

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, paddding="max_length", max_length=128)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

tokenized_dataset = tokenized_dataset.remove_columns("text")
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")
tokenized_dataset.set_format("torch")

train_dataset = tokenized_dataset["train"]
test_dataset = tokenized_dataset["test"]

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=2
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    processing_class= tokenizer
)

trainer.train()

results = trainer.evaluate()
print("Evaluation results:", results)

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

def preprocess_t5(examples):
    input = ["classify sentiment: "+ doc for doc in examples ["text"]]
    model_inputs = tokenizer(input, max_length=128, truncation=True, padding="max_length", return_tensors="pt")
    labels = tokenizer(examples["label"], max_length=128, truncation=True, padding="max_length", return_tensors="pt")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_t5 = dataset.map(preprocess_t5, batched=True)

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=2
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_t5["train"],
    eval_dataset=tokenized_t5["test"],
    processing_class= tokenizer
)

trainer.train()

results = trainer.evaluate()
print("Evaluation results:", results)