### Advanced Transformers BERT Variants and GPT-3

## Exploration of BERT Variants

### Why BERT Variants?
While BERT is a powerful transformer-based model, it has some limitations:
- **Large computational requirements:** BERT's size and architecture make it resource-intensive, limiting its deployment in real-time or resource-constrained environments.
- **Inefficiencies in capturing certain nuances:** The original BERT architecture may not optimally handle all types of language tasks or domain-specific data.

**BERT variants** have been developed to:
- Optimize the model for specific tasks or domains
- Improve performance on downstream tasks
- Reduce computational overhead and memory usage

---

### Key BERT Variants

- **RoBERTa (Robustly Optimized BERT Approach):**
    - Removes the Next Sentence Prediction (NSP) task for better efficiency.
    - Trains on more data with larger batch sizes and longer sequences.
    - Uses dynamic masking during training.
    - **Use Case:** Superior performance in tasks requiring deeper context understanding, such as reading comprehension and sentiment analysis.

- **DistilBERT:**
    - A distilled (smaller and faster) version of BERT.
    - Retains ~97% of BERT's performance while being 60% faster and 40% smaller.
    - Achieves efficiency through knowledge distillation.
    - **Use Case:** Ideal for real-time applications and deployment on devices with limited resources.

- **ALBERT (A Lite BERT):**
    - Reduces memory consumption by factorizing embeddings and sharing parameters across layers.
    - Achieves similar or better performance with fewer parameters.
    - **Use Case:** Suitable for large-scale pre-training and downstream tasks with memory limitations.

- **BERTweet:**
    - Fine-tuned on large-scale English Twitter data.
    - Specialized for social media text, handling informal language, hashtags, and emojis.
    - **Use Case:** Social media sentiment analysis, hashtag prediction, and other Twitter-specific NLP tasks.

---

## Introduction to GPT-3

### What is GPT-3?
- Developed by **OpenAI**.
- A massive language model with **175 billion parameters** trained on diverse internet-scale datasets.
- Excels at generating coherent, contextually relevant, and human-like text.

### Key Features of GPT-3
- **Zero-shot and few-shot learning:** Can perform new tasks with minimal or no fine-tuning by conditioning on prompts.
- **Versatility:** Used for text generation, summarization, question answering, translation, code generation, and conversational AI.
- **Contextual understanding:** Maintains context over long passages, enabling more natural conversations and content creation.

### Applications
- **Conversational AI:** Chatbots, virtual assistants, and customer support.
- **Content Generation:** Articles, scripts, marketing copy, and code snippets.
- **Creative Writing:** Poems, stories, brainstorming, and ideation.
- **Education and Tutoring:** Automated explanations, question generation, and personalized learning.

---

## Transfer Learning in NLP with Transformer Models

### What is Transfer Learning?
Transfer learning involves:
- Pre-training a model on a large, general-purpose dataset (e.g., Wikipedia, BookCorpus).
- Fine-tuning the pre-trained model on a smaller, task-specific dataset.

### Advantages
- **Reduces the need for large amounts of labeled data** for each new task.
- **Speeds up training** and improves performance on specialized tasks.
- **Enables rapid adaptation** to new domains or languages with minimal data.

Transfer learning has become a cornerstone of modern NLP, enabling state-of-the-art results across a wide range of applications.


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

In [None]:
# load dataset
dataset = load_dataset("ag_news")

# load RoBERTa tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=4)

# tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

tokenized_dataset = dataset.map(tokenize_function, batched=True,)

# prepare datset 
tokenized_dataset = tokenized_dataset.remove_columns("text")
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")
tokenized_dataset.set_format("torch")

train_dataset = tokenized_dataset["train"]
test_dataset = tokenized_dataset["test"]

# training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy = "epoch",
    learning_rate = 2e-5,
    per_device_train_batch_size = 16,
    per_device_eval_batch_size = 16,
    num_train_epochs = 3,
    weight_decay = 0.01,
    save_steps=500,
)

# trainer
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    processing_class = tokenizer
)

# train model
trainer.train()

# evaluate the model
results = trainer.evaluate()

# print evaluation results
print("Evaluation results:", results)

Use GPT for text generation

In [None]:
# set api key(fake)
openai.api_key = "ca-pfnubap-W8NRg20bikndufh3-28rb3onf92h-b1oe1_b3fubifna"

try: 
    # genereate text using 3.5 Turbo
    response = openai.ChatCompletion.create(
        model = "gpt-3.5-turbo",
        messages = [{"role": "system", "content": "You are a helpful assistant"}],
        [{"role": "user", "content": "Write  short story about a robot learning to cook."}],
        max_tokens = 150,
        temperature = 0.7
    )

    print("Generated Text:\n" response["choices"][0]["message"]["content"].strip())

except Exception as e:
    print("Error:", e)