# Session 21 🐍

☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️☀️

***

# 166. Hugging Face Transformers
Hugging Face Transformers is a Python library that provides state-of-the-art machine learning models for Natural Language Processing (NLP) and beyond. It offers thousands of pre-trained models for tasks like text classification, question answering, text generation, and more.

***

# 167. Important Features
- Pre-trained models: Access to models like BERT, GPT-2, RoBERTa, T5, etc.
- Easy-to-use APIs: Simple interfaces for common NLP tasks
- Model sharing: Access to community-shared models via the Hugging Face Hub
- Framework interoperability: Works with PyTorch, TensorFlow, and JAX

***

# 168. Core Components

***

## 168-1. Pipeline
The simplest way to use pre-trained models is through the pipeline function:

In [None]:
from transformers import pipeline

# Sentiment analysis
classifier = pipeline("sentiment-analysis")
result = classifier("I love using Hugging Face Transformers!")
print(result)  # [{'label': 'POSITIVE', 'score': 0.9998}]

# Text generation
generator = pipeline("text-generation", model="gpt2")
result = generator("The future of AI is", max_length=50)
print(result[0]['generated_text'])

***

## 168-2. Auto Classes
For more control, use the Auto classes:

In [None]:
from transformers import AutoTokenizer, AutoModel

# Load tokenizer and model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Process text
inputs = tokenizer("Hello world!", return_tensors="pt")
outputs = model(**inputs)

***

## 168-3. Tokenizers
Tokenizers convert text to model inputs:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize text
encoded_input = tokenizer("Do not meddle in the affairs of wizards!", return_tensors="pt")
print(encoded_input)
# {'input_ids': tensor([[  101,  2079,  2025, 19960,  1999,  1996, 6619,  1997, 12971,   102]]), 
#  'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
#  'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

***

## 168-4. Models
Different model architectures are available:

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

***

# 169. Common Tasks

***

## 169-1. Text Classification

In [None]:
from transformers import pipeline

classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier("This movie is great!")
print(result)  # [{'label': 'POSITIVE', 'score': 0.9998}]

***

## 169-2. Named Entity Recognition (NER)

In [None]:
ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")
result = ner("Hugging Face is a company based in New York City.")
print(result)

***

## 169-3. Question Answering

In [None]:
qa = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
result = qa(
    question="What is Hugging Face?",
    context="Hugging Face is a company that develops tools for NLP."
)
print(result)  # {'answer': 'a company that develops tools for NLP', 'score': 0.7}

***

## 169-4. Text Generation

In [None]:
generator = pipeline("text-generation", model="gpt2")
result = generator("In a shocking finding, scientists discovered", max_length=50)
print(result[0]['generated_text'])

***

# 170. Fine-Tuning Models
A basic example

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
import torch
from datasets import load_dataset

# Load dataset
dataset = load_dataset("imdb")

# Load tokenizer and model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Prepare model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].select(range(1000)),  # Small subset for demo
    eval_dataset=tokenized_datasets["test"].select(range(1000)),
)

# Train
trainer.train()

***

# 171. Saving and Loading Models

In [None]:
# Save
model.save_pretrained("./my_model")
tokenizer.save_pretrained("./my_model")

# Load
model = AutoModel.from_pretrained("./my_model")
tokenizer = AutoTokenizer.from_pretrained("./my_model")

***

# 172. Using the Hugging Face Hub
You can share and access models through the Hub:

In [None]:
from huggingface_hub import notebook_login

# Login to upload models
notebook_login()

# Push to Hub
model.push_to_hub("my-awesome-model")
tokenizer.push_to_hub("my-awesome-model")

# Download from Hub
from transformers import AutoModel
model = AutoModel.from_pretrained("username/my-awesome-model")

***

# 173. Advanced Features

***

## 173-1. Custom Models

In [None]:
from transformers import BertConfig, BertModel

# Initialize a custom BERT model
config = BertConfig(
    hidden_size=768,
    num_attention_heads=12,
    num_hidden_layers=12,
)
model = BertModel(config)

***

## 173-2. Mixed Precision Training

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    per_device_train_batch_size=4,
    fp16=True,  # Enable mixed precision
    # ... other args
)

***

## 173-3. Distributed Training

In [None]:
training_args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    num_train_epochs=3,
    evaluation_strategy="steps",
    eval_steps=500,
    logging_steps=500,
    save_steps=1000,
    fp16=True,
    push_to_hub=False,
    logging_dir="./logs",
    output_dir="./results",
    report_to="tensorboard",
    dataloader_num_workers=4,
    # For distributed training
    local_rank=-1,
    deepspeed="./ds_config.json",  # For DeepSpeed
)

***

***

# Some Excercises

**1.** Use the `pipeline` function to perform sentiment analysis on the following sentences:

"I love coding with Hugging Face!"

"This movie was terrible and boring."

"The weather is okay, I guess."

**Expected Output:**

A list of dictionaries containing `label` (POSITIVE/NEGATIVE) and `score` (confidence).

___

**2.** Use the `pipeline` for NER to extract entities from:

"Apple is looking to buy a U.K. startup for $1 billion. Elon Musk is the CEO of Tesla."

**Expected Output:**

A list of entities with their labels (e.g., `ORG`, `PERSON`, `MONEY`).



---

**3.** Use the `text-generation` pipeline with `gpt2` to complete the sentence:

"Artificial intelligence will change the future by"

**Parameters:**

`max_length=50`

`num_return_sequences=2` (generate 2 different completions).

**Expected Output:**

Two generated text continuations.

---

**4.** Load `bert-base-uncased` and manually:

- Tokenize the sentence: **"Hugging Face is revolutionizing NLP."**

- Pass the tokens to the model and extract the last hidden states.

**Expected Output:**

Tokenized output (`input_ids`, `attention_mask`).

Tensor of shape `(1, sequence_length, 768)` (BERT's hidden states).

***

**5.** Use `zero-shot-classification` pipeline to classify:

"The new Marvel movie was action-packed and thrilling."

**Candidate Labels:** `["entertainment", "politics", "sports", "technology"]`

**Expected Output:**

Scores for each label (highest for `"entertainment"`).

***

**6.** Fine-tune `distilbert-base-uncased` on the `emotion` dataset:

- Load the dataset: `load_dataset("emotion")`.

- Tokenize the data.

- Train for **1 epoch** using `Trainer` (simplified setup).

**Expected Output:**

Training logs showing loss decreasing.

***

**7.** Download and use the `roberta-base-squad2` model for question answering.

Ask: **"What is the capital of France?"** with context:

"France is a country in Europe. Its capital is Paris."

**Expected Output:**

`{'answer': 'Paris', 'score': 0.98}`

***

**8.** Modify the fine-tuning example to:

- Use `TensorFlow` instead of PyTorch.

- Manually implement a training loop (without `Trainer`).

- Train on a small subset of `imdb` dataset.

**Expected Output:**

Validation accuracy improving over epochs.

***

#                                                        🌞 https://github.com/AI-Planet 🌞