<a href="https://colab.research.google.com/github/Doublemhdd/Ansad-project/blob/main/TP1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## TP1 : Transformer-based Text Classification with Hugging Face


### Objective:
Understand and practice the steps required to build a transformer-based text classification model using Hugging Face Transformers.

---

###  1. Understanding Transformer Models

**Q1.** Explain briefly what Transformer models are and why they have become popular in NLP tasks.

**Q2.** What advantages do libraries like Hugging Face Transformers offer to developers?



**A1.**
**Answer:**
Transformer models are neural network architectures that use self-attention mechanisms to process sequential data. They've become popular in NLP because:
- **Parallelization**: Process all words simultaneously (unlike RNNs)
- **Long-range dependencies**: Capture relationships between distant words effectively
- **Attention mechanism**: Focus on relevant parts of the input regardless of position
- **Performance**: Achieve state-of-the-art results across NLP tasks
- **Scalability**: Can be scaled to billions of parameters and massive datasets

**A2.**
**Answer:**
Hugging Face Transformers provides:
- **Pre-trained models**: Access to models trained on massive datasets
- **Unified API**: Consistent interface for different architectures (BERT, GPT, T5)
- **Easy fine-tuning**: Simple adaptation to specific tasks
- **Optimized tokenizers**: Matched to each pre-trained model
- **Framework compatibility**: Works with PyTorch and TensorFlow
- **Task versatility**: Support for classification, generation, QA, and more
- **Community support**: Extensive documentation and active development



###  2. Environment Setup

**Practical Task:**
- Install necessary libraries:

In [1]:
!pip install transformers datasets

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.4.1-py3-none-any.whl (487 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m487.4/487.4 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.w

###  3. Loading and Exploring Data

**Practical Task:**
- Load 1% of the IMDB dataset for training and testing:


In [2]:
from datasets import load_dataset
training_data = load_dataset('imdb', split='train[:1%]')
test_data = load_dataset('imdb', split='test[:1%]')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

**Q3.** Display the first 3 entries from the training dataset. Discuss briefly what you observe.


**A3.**
**Answer:**
The first 3 entries show:
- Each entry contains a `text` field with the movie review content and a `label` field (0 or 1)
- Reviews vary in length and writing style
- Labels represent sentiment (0 = negative, 1 = positive)
- Text includes natural language with punctuation and formatting
- The dataset is structured for binary sentiment classification
- Reviews contain subjective opinions about movies, suitable for sentiment analysis



In [3]:
training_data.select(range(3))

Dataset({
    features: ['text', 'label'],
    num_rows: 3
})

In [4]:
training_data[0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

###  4. Data Tokenization

**Practical Task:**
- Tokenize your loaded dataset using DistilBERT tokenizer:

In [5]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

def tokenize_function(example):
    return tokenizer(example['text'], padding='max_length', truncation=True)

tokenized_training_data = training_data.map(tokenize_function, batched=True)
tokenized_test_data = test_data.map(tokenize_function, batched=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

**Q4.** What does tokenization achieve? Why is padding used?


**A4.**
**Answer:**
**Tokenization achieves:**
- Conversion of text to numerical representations processable by neural networks
- Segmentation of text into meaningful units (words, subwords, characters)
- Creation of a consistent vocabulary mapping
- Handling of out-of-vocabulary words through subword tokenization

**Padding is used to:**
- Create uniform-length sequences for batch processing
- Enable efficient parallel computation
- Accommodate variable-length inputs in fixed-size tensors
- Ensure compatibility with model architecture requirements
- Maximize computational efficiency during training and inference

###  5. Model Initialization

**Practical Task:**
- Initialize your DistilBERT model for binary classification:

In [6]:
from transformers import DistilBertForSequenceClassification

model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased', num_labels=2
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Q5.** Explain why we specify `num_labels=2`. What would you do differently for multi-class classification with 5 labels?


**A5.**
**Answer:**
We specify `num_labels=2` because this is a binary classification task (positive/negative sentiment). This configures the final layer to output 2 logits.

For multi-class classification with 5 labels:
- Change to `num_labels=5`
- Ensure dataset labels are properly encoded (0-4)
- Modify evaluation metrics for multi-class scenarios (macro F1, confusion matrix)
- Consider using a softmax activation instead of sigmoid
- Potentially adjust the loss function to categorical cross-entropy

###  6. Model Training

**Practical Task:**
- Set up training arguments and train your model:

In [7]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./finetuned_model',
    num_train_epochs=1,
    per_device_train_batch_size=8,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_training_data,
)

trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33m23048[0m ([33mdebya[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss


TrainOutput(global_step=32, training_loss=0.09153131395578384, metrics={'train_runtime': 298.2221, 'train_samples_per_second': 0.838, 'train_steps_per_second': 0.107, 'total_flos': 33116849664000.0, 'train_loss': 0.09153131395578384, 'epoch': 1.0})

**Q6.** How could changing the number of epochs or batch size affect your training?


**A6.**
**Answer:**

**Epochs impact:**
- **More epochs**: Better convergence but risk of overfitting
- **Fewer epochs**: Faster training but potential underfitting

**Batch size effects:**
- **Larger batch size**:
  - More stable gradient updates
  - Better hardware utilization
  - Requires more memory
  - May converge to less optimal solutions
  
- **Smaller batch size**:
  - Less memory required
  - Potentially better generalization
  - More frequent updates
  - Slower training
  - Noisier gradient estimates

Optimal values depend on dataset size, model complexity, and available computational resources.



> Add blockquote


###  7. Evaluation and Reflection

**Practical Task:**
- Evaluate your model:

In [8]:
results = trainer.evaluate(tokenized_test_data)
print(results)

{'eval_loss': 0.0058595542795956135, 'eval_runtime': 3.7009, 'eval_samples_per_second': 67.551, 'eval_steps_per_second': 8.647, 'epoch': 1.0}


**Q7.** Interpret the evaluation results. If your evaluation score is low, explain what factors might be contributing to this result.


**A7.**

This indicates remarkably good performance, which is somewhat surprising given the limited training data (1% of IMDB) and minimal training (1 epoch). Several factors might explain this excellent performance:

- **Transfer learning advantage**: DistilBERT is pre-trained on massive text corpora, so it already has strong language understanding capabilities
- **Task simplicity**: Sentiment analysis on IMDB reviews is a relatively straightforward binary classification task
- **High-quality data**: IMDB reviews likely have clear sentiment signals that are easy for the model to identify
- **Effective distillation**: DistilBERT, despite being smaller than BERT, retains most of its performance capabilities
- **Good hyperparameters**: The default hyperparameters work well for this particular task
- **Potential data leakage**: The test set might be very similar to the training set in this small sample
- **Domain alignment**: Pre-training data likely included similar text to movie reviews

The extremely low loss suggests the model is very confident in its predictions and they align well with the ground truth labels. This demonstrates the power of transfer learning with transformer models, where even minimal fine-tuning can yield impressive results for certain tasks.


**Q8.** In real-world applications, what might you do differently to achieve better performance?

**A8.**

To improve real-world performance:
- **Use more data**: Train on the full dataset or larger portion
- **Extended training**: Increase epochs with early stopping
- **Hyperparameter optimization**: Grid/random search for optimal settings
- **Advanced preprocessing**: Clean, normalize, and augment text data
- **Model selection**: Try larger or domain-specific pre-trained models
- **Learning rate scheduling**: Implement warmup and decay strategies
- **Regularization techniques**: Apply dropout, weight decay, etc.
- **Ensemble methods**: Combine predictions from multiple models
- **Cross-validation**: Ensure robust evaluation across data splits
- **Error analysis**: Identify and address patterns in misclassified examples
- **Domain adaptation**: Further pre-train on in-domain data


In [9]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(example):
    return tokenizer(example['text'], padding='max_length', truncation=True)

tokenized_training_data = training_data.map(tokenize_function, batched=True)
tokenized_test_data = test_data.map(tokenize_function, batched=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

###  8. Finetuning with a Different Encoder-Type Model"

**Q8.** Reproduce the Finetuning code with another Encoder-type model.

(Hint: only parts 4 and 5 should be changed)

In [10]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased', num_labels=2
)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
