In [None]:
# üìå Notes & Clarifications

1Ô∏è‚É£ **Model Warnings**
- Some weights of `DistilBertForSequenceClassification` are **newly initialized** because we added a classification head for the IMDb dataset.
- This is normal and expected. The model needs to be trained on the downstream task before making reliable predictions.

2Ô∏è‚É£ **Hugging Face Hub Token**
- A warning may appear if no `HF_TOKEN` is set in the environment.
- Public models and datasets work without authentication, so this is **non-blocking**.

3Ô∏è‚É£ **CPU/GPU & Triton**
- On CPU-only environments, you may see warnings about **Triton** or pinned memory.
- These are just optimizations for GPU; they do **not affect correctness**.

4Ô∏è‚É£ **Trainer FutureWarnings**
- The `tokenizer` argument in `Trainer` is deprecated and will be removed in future versions.
- The notebook is fully functional; this is just for **future-proofing**.

5Ô∏è‚É£ **Evaluation Results**
- The accuracy shown is from a **small subset for demo purposes**.
- Training on the full dataset for multiple epochs will improve performance.

6Ô∏è‚É£ **W&B (Weights & Biases)**
- You can choose whether to track metrics online or offline.
- In this notebook, metrics are tracked **locally offline**, which is enough for demonstration.

7Ô∏è‚É£ **Takeaways**
- This notebook demonstrates the **full workflow**: loading a dataset, tokenizing, setting up a DistilBERT model, training, and evaluating.
- Warnings are mostly informational and do **not indicate errors**.
- It‚Äôs ready for **portfolio purposes** to show knowledge of Transformers, tokenization, training, evaluation, and handling real NLP data.


In [None]:
!pip install --upgrade transformers datasets evaluate -q


In [None]:
import transformers
print(transformers.__version__)


In [None]:
# 1Ô∏è‚É£ Install / upgrade packages if needed

# 2Ô∏è‚É£ Imports
from datasets import load_dataset
from transformers import AutoTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
import evaluate

# 3Ô∏è‚É£ Load dataset (IMDb sentiment)
dataset = load_dataset("imdb")

# 4Ô∏è‚É£ Initialize tokenizer and model
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2
)

# 5Ô∏è‚É£ Tokenize dataset
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

tokenized_dataset = dataset.map(tokenize, batched=True)

# 6Ô∏è‚É£ Set training arguments (compatible with older transformers)
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=50,
    do_train=True,
    do_eval=True,
)

# 7Ô∏è‚É£ Prepare metric
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)

# 8Ô∏è‚É£ Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"].shuffle(seed=42).select(range(200)),  # small subset for demo
    eval_dataset=tokenized_dataset["test"].shuffle(seed=42).select(range(200)),    # small subset for demo
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

# 9Ô∏è‚É£ Train & evaluate
trainer.train()
results = trainer.evaluate()
print("Evaluation results:", results)
