In [None]:
# DS776 Auto-Update (runs in ~2 seconds, only updates when needed)
# If this cell fails, see Lessons/Course_Tools/AUTO_UPDATE_SYSTEM.md for help
%run ../../Lessons/Course_Tools/auto_update_introdl.py

In [None]:
# === YOUR IMPORTS HERE ===
# Add any additional imports you need below this line

from introdl import (
    config_paths_keys,
    get_device,
    wrap_print_text
)

# Wrap print to format text nicely at 120 characters
print = wrap_print_text(print, width=120)

device = get_device()

# Configure paths
paths = config_paths_keys()
DATA_PATH = paths['DATA_PATH']
MODELS_PATH = paths['MODELS_PATH']
# === END YOUR IMPORTS ===

# Homework 8:  Sarcasm Detection

Using the ["Sarcasm_News_Headline" dataset](https://huggingface.co/datasets/raquiba/Sarcasm_News_Headline) on HuggingFace you're going to try several approaches to sarcasm detection (text classification) and write a summary at the end.

**Total Points: 50**
- Reading Questions: 8 points
- Download and split dataset: 2 points
- Approach 1 (TF-IDF + ML Model): 7 points
- Approach 2 (Pretrained Model): 6 points
- Approach 3 (Fine-tune DistilBERT): 7 points
- Approach 4 Part 1 (LLM Zero-Shot): 7 points
- Approach 4 Part 2 (LLM Few-Shot): 7 points
- Summarize and Compare: 4 points
- Reflection: 2 points

## Reading Questions (8 points)

Answer the following questions based on Chapter 3: Text Classification from *Natural Language Processing with Transformers*.

**Question 1 (2 points):** What are the three main advantages of DistilBERT over the original BERT model? How does DistilBERT achieve these improvements while maintaining most of BERT's performance?

📝 **YOUR ANSWER HERE:**

**Question 2 (2 points):** Compare the three main tokenization strategies discussed in the chapter: character tokenization, word tokenization, and subword tokenization (WordPiece). What are the key advantages and disadvantages of each approach?

📝 **YOUR ANSWER HERE:**

**Question 3 (2 points):** Explain the difference between the feature extraction and fine-tuning approaches for transfer learning. When would you choose feature extraction over fine-tuning, and what are the trade-offs?

📝 **YOUR ANSWER HERE:**

**Question 4 (2 points):** The chapter demonstrates using loss-based sorting as an error analysis technique to identify mislabeled examples and difficult cases. Explain how this technique works and why sorting by prediction loss is effective for finding problematic examples in your training set.

📝 **YOUR ANSWER HERE:**

## Download and split the dataset (2 points)

While the dataset has "train" and "test" splits, ignore the "test" split since it almost entirely duplicates the "train" split.

Instead, use train_test_split with a seed of 42 to generate an 80/20 split of the original "train" split into training and test data.

In [None]:
# Load the Sarcasm News Headline dataset
from datasets import load_dataset
from sklearn.model_selection import train_test_split

# Your code here:
# 1. Load the dataset using load_dataset("raquiba/Sarcasm_News_Headline")
# 2. Get the 'train' split (ignore 'test' as it duplicates 'train')
# 3. Extract headlines and labels into lists or arrays
# 4. Use train_test_split with test_size=0.2 and random_state=42
# 5. Print the sizes of train and test sets


## Apply Approach 1 - TF-IDF Vectors + ML Model (7 points)

Include code to create TF-IDF Vectors that represent each headline. Use these vectors to train a classification model (it doesn't have to be Logistic Regression). Make predictions on the test set and generate a classification report.

In [None]:
# Step 1: Create TF-IDF vectors
from sklearn.feature_extraction.text import TfidfVectorizer

# Your code here:
# 1. Create a TfidfVectorizer (consider max_features parameter)
# 2. Fit on training headlines and transform both train and test
# 3. Print the shape of the resulting vectors


In [None]:
# Step 2: Train a classification model
from sklearn.linear_model import LogisticRegression
# Or try: from sklearn.ensemble import RandomForestClassifier
# Or try: from sklearn.naive_bayes import MultinomialNB

# Your code here:
# 1. Create your classifier (e.g., LogisticRegression)
# 2. Train on the TF-IDF vectors and labels
# 3. Make predictions on the test set


In [None]:
# Step 3: Evaluate with classification report
from sklearn.metrics import classification_report

# Your code here:
# 1. Generate classification report comparing predictions to test labels
# 2. Print the report


## Apply Approach 2 - Use a Pretrained Model from HuggingFace (6 points)

Before fine-tuning your own model, search HuggingFace Hub for existing models trained on sarcasm detection.

**Task:**
1. Find a pretrained model for sarcasm detection on HuggingFace (search for "sarcasm")
2. Use the `pipeline()` API to load the model
3. Make predictions on your test set
4. Generate a classification report

**Hints:**
- Search HuggingFace for "sarcasm detection" or "sarcasm classification"
- Look for models with high downloads or recent updates
- Check the model card to verify it matches your task (news headlines vs tweets)
- Some models to explore: `helinivan/english-sarcasm-detector`

**Example code structure:**
```python
from transformers import pipeline

# Load pretrained model
classifier = pipeline("text-classification", model="model-name-here")

# Make predictions
# predictions = ...

# Generate classification report
```

**What to report:**
- Which model did you choose and why?
- What accuracy does it achieve on your test set?
- How does it compare to your TF-IDF approach?

In [None]:
# Step 1: Search and load a pretrained sarcasm detection model
from transformers import pipeline

# Your code here:
# 1. Search HuggingFace for "sarcasm detection" models
# 2. Choose a model (e.g., "helinivan/english-sarcasm-detector")
# 3. Load it using: classifier = pipeline("text-classification", model="...")
# 4. Test on 1-2 examples to verify it works

# Example:
# classifier = pipeline("text-classification", model="your-model-name")
# test = classifier(["This is a test headline."])
# print(test)


In [None]:
# Step 2: Make predictions on test set

# Your code here:
# 1. Use the pipeline to predict on all test headlines
# 2. Extract the predicted labels (may need to map model's labels to 0/1)
# 3. Handle batch processing if needed (pipeline can take lists)

# Hint: The model might return labels like "SARCASM" and "NOT_SARCASM"
# You may need to map these to 0 and 1 to match your dataset


In [None]:
# Step 3: Generate classification report
from sklearn.metrics import classification_report

# Your code here:
# 1. Generate classification report
# 2. Print the report
# 3. Note the model name and accuracy for comparison later


## Apply Approach 3 - Fine-tune DistilBERT with Classification Head (7 points)

Include code for set up, training, and classification report.

In [None]:
# Step 1: Prepare dataset for HuggingFace
from datasets import Dataset
import pandas as pd

# Your code here:
# 1. Create DataFrames or dictionaries with 'text' and 'label' columns
# 2. Convert to HuggingFace Dataset format using Dataset.from_pandas() or Dataset.from_dict()
# 3. Create a DatasetDict with 'train' and 'test' splits

# Example structure:
# train_df = pd.DataFrame({'text': train_texts, 'label': train_labels})
# test_df = pd.DataFrame({'text': test_texts, 'label': test_labels})
# train_dataset = Dataset.from_pandas(train_df)
# test_dataset = Dataset.from_pandas(test_df)


In [None]:
# Step 2: Load tokenizer and tokenize dataset
from transformers import AutoTokenizer

# Your code here:
# 1. Load DistilBERT tokenizer: AutoTokenizer.from_pretrained("distilbert-base-uncased")
# 2. Create a tokenize function that tokenizes the 'text' field
# 3. Use dataset.map() to tokenize both train and test datasets
# 4. Set truncation=True and padding=True

# Example:
# def tokenize_function(examples):
#     return tokenizer(examples['text'], truncation=True, padding=True, max_length=512)
# 
# tokenized_train = train_dataset.map(tokenize_function, batched=True)
# tokenized_test = test_dataset.map(tokenize_function, batched=True)


In [None]:
# Step 3: Load model and set up training
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

# Your code here:
# 1. Load DistilBERT model: AutoModelForSequenceClassification.from_pretrained(
#       "distilbert-base-uncased", num_labels=2)
# 2. Create TrainingArguments with:
#    - output_dir for saving checkpoints
#    - num_train_epochs (2-3 is typical)
#    - per_device_train_batch_size (8 or 16)
#    - evaluation_strategy="epoch"
# 3. Create Trainer with model, args, train_dataset, eval_dataset


In [None]:
# Step 4: Train the model

# Your code here:
# 1. Call trainer.train()
# 2. This will take several minutes


In [None]:
# Step 5: Make predictions and evaluate
import numpy as np
from sklearn.metrics import classification_report

# Your code here:
# 1. Use trainer.predict() on the test dataset
# 2. Extract predictions: np.argmax(predictions.predictions, axis=1)
# 3. Generate classification report
# 4. Print the report


## Apply Approach 4 - Part 1: Use an LLM Model and Zero-Shot Prompt (7 points)

Using the `llm_classifier` helper function from the lesson apply your LLM classifier to the first 100 examples in the test set. Use a local model and an API-based model for comparison. For the API-based model some possibilities include:
* Groq: "llama3-70b-8192", (rate_limit = 30 requests per minute on free tier)
* Together.AI: "meta-llama/Llama-3.3-70B-Instruct-Turbo-Free" (rate limit = 10)
* Gemini: "gemini-flash-lite" (rate_limit = 30 on free tier)

Feel free to try others if you have access.

Produce classification reports for both the local and best API-based model.

In [None]:
# Import the llm_classifier helper
from Lesson_08_Helpers import llm_classifier

In [None]:
# Step 1: Prepare test subset (first 100 examples)

# Your code here:
# 1. Select first 100 test headlines
# 2. Get corresponding labels
# 3. Print a few examples to verify


In [None]:
# Step 2: Create zero-shot prompts for sarcasm detection

# Your code here:
# 1. Create a system_prompt (e.g., "You are an expert at detecting sarcasm in news headlines.")
# 2. Create a prompt_template with:
#    - Clear instructions to classify as 'sarcastic' or 'not sarcastic'
#    - A placeholder {text} for the headline
#    - Request to output ONLY the label

# Example structure:
# system_prompt = "..."
# prompt_template = """
# Classify the following news headline as either 'sarcastic' or 'not sarcastic'.
# Output ONLY one of: sarcastic, not sarcastic
# 
# Headline: {text}
# """


In [None]:
# Step 3: Use API-based model (e.g., gemini-flash-lite)

# Your code here:
# 1. Call llm_classifier with:
#    - model_name='gemini-flash-lite' (or another API model)
#    - texts (first 100 headlines)
#    - system_prompt
#    - prompt_template
#    - estimate_cost=True to see API costs
# 2. Store predictions

# Example:
# predictions_api = llm_classifier(
#     'gemini-flash-lite',
#     test_texts_100,
#     system_prompt,
#     prompt_template,
#     estimate_cost=True
# )


In [None]:
# Step 4: Convert text predictions to 0/1 labels

# Your code here:
# 1. Map text predictions to 0/1
#    'not sarcastic' → 0
#    'sarcastic' → 1
# 2. Handle any unexpected responses (strip whitespace, lowercase, etc.)

# Example:
# predictions_api_binary = []
# for pred in predictions_api:
#     pred_clean = pred.strip().lower()
#     if 'not sarcastic' in pred_clean:
#         predictions_api_binary.append(0)
#     elif 'sarcastic' in pred_clean:
#         predictions_api_binary.append(1)
#     else:
#         predictions_api_binary.append(0)  # default


In [None]:
# Step 5: Generate classification report for API model
from sklearn.metrics import classification_report

# Your code here:
# 1. Generate classification report
# 2. Print the report

print("API Model Results (gemini-flash-lite):")
# print(classification_report(...))


In [None]:
# Step 6: Try a local model (optional but recommended)

# Your code here:
# 1. Use llm_classifier with a local model
#    Options: 'llama-3.2', or any locally hosted model
# 2. Same system_prompt and prompt_template
# 3. Convert predictions to 0/1
# 4. Generate classification report

# Note: Local models may be slower but are free


## Apply Approach 4 - Part 2: Use an LLM Model and Few-Shot Prompt (7 points)

Build a few-shot prompt with three to five examples of each class and apply the same models used for the zero-shot prompt. Produce classification reports.

In [None]:
# Step 1: Create few-shot examples

# Your code here:
# 1. Select 3-5 examples of sarcastic headlines
# 2. Select 3-5 examples of non-sarcastic headlines
# 3. Format them as examples in your prompt

# Example structure:
# few_shot_examples = """
# Examples:
# 
# Headline: "Local man thinks he's qualified to run country after playing SimCity once"
# Classification: sarcastic
# 
# Headline: "Scientists discover new species of frog in Amazon rainforest"
# Classification: not sarcastic
# 
# [Add more examples...]
# """


In [None]:
# Step 2: Create few-shot prompt template

# Your code here:
# 1. Update prompt_template to include the few-shot examples
# 2. Keep the same classification instructions
# 3. Add the {text} placeholder for new headlines

# Example:
# prompt_template_fewshot = few_shot_examples + """
# Now classify this headline:
# 
# Headline: {text}
# Classification:
# """


In [None]:
# Step 3: Apply few-shot prompting with API model

# Your code here:
# 1. Use llm_classifier with the same API model
# 2. Use the new few-shot prompt template
# 3. Same 100 test headlines
# 4. Convert predictions to 0/1
# 5. Generate classification report

# predictions_fewshot_api = llm_classifier(
#     'gemini-flash-lite',
#     test_texts_100,
#     system_prompt,
#     prompt_template_fewshot,
#     estimate_cost=True
# )


In [None]:
# Step 4: Apply few-shot prompting with second model

# Your code here:
# 1. Try the same few-shot approach with your second model
# 2. Generate classification report
# 3. Compare zero-shot vs few-shot performance


## Summarize and Compare All Approaches (4 points)

Create a summary table comparing all four approaches on the following dimensions:

| Approach | Model/Method | Accuracy | Training Time | Inference Speed | Key Advantages | Key Disadvantages |
|----------|--------------|----------|---------------|-----------------|----------------|-------------------|
| 1. TF-IDF + ML | ... | ...% | ... | ... | ... | ... |
| 2. Pretrained | ... | ...% | ... | ... | ... | ... |
| 3. Fine-tuned DistilBERT | ... | ...% | ... | ... | ... | ... |
| 4a. LLM Zero-shot | ... | ...% | ... | ... | ... | ... |
| 4b. LLM Few-shot | ... | ...% | ... | ... | ... | ... |

**Discussion Questions:**

1. Which approach performed best? Why do you think this is?
2. Compare Approach 2 (pretrained) vs Approach 3 (fine-tuned). Which was easier? Which performed better? When would you choose one over the other?
3. How did few-shot prompting (4b) compare to zero-shot (4a)? Was the improvement worth the extra effort?
4. If you were deploying a sarcasm detection system in production, which approach would you choose and why? Consider accuracy, speed, cost, and maintainability.

📝 **YOUR ANSWERS HERE:**

In [None]:
# Create comparison table
import pandas as pd

# Your code here:
# 1. Create a DataFrame with columns:
#    - Approach (name)
#    - Model/Method
#    - Accuracy
#    - Training Time (approximate)
#    - Inference Speed (Fast/Medium/Slow)
# 2. Fill in your results from each approach
# 3. Display the table

# Example structure:
# results = {
#     'Approach': ['1. TF-IDF + ML', '2. Pretrained', '3. Fine-tuned DistilBERT', 
#                  '4a. LLM Zero-shot', '4b. LLM Few-shot'],
#     'Model/Method': ['Logistic Regression', '...', 'distilbert-base-uncased', 
#                      'gemini-flash-lite', 'gemini-flash-lite'],
#     'Accuracy': [0.XX, 0.XX, 0.XX, 0.XX, 0.XX],
#     'Training Time': ['< 1 min', '0 (pretrained)', '5-10 min', '0', '0'],
#     'Inference Speed': ['Very Fast', 'Fast', 'Fast', 'Slow (API)', 'Slow (API)']
# }
# 
# df_results = pd.DataFrame(results)
# display(df_results)


## Reflection (2 points)

1. What, if anything, did you find difficult to understand for the lesson? Why?

📝 **YOUR ANSWER HERE:**

2. What resources did you find supported your learning most and least for this lesson? (Be honest - I use your input to shape the course.)

📝 **YOUR ANSWER HERE:**

### Export Notebook to HTML for Canvas Upload

Uncomment the two lines below and run the cell to export the current notebook to HTML.

In [None]:
# from introdl import export_this_to_html
# export_this_to_html()