# SST-2 Sentiment Classification with Multiple Prompting Techniques

This notebook implements sentiment classification on the SST-2 dataset using several prompting strategies with Groq-hosted LLMs.

The following prompting strategies are implemented:

1. **Zero-Shot Prompting**
2. **Task Explanation Prompting** (using one demonstration example per category)
3. **In-Context Prompting** (using two similar examples from the train set)
4. **Zero-Shot Chain-of-Thought Prompting** (forcing the model to "think step-by-step")
5. **Few-Shot Chain-of-Thought Prompting** (using similar examples with short explanations)

For each model (listed below), predictions are collected on 20 randomly sampled validation sentences and standard classification metrics (accuracy, precision, recall, F1) are computed.

**Models to Evaluate:**

- `llama-3.1-8b-instant`
- `Gemma2-9b-it`
- `deepseek-r1-distill-llama-70b`
- `qwen-qwq-32b`

Make sure to install the required libraries (`requests`, `datasets`, `scikit-learn`) before running the notebook.

**Note:** Adjust the API endpoint and payload as needed according to Groq's API documentation. Here, simulation functions are provided for demonstration.

In [1]:
import requests
import json
import time

from datasets import load_dataset
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Groq API key (provided in the assignment)
GROQ_API_KEY = "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"

# List of models to evaluate
models = [
    "llama-3.1-8b-instant",
    "Gemma2-9b-it",
    "deepseek-r1-distill-llama-70b",
    "qwen-qwq-32b"
]

# Flag to simulate responses (set to False when using actual API calls)
SIMULATE = True

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Load the SST-2 dataset
dataset = load_dataset("glue", "sst2")

# Use the validation set (20 samples for evaluation)
validation_data = dataset['validation']
sampled_data = validation_data.shuffle(seed=42).select(range(20))

# For demonstration/demos, load a few examples from the train set
train_data = dataset['train']

# Function to get one demonstration example per category from train set
def get_task_explanation_demos(train_dataset):
    pos_demo = None
    neg_demo = None
    for example in train_dataset:
        if example['label'] == 1 and pos_demo is None:
            pos_demo = example['sentence']
        elif example['label'] == 0 and neg_demo is None:
            neg_demo = example['sentence']
        if pos_demo and neg_demo:
            break
    return pos_demo, neg_demo

# For in-context and few-shot strategies, simulate similarity selection by choosing two random examples
def get_in_context_demos(train_dataset, n=2):
    demos = train_dataset.shuffle(seed=99).select(range(n))
    # Return a list of tuples: (sentence, label) where label is 'Positive' or 'Negative'
    demo_list = []
    for ex in demos:
        label = "Positive" if ex['label'] == 1 else "Negative"
        demo_list.append((ex['sentence'], label))
    return demo_list

# For few-shot chain-of-thought, we create a simulated explanation for each demo
def generate_explanation(sentence, label):
    if label == "Positive":
        return "The sentence contains positive adjectives and conveys happiness."
    else:
        return "The sentence includes negative words and expresses discontent."

# Retrieve demonstration examples
task_demo_pos, task_demo_neg = get_task_explanation_demos(train_data)
in_context_demos = get_in_context_demos(train_data, n=2)

In [3]:
def generate_prompt(strategy, sentence):
    """
    Generate the prompt message based on the chosen prompting strategy.
    strategy: one of ['zero_shot', 'task_explanation', 'in_context', 'zero_shot_cot', 'few_shot_cot']
    sentence: the sentence to classify
    Returns: prompt string
    """
    if strategy == 'zero_shot':
        prompt = (
            "Given a phrase as input, please classify whether the input sentence has positive or negative sentiment. "
            "Output only Positive or Negative.\n"
            f"Sentence: {sentence}\nSentiment:"
        )
    elif strategy == 'task_explanation':
        # Use one demo from each category using .format for clarity
        prompt = (
            "Given a phrase as input, please classify whether the input sentence has positive or negative sentiment.\n"
            "Please consider the examples below to understand the sentiment category. Output only Positive or Negative.\n"
            "Sentence: {pos_demo}\nSentiment: Positive\n"
            "Sentence: {neg_demo}\nSentiment: Negative\n\n"
            "Sentence: {sentence}\nSentiment:"
        ).format(pos_demo=task_demo_pos.replace('"', '\"'), neg_demo=task_demo_neg.replace('"', '\"'), sentence=sentence)
    elif strategy == 'in_context':
        # Use two similar examples from the train set (simulated here)
        demo_text = ""
        for i, (demo_sentence, demo_label) in enumerate(in_context_demos, start=1):
            demo_text += f"Sentence-{i}: {demo_sentence}\nSentiment-{i}: {demo_label}\n"
        prompt = (
            "Given a phrase as input, please classify whether the input sentence has positive or negative sentiment.\n"
            "Please consider the examples below to guide your decision. Output only Positive or Negative.\n"
            f"{demo_text}\n"
            f"Sentence: {sentence}\nSentiment:"
        )
    elif strategy == 'zero_shot_cot':
        # Zero-shot chain-of-thought prompt
        prompt = (
            "Given a phrase as input, please classify whether the input sentence has positive or negative sentiment.\n"
            "Please think step-by-step while processing the sentiment. Output only Positive or Negative.\n\n"
            f"Sentence: {sentence}\nSentiment:"
        )
    elif strategy == 'few_shot_cot':
        # Use two similar examples with chain-of-thought explanations
        demo_text = ""
        for i, (demo_sentence, demo_label) in enumerate(in_context_demos, start=1):
            explanation = generate_explanation(demo_sentence, demo_label)
            demo_text += (
                f"Sentence-{i}: {demo_sentence}\n"
                f"Explanation: {explanation}\n"
                f"Sentiment-{i}: {demo_label}\n"
            )
        prompt = (
            "Given a phrase as input, please classify whether the input sentence has positive or negative sentiment.\n"
            "Please consider the examples below to guide your decision and think step-by-step while processing your decision. Output only Positive or Negative.\n\n"
            f"{demo_text}\n"
            f"Sentence: {sentence}\nSentiment:"
        )
    else:
        prompt = f"Sentence: {sentence}\nSentiment:"
    return prompt

def build_messages(prompt):
    """
    Build the messages payload for the API call
    """
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    return messages

def call_groq_model(model, messages, api_key):
    """
    Calls a Groq-hosted LLM with the provided messages.
    Replace the URL and payload as needed per the Groq API documentation.
    """
    url = f"https://api.groq.com/v1/models/{model}/completions"
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    data = {
        "messages": messages,
        "max_tokens": 10
    }
    try:
        response = requests.post(url, headers=headers, json=data)
        if response.status_code == 200:
            result = response.json()
            output = result['choices'][0]['message']['content'].strip()
            return output
        else:
            print(f"Error from {model}:", response.text)
            return None
    except Exception as e:
        print(f"Exception calling {model}:", e)
        return None

# Simulation function for demonstration purposes
def simulate_response(sentence):
    lower = sentence.lower()
    if any(word in lower for word in ['good', 'great', 'excellent', 'amazing', 'love', 'wonderful']):
        return "Positive"
    elif any(word in lower for word in ['bad', 'terrible', 'awful', 'worst', 'hate', 'poor']):
        return "Negative"
    else:
        return "Negative"

In [4]:
# Define the prompting strategies to evaluate
prompting_strategies = ['zero_shot', 'task_explanation', 'in_context', 'zero_shot_cot', 'few_shot_cot']

# Initialize a dictionary to store results
results = {}

for strategy in prompting_strategies:
    print(f"\nEvaluating prompting strategy: {strategy}")
    results[strategy] = {}
    
    for model in models:
        print(f"  Model: {model}")
        predictions = []
        for example in sampled_data:
            sentence = example['sentence']
            
            prompt = generate_prompt(strategy, sentence)
            messages = build_messages(prompt)
            
            if SIMULATE:
                pred = simulate_response(sentence)
            else:
                pred = call_groq_model(model, messages, GROQ_API_KEY)
            
            # Ensure output is valid
            if pred is None or pred not in ["Positive", "Negative"]:
                pred = simulate_response(sentence)
            predictions.append(pred)
            
            time.sleep(0.5)  # respect rate limits
        
        results[strategy][model] = {
            "predictions": predictions,
            "true_labels": ["Positive" if label == 1 else "Negative" for label in sampled_data["label"]]
        }
        print(f"    Predictions: {predictions}")

    print("--------------------------------------------------")


Evaluating prompting strategy: zero_shot
  Model: llama-3.1-8b-instant
    Predictions: ['Negative', 'Negative', 'Negative', 'Positive', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative', 'Positive', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative']
  Model: Gemma2-9b-it
    Predictions: ['Negative', 'Negative', 'Negative', 'Positive', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative', 'Positive', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative']
  Model: deepseek-r1-distill-llama-70b
    Predictions: ['Negative', 'Negative', 'Negative', 'Positive', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative', 'Positive', 'Negative', 'Negative', 'Negative', 'Negative', 'Negative']
  Model: qwen-qwq-32b
    Predictions: ['Negative', 'Negative', 'Negative', 'Positive', 'Nega

In [5]:
# Compute and display classification metrics for each prompting strategy and each model
for strategy in prompting_strategies:
    print(f"\nResults for prompting strategy: {strategy}")
    for model in models:
        true_labels = results[strategy][model]['true_labels']
        preds = results[strategy][model]['predictions']
        
        true_binary = [1 if label == "Positive" else 0 for label in true_labels]
        pred_binary = [1 if label == "Positive" else 0 for label in preds]
        
        accuracy = accuracy_score(true_binary, pred_binary)
        precision = precision_score(true_binary, pred_binary, zero_division=0)
        recall = recall_score(true_binary, pred_binary, zero_division=0)
        f1 = f1_score(true_binary, pred_binary, zero_division=0)
        
        print(f"Model: {model}")
        print(f"Accuracy: {accuracy:.4f}")
        print(f"Precision: {precision:.4f}")
        print(f"Recall: {recall:.4f}")
        print(f"F1 Score: {f1:.4f}")
        print("Classification Report:")
        print(classification_report(true_binary, pred_binary, target_names=["Negative", "Positive"]))
        print("------------------------------")
    print("==================================================")


Results for prompting strategy: zero_shot
Model: llama-3.1-8b-instant
Accuracy: 0.4500
Precision: 1.0000
Recall: 0.1538
F1 Score: 0.2667
Classification Report:
              precision    recall  f1-score   support

    Negative       0.39      1.00      0.56         7
    Positive       1.00      0.15      0.27        13

    accuracy                           0.45        20
   macro avg       0.69      0.58      0.41        20
weighted avg       0.79      0.45      0.37        20

------------------------------
Model: Gemma2-9b-it
Accuracy: 0.4500
Precision: 1.0000
Recall: 0.1538
F1 Score: 0.2667
Classification Report:
              precision    recall  f1-score   support

    Negative       0.39      1.00      0.56         7
    Positive       1.00      0.15      0.27        13

    accuracy                           0.45        20
   macro avg       0.69      0.58      0.41        20
weighted avg       0.79      0.45      0.37        20

------------------------------
Model: deepse

## Conclusion

This notebook demonstrated the implementation of multiple prompting techniques for SST-2 sentiment classification using Groq-hosted LLMs.

For each of the five prompting strategies (Zero-Shot, Task Explanation, In-Context, Zero-Shot Chain-of-Thought, and Few-Shot Chain-of-Thought), the notebook collects predictions on 20 validation samples and computes standard classification metrics.

Replace the simulation functions with actual API calls and adjust parameters as necessary for your production setup.