# Classify Policy Stance with Embeddings

In this notebook, we demonstrate how to classify text (e.g. survey responses or policy statements) by using sentence embeddings. We'll start with a zero-shot setup using cosine similarity, and gradually introduce more advanced techniques including few-shot classification and shallow supervised models. Finally, we outline how parameter-efficient fine-tuning (PEFT) can further improve accuracy for domain-specific tasks.


In [None]:
!pip install -q sentence-transformers scikit-learn numpy

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

## Zero-Shot Classification with Anchors

In [None]:
anchor_examples = {
    "Supportive": [
        "The policy is necessary to reduce emissions",
        "I believe climate change is a priority"
    ],
    "Neutral": [
        "I don’t have strong views on the matter",
        "This needs more information"
    ],
    "Opposed": [
        "I think it's a waste of taxpayer money",
        "We should not enforce such rules"
    ]
}

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def get_embedding(text):
    return model.encode([text])[0]

def classify_text(text, anchors):
    text_vec = get_embedding(text)
    scores = {}
    for label, examples in anchors.items():
        sims = [cosine_similarity([text_vec], [get_embedding(ex)])[0][0] for ex in examples]
        scores[label] = np.mean(sims)
    return max(scores, key=scores.get), scores

In [None]:
unlabeled = [
    "This policy will help the environment",
    "Why are we spending on this?",
    "I haven’t read enough to say"
]

for text in unlabeled:
    label, scores = classify_text(text, anchor_examples)
    print(f"Text: '{text}'\n → Predicted: {label}\n → Scores: {scores}\n")

## Few-Shot Expansion

Add more anchor examples per label category to improve semantic coverage. More examples = more accurate average vectors.

## Embedding + Logistic Regression Classifier

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Training data
all_texts = [
    "The policy will reduce CO2", "We must act on climate", "Neutral position", 
    "Not enough info", "This policy wastes money", "Completely disagree with this"
]
all_labels = ["Supportive", "Supportive", "Neutral", "Neutral", "Opposed", "Opposed"]

# Get sentence embeddings
X = [get_embedding(text) for text in all_texts]
y = all_labels

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

# Train classifier
clf = LogisticRegression()
clf.fit(X_train, y_train)
print("Accuracy:", clf.score(X_test, y_test))

## Optional: Fine-Tuning with LoRA

For advanced users or domain-specific tasks, you can improve model performance by fine-tuning the embedding model using parameter-efficient fine-tuning (PEFT) techniques such as:

- LoRA (Low-Rank Adaptation)
- Adapters
- QLoRA (Quantized LoRA for low-resource training)

See GitBook page [`peft_finetune_demo.md`](peft_finetune_demo.md) for examples and links.
