# 📉 Semantic Drift Detector (Dual Method)

This notebook performs semantic drift analysis using:

1. **Quantitative** embeddings + cosine similarity + PCA (Hugging Face)
2. **Qualitative** free-text interpretation (Google Gemini)

---

✅ Use cases:  
- Discourse & media framing  
- Survey and stakeholder alignment  
- Narrative consistency in policy, brand, or messaging  
- Cross-source or longitudinal sentiment shift  


In [None]:
# Optional: install required packages
# !pip install sentence-transformers matplotlib scikit-learn pandas seaborn google-generativeai


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import google.generativeai as genai


## 📝 Step 1: Provide Input Sentences and Labels

Each sentence should represent a statement from a different source, time, or group.  
Label them by speaker, party, outlet, etc., to highlight potential **semantic shifts**.


In [None]:
# ✍️ Replace these with your own data
sentences = [
    "The protest drew thousands of citizens.",
    "The riot disrupted public order.",
    "Demonstrators gathered peacefully.",
    "Violent clashes erupted between police and protesters.",
    "The rally was held in solidarity.",
    "Crowds vandalized public property."
]

labels = ["neutral", "negative", "neutral", "negative", "neutral", "negative"]


## 🔍 Step 2: Generate Sentence Embeddings (Hugging Face)

This encodes each sentence into a high-dimensional vector using `all-MiniLM-L6-v2`, capturing semantic meaning.

These vectors let us **numerically measure similarity** between sentence meanings.


In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(sentences)


## 📐 Step 3: Cosine Similarity Matrix

We now calculate **pairwise similarity** between sentence vectors using cosine similarity.

Scores range from:
- `1.0` → identical direction (same meaning)
- `0.0` → unrelated meaning
- `<0` → opposite meaning (rare)


In [None]:
sim_matrix = cosine_similarity(embeddings)
sim_df = pd.DataFrame(sim_matrix, index=sentences, columns=sentences)
sim_df.round(2)


## 📊 Step 4: Visualize Semantic Drift with PCA

We use **Principal Component Analysis** to reduce embedding vectors to 2D for plotting.

This visualizes **how meaning drifts** based on position and grouping.


In [None]:
pca = PCA(n_components=2)
reduced = pca.fit_transform(embeddings)

plt.figure(figsize=(8, 6))
if labels:
    unique_labels = sorted(set(labels))
    for lbl in unique_labels:
        idx = [i for i, l in enumerate(labels) if l == lbl]
        plt.scatter(reduced[idx, 0], reduced[idx, 1], label=lbl)
    plt.legend()
else:
    plt.scatter(reduced[:, 0], reduced[:, 1])

for i, text in enumerate(sentences):
    plt.annotate(f"{i+1}", (reduced[i, 0], reduced[i, 1]), fontsize=9)

plt.title("PCA Projection of Sentence Embeddings (Semantic Drift)")
plt.xlabel("PC 1")
plt.ylabel("PC 2")
plt.grid(True)
plt.tight_layout()
plt.show()


## 🤖 Step 5 (Optional): Gemini Meaning Comparison

Ask Gemini to compare any two sentences and explain **how their meaning differs**.
This supports **interpretability**, e.g. for clients or stakeholders.

You’ll need a Gemini API key from Google Generative AI:
https://makersuite.google.com/app/apikey


In [None]:
# Replace with your Gemini API key
genai.configure(api_key="YOUR-API-KEY")

def gemini_compare_meaning(sent_a, sent_b):
    prompt = f"Compare the following statements:
A: "{sent_a}"
B: "{sent_b}"

How do they differ in meaning, tone, or framing?"
    model = genai.GenerativeModel("gemini-pro")
    response = model.generate_content(prompt)
    return response.text.strip()

# Example: compare two sentences
# gemini_compare_meaning(sentences[0], sentences[1])


## 🎛️ Streamlit Dashboard Logic (Preview)

To convert this into a dashboard, wrap PCA and comparison logic in a Streamlit app.


In [None]:
# Streamlit layout sketch (not runnable here, but exportable)
'''
import streamlit as st

st.title("Semantic Drift Detector")

sent1 = st.text_input("Sentence A")
sent2 = st.text_input("Sentence B")

if sent1 and sent2:
    sim_score = hf_similarity(sent1, sent2)
    gemini_out = gemini_compare_meaning(sent1, sent2)

    st.write("Cosine Similarity:", sim_score)
    st.write("Gemini Interpretation:", gemini_out)
'''


---

## 🧠 Interpretation Guidelines

- **Nearby points** in PCA → similar framing
- **Far apart** = drift (especially within same label)
- **Neutral vs Negative clusters** → tone divergence
- Combine **HF score** + **Gemini text** for dual insights

---

## 💼 Use Cases

| Domain | Example |
|--------|---------|
| Political Analysis | Party A vs Party B press releases |
| Media Studies | Compare headlines on same event |
| Product Feedback | Tone drift between cohorts |
| HR/DEI | Survey framing differences by region |
| Research QA | Detect rater drift across annotation rounds |
