<a href="https://colab.research.google.com/github/AHMerrill/unstructured-project/blob/resolve_topic_assignment/anti_echo_chamber.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ================================================================
# Anti Echo Chamber — Full Analysis and Retrieval Pipeline
# ================================================================

This notebook performs a complete, production-grade anti-echo workflow:

1. Secure OpenAI login  
2. Rebuild ChromaDB from Hugging Face  
3. Upload and parse PDF / TXT / HTML  
4. Summarize with OpenAI (`gpt-4o-mini`)  
5. Create topic + stance embeddings  
6. Compare against Chroma to surface ideologically contrasting articles  

Repositories  
- GitHub: https://github.com/AHMerrill/anti-echo-chamber  
- Hugging Face dataset: https://huggingface.co/datasets/zanimal/anti-echo-artifacts


In [1]:
# ================================================================
# Stage 1 — Secure OpenAI API Key Setup
# ================================================================

import os
from getpass import getpass

if "OPENAI_API_KEY" not in os.environ or not os.environ["OPENAI_API_KEY"]:
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

print("OpenAI API key loaded into environment (hidden)")


Enter your OpenAI API key: ··········
OpenAI API key loaded into environment (hidden)


# ================================================================
# Stage 2 — Environment Setup and Repository Configuration
# ================================================================

This stage:
- Clones the GitHub repo
- Installs dependencies
- Loads YAML + JSON configs
- Prints the active models and Chroma settings


In [2]:
import os, json, yaml, numpy as np, torch
from pathlib import Path

# --- Repo and paths ---
GIT_URL = "https://github.com/AHMerrill/anti-echo-chamber.git"
PROJECT_ROOT = Path("/content/anti_echo").resolve()

if not PROJECT_ROOT.exists():
    print(f"Cloning from {GIT_URL}...")
    os.system(f"git clone {GIT_URL} {PROJECT_ROOT}")
else:
    print("Repository exists. Pulling latest changes...")
    os.system(f"cd {PROJECT_ROOT} && git pull")

# --- Install dependencies ---
!pip install -q pdfplumber beautifulsoup4 chromadb sentence-transformers pyyaml huggingface_hub openai rapidfuzz

# --- Load configs ---
CONFIG_PATH = PROJECT_ROOT / "config/config.yaml"
with open(CONFIG_PATH, "r", encoding="utf-8") as f:
    CONFIG = yaml.safe_load(f)

summary = {
    "repo": str(PROJECT_ROOT),
    "topic_model": CONFIG["embeddings"]["topic_model"],
    "stance_model": CONFIG["embeddings"]["stance_model"],
    "chroma_collections": CONFIG["chroma_collections"]
}
print(json.dumps(summary, indent=2))


Cloning from https://github.com/AHMerrill/anti-echo-chamber.git...
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m86.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.7/20.7 MB[0m [31m112.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# ================================================================
# Stage 3 — Full Chroma Rebuild from Hugging Face Dataset
# ================================================================

This stage reconstructs the local **ChromaDB** from your Hugging Face dataset  
[`zanimal/anti-echo-artifacts`](https://huggingface.co/datasets/zanimal/anti-echo-artifacts).

It preserves the full multi-topic and multi-stance structure from your scraper:
- Each article can yield **multiple topic vectors** (`::topic::0`, `::topic::1`, …)
- Each article can yield **multiple stance vectors** (`::stance::summary`, `::stance::0`, …)

Duplicates are filtered **only by exact row_id**, not by base article ID.  
This ensures we retain all topical clusters while preventing re-ingestion of the same batch.


In [3]:
# ================================================================
# Stage 3 — Chroma Rebuild (multi-topic aware)
# ================================================================

import os, json, numpy as np, traceback
from pathlib import Path
from huggingface_hub import list_repo_files, hf_hub_download
import chromadb
from collections import defaultdict

HF_REPO = "zanimal/anti-echo-artifacts"
PROJECT_ROOT = Path("/content/anti_echo").resolve()
CHROMA_PATH = PROJECT_ROOT / "chroma_db"
CHROMA_PATH.mkdir(parents=True, exist_ok=True)

client = chromadb.PersistentClient(path=str(CHROMA_PATH))

# Drop and recreate clean collections
for name in ["news_topic", "news_stance"]:
    try:
        client.delete_collection(name)
    except Exception:
        pass

topic_coll = client.create_collection("news_topic", metadata={"hnsw:space": "cosine"})
stance_coll = client.create_collection("news_stance", metadata={"hnsw:space": "cosine"})
print(f"Initialized Chroma collections at {CHROMA_PATH}")

# -------------------------------------------------------------------
# Helpers
# -------------------------------------------------------------------

def load_npz_safely(path):
    """Load an .npz file and return the first valid 2D array."""
    arr = np.load(path, allow_pickle=False)
    if isinstance(arr, np.lib.npyio.NpzFile):
        for key in arr.files:
            if arr[key].ndim == 2:
                return arr[key]
        raise ValueError(f"No 2D arrays found in {path}")
    return arr

def load_jsonl(fp):
    """Read JSONL safely into list of dicts."""
    records = []
    with open(fp, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            try:
                records.append(json.loads(line))
            except json.JSONDecodeError:
                continue
    return records

# -------------------------------------------------------------------
# Discover batches
# -------------------------------------------------------------------

files = list_repo_files(HF_REPO, repo_type="dataset")
batches = sorted({"/".join(f.split("/")[:2]) for f in files if f.startswith("batches/")})
print(f"Detected {len(batches)} batches in {HF_REPO}")

# Track seen row IDs to prevent duplicates
seen_topic_ids, seen_stance_ids = set(), set()
article_topic_counts = defaultdict(int)
article_stance_counts = defaultdict(int)

topic_total, stance_total = 0, 0

# -------------------------------------------------------------------
# Process each batch
# -------------------------------------------------------------------

for batch in batches:
    try:
        print(f"\n--- Processing {batch} ---")

        topic_npz = hf_hub_download(HF_REPO, f"{batch}/embeddings_topic.npz", repo_type="dataset")
        stance_npz = hf_hub_download(HF_REPO, f"{batch}/embeddings_stance.npz", repo_type="dataset")
        meta_topic = hf_hub_download(HF_REPO, f"{batch}/metadata_topic.jsonl", repo_type="dataset")
        meta_stance = hf_hub_download(HF_REPO, f"{batch}/metadata_stance.jsonl", repo_type="dataset")

        t_embs = load_npz_safely(topic_npz)
        s_embs = load_npz_safely(stance_npz)
        t_meta = load_jsonl(meta_topic)
        s_meta = load_jsonl(meta_stance)

        if len(t_embs) != len(t_meta) or len(s_embs) != len(s_meta):
            print(f"⚠️ Mismatch in {batch}: topic {len(t_embs)} vs {len(t_meta)}, stance {len(s_embs)} vs {len(s_meta)}")

        # --- Topic collection ---
        t_records = []
        for e, m in zip(t_embs, t_meta):
            rid = m.get("row_id") or f"{m.get('id','unknown')}::topic::0"
            if rid not in seen_topic_ids:
                seen_topic_ids.add(rid)
                t_records.append((rid, e, m))
                base_id = rid.split("::")[0]
                article_topic_counts[base_id] += 1

        if t_records:
            topic_coll.upsert(
                ids=[r[0] for r in t_records],
                embeddings=[r[1].tolist() for r in t_records],
                metadatas=[r[2] for r in t_records],
            )
        topic_total += len(t_records)

        # --- Stance collection ---
        s_records = []
        for e, m in zip(s_embs, s_meta):
            rid = m.get("row_id") or f"{m.get('id','unknown')}::stance::0"
            if rid not in seen_stance_ids:
                seen_stance_ids.add(rid)
                s_records.append((rid, e, m))
                base_id = rid.split("::")[0]
                article_stance_counts[base_id] += 1

        if s_records:
            stance_coll.upsert(
                ids=[r[0] for r in s_records],
                embeddings=[r[1].tolist() for r in s_records],
                metadatas=[r[2] for r in s_records],
            )
        stance_total += len(s_records)

        print(f"✓ Added {len(t_records)} topic, {len(s_records)} stance vectors")

    except Exception as e:
        print(f"Failed {batch}: {type(e).__name__}: {e}")
        traceback.print_exc(limit=1)

# -------------------------------------------------------------------
# Summary
# -------------------------------------------------------------------

print("\n=== Rebuild Summary ===")
print(f"Topic vectors added: {topic_total}")
print(f"Stance vectors added: {stance_total}")
print(f"Unique articles (topic): {len(article_topic_counts)}")
print(f"Unique articles (stance): {len(article_stance_counts)}")

avg_topics = np.mean(list(article_topic_counts.values())) if article_topic_counts else 0
avg_stances = np.mean(list(article_stance_counts.values())) if article_stance_counts else 0
print(f"Avg topics per article: {avg_topics:.2f}")
print(f"Avg stance vectors per article: {avg_stances:.2f}")
print(f"Stored at {CHROMA_PATH}")


Initialized Chroma collections at /content/anti_echo/chroma_db
Detected 1 batches in zanimal/anti-echo-artifacts

--- Processing batches/batch_20251022T220634Z_4e4eb73b ---


embeddings_topic.npz:   0%|          | 0.00/80.1k [00:00<?, ?B/s]

embeddings_stance.npz:   0%|          | 0.00/18.7k [00:00<?, ?B/s]

metadata_topic.jsonl: 0.00B [00:00, ?B/s]

metadata_stance.jsonl: 0.00B [00:00, ?B/s]

✓ Added 57 topic, 13 stance vectors

=== Rebuild Summary ===
Topic vectors added: 57
Stance vectors added: 13
Unique articles (topic): 13
Unique articles (stance): 13
Avg topics per article: 4.38
Avg stance vectors per article: 1.00
Stored at /content/anti_echo/chroma_db


# ================================================================
# Stage 4 — User Upload + Source Bias Detection
# ================================================================

This stage handles ingestion of a user-supplied article (TXT / PDF / HTML).  
It performs four key steps:

1. **Upload** the file and extract plain text.  
2. **Infer or confirm** the publication source.  
3. **Match** against existing entries in `source_bias.json` (fuzzy).  
4. **If new**, infer ideological metadata (bias family + score) using `gpt-4o-mini`.

All extracted and inferred data will be cached for later topic + stance analysis.


In [4]:
# ================================================================
# Stage 4 — User Upload + Source Bias Detection
# ================================================================

import os, re, json, pdfplumber, requests
from bs4 import BeautifulSoup
from rapidfuzz import process, fuzz
from pathlib import Path
from openai import OpenAI
from getpass import getpass

# --- Environment setup ---
PROJECT_ROOT = Path("/content/anti_echo").resolve()
CONFIG_DIR = PROJECT_ROOT / "config"

# Load your bias map
SOURCE_BIAS_PATH = CONFIG_DIR / "source_bias.json"
SOURCE_BIAS = json.load(open(SOURCE_BIAS_PATH, encoding="utf-8"))

# Ensure OpenAI key
if "OPENAI_API_KEY" not in os.environ or not os.environ["OPENAI_API_KEY"].strip():
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# --------------------------------------------------------------------
# 1. File Upload + Text Extraction
# --------------------------------------------------------------------
from google.colab import files
import io

uploaded = files.upload()
filename = list(uploaded.keys())[0]
file_ext = Path(filename).suffix.lower()

def extract_text(path):
    if path.endswith(".txt"):
        return Path(path).read_text(encoding="utf-8", errors="ignore")
    if path.endswith(".pdf"):
        text = ""
        with pdfplumber.open(path) as pdf:
            for page in pdf.pages:
                text += page.extract_text() or ""
        return text
    if path.endswith(".html") or path.endswith(".htm"):
        soup = BeautifulSoup(Path(path).read_text(encoding="utf-8", errors="ignore"), "html.parser")
        for s in soup(["script","style"]): s.decompose()
        return soup.get_text(separator=" ")
    raise ValueError("Unsupported file type")

article_text = extract_text(filename).strip()
print(f"Extracted {len(article_text)} characters from {filename}")

# --------------------------------------------------------------------
# 2. Attempt Source Inference
# --------------------------------------------------------------------
# Heuristic: find common domains in text, or ask GPT if ambiguous
def infer_source_name(text):
    # Quick domain sniff
    m = re.search(r"https?://([^/\s]+)", text)
    if m:
        domain = m.group(1).lower()
        domain = domain.replace("www.", "")
        return domain.split(".")[0]
    # GPT fallback
    prompt = (
        "You are a media analyst. Based on the article text below, "
        "infer the most likely publication or outlet name. "
        "Return only the outlet name, e.g., 'The Guardian', 'Fox News', etc.\n\n"
        f"Article excerpt:\n{text[:2000]}"
    )
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=20,
        temperature=0.2
    )
    guess = resp.choices[0].message.content.strip()
    return re.sub(r"[^A-Za-z0-9\s\-]", "", guess)

inferred_source = infer_source_name(article_text)
print(f"🕵️ Inferred possible source: {inferred_source}")

# Ask user to confirm or override
user_resp = input(f"Is this article from '{inferred_source}'? [y/n]: ").strip().lower()
if user_resp != "y":
    user_source = input("Enter the source name (may contain typos): ").strip()
    confirmed_source = user_source
else:
    confirmed_source = inferred_source

# --------------------------------------------------------------------
# 3. Fuzzy Match Against Known Sources
# --------------------------------------------------------------------
known_sources = list(SOURCE_BIAS.keys())
match, score, _ = process.extractOne(confirmed_source, known_sources, scorer=fuzz.ratio)
print(f"Closest match: {match} (score {score})")

if score >= 85:
    bias_info = SOURCE_BIAS[match]
    print(f"Matched existing bias entry for {match}")
else:
    # ----------------------------------------------------------------
    # 4. GPT Bias Inference Fallback
    # ----------------------------------------------------------------
    prompt = f"""
You are a media bias researcher.
Given the outlet name "{confirmed_source}", infer its general political bias family
(e.g., 'center left', 'center right', 'libertarian right', 'progressive left', 'neutral').
Return JSON with:
- bias_family
- bias_score (float, -1.0 = far left, +1.0 = far right)
- short_rationale (brief explanation)
"""
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=128,
        temperature=0.4
    )
    try:
        bias_info = json.loads(resp.choices[0].message.content)
    except Exception:
        bias_info = {"bias_family": "unknown", "bias_score": 0.0, "short_rationale": resp.choices[0].message.content.strip()}
    print(f"New outlet inferred:\n{json.dumps(bias_info, indent=2)}")

# --------------------------------------------------------------------
# 5. Cache Inferred Metadata
# --------------------------------------------------------------------
ARTICLE_META = {
    "filename": filename,
    "source_input": confirmed_source,
    "matched_source": match if score >= 85 else None,
    "bias_family": bias_info.get("bias_family", "unknown"),
    "bias_score": bias_info.get("bias_score", 0.0),
    "rationale": bias_info.get("short_rationale", ""),
    "chars": len(article_text)
}

# Save to temporary workspace
TEMP_DIR = PROJECT_ROOT / "tmp"
TEMP_DIR.mkdir(parents=True, exist_ok=True)
meta_path = TEMP_DIR / f"{Path(filename).stem}_meta.json"
text_path = TEMP_DIR / f"{Path(filename).stem}.txt"
Path(text_path).write_text(article_text, encoding="utf-8")
Path(meta_path).write_text(json.dumps(ARTICLE_META, indent=2), encoding="utf-8")

print("\n--- Source Bias Summary ---")
print(json.dumps(ARTICLE_META, indent=2))
print(f"Cached article + metadata under {TEMP_DIR}")




Saving fox_test.pdf to fox_test.pdf




Extracted 19461 characters from fox_test.pdf
🕵️ Inferred possible source: foxnews
Is this article from 'foxnews'? [y/n]: y
Closest match: foxnews (score 100.0)
Matched existing bias entry for foxnews

--- Source Bias Summary ---
{
  "filename": "fox_test.pdf",
  "source_input": "foxnews",
  "matched_source": "foxnews",
  "bias_family": "conservative right",
  "bias_score": 0.8,
  "rationale": "",
  "chars": 19461
}
Cached article + metadata under /content/anti_echo/tmp


# ================================================================
# Stage 5a — Topic Embedding for Uploaded Article
# ================================================================

This step mirrors the topic vectorization logic used in your scraper.
It:
1. Loads the extracted text and metadata from Stage 4.  
2. Chunks and summarizes text if needed.  
3. Generates multi-topic embeddings via the same model family.  
4. Stores results in the `news_topic` Chroma collection.


In [5]:
# ================================================================
# Stage 5a — Topic Embedding + Canonical Topic Assignment (Ephemeral, Scraper-Accurate)
# ================================================================

import json, numpy as np, nltk, torch
from pathlib import Path
from transformers import AutoTokenizer
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics.pairwise import cosine_similarity

PROJECT_ROOT = Path("/content/anti_echo").resolve()
CONFIG_DIR = PROJECT_ROOT / "config"
TMP = PROJECT_ROOT / "tmp"
EPHEMERAL = TMP / "ephemeral_embeddings"
EPHEMERAL.mkdir(parents=True, exist_ok=True)

# --- Load text + meta ---
latest_meta = sorted(TMP.glob("*_meta.json"))[-1]
meta = json.load(open(latest_meta))
text_path = TMP / f"{Path(latest_meta).stem.replace('_meta','')}.txt"
article_text = text_path.read_text(encoding="utf-8")

print(f"Embedding topics for: {meta['filename']} ({len(article_text)} chars)")

# --- Config parameters ---
topic_model_name = CONFIG["embeddings"]["topic_model"]
chunk_tokens = CONFIG["embeddings"]["chunk_tokens"]
normalize = CONFIG["embeddings"]["normalize"]
threshold = CONFIG["topics"]["similarity_threshold"]
max_topics_per_vec = CONFIG["topics"]["max_topics_per_article"]
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Using topic model: {topic_model_name}")
tokenizer = AutoTokenizer.from_pretrained(topic_model_name, use_fast=True)
embedder = SentenceTransformer(topic_model_name, device=device)

# --- Load canonical topic anchors + labels ---
anchors_path = CONFIG_DIR / "topic_anchors.npz"
topics_path = CONFIG_DIR / "topics.json"
anchors_npz = np.load(anchors_path, allow_pickle=True)
topic_anchors = {k: anchors_npz[k] for k in anchors_npz.files}
topic_labels = list(topic_anchors.keys())
print(f"Loaded {len(topic_anchors)} topic anchors")

# --- NLTK setup ---
for pkg in ["punkt", "punkt_tab"]:
    try:
        nltk.data.find(f"tokenizers/{pkg}")
    except LookupError:
        nltk.download(pkg)

def sent_split(text):
    return [s.strip() for s in nltk.sent_tokenize(text) if s.strip()]

def encode(texts):
    if isinstance(texts, str):
        texts = [texts]
    vecs = embedder.encode(
        texts,
        convert_to_numpy=True,
        normalize_embeddings=normalize,
        show_progress_bar=False,
    )
    return np.array(vecs)

def topic_vecs(text):
    sents = sent_split(text)
    if not sents:
        return []
    if len(sents) < 2:
        return [encode(" ".join(sents)).mean(axis=0)]
    emb = encode(sents)
    k = min(max(1, len(sents)//8), 8)
    labels = AgglomerativeClustering(n_clusters=k).fit_predict(emb)
    segs = [" ".join([s for s, l in zip(sents, labels) if l == lab]) for lab in sorted(set(labels))]
    out = []
    for seg in segs:
        out.append(encode(seg).mean(axis=0))
    return out

def match_topics(vec, anchors_dict, max_topics=5, threshold=0.4):
    scores = {
        label: float(cosine_similarity([vec], [anchor])[0][0])
        for label, anchor in anchors_dict.items()
    }
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    selected = []
    for i, (label, sim) in enumerate(ranked[:max_topics]):
        if i == 0 or sim >= threshold:
            selected.append({"topic_label": label, "similarity": sim})
    if not selected:
        selected = [{"topic_label": "General / Miscellaneous", "similarity": 0.0}]
    return selected

# --- Generate embeddings + topic matches ---
topic_vec_list = topic_vecs(article_text)
topic_vecs = np.vstack(topic_vec_list)
print(f"Generated {len(topic_vecs)} topic embeddings with shape {topic_vecs.shape}")

# --- Match to anchors ---
all_labels = []
topic_matches = []
for i, vec in enumerate(topic_vecs):
    matches = match_topics(vec, topic_anchors, max_topics=max_topics_per_vec, threshold=threshold)
    topic_matches.append(matches)
    top_labels = [m["topic_label"] for m in matches]
    all_labels.extend(top_labels)
    print(f"\n[Topic vector {i}] matches:")
    for m in matches:
        print(f"  - {m['topic_label']:<40} (similarity {m['similarity']:.3f})")

# --- Deduplicate + limit to top 8 overall topics ---
flat_topics = list(dict.fromkeys(all_labels))[:8]
print("\n--- Canonical Topics Assigned ---")
for t in flat_topics:
    print(f" - {t}")

# --- Save ephemeral outputs ---
base = Path(meta["filename"]).stem
topic_path = EPHEMERAL / f"{base}_topic.npy"
match_path = EPHEMERAL / f"{base}_topic_matches.json"
flat_path = EPHEMERAL / f"{base}_topics_flat.json"
np.save(topic_path, topic_vecs)
json.dump(topic_matches, open(match_path, "w"), indent=2)
json.dump(flat_topics, open(flat_path, "w"), indent=2)

print(f"\nSaved topic vectors → {topic_path}")
print(f"Saved topic matches → {match_path}")
print(f"Saved canonical topic list → {flat_path}")


Embedding topics for: fox_test.pdf (19461 chars)
Using topic model: intfloat/e5-base-v2


tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

Loaded 22 topic anchors


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Generated 8 topic embeddings with shape (8, 768)

[Topic vector 0] matches:
  - Politics / Global / Geopolitics & Conflict (similarity 0.818)
  - Technology / Social Media & Platforms    (similarity 0.812)
  - Economy / Trade / Globalization          (similarity 0.809)
  - Society / Media / Communication          (similarity 0.807)
  - Environment / Climate / Energy Policy    (similarity 0.805)

[Topic vector 1] matches:
  - Politics / Global / Geopolitics & Conflict (similarity 0.804)
  - Economy / Trade / Globalization          (similarity 0.795)
  - Politics / US / Federal / Executive Policy (similarity 0.783)
  - Environment / Climate / Energy Policy    (similarity 0.776)
  - Technology / Social Media & Platforms    (similarity 0.773)

[Topic vector 2] matches:
  - Politics / Global / Geopolitics & Conflict (similarity 0.809)
  - Politics / US / Federal / Executive Policy (similarity 0.778)
  - Economy / Trade / Globalization          (similarity 0.774)
  - Politics / US / Federal 

In [6]:
# ================================================================
# Stage 5b — Stance Classification + Hybrid Embedding (Ephemeral, Scraper-Accurate)
# ================================================================

import os, json, re, torch, numpy as np
from pathlib import Path
from openai import OpenAI
from sentence_transformers import SentenceTransformer

PROJECT_ROOT = Path("/content/anti_echo").resolve()
CONFIG_DIR   = PROJECT_ROOT / "config"
TMP          = PROJECT_ROOT / "tmp"
EPHEMERAL    = TMP / "ephemeral_embeddings"
EPHEMERAL.mkdir(parents=True, exist_ok=True)

# --- Load configs and guides ---
with open(CONFIG_DIR / "political_leanings.json", encoding="utf-8") as f:
    leanings_map = json.load(f)
with open(CONFIG_DIR / "implied_stances.json", encoding="utf-8") as f:
    stances_map = json.load(f)
with open(CONFIG_DIR / "source_bias.json", encoding="utf-8") as f:
    source_bias = json.load(f)

# --- Load metadata and article text from Stage 4 ---
latest_meta = sorted(TMP.glob("*_meta.json"))[-1]
meta        = json.load(open(latest_meta))
text_path   = TMP / f"{Path(latest_meta).stem.replace('_meta','')}.txt"
article_txt = text_path.read_text(encoding="utf-8")

print(f"Generating stance embedding for: {meta['filename']} ({len(article_txt)} chars)")

# --- Retrieve outlet bias already inferred in Stage 4 ---
bias_family = meta.get("bias_family", "unknown")
bias_score  = float(meta.get("bias_score", 0.0))

# --- OpenAI client ---
if "OPENAI_API_KEY" not in os.environ or not os.environ["OPENAI_API_KEY"].strip():
    from getpass import getpass
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# --- GPT classification identical to scraper ---
prompt = f"""
You are a political analyst.
Based on the article below, classify its overall political leaning (tone) and implied stance.

Leaning options: {', '.join(leanings_map.keys())}
Stance examples: {', '.join([s for cat in stances_map.values() for s in cat['families'].keys()])}

Return strict JSON with fields:
- political_leaning (string)
- implied_stance (string)
- summary (one-sentence summary of the article's main argument)

Article title: {meta.get('filename')}
Excerpt: {article_txt[:2000]}
"""

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    max_tokens=256,
    temperature=0.4
)
raw = resp.choices[0].message.content.strip()

try:
    stance_info = json.loads(raw)
except Exception:
    # fallback regex extraction
    leaning = re.search(r"leaning[:\-]?\s*(.+)", raw, re.I)
    stance  = re.search(r"stance[:\-]?\s*(.+)",  raw, re.I)
    summary = re.search(r"summary[:\-]?\s*(.+)", raw, re.I)
    stance_info = {
        "political_leaning": (leaning.group(1).strip() if leaning else "unknown"),
        "implied_stance":    (stance.group(1).strip()  if stance  else "unknown"),
        "summary":           (summary.group(1).strip() if summary else raw[:200])
    }

print("\n--- GPT Classification ---")
print(json.dumps(stance_info, indent=2))

# --- Compute tone score + match with outlet bias ---
def bias_to_score(label):
    l = (label or "").lower().strip()
    if "progressive" in l or ("left" in l and "center" not in l): return -0.8
    if "center left" in l:  return -0.4
    if l == "center":       return 0.0
    if "center right" in l: return 0.4
    if "conservative" in l or "right" in l: return 0.8
    if "libertarian" in l:  return 0.6
    return 0.0

tone_score = bias_to_score(stance_info.get("political_leaning"))
author_match = abs(bias_score - tone_score) <= 0.3

# --- Build enriched stance metadata ---
stance_meta = {
    "political_leaning": stance_info.get("political_leaning", "unknown"),
    "implied_stance":    stance_info.get("implied_stance", "unknown"),
    "summary":           stance_info.get("summary", ""),
    "bias_family":       bias_family,
    "bias_score":        bias_score,
    "tone_score":        tone_score,
    "author_tone_match": author_match
}

print("\n--- Source / Tone Alignment ---")
print(json.dumps(stance_meta, indent=2))

# --- Hybrid text for embedding ---
hybrid_text = "\n".join([
    stance_meta["political_leaning"],
    stance_meta["implied_stance"],
    stance_meta["summary"]
]).strip()

# --- Generate embedding ---
stance_model_name = CONFIG["embeddings"]["stance_model"]
device = "cuda" if torch.cuda.is_available() else "cpu"
embedder = SentenceTransformer(stance_model_name, device=device)
stance_vec = embedder.encode(hybrid_text, normalize_embeddings=True)
stance_vec = stance_vec.reshape(1, -1)
print(f"\nUsing stance model: {stance_model_name}")
print(f"Generated stance vector with shape {stance_vec.shape}")

# --- Save ephemeral outputs for Stage 6 ---
base = Path(meta["filename"]).stem
np.save(EPHEMERAL / f"{base}_stance.npy", stance_vec)
Path(EPHEMERAL / f"{base}_stance_summary.txt").write_text(hybrid_text, encoding="utf-8")
Path(EPHEMERAL / f"{base}_stance_info.json").write_text(json.dumps(stance_meta, indent=2), encoding="utf-8")

print(f"\nSaved ephemeral stance artifacts under {EPHEMERAL}")


Generating stance embedding for: fox_test.pdf (19461 chars)

--- GPT Classification ---
{
  "political_leaning": "\": \"center right\",",
  "implied_stance": "\": \"nationalist realist\",",
  "summary": "\": \"The article discusses President Trump's imposition of sanctions on Russian oil companies and the cancellation of a summit with Putin, reflecting a shift towards a more confrontational stance while avoiding deeper military involvement in Ukraine.\""
}

--- Source / Tone Alignment ---
{
  "political_leaning": "\": \"center right\",",
  "implied_stance": "\": \"nationalist realist\",",
  "summary": "\": \"The article discusses President Trump's imposition of sanctions on Russian oil companies and the cancellation of a summit with Putin, reflecting a shift towards a more confrontational stance while avoiding deeper military involvement in Ukraine.\"",
  "bias_family": "conservative right",
  "bias_score": 0.8,
  "tone_score": 0.4,
  "author_tone_match": false
}


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]


Using stance model: all-mpnet-base-v2
Generated stance vector with shape (1, 768)

Saved ephemeral stance artifacts under /content/anti_echo/tmp/ephemeral_embeddings


### Stage 6 — Retrieval and Anti-Echo Analysis

This stage finds articles that cover the same topic as the uploaded one, but from a different stance or bias.

---

### Process summary

- **Load uploaded article features**: topic embeddings, stance embeddings, bias, tone.
- **Load corpus features**: stored article embeddings and metadata from the Chroma database.
- **Compare uploaded vs. stored articles** using:
  - **Topic overlap**: fraction of shared topics.
  - **Stance similarity**: cosine similarity between stance embeddings.
  - **Bias difference**: absolute distance between bias scores.
  - **Tone difference**: absolute distance between tone/emotional style scores.

---

### Scoring formula (plain text)

anti_echo_score =
  (w_T * topic_overlap)
-(w_S * stance_similarity)
-(w_B * bias_diff)
-(w_Tone * tone_diff)

- Higher scores mean: same topic, meaningfully different angle (i.e., high topic overlap, lower stance similarity, larger bias/tone difference), without drifting off-topic.

---

### Output

- **Printed results**: lists of articles grouped by topic and stance differences.
- **Saved artifact**: a CSV with all metrics:
  - topic_overlap
  - stance_similarity
  - bias_diff
  - tone_diff
  - anti_echo_score

In [7]:
# ================================================================
# Stage 6 — Retrieval and Anti-Echo Analysis (canonical-topic aligned, interpretable + links)
# ================================================================

import os, json, numpy as np, pandas as pd
from pathlib import Path
from openai import OpenAI
import chromadb
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from getpass import getpass

# ---------------------------------------------------------------
# TUNABLE PARAMETERS
# ---------------------------------------------------------------
# Composite score weights
w_T = 1.0     # topic overlap weight
w_S = 1.0     # stance similarity penalty
w_B = 1.0     # bias difference penalty
w_Tone = 0.5  # tone difference penalty

# Retrieval thresholds and ranking controls
TOPIC_OVERLAP_THRESHOLD = 0.5   # minimum required overlap to consider a match
TOP_N_RESULTS = 10              # how many top-ranked results to show in each section
PRINT_TOP_N = 3                 # how many to print per section in console output

# ---------------------------------------------------------------
# Configuration and paths
# ---------------------------------------------------------------
PROJECT_ROOT = Path("/content/anti_echo").resolve()
CHROMA_PATH = PROJECT_ROOT / "chroma_db"
TMP = PROJECT_ROOT / "tmp"
EPHEMERAL = TMP / "ephemeral_embeddings"

# ---------------------------------------------------------------
# OpenAI setup
# ---------------------------------------------------------------
if "OPENAI_API_KEY" not in os.environ or not os.environ["OPENAI_API_KEY"].strip():
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# ---------------------------------------------------------------
# Load metadata and ephemeral embeddings
# ---------------------------------------------------------------
latest_meta = sorted(TMP.glob("*_meta.json"))[-1]
meta = json.load(open(latest_meta))
article_id = Path(meta["filename"]).stem
bias_score_article = float(meta["bias_score"])
tone_score_article = float(meta.get("tone_score", 0.0))
print(f"Running anti-echo retrieval for {meta['filename']} (bias={bias_score_article}, tone={tone_score_article})")

topic_vecs = np.load(EPHEMERAL / f"{article_id}_topic.npy")
stance_vec = np.load(EPHEMERAL / f"{article_id}_stance.npy")
topics_flat = json.load(open(EPHEMERAL / f"{article_id}_topics_flat.json"))
stance_text = (EPHEMERAL / f"{article_id}_stance_summary.txt").read_text()

client_chroma = chromadb.PersistentClient(path=str(CHROMA_PATH))
topic_coll = client_chroma.get_collection("news_topic")
stance_coll = client_chroma.get_collection("news_stance")

# ---------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------
def parse_topics(obj):
    if obj is None:
        return []
    if isinstance(obj, list):
        return [t.strip() for t in obj if t.strip()]
    if isinstance(obj, str):
        parts = [t.strip() for t in obj.split(";") if t.strip()]
        if len(parts) == 1 and parts[0].startswith("["):
            try:
                parsed = json.loads(parts[0])
                if isinstance(parsed, list):
                    return [t.strip() for t in parsed if isinstance(t, str)]
            except Exception:
                pass
        return parts
    return []

def topic_overlap_score(a_topics, b_topics):
    a = set([t.strip().lower() for t in parse_topics(a_topics)])
    b = set([t.strip().lower() for t in parse_topics(b_topics)])
    if not a or not b:
        return 0.0
    return len(a & b) / len(a | b)

def interpret_bias(score: float) -> str:
    if score <= -0.6: return "Progressive / Left"
    if -0.6 < score <= -0.2: return "Center-Left"
    if -0.2 < score < 0.2: return "Center / Neutral"
    if 0.2 <= score < 0.6: return "Center-Right"
    if score >= 0.6: return "Conservative / Right"
    return "Unknown"

def short_url(u, max_len=70):
    if not u:
        return ""
    return (u[:max_len] + "…") if len(u) > max_len else u

# ---------------------------------------------------------------
# Retrieve all topic & stance docs
# ---------------------------------------------------------------
topic_docs = topic_coll.get(include=["embeddings", "metadatas"])
stance_docs = stance_coll.get(include=["embeddings", "metadatas"])

# ---------------------------------------------------------------
# Compare uploaded article to stored corpus
# ---------------------------------------------------------------
scores = []
for emb, md in zip(topic_docs["embeddings"], topic_docs["metadatas"]):
    topic_overlap = topic_overlap_score(topics_flat, md.get("topics_flat", []))
    if topic_overlap < TOPIC_OVERLAP_THRESHOLD:
        continue

    # Get bias and tone info
    bias_db, tone_db = 0.0, 0.0
    for s_md in stance_docs["metadatas"]:
        if s_md.get("id") == md.get("id"):
            try:
                bias_db = float(s_md.get("bias_score", 0.0))
            except Exception:
                try:
                    bias_db = float(json.loads(s_md.get("source_bias", "{}")).get("bias_score", 0.0))
                except Exception:
                    bias_db = 0.0
            tone_db = float(s_md.get("tone_score", bias_db))
            break

    bias_diff = abs(bias_score_article - bias_db)
    tone_diff = abs(tone_score_article - tone_db)

    stance_match = next(
        (s_emb for s_emb, s_md in zip(stance_docs["embeddings"], stance_docs["metadatas"])
         if s_md["id"] == md["id"]), None)
    stance_sim = 0.0
    if stance_match is not None:
        stance_sim = cosine_similarity(
            stance_vec.reshape(1, -1), np.array(stance_match).reshape(1, -1)
        )[0][0]

    # Composite score
    anti_echo_score = (
        (w_T * topic_overlap)
        - (w_S * stance_sim)
        - (w_B * bias_diff)
        - (w_Tone * tone_diff)
    )

    scores.append({
        "article_id": md.get("id"),
        "source": md.get("source", ""),
        "title": md.get("title", ""),
        "url": md.get("url", ""),
        "bias_family": md.get("bias_family", ""),
        "bias_score": bias_db,
        "topic_overlap": topic_overlap,
        "stance_similarity": stance_sim,
        "bias_diff": bias_diff,
        "tone_diff": tone_diff,
        "anti_echo_score": anti_echo_score
    })

df = pd.DataFrame(scores)
if df.empty:
    raise ValueError("No related articles found (verify canonical topics or lower threshold).")

# ---------------------------------------------------------------
# Ranking
# ---------------------------------------------------------------
same_topic_diff_bias = df.sort_values(["topic_overlap", "bias_diff"], ascending=[False, False]).head(TOP_N_RESULTS)
same_topic_opposite_stance = df.sort_values(["topic_overlap", "stance_similarity"], ascending=[False, True]).head(TOP_N_RESULTS)
anti_echo_best = df.sort_values("anti_echo_score", ascending=False).head(TOP_N_RESULTS)

# ---------------------------------------------------------------
# Readable, structured console summaries
# ---------------------------------------------------------------
def print_header(title):
    print("\n" + "=" * 80)
    print(title.upper().center(80))
    print("=" * 80 + "\n")

def format_article_row(row):
    title = (row.get("title") or "Untitled").strip()
    source = row.get("source", "unknown")
    bias_label = interpret_bias(row["bias_score"])
    metrics = (
        f"Topic overlap: {row['topic_overlap']:.2f}   "
        f"Stance sim: {row['stance_similarity']:.2f}   "
        f"Bias diff: {row['bias_diff']:.2f}   "
        f"Anti-echo score: {row['anti_echo_score']:.3f}"
    )
    url = short_url(row.get("url", ""))
    lines = [
        f"• {title}",
        f"  Source: {source}  ({bias_label})",
        f"  {metrics}",
    ]
    if url:
        lines.append(f"  Link: {url}")
    return "\n".join(lines)

def show_results(df, title, n=PRINT_TOP_N):
    print_header(title)
    if df.empty:
        print("  No matches found.\n")
        return
    for _, row in df.head(n).iterrows():
        print(format_article_row(row))
        print("-" * 80)
    print()

def show_overview(df):
    print_header("Ideological Spread Overview")
    left = df[df["bias_score"] < -0.2]["source"].unique()
    right = df[df["bias_score"] > 0.2]["source"].unique()
    print(f"Left / progressive outlets : {', '.join(left) if len(left)>0 else 'none'}")
    print(f"Right / conservative outlets: {', '.join(right) if len(right)>0 else 'none'}\n")
    top = df.iloc[0]
    print(f"Top contrastive article: {top['source']} ({interpret_bias(top['bias_score'])})")
    print(
        f"  Topic overlap: {top['topic_overlap']:.2f}   "
        f"Stance sim: {top['stance_similarity']:.2f}   "
        f"Bias diff: {top['bias_diff']:.2f}   "
        f"Anti-echo score: {top['anti_echo_score']:.3f}"
    )
    print()

# ---------------------------------------------------------------
# Pretty console output
# ---------------------------------------------------------------
show_overview(anti_echo_best)
show_results(same_topic_diff_bias, "Same Topic — Different Source Bias")
show_results(same_topic_opposite_stance, "Same Topic — Opposite Stance")
show_results(anti_echo_best, "Top Anti-Echo Candidates")

# ---------------------------------------------------------------
# Save results
# ---------------------------------------------------------------
out_path = TMP / f"{article_id}_anti_echo_analysis.csv"
df.to_csv(out_path, index=False)

print("=" * 80)
print(f"Detailed analysis saved to: {out_path}")
print("=" * 80)


Running anti-echo retrieval for fox_test.pdf (bias=0.8, tone=0.0)

                          IDEOLOGICAL SPREAD OVERVIEW                           

Left / progressive outlets : vox
Right / conservative outlets: none

Top contrastive article: nypost (Center / Neutral)
  Topic overlap: 0.62   Stance sim: 0.73   Bias diff: 0.80   Anti-echo score: -0.904


                       SAME TOPIC — DIFFERENT SOURCE BIAS                       

• Why your electric bill is so high now: Blame AI data centers | Vox
  Source: vox  (Progressive / Left)
  Topic overlap: 0.62   Stance sim: 0.56   Bias diff: 1.40   Anti-echo score: -1.635
  Link: https://www.vox.com/technology/465749/electricity-costs-ai-data-center…
--------------------------------------------------------------------------------
• play
  Source: aljazeera  (Center-Left)
  Topic overlap: 0.62   Stance sim: 0.60   Bias diff: 1.00   Anti-echo score: -1.072
  Link: https://www.aljazeera.com/news/2025/10/22/political-infighting-as-iran…
----