# Anti-Echo-Chamber

This notebook is organized as small, modular steps:

1. **Runtime & Config** — choose CPU or GPU; set paths; install deps.
2. **Guardian UK Scraper (one-shot)** — fetch 1 article → save text + metadata → update local cache.
3. **Embeddings (topic & sentiment/stance)** —
   - Topic: chunked article → semantic embedding.
   - Sentiment/Stance: LLM summary of *tone/opinion/argument* → embedding.
4. **Artifacts** — save compact `embeddings.npz` + `metadata.jsonl`.
5. **Vector DB (Chroma)** — two collections:
   - `news_topic` (topic vectors)
   - `news_sentiment` (sentiment/stance vectors)
6. **Query Demos** — same-topic search; same-topic *opposite-stance* search.

### Runtime & Cost Notes
- You can run **CPU-only** (free) or **GPU** (faster if available). Select it in the config cell.
- “LLM summary” for sentiment/stance supports two modes:
  - **`"openai"`** — uses your `OPENAI_API_KEY` if you want GPT-quality summaries.
  - **`"hf"`** — **free** local summarizer via Hugging Face (slower, still good).
  - We default to `"hf"` so this runs fully free.

### Repo hygiene
- Code + small JSON (e.g., `feeds/index.json`) live in Git.
- Raw text, embeddings, and Chroma DB live **locally** (ephemeral) or on Drive if you mount it later.



## Optional: Use OpenAI for stance/tone summaries

- If you have an OpenAI API key, enter it *privately* below.  
- If you skip this, the notebook will use a **free Hugging Face** summarizer.

> Your key is stored only in the Colab session environment (not saved to Git).

# If you enter an openai key, DO NOT COMMIT THE OUTPUT OF THIS CELL TO GIT



In [11]:
# OPTIONAL — set OpenAI API key (kept private in this Colab session)
# If you skip this cell, the notebook will fall back to the free HF summarizer.
import os, getpass
try:
    _key = getpass.getpass("Enter your OpenAI API key (or press Enter to skip): ")
    if _key.strip():
        os.environ["OPENAI_API_KEY"] = _key.strip()
        CONFIG["SUMMARY_MODE"] = "openai"
        print("OpenAI key set for this session. SUMMARY_MODE=openai")
    else:
        print("No key entered. SUMMARY_MODE remains:", CONFIG["SUMMARY_MODE"])
except Exception as e:
    print("Skipping OpenAI setup:", e)


Enter your OpenAI API key (or press Enter to skip): ··········
No key entered. SUMMARY_MODE remains: hf


In [12]:
# === CONFIG + INSTALLS + IMPORTS (single place for all imports) ===
# Run this once per session. Re-run if the Colab runtime restarts.

# ----------------
# USER CONFIG
# ----------------
CONFIG = {
    # Hardware preference: True tries GPU if available; False forces CPU.
    "PREFER_GPU": True,          # toggle this if you don't have Colab GPU
    # Where to keep working files; keep ephemeral for now.
    "BASE_DIR": "/content/anti-echo-chamber",
    # Sentiment/Stance summarizer mode: "hf" (free, local) or "openai" (uses OPENAI_API_KEY).
    "SUMMARY_MODE": "hf",
    # HF summarizer model (used if SUMMARY_MODE == "hf")
    "HF_SUMMARY_MODEL": "facebook/bart-large-cnn",
    # Embedding model for both topic & stance vectors
    "EMBED_MODEL": "sentence-transformers/all-MiniLM-L6-v2",
    # Chroma collections
    "COLL_TOPIC": "news_topic",
    "COLL_SENT": "news_sentiment",
    # Disable Chroma telemetry noise
    "CHROMA_TELEMETRY": False,
}

# ----------------
# INSTALLS
# ----------------
import sys, subprocess, os

def pipi(*pkgs):
    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True)

# Pin a few versions known to be stable in Colab
pipi("numpy==2.1.2")
pipi("torch>=2.3")
pipi("sentence-transformers==3.0.1", "chromadb==0.5.5")
pipi("trafilatura==1.8.1", "readability-lxml==0.8.4.1", "newspaper3k==0.2.8")
pipi("transformers>=4.41", "accelerate>=0.33", "safetensors>=0.4")
# OpenAI is optional; install so users can switch SUMMARY_MODE later without changing installs
pipi("openai>=1.40.0")

# ----------------
# IMPORTS
# ----------------
import json, re, hashlib, math, time, datetime as dt
from pathlib import Path
from urllib.parse import urlparse

import numpy as np
import torch
from sentence_transformers import SentenceTransformer

import trafilatura
from readability import Document as ReadabilityDocument

import chromadb
from chromadb import PersistentClient

from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    pipeline as hf_pipeline,
)

# ----------------
# DEVICE / RUNTIME
# ----------------
def pick_device(prefer_gpu: bool = True) -> str:
    if prefer_gpu and torch.cuda.is_available():
        return "cuda"
    return "cpu"

DEVICE = pick_device(CONFIG["PREFER_GPU"])
print(f"Device selected: {DEVICE}  |  CUDA available: {torch.cuda.is_available()}")

# ----------------
# PATHS
# ----------------
BASE = Path(CONFIG["BASE_DIR"])
DATA = BASE / "data"
RAW = DATA / "raw"
PROC = DATA / "processed"
EMB = DATA / "embeddings"
DB  = BASE / "db"  # Chroma persistence

for p in [BASE, DATA, RAW, PROC, EMB, DB]:
    p.mkdir(parents=True, exist_ok=True)

print("Project dirs ready under:", BASE)

# ----------------
# HELPERS (shared)
# ----------------
def now_utc() -> str:
    return dt.datetime.now(dt.timezone.utc).isoformat()

def slugify(s: str, max_len: int = 80) -> str:
    s = (s or "").lower()
    s = re.sub(r"[^a-z0-9]+", "-", s).strip("-")
    return s[:max_len] if s else "article"

def pretty_size(path: Path) -> str:
    sz = path.stat().st_size if path.exists() else 0
    units = ["B","KB","MB","GB"]
    u=0
    while sz >= 1024 and u < len(units)-1:
        sz/=1024; u+=1
    return f"{sz:.2f} {units[u]}"

# Embedding model (loaded once, used by all steps)
EMB_MODEL = SentenceTransformer(CONFIG["EMBED_MODEL"], device=DEVICE)
print("Embedding model loaded:", CONFIG["EMBED_MODEL"])

# Optional: prepare a free HF summarizer pipeline (lazy-init on first use)
_hf_summarizer = None
def get_hf_summarizer():
    global _hf_summarizer
    if _hf_summarizer is None:
        # pipeline handles tokenization + generation under the hood
        _hf_summarizer = hf_pipeline(
            "summarization",
            model=CONFIG["HF_SUMMARY_MODEL"],
            device=0 if DEVICE == "cuda" else -1,
        )
    return _hf_summarizer

# Optional: OpenAI client (only used if SUMMARY_MODE == "openai" and env var present)
_openai = None
def get_openai_client():
    global _openai
    if _openai is None:
        from openai import OpenAI
        _openai = OpenAI()  # uses OPENAI_API_KEY env var if set
    return _openai

if os.getenv("OPENAI_API_KEY"): CONFIG["SUMMARY_MODE"] = "openai"


Device selected: cuda  |  CUDA available: True
Project dirs ready under: /content/anti-echo-chamber
Embedding model loaded: sentence-transformers/all-MiniLM-L6-v2


## ChromaDB — clean reset & init

If you see “An instance of Chroma already exists … with different settings,” it means a prior client was created with different config.  
For prototyping, it’s safe to **reset the local DB folder** and re-initialize.

This cell:
1) Deletes the local Chroma folder at `BASE/db`
2) Disables telemetry noise (optional)
3) Creates a fresh `PersistentClient` you can reuse later


In [15]:
# 🔧 Chroma: conflict-proof init (new path + explicit settings)

from pathlib import Path
import shutil, os
import chromadb
from chromadb import PersistentClient
from chromadb.config import Settings

# 1) Choose a DB path based on telemetry setting so it's always consistent
telemetry_on = bool(CONFIG.get("CHROMA_TELEMETRY", False))
DB = Path(CONFIG["BASE_DIR"]) / ("db_telem_on" if telemetry_on else "db_telem_off")

# 2) (Optional) nuke the folder for a clean slate during prototyping
#    Comment this out later if you want to keep data across runs.
shutil.rmtree(DB, ignore_errors=True)
DB.mkdir(parents=True, exist_ok=True)

# 3) Make sure env + Settings agree
os.environ["ANONYMIZED_TELEMETRY"] = "True" if telemetry_on else "False"
settings = Settings(anonymized_telemetry=telemetry_on)

# 4) Create the client with explicit settings for this path
client = PersistentClient(path=str(DB), settings=settings)

print("✅ Chroma ready")
print("  path     :", DB)
print("  telemetry:", telemetry_on)


ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given


✅ Chroma ready
  path     : /content/anti-echo-chamber/db_telem_off
  telemetry: False


## Guardian UK scraper (RSS, multi-article dry run)

**Goal:** Pull a small batch of recent Guardian articles via RSS, extract **full text**, and save them locally for later embedding.

**What this does**
- Reads several Guardian RSS feeds (e.g., UK, World, Opinion, Politics, Environment).
- Parses entries with `feedparser`, filters by:
  - `MAX_ARTICLES` total
  - `DATE_FROM` (UTC date threshold, optional)
- For each new URL (not in local cache):
  - Fetches HTML
  - Extracts title (Readability → `<title>` fallback)
  - Extracts main text (Trafilatura)
  - Saves `data/raw/<slug>.txt` + `<slug>.meta.json`
  - Updates `feeds/index.json` (session-local cache)

**Why RSS?** It’s free and simple. For large backfills later, we can add the Guardian Content API (also free w/ key) or section-by-section pagination.

> Tip: Start with `MAX_ARTICLES = 5–10` to verify. Then change that number or widen the date range to scale up.


In [17]:
# --- Guardian RSS scraper (multi-article, even split across feeds with remainder to Comment is Free) ---
# Relies on CONFIG, paths, and helpers from your big setup cell (BASE, RAW, now_utc, slugify, get_title_from_html, fetch_and_extract, save_article)

# ===== Tunables =====
MAX_ARTICLES   = 100             # total NEW articles to fetch/save across all feeds
MAX_PER_FEED   = None            # optional hard cap per feed (e.g., 3). Use None to let quotas handle it.
DATE_FROM      = "2025-07-01"    # ISO-8601 (UTC). Set to None to ignore.
FORCE_REFETCH  = False           # if True, re-download even if cached (typically False for dry runs)
INCLUDE_EXTENDED = True          # include many Guardian sections beyond the core set

# ===== Feeds =====
CORE_FEEDS = [
    ("uk",              "https://www.theguardian.com/uk/rss"),
    ("world",           "https://www.theguardian.com/world/rss"),
    ("politics",        "https://www.theguardian.com/politics/rss"),
    ("environment",     "https://www.theguardian.com/environment/rss"),
    ("commentisfree",   "https://www.theguardian.com/commentisfree/rss"),  # opinion
]

EXTENDED_FEEDS = [
    ("us",              "https://www.theguardian.com/us-news/rss"),
    ("europe",          "https://www.theguardian.com/world/europe/rss"),
    ("australia",       "https://www.theguardian.com/australia-news/rss"),
    ("business",        "https://www.theguardian.com/uk/business/rss"),
    ("technology",      "https://www.theguardian.com/uk/technology/rss"),
    ("science",         "https://www.theguardian.com/science/rss"),
    ("global-development","https://www.theguardian.com/global-development/rss"),
    ("culture",         "https://www.theguardian.com/uk/culture/rss"),
    ("sport",           "https://www.theguardian.com/uk/sport/rss"),
    ("money",           "https://www.theguardian.com/uk/money/rss"),
]

FEEDS = CORE_FEEDS + (EXTENDED_FEEDS if INCLUDE_EXTENDED else [])

# ===== Deps =====
pipi("feedparser==6.0.10")
import feedparser
import datetime as dt
from email.utils import parsedate_to_datetime

# ===== Local "seen" cache (session-local) =====
FEEDS_DIR = BASE / "feeds"
FEEDS_DIR.mkdir(parents=True, exist_ok=True)
INDEX_PATH = FEEDS_DIR / "index.json"

def load_index_local():
    if INDEX_PATH.exists():
        try:
            return json.loads(INDEX_PATH.read_text(encoding="utf-8"))
        except Exception:
            pass
    return {"last_updated": None, "items": {}}

def save_index_local(idx):
    idx["last_updated"] = now_utc()
    INDEX_PATH.write_text(json.dumps(idx, indent=2), encoding="utf-8")

index = load_index_local()
index.setdefault("items", {})

# ===== Date helpers =====
def parse_entry_date(entry):
    for attr in ("published", "updated"):
        try:
            val = getattr(entry, attr, None)
            if val:
                return parsedate_to_datetime(val)
        except Exception:
            pass
    return None

def in_date_range(d, lower_iso):
    if not lower_iso:
        return True
    try:
        floor = dt.datetime.fromisoformat(lower_iso).replace(tzinfo=dt.timezone.utc)
    except Exception:
        return True
    if d is None:
        return True
    if d.tzinfo is None:
        d = d.replace(tzinfo=dt.timezone.utc)
    return d >= floor

# ===== Collect entries PER FEED (so we can enforce per-feed quotas) =====
per_feed_entries = {}
for name, feed_url in FEEDS:
    fp = feedparser.parse(feed_url)
    items = []
    for e in fp.entries:
        url = getattr(e, "link", None)
        if not url:
            continue
        pub = parse_entry_date(e)
        if not in_date_range(pub, DATE_FROM):
            continue
        items.append({"url": url, "published": pub})
    # sort newest first, de-dupe inside feed
    seen_urls = set()
    uniq = []
    for it in sorted(items, key=lambda x: (x["published"] or dt.datetime.min), reverse=True):
        if it["url"] in seen_urls:
            continue
        seen_urls.add(it["url"])
        uniq.append(it)
    per_feed_entries[name] = uniq

# ===== Compute even quotas with remainder to Comment is Free =====
feed_names = [name for name, _ in FEEDS]
num_feeds = len(feed_names)
if num_feeds == 0:
    raise RuntimeError("No feeds configured.")

base_quota = MAX_ARTICLES // num_feeds
remainder  = MAX_ARTICLES % num_feeds

quotas = {name: base_quota for name in feed_names}

# Send remainder to commentisfree if present; else to the first feed
cif_name = "commentisfree" if "commentisfree" in quotas else feed_names[0]
quotas[cif_name] += remainder

# Optional hard cap per feed
if isinstance(MAX_PER_FEED, int) and MAX_PER_FEED > 0:
    for k in quotas:
        quotas[k] = min(quotas[k], MAX_PER_FEED)

# ===== Download loop with global cap + per-feed quotas =====
saved_global = 0
errors_global = 0
globally_seen = set()  # prevent cross-feed duplicates

def already_cached(u: str) -> bool:
    return (u in index["items"]) and (index["items"][u].get("status") == "ok")

for name, _url in FEEDS:
    if saved_global >= MAX_ARTICLES:
        break
    quota = quotas.get(name, 0)
    if quota <= 0:
        continue

    saved_this_feed = 0
    entries = per_feed_entries.get(name, [])

    for item in entries:
        if saved_global >= MAX_ARTICLES or saved_this_feed >= quota:
            break

        url = item["url"]
        if url in globally_seen:
            continue
        globally_seen.add(url)

        if already_cached(url) and not FORCE_REFETCH:
            print(f"↩︎ skip (cached) [{name}]:", url)
            continue

        try:
            res = fetch_and_extract(url)  # uses trafilatura + readability title fallback
            out = save_article(url, res["title"], res["text"])  # writes txt + meta, updates index
            print(f"✓ saved [{name}]:", out["txt_path"].name, "|", res["title"][:100])
            saved_this_feed += 1
            saved_global += 1
        except Exception as exc:
            print(f"✗ error [{name}]:", url, "|", type(exc).__name__, "-", str(exc)[:120])
            errors_global += 1
            # mark error to avoid retrying this session
            index["items"][url] = {"status": "error", "fetched_at": now_utc()}
            save_index_local(index)

print("\nSummary:")
print("  Total saved   :", saved_global)
print("  Total errors  :", errors_global)
print("  Per-feed quotas:", quotas)
print("  Cache file    :", INDEX_PATH)


↩︎ skip (cached) [uk]: https://www.theguardian.com/football/2025/oct/09/england-wales-international-friendly-match-report
↩︎ skip (cached) [uk]: https://www.theguardian.com/society/2025/oct/09/majority-of-family-court-cases-in-england-and-wales-feature-domestic-abuse-watchdog-says
✓ saved [uk]: theguardian.com-ferguson-is-scotland-s-hero-as-they-fight-back-to-steal-vital-victory-over-greec-42327245ea.txt | Ferguson is Scotland’s hero as they fight back to steal vital victory over Greece
↩︎ skip (cached) [uk]: https://www.theguardian.com/us-news/2025/oct/09/criminal-charges-letitia-james-new-york-attorney-general
↩︎ skip (cached) [uk]: https://www.theguardian.com/film/2025/oct/09/good-boy-review-stephen-graham-and-andrea-riseborough-turn-nasty-in-kubrickian-absurdist-nightmare
✓ saved [uk]: theguardian.com-first-images-emerge-from-new-bbc-adaptation-of-lord-of-the-flies-7301480d15.txt | First images emerge from new BBC adaptation of Lord of the Flies
✓ saved [uk]: theguardian.com-joy-an