<a href="https://colab.research.google.com/github/DarianSawali/News-Based-RAG/blob/main/GPT2_News_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install feedparser newspaper3k sentence-transformers faiss-cpu

Collecting feedparser
  Downloading feedparser-6.0.12-py3-none-any.whl.metadata (2.7 kB)
Collecting newspaper3k
  Downloading newspaper3k-0.2.8-py3-none-any.whl.metadata (11 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.13.0-cp39-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.7 kB)
Collecting sgmllib3k (from feedparser)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting cssselect>=0.9.2 (from newspaper3k)
  Downloading cssselect-1.3.0-py3-none-any.whl.metadata (2.6 kB)
Collecting tldextract>=2.0.1 (from newspaper3k)
  Downloading tldextract-5.3.0-py3-none-any.whl.metadata (11 kB)
Collecting feedfinder2>=0.0.4 (from newspaper3k)
  Downloading feedfinder2-0.0.4.tar.gz (3.3 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jieba3k>=0.35.1 (from newspaper3k)
  Downloading jieba3k-0.35.1.zip (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m137.4 MB/

In [2]:
!pip install lxml_html_clean

Collecting lxml_html_clean
  Downloading lxml_html_clean-0.4.3-py3-none-any.whl.metadata (2.3 kB)
Downloading lxml_html_clean-0.4.3-py3-none-any.whl (14 kB)
Installing collected packages: lxml_html_clean
Successfully installed lxml_html_clean-0.4.3


In [3]:
import feedparser

NEWS_SOURCES = {
    # "CBC BC": "https://www.cbc.ca/cmlink/rss-canada-britishcolumbia",
    "Global News BC": "https://globalnews.ca/bc/feed/",
    "CTV Vancouver": "https://bc.ctvnews.ca/rss/ctv-news-vancouver-1.822295",
}

def fetch_articles_from_rss(source_name, feed_url, max_articles=5):
    feed = feedparser.parse(feed_url)
    docs = []

    for entry in feed.entries[:max_articles]:
        title = entry.get("title", "").strip()
        summary = entry.get("summary", "").strip()
        url = entry.get("link", "")

        text = summary or title

        if not text:
            continue

        docs.append({
            "source": source_name,
            "title": title,
            "text": text,
            "url": url,
            "published": entry.get("published", "")
        })

    return docs


In [4]:
import feedparser

feed = feedparser.parse("https://globalnews.ca/bc/feed/")
print(len(feed.entries))

10


In [5]:
all_documents = []

for name, url in NEWS_SOURCES.items():
    print("Fetching", name)
    docs = fetch_articles_from_rss(name, url)
    print(" → Retrieved:", len(docs))
    all_documents.extend(docs)

print("Total articles:", len(all_documents))

Fetching Global News BC
 → Retrieved: 5
Fetching CTV Vancouver
 → Retrieved: 0
Total articles: 5


In [6]:
len(all_documents), all_documents[1]

(5,
 {'source': 'Global News BC',
  'title': 'B.C. premier calls pipeline MOU an ‘energy vampire’, First Nations call it a ‘pipe dream’',
  'text': 'The West Coast Oil Tanker Ban came into effect in 2019 and prohibits tankers from carrying more than 12,500 metric tonnes of crude oil along the northern coast of B.C.',
  'url': 'https://globalnews.ca/news/11546713/bc-reaction-david-eby-pipeline-first-nations/',
  'published': 'Thu, 27 Nov 2025 19:35:00 +0000'})

In [8]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

In [9]:
embed_model = SentenceTransformer("all-MiniLM-L6-v2")

corpus_texts = [
    doc["title"] + "\n\n" + doc["text"]
    for doc in all_documents
]

# Create embeddings
corpus_embeddings = embed_model.encode(corpus_texts, convert_to_numpy=True)
corpus_embeddings = corpus_embeddings.astype("float32")

print("Number of docs:", len(corpus_texts))
print("Embedding shape:", corpus_embeddings.shape)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Number of docs: 5
Embedding shape: (5, 384)


In [10]:
embedding_dim = corpus_embeddings.shape[1]

index = faiss.IndexFlatL2(embedding_dim)
index.add(corpus_embeddings)

print("FAISS index size:", index.ntotal)

FAISS index size: 5


In [11]:
def retrieve_docs(query, k=3):

    q_emb = embed_model.encode([query], convert_to_numpy=True).astype("float32")

    distances, indices = index.search(q_emb, k)

    results = []
    for dist, idx in zip(distances[0], indices[0]):
        doc = all_documents[idx]
        doc = {**doc, "distance": float(dist)}
        results.append(doc)
    return results

In [12]:
test_query = "traffic delays in Metro Vancouver"
retrieved = retrieve_docs(test_query, k=3)

for r in retrieved:
    print(r["source"], "-", r["title"])
    print("distance:", r["distance"])
    print("url:", r["url"])
    print()

Global News BC - Hong Kong Canadians reeling after deadly highrise inferno
distance: 1.5711002349853516
url: https://globalnews.ca/news/11545824/hong-kong-canadians-reeling-highrise-inferno/

Global News BC - B.C. man gets months-long sentence for assault, threats, as he awaits murder trial
distance: 1.7798620462417603
url: https://globalnews.ca/news/11546845/james-plover-assault-sentence/

Global News BC - B.C. premier calls pipeline MOU an ‘energy vampire’, First Nations call it a ‘pipe dream’
distance: 1.8841784000396729
url: https://globalnews.ca/news/11546713/bc-reaction-david-eby-pipeline-first-nations/



In [13]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
gpt2 = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
tokenizer.pad_token = tokenizer.eos_token


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [15]:
import re

def extract_first_sentences(text, n=2):
    """Return the first n sentences from a block of text."""
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    sentences = [s for s in sentences if s]
    if not sentences:
        return ""
    return " ".join(sentences[:n])


def summarize_query_with_rag(query, k=3, distance_threshold=2.0):
    """
    Retrieve top-k news articles for a query and return short, extractive summaries.
    This is a non-generative (no GPT-2) summarizer, so it cannot hallucinate.
    """
    retrieved = retrieve_docs(query, k=k)

    if len(retrieved) == 0:
        return "No relevant news found for this query."

    summaries = []

    for doc in retrieved:
        dist = doc.get("distance", None)
        if dist is not None and dist > distance_threshold:
            continue

        source = doc["source"]
        title = doc["title"]
        text = doc["text"]
        url = doc.get("url", "")

        short = extract_first_sentences(text, n=2)
        if not short:
            continue

        line = f"{short} (Source: {source} — {title})"
        if url:
            line += f"\nLink: {url}"

        summaries.append(line)

    if not summaries:
        return "No strong matches found for this query."

    return "\n\n".join(summaries)


In [16]:
print(summarize_query_with_rag("what happened recently?", k=2))

The man accused in a high-profile killing of his estranged wife has been sentenced in a separate case of choking and uttering threats in Kelowna, B.C. (Source: Global News BC — B.C. man gets months-long sentence for assault, threats, as he awaits murder trial)
Link: https://globalnews.ca/news/11546845/james-plover-assault-sentence/

“This is not talked about nearly enough,” Scott said. “It’s something that’s really coming to the forefront in more recent years.” (Source: Global News BC — Postpartum depression in the spotlight through Okanagan community auction)
Link: https://globalnews.ca/news/11545383/postpartum-depression-spotlight-okanagan-community-auction/
