<a href="https://colab.research.google.com/github/Saketh2611/bbc-news-summerizer/blob/main/VectorSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IMPORTING LIBRARIES FOR WEB SCRAPING 📄

In [1]:
import requests
from bs4 import BeautifulSoup
import json
from datetime import datetime

WEB SCRAPING

In [2]:
def scrape_bbc_rss(limit=20):
    url = 'https://feeds.bbci.co.uk/news/rss.xml'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'xml')  # Use 'xml' parser

    items = soup.find_all('item')[:limit]

    articles = []
    for item in items:
        title = item.title.text
        link = item.link.text

        # Get full article content (optional, can be skipped)
        try:
            article_res = requests.get(link, headers={'User-Agent': 'Mozilla/5.0'})
            article_soup = BeautifulSoup(article_res.content, 'html.parser')
            paragraphs = article_soup.select('article p')
            text = ' '.join([p.get_text(strip=True) for p in paragraphs])
        except:
            text = ""

        articles.append({
            'title': title,
            'url': link,
            'text': text[:1000],
            'published_date': datetime.now().isoformat(),
            'category': 'bbc'
        })

    # Print sample
    print("\n✅ Sample BBC News Articles:\n")
    for i, article in enumerate(articles):
        print(f"{i+1}. 📰 {article['title']}")
        print(f"   🔗 {article['url']}")
        print(f"   📄 {article['text'][:1000]}...\n")

    return articles




SAVING ARTICLES IN A JSON FILE

In [3]:
articles = scrape_bbc_rss()

# Save to JSON
with open('articles.json', 'w') as f:
    json.dump(articles, f, indent=2)

# To download the file to your computer
from google.colab import files
files.download('articles.json')



✅ Sample BBC News Articles:

1. 📰 What we know so far about the Texas flood victims
   🔗 https://www.bbc.com/news/articles/c5ygl8lpyyqo
   📄 An eight-year-old girl and the director of an all-girls' summer camp are among the victims of flash floods in Texas that have claimed dozens of lives. Officials say most of the victims have been identified. Authorities have not yet released any names publicly. Here is what we know so far about the victims, many of whom were children. Renee Smajstrla, 8, was at Camp Mystic when flooding swept through the summer camp for girls, her uncle said in a Facebook post. "Renee has been found and while not the outcome we prayed for, the social media outreach likely assisted the first responders in helping to identify her so quickly," wrote Shawn Salta, of Maryland. "We are thankful she was with her friends and having the time of her life, as evidenced by this picture from yesterday," he wrote. "She will forever be living her best life at Camp Mystic." Camp 

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# 🔗 Generating text embeddings using SentenceTransformer

In [4]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-mpnet-base-v2')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [5]:
with open('articles.json', 'r') as f:
    articles = json.load(f)


In [6]:
embeddings = model.encode(
    [f"{article['title']}. {article['text']}" for article in articles],
    normalize_embeddings=True
)


In [8]:
! pip install qdrant_client

Collecting qdrant_client
  Downloading qdrant_client-1.14.3-py3-none-any.whl.metadata (10 kB)
Collecting portalocker<3.0.0,>=2.7.0 (from qdrant_client)
  Downloading portalocker-2.10.1-py3-none-any.whl.metadata (8.5 kB)
Downloading qdrant_client-1.14.3-py3-none-any.whl (328 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/329.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m329.0/329.0 kB[0m [31m27.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading portalocker-2.10.1-py3-none-any.whl (18 kB)
Installing collected packages: portalocker, qdrant_client
Successfully installed portalocker-2.10.1 qdrant_client-1.14.3


# 📦 Pushing Embeddings to Qdrant

In [9]:
from qdrant_client import QdrantClient
client = QdrantClient(
    url="https://23a37241-1707-4f1a-8f5e-47c00502551d.us-west-1-0.aws.cloud.qdrant.io:6333",
    api_key="eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJhY2Nlc3MiOiJtIn0.6MHdGWXVS2dEszyAaokzSlQbqe0Fdh_vFEvBJxXH50c"
)


CREATING COLLECTIONS

In [10]:
from qdrant_client.models import VectorParams, Distance

collection_name = "news_articles"

# Check if the collection exists
if not client.collection_exists(collection_name=collection_name):
    client.recreate_collection(
        collection_name=collection_name,
        vectors_config=VectorParams(size=768, distance=Distance.COSINE)
    )
    print(f"✅ Created collection '{collection_name}'")
else:
    client.recreate_collection(
        collection_name=collection_name,
        vectors_config=VectorParams(size=768, distance=Distance.COSINE)
    )
    print(f"⚠️ Collection '{collection_name}' already exists")


  client.recreate_collection(


⚠️ Collection 'news_articles' already exists


PUSH EMBEDDINGS TO QDRANT

In [11]:
from qdrant_client.models import PointStruct

# Push to Qdrant
points = [
    PointStruct(
        id=i,
        vector=embeddings[i],
        payload=articles[i]  # includes title, text, url, etc.
    )
    for i in range(len(articles))
]

client.upsert(
    collection_name="news_articles",
    points=points
)


UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

# TRY EXAMPLES

In [12]:
from sentence_transformers import SentenceTransformer
from qdrant_client.models import Filter, FieldCondition, MatchValue

# Example query
query = "Any war between countries for now"


# Convert query to embedding
query_vector = model.encode(query, normalize_embeddings=True)

# Search Qdrant
search_result = client.search(
    collection_name="news_articles",
    query_vector=query_vector,
    limit=3  # Top 3 results
)

# Print results
print("\n🔍 Top Matching News Articles:\n")
for i, hit in enumerate(search_result):
    payload = hit.payload
    print(f"{i+1}. 📰 {payload.get('title')}")
    print(f"   🔗 {payload.get('url')}")
    print(f"   📄 {payload.get('text')[:200]}...\n")


  search_result = client.search(



🔍 Top Matching News Articles:

1. 📰 Netanyahu visits US as Trump puts pressure to agree Gaza ceasefire deal
   🔗 https://www.bbc.com/news/articles/cy4ypze027ro
   📄 After 21 months of war, there are growing hopes of a new Gaza ceasefire announcement as Israel's Prime Minister Benjamin Netanyahu meets US President Donald Trump in Washington. Trump previously told ...

2. 📰 Trump threatens extra 10% tariff on nations siding with 'anti-American policies'
   🔗 https://www.bbc.com/news/articles/c1dnz7gw92zo
   📄 US President Donald Trump has warned that countries which side with the policies of the Brics alliance that go against US interests will be hit with an extra 10% tariff. "Any country aligning themselv...

3. 📰 Is the UK really any safer 20 years on from 7/7?
   🔗 https://www.bbc.com/news/articles/c14e77je72mo
   📄 There are extraordinary secret surveillance images - now largely forgotten - that in their own grainy and mysterious way, tell the story of missed opportunities that mayb

In [19]:
!pip install transformers sentencepiece




# Load the summarization pipeline using BART which summarizes the context

In [20]:
from transformers import pipeline

# Load the summarization pipeline using BART
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


In [24]:
# Get top 3 articles' text from Qdrant results
context = "\n\n".join([hit.payload.get("text", "") for hit in search_result])
# Truncate if the context is too long
if len(context.split()) > 1000:
    context = " ".join(context.split()[:1000])


In [25]:
# Generate the summary
summary_output = summarizer(context, max_length=250, min_length=80, do_sample=False)[0]['summary_text']

# Display the result
print("🧾 Summary:\n")
print(summary_output)


🧾 Summary:

There are growing hopes of a new Gaza ceasefire announcement as Israel's Prime Minister Benjamin Netanyahu meets US President Donald Trump in Washington. Trump previously told reporters he had been "very firm" with Netanyahu about ending the conflict and that he thought "we'll have a deal" this week. Indirect talks between Israel and Hamas on a US-sponsored proposal for a 60-day ceasefire and hostage release deal resumed in Qatar on Sunday evening.
