# BBC News RAG System

This notebook implements a Retrieval-Augmented Generation (RAG) system for the BBC news dataset.

## What is RAG?
RAG combines:
1. **Retrieval**: Finding relevant documents/articles from the dataset
2. **Augmentation**: Adding retrieved context to the user's question
3. **Generation**: Using an LLM to generate answers based on the retrieved context

## System Components:
- Document chunking and preprocessing
- Vector embeddings using sentence-transformers
- Vector store (FAISS or Chroma)
- Retrieval system
- Generation with OpenAI API
- Question-answering pipeline


In [None]:
import os
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
from openai import OpenAI
import json
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')

# Set up the API key
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    print("⚠️  WARNING: No OpenAI API key found!")
    print("Please set your API key: $env:OPENAI_API_KEY = 'your-api-key-here'")
    client = None
else:
    print("✅ OpenAI API key loaded successfully")
    client = OpenAI(api_key=api_key)

# Load the full BBC dataset
df = pd.read_csv("../data/raw_bbc.csv")
print(f"Loaded {len(df)} articles from BBC dataset")
print(f"Categories: {df['Category'].unique()}")

# Load the sentence transformer model
print("Loading sentence transformer model...")
model = SentenceTransformer('all-MiniLM-L6-v2')  # Fast and efficient model
print("✅ Sentence transformer model loaded")


  from .autonotebook import tqdm as notebook_tqdm


✅ OpenAI API key loaded successfully
Loaded 2225 articles from BBC dataset
Categories: ['business' 'entertainment' 'politics' 'sport' 'tech']
Loading sentence transformer model...
✅ Sentence transformer model loaded


## 1. Document Preprocessing and Chunking

We'll split long articles into smaller chunks to improve retrieval quality.


In [None]:
# Function to split text into overlapping chunks
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
    """
    Split text into overlapping chunks.
    
    Args:
        text: Input text to chunk
        chunk_size: Maximum size of each chunk
        overlap: Number of characters to overlap between chunks
    
    Returns:
        List of text chunks
    """
    if len(text) <= chunk_size:
        return [text]
    
    chunks = []
    start = 0
    
    while start < len(text):
        end = start + chunk_size
        
        # Try to break at sentence boundary
        if end < len(text):
            # Look for sentence endings within the last 100 characters
            for i in range(min(100, chunk_size)):
                if text[end - i] in '.!?':
                    end = end - i + 1
                    break
        
        chunk = text[start:end].strip()
        if chunk:
            chunks.append(chunk)
        
        start = end - overlap
        
        # Prevent infinite loop
        if start >= len(text) - overlap:
            break
    
    return chunks


def preprocess_articles(df: pd.DataFrame, chunk_size: int = 500) -> List[Dict]:
    """
    Preprocess articles and create chunks for RAG.
    
    Returns:
        List of dictionaries containing chunk information
    """
    processed_chunks = []
    
    for idx, row in df.iterrows():
        text = row['Text']
        category = row['Category']
        filename = row.get('Filename', f'article_{idx}')
        
        # Create chunks
        chunks = chunk_text(text, chunk_size=chunk_size)
        
        for chunk_idx, chunk in enumerate(chunks):
            processed_chunks.append({
                'article_id': idx,
                'chunk_id': f"{idx}_{chunk_idx}",
                'text': chunk,
                'category': category,
                'filename': filename,
                'chunk_length': len(chunk)
            })
    
    return processed_chunks

# Test chunking on a sample article
sample_text = df['Text'].iloc[0]
print(f"Original article length: {len(sample_text)} characters")
sample_chunks = chunk_text(sample_text, chunk_size=300)
print(f"Number of chunks: {len(sample_chunks)}")
print(f"First chunk: {sample_chunks[0][:200]}...")
print(f"Second chunk: {sample_chunks[1][:200]}...")


Original article length: 2559 characters
Number of chunks: 12
First chunk: Ad sales boost Time Warner profit

Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.

The firm, which is now one o...
Second chunk: sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit d...


## 2. Create Vector Embeddings and Index

We'll create embeddings for all text chunks and build a FAISS index for fast retrieval.


In [3]:
def create_embeddings_and_index(chunks: List[Dict], model: SentenceTransformer) -> Tuple[np.ndarray, faiss.Index, List[Dict]]:
    """
    Create embeddings for all chunks and build FAISS index.
    
    Returns:
        embeddings: numpy array of embeddings
        index: FAISS index for similarity search
        chunk_metadata: metadata for each chunk
    """
    print(f"Creating embeddings for {len(chunks)} chunks...")
    
    # Extract text for embedding
    texts = [chunk['text'] for chunk in chunks]
    
    # Create embeddings
    embeddings = model.encode(texts, show_progress_bar=True)
    
    # Create FAISS index
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatIP(dimension)  # Inner product (cosine similarity)
    
    # Normalize embeddings for cosine similarity
    faiss.normalize_L2(embeddings)
    
    # Add embeddings to index
    index.add(embeddings.astype('float32'))
    
    print(f"✅ Created FAISS index with {index.ntotal} vectors")
    
    return embeddings, index, chunks

# Process a sample of articles first (for testing)
print("Processing sample of 50 articles for testing...")
sample_df = df.head(50)
sample_chunks = preprocess_articles(sample_df, chunk_size=400)
print(f"Created {len(sample_chunks)} chunks from {len(sample_df)} articles")

# Create embeddings and index
embeddings, index, chunk_metadata = create_embeddings_and_index(sample_chunks, model)

print(f"\nChunk statistics:")
chunk_lengths = [chunk['chunk_length'] for chunk in chunk_metadata]
print(f"Average chunk length: {np.mean(chunk_lengths):.1f} characters")
print(f"Min chunk length: {np.min(chunk_lengths)} characters")
print(f"Max chunk length: {np.max(chunk_lengths)} characters")


Processing sample of 50 articles for testing...
Created 308 chunks from 50 articles
Creating embeddings for 308 chunks...


Batches:   0%|          | 0/10 [00:00<?, ?it/s]

Batches: 100%|██████████| 10/10 [00:02<00:00,  4.04it/s]

✅ Created FAISS index with 308 vectors

Chunk statistics:
Average chunk length: 346.8 characters
Min chunk length: 50 characters
Max chunk length: 401 characters





## 3. Retrieval System

Implement the retrieval component that finds relevant chunks for a given question.


In [4]:
def retrieve_relevant_chunks(question: str, index: faiss.Index, model: SentenceTransformer, 
                           chunk_metadata: List[Dict], k: int = 5) -> List[Dict]:
    """
    Retrieve the most relevant chunks for a given question.
    
    Args:
        question: User's question
        index: FAISS index
        model: Sentence transformer model
        chunk_metadata: Metadata for each chunk
        k: Number of chunks to retrieve
    
    Returns:
        List of relevant chunks with metadata
    """
    # Create embedding for the question
    question_embedding = model.encode([question])
    faiss.normalize_L2(question_embedding)
    
    # Search for similar chunks
    scores, indices = index.search(question_embedding.astype('float32'), k)
    
    # Get relevant chunks
    relevant_chunks = []
    for i, (score, idx) in enumerate(zip(scores[0], indices[0])):
        chunk_info = chunk_metadata[idx].copy()
        chunk_info['similarity_score'] = float(score)
        chunk_info['rank'] = i + 1
        relevant_chunks.append(chunk_info)
    
    return relevant_chunks

def format_context_for_llm(chunks: List[Dict]) -> str:
    """
    Format retrieved chunks into context for the LLM.
    """
    context_parts = []
    
    for i, chunk in enumerate(chunks, 1):
        context_parts.append(f"[Source {i}] (Category: {chunk['category']}, Score: {chunk['similarity_score']:.3f})")
        context_parts.append(chunk['text'])
        context_parts.append("\n")
    
    return "\n".join(context_parts)

# Test retrieval with a sample question
test_question = "What happened with Time Warner profits?"
print(f"Testing retrieval with question: '{test_question}'")

relevant_chunks = retrieve_relevant_chunks(test_question, index, model, chunk_metadata, k=3)

print(f"\nRetrieved {len(relevant_chunks)} relevant chunks:")
for chunk in relevant_chunks:
    print(f"\nRank {chunk['rank']} (Score: {chunk['similarity_score']:.3f}):")
    print(f"Category: {chunk['category']}")
    print(f"Text: {chunk['text'][:200]}...")

print(f"\nFormatted context:")
print(format_context_for_llm(relevant_chunks)[:500] + "...")


Testing retrieval with question: 'What happened with Time Warner profits?'

Retrieved 3 relevant chunks:

Rank 1 (Score: 0.693):
Category: business
Text: Ad sales boost Time Warner profit

Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.

The firm, which is now one o...

Rank 2 (Score: 0.678):
Category: business
Text: AOL's existing customers for high-speed broadband. TimeWarner also has to restate 2000 and 2003 results following a probe by the US Securities Exchange Commission (SEC), which is close to concluding.
...

Rank 3 (Score: 0.615):
Category: business
Text: rth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.

Time Warner said on Friday that it now ow...

Formatted context:
[Source 1] (Category: business, Score: 0.693)
Ad sales boost Time Warner profit

Quarterly profits at US media giant

## 4. Generation System

Implement the generation component using OpenAI's API.


In [5]:
def generate_answer(question: str, context: str, model: str = "gpt-3.5-turbo") -> str:
    """
    Generate an answer using OpenAI API with retrieved context.
    
    Args:
        question: User's question
        context: Retrieved context from documents
        model: OpenAI model to use
    
    Returns:
        Generated answer
    """
    if not client:
        return "Error: OpenAI API client not available. Please set your API key."
    
    prompt = f"""You are a helpful assistant that answers questions based on the provided context from BBC news articles.

Context from BBC News Articles:
{context}

Question: {question}

Instructions:
- Answer the question based ONLY on the information provided in the context above
- If the context doesn't contain enough information to answer the question, say so
- Be specific and cite relevant details from the context
- If there are multiple sources, mention the different perspectives
- Keep your answer concise but informative

Answer:"""

    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a helpful assistant that answers questions based on BBC news articles."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=500,
            temperature=0.1
        )
        
        return response.choices[0].message.content.strip()
    
    except Exception as e:
        return f"Error generating answer: {str(e)}"

# Test generation with our sample question
if client:
    context = format_context_for_llm(relevant_chunks)
    answer = generate_answer(test_question, context)
    
    print(f"Question: {test_question}")
    print(f"\nAnswer: {answer}")
else:
    print("OpenAI client not available. Please set your API key to test generation.")


Question: What happened with Time Warner profits?

Answer: Error generating answer: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-de9ab***********************5140. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}


## 5. Complete RAG Pipeline

Combine retrieval and generation into a complete RAG system.


In [6]:
class BBCRAGSystem:
    """
    Complete RAG system for BBC news articles.
    """
    
    def __init__(self, df: pd.DataFrame, model_name: str = 'all-MiniLM-L6-v2'):
        self.df = df
        self.model = SentenceTransformer(model_name)
        self.index = None
        self.chunk_metadata = None
        self.client = OpenAI(api_key=os.getenv("OPENAI_API_KEY")) if os.getenv("OPENAI_API_KEY") else None
    
    def build_index(self, chunk_size: int = 500, sample_size: int = None):
        """
        Build the vector index from articles.
        """
        print(f"Building RAG index...")
        
        # Use sample if specified
        df_to_process = self.df.head(sample_size) if sample_size else self.df
        
        # Create chunks
        chunks = preprocess_articles(df_to_process, chunk_size=chunk_size)
        print(f"Created {len(chunks)} chunks from {len(df_to_process)} articles")
        
        # Create embeddings and index
        embeddings, index, chunk_metadata = create_embeddings_and_index(chunks, self.model)
        
        self.index = index
        self.chunk_metadata = chunk_metadata
        
        print(f"✅ RAG index built successfully!")
        
    def ask(self, question: str, k: int = 5, model: str = "gpt-3.5-turbo") -> Dict:
        """
        Ask a question and get an answer using RAG.
        
        Returns:
            Dictionary containing answer, sources, and metadata
        """
        if self.index is None:
            return {"error": "Index not built. Please call build_index() first."}
        
        # Retrieve relevant chunks
        relevant_chunks = retrieve_relevant_chunks(
            question, self.index, self.model, self.chunk_metadata, k=k
        )
        
        # Format context
        context = format_context_for_llm(relevant_chunks)
        
        # Generate answer
        if self.client:
            answer = generate_answer(question, context, model)
        else:
            answer = "OpenAI API not available. Please set your API key."
        
        return {
            "question": question,
            "answer": answer,
            "sources": relevant_chunks,
            "context": context,
            "num_sources": len(relevant_chunks)
        }
    
    def save_index(self, index_path: str, metadata_path: str):
        """
        Save the FAISS index and metadata to disk.
        """
        if self.index is None:
            print("No index to save.")
            return
        
        faiss.write_index(self.index, index_path)
        
        with open(metadata_path, 'w') as f:
            json.dump(self.chunk_metadata, f, indent=2)
        
        print(f"Index saved to {index_path}")
        print(f"Metadata saved to {metadata_path}")
    
    def load_index(self, index_path: str, metadata_path: str):
        """
        Load the FAISS index and metadata from disk.
        """
        self.index = faiss.read_index(index_path)
        
        with open(metadata_path, 'r') as f:
            self.chunk_metadata = json.load(f)
        
        print(f"Index loaded from {index_path}")
        print(f"Metadata loaded from {metadata_path}")

# Initialize RAG system
rag_system = BBCRAGSystem(df)

# Build index with sample data (50 articles)
rag_system.build_index(chunk_size=400, sample_size=50)

print(f"\nRAG system ready! Index contains {rag_system.index.ntotal} chunks")


Building RAG index...
Created 308 chunks from 50 articles
Creating embeddings for 308 chunks...


Batches: 100%|██████████| 10/10 [00:02<00:00,  4.67it/s]

✅ Created FAISS index with 308 vectors
✅ RAG index built successfully!

RAG system ready! Index contains 308 chunks





## 6. Test the RAG System

Let's test our RAG system with various questions about the BBC news articles.


In [7]:
def test_rag_system(rag_system: BBCRAGSystem, questions: List[str]):
    """
    Test the RAG system with a list of questions.
    """
    for i, question in enumerate(questions, 1):
        print(f"\n{'='*60}")
        print(f"Question {i}: {question}")
        print(f"{'='*60}")
        
        result = rag_system.ask(question, k=3)
        
        if "error" in result:
            print(f"Error: {result['error']}")
            continue
        
        print(f"\nAnswer:")
        print(result['answer'])
        
        print(f"\nSources ({result['num_sources']}):")
        for j, source in enumerate(result['sources'], 1):
            print(f"\n{j}. {source['category']} (Score: {source['similarity_score']:.3f})")
            print(f"   {source['text'][:150]}...")

# Test questions
test_questions = [
    "What happened with Time Warner profits?",
    "What are the latest developments in technology?",
    "What sports news is there?",
    "What political events are happening?",
    "What entertainment news is there?"
]

test_rag_system(rag_system, test_questions)



Question 1: What happened with Time Warner profits?

Answer:
Error generating answer: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-de9ab***********************5140. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

Sources (3):

1. business (Score: 0.693)
   Ad sales boost Time Warner profit

Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from...

2. business (Score: 0.678)
   AOL's existing customers for high-speed broadband. TimeWarner also has to restate 2000 and 2003 results following a probe by the US Securities Exchang...

3. business (Score: 0.615)
   rth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users f...

Question 2: What are the latest developments in technology?

Answer:
Error generatin

## 7. Interactive Question-Answering

Create an interactive interface for asking questions.


In [8]:
def interactive_qa(rag_system: BBCRAGSystem):
    """
    Interactive question-answering interface.
    """
    print("\n" + "="*60)
    print("BBC News RAG System - Interactive Mode")
    print("="*60)
    print("Ask questions about BBC news articles!")
    print("Type 'quit' or 'exit' to stop.")
    print("Type 'help' for more options.")
    print("="*60)
    
    while True:
        question = input("\nYour question: ").strip()
        
        if question.lower() in ['quit', 'exit', 'q']:
            print("Goodbye!")
            break
        
        if question.lower() == 'help':
            print("\nAvailable commands:")
            print("- Ask any question about the news articles")
            print("- 'quit' or 'exit': Stop the program")
            print("- 'help': Show this help message")
            continue
        
        if not question:
            continue
        
        print("\nSearching for relevant information...")
        result = rag_system.ask(question, k=5)
        
        if "error" in result:
            print(f"Error: {result['error']}")
            continue
        
        print(f"\nAnswer:")
        print(result['answer'])
        
        print(f"\nSources used ({result['num_sources']}):")
        for i, source in enumerate(result['sources'], 1):
            print(f"{i}. {source['category']} - Score: {source['similarity_score']:.3f}")

# Uncomment to start interactive mode
# interactive_qa(rag_system)


## 8. Build Full Dataset Index

Process the complete BBC dataset for production use.


In [9]:
# Uncomment to build the full dataset index
# print("Building full dataset index...")
# print(f"This will process all {len(df)} articles")
# print("Estimated time: 10-20 minutes")
# 
# # Create new RAG system for full dataset
# full_rag_system = BBCRAGSystem(df)
# full_rag_system.build_index(chunk_size=500, sample_size=None)
# 
# # Save the index
# full_rag_system.save_index("../results/bbc_rag_index.faiss", "../results/bbc_rag_metadata.json")
# 
# print("\n✅ Full dataset index built and saved!")
# print(f"Index contains {full_rag_system.index.ntotal} chunks from {len(df)} articles")
