**INSTALLATIONS**

In [4]:
!pip install -q datasets transformers sentence-transformers faiss-cpu gradio torch


**IMPORTS**

In [5]:
import torch
import faiss
import numpy as np
import gradio as gr

from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from transformers import pipeline


**DATASET LOADING**


In [6]:
dataset = load_dataset("ag_news", split="train[:2000]")


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

**Pre-process & Create Corpus**

In [7]:
documents = []

for item in dataset:
    text = item["text"].strip()
    if len(text) > 100:
        documents.append(text)


**Create Embeddings**

In [8]:
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

doc_embeddings = embedding_model.encode(
    documents,
    show_progress_bar=True,
    convert_to_numpy=True
)


Batches:   0%|          | 0/63 [00:00<?, ?it/s]

**Build FAISS Index (Search Engine)**

In [9]:
dimension = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(doc_embeddings)


**Search Function (Top-N Retrieval)**

In [10]:
def search_documents(query, top_k):
    query_embedding = embedding_model.encode([query])
    distances, indices = index.search(query_embedding, top_k)
    return [documents[i] for i in indices[0]]


**Load Summarization Model (Hugging Face)**

In [12]:
summarizer = pipeline(
    "summarization",
    model="facebook/bart-large-cnn",
    device=0 if torch.cuda.is_available() else -1
)


Device set to use cpu


**Summarize Retrieved Docs**

In [22]:
def summarize_documents(docs, max_len):
    combined_text = " ".join(docs)

    # Estimate input length (rough token approximation)
    input_len = len(combined_text.split())

    # Ensure summary is shorter than input
    dynamic_max_len = min(max_len, max(30, input_len // 2))

    summary = summarizer(
        combined_text[:4000],
        max_length=dynamic_max_len,
        min_length=min(30, dynamic_max_len - 5),
        do_sample=False
    )

    return summary[0]["summary_text"]


**Create Test Samples**

In [23]:
import random

test_samples = random.sample(documents, 50)


**SEARCH EVALUATION**

In [24]:
def evaluate_search_accuracy(top_k=3):
    correct = 0

    for doc in test_samples:
        query = doc[:50]     # query from same doc
        retrieved_docs = search_documents(query, top_k)

        if doc in retrieved_docs:
            correct += 1

    accuracy = correct / len(test_samples)
    return accuracy


In [25]:
search_accuracy = evaluate_search_accuracy(top_k=3)
print("Top-3 Search Accuracy:", search_accuracy)


Top-3 Search Accuracy: 1.0


**SUMMARY EVALUATION (ROUGE)**

In [19]:
!pip install -q evaluate rouge-score


  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone


In [26]:
import evaluate
rouge = evaluate.load("rouge")


**Generate Summaries**

In [27]:
predicted_summaries = []
reference_texts = []

for doc in test_samples[:10]:
    summary = summarize_documents([doc], max_len=150)
    predicted_summaries.append(summary)
    reference_texts.append(doc[:200])


**Compute R0UGE**

In [28]:
rouge_scores = rouge.compute(
    predictions=predicted_summaries,
    references=reference_texts
)

print(rouge_scores)


{'rouge1': np.float64(0.6010712176522719), 'rouge2': np.float64(0.5089810763092459), 'rougeL': np.float64(0.5324298909521392), 'rougeLsum': np.float64(0.5298350717079532)}


**Full RAG Pipeline**

In [29]:
def rag_pipeline(query, top_k, summary_length):
    retrieved_docs = search_documents(query, top_k)
    summary = summarize_documents(retrieved_docs, summary_length)
    return "\n\n".join(retrieved_docs), summary


**Gradio UI**

In [31]:
import gradio as gr

def rag_pipeline(query, top_k, summary_length):
    retrieved_docs = search_documents(query, top_k)
    summary = summarize_documents(retrieved_docs, summary_length)
    return "\n\n".join(retrieved_docs), summary


gr.Interface(
    fn=rag_pipeline,
    inputs=[
        gr.Textbox(
            label="Ask a question about the news articles",
            placeholder="e.g., Summarize recent technology news or What is this article about?"
        ),
        gr.Slider(
            minimum=1,
            maximum=5,
            value=3,
            step=1,
            label="Number of documents to retrieve (Top-K)"
        ),
        gr.Slider(
            minimum=80,
            maximum=300,
            value=150,
            step=20,
            label="Summary length (tokens)"
        )
    ],
    outputs=[
        gr.Textbox(
            label="Retrieved Relevant Documents",
            lines=8
        ),
        gr.Textbox(
            label="Generated Summary",
            lines=6
        )
    ],
    title="Document Search and Summarization using RAG",
    description=(
        "This system retrieves relevant news articles using semantic search "
        "and generates a concise summary using a Large Language Model. "
        "Ask information-seeking questions related to the dataset."
    )
).launch()


It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://f1809dd8d2b378d32e.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


