Step1 : Data Loading

In [13]:
# --- Step 1: Install required libraries (if not already installed) ---
!pip install datasets langchain sentence-transformers faiss-cpu tqdm

# --- Step 2: Load the Data ---
from datasets import load_dataset
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS



# Load a small subset of 100 papers

In [14]:
full_dataset = load_dataset("franz96521/scientific_papers", split='train', streaming=True)
subset_dataset_iterable = full_dataset.take(100)
papers_data = list(subset_dataset_iterable)
print(f"Loaded {len(papers_data)} papers for processing.")

Loaded 100 papers for processing.


# --- Step 3: Chunk the Documents ---

In [15]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
all_chunks = []
for paper in papers_data:
    chunks = text_splitter.split_text(paper['full_text'])
    for chunk in chunks:
        doc = Document(page_content=chunk, metadata={"paper_id": paper['id']})
        all_chunks.append(doc)
print(f"Total number of text chunks created: {len(all_chunks)}")


Total number of text chunks created: 5516


# --- Step 4: Create Embeddings and Build the Vector Store with tqdm ---

In [16]:
# Load the embedding model and enable the progress bar
model_name = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    show_progress=True  # This is the key change to show a tqdm bar
)
print("Embedding model loaded.")

# Build the FAISS vector store. LangChain will now automatically display a progress bar.
vector_store = FAISS.from_documents(all_chunks, embeddings)

print("\nFAISS vector store created successfully!")

Embedding model loaded.


Batches:   0%|          | 0/173 [00:00<?, ?it/s]


FAISS vector store created successfully!


# --- Step 5: Test the Retrieval System ---

In [17]:
query = "What are sparsity-certifying decompositions?"
retrieved_docs = vector_store.similarity_search(query)
print("\n--- Most Relevant Chunk ---")
print(retrieved_docs[0].page_content)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


--- Most Relevant Chunk ---
A (k, `)-maps-and-trees is a graph that admits a decomposition into k− ` edge-disjoint
map-graphs and ` spanning trees.
Another characterization of map-graphs, which we will use extensively in this paper, is as
the (1,0)-tight graphs [8, 24]. The k-map-graphs are evidently (k,0)-tight, and [8, 24] show that
the converse holds as well.
1 Our terminology follows Lovász in [16]. In the matroid literature map-graphs are sometimes known as bases
of the bicycle matroid or spanning pseudoforests.
Sparsity-certifying Graph Decompositions 3
Fig. 1. Examples of sparsity-certifying decompositions: (a) a 3-arborescence; (b) a 2-map-graph; (c) a
(2,1)-maps-and-trees. Edges with the same line style belong to the same subgraph. The 2-map-graph is
shown with a certifying orientation.
A `Tk is a decomposition into ` edge-disjoint (not necessarily spanning) trees such that each
vertex is in exactly k of them. Figure 2(a) shows an example of a 3T2.


# Integrating LLm into this

In [51]:
import re

In [52]:
# --- Step 1: Install necessary libraries ---
!pip install openai

# --- Step 2: Import libraries and get your API key ---
from google.colab import userdata
from openai import OpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

# Make sure your 'vector_store' object from the previous steps is available

# Securely get the NVIDIA API key from Colab secrets
NVIDIA_API_KEY = "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"

print("NVIDIA API Key loaded.")

# --- Step 3: Initialize OpenAI client for NVIDIA endpoint ---
client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key=NVIDIA_API_KEY
)

print("OpenAI client initialized with NVIDIA API endpoint.")

# --- Step 4: Define the retriever and prompt template ---
retriever = vector_store.as_retriever()

prompt_template = """
Answer the following question based only on the provided context.
If the answer is not in the context or the context is empty, say "I don't know the answer to this Please ask again".

Context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(prompt_template)

# --- Step 5: Define the RAG chain manually ---
def rag_chain(query):
    # Step 1: Retrieve context
    context_docs = retriever.get_relevant_documents(query)

    # --- NEW CLEANING STEP ---
    # Define a regex pattern to find the corrupted chunks.
    gibberish_pattern = re.compile(r'/DAN <[A-Fa-f0-9]+>')

    # Filter the list of documents, keeping only the ones that DO NOT contain the gibberish pattern.
    cleaned_docs = [doc for doc in context_docs if not gibberish_pattern.search(doc.page_content)]

    print(f"Retrieved {len(context_docs)} documents, {len(cleaned_docs)} remaining after cleaning.")

    # Join only the cleaned documents to create the final context.
    context_text = "\n\n".join([doc.page_content for doc in cleaned_docs])

    # Step 2: Format prompt
    print(context_text)
    formatted_prompt = prompt.format(context=context_text, question=query)

    # Step 3: Call NVIDIA-hosted GPT model
    completion = client.chat.completions.create(
        model="openai/gpt-oss-120b",
        messages=[{"role": "user", "content": formatted_prompt}],
        temperature=0,
        top_p=1,
        max_tokens=4098
    )
    return completion.choices[0].message.content





NVIDIA API Key loaded.
OpenAI client initialized with NVIDIA API endpoint.


# --- Step 6: Ask a Question! ---

In [53]:
prompt_template = """
You are a helpful question-answering assistant.

Use ONLY the information provided in the context to answer the question.
If the context is empty, irrelevant, or does not contain the answer, respond with:
"I don't know the answer to this. Please ask again."

Do not output anything else (no special characters, no explanation, no guessing).
Always respond in plain English.

Context:
{context}

Question:
{question}

Answer:
"""


In [55]:

query = "What is the check game with colors?"
print(f"Query: {query}")

answer = rag_chain(query)

print("\n--- Generated Answer ---")
print(answer)

Query: What is the check game with colors?


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Retrieved 4 documents, 4 remaining after cleaning.
son’s Laman graph algorithms [9]. Berg and Jordan [1], provided the formal analysis of the
pebble game of [10] and introduced the idea of playing the game on a directed graph. Lee and
Streinu [12] generalized the pebble game to the entire range of parameters 0≤ `≤ 2k−1, and
left as an open problem using the pebble game to find sparsity certifying decompositions.
3. The pebble game with colors
Our pebble game with colors is a set of rules for constructing graphs indexed by nonnegative
integers k and `. We will use the pebble game with colors as the basis of an efficient algorithm
for the decomposition problem later in this paper. Since the phrase “with colors” is necessary
only for comparison to [12], we will omit it in the rest of the paper when the context is clear.
Sparsity-certifying Graph Decompositions 5
We now present the pebble game with colors. The game is played by a single player on a

We now present the pebble game with colo