# RAG Research Summarizer with Claude (Proof of Concept)
This script uses Anthropic's Claude to answer queries using relevant research summaries.

## Setup:
1. Add your API key to a file called ignore.py at the same directory level as this script:

    KEY = "your_claude_api_key_here"


In [1]:
%pip install -qq torch sentence-transformers anthropic

Note: you may need to restart the kernel to use updated packages.


# imports

In [2]:
import json 
import ignore

import torch
from sentence_transformers import SentenceTransformer, util
import anthropic


DOCUMENT_STORE_PATH: str = './all_wwc.json'


  from .autonotebook import tqdm as notebook_tqdm


# extract documents

In [None]:
docs = json.load(open(DOCUMENT_STORE_PATH, 'r'))
docs = [f'{k}||{v}' for k, v in docs.items()]  # Make them a list with some metadata fusion.

for doc in docs[:3]:  # print a few docs as an example
    print(doc)
    print(); print()

dddm_pg_092909.pdf__1__0||Using Student Achievement Data to  
Support Instructional Decision Making
Using Student Achievement Data to  
Support Instructional Decision Making
NCEE 2009-4067
U.S. DEPARTMENT OF EDUCATION
IES PRACTICE GUIDE
WHAT WORKS CLEARINGHOUSE

dddm_pg_092909.pdf__2__0||The Institute of Education Sciences (IES) publishes practice guides in education 
to bring the best available evidence and expertise to bear on the types of challenges 
that cannot currently be addressed by a single intervention or program. Authors of 
practice guides seldom conduct the types of systematic literature searches that are 
the backbone of a meta-analysis, although they take advantage of such work when 
it is already published. Instead, authors use their expertise to identify the most im­
portant research with respect to their recommendations and conduct a search of 
recent publications to ensure that the research supporting the recommendations 
is up-to-date. 
Unique to IES-sponsored pract

# cosine similarity function definition

In [4]:
def search_top_k(model: SentenceTransformer, query: str, doc_embs: torch.Tensor, docs: list[str], k: int = 3) -> list[tuple[float, str]]:
    """
    Perform a cosine similarity search for a query against precomputed document embeddings.

    Args:
        model (SentenceTransformer): Preloaded Huggingface embedding model.
        query (str): Query string.
        doc_embs (torch.Tensor): Precomputed document embeddings (normalized).
        docs (List[str]): Original documents corresponding to embeddings.
        k (int, optional): Number of top results to return. Defaults to 3.

    Returns:
        list[Tuple[float, str]]: List of (similarity_score, document) tuples.
    """
    query_emb = model.encode([query], convert_to_tensor=True, normalize_embeddings=True)
    sims = util.cos_sim(query_emb, doc_embs)[0]  # shape: [num_docs]
    top_k = torch.topk(sims, k=k)
    return [(score.item(), docs[idx]) for idx, score in zip(top_k.indices, top_k.values)]



# build model 

In [5]:
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
doc_embs = model.encode(docs, convert_to_tensor=True, normalize_embeddings=True)

### example search usage

In [6]:
query = 'what do we know about outdoor education'
n = 3

results = search_top_k(model, query, doc_embs, docs, k=n)

for score, doc in results:
    print(f"{score:.4f} | {doc}")
    print()

0.6443 | 13362.pdf__228__0||Discipline-Based Education Research: Understanding and Improving Learning in Undergraduate Science and...
Copyright National Academy of Sciences. All rights reserved.
REFERENCES	
209
National Science Foundation. (1997). Geoscience education: A recommended strategy. Report 
based on August 29-30, 1996, workshop from the Geoscience Education Working Group 
to the Advisory Committee for Geosciences and the Directorate for Geosciences of the 
National Science Foundation, Arlington, VA. Available: http://www.nsf.gov/pubs/1997/
nsf97171/nsf97171.htm.
Orion, N., and Hofstein, A. (1994). Factors that influence learning during a scientific field trip 
in a natural environment. Journal of Research in Science Teaching, 31(10), 1097-1119.
Orion, N., Hofstein, A., Tamir, P., and Giddings, G.J. (1997). Development and validation of 
an instrument for assessing the learning environment of outdoor science activities. Science 
Education, 81(2), 161-171.
Petcovic, H.L., Libar

# prompt building and generation

In [7]:
def generate_prompt(query: str, sources: int = 3, print_flag: bool = False) -> str:
    results = search_top_k(model, query, doc_embs, docs, k=sources)

    if print_flag:
        for score, doc in results:
            print(f"{score:.4f} | {doc}")


    rag_input = {
        "query": query,
        "research_summaries": [
            {
                "score": score,
                "id": text.split('||')[0],
                "text": text.split('||')[1]
            }
            for score, text in results[:n]
        ]
    }


    prompt = f"""
    You are an AI assisntant that uses retrieval augmented generation to answer questions about educational best practices

    == Relevant Information ==
    Reference Summaries: You will be provided with structured summaries of research papers.
    Relevance Filtering: Only use information from the summaries if it is directly relevant to the query.
    Answer Generation: Generate concise, accurate, and clear answers to the user query.
    Citation: When using information from a summary, include a reference to the summary’s ID.

    ==INPUT==
    {json.dumps(rag_input, indent=2)}

    ==EXAMPLE OUTPUT== 
    {{
    "answer": <"Answer based on relevant summaries.">,
    "used_summaries": <["id1", ..., "idn"]>
    }}

    ==IMPORTANT==
    - Only respond with the output JSON, nothing before or after; DO NOT inlude "```json" or other markdown in your response.
    - Maintain a professional and friendly tone.
    - Respond only by referencing the given input. If none of the input is relevant to the user query, then respond that you have nothing useful to say.
    - Do not elaborate at all in your response outside of the input data.
    - Be concise
    """


    return prompt


### putting it all together with claude

In [8]:

prompt = generate_prompt(query='Tell me about optimal class size?')

client = anthropic.Anthropic(api_key=ignore.KEY)

response = client.messages.create(
    model="claude-sonnet-4-5-20250929",    
    max_tokens=1024,
    messages=[
        {"role": "user", "content": prompt}
    ]
)

response_obj = json.loads(response.content[0].text)

print(response_obj)

{'answer': "Based on the research summaries provided, there are references to class size studies, particularly Tennessee's Class Size Study (Project STAR) conducted by Finn and Achilles. This research examined questions about class size and student achievement. However, the summaries provided do not contain specific details about what the optimal class size is or the study's conclusions about ideal class sizes. One summary mentions that analyzing class size effects must account for not only class size itself but also teacher practices that correlate with class size variations. To provide specific recommendations about optimal class size, I would need access to summaries that contain the actual findings and conclusions from these studies.", 'used_summaries': ['10236.pdf__181__0', '11112.pdf__118__0']}


# A more production style oop example

In [9]:
class RAGPromptGenerator:
    def __init__(self, docs: list[str], api_key: str, embedding_model: str = "all-MiniLM-L6-v2", claude_model: str = "claude-sonnet-4-5-20250929"):
        """
        Initialize the RAG prompt generator and embed the documents.

        Args:
            docs: List of documents with format "id||text".
            embedding_model: Name of the SentenceTransformer model to use for embeddings.
            claude_model: Which Claude model to use.
        """
        self.docs = docs
        self.model = SentenceTransformer(embedding_model)
        self.doc_embs = self.model.encode(docs, convert_to_tensor=True, normalize_embeddings=True)
        self.claude_model = claude_model
        self.client = anthropic.Anthropic(api_key=api_key)

    def search_top_k(self, query: str, k: int = 3) -> list[tuple[float, str]]:
        """Perform a cosine similarity search for a query against precomputed document embeddings."""
        query_emb = self.model.encode([query], convert_to_tensor=True, normalize_embeddings=True)
        sims = util.cos_sim(query_emb, self.doc_embs)[0]
        top_k = torch.topk(sims, k=k)
        return [(score.item(), self.docs[idx]) for idx, score in zip(top_k.indices, top_k.values)]

    def generate_prompt(self, query: str, sources: int = 3, print_flag: bool = False) -> str:
        """Generate a RAG-style prompt with top-k relevant research summaries."""
        results = self.search_top_k(query, k=sources)

        if print_flag:
            for score, doc in results:
                print(f"{score:.4f} | {doc}")

        rag_input = {
            "query": query,
            "research_summaries": [
                {
                    "score": score,
                    "id": text.split('||')[0],
                    "text": text.split('||')[1]
                }
                for score, text in results[:sources]
            ]
        }

        prompt = f"""
        You are an AI assistant that uses retrieval-augmented generation to answer questions about educational best practices.

        == Relevant Information ==
        Reference Summaries: You will be provided with structured summaries of research papers.
        Relevance Filtering: Only use information from the summaries if it is directly relevant to the query. You may use mutliple summaries if they are all relevant.
        Answer Generation: Generate concise, accurate, and clear answers to the user query.
        Citation: When using information from a summary, include a reference to the summary IDs - inline when they are used.

        ==INPUT==
        {json.dumps(rag_input, indent=2)}

        ==EXAMPLE OUTPUT==
        {{
        "answer": <"Answer based on relevant summaries.">,
        "used_summaries": <["id1", ..., "idn"]>,
        "all_summaries": <["id1", ..., "idn"]>
        }}

        ==IMPORTANT==
        - Only respond with the output JSON, nothing before or after; DO NOT inlude "```json" or other markdown in your response.
        - Maintain a professional and friendly tone.
        - Respond only by referencing the given input. If none of the input is relevant to the user query, then respond that you have nothing useful to say.
        - Do not elaborate at all in your response outside of the input data.
        - Be concise

        *REMEBER* 
        - Your response **must be valid JSON only**.
        - DO NOT include ```json, ``` or any other markdown syntax.
        - Do NOT include explanations, greetings, or extra text—only the JSON.
        """
        return prompt

    def query_llm(self, query: str, sources: int = 3, print_flag: bool = False) -> dict:
        """
        Full pipeline: query -> retrieve top summaries -> generate prompt -> call Claude -> return JSON.
        """
        prompt = self.generate_prompt(query, sources=sources, print_flag=print_flag)

        response = self.client.messages.create(
            model=self.claude_model,
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )

        try:
            response = response.content[0].text.strip(r'```json').strip(r'```')
            response_obj = json.loads(response)
        except json.JSONDecodeError:
            response_obj = {"error": "Failed to parse response JSON", "raw_text": response}

        return response_obj


In [None]:
rag_generator = RAGPromptGenerator(docs, 
                                   api_key=ignore.KEY,
                                   embedding_model='all-MiniLM-L6-v2', 
                                   claude_model='claude-sonnet-4-5-20250929')

query = "What do we know about optimal class size?"
response = rag_generator.query_llm(query, sources=8)

print(response['answer'])
print('-' * 50)
print(json.dumps(response, indent=2))

In [None]:
query = "How can I help my students who speak english as a second language?"
response = rag_generator.query_llm(query, sources=5)

print(response['answer'])
print('-' * 50)
print(json.dumps(response, indent=2))

To help students who speak English as a second language, research-based recommendations include:

1. **Teach academic vocabulary intensively** across several days using varied instructional activities (english_learners_pg_040114.pdf__10__1).

2. **Integrate oral and written English instruction into content-area teaching**, which benefits both English learners and native English speakers from similar backgrounds (english_learners_pg_040114.pdf__10__1, english_learners_pg_040114.pdf__53__0).

3. **Provide regular, structured opportunities to develop written language skills** (english_learners_pg_040114.pdf__10__1).

4. **Offer small-group instructional interventions** (3-5 students) for students struggling with literacy and English language development. Use homogeneous groups for foundational skills like phonemic awareness and decoding, but heterogeneous groups for writing, oral language, and comprehension tasks (english_learners_pg_040114.pdf__10__1, english_learners_pg_040114.pdf__69__

In [None]:
query = "Would meditating outside be useful to my students?"
response = rag_generator.query_llm(query, sources=3)

print(response['answer'])
print('-' * 50)
print(json.dumps(response, indent=2))

{
  "answer": "Based on the available research summaries, there is limited direct information about meditating outside specifically. However, one study mentioned that outdoor education experiences can contribute to positive development in the affective domain [13362.pdf__262__1]. While this suggests outdoor activities may have benefits for students' emotional and social development, the research provided does not specifically address meditation practices or their effectiveness when conducted outdoors versus indoors.",
  "used_summaries": [
    "13362.pdf__262__1"
  ],
  "all_summaries": [
    "13362.pdf__228__0",
    "13362.pdf__262__1",
    "behavioral-interventions-practice-guide_v3a_508a.pdf__73__2"
  ]
}
