# RAG Research Summarizer with Claude (Proof of Concept)
This script uses Anthropic's Claude to answer queries using relevant research summaries.

## Setup:
1. Add your API key to a file called ignore.py at the same directory level as this script:

    KEY = "your_claude_api_key_here"


In [13]:
%pip install -qq torch sentence-transformers anthropic

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


# imports

In [None]:
import json 
import ignore

import torch
from sentence_transformers import SentenceTransformer, util
import anthropic


DOCUMENT_STORE_PATH: str = './all_wwc.json'


# extract documents

In [15]:
docs = json.load(open(DOCUMENT_STORE_PATH, 'r'))
docs = [f'{k}||{v}' for k, v in docs.items()]  # Make them a list with some metadata fusion.

for doc in docs[:5]:  # print a few docs as an example
    print(doc)
    print()

dddm_pg_092909.pdf__1__0||Using Student Achievement Data to  
Support Instructional Decision Making
Using Student Achievement Data to  
Support Instructional Decision Making
NCEE 2009-4067
U.S. DEPARTMENT OF EDUCATION
IES PRACTICE GUIDE
WHAT WORKS CLEARINGHOUSE

dddm_pg_092909.pdf__2__0||The Institute of Education Sciences (IES) publishes practice guides in education 
to bring the best available evidence and expertise to bear on the types of challenges 
that cannot currently be addressed by a single intervention or program. Authors of 
practice guides seldom conduct the types of systematic literature searches that are 
the backbone of a meta-analysis, although they take advantage of such work when 
it is already published. Instead, authors use their expertise to identify the most im­
portant research with respect to their recommendations and conduct a search of 
recent publications to ensure that the research supporting the recommendations 
is up-to-date. 
Unique to IES-sponsored pract

# cosine similarity function definition

In [16]:
def search_top_k(model: SentenceTransformer, query: str, doc_embs: torch.Tensor, docs: list[str], k: int = 3) -> list[tuple[float, str]]:
    """
    Perform a cosine similarity search for a query against precomputed document embeddings.

    Args:
        model (SentenceTransformer): Preloaded Huggingface embedding model.
        query (str): Query string.
        doc_embs (torch.Tensor): Precomputed document embeddings (normalized).
        docs (List[str]): Original documents corresponding to embeddings.
        k (int, optional): Number of top results to return. Defaults to 3.

    Returns:
        list[Tuple[float, str]]: List of (similarity_score, document) tuples.
    """
    query_emb = model.encode([query], convert_to_tensor=True, normalize_embeddings=True)
    sims = util.cos_sim(query_emb, doc_embs)[0]  # shape: [num_docs]
    top_k = torch.topk(sims, k=k)
    return [(score.item(), docs[idx]) for idx, score in zip(top_k.indices, top_k.values)]



# build model 

In [17]:
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
doc_embs = model.encode(docs, convert_to_tensor=True, normalize_embeddings=True)

### example search usage

In [18]:
query = 'what do we know about outdoor education'
n = 3

results = search_top_k(model, query, doc_embs, docs, k=n)

for score, doc in results:
    print(f"{score:.4f} | {doc}")
    print()

0.5050 | adlit_pg_082608.pdf__55__4||of the first study42 indicated that students 
39.  This was the content taught in Guthrie et al. 
(1999). In Guthrie et al. (2000), teachers taught 
environmental adaptation (life science theme) 
in the fall and weather (earth science theme) in 
the spring.
40.  Guthrie et al. (1999).
41.  Guthrie et al. (2000).
42.  Guthrie et al. (1999).

0.4637 | ost_pg_072109.pdf__47__2||expert on a number of national research 
and development projects and was recently 
named to the 15-member Urban Education 
Research Task Force established to advise 
the U.S. Department of Education on issues 
affecting urban education. Among his vari­
ous awards and honors, most recently Dr. 
Borman received the 2008 American Educa­
tional Research Association (AERA) Palmer 
O. Johnson Award and was recognized for 
his contributions to education research by 
selection as an AERA Fellow.
Jeffrey Capizzano is vice president of pub­
lic policy and research at Teaching Strate­
gie

# prompt building and generation

In [19]:
def generate_prompt(query: str, sources: int = 3, print_flag: bool = False) -> str:
    results = search_top_k(model, query, doc_embs, docs, k=sources)

    if print_flag:
        for score, doc in results:
            print(f"{score:.4f} | {doc}")


    rag_input = {
        "query": query,
        "research_summaries": [
            {
                "score": score,
                "id": text.split('||')[0],
                "text": text.split('||')[1]
            }
            for score, text in results[:n]
        ]
    }


    prompt = f"""
    You are an AI assisntant that uses retrieval augmented generation to answer questions about educational best practices

    == Relevant Information ==
    Reference Summaries: You will be provided with structured summaries of research papers.
    Relevance Filtering: Only use information from the summaries if it is directly relevant to the query.
    Answer Generation: Generate concise, accurate, and clear answers to the user query.
    Citation: When using information from a summary, include a reference to the summary’s ID.

    ==INPUT==
    {json.dumps(rag_input, indent=2)}

    ==EXAMPLE OUTPUT== 
    {{
    "answer": <"Answer based on relevant summaries.">,
    "used_summaries": <["id1", ..., "idn"]>
    }}

    ==IMPORTANT==
    - Only respond with the output JSON, nothing before or after; DO NOT inlude "```json" or other markdown in your response.
    - Maintain a professional and friendly tone.
    - Respond only by referencing the given input. If none of the input is relevant to the user query, then respond that you have nothing useful to say.
    - Do not elaborate at all in your response outside of the input data.
    - Be concise
    """


    return prompt


### putting it all together with claude

In [None]:

prompt = generate_prompt(query='Tell me about optimal class size?')

client = anthropic.Anthropic(api_key=ignore.KEY)

response = client.messages.create(
    model="claude-sonnet-4-5-20250929",    
    max_tokens=1024,
    messages=[
        {"role": "user", "content": prompt}
    ]
)

response_obj = json.loads(response.content[0].text)

print(response_obj)

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

# A more production style oop example

In [24]:
class RAGPromptGenerator:
    def __init__(self, docs: list[str], api_key: str, embedding_model: str = "all-MiniLM-L6-v2", claude_model: str = "claude-sonnet-4-5-20250929"):
        """
        Initialize the RAG prompt generator and embed the documents.

        Args:
            docs: List of documents with format "id||text".
            embedding_model: Name of the SentenceTransformer model to use for embeddings.
            claude_model: Which Claude model to use.
        """
        self.docs = docs
        self.model = SentenceTransformer(embedding_model)
        self.doc_embs = self.model.encode(docs, convert_to_tensor=True, normalize_embeddings=True)
        self.claude_model = claude_model
        self.client = anthropic.Anthropic(api_key=api_key)

    def search_top_k(self, query: str, k: int = 3) -> list[tuple[float, str]]:
        """Perform a cosine similarity search for a query against precomputed document embeddings."""
        query_emb = self.model.encode([query], convert_to_tensor=True, normalize_embeddings=True)
        sims = util.cos_sim(query_emb, self.doc_embs)[0]
        top_k = torch.topk(sims, k=k)
        return [(score.item(), self.docs[idx]) for idx, score in zip(top_k.indices, top_k.values)]

    def generate_prompt(self, query: str, sources: int = 3, print_flag: bool = False) -> str:
        """Generate a RAG-style prompt with top-k relevant research summaries."""
        results = self.search_top_k(query, k=sources)

        if print_flag:
            for score, doc in results:
                print(f"{score:.4f} | {doc}")

        rag_input = {
            "query": query,
            "research_summaries": [
                {
                    "score": score,
                    "id": text.split('||')[0],
                    "text": text.split('||')[1]
                }
                for score, text in results[:sources]
            ]
        }

        prompt = f"""
        You are an AI assistant that uses retrieval-augmented generation to answer questions about educational best practices.

        == Relevant Information ==
        Reference Summaries: You will be provided with structured summaries of research papers.
        Relevance Filtering: Only use information from the summaries if it is directly relevant to the query. You may use mutliple summaries if they are all relevant.
        Answer Generation: Generate concise, accurate, and clear answers to the user query.
        Citation: When using information from a summary, include a reference to the summary IDs - inline when they are used.

        ==INPUT==
        {json.dumps(rag_input, indent=2)}

        ==EXAMPLE OUTPUT==
        {{
        "answer": <"Answer based on relevant summaries.">,
        "used_summaries": <["id1", ..., "idn"]>,
        "all_summaries": <["id1", ..., "idn"]>
        }}

        ==IMPORTANT==
        - Only respond with the output JSON, nothing before or after; DO NOT inlude "```json" or other markdown in your response.
        - Maintain a professional and friendly tone.
        - Respond only by referencing the given input. If none of the input is relevant to the user query, then respond that you have nothing useful to say.
        - Do not elaborate at all in your response outside of the input data.
        - Be concise

        *REMEBER* 
        - Your response **must be valid JSON only**.
        - DO NOT include ```json, ``` or any other markdown syntax.
        - Do NOT include explanations, greetings, or extra text—only the JSON.
        """
        return prompt

    def query_llm(self, query: str, sources: int = 3, print_flag: bool = False) -> dict:
        """
        Full pipeline: query -> retrieve top summaries -> generate prompt -> call Claude -> return JSON.
        """
        prompt = self.generate_prompt(query, sources=sources, print_flag=print_flag)

        response = self.client.messages.create(
            model=self.claude_model,
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )

        try:
            response = response.content[0].text.strip(r'```json').strip(r'```')
            response_obj = json.loads(response)
        except json.JSONDecodeError:
            response_obj = {"error": "Failed to parse response JSON", "raw_text": response}

        return response_obj


In [27]:
rag_generator = RAGPromptGenerator(docs, 
                                   api_key=ignore.KEY,
                                   embedding_model='all-MiniLM-L6-v2', 
                                   claude_model='claude-sonnet-4-5-20250929')

query = "What do we know about classes that are outdoors in nature?"
response = rag_generator.query_llm(query, sources=8)

print(json.dumps(response, indent=2))

{
  "answer": "The provided research summaries do not contain specific information about classes that are conducted outdoors in nature. The summaries discuss various educational approaches including hands-on activities, observation and personalization phases in instruction, and connecting students to topics like ecosystems and weather, but none explicitly address outdoor or nature-based classroom settings.",
  "used_summaries": [],
  "all_summaries": [
    "adlit_pg_082608.pdf__55__4",
    "adlit_pg_082608.pdf__55__2",
    "20072003.pdf__15__2",
    "WWC-practice-guide-reading-intervention-full-text.pdf__30__1",
    "TO4_PRACTICE_GUIDE_Preparing-for-School_07222022_v6.pdf__66__1",
    "WWC-practice-guide-reading-intervention-full-text.pdf__30__2",
    "english_learners_pg_040114.pdf__52__1",
    "readingcomp_pg_092810.pdf__41__4"
  ]
}


In [28]:
query = "How can I help my students who speak english as a second language?"
response = rag_generator.query_llm(query, sources=5)
print(json.dumps(response, indent=2))

{
  "answer": "To help students who speak English as a second language, research recommends several strategies: 1) Teach academic vocabulary intensively across several days using varied instructional activities (english_learners_pg_040114.pdf__10__2). 2) Integrate oral and written English language instruction into content-area teaching across subjects like math, science, and history (english_learners_pg_040114.pdf__10__2, english_learners_pg_040114.pdf__47__1). 3) Provide regular structured opportunities to develop written language skills (english_learners_pg_040114.pdf__10__2). 4) Group students heterogeneously by language proficiency so stronger English speakers can model language for less proficient students (english_learners_pg_040114.pdf__47__1). 5) Allow brief but frequent peer discussions throughout the day where students explain content using academic vocabulary, and permit emergent learners to discuss in their primary language to promote comprehension (english_learners_pg_0401

In [26]:
query = "Would meditating outside be useful to my students?"
response = rag_generator.query_llm(query, sources=3)
print(json.dumps(response, indent=2))

{
  "answer": "Based on the available research summaries, there is no directly relevant information about meditating outside or outdoor meditation practices for students. The summaries discuss self-monitoring techniques, classroom environment modifications, and teaching social-emotional skills like deep breathing exercises, but none specifically address outdoor meditation or its potential benefits for students.",
  "used_summaries": [],
  "all_summaries": [
    "behavioral-interventions-practice-guide_v3a_508a.pdf__67__1",
    "behavior_pg_092308.pdf__31__0",
    "TO4_PRACTICE_GUIDE_Preparing-for-School_07222022_v6.pdf__17__2"
  ]
}
