- Aims to bridge the style gap between user queries and relevant information in document text in a RAG system  
- Enhancement to traditional RAG retrieval that precomputes hypithetical prompts at the indexing stage. Unlike HyDe, Hype is a framework that shifts the generation of hypothetical content from query time to the indexing phase.
- By precomputing multiple hypothetical prompts for each data chunk and embedding the chunk in place of the prompt, HyPE transforms retrieval into a question-question matching task, bypassing the need for runtime synthetic answer generation.
- This approach does not introduce latency but also strengthens the alignment between queries and relevant context.
- Instead of embedding raw text chunks, HyPE generates multiple hypothetical prompts for each chunk.
- These precomputed questions simulate user queries, improving alignment with real-world searches.
- This approach stores multiple representations per chunk, increasing retrieval flexibility.

Benefits:
-  **No Runtime Overhead**: Unlike HyDE, HyPE does not require LLM calls at query time, making retrieval faster and cheaper.
- **Enhanced Retrieval Precision**: Better alignment between queries and stored content.
- **Retrieval is as fast as standard RAG.**
- **No additional per-query computational cost**

In [None]:
! pip3 install python-dotenv langchain langchain-classic langchain-core langchain-text-splitters beautifulsoup4 langchain-community langchain-openai pydantic faiss-cpu tiktoken

### Constants

In [22]:
LLM_MODEL_NAME="o4-mini-2025-04-16"
EMBEDDING_MODEL_NAME="text-embedding-3-small"
CHUNK_SIZE=1000
CHUNK_OVERLAP=200

In [4]:
URLS = [
    # Quizzes
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-view-Quizzes-as-a-student/ta-p/472",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-view-the-rubric-for-a-quiz/ta-p/453",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-take-a-quiz/ta-p/507",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-take-a-quiz-in-New-Quizzes/ta-p/291",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-take-a-quiz-where-I-can-only-view-one-question-at-a/ta-p/482",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-take-a-quiz-where-I-can-only-view-one-question-at-a/ta-p/292",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-answer-each-type-of-question-in-a-quiz/ta-p/474",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-answer-each-type-of-question-in-New-Quizzes/ta-p/290",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-resume-a-quiz-that-I-already-started-taking/ta-p/452",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-submit-a-quiz/ta-p/475",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-view-quiz-results-as-a-student/ta-p/335",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-view-quiz-comments-from-my-instructor/ta-p/471",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-view-my-quiz-results-as-a-student-in-New-Quizzes/ta-p/289",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-know-if-I-can-retake-a-quiz/ta-p/490",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-know-if-I-can-retake-a-quiz-in-New-Quizzes/ta-p/287",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-submit-a-survey/ta-p/380",
    # Discussions
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-view-Discussions-as-a-student/ta-p/314",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-view-the-rubric-for-my-graded-discussion/ta-p/319",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-subscribe-to-a-discussion-podcast-as-a-student/ta-p/368",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-know-if-I-have-a-peer-review-discussion-to-complete/ta-p/419",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-submit-a-peer-review-to-a-discussion/ta-p/355",
    "https://community.canvaslms.com/t5/Student-Guide/Where-can-I-find-my-peers-feedback-for-peer-reviewed-discussions/ta-p/428",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-create-a-course-discussion-as-a-student/ta-p/300",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-subscribe-to-a-discussion-as-a-student/ta-p/352",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-view-and-sort-discussion-replies-as-a-student/ta-p/465",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-change-discussion-settings-to-manually-mark-discussion/ta-p/366",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-mark-discussion-replies-as-read-or-unread-as-a-student/ta-p/284",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-reply-to-a-discussion-as-a-student/ta-p/334",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-attach-a-file-to-a-discussion-reply-as-a-student/ta-p/375",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-embed-an-image-in-a-discussion-reply-as-a-student/ta-p/313",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-edit-or-delete-discussion-replies-as-a-student/ta-p/399",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-like-a-reply-in-a-course-discussion-as-a-student/ta-p/392",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-view-a-discussion-thread-as-a-student/ta-p/485668",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-mention-a-user-in-a-discussion-reply-as-a-student/ta-p/485669",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-report-a-reply-in-a-discussion/ta-p/542169",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-reply-to-a-discussion-as-a-student-in-Canvas-for/ta-p/645002",
    "https://community.canvaslms.com/t5/Student-Guide/How-do-I-translate-a-discussion-using-AI-Translations-as-a/ta-p/660442",
]

### Loading and Processing URL content

In [None]:
from langchain_classic.document_loaders import WebBaseLoader
from langchain_text_splitters.character import RecursiveCharacterTextSplitter
import bs4
import re

In [None]:
def clean_page_content(document):

    """clean video transcript and reference labels"""

    if not re.findall('[0-9]{2}:[0-9]{2}: [0-9A-Za-z ;.,!?-]*',document.page_content):
        return document

    page_content = document.page_content
    matches = re.findall('[0-9]{2}:[0-9]{2}: [0-9A-Za-z ;.,!?-]*',page_content)
    idx1 = page_content.find(matches[0])
    idx2 = page_content.find(matches[-2]) + len(matches[-2])
    page_content = page_content[:idx1] + page_content[idx2:]
    page_content = re.sub('[\[][0-9][\]]','',page_content)
    page_content = re.sub('[0-9]{2}:[0-9]{2}: ','',page_content)
    page_content = re.sub('\n',' ',page_content)
    document.page_content = page_content
    
    return document

In [7]:
def process_url_content(urls: list[str]):

    loader = WebBaseLoader(
        web_paths= urls,
        bs_kwargs={
            "parse_only": bs4.SoupStrainer(id="content"),
        },
        bs_get_text_kwargs={"separator": "\n", "strip": True},
    )

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP, length_function=len)
    
    processed_document_chunks = []

    for doc in loader.lazy_load():
        cleaned_doc = clean_page_content(doc)
        chunks = text_splitter.split_documents([cleaned_doc])
        processed_document_chunks.extend(chunks)
    
    return processed_document_chunks

In [8]:
processed_document_chunks = process_url_content(URLS)

In [3]:
import pickle

In [None]:
with open("./data/processed_document_chunks.pickle",'wb') as fp:
    pickle.dump(processed_document_chunks,fp)

### Define generation of HyPE

In [26]:
from dotenv import find_dotenv, load_dotenv
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_classic.prompts import PromptTemplate
from pydantic import BaseModel, Field

In [27]:
load_dotenv(find_dotenv())

True

In [28]:
from langchain_core.documents import Document

In [29]:
# data model
class Questions(BaseModel):
    questions: list[str] = Field(description="List of essential questions that, when answered, capture the main points of the text.")

In [30]:
def generate_hypothetical_prompt_embeddings(document: Document) -> list[list[float]]:
    """
    Uses the LLM to generate multiple hypothetical questions for a single chunk.
    These questions will be used as 'proxies' for the chunk during retrieval.

    Parameters:
    Document Object

    Returns:
    hypothetical prompt embeddings (list[list[float]]): 
    A list of embedding vectors generated from the questions
    """

    llm = ChatOpenAI(model=LLM_MODEL_NAME)
    structured_llm = llm.with_structured_output(Questions)
    embedding_model = OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME)

    question_gen_template = """
# Task
Analyze the input text and generate essential questions that, when answered, capture the main points of the text.
(Important) Generate questions without numbering or prefixes.
---
# Input
Text:
{chunk_text}
""".strip()
    
    question_gen_prompt = PromptTemplate(
        template=question_gen_template,
        input_variables=["chunk_text"]
    )
    
    question_gen_chain = question_gen_prompt | structured_llm

    questions = question_gen_chain.invoke({"chunk_text":document.page_content}).questions

    hype = embedding_model.embed_documents(questions)

    return hype

### Creation and Population of FIASS Vectorstore

- Each chunk is stored multiple times, once for each generated question embedding.
- The embeddings are stored in a FAISS index for efficient similarity (L2 based) search.

In [31]:
from langchain_classic.vectorstores import FAISS
from langchain_classic.docstore import InMemoryDocstore
import faiss

In [32]:
def generate_vectorstore(chunks: list[Document]):
    
    vector_store = FAISS(
        embedding_function=OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME),
        index=faiss.IndexFlatL2(len(OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME).embed_query("hello world"))), # L2 index for similarity search
        docstore=InMemoryDocstore(),
        index_to_docstore_id={} # Maintain index-to-document mapping
    )

    for chunk in chunks:

        hype = generate_hypothetical_prompt_embeddings(chunk)

        # Pair the chunk's content with each generated embedding vector.
        # Each chunk is inserted multiple times, once for each prompt vector
        chunks_with_embedding_vectors = [(chunk.page_content, vec) for vec in hype]

        # Add embeddings to the store
        vector_store.add_embeddings(text_embeddings=chunks_with_embedding_vectors)
    

    return vector_store

In [33]:
import time

In [34]:
# Chunk size can be quite large with HyPE as we are not loosing precision with more information. 
# Need to test how exhaustive the model is in generating sufficient amount of questions per chunk. This will mostly depend on the information density.
start_time = time.time()
chunks_vector_store = generate_vectorstore(processed_document_chunks)
end_time = time.time()

In [44]:
round(end_time-start_time,0) // 60

17.0

In [None]:
chunks_vector_store.save_local(folder_path="./data/faiss_index")

### Create retriever

In [46]:
chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={"k": 3})

### Testing

In [69]:
test_query = "I am unable to return back to the question to change my answer."

In [70]:
docs = chunks_query_retriever.invoke(input=test_query)

In [71]:
from langchain_core.load import loads, dumps

In [72]:
deduplicated_docs = set([dumps(doc) for doc in docs])

In [73]:
deduplicated_docs

 '{"lc": 1, "type": "constructor", "id": ["langchain", "schema", "document", "Document"], "kwargs": {"id": "99caaf80-9180-4364-857f-a259840d913c", "page_content": "How do I take a quiz where I can only view one question at a time? Your instructor may choose to build quizzes that show one question at a time. This means you will receive only one quiz question on your screen at a time instead of all questions posted at once. Note: Your instructor may be using an upgraded quiz tool called New Quizzes in your course. If the quiz you are accessing displays differently, your instructor may have used the New Quizzes tool to create the quiz. Functionality may differ between these quiz types. For help with viewing one quiz question at a time, please see How do I take a quiz where I can only view one question at a time in New Quizzes?. \'t be able to return to this question.  This guide covered how to take a quiz where I can only view one question at a time. Next Questions Each question will appe

In [74]:
final_docs = [loads(doc) for doc in list(deduplicated_docs)]

In [75]:
final_docs

[Document(id='af39cd33-0e88-4199-9f83-5f25fc5480b2', metadata={}, page_content='. View Previous Attempts You can also view previous attempts through the sidebar submission details. Click the View Previous Attempts link. View Quiz Results for Previous Attempts Each quiz attempt will be listed in the sidebar with a hyperlink to the quiz results. Click the attempt you wish to view . The quiz results for that attempt will appear . Keep in mind that the same settings will apply in the quiz results, meaning that you may only be able to view your responses or not view quiz results at all. To return to the quiz, click the Back to Quiz link .'),
 Document(id='99caaf80-9180-4364-857f-a259840d913c', metadata={}, page_content="How do I take a quiz where I can only view one question at a time? Your instructor may choose to build quizzes that show one question at a time. This means you will receive only one quiz question on your screen at a time instead of all questions posted at once. Note: Your in

In [76]:
import textwrap

In [77]:
for doc in final_docs:
    print(textwrap.TextWrapper(width=100).fill(doc.page_content))
    print("-"*50)

. View Previous Attempts You can also view previous attempts through the sidebar submission details.
Click the View Previous Attempts link. View Quiz Results for Previous Attempts Each quiz attempt
will be listed in the sidebar with a hyperlink to the quiz results. Click the attempt you wish to
view . The quiz results for that attempt will appear . Keep in mind that the same settings will
apply in the quiz results, meaning that you may only be able to view your responses or not view quiz
results at all. To return to the quiz, click the Back to Quiz link .
--------------------------------------------------
Next Questions Each question will appear on the screen by itself. Once you have answered the
question, the Next button will turn blue. Click the Next button to advance through the quiz.
Previous Questions If your instructor allows you to return to prior questions, you can click the
Previous button to check your answers or return to questions you left blank. Navigate Questions in
Sideb

![HyPE Example](./attachments/HyPE-example.png)