- propositions are small, self-contained pieces of information.
- individual facts or claims that can stand on their own

- key characteristics:
    - contain a single piece of info
    - can be understood without additional context; self-contained
    - state a clear fact or claim
    - consise and to the point
    - include any necessary context within themselves

- instead of a paragraph about the `Leaning tower of pisa`, we might have propositions like:
    - The Leaning tower of Pisa is 55.86 metres ( 183 feet 3 inches).
    - The Leaning tower of Pisa is known for its nearly 40 degree lean.
    - The Leaning tower of Pisa has 296 steps


- some drawbacks of traditional methods of splitting text into chunks like paragraphs or sentences:
    - important info might be split across multiple chunks
    - chunks may contain irrelevant info along with the relevant info
    - harder to pinpoint exactly the info needed to answer a specific question.

- propositions aim to address these issues by breaking info down to its smallest meaningful parts.
- using propositions allows for finer control and better handling of specific queries, especially for extracting knowledge from detailed or complex texts.
- by breaking the document into small factual propositions, the system allows for highly specific retrieval, making it easier to extract precise answers from large or complex documents. 
- more accurate AI responses due to more precise input info. 

```mermaid
graph LR
A(prepare your data) --> B(generate propositions) --> C(check quality) --> D(index the propositions) --> E(retrieval) --> F(use in your system)
```

### Preparing Data

In [1]:
sample_content = """The Leaning Tower of Pisa, or simply the Tower of Pisa (torre di Pisa), is the campanile, or freestanding bell tower, of Pisa Cathedral. It is known for its nearly four-degree lean, the result of an unstable foundation. The tower is one of three structures in Pisa's Cathedral Square (Piazza del Duomo), which includes the cathedral and Pisa Baptistry. Over time, the tower has become one of the most visited tourist attractions in the world as well as an architectural icon of Italy, receiving over 5 million visitors each year.

The height of the tower is 55.86 metres (183 feet 3 inches) from the ground on the low side and 56.67 m (185 ft 11 in) on the high side. The width of the walls at the base is 2.44 m (8 ft 0 in). Its weight is estimated at 14,500 tonnes (16,000 short tons). The tower has 296 or 294 steps; the seventh floor has two fewer steps on the north-facing staircase.

The tower began to lean during construction in the 12th century, due to soft ground which could not properly support the structure's weight. It worsened through the completion of construction in the 14th century. By 1990, the tilt had reached 5.5 degrees. The structure was stabilized by remedial work between 1993 and 2001, which reduced the tilt to 3.97 degrees.

The identity of the architect of the tower is a subject of controversy. The design had long been attributed to a man named Guglielmo and to Bonanno Pisano, the latter a well-known 12th-century resident artist of Pisa known for his bronze casting, particularly in the Pisa Duomo. Pisano left Pisa in 1185 for Monreale, Sicily, only to return and die in his home town. A piece of cast bearing his name was discovered at the foot of the tower in 1820, but this may be related to the bronze door in the façade of the cathedral that was destroyed in 1595. A 2001 study seems to indicate Diotisalvi was the original architect, due to the time of construction and affinity with other Diotisalvi works, notably the bell tower of San Nicola and the Baptistery, both in Pisa.

The tower has survived at least four strong earthquakes since 1280. A 2018 engineering investigation concluded that the tower withstood the tremors because of dynamic soil-structure interaction: the height and stiffness of the tower combined with the softness of the foundation soil influences the tower's vibrational characteristics in such a way that it does not resonate with earthquake ground motion. The same soft soil that caused the leaning and brought the tower to the verge of collapse helped to prevent significant destruction in the event of an earthquake.
"""

### Chunking

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

In [3]:
docs_list = [Document(page_content=sample_content, metadata={"Title":"Leaning Tower of Pisa", "Source":"https://en.wikipedia.org/wiki/Leaning_Tower_of_Pisa"})]

In [4]:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=256, chunk_overlap=64)

In [5]:
chunks = text_splitter.split_documents(docs_list)

In [6]:
chunks

[Document(metadata={'Title': 'Leaning Tower of Pisa', 'Source': 'https://en.wikipedia.org/wiki/Leaning_Tower_of_Pisa'}, page_content="The Leaning Tower of Pisa, or simply the Tower of Pisa (torre di Pisa), is the campanile, or freestanding bell tower, of Pisa Cathedral. It is known for its nearly four-degree lean, the result of an unstable foundation. The tower is one of three structures in Pisa's Cathedral Square (Piazza del Duomo), which includes the cathedral and Pisa Baptistry. Over time, the tower has become one of the most visited tourist attractions in the world as well as an architectural icon of Italy, receiving over 5 million visitors each year.\n\nThe height of the tower is 55.86 metres (183 feet 3 inches) from the ground on the low side and 56.67 m (185 ft 11 in) on the high side. The width of the walls at the base is 2.44 m (8 ft 0 in). Its weight is estimated at 14,500 tonnes (16,000 short tons). The tower has 296 or 294 steps; the seventh floor has two fewer steps on the

In [7]:
for i, doc in enumerate(chunks):
    doc.metadata['chunk_id'] = i+1

### Generate Propositions

In [8]:
from pydantic import BaseModel, Field
from langchain_core.prompts import FewShotChatMessagePromptTemplate, ChatPromptTemplate
#from langchain_ollama import ChatOllama
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv, find_dotenv

In [9]:
load_dotenv(find_dotenv())

True

In [10]:
# data model
class GeneratePropositions(BaseModel):
    """List of all the propositions in a given document"""

    propositions: list[str] = Field(description="List of propositions (factual, self-contained, consise information)")

In [11]:
structured_llm = ChatOpenAI(model="gpt-5-mini", temperature=0).with_structured_output(GeneratePropositions)

In [12]:
proposition_example = [
    {
        "document": "In 1969, Neil Armstrong became the first person to walk on the Moon during the Apollo 11 mission.",
        "propositions": "['Neil Armstrong was an astronaut.', 'Neil Armstrong walked on the Moon in 1969.', 'Neil Armstrong was the first person to walk on the Moon.', 'Neil Armstrong walked on the Moon during the Apollo 11 mission.', 'The Apollo 11 mission occurred in 1969.']"
    },
]

In [13]:
proposition_prompt_example = ChatPromptTemplate([
    ('human', "Example Document:\n{document}"),
    ('ai',"Example Propositions:\n{propositions}")
])

In [14]:
few_shot_prompt = FewShotChatMessagePromptTemplate(
    examples=proposition_example,
    example_prompt=proposition_prompt_example # This is a prompt template used to format each individual example.
)

In [15]:
few_shot_prompt

FewShotChatMessagePromptTemplate(examples=[{'document': 'In 1969, Neil Armstrong became the first person to walk on the Moon during the Apollo 11 mission.', 'propositions': "['Neil Armstrong was an astronaut.', 'Neil Armstrong walked on the Moon in 1969.', 'Neil Armstrong was the first person to walk on the Moon.', 'Neil Armstrong walked on the Moon during the Apollo 11 mission.', 'The Apollo 11 mission occurred in 1969.']"}], input_variables=[], input_types={}, partial_variables={}, example_prompt=ChatPromptTemplate(input_variables=['document', 'propositions'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['document'], input_types={}, partial_variables={}, template='Example Document:\n{document}'), additional_kwargs={}), AIMessagePromptTemplate(prompt=PromptTemplate(input_variables=['propositions'], input_types={}, partial_variables={}, template='Example Propositions:\n{propositions}'), additional_kwargs={})]))

In [16]:
few_shot_prompt.invoke({}).messages

[HumanMessage(content='Example Document:\nIn 1969, Neil Armstrong became the first person to walk on the Moon during the Apollo 11 mission.', additional_kwargs={}, response_metadata={}),
 AIMessage(content="Example Propositions:\n['Neil Armstrong was an astronaut.', 'Neil Armstrong walked on the Moon in 1969.', 'Neil Armstrong was the first person to walk on the Moon.', 'Neil Armstrong walked on the Moon during the Apollo 11 mission.', 'The Apollo 11 mission occurred in 1969.']", additional_kwargs={}, response_metadata={})]

In [17]:
system = """Please break down the following text into simple, self-contained propositions. Ensure that each proposition meets the following criteria:

    1. (Important) Express a Single Fact: Each proposition should state one specific fact or claim.
    2. (Important) Be Understandable Without Context: The proposition should be self-contained, meaning it can be understood without needing additional context.
    3. (Important) Use Full Names, Not Pronouns: Avoid pronouns or ambiguous references; use full entity names.
    4. (Important) Include Relevant Dates/Qualifiers: If applicable, include necessary dates, times, and qualifiers to make the fact precise.
    5. (Important) Contain One Subject-Predicate Relationship: Focus on a single subject and its corresponding action or attribute, without conjunctions or multiple clauses.
"""

In [18]:
final_propositions_prompt = ChatPromptTemplate([
    ('system',system),
    few_shot_prompt,
    ('human',"Input Document:\n{document}")
])

In [19]:
for message in final_propositions_prompt.invoke({"document":".....EXAMPLE DOCUMENT....."}).messages:
    print(message.content)
    print()

Please break down the following text into simple, self-contained propositions. Ensure that each proposition meets the following criteria:

    1. (Important) Express a Single Fact: Each proposition should state one specific fact or claim.
    2. (Important) Be Understandable Without Context: The proposition should be self-contained, meaning it can be understood without needing additional context.
    3. (Important) Use Full Names, Not Pronouns: Avoid pronouns or ambiguous references; use full entity names.
    4. (Important) Include Relevant Dates/Qualifiers: If applicable, include necessary dates, times, and qualifiers to make the fact precise.
    5. (Important) Contain One Subject-Predicate Relationship: Focus on a single subject and its corresponding action or attribute, without conjunctions or multiple clauses.


Example Document:
In 1969, Neil Armstrong became the first person to walk on the Moon during the Apollo 11 mission.

Example Propositions:
['Neil Armstrong was an astrona

In [20]:
proposition_generator = final_propositions_prompt | structured_llm

In [21]:
propositions = []

In [22]:
for i in range(len(chunks)):
    response = proposition_generator.invoke({"document":chunks[i].page_content})
    for proposition in response.propositions:
        propositions.append(Document(page_content=proposition, metadata={"Title":"Leaning Tower of Pisa", "Source":"https://en.wikipedia.org/wiki/Leaning_Tower_of_Pisa", "chunk_id":i+1}))

In [23]:
len(propositions)

58

In [24]:
import random

In [25]:
for proposition in random.sample(propositions,5):
    print(proposition.page_content)

The 2001 study noted an affinity between the tower and the bell tower of San Nicola in Pisa.
The Leaning Tower of Pisa is the campanile of Pisa Cathedral.
The softness of the foundation soil brought the tower to the verge of collapse.
An engineering investigation in 2018 concluded that dynamic soil-structure interaction explained why the tower withstood earthquake tremors.
The softness of the foundation soil helped to prevent significant destruction to the tower during earthquakes.


In [26]:
import pickle

In [None]:
with open("./data/propositions.pickle",'wb') as fp:
    pickle.dump(propositions,fp)

### Quality Check

In [28]:
# data model
class GradePropositions(BaseModel):
    """Grade a given proposition on accuracy, clarity, completeness, and conciseness"""

    accuracy: int = Field(description="Rate from 1-10 based on how well the proposition reflects the original text.")

    clarity: int = Field(
        description="Rate from 1-10 based on how easy it is to understand the proposition without additional context."
    )

    completeness: int = Field(
        description="Rate from 1-10 based on whether the proposition includes necessary details (e.g., dates, qualifiers)."
    )

    conciseness: int = Field(
        description="Rate from 1-10 based on whether the proposition is concise without losing important information."
    )

    reasoning: str = Field(description="The full reasoning and thought process that justifies the given scores/ evaluation")

In [29]:
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv, find_dotenv

In [30]:
load_dotenv(find_dotenv())

True

In [31]:
structured_llm = ChatOpenAI(model='gpt-5-mini', temperature=0).with_structured_output(GradePropositions)

In [32]:
evaluation_prompt_template = """
Please evaluate the following proposition based on the criteria below:

1. Accuracy: Rate from 1-10 based on how well the proposition reflects the original text.
2. Clarity: Rate from 1-10 based on how easy it is to understand the proposition without additional context.
3. Completeness: Rate from 1-10 based on whether the proposition includes necessary details (e.g., dates, qualifiers).
4. Conciseness: Rate from 1-10 based on whether the proposition is concise without losing important information.

Don't jump into scoring the proposition, clearly show your thought process.

---

Example:

Docs: In 1969, Neil Armstrong became the first person to walk on the Moon during the Apollo 11 mission.

Propositons_1: Neil Armstrong was an astronaut.
Evaluation_1: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Propositons_2: Neil Armstrong walked on the Moon in 1969.
Evaluation_3: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Propositons_3: Neil Armstrong was the first person to walk on the Moon.
Evaluation_3: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Propositons_4: Neil Armstrong walked on the Moon during the Apollo 11 mission.
Evaluation_4: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

Propositons_5: The Apollo 11 mission occurred in 1969.
Evaluation_5: "accuracy": 10, "clarity": 10, "completeness": 10, "conciseness": 10

---

New Input:

Proposition: 
"{proposition}"

Original Text: 
"{original_text}"
""".strip()

In [33]:
prompt = ChatPromptTemplate([
    ('system', evaluation_prompt_template),
])

In [34]:
proposition_evaluator = prompt | structured_llm

In [35]:
thresholds = {"accuracy": 7, "clarity": 7, "completeness": 7, "conciseness": 7}


def evaluate_proposition(proposition, original_text):
    response = proposition_evaluator.invoke({"proposition": proposition, "original_text": original_text})
    
    
    scores = {"accuracy": response.accuracy, "clarity": response.clarity, "completeness": response.completeness, "conciseness": response.conciseness}

    reasoning = response.reasoning

    return scores, reasoning


def passes_quality_check(scores):
    for category, score in scores.items():
        if score < thresholds[category]:
            return False
    return True

In [36]:
evaluated_propositions = []

In [37]:
for idx, proposition in enumerate(propositions):
    scores, reasoning = evaluate_proposition(proposition.page_content, chunks[proposition.metadata['chunk_id'] - 1].page_content)
    if passes_quality_check(scores):
        # Proposition passes quality check, keep it
        evaluated_propositions.append(proposition)
    else:
        # Proposition fails, discard or flag for further review
        print(f"{idx+1}) Propostion: {proposition.page_content}\nScores: {scores}")
        print(f"Reasoning:\n{reasoning}")
        print("-"*50)

23) Propostion: The Leaning Tower of Pisa has 294 steps according to other counts.
Scores: {'accuracy': 9, 'clarity': 10, 'completeness': 6, 'conciseness': 10}
Reasoning:
Thought process:

- I compared the proposition to the original text. The original states: "The tower has 296 or 294 steps; the seventh floor has two fewer steps on the north-facing staircase." This explicitly gives two possible step counts (296 or 294) and notes an explanation for the difference (the seventh floor has two fewer steps on one staircase).

- The proposition says: "The Leaning Tower of Pisa has 294 steps according to other counts." This aligns with one of the two counts given in the source (294) and frames it as an alternate/other count, which is consistent with the source language "296 or 294 steps." Therefore the proposition correctly reflects part of the source.

Scoring justification:
- Accuracy (9): The proposition is factually supported by the original text (294 is one of the counts). I did not give

In [38]:
len(evaluated_propositions)

37

In [None]:
with open("./data/evaluated_propositions.pickle",'wb') as fp:
    pickle.dump(evaluated_propositions,fp)

### Embedding propositions in a vectorstore

In [40]:
from langchain_chroma import Chroma
from langchain_openai.embeddings import OpenAIEmbeddings

In [41]:
vectorstore_propositions = Chroma.from_documents(documents=evaluated_propositions,embedding=OpenAIEmbeddings(model="text-embedding-3-small"))

In [54]:
retriever_propositions = vectorstore_propositions.as_retriever(search_type="similarity", search_kwargs={"k":3})

In [55]:
query = "how many people visit the tower each year?"

In [56]:
res_proposition = retriever_propositions.invoke(query)

In [57]:
for i, doc in enumerate(res_proposition):
    print(f"{i}) Content: {doc.page_content} --- chunk_id: {doc.metadata["chunk_id"]}")

0) Content: The Leaning Tower of Pisa receives over 5 million visitors each year. --- chunk_id: 1
1) Content: The Leaning Tower of Pisa has become one of the most visited tourist attractions in the world. --- chunk_id: 1
2) Content: The tower survived at least four strong earthquakes since 1280. --- chunk_id: 4


In [69]:
# answer is clearly mentions in the top retrieved doc with no extra, un-relevant info

### Comparison with larger chunks size

In [None]:
from langchain_core.vectorstores import InMemoryVectorStore

In [64]:
vectorstore_larger = InMemoryVectorStore.from_documents(documents=chunks, embedding=OpenAIEmbeddings(model="text-embedding-3-small"))

In [65]:
retriever_larger = vectorstore_larger.as_retriever(search_type="similarity", search_kwargs={"k":3})

In [66]:
res_larger = retriever_larger.invoke(query)

In [67]:
for i, doc in enumerate(res_larger):
    print(f"{i}) Content: {doc.page_content} --- chunk_id: {doc.metadata["chunk_id"]}")

0) Content: The tower began to lean during construction in the 12th century, due to soft ground which could not properly support the structure's weight. It worsened through the completion of construction in the 14th century. By 1990, the tilt had reached 5.5 degrees. The structure was stabilized by remedial work between 1993 and 2001, which reduced the tilt to 3.97 degrees. --- chunk_id: 2
1) Content: The Leaning Tower of Pisa, or simply the Tower of Pisa (torre di Pisa), is the campanile, or freestanding bell tower, of Pisa Cathedral. It is known for its nearly four-degree lean, the result of an unstable foundation. The tower is one of three structures in Pisa's Cathedral Square (Piazza del Duomo), which includes the cathedral and Pisa Baptistry. Over time, the tower has become one of the most visited tourist attractions in the world as well as an architectural icon of Italy, receiving over 5 million visitors each year.

The height of the tower is 55.86 metres (183 feet 3 inches) from

In [68]:
# 2nd doc contains the answer with a lot of unrelated info

Final Observations:
- key pieces of information was lost during creation of propositions from larger chunks. The process is not deterministic as an LLM is used
- evaluation of propositions was off at times
- this technique seems to work well with direct questions; clear inquiry that asks for information, rather than complex queries

### Comparison

| Aspect| Proposition-Based Retrieval | Simple Chunk Retrieval |
|---|---|---|
| Precision in Response | **High**: Delivers focused and direct answers. | **Medium**: Provides more context but may include irrelevant information. |
| Clarity and Brevity | **High**: Clear and concise, avoids unnecessary details. | **Medium**: More comprehensive but can be overwhelming. |
| Contextual Richness | **Low**: May lack context, focusing on specific propositions. | **High**: Provides additional context and details. |
| Comprehensiveness | **Low**: May omit broader context or supplementary details. | **High**: Offers a more complete view with extensive information. |
| Narrative Flow | **Medium**: Can be fragmented or disjointed. | **High**: Preserves the logical flow and coherence of the original document. |
| Information Overload | **Low**: Less likely to overwhelm with excess information. | **High**: Risk of overwhelming the user with too much information. |
| Use Case Suitability | Best for quick, factual queries. | Best for complex queries requiring in-depth understanding. |
| Efficiency | **High**: Provides quick, targeted responses. | **Medium**: May require more effort to sift through additional content. |
| Specificity | **High**: Precise and targeted responses. | **Medium**: Answers may be less targeted due to inclusion of broader context. |