# RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

This notebook shows how to use an implementation of RAPTOR with llama-index, leveraging the RAPTOR llama-pack.

RAPTOR works by recursively clustering and summarizing clusters in layers for retrieval.

There two retrieval modes:
- tree_traversal -- traversing the tree of clusters, performing top-k at each level in the tree.
- collapsed -- treat the entire tree as a giant pile of nodes, perform simple top-k.

See [the paper](https://arxiv.org/abs/2401.18059) for full algorithm details.

## Setup

In [4]:
!pip install llama-index llama-index-packs-raptor llama-index-vector-stores-qdrant

Collecting protobuf<6.0dev,>=5.26.1 (from grpcio-tools>=1.41.0->qdrant-client>=1.7.1->llama-index-vector-stores-qdrant)
  Using cached protobuf-5.27.2-cp38-abi3-macosx_10_9_universal2.whl.metadata (592 bytes)
Using cached protobuf-5.27.2-cp38-abi3-macosx_10_9_universal2.whl (412 kB)
Installing collected packages: protobuf
  Attempting uninstall: protobuf
    Found existing installation: protobuf 4.25.3
    Uninstalling protobuf-4.25.3:
      Successfully uninstalled protobuf-4.25.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-ai-generativelanguage 0.6.6 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5, but you have protobuf 5.27.2 which is incompatible.[0m[31m
[0mSuccessfully installed protobuf-5.27.2


In [3]:
from llama_index.packs.raptor import RaptorPack

# optionally download the pack to inspect/modify it yourself!
# from llama_index.core.llama_pack import download_llama_pack
# RaptorPack = download_llama_pack("RaptorPack", "./raptor_pack")

In [7]:
!wget https://arxiv.org/pdf/2401.18059.pdf -O ./raptor_paper.pdf

--2024-07-07 14:45:22--  https://arxiv.org/pdf/2401.18059.pdf
Resolving arxiv.org (arxiv.org)... 151.101.67.42, 151.101.3.42, 151.101.195.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.67.42|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://arxiv.org/pdf/2401.18059 [following]
--2024-07-07 14:45:22--  http://arxiv.org/pdf/2401.18059
Connecting to arxiv.org (arxiv.org)|151.101.67.42|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2547113 (2.4M) [application/pdf]
Saving to: ‘./raptor_paper.pdf’


2024-07-07 14:45:23 (5.71 MB/s) - ‘./raptor_paper.pdf’ saved [2547113/2547113]



In [4]:
import os
from Setting import openAIKey
os.environ["OPENAI_API_KEY"] = openAIKey.Key

## Constructing the Clusters/Hierarchy Tree

In [5]:
import nest_asyncio

nest_asyncio.apply()

In [6]:
from llama_index.core import SimpleDirectoryReader

# documents = SimpleDirectoryReader(input_files=["./raptor_paper.pdf"]).load_data()
documents = SimpleDirectoryReader(input_files=["/Users/tanghao/NUS/Internship/EverpixAnalysis/Data/Investor Reports/2011.md"]).load_data()

In [7]:
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

client = chromadb.PersistentClient(path="./investor_report_2011_db")
collection = client.get_or_create_collection("investor_report_2011")

vector_store = ChromaVectorStore(chroma_collection=collection)

raptor_pack = RaptorPack(
    documents,
    embed_model=OpenAIEmbedding(
        model="text-embedding-3-small"
    ),  # used for embedding clusters
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0.1),  # used for generating summaries
    vector_store=vector_store,  # used for storage
    similarity_top_k=2,  # top k for each layer, or overall top-k for collapsed
    mode="collapsed",  # sets default mode
    transformations=[
        SentenceSplitter(chunk_size=400, chunk_overlap=50)
    ],  # transformations applied for ingestion
)

Generating embeddings for level 0.
Performing clustering for level 0.
Generating summaries for level 0 with 1 clusters.
Level 0 created summaries/clusters: 1
Generating embeddings for level 1.
Performing clustering for level 1.
Generating summaries for level 1 with 1 clusters.
Level 1 created summaries/clusters: 1
Generating embeddings for level 2.
Performing clustering for level 2.
Generating summaries for level 2 with 1 clusters.
Level 2 created summaries/clusters: 1


## Retrieval

In [9]:
nodes = raptor_pack.run("What is primary goals for Everpix Private Alpha?", mode="collapsed")
print(len(nodes))
print(nodes[1].text)

2
Everpix project started in April 2011 and was officially announced at TechCrunch Disrupt in September 2011. The Private Alpha phase aimed to test the infrastructure and software, with around 5,000 users registering and 2,080 importing photos. The Mac Uploader was used by 732 users, with an average of 10,000 photos imported by Mac users. The Public Beta was launched at the end of 2011, allowing anyone to create an account. Unexpected PR coverage led to 3,000 new users and 8,000,000 new photos imported, causing infrastructure scaling issues. The back end infrastructure was re-engineered for the Public Beta using Tornado & MySQL on Amazon Web Services. Recruitment for key positions like Android and Windows developers was challenging. Financially, expenses exceeded the budget due to higher lawyer fees and hiring. Future goals include releasing Everpix 1.0, offering premium subscriptions, and addressing recruitment challenges. The co-founders plan to adjust their salaries, and a Series A 

In [12]:
nodes = raptor_pack.run(
    "What is the major risk for Everpix?", mode="tree_traversal"
)
print(len(nodes))
print(nodes[1].text)

Retrieved parent IDs from level 2: ['41613443-5fbf-453d-a10a-2f4d2b4fcdf0']
Retrieved 1 from parents at level 2.
Retrieved parent IDs from level 1: ['d73a8078-774a-4903-aeb1-da7393ac5fc4']
Retrieved 1 from parents at level 1.
Retrieved parent IDs from level 0: ['cd5be605-e5b9-4314-ad0b-8689144d1d87']
Retrieved 2 from parents at level 0.
2
Google App Engine was great for Everpix Private Alpha and saved us a lot of time, but we had reached the tipping point.

Re-engineering the back end also allowed to address numerous design issues with the previous implementation (for instance proper color matching of imported photos and consistent scaling across import paths) and add valuable features like much improved analytics.

This re-engineering does come with an important downside in the fact that it won’t scale for free contrary to Google App Engine which handled that part automatically. We can only do so much here with our limited internal knowledge and proper scaling will require hiring an e

## Loading

Since we saved to a vector store, we can also use it again! (For local vector stores, there is a `persist` and `from_persist_dir` method on the retriever)

In [None]:
from llama_index.packs.raptor import RaptorRetriever

retriever = RaptorRetriever(
    [],
    embed_model=OpenAIEmbedding(
        model="text-embedding-3-small"
    ),  # used for embedding clusters
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0.1),  # used for generating summaries
    vector_store=vector_store,  # used for storage
    similarity_top_k=2,  # top k for each layer, or overall top-k for collapsed
    mode="tree_traversal",  # sets default mode
)

In [None]:
# if using a default vector store
# retriever.persist("./persist")
# retriever = RaptorRetriever.from_persist_dir("./persist", ...)

## Query Engine

In [None]:
from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(
    retriever, llm=OpenAI(model="gpt-3.5-turbo", temperature=0.1)
)

In [None]:
response = query_engine.query("What baselines was RAPTOR compared against?")

In [None]:
print(str(response))

BM25 and DPR
