# RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

In [1]:
# NOTE: An OpenAI API key must be set here for application initialization, even if not in use.
# If you're not utilizing OpenAI models, assign a placeholder string (e.g., "not_used").
import os, json
os.environ["OPENAI_API_KEY"] = "your-api-key"

In [2]:
with open('./dataset/json_data/escaped_20221114000599.json', 'r') as file: #"POSCO홀딩스"
    text = json.load(file)
text = text[:100]

1) **Building**: RAPTOR recursively embeds, clusters, and summarizes chunks of text to construct a tree with varying levels of summarization from the bottom up. You can create a tree from the text in 'sample.txt' using `RA.add_documents(text)`.

2) **Querying**: At inference time, the RAPTOR model retrieves information from this tree, integrating data across lengthy documents at different abstraction levels. You can perform queries on the tree with `RA.answer_question`.

### Building the tree

In [5]:
from raptor import RetrievalAugmentation, RetrievalAugmentationConfig
from raptor.custom_tokenizer import FinQATokenizer
from raptor.SummarizationModels import HCX_003_SummarizationModel, NCPSummarizationModel
from raptor.ExtractModel import HCX_003_MetaDataExecutor
from raptor.cluster_utils import FinRAG_Clustering
from raptor.QAModels import HCX_003_QAModel
from raptor.EmbeddingModels import (
    BaseEmbeddingModel,
    OpenAIEmbeddingModel,
    HyperCLOVAEmbeddingModel
)
from raptor.config.raptor_config import get_default_config, get_custom_config

In [6]:
# 기본 설정 사용
config = get_default_config()
RA = RetrievalAugmentation(config=config)

2024-11-13 12:03:33,813 - Successfully initialized TreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: FinQATokenizer
            Max Tokens: 1000
            Num Layers: 5
            Threshold: 0.5
            Top K: 5
            Selection Mode: top_k
            Summarization Length: 100
            Summarization Model: <raptor.SummarizationModels.HCX_003_SummarizationModel object at 0x306c493a0>
            Embedding Models: {'OpenAI': <raptor.EmbeddingModels.OpenAIEmbeddingModel object at 0x306c49280>}
            Cluster Embedding Model: OpenAI
        
        Reduction Dimension: 10
        Clustering Algorithm: FinRAG_Clustering
        Clustering Parameters: {}
        
Layer Summarization Lengths: {0: 300, 1: 200, 2: 100, 3: 1000}
2024-11-13 12:03:33,814 - Successfully initialized ClusterTreeBuilder with Config 
        TreeBuilderConfig:
            Tokenizer: FinQATokenizer
            Max Tokens: 1000
            Num Layers: 5
            Threshold

In [6]:
# construct the tree
RA.add_documents(text)

2024-11-13 11:36:14,189 - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-11-13 11:36:14,195 - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-11-13 11:36:14,236 - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-11-13 11:36:14,247 - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-11-13 11:36:14,265 - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-11-13 11:36:14,277 - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-11-13 11:36:14,330 - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-11-13 11:36:14,331 - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-11-13 11:36:14,331 - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-11-13 11:36:14,362 - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


KeyboardInterrupt: 

2024-11-13 11:46:22,616 - Retrying request to /embeddings in 0.984054 seconds
--- Logging error ---
Traceback (most recent call last):
  File "/Users/jisu/miniconda/envs/raptor/lib/python3.9/site-packages/httpx/_transports/default.py", line 72, in map_httpcore_exceptions
    yield
  File "/Users/jisu/miniconda/envs/raptor/lib/python3.9/site-packages/httpx/_transports/default.py", line 236, in handle_request
    resp = self._pool.handle_request(req)
  File "/Users/jisu/miniconda/envs/raptor/lib/python3.9/site-packages/httpcore/_sync/connection_pool.py", line 216, in handle_request
    raise exc from None
  File "/Users/jisu/miniconda/envs/raptor/lib/python3.9/site-packages/httpcore/_sync/connection_pool.py", line 196, in handle_request
    response = connection.handle_request(
  File "/Users/jisu/miniconda/envs/raptor/lib/python3.9/site-packages/httpcore/_sync/connection.py", line 101, in handle_request
    return self._connection.handle_request(request)
  File "/Users/jisu/miniconda/en

### Querying from the tree

```python
question = # any question
RA.answer_question(question)
```

In [16]:
question = "포스코 회사채 신용등급은 뭐고 누구에게 해당 신용 등급을 받았어?"

answer = RA.answer_question(question=question)
search_res = RA.retrieve(question="회사채 신용등급",top_k=3)
print("\nAnswer: ", answer)
print(f"\nSearch result :",search_res[0])

2024-11-12 13:49:10,777 - Using collapsed_tree
2024-11-12 13:49:11,149 - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-11-12 13:49:15,453 - Using collapsed_tree
2024-11-12 13:49:15,964 - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"



Answer:  포스코는 국내 신용평가기관인 NICE신용평가, 한국신용평가, 한국기업평가로부터 AA+(Positive/Stable)의 신용등급을 받았고, 해외신용평가기관인 S&P로부터 BBB+(Positive), Moody's로부터 Baa1(Stable)의 신용등급을 받았습니다.

Search result : - 회사채 신용등급의 정의(해외)

※ 신용등급체계- 회사채 신용등급의 정의(국내)

신용평가에 관한 사항 회사는 보고서 제출일 현재 국내 신용평가기관인 NICE신용평가로부터 AA+(Positive), 한국신용평가, 한국기업평가로부터 AA+(Stable)의 신용등급을 받고 있으며, 해외신용평가기관인 S BBB+(Positive), Moody&#x27;s로부터 Baa1(Stable)의 신용등급을 받고 있습니다. 회사가 최근 3사업연도 신용평가 전문기관으로부터 받은 신용평가등급은 다음과 같습니다.


