- Deepseek+Exaone+Docling으로 오픈소스 Reasoning RAG 구축하기
- https://youtu.be/4j6J-9hxfhk?si=oWnmZatfye60k107

- Docling: Entitiy를 모두 파싱하고, 적절한 처리 파이프라인으로 따르도록 만들고 쉽게 LLM이 이해할 수 있도록 마크다운으로 변환하는 라이브러리

In [1]:
%pip install -q langchain langgraph langchain-docling langchain-qdrant langchain-text-splitters langchain-ollama

Note: you may need to restart the kernel to use updated packages.


In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
from langchain_ollama import ChatOllama

reasoning_llm = ChatOllama(
    model="deepseek-r1:7b", # 추론 모델
    stop=["</think>"]
)

answer_llm = ChatOllama(
    model = "exaone3.5", # 한국 모델
    temperature = 0
)

In [5]:
from typing import Annotated, List, TypedDict, Literal
from langgraph.graph.message import add_messages
from langchain_core.documents import Document

# RAG 상태 정의
class RAGState(TypedDict):
    """RAG 시스템의 상태를 정의합니다."""
    query: str
    thinking: str
    documents: List[Document]
    answer: str
    messages: Annotated[List, add_messages]
    mode: str

In [12]:
from langchain_docling import DoclingLoader
from langchain_docling.loader import ExportType

FILE_PATH = "https://arxiv.org/pdf/2408.09869"

# 시간 오래 걸림
# 내부적으로 레이아웃을 파싱하고, LLM이 이해하기 쉽게 Rule-Based로 수행.
loader = DoclingLoader(
    file_path = FILE_PATH,
    export_type = ExportType.MARKDOWN
)

docs = loader.load()

In [13]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "Header_1"),
        ("##", "Header_2"),
        ("###", "Header_3"),
    ],
)

splits = [split for doc in docs for split in splitter.split_text(doc.page_content)]

for d in splits[:3]:
    print(f"- {d.page_content=}")
print("...")

- d.page_content='Version 1.0  \nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar  \nAI4K Group, IBM Research R¨ uschlikon, Switzerland'
- d.page_content='This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'
- d.page_content='Converting PDF documents back into a machine-processable format has been a major challenge for decades due to their huge variability in formats, we

In [14]:
from IPython.display import Markdown

display(Markdown(splits[12].page_content))

In this section, we establish some reference numbers for the processing speed of Docling and the resource budget it requires. All tests in this section are run with default options on our standard test set distributed with Docling, which consists of three papers from arXiv and two IBM Redbooks, with a total of 225 pages. Measurements were taken using both available PDF backends on two different hardware systems: one MacBook Pro M3 Max, and one bare-metal server running Ubuntu 20.04 LTS on an Intel Xeon E5-2690 CPU. For reproducibility, we fixed the thread budget (through setting OMP NUM THREADS environment variable ) once to 4 (Docling default) and once to 16 (equal to full core count on the test hardware). All results are shown in Table 1.  
If you need to run Docling in very low-resource environments, please consider configuring the pypdfium backend. While it is faster and more memory efficient than the default docling-parse backend, it will come at the expense of worse quality results, especially in table structure recovery.  
Establishing GPU acceleration support for the AI models is currently work-in-progress and largely untested, but may work implicitly when CUDA is available and discovered by the onnxruntime and  
torch runtimes backing the Docling pipeline. We will deliver updates on this topic at in a future version of this report.  
Table 1: Runtime characteristics of Docling with the standard model pipeline and settings, on our test dataset of 225 pages, on two different systems. OCR is disabled. We show the time-to-solution (TTS), computed throughput in pages per second, and the peak memory used (resident set size) for both the Docling-native PDF backend and for the pypdfium backend, using 4 and 16 threads.  
| CPU                              | Thread budget   | native backend   | native backend   | native backend   | pypdfium backend   | pypdfium backend   | pypdfium backend   |
|----------------------------------|-----------------|------------------|------------------|------------------|--------------------|--------------------|--------------------|
|                                  |                 | TTS              | Pages/s          | Mem              | TTS                | Pages/s            | Mem                |
| Apple M3 Max                     | 4               | 177 s 167 s      | 1.27 1.34        | 6.20 GB          | 103 s 92 s         | 2.18 2.45          | 2.56 GB            |
| (16 cores) Intel(R) Xeon E5-2690 | 16 4 16         | 375 s 244 s      | 0.60 0.92        | 6.16 GB          | 239 s 143 s        | 0.94 1.57          | 2.42 GB            |

In [16]:
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(
    model = "bge-m3:latest",
)

In [20]:
from langchain_qdrant import QdrantVectorStore
from langchain_qdrant import RetrievalMode

vector_store = QdrantVectorStore.from_documents(
    documents=splits,
    embedding = embeddings,
    location = ":memory:",
    collection_name = "rag_collection",
    retrieval_mode = RetrievalMode.DENSE
)

retriever = vector_store.as_retriever(search_kwargs = {"k": 10})

In [21]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

model = HuggingFaceCrossEncoder(model_name = "BAAI/bge-reranker-base")
compressor = CrossEncoderReranker(model=model, top_n=5)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

In [22]:
compression_retriever

ContextualCompressionRetriever(base_compressor=CrossEncoderReranker(model=HuggingFaceCrossEncoder(client=CrossEncoder(
  (model): XLMRobertaForSequenceClassification(
    (roberta): XLMRobertaModel(
      (embeddings): XLMRobertaEmbeddings(
        (word_embeddings): Embedding(250002, 768, padding_idx=1)
        (position_embeddings): Embedding(514, 768, padding_idx=1)
        (token_type_embeddings): Embedding(1, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): XLMRobertaEncoder(
        (layer): ModuleList(
          (0-11): 12 x XLMRobertaLayer(
            (attention): XLMRobertaAttention(
              (self): XLMRobertaSdpaSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
           

In [40]:
from langchain_core.messages import HumanMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

from langgraph.graph import START, StateGraph, END

# 1. 질문 분류 함수 - 중요: 여기서는 상태를 업데이트하는 노드 함수
def classify_node(state: RAGState):
    """질문을 분류하여 처리 모드를 결정합니다."""
    query = state["query"]
    # 모드를 상태에 저장
    if "Docling" in query:
        print("===검색 시작===")
        return {"mode": "retrieve"}
    else:
        print("===생성 시작===")
        return {"mode": "generate"}
    
# 2. 검색 노드 - 검색 후 상태 업데이트
def retrieve_node(state: RAGState):
    """문서를 검색하고 상태를 업데이트합니다."""
    query = state["query"]
    documents = compression_retriever.invoke(query)
    print(f"===검색 완료: {len(documents)}개의 문서 검색됨===")
    return {"documents": documents}

# 3. 추론 노드 - 상태 업데이트 
def think_node(state: RAGState):
    """추론 모델을 사용하여 생각을 생성합니다."""   
    query = state["query"]
    documents = state["documents"]
    thinking = reasoning_llm.invoke(
        f"<think> {query} </think> {documents}"
    )
    # AIMessage에서 content만 추출
    thinking_content = thinking.content
    print(f"===추론 완료: {thinking_content}===")
    return {"thinking": thinking_content}

# 4. 답변 생성 노드 - 상태 업데이트
def answer_node(state: RAGState):
    """답변 모델을 사용하여 최종 답변을 생성합니다."""
    thinking = state["thinking"]
    # thinking의 content 부분만 추출하여 프롬프트 구성
    prompt = f"{thinking}"
    # 메시지 생성 및 모델 호출
    message = HumanMessage(content=prompt)
    response = answer_llm.invoke([message])
    answer = response.content
    print(f"===답변 생성 완료: {answer}===")
    return {"answer": answer}  # 반드시 answer를 딕셔너리 형태로 반환

# 5. 워크플로우 생성
workflow = StateGraph(RAGState)

# 시작 노드에서 분류 노드로
workflow.add_node("classify", classify_node)
# 분류 노드 이후 조건부 라우팅
workflow.add_conditional_edges(
    "classify",
    lambda x: x["mode"],
    {
        "retrieve": "retrieve",
        "generate": "answer"
    }
)

# 검색 노드에서 추론 노드로
workflow.add_node("retrieve", retrieve_node)
workflow.add_edge("retrieve", "think")

# 추론 노드에서 답변 노드로
workflow.add_node("think", think_node)
workflow.add_edge("think", "answer")

# 답변 노드 추가 및 종료
workflow.add_node("answer", answer_node)
workflow.add_edge("answer", END)

# 시작점 설정
workflow.set_entry_point("classify")

# 그래프 컴파일
graph = workflow.compile()

# RAG 시스템 실행 함수
def run_rag_system(query: str) -> str:
    """RAG 시스템을 실행하여 질문에 대한 답변을 생성합니다."""
    result = graph.invoke({
        "query": query,
        "documents": [],
        "thinking": "",
        "answer": "",
        "messages": [],
        "mode": ""
    })
    return result["answer"]

In [41]:
# RAG 시스템 실행 예시
result = run_rag_system("Docling에 대해 설명해줘.")
print(f"최종 답변: {result}")

===검색 시작===
===검색 완료: 5개의 문서 검색됨===
===추론 완료: <think>
Alright, I need to explain Docling based on the provided information. Let me start by understanding what each document says.

Docling is an open-source PDF converter. The first document explains its extensibility with a model pipeline for customization. It mentions implementing a linear processing pipeline that works page-by-page. The introduction highlights its challenges in converting PDFs due to their variability and the gap between open-source tools and commercial ones.

Looking at the other documents, they talk about versioning (1, 2, 3, 4) but maybe it's just part of how the information is structured. The key features are:
- Converts PDFs to JSON or Markdown quickly.
- Understands page layout, reading order, figures, and table structures.
- Extracts metadata like title, authors, etc.
- Supports OCR for scanned PDFs.
- Works in batch or interactive mode.
- Can use GPUs for acceleration.

The abstract summarizes it as a self-con

In [42]:
result

'Certainly! Here’s a concise explanation of Docling based on the provided information:\n\n**Docling** is an open-source PDF converter designed to address the complexities and variability inherent in PDF documents through advanced AI capabilities. Here are its key features and strengths:\n\n1. **Efficiency and Speed**: Docling quickly converts PDFs into structured formats like JSON or Markdown, making it highly efficient for both batch and interactive use cases.\n\n2. **Advanced Parsing**: It excels at understanding complex PDF structures, including:\n   - **Page Layout**: Accurate interpretation of text placement and layout.\n   - **Reading Order**: Proper sequencing of content for readability.\n   - **Figures and Tables**: Extraction and preservation of visual elements and tabular data.\n   - **Metadata**: Extraction of essential metadata such as titles, authors, and other document properties.\n\n3. **Versatility**: Docling supports OCR (Optical Character Recognition) for scanned PDFs

In [46]:
# 워크플로우 시각화
from IPython.display import Image
import graphviz

# workflow를 DOT 형식으로 변환
dot = graphviz.Digraph(comment='RAG Workflow')
dot.attr(rankdir='LR')  # 왼쪽에서 오른쪽으로 방향 설정

# 노드 스타일 설정
dot.attr('node', shape='box', style='rounded,filled', fillcolor='lightblue')

# 노드 추가
dot.node('START', 'START', shape='oval')
dot.node('classify', 'Classify(질문 분류)')
dot.node('retrieve', 'Retrieve(문서 검색)')
dot.node('think', 'Think(추론)')
dot.node('answer', 'Answer(답변 생성)')
dot.node('END', 'END', shape='oval')

# 엣지 추가
dot.edge('START', 'classify')
dot.edge('classify', 'retrieve', 'retrieve')
dot.edge('classify', 'answer', 'generate')
dot.edge('retrieve', 'think')
dot.edge('think', 'answer')
dot.edge('answer', 'END')

# 그래프 렌더링
dot.render('rag_workflow', format='png', view=True)
display(dot)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


ExecutableNotFound: failed to execute PosixPath('dot'), make sure the Graphviz executables are on your systems' PATH