# [프로젝트] Multimodal RAG - Part 1

최근에는 텍스트 이외의 데이터에 대해서도 RAG를 수행하기 위한 작업들이 많이 연구되고 있습니다.   

일반적인 문서 로더를 이용해 RAG를 수행할 경우, 텍스트만 활용하게 되어 정보의 부분적 손실이 발생하는데요.    

이번 프로젝트에서는 2024년 8월 공개된 오픈 소스 라이브러리 Docling을 이용해,  

이미지/표 등이 포함된 PDF 문서를 재구성하고, 이를 통해 RAG를 수행해 보겠습니다.  

**GPU 클라우드 T4를 설정해 주세요!**

## 라이브러리 설치

docling 라이브러리를 설치합니다.
https://github.com/DS4SD/docling

In [None]:
!pip install docling google-generativeai langchain_huggingface sentence_transformers jsonlines langchain langchain-google-genai langchain-community beautifulsoup4 langchain_chroma chromadb -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/162.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.6/162.6 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m159.6/159.6 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

docling은 PDF 데이터를 마크다운으로 변환합니다.      
텍스트 이외에도, 표와 이미지를 추출할 수 있습니다.

In [None]:
# 기본 코드: Image를 제외한 텍스트를 마크다운으로 변경
# T4 GPU 기준 3분 소요
from docling.document_converter import DocumentConverter

source = "https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf"  # PDF path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())  # output: "### Technical Report[...]"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].


<!-- image -->

## Gemma 3 Technical Report

Gemma Team, Google DeepMind 1

We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma34B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release 

Docling은 다음의 작업을 지원합니다.
1. 각 페이지를 이미지로 추출하기
2. 페이지에 포함된 각 이미지를 추출하기
3. 전체를 HTML/MD 형식으로 재구성하기

In [None]:
import logging
import time
import re
import requests
from pathlib import Path
from urllib.parse import urlparse
from docling_core.types.doc import ImageRefMode, PictureItem, TableItem
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

IMAGE_RESOLUTION_SCALE = 2.0
_log = logging.getLogger(__name__)

def download_pdf(url, save_dir="downloads"):
    """URL에서 PDF 파일을 다운로드하여 로컬 경로를 반환"""
    save_dir = Path(save_dir)
    save_dir.mkdir(parents=True, exist_ok=True)

    response = requests.get(url, stream=True)
    if response.status_code == 200:
        filename = url.split("/")[-1]

        if not filename.endswith('pdf'):
            filename+='.pdf'
        file_path = save_dir / filename

        with open(file_path, "wb") as f:
            for chunk in response.iter_content(1024):
                f.write(chunk)

        return str(file_path)
    else:
        raise Exception(f"Failed to download file: {url} (Status code: {response.status_code})")

def is_url(path):
    """주어진 문자열이 URL인지 확인"""
    return re.match(r'https?://', path) is not None

def parse(path, output_dir='docling_result'):
    logging.basicConfig(level=logging.INFO)

    if is_url(path):  # URL이면 다운로드
        _log.info(f"Downloading PDF from {path}...")
        path = download_pdf(path)

    input_doc_path = Path(path)
    output_dir = Path(output_dir)

    pipeline_options = PdfPipelineOptions()
    pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
    pipeline_options.generate_page_images = True
    pipeline_options.generate_picture_images = True

    doc_converter = DocumentConverter(
        format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
    )

    start_time = time.time()
    conv_res = doc_converter.convert(input_doc_path)

    output_dir.mkdir(parents=True, exist_ok=True)
    doc_filename = conv_res.input.file.stem

    # 페이지 이미지 저장
    for page_no, page in conv_res.document.pages.items():
        page_image_filename = output_dir / f"{doc_filename}-{page_no}.png"
        with page_image_filename.open("wb") as fp:
            page.image.pil_image.save(fp, format="PNG")

    # 이미지/테이블 저장
    table_counter = 0
    picture_counter = 0
    for element, _level in conv_res.document.iterate_items():
        if isinstance(element, TableItem):
            table_counter += 1
            element_image_filename = output_dir / f"{doc_filename}-table-{table_counter}.png"
            with element_image_filename.open("wb") as fp:
                element.get_image(conv_res.document).save(fp, "PNG")

        if isinstance(element, PictureItem):
            picture_counter += 1
            element_image_filename = output_dir / f"{doc_filename}-picture-{picture_counter}.png"
            with element_image_filename.open("wb") as fp:
                element.get_image(conv_res.document).save(fp, "PNG")

    # 전체 마크다운 저장(이미지는 utf8 형태로)
    md_filename = output_dir / f"{doc_filename}-with-images.md"
    conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.EMBEDDED)

    # 전체 마크다운 저장(이미지는 Reference 형태로)
    md_filename = output_dir / f"{doc_filename}-with-image-refs.md"
    conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.REFERENCED)

    # 이미지 포함된 HTML 저장
    html_filename = output_dir / f"{doc_filename}-with-image-refs.html"
    conv_res.document.save_as_html(html_filename, image_mode=ImageRefMode.REFERENCED)

    end_time = time.time() - start_time
    _log.info(f"Document converted and figures exported in {end_time:.2f} seconds.")

# 실행 예시
parse("https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf")
