## Chunker

In [1]:
import numpy as np
import pandas as pd

from frony_document_manager.parser import ParserTXT
from frony_document_manager.parser import ParserPDF
from frony_document_manager.parser import ParserPPTX
from frony_document_manager.parser import ParserPDFImage
from frony_document_manager.parser import ParserImage

from frony_document_manager.chunker import RuleBasedTextChunker
from frony_document_manager.chunker import LLMBasedTextChunker
from frony_document_manager.chunker import LLMBasedImageChunker

  from .autonotebook import tqdm as notebook_tqdm


## RuleBasedTextChunker

In [2]:
parser = ParserPDF()
df = parser.parse("test_files/test_pdf.pdf")
df

Unnamed: 0,page_number,page_content
0,1,"Provided proper attribution is provided, Googl..."
1,2,"1 Introduction\nRecurrent neural networks, lon..."
2,3,Figure 1: The Transformer - model architecture...
3,4,Scaled Dot-Product Attention Multi-Head Attent...
4,5,output values. These are concatenated and once...
5,6,"Table 1: Maximum path lengths, per-layer compl..."
6,7,n\nlength is smaller than the representation d...
7,8,Table 2: The Transformer achieves better BLEU ...
8,9,Table 3: Variations on the Transformer archite...
9,10,Table 4: The Transformer generalizes well to E...


In [3]:
print(df.iloc[0]["page_content"])

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
3202 guA 2  ]LC.sc[  7v26730.6071:viXra
Ashish Vaswani∗ Noam Shazeer∗ Niki Parmar∗ Jakob Uszkoreit∗
Google Brain Google Brain Google Research Google Research
avaswani@google.com noam@google.com nikip@google.com usz@google.com
†
Llion Jones∗ Aidan N. Gomez∗ Łukasz Kaiser∗
Google Research University of Toronto Google Brain
llion@google.com aidan@cs.toronto.edu lukaszkaiser@google.com
‡
Illia Polosukhin∗
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with r

In [4]:
print(df.iloc[1]["page_content"])

1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks
in particular, have been firmly established as state of the art approaches in sequence modeling and
transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous
efforts have since continued to push the boundaries of recurrent language models and encoder-decoder
architectures [38, 24, 15].
Recurrent models typically factor computation along the symbol positions of the input and output
sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden
states h , as a function of the previous hidden state h and the input for position t. This inherently
t t−1
sequential nature precludes parallelization within training examples, which becomes critical at longer
sequence lengths, as memory constraints limit batching across examples. Recent work has achieved
significant improvements in computational efficiency through fac

In [5]:
chunker = RuleBasedTextChunker()
chunks = chunker.chunk(df)
total_chunks = next(chunks)
print(total_chunks)
df_chunk = []
for chunk in chunks:
    df_chunk.append(chunk)
df_chunk = pd.DataFrame(df_chunk)
df_chunk

631


create documents... (rule_short):   0%|          | 0/507 [00:00<?, ?it/s]

create documents... (rule_short): 100%|██████████| 507/507 [00:01<00:00, 370.35it/s]
create documents... (rule_long): 100%|██████████| 124/124 [00:00<00:00, 383.52it/s]


Unnamed: 0,page_number,chunk_type,chunk_id,chunk_content
0,1,rule_short,0,"Provided proper attribution is provided, Googl..."
1,1,rule_short,1,reproduce the tables and figures in this paper...
2,1,rule_short,2,Attention Is All You Need\n3202 guA 2 ]LC.sc[...
3,1,rule_short,3,Google Brain Google Brain Google Research Goog...
4,1,rule_short,4,†\nLlion Jones∗ Aidan N. Gomez∗ Łukasz Kaiser∗...
...,...,...,...,...
626,2,rule_long,119,| | p | u | t- | In | p ...
627,1,rule_long,120,| 0 | | | | | ...
628,15,rule_long,121,| 2 | ehT | waL | lliw | reven | eb ...
629,15,rule_long,122,| 4 | | | | | ...


## LLMBasedTextChunker

In [4]:
from frony_document_manager.parser_dev import ParserPDF
from frony_document_manager.chunker_dev import LLMBasedTextChunker
# import numpy as np
# import pandas as pd
# import re
# from tqdm import tqdm
# from langchain_text_splitters import RecursiveCharacterTextSplitter
# from transformers import AutoTokenizer
# from openai import OpenAI
# import Levenshtein
# from dotenv import load_dotenv
# load_dotenv()

In [5]:
parser = ParserPDF()
df = parser.parse("test_files/test_pdf.pdf")
df

Unnamed: 0,page_number,page_content
0,1,"Providedproperattributionisprovided,Googlehere..."
1,2,"1 Introduction\nRecurrentneuralnetworks,longsh..."
2,3,Figure1: TheTransformer-modelarchitecture.\nTh...
3,4,ScaledDot-ProductAttention Multi-HeadAttention...
4,5,output values. These are concatenated and once...
5,6,"Table1: Maximumpathlengths,per-layercomplexity..."
6,7,"n d,\nlength is smaller than the representatio..."
7,8,Table2: TheTransformerachievesbetterBLEUscores...
8,9,Table3: VariationsontheTransformerarchitecture...
9,10,Table4: TheTransformergeneralizeswelltoEnglish...


In [6]:
chunker = LLMBasedTextChunker(tokenizer_path="google-bert/bert-base-uncased", n_gram=2)
chunks = chunker.chunk(
    df.iloc[:5],
    splitter_config=[
        {"type": "llm_text", "params": {"chunk_size": 512, "chunk_overlap": 512 // 4}},
    ]
)
total_chunks = next(chunks)
print(total_chunks)
df_chunk = []
for chunk in chunks:
    df_chunk.append(chunk)
df_chunk = pd.DataFrame(df_chunk)
df_chunk

Token indices sequence length is longer than the specified maximum sequence length for this model (812 > 512). Running this sequence through the model will result in indexing errors


15


create documents... (llm_text): 100%|██████████| 15/15 [01:08<00:00,  4.55s/it]


Unnamed: 0,page_number,chunk_type,chunk_id,chunk_content
0,1,llm_text,0,### 주제별 요약\n\n#### 1. 논문의 목적 및 배경\n이 논문에서는 기존의...
1,1,llm_text,0,### 주제별 요약\n\n1. **모델 성능 및 실험 결과**\n - 새로운 기...
2,1,llm_text,0,### 주제별 요약\n\n1. **모델 개발 및 평가**:\n - Nik은 원래...
3,1,llm_text,0,"### 1. 서론\n순환 신경망(Recurrent Neural Networks), ..."
4,2,llm_text,0,### 1. 주제: 시퀀스 모델링의 제약\n- 시퀀스 컴퓨테이션의 제약이 여전히 존...
5,2,llm_text,0,### 1. 신호 처리 및 연산 요구 사항\n- 두 임의의 입력 또는 출력 위치 간...
6,2,llm_text,0,### 1. Transformer 모델의 개요\n- Transformer는 입력과 ...
7,3,llm_text,0,**주제: Transformer 모델 아키텍처**\n\n1. **전반적인 구조**\...
8,3,llm_text,0,### 1. 디코더 구조\n디코더는 각 인코더 레이어에 두 개의 서브 레이어 외에 ...
9,3,llm_text,0,### 1. Scaled Dot-Product Attention\n- **정의**:...


In [7]:
chunker = LLMBasedTextChunker(n_gram=2)
chunks = chunker.chunk(
    df.iloc[:5],
    splitter_config=[
        {"type": "llm_text", "params": {"chunk_size": 2048, "chunk_overlap": 2048 // 4}},
    ]
)
total_chunks = next(chunks)
print(total_chunks)
df_chunk = []
for chunk in chunks:
    df_chunk.append(chunk)
df_chunk = pd.DataFrame(df_chunk)
df_chunk

10


create documents... (llm_text):   0%|          | 0/10 [00:00<?, ?it/s]

create documents... (llm_text): 100%|██████████| 10/10 [00:51<00:00,  5.12s/it]


Unnamed: 0,page_number,chunk_type,chunk_id,chunk_content
0,1,llm_text,0,### 주제별 요약\n\n#### 1. 연구 배경 및 목적\n- 기존의 시퀀스 변환...
1,1,llm_text,0,### 주제별 요약\n\n1. **연구 배경 및 목표**:\n - 연구는 영어 ...
2,2,llm_text,0,"### 1. 서론\n순환 신경망(Recurrent Neural Networks, R..."
3,2,llm_text,0,### 1. 배경\n- **목표**: 순차적 계산을 줄이는 것이 Extended N...
4,2,llm_text,0,### 주제별 요약\n\n#### 1. Transformer 및 Self-Atten...
5,3,llm_text,0,### 1. Transformer 모델 아키텍처\n- Transformer 모델은 ...
6,4,llm_text,0,### 주제별 요약\n\n#### 1. Scaled Dot-Product Atten...
7,4,llm_text,0,### 주제별 요약\n\n1. **문제점 및 해결책**\n - 매우 작은 기울기...
8,4,llm_text,0,### 1. 멀티 헤드 어텐션 (Multi-head Attention)\n- 멀티 ...
9,5,llm_text,0,### 1. 디코더의 정보 흐름 차단\n- 디코더에서 왼쪽으로 정보 흐름이 발생하지...


## LLMBasedImageChunker

In [8]:
parser = ParserPDFImage()
df = parser.parse("test_files/test_pdf.pdf")
df

Unnamed: 0,page_number,page_content
0,1,iVBORw0KGgoAAAANSUhEUgAACfYAAAzlCAIAAABT38lbAA...
1,2,iVBORw0KGgoAAAANSUhEUgAACfYAAAzlCAIAAABT38lbAA...
2,3,iVBORw0KGgoAAAANSUhEUgAACfYAAAzlCAIAAABT38lbAA...
3,4,iVBORw0KGgoAAAANSUhEUgAACfYAAAzlCAIAAABT38lbAA...
4,5,iVBORw0KGgoAAAANSUhEUgAACfYAAAzlCAIAAABT38lbAA...
5,6,iVBORw0KGgoAAAANSUhEUgAACfYAAAzlCAIAAABT38lbAA...
6,7,iVBORw0KGgoAAAANSUhEUgAACfYAAAzlCAIAAABT38lbAA...
7,8,iVBORw0KGgoAAAANSUhEUgAACfYAAAzlCAIAAABT38lbAA...
8,9,iVBORw0KGgoAAAANSUhEUgAACfYAAAzlCAIAAABT38lbAA...
9,10,iVBORw0KGgoAAAANSUhEUgAACfYAAAzlCAIAAABT38lbAA...


In [9]:
chunker = LLMBasedImageChunker()
chunks = chunker.chunk(df.iloc[:5])
total_chunks = next(chunks)
print(total_chunks)
df_chunk = []
for chunk in chunks:
    df_chunk.append(chunk)
df_chunk = pd.DataFrame(df_chunk)
df_chunk

5


create documents... (llm_image):   0%|          | 0/5 [00:00<?, ?it/s]

create documents... (llm_image): 100%|██████████| 5/5 [00:33<00:00,  6.65s/it]


Unnamed: 0,page_number,chunk_type,chunk_id,chunk_content
0,1,llm_image,0,"논문 ""Attention Is All You Need""의 요약은 다음과 같습니다.\..."
1,2,llm_image,0,### 1. 서론\n- **재귀 신경망**: 언어 모델링과 기계 번역에서 주류 방법...
2,3,llm_image,0,이미지는 Transformer 모델 아키텍처에 대한 설명을 포함하고 있습니다. 다음...
3,4,llm_image,0,### 주제별 요약\n\n#### 1. Scaled Dot-Product Atten...
4,5,llm_image,0,다음은 주제별로 요약한 내용입니다.\n\n### 1. 멀티헤드 어텐션\n- 멀티헤드...
