*REMEBER: evaluation should be done at each step and also at the end for integration test

In [1]:
import numpy as np
import os
import pandas as pd
import re, sys

import langchain.text_splitter as lcts

In [2]:
os.listdir("data/processed")

['.ipynb_checkpoints', '2025_1Q_LGES_Audit_Report_CONFS_en_fulltext.txt']

In [3]:
with open("data/processed/2025_1Q_LGES_Audit_Report_CONFS_en_fulltext.txt", 'r') as f:
    lges_2025_1q_fulltext = f.read()

## 1. Recursive character text split

- There are list of separaters. Starting from first separator to split texts however if splitted chunk is greater than chunk_size use next separator and so on... so `len(chunk_i) < chunk_size`
- Default separators for LangChain = ['\n\n', '\n', ' ', '']


BUT there is one downside, if your application wants to display retrieved context to users, this may not be optimal (or you should add further steps) as chunks does not adhere to grammar.


Things to experiment:

- [x] (1.1) So, if we create a chunk such that each chunk is grammatically correct(full sentence). Does the performance drop
- [ ] (1.2) Optimal chunksize, chunk_overlap? It probably depends on text. Can we dynamically adjust parameters such that it is optimal to text we are working with?

Notice that once we run out of separators, we cannot chunk any further.

for first example:
1. Split with "\n\n" -> ['abcasdcv,mndf', efghijklmn]
2. two chunks are still >= 10, so we chunk again with next occuring chunk `""`
3. ['abcasdcv,m','ndf', 'efghijklm', 'n'] now all len(chunk_i) < chunk_size.

For 2nd example, if separator run out, chunk_size may be larger than chunk_size.

In [None]:
# Examples:
txt = 'abcasdcv,mndf \n\n efghijklmn'
print("text to chunk: ", r'abcasdcv,mndf \n\n efghijklmn', len(txt))
print()

rec_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=10, chunk_overlap=0,
    separators= ["\n\n", "\n", " ", ""]
)

print(r'rec character split with (default)separaters: ["\n\n", "\n", " ", ""] ')
print(rec_text_splitter.split_text(txt))
print()

rec_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=10, chunk_overlap=0, separators= ["\n\n", "\n"]
)
print(r'rec character split with separaters: ["\n\n", "\n"] ')
print(rec_text_splitter.split_text(txt))

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

rec_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=50,
    length_function=len,
    is_separator_regex=False
)

chunks = rec_text_splitter.create_documents([lges_2025_1q_fulltext])

In [16]:
import pickle
chunk_data_path = 'data/chunks/'
file_name = 'chunks_rec_text_split_cs300_co50_defaultsep.json'

with open(os.path.join(chunk_data_path, file_name), "wb") as f:
    pickle.dump(chunks, f)

In [13]:
chunks

[Document(metadata={}, page_content='--- Page 1 ---\nLG ENERGY SOLUTION, LTD. AND ITS SUBSIDIARIES Interim Condensed Consolidated Financial Statements As of March 31, 2025, and December 31, 2024, and For the Three-Month Periods Ended March 31, 2025 and 2024 (With the Independent Auditors Review Report Thereon)'),
 Document(metadata={}, page_content='--- Page 2 ---'),
 Document(metadata={}, page_content='Table of Contents Report on Review of Interim Condensed Consolidated Financial Statements Page Interim Condensed Consolidated Financial Statements As of March 31, 2025, and December 31, 2024, and For the Three-Month Periods Ended March 31, 2025 and 2024: Interim Condensed Consolidated Statements of'),
 Document(metadata={}, page_content='Interim Condensed Consolidated Statements of Financial Position 1 Interim Condensed Consolidated Statements of Profit or Loss 3 Interim Condensed Consolidated Statements of Comprehensive Income 5 Interim Condensed Consolidated Statements of Changes in E

In [7]:
chunks[1]

Document(metadata={}, page_content='--- Page 2 ---')

In [8]:
len(chunks)

600

### 1.1. Sentence-aware chunking

These downsides explain why character_size_split is preferred over sentence-aware chunking:
- Variability in sentence sizes: -> hard to choose one optimal embedding model as too short sentence will not have enough semantic signal whereas too long sentences' signal will be diulted.
  - Mathematically, if x is very short, the embedding vector f(x) is dominated by high-frequency, non-informative tokens (e.g., “the”, “was”), which are common across many documents. The cosine similarity space becomes noisy.
  - If x is too long, since embeddings are often mean-pooled(take mean of token vectors), pooling large x may wash out rare-but-important signals
  - In practice, chunks of 200~500 tokens ??? but why...?
    - Some papers on this topic: [(2024) Late Chunking: Contextual chunk embeddings using long-context embedding models](https://arxiv.org/pdf/2409.04701), [(2024)Evaluating chunking stragies for retrieval - Chroma](https://research.trychroma.com/evaluating-chunking)
- Less predictable overlap
  - ??? Why is overlap important in chunking -> 

<br>

Alternative method:
- Hybrid
  1. Split into sentences.
  2. group consecutive sentences until you hit ~500–800 tokens.
  3. Add overlaps of 1–2 sentences for continuity.

In [None]:
chunks

## 2. Semantic chunking
Chunk/group_together sentences when similarity score is high enough

Steps:

1. Chunk sentences.
2. Compute similarity score between first and second sentence.
    1. If below similarity threshold, set split point between them
    2. else, concat 1st and 2nd sentence. This becomes sentence_1 and third sentence becomes sentence_2
3. Repeat this process

## Contextual chunk
[(2024) Introducing Contextual Retrieval - Anthropic](https://www.anthropic.com/news/contextual-retrieval)

## 3. Late chunking

- [Stop Losing Context! How Late Chunking Can Enhance Your Retrieval Systems - Youtube](https://www.youtube.com/watch?v=Hj7PuK1bMZU)
- [(2024) Late chunking in long-context embedding models](https://jina.ai/news/late-chunking-in-long-context-embedding-models/)
- [(2024) What Late chunking really is & What it's not: Part II](https://jina.ai/news/what-late-chunking-really-is-and-what-its-not-part-ii/)
- [(2024) Late Chunking: Balancing Precision and Cost in Long context retrieval](https://weaviate.io/blog/late-chunking)