## Chunking

! pip install transformers

In [1]:
model_name = 'snowflake/snowflake-arctic-embed-m-long'
emb_dimensions = 768

In [2]:
# Load tokenizer (must match the model)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [3]:
filename = "bfp-a3447q.pdf"
input_path = "data/" + filename.split('.')[0]+'.txt'
output_path= input_path.split('.')[0]+'_chunked.txt'
image_path = "data/images"

In [6]:
import math
def chunk_text_by_lines(text, tokenizer, token_limit):
    """
    Splits text into chunks based on token count, ensuring cuts only happen
    between lines (preserving \n).
    """
    lines = text.splitlines(keepends=True)  # keep '\n' at end of each line
    chunks, current_chunk, current_tokens = [], [], 0
    all_tokens = tokenizer.encode(text, add_special_tokens=False)
    # setting 'soft limit'
    chunk_size = math.ceil(len(all_tokens) / math.ceil(len(all_tokens)/ token_limit))
    for line in lines:
        line_tokens = tokenizer.encode(line, add_special_tokens=False)
        if current_tokens + len(line_tokens) > token_limit:
            # flush current chunk
            chunks.append("".join(current_chunk).strip())
            current_chunk, current_tokens = [], 0
        current_chunk.append(line)
        current_tokens += len(line_tokens)
        if current_tokens > chunk_size:
            # flush current chunk
            chunks.append("".join(current_chunk).strip())
            current_chunk, current_tokens = [], 0
    
    # add last chunk
    if current_chunk:
        chunks.append("".join(current_chunk).strip())
    
    return chunks, chunk_size

In [10]:
import re
import pathlib, json
json_read = pathlib.Path(input_path).read_text()
data = json.loads(json_read)
chunked_data = []
chunks = []
token_limit = 1000
print(f'Token threshold: {token_limit}')
for index, chunk in enumerate(data):
    text = chunk[-1]
    # Replace <br> with semicolons or newlines to reduce token consumption and clean tables
    text = re.sub(r"<br\s*/?>", "; ", text)
    
    # Strip off excessive whitespace
    text = re.sub(r" +", " ", text)
    text = re.sub(r"\.+", ".", text)
    text = re.sub(r"~", "", text)
        
    chunks, size = chunk_text_by_lines(text, tokenizer, token_limit)
    for text in chunks:
        chunked_data.append(
            [chunk[0],
             chunk[1],
             chunk[2],
             text]
        )
    if len(chunks) > 1:
        print(f"{index}) Text length: {len(text)}\tToken chunk size: {size}\tChunks count: {len(chunks)}\t for {chunk[1]}")


Token threshold: 1000
8) Text length: 409	Token chunk size: 974	Chunks count: 5	 for 1.2.2 Combination of the robot arm and the controller
9) Text length: 426	Token chunk size: 779	Chunks count: 5	 for 1.3 CE marking specifications
15) Text length: 2306	Token chunk size: 694	Chunks count: 2	 for 1.7 Contents of the Option equipment and special specification
19) Text length: 2801	Token chunk size: 862	Chunks count: 2	 for (1) RH-3CH-Sxx
20) Text length: 2919	Token chunk size: 945	Chunks count: 2	 for (2) RH-6CH-Sxx
26) Text length: 1023	Token chunk size: 884	Chunks count: 4	 for (1) Setting Load Capacity and Size (Hand Conditions)
39) Text length: 1245	Token chunk size: 811	Chunks count: 3	 for (1) Normal environmental specification
52) Text length: 1868	Token chunk size: 569	Chunks count: 2	 for (2) Machine cable (flexed type)
57) Text length: 2719	Token chunk size: 693	Chunks count: 2	 for 3.1.1 Basic specifications
66) Text length: 2155	Token chunk size: 687	Chunks count: 3	 for 3.5 

In [14]:
import json
json_file=json.dumps(chunked_data, indent=2)
# Exporting data to output file for storage
with open(output_path, mode='w+') as f_out:
    f_out.write(json_file)

In [5]:
data_folder = "data/"
filename = data_folder + "bfp-a3447q.pdf"
content_path= filename.split('.')[0]+'_chunked.txt'
context_path= filename.split('.')[0]+'_context.txt'
image_path = "data/images"

In [6]:
import pathlib, json
json_read = pathlib.Path(context_path).read_text()
data_context = json.loads(json_read)
data_context = {int(k): v for k, v in data_context.items()}

In [9]:
max_len = 0
for k,v in data_context.items():
    all_tokens = tokenizer.encode(v, add_special_tokens=False)
    if len(all_tokens) > 600:
        if len(all_tokens) > max_len:
            max_len = len(all_tokens)
        print(f'For chapter no: {k}, token count is {len(all_tokens)}')

For chapter no: 31, token count is 717
For chapter no: 34, token count is 1106
For chapter no: 45, token count is 725
For chapter no: 67, token count is 756
For chapter no: 71, token count is 639
For chapter no: 104, token count is 665
For chapter no: 106, token count is 601
For chapter no: 108, token count is 742
For chapter no: 116, token count is 687
For chapter no: 119, token count is 602
For chapter no: 135, token count is 649


In [16]:
text = """
The document provides technical specifications for different robot models (RH-6CH6020-S11/S15, RH-6CH7020-S11/S15, RH-6CH6020-S23/S24, RH-6CH7020-S23/S24). These specifications include allowable inertia, J3(Z) axis pressing force, maximum eccentricity, mass, tool wiring , tool pneumatic pipes, supply pressure, protection, ambient temperature, painting color, and machine cable length.\nNotes clarify: Note 1 covers max speed with MvTune2 and no auto compensation. Note 3 explains cycle time under MvTune2, affected by accuracy needs and carrying mass per model. Note 5 defines J3(Z) axis pressing force and overload risks. Note 8 explains setting environmental temperature with parameter OLTMX for control and protection.\n\nSection 2.1.2 addresses counter-force on installation surface, listing falls moment (ML), torsion moment (MT), horizontal force (FH), and vertical force (FV) for RH-3CH-Sxx, RH-6CH6020-Sxx, and RH-6CH7020-Sxx.\n\nSection 2.2 covers mass capacity and tooling: eccentric loads impose restrictions; tooling must follow allowable inertia values (2.1.1). Cautions include alarms (vibration, overload, overcurrent) caused by speed or posture even within limits; proper parameter settings (mass, centroid) are crucial to prevent gear/belt wear. Mass capacity and allowable inertia are dynamic limits, not guarantees of accuracy. Long or low-rigid tooling may reduce accuracy or cause vibration. Allowable offset (Z direction) from shaft to center of gravity is 100 mm. Load deviation allowance is 150 mm, reduced to 10 mm in large inertia mode. Figures 2-1 to 2-3 show center of gravity for loads.\n\nSection 2.2.3 explains mass capacity, speed, and acceleration/deceleration: robot optimizes automatically based on load data (mass and size). Manual reduction of speed/accel may be required if vibration or overheating occurs. Wrong load settings shorten lifespan. For high accuracy, correct load input and reduced acceleration/deceleration are recommended. Parameters: "HNDDAT*" (hand), "WRKDAT*" (work), set with "LoadSet" command (default LoadSet 0.0). Factory defaults for hand mass, size, and center of gravity are given in a table.
"""
all_tokens = tokenizer.encode(text, add_special_tokens=False)

In [17]:
len(all_tokens)

554

In [12]:
text

'\nThe document provides technical specifications for different robot models (RH-6CH6020-S11/S15, RH-6CH7020-S11/S15, RH-6CH6020-S23/S24, RH-6CH7020-S23/S24). These specifications include allowable inertia (values in parentheses for large inertia mode), J3(Z) axis pressing force, maximum eccentricity (also with large inertia mode values), mass, tool wiring (15 points, D-SUB), tool pneumatic pipes (φ6 × 2, φ4 × 1), supply pressure (0.5±10% MPa), protection (IP20), ambient temperature (0 to 40 ℃), painting color (Light gray, Munsell 0.6B7.6/0.2), and machine cable length (5m).\n\nNotes clarify: Note 1 covers max speed with MvTune2 and no auto compensation. Note 3 explains cycle time under MvTune2, affected by accuracy needs and carrying mass per model. Note 4 specifies inertia values in parentheses are for large inertia mode. Note 5 defines J3(Z) axis pressing force and overload risks. Note 6 says eccentricity values in parentheses are for large inertia mode. Note 8 explains setting envi

In [18]:
data_context[34] = text

In [19]:
import json
json_file=json.dumps(data_context, indent=2)
# Exporting data to output file for storage
with open(context_path, mode='w+') as f_out:
    f_out.write(json_file)