# Research Notes and Testing

This notebook is an expermentation of the ManualsGraph task: extracting structured machine constraints from technical manuals. I focus on robot arm datasheets, specifically torque specifications, and test a few approaches to get reliable text and numbers out of PDFs. 

There will be commentary along the way of my thought process behind certain snippets of code/observations

What I cover:
* PDF text extraction using pypdf

* Regex + keyword-based matching to extract torque limits from both running text and tables


I am tackling regex based is that its fast and deterministic and its logic will then be able to be used as a A good “first-pass” filter that can simplify downstream curation and enable smaller, cheaper models later (only where needed) to reduce Forgis overhead cost with token count

In [None]:
from pypdf import PdfReader
import re
import json
#import camelot

In [18]:
TEST_PDFs = "test_manuals"

robot_arm = "./self_found_manuals/robot_arm.pdf" 

In [19]:
reader = PdfReader(robot_arm)

## Preliminary visualisation

I will start an initial evaluation on the `robot_arm.pdf` this is as its in tabular format, which seems to be the most common way of representing technical specifications, from what I have seen searching online.

For now I will use pypdf (which is being used in the scripts provided) and see how the reader represents the full text (part of previous pipelines).

In [20]:
full_text=""
for page in reader.pages:
    page_text = page.extract_text()
    print(page_text)
    full_text += page_text + "\n\n"

SPECIFICATION OF ROBOT
KJ314JWE25
KJ314JTE25
KJ314JVE25
KJ264JFE25
KJ264JGE25
KJ264JTE25
KJ264JVE25
3rd Edition : Jun.23.2014
KAWASAKI HEAVY INDUSTRIES LTD.
   ROBOT DIV. 
    Doc,No:90101-2139DEC
Materials and specifications are subject to change without notice.
1. Specification of Robot
[1-1] Robot Arm (KJ314J)
1. Model KJ314J-D0 , KJ314J-D1
2. Type Articulated robot + Swing unit
3. Degree of freedom 7 axes (6 axes + 1axis)
4. Axis specification Operating axis Max. operating range
  Arm rotation (JT1) +120 ゜ ～－120 ゜
Arm out-in (JT2) +130 ゜ ～－ 80 ゜
Arm up-down (JT3) +  90 ゜ ～－ 65 ゜
Wrist roll (JT4) +720 ゜ ～－720 ゜
Wrist roll (JT5) +720 ゜ ～－720 ゜
Wrist roll (JT6) +410 ゜ ～－410 ゜
Swing (JT7) +  90 ゜ ～－ 90 ゜
5. Repeatability ±0.5 mm (at the tool mounting surface)
6. Playback Accuracy ±1.0 mm (at the tool mounting surface)
7. Max. payload Wrist : 15 kg
Upper arm : 25 kg
        (on the Upper Arm :Include painting equipments in pressurized compartment)
8. Max. painting speed 1500 mm/s (at th

### Quick observations

1. The pdf emcompasses various robots at the same time, proper detection of what specification table refers to which model is vital. Otherwise, it will be very easy to end up with noisy data.
2. This PDF contains tables inside other tables (see image). When extracted as raw text, the structure degrades and key headers can get separated from their values. For example, this snippet appears without the nearby headers for Max. Torque and Moment of Inertia:

```text
JT4 56.2 N･m           2.19 kg･m2
JT5 43.4 N･m           1.31 kg･m2
JT6 22.0 N･m           0.33 kg･m2
```
As you can see the header of "Max. Torque" and Moment of Inertia are not nearby, rather appear at the end of the page in the text format.
The rest seem to be doing "ok" Example: 7. Max. payload Wrist : 15 kg.

3. The document is of a japanese brand and so symbols:～－ are used for minus degrees in the Operating axis which may seem insignificant but could affect the pipeline/regex in future and is something I would need to take into consideration further down the line. LLMs should not struggle with this if I give it the full unicode.


## Quick test of the Regex

In [21]:
TORQUE_PATTERN = r"(\d+(?:\.\d+)?)\s*([Nn][\s\.\-･·]*[Mm])"

In [22]:
matches = re.finditer(TORQUE_PATTERN, full_text, re.IGNORECASE)
for match in matches:
    print(match)


<re.Match object; span=(1043, 1051), match='56.2 N･m'>
<re.Match object; span=(1077, 1085), match='43.4 N･m'>
<re.Match object; span=(1111, 1119), match='22.0 N･m'>
<re.Match object; span=(1515, 1521), match='0.5 Nm'>
<re.Match object; span=(3266, 3274), match='56.2 N･m'>
<re.Match object; span=(3300, 3308), match='43.4 N･m'>
<re.Match object; span=(3334, 3342), match='22.0 N･m'>
<re.Match object; span=(3837, 3843), match='0.5 Nm'>


The above regex seems to be working although it picks up noise specificially mistaken: 0.5 Nm 3/min.

We will update the regex to make it more robust. This process is something I will need to be able to test, prove and measure for each extraction to avoid extracting erroneous data.

In [23]:
TORQUE_PATTERN = r"\b(\d+(?:\.\d+)?)\s*N\s*(?:[·･⋅∙・]\s*)?m\b(?![\u00B2\u00B3\u00B9\u2070-\u2079]|\^|/)"

In [24]:
matches = re.finditer(TORQUE_PATTERN, full_text, re.IGNORECASE)
for match in matches:
    print(match)

<re.Match object; span=(1043, 1051), match='56.2 N･m'>
<re.Match object; span=(1077, 1085), match='43.4 N･m'>
<re.Match object; span=(1111, 1119), match='22.0 N･m'>
<re.Match object; span=(3266, 3274), match='56.2 N･m'>
<re.Match object; span=(3300, 3308), match='43.4 N･m'>
<re.Match object; span=(3334, 3342), match='22.0 N･m'>


Seems to be fixed.

Within my work in Manuals Graph we may need to look into expanding this extraction with an LLM NER approach with a confidence score

As it's a table we can try and assume that each new line is a row in the table. It seems to line up within this pdf. This approach is nowhere near robust enough for the final approach for ManualsGraph but will allow a proof of concept

In [None]:

matches = re.finditer(TORQUE_PATTERN, full_text, re.IGNORECASE)

def find_context(matches, full_text):
    results = []
    for match in matches:
            start, end = match.span()
            
            # find the start of the line (look backward for \n)
            line_start = full_text.rfind('\n', 0, start) + 1
            
            # Try find the end of the line (look forward for \n)
            line_end = full_text.find('\n', end)
            if line_end == -1: # End of document
                line_end = len(full_text)
                
            line_content = full_text[line_start:line_end].strip()
            
            # grab the line ABOVE for more context (Max/Min labels)
            prev_line_start = full_text.rfind('\n', 0, line_start - 1) + 1
            prev_line = full_text[prev_line_start:line_start].strip()
            
            results.append({
                "torque": match.group(),
                "line": line_content,
                "context": f"{prev_line} | {line_content}"
            })
    return results
results = find_context(matches=matches, full_text=full_text)
for result in results:
    print(result)

{'torque': '56.2 N･m', 'line': 'JT4 56.2 N･m           2.19 kg･m2', 'context': '9. | JT4 56.2 N･m           2.19 kg･m2'}
{'torque': '43.4 N･m', 'line': 'JT5 43.4 N･m           1.31 kg･m2', 'context': 'JT4 56.2 N･m           2.19 kg･m2 | JT5 43.4 N･m           1.31 kg･m2'}
{'torque': '22.0 N･m', 'line': 'JT6 22.0 N･m           0.33 kg･m2', 'context': 'JT5 43.4 N･m           1.31 kg･m2 | JT6 22.0 N･m           0.33 kg･m2'}
{'torque': '56.2 N･m', 'line': 'JT4 56.2 N･m           2.19 kg･m2', 'context': '9. | JT4 56.2 N･m           2.19 kg･m2'}
{'torque': '43.4 N･m', 'line': 'JT5 43.4 N･m           1.31 kg･m2', 'context': 'JT4 56.2 N･m           2.19 kg･m2 | JT5 43.4 N･m           1.31 kg･m2'}
{'torque': '22.0 N･m', 'line': 'JT6 22.0 N･m           0.33 kg･m2', 'context': 'JT5 43.4 N･m           1.31 kg･m2 | JT6 22.0 N･m           0.33 kg･m2'}


Seems to be working, we can see each of the axis names: 'JT'[4,5,6] and within the row. 

For this pdf we could try and assume the start of the line is the name/header for the row seperated by a space, if its not maximum/minimum or anything else we could say its the name of the axis/point.

Knowing this we can try and write a function and expand on what we have now to see if we can get the axis name.

In [None]:

BAD_ROW_HEADERS = {
    "max", "maximum", "min", "minimum", "note", "notes",
    "torque", "moment", "inertia", "speed", "velocity"
}

def extract_axis_name(line: str) -> str | None:
    line = line.strip()
    if not line:
        return None

    # I assume: first token on the line is the row header (e.g., "JT4")
    first = line.split()[0].strip(",:;()[]")
    if first and not first[0].isdigit() and first.lower() not in BAD_ROW_HEADERS:
        # accept common axis-like tokens
        if re.fullmatch(r"(?i)jt\d+", first) or re.fullmatch(r"(?i)(axis|joint)\d+", first):
            return first.upper().replace(" ", "")
        # if it's not axis-like, you can still return it as a fallback row label
        # return first

    # Fallback find JT/axis/joint anywhere in the line
    m = re.search(r"(?i)\b(JT\d+|axis\s*\d+|joint\s*\d+)\b", line)
    if m:
        return m.group(1).upper().replace(" ", "")

    return None


In [27]:
for r in results:
    r["axis"] = extract_axis_name(r["line"]) or ""
    print(r["axis"] )

JT4
JT5
JT6
JT4
JT5
JT6


We were able to get the axis and its maximum torque spec, although this will not work for all applications. Current implementation would completely break if given an interval. Or another table structure which we will test now.

Although, the extraction of lines/rows logic may be useful in future as a preprocessing step,we can provide a cleaner input for an LLM to interpret. Running a lightweight 8B model locally on Forgis servers or via free tiers would offer a robust, zero-cost way to handle these complex specifications with the flexibility we are currently missing with this regex approach.

That said, let’s continue by checking whether we can reliably detect the robot.

There are a few ways to approach this. PDFs can be structured page-by-page, and in robot_arm.pdf each page appears to contain a table, with roughly one robot per page.

This suggests a simple hierarchical assumption: once a robot is identified, we treat subsequent rows/fields as belonging to that robot until a new robot is explicitly introduced.


This seems more logical than doing it by a distance from word basis/other approach


Below is an approach line by line

In [35]:
# Pattern to detect the Robot Model (e.g., KJ314J or KJ264J)
ROBOT_PATTERN = r"\[\d+-\d+\]\s*Robot\s*Arm\s*\((.*?)\)"

def extract_hierarchical_specs(text):
    lines = text.split('\n')
    current_robot = "Unknown Robot"
    results = []

    for line in lines:
        clean_line = line.strip()
        if not clean_line:
            continue
        robot_match = re.search(ROBOT_PATTERN, clean_line)
        if robot_match:
            current_robot = robot_match.group(1)
            continue 
        
        torque_matches = re.finditer(TORQUE_PATTERN, clean_line, re.IGNORECASE)
        
        for match in torque_matches:
            axis = extract_axis_name(clean_line) or "N/A"
            
            results.append({
                "robot_model": current_robot,
                "axis": axis,
                "torque_spec": match.group(0),
                "raw_context": clean_line
            })

    return results

structured_data = extract_hierarchical_specs(full_text)

print(structured_data)

[{'robot_model': 'KJ314J', 'axis': 'JT4', 'torque_spec': '56.2 N･m', 'raw_context': 'JT4 56.2 N･m           2.19 kg･m2'}, {'robot_model': 'KJ314J', 'axis': 'JT5', 'torque_spec': '43.4 N･m', 'raw_context': 'JT5 43.4 N･m           1.31 kg･m2'}, {'robot_model': 'KJ314J', 'axis': 'JT6', 'torque_spec': '22.0 N･m', 'raw_context': 'JT6 22.0 N･m           0.33 kg･m2'}, {'robot_model': 'KJ264J', 'axis': 'JT4', 'torque_spec': '56.2 N･m', 'raw_context': 'JT4 56.2 N･m           2.19 kg･m2'}, {'robot_model': 'KJ264J', 'axis': 'JT5', 'torque_spec': '43.4 N･m', 'raw_context': 'JT5 43.4 N･m           1.31 kg･m2'}, {'robot_model': 'KJ264J', 'axis': 'JT6', 'torque_spec': '22.0 N･m', 'raw_context': 'JT6 22.0 N･m           0.33 kg･m2'}]


In [None]:
def extract_to_json_format(text):
    # Updated Patterns
    SECTION_PATTERN = r"" # TODO could not reliably complete for this pdf.
    ROBOT_PATTERN = r"\[\d+-\d+\]\s*Robot\s*Arm\s*\((.*?)\)" # Matches "[1-1] Robot Arm (KJ314J)"

    lines = text.split('\n')
    
    # State tracking
    current_section = "General"
    current_robot = "Unknown Robot"
    results = []

    for line in lines:
        clean_line = line.strip()
        if not clean_line:
            continue

        # 1. Update Section State
        section_match = re.search(SECTION_PATTERN, clean_line)
        if section_match:
            current_section = section_match.group(0).strip()
        
        # 2. Update Entity State
        robot_match = re.search(ROBOT_PATTERN, clean_line)
        if robot_match:
            current_robot = robot_match.group(1).strip()
            continue

        # 3. Extract Values and Map to Schema
        torque_matches = re.finditer(TORQUE_PATTERN, clean_line, re.IGNORECASE)
        for match in torque_matches:
            # Re-use your previous axis name logic
            axis_id = extract_axis_name(clean_line) or "wrist"
            val_str = match.group(1) # The numeric part

            results.append({
                "section": current_section,
                "entity": f"{current_robot}_{axis_id}".strip(),
                "property": "max_torque",
                "value": float(val_str),
                "unit": "Nm",
                "source_sentence": clean_line
            })

    return results

# Process the text
final_output = extract_to_json_format(full_text)

# Print as formatted JSON
print(json.dumps(final_output, indent=2, ensure_ascii=False))

[
  {
    "section": "",
    "entity": "KJ314J_JT4",
    "property": "max_torque",
    "value": 56.2,
    "unit": "Nm",
    "source_sentence": "JT4 56.2 N･m           2.19 kg･m2"
  },
  {
    "section": "",
    "entity": "KJ314J_JT5",
    "property": "max_torque",
    "value": 43.4,
    "unit": "Nm",
    "source_sentence": "JT5 43.4 N･m           1.31 kg･m2"
  },
  {
    "section": "",
    "entity": "KJ314J_JT6",
    "property": "max_torque",
    "value": 22.0,
    "unit": "Nm",
    "source_sentence": "JT6 22.0 N･m           0.33 kg･m2"
  },
  {
    "section": "",
    "entity": "KJ264J_JT4",
    "property": "max_torque",
    "value": 56.2,
    "unit": "Nm",
    "source_sentence": "JT4 56.2 N･m           2.19 kg･m2"
  },
  {
    "section": "",
    "entity": "KJ264J_JT5",
    "property": "max_torque",
    "value": 43.4,
    "unit": "Nm",
    "source_sentence": "JT5 43.4 N･m           1.31 kg･m2"
  },
  {
    "section": "",
    "entity": "KJ264J_JT6",
    "property": "max_torque",
    "va

## Result Achieved
The above code returns what we want to see, a quick prototype/run down which gives us with the desired json output:
```json
[
  {
    "section": "",
    "entity": "KJ314J_JT4",
    "property": "max_torque",
    "value": 56.2,
    "unit": "Nm",
    "source_sentence": "JT4 56.2 N･m           2.19 kg･m2"
  },
]
```
Although the extraction of the section was not a success for this pdf due to similarity with the table row names, we can say that the assumptions were good enough and can continue on to Task 3

### Camelot:
I tried camelot, though it also breaks with the 9. section where it has a table within a table, it seems to want to create a seperate table and ends up splitting up the first part of the table with anything after row 9.

I spent more time than I am willing to admit and due to time constraints of this challenge I have decided to park it for now. There is a future here, possibly with leveraging pypdf to ensure the tables dont end up fragmented.

In [1]:
#tables = camelot.read_pdf(robot_arm, pages='all', flavor='stream')
#tables.export('foo.csv', f='csv', compress=True)