## Block 1: PDF Text Extraction

### What this block does
This block reads the service manual PDF and extracts all **text content** from it.
The extracted text will be used in later steps like chunking, embeddings, and retrieval.

### Why this is needed
- The service manual is a PDF, not plain text.
- LLMs and embedding models cannot directly understand PDFs.
- We must convert the PDF into text first.

### How it works (conceptually)
- Open the PDF file from disk.
- Read the PDF **page by page** to handle large files safely.
- From each page, extract only the **visible text**.
- Ignore images, diagrams, and drawings.
- Keep track of the **page number** for each extracted text.

### Output of this block
- A list of pages where each item contains:
  - Page number
  - Text extracted from that page

Example output (conceptual):
- Page 24 → Torque specifications text
- Page 25 → Front suspension description text


This block only converts the PDF into raw text.


In [1]:
pip install fitz

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
import sys

# Ensure PyMuPDF is installed
!{sys.executable} -m pip install PyMuPDF --quiet

# Block 1: PDF Text Extraction
# This code reads a PDF file and extracts text page by page.

import fitz  # PyMuPDF

# Path to the service manual PDF
pdf_path = "sample-service-manual 1.pdf"

# List to store extracted text from each page
pages_text = []

# Open the PDF
with fitz.open(pdf_path) as pdf:
    # Loop through each page
    for page_number in range(len(pdf)):
        page = pdf[page_number]

        # Extract text from the page
        text = page.get_text()

        # Store page number and text
        pages_text.append({
            "page": page_number + 1,
            "text": text
        })

# Simple check: print text from the first page
print("Page:", pages_text[0]["page"])
print(pages_text[0]["text"][:500])  # print first 500 characters

Page: 1
Suspension System 
Inspection and Verification 
1.
Road test. 
z Verify the customer concern by carrying out a road test on a smooth road. If any vibrations are 
apparent, refer to Section 100-04 . 
2.
Inspect tires. 
z Check the tire pressure with all normal loads in the vehicle and the tires cold. Refer to the 
Vehicle Certification (VC) label. 
z Verify that all tires are sized to specification. Refer to the VC label. 
z Inspect the tires for incorrect wear and damage. Install new tires as ne


## Block 2: Text Cleaning and Normalization

### What this block does
This block cleans the raw text extracted from the PDF so it becomes easier to process later.

### Why this is needed
- PDF text often contains:
  - Extra spaces
  - Random line breaks
  - Page headers and footers
- If we do not clean the text, chunking and extraction will be messy and inaccurate.

### How it works (conceptually)
- Take the raw text from each page.
- Remove extra spaces and empty lines.
- Merge broken lines into proper sentences.
- Keep the actual technical content like:
  - Torque values
  - Procedure steps
  - Section titles

### Output of this block
- Cleaned text for each page.
- Same structure as before, but text is more readable and consistent.

Example (conceptual):
- Before:  
  `"Brake disc shield bolts    17   Nm"`
- After:  
  `"Brake disc shield bolts 17 Nm"`

### What this block does NOT do
- No chunking
- No embeddings
- No LLM usage

This block only prepares clean text for the next step.


In [3]:
# Block 2: Text Cleaning and Normalization
# This code cleans the raw text extracted from the PDF pages.

import re

# List to store cleaned text
cleaned_pages_text = []

for page_data in pages_text:
    raw_text = page_data["text"]

    # Convert multiple spaces into a single space
    text = re.sub(r"\s+", " ", raw_text)

    # Remove leading and trailing spaces
    text = text.strip()

    # Store cleaned text with page number
    cleaned_pages_text.append({
        "page": page_data["page"],
        "text": text
    })

# Simple check: print cleaned text from the first page
print("Page:", cleaned_pages_text[0]["page"])
print(cleaned_pages_text[0]["text"][:500])  # first 500 characters


Page: 1
Suspension System Inspection and Verification 1. Road test. z Verify the customer concern by carrying out a road test on a smooth road. If any vibrations are apparent, refer to Section 100-04 . 2. Inspect tires. z Check the tire pressure with all normal loads in the vehicle and the tires cold. Refer to the Vehicle Certification (VC) label. z Verify that all tires are sized to specification. Refer to the VC label. z Inspect the tires for incorrect wear and damage. Install new tires as necessary. 


## Block 3: Text Chunking

### What this block does
This block splits the cleaned text into **smaller chunks** so it can be embedded and searched efficiently.

### Why this is needed
- The service manual is very large.
- Embedding models and LLMs work best with smaller text pieces.
- Chunking helps retrieve only the **relevant section** (for example, torque tables).

### How it works (conceptually)
- Take cleaned text page by page.
- Split text into chunks of fixed size (for example, a few hundred words).
- Allow a small overlap between chunks to avoid losing context.
- Keep page number information for traceability.

### Output of this block
- A list of text chunks.
- Each chunk contains:
  - Chunk text
  - Page number (or page range)

Example (conceptual):
- Chunk 1 → Page 24 → Torque specifications
- Chunk 2 → Page 25 → Front suspension description

### What this block does NOT do
- No embeddings
- No vector database
- No LLM usage

This block only prepares text for embedding and retrieval.


In [4]:
# Block 3: Text Chunking
# This code splits cleaned text into smaller overlapping chunks.

# Parameters for chunking
CHUNK_SIZE = 500    # number of words per chunk
CHUNK_OVERLAP = 50  # number of overlapping words

chunks = []

for page_data in cleaned_pages_text:
    words = page_data["text"].split()
    page_number = page_data["page"]

    start = 0
    while start < len(words):
        end = start + CHUNK_SIZE
        chunk_words = words[start:end]

        chunk_text = " ".join(chunk_words)

        chunks.append({
            "page": page_number,
            "text": chunk_text
        })

        # Move start forward with overlap
        start = end - CHUNK_OVERLAP

# Simple check: print first chunk
print("Page:", chunks[0]["page"])
print(chunks[0]["text"][:500])


Page: 1
Suspension System Inspection and Verification 1. Road test. z Verify the customer concern by carrying out a road test on a smooth road. If any vibrations are apparent, refer to Section 100-04 . 2. Inspect tires. z Check the tire pressure with all normal loads in the vehicle and the tires cold. Refer to the Vehicle Certification (VC) label. z Verify that all tires are sized to specification. Refer to the VC label. z Inspect the tires for incorrect wear and damage. Install new tires as necessary. 


## Block 4: Embeddings and Vector Store

### What this block does
This block converts each text chunk into a **numeric vector (embedding)** and stores it in a **vector database**.

### Why this is needed
- Text search using keywords is weak for large manuals.
- Embeddings allow **semantic search** (meaning-based).
- This helps find the right section even if words are different.

### How it works (conceptually)
- Take each text chunk from Block 3.
- Use an embedding model to convert text into numbers.
- Store these vectors in a vector store.
- Keep metadata like page number and original text.

### Output of this block
- A vector database containing:
  - Embedding vectors
  - Chunk text
  - Page number

Example (conceptual):
- Vector → “Brake disc shield bolts 17 Nm” → Page 24
- Vector → “Lower arm forward nuts torque” → Page 32



This block only prepares data for fast and accurate retrieval.


In [5]:
pip install faiss-cpu

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [6]:
pip install sentence_transformers

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [7]:
pip install tf-keras

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [8]:
# Block 4: Embeddings and Vector Store
# This code converts text chunks into embeddings and stores them in a FAISS vector index.

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Extract chunk texts
texts = [chunk["text"] for chunk in chunks]

# Create embeddings
embeddings = model.encode(texts, show_progress_bar=True)

# Convert embeddings to numpy array
embeddings = np.array(embeddings).astype("float32")

# Create FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)

# Add embeddings to index
index.add(embeddings)

# Store metadata separately (for retrieval later)
metadata = chunks

# Simple check
print("Total chunks indexed:", index.ntotal)


2026-02-13 20:14:27.629504: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Batches:   0%|          | 0/28 [00:00<?, ?it/s]

Total chunks indexed: 875


## Block 5: Retrieval (Finding Relevant Chunks)

### What this block does
This block finds the **most relevant text chunks** from the vector database for a user query.

### Why this is needed
- The full manual is very large.
- We only want the **few sections** related to the question.
- Retrieval reduces noise before sending text to an LLM.

### How it works (conceptually)
- Take a user query (example: “Torque for brake caliper bolts”).
- Convert the query into an embedding using the same model.
- Search the vector database for the closest matching chunks.
- Return the top matching chunks with their page numbers.

### Output of this block
- A small list of relevant text chunks.
- Each result contains:
  - Chunk text
  - Page number

Example (conceptual):
- Result 1 → Page 24 → Torque specifications table
- Result 2 → Page 32 → Installation steps mentioning torque


This block only selects the best text for the LLM.


In [9]:
# Block 5: Retrieval (Finding Relevant Chunks)
# This code searches the vector store to find chunks related to a user query.

# Example user query
query = "Torque for brake caliper bolts"

# Convert query to embedding
query_embedding = model.encode([query])
query_embedding = np.array(query_embedding).astype("float32")

# Number of top results to retrieve
TOP_K = 5

# Search the FAISS index
distances, indices = index.search(query_embedding, TOP_K)

# Collect relevant chunks
retrieved_chunks = []
for idx in indices[0]:
    retrieved_chunks.append(metadata[idx])

# Simple check: print retrieved results
for i, chunk in enumerate(retrieved_chunks, 1):
    print(f"\nResult {i} (Page {chunk['page']}):")
    print(chunk["text"][:300])



Result 1 (Page 734):
Torque Specifications a Refer to the procedure in this section. SECTION 206-07: Power Brake Actuation 2014 F-150 Workshop Manual SPECIFICATIONS Procedure revision date: 10/25/2013 Description Nm lb-ft lb-in Brake booster nuts a — — — Brake master cylinder nuts 25 18 — Brake vacuum pump bolts a — — —

Result 2 (Page 672):
General Specifications Torque Specifications SECTION 206-05: Parking Brake and Actuation 2014 F-150 Workshop Manual SPECIFICATIONS Procedure revision date: 10/25/2013 Material Item Specification Fill Capacity Molykote® M77 8U7Z-19A506-A — — Item Specification Parking brake shoe clearance 0.59 mm (0.

Result 3 (Page 652):
General Specifications Torque Specifications SECTION 206-04: Rear Disc Brake 2014 F-150 Workshop Manual SPECIFICATIONS Procedure revision date: 10/25/2013 Material Item Specification Fill Capacity High Performance DOT 3 Motor Vehicle Brake Fluid PM-1-C (US); CPM-1-C (Canada) WSS-M6C62-A or WSS-M6C65

Result 4 (Page 636):
General S

## Block 6: LLM-based Structured Specification Extraction

### What this block does
This block uses an **LLM** to extract **structured vehicle specifications** from the retrieved text.

### Why this is needed
- Retrieved text is still unstructured.
- We need clean outputs like:
  - Component name
  - Specification type (torque, capacity, etc.)
  - Value
  - Unit
- Structured data is easier to store, analyze, or export as JSON/CSV.

### How it works (conceptually)
- Take the top retrieved chunks from Block 5.
- Send them to an LLM with a clear instruction.
- Ask the LLM to extract only specification-related data.
- Force the output to follow a fixed JSON format.

### Output of this block
- A list of structured specifications.

Example output (conceptual):
```json
[
  {
    "component": "Brake disc shield bolt",
    "spec_type": "Torque",
    "value": "17",
    "unit": "Nm",
    "page": 24
  }
]


In [16]:
from huggingface_hub import login
login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Token has not been saved to git credential helper.


In [17]:
from huggingface_hub import whoami
whoami()


{'type': 'user',
 'id': '68424b82ba04d3ceff5902ab',
 'name': 'SandipanMondal06',
 'fullname': 'Sandipan Mondal',
 'email': '24n0457@iitb.ac.in',
 'emailVerified': True,
 'canPay': False,
 'billingMode': 'prepaid',
 'periodEnd': 1772323200,
 'isPro': False,
 'avatarUrl': '/avatars/dce855125c039f035ff5e62a6ad236af.svg',
 'orgs': [],
 'auth': {'type': 'access_token',
  'accessToken': {'displayName': 'Service_Manual_RAG',
   'role': 'read',
   'createdAt': '2026-02-13T14:41:33.219Z'}}}

In [18]:
!pip install -q transformers accelerate torch huggingface_hub


In [19]:
pip install accelerate

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [20]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "mistralai/Mistral-7B-Instruct-v0.2"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

print("Model loaded successfully")



Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Model loaded successfully


In [21]:
# Simple test prompt to check model response

prompt = "Say hello and tell me what torque means in vehicles in one sentence."

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    temperature=0.7
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Say hello and tell me what torque means in vehicles in one sentence.

Torque is a measure of rotational force that turns the wheels of a vehicle, enabling it to move or accelerate.


In [24]:
import json

# Block 6: LLM-based Structured Specification Extraction
# This code uses the LLM to extract structured specifications from the retrieved text.

# Prepare the context from retrieved chunks
context = ""
for chunk in retrieved_chunks:
    context += f"Page {chunk['page']}: {chunk['text']}\n\n"

# Define the system prompt for structured extraction
system_prompt = (
    "You are an expert at extracting vehicle specifications from service manuals.\n"
    "Your task is to identify and extract any torque specifications or other numerical specifications (like fluid capacities) from the provided text.\n"
    "Present the extracted information in a JSON array format. Each object in the array should have the following keys:\n"
    "- 'component': The name of the component the specification refers to (e.g., 'Brake caliper bolts', 'Engine oil capacity').\n"
    "- 'spec_type': The type of specification (e.g., 'Torque', 'Capacity', 'Pressure', 'Dimension').\n"
    "- 'value': The numerical value of the specification.\n"
    "- 'unit': The unit of measurement (e.g., 'Nm', 'lb-ft', 'liters', 'psi', 'mm').\n"
    "- 'page': The original page number from which the information was extracted.\n"
    "If a specification provides a range of values, represent it as 'min-max'. If no unit is explicitly mentioned but implied, try to infer it. If no relevant information is found, return an empty JSON array.\n"
    "Example format:\n"
    "```json\n"
    "[\n"
    "  {\n"
    "    \"component\": \"Brake disc shield bolt\",\n"
    "    \"spec_type\": \"Torque\",\n"
    "    \"value\": \"17\",\n"
    "    \"unit\": \"Nm\",\n"
    "    \"page\": 24\n"
    "  },\n"
    "  {\n"
    "    \"component\": \"Engine oil capacity\",\n"
    "    \"spec_type\": \"Capacity\",\n"
    "    \"value\": \"5.5\",\n"
    "    \"unit\": \"liters\",\n"
    "    \"page\": 10\n"
    "  }\n"
    "]\n"
    "```"
)

# Combine system prompt, user query, and context for the LLM
full_prompt = f"[INST] {system_prompt}\n\nHere is the relevant text:\n\n{context}\n\nExtract the specifications in the described JSON format. [/INST]"

# Tokenize the prompt
inputs = tokenizer(full_prompt, return_tensors="pt").to(model.device)

# Generate response from the LLM
# Set max_new_tokens to allow for longer JSON output
outputs = model.generate(
    **inputs,
    max_new_tokens=1000, # Increased tokens for detailed JSON output
    temperature=0.01, # Use a low temperature for more deterministic output
    num_return_sequences=1
)

# Decode the response and extract the JSON part
response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Find the JSON block in the response
json_start = response_text.find('```json')
json_end = response_text.find('```', json_start + 1)

extracted_json = ""
if json_start != -1 and json_end != -1:
    extracted_json = response_text[json_start + len('```json'):json_end].strip()
else:
    print("Could not find JSON block in the response:")
    print(response_text)

# Parse the JSON
specifications = []
try:
    if extracted_json:
        specifications = json.loads(extracted_json)
    else:
        print("No JSON found or extracted JSON was empty.")
except json.JSONDecodeError as e:
    print(f"JSON decoding error: {e}")
    print(f"Problematic JSON string: {extracted_json}")

# Print the extracted specifications
if specifications:
    print("Extracted Specifications:")
    for spec in specifications:
        print(spec)
else:
    print("No structured specifications were extracted.")


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Extracted Specifications:
{'component': 'Brake disc shield bolt', 'spec_type': 'Torque', 'value': '17', 'unit': 'Nm', 'page': 24}
{'component': 'Engine oil capacity', 'spec_type': 'Capacity', 'value': '5.5', 'unit': 'liters', 'page': 10}


## Block 7: Save Extracted Specifications

### What this block does
This block saves the extracted structured specifications (from Block 6) into a JSON file.

### Why this is needed
- The extracted data is valuable and should be persisted.
- A JSON file is a common and easily parsable format for structured data.
- This allows for further analysis, integration with other systems, or simply keeping a record of the extracted information.

### How it works (conceptually)
- Takes the `specifications` list.
- Uses the `json` library to write this list to a file.

### Output of this block
- A JSON file named `extracted_specifications.json` containing the list of dictionaries.


In [25]:
# Block 7: Save Extracted Specifications
# This code saves the extracted specifications to a JSON file.

output_filename = "extracted_specifications.json"

# Ensure specifications list is not empty before saving
if specifications:
    with open(output_filename, 'w') as f:
        json.dump(specifications, f, indent=4)
    print(f"Extracted specifications saved to {output_filename}")
else:
    print("No specifications to save. The 'specifications' list is empty.")

# Optional: Display the content of the saved file to confirm
# with open(output_filename, 'r') as f:
#     saved_content = f.read()
# print("\nContent of the saved JSON file:\n")
# print(saved_content)

Extracted specifications saved to extracted_specifications.json
