# Chatbot for Singer Heavy Duty Sewing Machine

This is a model trained to help you retreive information and guidelines in the Singer Heavy Duty 4423 sewing machine manual. 

### Setup, Imports and Configure API

We start by importing the necessary libraries. `fitz` (PyMuPDF) is used for extracting text from PDF pages. We also load libraries `numpy` for numerical operations, `polars` for efficient dataframes, and `tqdm`  to provide visual feedback during processing. Thereafter setting my Google Cloud API key securely with access to Generative AI models.


In [2]:
import os
import numpy as np
import fitz  # PyMuPDF
import polars as pl

from tqdm import tqdm
from pathlib import Path
from google import genai
from google.genai import types

client = genai.Client(api_key=os.getenv("API_KEY"))

### Load pdf-file and Extract Text 

We extract raw text content from each page of the pdf using PyMuPDF. This gives us a list where each item corresponds to the text from a page.

In [4]:
pdf_path = Path(r"C:\Users\Gebruiker\Documents\DS24\Chatbot\Singer_4423_EN.pdf")

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text_pages = [page.get_text() for page in doc]
    doc.close()
    return text_pages

pages = extract_text_from_pdf(pdf_path)

print(f"Total pages: {len(pages)}")

Total pages: 32


### Chunk the Text using *fixed-length chunking*

We split the extracted text into fixed-length chunks of 1000 words with an overlap of 200 words. This allows the model to maintain context between chunks while staying within token limits.

In [5]:
full_text = "\n".join(pages)

# Fixed-length chunking
def chunk_text(text, chunk_size=1000, overlap=200):
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap  # Move forward with overlap
    return chunks

chunks = chunk_text(full_text, chunk_size=1000, overlap=200)
print(f"Total chunks created: {len(chunks)}")

Total chunks created: 7


### Embed Chunks with GenAI

We will apply the `GenAI` model to generate dense vector embeddings for each textual segment, meaning we turn words into a format that the AI can really understand and compare. We then store the chunks in a `Polars` DataFrame with unique `chunk_id`s for easy reference and downstream usage.

In [7]:
result = client.models.embed_content(
    model="text-embedding-004", 
    contents=chunks, 
    config=types.EmbedContentConfig(task_type="SEMANTIC_SIMILARITY")
    ).embeddings

In [None]:
# Create DataFrame
df = pl.DataFrame({
    "chunk_id": list(range(len(chunks))),
    "text_chunk": chunks,
    "embedding": [r.values for r in result]
})

### Semantic Search Helper Functions

Implement cosine similarity and a semantic search function. The search embeds a user query and finds the top k most similar chunks.

In [13]:
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

def semantic_search(query, chunks, embeddings, k=5):
    query_embedding = client.models.embed_content(
        model="text-embedding-004",
        contents=[query],
        config=types.EmbedContentConfig(task_type="SEMANTIC_SIMILARITY")
    ).embeddings[0].values

    similarity_scores = []
    for i, chunk_embedding in enumerate(embeddings):
        similarity_score = cosine_similarity(query_embedding, chunk_embedding)
        similarity_scores.append((i, similarity_score))
    similarity_scores.sort(key=lambda x: x[1], reverse=True)
    top_indices = [index for index, _ in similarity_scores[:k]]
    return [chunks[index] for index in top_indices]


### Generate Prompt with Retrieved Context

Create a user prompt combining the question and the retrieved context chunks for use by the language model.

In [14]:
system_prompt = """"I will ask you a question, and I want you to answer based only on the context I provide, and no other information. 
If there isn’t enough information in the context to answer the question, say 'I don’t know.' 
Do not try to guess. Express yourself clearly and divide the answer into well-structured paragraphs."""

def generate_user_prompt(query):
    context = "\n".join(semantic_search(query, df['text_chunk'], df['embedding']))
    user_prompt = f"This is the question: {query}. This is the context:\n{context}."
    return user_prompt


### Generate Response

Send the constructed prompt and system instructions to the GenAI language model and receive a generated answer.

In [15]:
def generate_response(system_prompt, user_message, model="gemini-2.0-flash"):
    response = client.models.generate_content(
        model=model,
        config=types.GenerateContentConfig(system_instruction=system_prompt),
        contents=[user_message]
    )
    return response


### Run a Test Query

Test the entire RAG pipeline with a sample question related to chunking.

In [20]:
query = "How do I make a blind hem?"
user_prompt = generate_user_prompt(query)
response = generate_response(system_prompt, user_prompt)

print(f"Question: {query}")
print("Answer:")
print(response.text)


Question: How do I make a blind hem?
Answer:
To make a blind hem, follow these steps based on the provided context:

1.  **Select Blind Hem Stitch:**
    *   Set the Pattern Selector Dial to the blind hem setting, indicated by "M" in the diagram. The machine has different settings for firm and stretch fabrics.

2.  **Adjust Stitch Settings:**
    *   Set the Stitch Length Dial within the range shown in the diagram. Blind hems typically use a longer stitch length setting.
    *   Set the Stitch Width Dial appropriately for the fabric weight. Use a narrower stitch for lighter fabrics and a wider stitch for heavier fabrics. Test on a fabric scrap first.

3.  **Prepare the Fabric:**
    *   Turn up the hem to the desired width and press it.
    *   Fold back the hem against the right side of the fabric, leaving the top edge of the hem extending about 7 mm (1/4 inch) to the right side of the folded fabric (as shown in Fig. 1).

4.  **Sew the Hem:**
    *   Start sewing slowly on the fold.
 

### Save and Load Embeddings


In [None]:
# Save sewing machine manual embeddings
df.write_parquet("embeddings.parquet")

**Diskussion**

Modellen kan användas i verkligheten precis med det syftet den skapades - att enkelt lokalisera informatinon i en manual när du sitter med en ny symaskin. Detta kan vara på personnivå, men också på företagsnivå när personal kommer till en ny masin och snabbt behöver lära sig nya inställningar.

En potentiell utmaning är att den är tränad för en specifik modell. Detta är också dess styrka, men för en ny maskin modell behövs alltså en ny modell. Det kan vara en utmaning om ett företag har månag olika maskiner. 

En utvecklingsmölighet, något jag även provade mig på, är att inkludera bilderna från manualen i svaren. Det hade gjort responsen ännu hjälpsammare och tydligare. 