The purpose of this notebook is to build and test all elements of the basic chatbot. I will create a vector store from the PDF manual, query it, and use the results as context for questions sent to GPT 5.

In [1]:
#imports
import getpass
import os
import re

from textwrap import dedent

from langchain.chat_models import init_chat_model
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

We simply need to define the filepath and create the PyPDFLoader object using the filepath. From there, we can load in the document. They are saved to a variable called "pages" because each page of the document creates its own Document object from LangChain. We see there are 134 pages in the document.

In [2]:
filepath = 'production_planning_scheduling.pdf'
loader = PyPDFLoader(filepath)

pages = loader.load()
print(len(pages))

134


Not all pages were created equal. The first five pages comprise the index. These will not help pull useful context for our model! After a brief skim of the document, the rest of the pages look like they may provide useful context to answer questions. It will be interesting to see how some of the information (the tables at the end, for example) is read by the PyPDFLoader and if it can serve as useful context.

In [3]:
pages = pages[5:]
print(len(pages))

129


Let's take a look at the metadata included with the Document object. It contains the following bits of information:  
1. The source document
2. Total number of pages
3. The page label (this is the correct page number)
4. The page content  

Some of this information will prove useful for the project. In addition to the page content, the goal is to provide the page number to the LLM, so the context can be cited directly from the text. It allows the user the opportunity to verify the output from the model and read directly from the manual to gather additional information about the task. For example, there are useful images in the PDF that are not captured in the text. A reference to the page number will direct users to relevant images.

In [4]:
pages[10]

Document(metadata={'producer': 'Skia/PDF m139', 'creator': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36', 'creationdate': '2025-08-22T03:05:14+00:00', 'title': 'Help', 'moddate': '2025-08-22T03:05:14+00:00', 'source': 'production_planning_scheduling.pdf', 'total_pages': 134, 'page': 15, 'page_label': '16'}, page_content='2.4      Production Planning Access Window\nThe Order Item Releases that are on Planning Review are displayed on the Production Planning screen based on the following Accesswindow: \n Production Planning Access Window\n \n2.4.1    Display Sequence\nBy default, the Order Releases are sequenced by the sold product.  However the Display Sequence option provides the ability to sequenceby any of the following options • Route• Sales Order• Sold Product\n8/21/25, 10:05 PM PRODUCTION PLANNING AND SCHEDULING\nhttps://infohub.invera.com/HelpViewer.html?#HelpViewer1755829302616 16/134')

In the page content, there are newline characters and excess whitespace. Let's remove that.

In [5]:
regexp = r'\s+'

for page in pages:
    old_content = page.page_content
    new_content = old_content.strip().replace('\n', ' ')
    new_content = re.sub(regexp, ' ', new_content)
    page.page_content = new_content

print(pages[10])    

page_content='2.4 Production Planning Access Window The Order Item Releases that are on Planning Review are displayed on the Production Planning screen based on the following Accesswindow: Production Planning Access Window 2.4.1 Display Sequence By default, the Order Releases are sequenced by the sold product. However the Display Sequence option provides the ability to sequenceby any of the following options • Route• Sales Order• Sold Product 8/21/25, 10:05 PM PRODUCTION PLANNING AND SCHEDULING https://infohub.invera.com/HelpViewer.html?#HelpViewer1755829302616 16/134' metadata={'producer': 'Skia/PDF m139', 'creator': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36', 'creationdate': '2025-08-22T03:05:14+00:00', 'title': 'Help', 'moddate': '2025-08-22T03:05:14+00:00', 'source': 'production_planning_scheduling.pdf', 'total_pages': 134, 'page': 15, 'page_label': '16'}


While some of these pages may be short on text content, others contain a great deal of text. We need to split the text into equal chunks. We will use the recursive text splitter. It splits text using sentences and new lines until the target length is met. There will be some overlap, in case the cutoff is in the middle of an important explanation. These values represent two "hyperparameters" that can be tuned to improve the performance of the vector store. For now, we will try the values provided by the tutorial.

In [6]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1_000, chunk_overlap=200, add_start_index=True
)

splits = splitter.split_documents(pages)
print(len(splits))

462


These document fragments still contain the important metadata we need for the queries. Now we need to create an embeddings object to reference a Google embedding model and generate vectors from the text. The embedding model is used to create the vector store object and will embed all text fragments for searching.

In [7]:
os.environ['OPENAI_API_KEY'] = getpass.getpass('Enter OpenAI API key: ')

embeddings = OpenAIEmbeddings(model='text-embedding-3-large')
vector_store = InMemoryVectorStore(embeddings)
ids = vector_store.add_documents(documents=splits)

Now the vector store exists, we can query it for the closest match(es). Let's ask it some questions with answers we can easily find in the text. By default, the search will return 4 documents. Let's cut that in half to 2. I will also return the similarity score to see how close of a match the context is to the query.

The first question should be found on page label 6 and be a list including job set number and planning warehouse.

In [8]:
result1 = vector_store.similarity_search_with_score(
    'What job set attributes are expected to be included with each job?',
    2
)

print(f'First file similarity score: {result1[0][1]}')
print(f'First file content: {result1[0][0]}')
print()
print(f'Second file similarity score: {result1[1][1]}')
print(f'Second file content: {result1[1][0]}')

First file similarity score: 0.5805875729737856
First file content: page_content='Jobs that represent the Job Set Routing. The Job Set routing can be different from that of the planned Order Item Releasesif the Planner chooses to alter slightly the processing routes.• Transport activities to initiate the transfer of material based on the Job Activity warehouses and the Shipping warehouse.• Transport activities to initiate the shipment to the customer if the Planned Order Item Release is a Normal, Blanket Release, ContractOrder or Toll Order.• Reservations which can be Specific, Non-Specific or Incoming. 1.3.2 Job Set Attributes The Job Set has the following attributes:• A unique Job Set number• A Planning Warehouse• An Expected Starting Date• Customer Owned designation• Entry Measure (English/Metric)• Job Set Status 1.3.3 Job Set Models A Job Set can vary from a simple one/multiple Jobs to fulfill one Order Item Release, to a more complex Job Set which includedone/multiple Jobs that ar

Bingo. Let's amp up the difficulty immediately because that worked so well. The next question is: "What does the reclassify action button do for processing planning?" The answer is found in a table on page 75.

In [9]:
result2 = vector_store.similarity_search_with_score(
    'What does the reclassify action button do for processing planning?',
    2
)

print(f'First file similarity score: {result2[0][1]}')
print(f'First file content: {result2[0][0]}')
print()
print(f'Second file similarity score: {result2[1][1]}')
print(f'Second file content: {result2[1][0]}')

First file similarity score: 0.5906416177143428
First file content: page_content='(PNL-PRFL-WHS set on ipypsh) Processing Plan Action ButtonsThe following action buttons are applicable to the Processing Plan Grid: Action ButtonsAction Enter Product • Calls the Product Item Entry Window to allow the user to enter to enter theReturn to Stock Product Item.• Only enabled if a Planned Consumption was entered Reclassify • Only available for Planned production lines which have an assignment to‘Stock’. Displays ‘Reclassification’ window whereby the user can change theInventory Type of the stock. Allowable Inventory types are:o Dropo Mastero Reject Scrap • Creates a Planned Scrap record for the Proof. • Only enabled if the following conditions are satisfied:o Planned Consumption Weight > then the total Planned Production Weight,ando No Scrap record exists on the Processing Plan • When selected, a Planned Production record is automatically created asfollows:o Product = Scrap Product for the Plan

As you can see, the first piece of context is correctly found on page 75. The the vector store looks adequate for our needs to create the MVP of this project. Let's make sure we can use this context correctly to get a response from GPT 5. We will ask a new question about information found on page 75 and put the pieces together.

In [10]:
system_template = dedent('''\
                         Your job is to answer user questions based on information from a manual.
                         Draw directly from the provided context to craft your answer. At the end of
                         your response, cite the page number of the content you used.
                            
                         Page number: {p1}
                         Context: {c1}
                            
                         Page number: {p2}
                         Context: {c2}''').replace('\n', ' ')

prompt_template = ChatPromptTemplate(
    [('system', system_template), ('user', '{query}')]
)

In [20]:
user_query = 'What are some of the available processing plan action buttons and what do they do?'

context1, context2 = vector_store.similarity_search(
    user_query,
    2
)

prompt = prompt_template.invoke({
    'p1': context1.metadata['page_label'],
    'c1': context1.page_content,
    'p2': context2.metadata['page_label'],
    'c2': context2.page_content,
    'query': user_query
})

print(prompt)

messages=[SystemMessage(content='Your job is to answer user questions based on information from a manual. Draw directly from the provided context to craft your answer. At the end of your response, cite the page number of the content you used.  Page number: 75 Context: (PNL-PRFL-WHS set on ipypsh) Processing Plan Action ButtonsThe following action buttons are applicable to the Processing Plan Grid: Action ButtonsAction Enter Product • Calls the Product Item Entry Window to allow the user to enter to enter theReturn to Stock Product Item.• Only enabled if a Planned Consumption was entered Reclassify • Only available for Planned production lines which have an assignment to‘Stock’. Displays ‘Reclassification’ window whereby the user can change theInventory Type of the stock. Allowable Inventory types are:o Dropo Mastero Reject Scrap • Creates a Planned Scrap record for the Proof. • Only enabled if the following conditions are satisfied:o Planned Consumption Weight > then the total Planned 

In [21]:
model = init_chat_model('gpt-5-nano', model_provider='openai')

response = model.invoke(prompt)

print(response.content)

Here are some of the processing plan action buttons and what they do:

- Enter Product: Calls the Product Item Entry Window to enter a Return to Stock Product Item. This button is enabled only if a Planned Consumption has been entered.

- Reclassify: Available for planned production lines assigned to Stock. Opens a Reclassification window to change the Inventory Type of the stock. Allowable types are: Drop, Master, Reject, Scrap.

- Scrap: Creates a Planned Scrap record for the Proof. Enabled only if Planned Consumption Weight > Total Planned Production Weight and there is no existing Scrap record on the Processing Plan. When used, a Planned Production record is automatically created with Product = Scrap Product for the Planned Consumption and Weight = Planned Consumption – Total Planned Production.

- Compete (likely “Complete”): Closes the Job maintenance form and continues automatic generation of the Processing Plans for the Job Set, starting from where it left off. It may be highli

It's not perfect, but for a project due in a couple days, it's good enough. The next priority is to create a fastapi backend using the work from this document. There is sufficient work here to create a stateless chatbot that pulls answers from the operation manual.

Finally, I need to save my vector store to a file so I can pull it in to the backend.

In [24]:
filepath = './backend/vs.json'
vector_store.dump(filepath)