# Part 2: Prepare text documents for stamp value information queries

## Step1: Upload the demonstration dataset

Upload the "2012.pdf" file, created in Part 1, to your Colab workspace in the directory '/content'

## Step2: Define utility functions  
**Note:** These functions are identical to those used in Part 1.

In [2]:
#print_progress(): print a progress bar
#e.g. [██████████████████████████████--------------------] 60.67%
def print_progress(cur_data, total_data):
    total_bar = 50   #set total bar size to be 50
    cur_percent = (cur_data / total_data) * 100
    cur_bar = int((cur_data / total_data) * total_bar)
    cur_bar_display = '█' * cur_bar + '-' * (total_bar - cur_bar)
    print(f'\r[{cur_bar_display}] {cur_percent:.2f}%', end='\n' if cur_percent == 100 else '') #1)\r: return to linehead 2)end='': not print a new line

#calculate the cost of API calls based on the number of tokens used
def calculate_token_cost(model, input_token_count=0, output_token_count=0):
    #default price per 1 million tokens
    input_token_price = 10.0
    output_token_price = 30.0
    #adjust prices based on model type
    if model == 'gpt-4-turbo-2024-04-09':
        pass  # Uses default prices
    elif model == 'text-embedding-ada-002-v2':
        input_token_price = 0.1
        output_token_price = 0.0
    cost = (input_token_count / 1_000_000 * input_token_price +
            output_token_count / 1_000_000 * output_token_price)
    return round(cost, 2)

## Step3: Convert the demonstration pdf into text files


The demonstration pdf contains stamp value information. We will convert this pdf into multiple text files, creating one text file for each page. Special characters will be handled during this process. In the next step, the LLM model will only need to process a single text file to extract the requested stamp value information.

In [3]:
!pip -q install PyMuPDF

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m22.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.8/30.8 MB[0m [31m36.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [4]:
#convert_pdf_to_txt_files(): converts a pdf file into separate text files, one for each page
import fitz
import os
import re
def convert_pdf_to_txt_files(pdf_path):
    # store text files in the '/txt' folder
    txt_dir_path = 'txt'
    if not os.path.exists(txt_dir_path):
        os.makedirs(txt_dir_path)
    doc = fitz.open(pdf_path)
    for page_idx, page in enumerate(doc):  # iterate through each page of the pdf
        page_text = page.get_text()
        page_text = re.sub(r'[\x00-\x1F\x7F\n\r]', ' ', page_text)   #replace control characters and line breaks with space
        txt_file_path = os.path.join(txt_dir_path, f"p{page_idx}.txt")
        with open(txt_file_path, "w", encoding="utf-8") as txt_file:
            txt_file.write(page_text)
        print_progress(page_idx + 1, len(doc))
    doc.close()

In [5]:
#convert the demo dataset -- 2012.pdf into text files and save them in the '/txt' folder
convert_pdf_to_txt_files('./2012.pdf')


[████████████████----------------------------------] 33.33%[█████████████████████████████████-----------------] 66.67%[██████████████████████████████████████████████████] 100.00%


## Step4: Extract stamp value info from text files using OpenAI's LLM model

1. The text files created above are unstructured, yet they contain stamp value information typically formatted as follows:   
`scott_number, face_value, stamp_name,  mint_value, used_value `  
For instance:  
`4603  65¢ Baltimore Checkerspot Butterfly..............................  $7.50  $2.95`
2. we will make OpenAI API calls to leverage GPT LLM model to extract stamp value info from the text files. We'll perform prompt engineering with 'system template' and 'user template' to guide the GPT LLM model in extracting this information accurately.

Note: As of April 2024, the latest LLM model from OpenAI is "gpt-4-turbo-2024-04-09."

In [6]:
!pip -q install langchain langchain-openai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.7/817.7 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m47.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m299.3/299.3 kB[0m [31m27.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.0/116.0 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.6/311.6 kB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m59.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━

In [7]:
#use langchain to load text files
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader
loader = DirectoryLoader('./txt/', glob="*.txt", loader_cls=TextLoader)
docs = loader.load()
docs_sorted = sorted(docs, key=lambda item:item.metadata['source'])    #sort documents for easier subsequent testing
for doc in docs_sorted:
    print(doc.metadata['source'], doc.page_content[:80])

txt/p0.txt Scott #   Mint  Used 4603  65¢ Baltimore Checkerspot Butterfly..................
txt/p1.txt Scott #   Mint  Used 4653  45¢ William H. Johnson. .....  $3.25  $.35  4654-63 .
txt/p2.txt Scott #   Mint  Used 4698-4701.  45¢ Innovative Choreographers  4 Stamps. ......


In [8]:
#read OpenAI API key from Colab's secrets
import os
from google.colab import userdata
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
api_key = os.environ['OPENAI_API_KEY']

In [9]:
#extract_stamp_value_info(): extract stamp value information based on a user query, using the OpenAI API and the latest LLM model (gpt-4-turbo-2024-04-09)
from langchain_core.prompts.chat import SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate
from langchain_openai import ChatOpenAI

def extract_stamp_value_info(user_query, doc_content):

    openai_llm = ChatOpenAI(temperature=0, model_name='gpt-4-turbo-2024-04-09')

    system_template = SystemMessagePromptTemplate.from_template(
        "Assume the role of a helpful assistant with expertise in stamp collecting."
    )

    user_template = HumanMessagePromptTemplate.from_template(
        """
        {user_query}
        Please use the information in the provided document to determine the value.

        Document Content: {doc_content}

        The document contains information about stamp values in the format:
        "scott_number stamp_face_value stamp_name stamp_mint_value stamp_used_value"

        Note1: If the entry is for a stamp set, the 'scott_number' will contain a '-'.

        Note2: If specific value information is missing, '-' will be used to denote this.

        Structure your response as follows (no additional explaination required):
        "scott_number_or_stamp_set_number; stamp_name; face_value; mint_value; used_value"

        Note3: Use '-' in any fields where the information is not available.
        """
    )

    chat_prompt = ChatPromptTemplate.from_messages([system_template, user_template])

    chat_messages = chat_prompt.format_prompt(user_query=user_query, doc_content=doc_content).to_messages()
    # print(chat_messages)

    response = openai_llm(chat_messages)
    # print(response)

    return response


## Step5: Perform queries to test stamp value info extraction

In [10]:
import re
#query_single_stamp_value(): extract stamp value information for a given Scott number using a LLM model API call
def query_single_stamp_value(scott_num, text_content):
    user_query = f'Determine the value of a stamp whose scott number is {scott_num}.'
    response = extract_stamp_value_info(user_query,  text_content)
    #print API call cost
    input_tokens = response.response_metadata['token_usage']['prompt_tokens']
    output_tokens = response.response_metadata['token_usage']['completion_tokens']
    cost = calculate_token_cost('gpt-4-turbo-2024-04-09', input_tokens, output_tokens)
    print(f'[LLM] token:{input_tokens}(input),{output_tokens}(output) cost:${cost}')
    #print query results
    print("[scott_number_or_stamp_set_number, stamp_name, face_value, mint_value, used_value]")
    print(response.content)
    scott_number, stamp_name, face_value, mint_value, used_value = re.split(r';', response.content)
    return (scott_number, stamp_name, face_value, mint_value, used_value, input_tokens, output_tokens, cost)

#query_all_stamp_values(): extract all stamp value information from the document using a LLM model API call
def query_all_stamp_values(text_content):
    user_query = "Determine the value of all stamps in the document."
    response = extract_stamp_value_info(user_query,  text_content)
    #print API call cost
    input_tokens = response.response_metadata['token_usage']['prompt_tokens']
    output_tokens = response.response_metadata['token_usage']['completion_tokens']
    cost = calculate_token_cost('gpt-4-turbo-2024-04-09', input_tokens, output_tokens)
    print(f'[LLM] token:{input_tokens}(input),{output_tokens}(output) cost:${cost}')
    #print query results
    print("[scott_number_or_stamp_set_number, stamp_name, face_value, mint_value, used_value]")
    print(response.content)


In [11]:
#[Exam]stamp value query1: Query for stamp value information that exists within the document
query_single_stamp_value('4626' ,docs_sorted[0].page_content)

  warn_deprecated(


[LLM] token:1932(input),20(output) cost:$0.02
[scott_number_or_stamp_set_number, stamp_name, face_value, mint_value, used_value]
4626; Love Ribbons; 45¢; $3.50; $0.35


('4626', ' Love Ribbons', ' 45¢', ' $3.50', ' $0.35', 1932, 20, 0.02)

In [12]:
#[Exam]stamp value query2: Query for stamp value information in cases where such information is NOT present in the document.
query_single_stamp_value('4000' ,docs_sorted[0].page_content)

[LLM] token:1932(input),10(output) cost:$0.02
[scott_number_or_stamp_set_number, stamp_name, face_value, mint_value, used_value]
4000; -; -; -; -


('4000', ' -', ' -', ' -', ' -', 1932, 10, 0.02)

In [13]:
#[Exam]stamp value query3: Query for all stamp value information found in the document.
query_all_stamp_values(docs_sorted[0].page_content)

[LLM] token:1927(input),1245(output) cost:$0.06
[scott_number_or_stamp_set_number, stamp_name, face_value, mint_value, used_value]
4603; Baltimore Checkerspot Butterfly; 65¢; $7.50; $2.95
4604-07; Dogs at Work; 65¢; $13.95; $8.25
4604; Seeing Eye Dog; 65¢; $3.50; $2.10
4605; Therapy Dog; 65¢; $3.50; $2.10
4606; Military Dog; 65¢; $3.50; $2.10
4607; Rescue Dog; 65¢; $3.50; $2.10
4608-12; Birds of Prey; 85¢; $27.95; -
4608; Northern Goshawk; 85¢; $5.75; -
4609; Peregrine Falcon; 85¢; $5.75; -
4610; Golden Eagle; 85¢; $5.75; -
4611; Osprey; 85¢; $5.75; -
4612; Northern Harrier; 85¢; $5.75; -
4613-17; Weathervanes; 45¢; $9.95; $2.25
4613; Brown Rooster; 45¢; $2.00; $0.50
4614; Cow; 45¢; $2.00; $0.50
4615; Eagle; 45¢; $2.00; $0.50
4616; Black Rooster; 45¢; $2.00; $0.50
4617; Centaur; 45¢; $2.00; $0.50
4618-22; Bonsai; 45¢; $24.95; $1.00
4618; Sierra Juniper – Semi-cascade; 45¢; $5.00; $0.35
4619; Black Pine – Formal Upright; 45¢; $5.00; $0.35
4620; Banyan – Cascade; 45¢; $5.00; $0.35
4621; 

## Step6: Backup generated artifacts

1. Artifacts for Part 3:

   - **/txt**: The folder storing unstructured text files containing stamp value information

In [14]:
#pack generated artifacts into a zip file
!zip -r data_part2.zip txt  > /dev/null

In [15]:
#download the zip file (~8k) for future use
from google.colab import files
files.download('data_part2.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>