##### Boring import stuff

In [2]:
import logging
import boto3
from botocore.exceptions import ClientError
from botocore.exceptions import NoCredentialsError
import os
import json
from datetime import datetime
import numpy as np
import pandas as pd
import re
import time

# Summary

This notebook defines a pipeline to upload receipts to AWS S3, then process the stored objects through AWS Textract Analyze Expense. The Textract output is condensed to remove excess data, leaving only the extracted text in a simplified JSON format. This condensed data is saved to a DataFrame and exported to an Excel sheet for use in the next steps. By storing the condensed output, the need for repeatedly processing receipts with Textract is reduced, saving both time and money.<br><br>

The condensed output is then parsed using key checks and regex to extract the receipt's date and total amount, which are written back to the DataFrame for accuracy verification. A prompt is generated from the condensed output and used to invoke foundation models on AWS Bedrock. This notebook evaluates three models: Llama 3.1 8B Instruct, Amazon Titan Text Express, and Llama 3.1 70B Instruct. The results, including extracted fields and predicted categories, are saved in a DataFrame and stored in Excel spreadsheets. 
The Llama models were prompted by the prompt that specifies System and Instruction fields; Amazon with the other.

# First set of functions

The first set of operations will read into a data frame a .csv file in the format:

| **Column Name**       | **Description**                                                                                  | **Data Type** | **Example Value**                 |
|------------------------|--------------------------------------------------------------------------------------------------|---------------|----------------------------------------------|
| `receipt_extract`      | Raw extracted text or data from the receipt.                                                    | `string`      | Uh, there are really big so no example :)     |
| `object_name`          | Name of the object or entity identified from the receipt.                                       | `string`      | "Supplies40.jpg"                              |
| `date`                 | Transaction date                                                                                | `string`      | "09/10/1900"                        |
| `subtotal`             | The subtotal for the object, typically excluding taxes or discounts.                            | `float`       | 10.00                             |
| `total`                | The total amount associated with the object, including taxes or discounts if applicable.        | `float`       | 12.00                             |
| `category`             | The manually defined category to which the object belongs.                                      | `string`      | "Meals"                       |
|------------------------|-------------------------------------------------------------------------------------------------|---------------|--------------------------------|

The values in object_name must correspond to a jpg file of the same name saved in ~/data/ and an object that is stored or to be stored on S3 in bucket_name.

# *** Input bucket and location of .csv here ****

## Run these next few cells to simply import a .csv file into a dataframe.

In [5]:
bucket_name = "test-bucket-cnevares-2024"
filename = r'data/Spreadsheets/receipts_test.csv'

#### Fire up the dataframe

In [6]:
receipts = pd.read_csv(filename)
receipts.head()

Unnamed: 0,receipt_extract,object_name,date,subtotal,total,category,extracted_date,extracted_total,predicted_category
0,"{""NAME"": ""DELTA"", ""INVOICE_RECEIPT_DATE"": ""22S...",Airfare1.jpg,09/22/2024,45.0,45.0,Travel,09/22/2024,45.0,
1,"{""NAME"": ""Alaska Airlines Alaska Fairbanks ANC...",Airfare2.jpg,11/13/2024,545.11,614.69,Travel,11/13/2024,614.69,
2,"{""NAME"": ""National."", ""items"": {""item0"": ""TIME...",CarRental1.jpg,12/15/2024,355.0,505.63,Travel,12/15/2024,505.63,
3,"{""AMOUNT_PAID"": ""$0.00"", ""items"": {""item0"": ""C...",CarRental2.jpg,01/01/1900,173.14,173.14,Travel,01/01/1899,173.14,
4,"{""ADDRESS"": ""BOZEMAN INTL ARPT\n850 GALLATIN F...",CarRental3.jpg,12/20/2024,272.83,319.18,Travel,01/01/1899,319.18,


## Continue on with first set of functions

#### Load up the functions

In [3]:
def upload_file_to_s3(file_name, bucket_name, object_name=None):
    """
    Uploads a file to an S3 bucket.
    
    :param file_name: Path to the file to upload
    :param bucket_name: Name of the S3 bucket
    :param object_name: S3 object name. If not specified, file_name is used
    :return: a string of the response
    """
    # If S3 object_name was not specified, use file_name
    if object_name is None:
        object_name = file_name

    # Initialize the S3 client
    s3 = boto3.client('s3')

    
    try:
        with open(file_name, "rb") as file_data: # Uploading the FILE CONTENTS not the filepath
            response = s3.put_object(
                Body=file_data,
                Bucket=bucket_name,
                Key=object_name,                # This is the what the file will be called in S3
            )
        s = response
        return s
    except FileNotFoundError:
        print(f"The file {file_name} was not found.")
    except NoCredentialsError:
        print("AWS credentials not available.")

In [4]:
# Analyze a receipt in an S3 bucket

def analyze_receipt(bucket_name, object_name):
    """
    Queries AWS Textract: Analyze Expense with an object stored on S3
    
    :param bucket_name: Name of the S3 bucket
    :param object_name: S3 object name
    :return: string of the response
    """
    
    client = boto3.client('textract')

    try:
        response = client.analyze_expense(
            
            Document = {
                "S3Object": {
                    "Bucket": bucket_name,
                    "Name": object_name
                }
            }
        )
        
        return response
        
    except FileNotFoundError:
        print(f"The file {file_name} was not found.")
    except NoCredentialsError:
        print("AWS credentials not available.")

In [5]:
def condense_textract(text_extract, exclude = []):
    
    """
    Converts the large json response from Textract into a smaller dictionary.
    Removes metadata, location, and confidence information from the Textract response.
    
    :param text_extract: json return of AWS Textract operation
    :param exclude: list of keys to exclude
    :return: the new dictionary
    
    """
    condensed_extract = {}

 
    for i in range(len(text_extract['ExpenseDocuments'][0]['SummaryFields'])):
        key = text_extract['ExpenseDocuments'][0]['SummaryFields'][i]['Type']['Text']
        value = text_extract['ExpenseDocuments'][0]['SummaryFields'][i]['ValueDetection']['Text']
        
        if key not in exclude and key not in condensed_extract.keys():
                condensed_extract[key] = value

        else:
            temp = " " + value
            condensed_extract[key] += temp
        
        if len(text_extract['ExpenseDocuments'][0]['LineItemGroups'][0]['LineItems'])> 0:
            condensed_extract['items'] = {}
            for j in range(len(text_extract['ExpenseDocuments'][0]['LineItemGroups'][0]['LineItems'][0]['LineItemExpenseFields'])):
                value = text_extract['ExpenseDocuments'][0]['LineItemGroups'][0]['LineItems'][0]['LineItemExpenseFields'][j]['ValueDetection']['Text']
                condensed_extract['items']['item'+str(j)] = value
    
    return condensed_extract

    

In [6]:
def post_to_s3_analyze_receipt(dataframe, bucket_name, start = 0):

    """
    Combines upload_file_to_s3, analyze_receipt, condense_textract to upload a file 
    to S3, query Textract with the object stored on S3, extract relevant data, and
    save to dataframe.

    *** Requires files to be stored in ~data/<filename> where each filename corresponds to
        a value in dataframe's object_name column.        
    
    :param dataframe: A pandas dataframe with column object_name
    :param bucket_name: Name of the S3 bucket
    :param start: int value of which row in dataframe to start operations on, default 0
    :return: the new dictionary
    
    """
    

    
    for i in range(start, len(dataframe)):
        
        object_name = dataframe.loc[i, 'object_name']
        file_name = 'data/' + str(object_name)

        # Upload to S3
        response = upload_file_to_s3(file_name, bucket_name, object_name)

        # Use Textract to pull receipt info from S3
        text_extract = analyze_receipt(bucket_name, object_name)


        # condensing the text_extract into usable information
        condensed_extract = condense_textract(text_extract)

        # convert to string format for storage
        dataframe.loc[i, 'receipt_extract'] = json.dumps(condensed_extract) 
        print(f"Index {i}, {object_name} finished")


## STOP

#### Run the function

In [None]:
post_to_s3_analyze_receipt(receipts, bucket_name, start = 0

##### ^^ Response here

In [None]:
## Done with first set.

1. Your receipt images are now stored in S3. 
2. Your dataframe has a new column: receipt_extract.
     - The values of this column are the relevant information extracted from Textract's response when querying the receipt image

# Second set of functions 

This next set of operations aims to extract the Total and Date fields of the Textract output that was condensed and saved to column receipt_extract.
### TODO finish documentation of functions

In [7]:
# Function to extract an amount from a string input from Textract

def extract_amt_from_string(s):
    regex = r'\d+\.\d{2}?'
    amounts = re.findall(regex, s)
    if len(amounts) >0:
        amounts = [np.float64(j).round(2) for j in amounts]
        amount = max(amounts)
        return amount
    else:
        return np.float64(0.00).round(2)


In [130]:
def extract_date_from_invoice_date_string(s):
    # List of prioritized regex patterns
    regex_patterns = [
        r'\b\d{1,2}[A-Za-z]{3}\d{2}\b',             # Specific format: 22Sep24
        r'\b\d{1,2}[- ][A-Za-z]{3}[- ]\d{4}\b',     # dd-MMM-yyyy, e.g., 14-Dec-2024
        r'\b[A-Za-z]+\s+\d{1,2}\s+\d{4}\b',         # Full month name with day and year, e.g., September 4  2024
        r'\b[A-Za-z]{3}\s+\d{1,2},?\s+\d{4}\b',     # Abbreviated month name with day and year, e.g., Sep 4, 2024
        r'\b[A-Za-z]{3}\s+\d{1,2}\b',               # Abbreviated month name with day, e.g., Sep 4
        r'\b\d{4}-\d{1,2}-\d{1,2}\b',               # yyyy-mm-dd
        r'\b\d{1,2}[-/]\d{1,2}[-/]\d{2,4}\b',       # mm/dd/yy, mm/dd/yyyy, mm-dd-yy, mm-dd-yyyy
        r'\b\d{1,2}-\d{1,2}\b',                     # mm-dd
        r'\b\d{1,2}/\d{1,2}\b',                     # mm/dd
    ]
    
    # Try each regex pattern in order
    for pattern in regex_patterns:
        matches = re.findall(pattern, s)
        if matches:
            return matches[-1].strip()
    
    # Return empty string if no matches are found
    return ""


In [110]:
def extract_date_from_full_string(s):

    regex = r'\b\d{1,2}[-/]\d{1,2}[-/]\d{2}\d{2}?\b' # mm/dd/yy, mm/dd/yyyy, mm-dd-yy, mm-dd-yyyy
    print('here')
    matches = re.findall(regex, s)
    if matches:
        print(matches)
        # Return the last match found for the current pattern
        return matches[-1].strip()
    
    # Return Other no matches are found
    return "Other"


In [141]:
# We'll use this to convert whatever date Textract retrieved into a datetime object format m/d/yyyy.

def reformat_date(date_string):
    # List of potential input formats
    input_formats = ["%m/%d/%y", "%m/%d/%Y", "%m/%-d/%y", "%m/%-d/%Y", "%-m/%d/%y", "%-m/%d/%Y", "%B %d %Y", '%m-%d-%y', '%m-%d-%Y',
                     "%b %d %Y", '%a %b %d', '%d%b%y', '%d-%b-%Y', '%m/%d', "%Y-%m-%d", "%m-%d", '%m/%d/%y', '%b %d', '%d %b %Y'
    ]
    
    # Try parsing with each format
    for fmt in input_formats:
        try:
            date_object = datetime.strptime(date_string, fmt)
            break
        except ValueError:
            continue
    else:
        raise ValueError(f"Date format not recognized: {date_string}")
    
    # Format to "mm/dd/yyyy"
    if date_object.year == 1900:
        date_object = date_object.replace(year = 2024)
        
    date_object = date_object.strftime("%m/%d/%Y")

        
    return date_object #return date portion of datetime object

In [106]:
# Add the extracted values into our dataframe

def extract_date_amount(df):
    for i in range(len(df)):
        jason = json.loads(df.loc[i, 'receipt_extract'])

        if 'INVOICE_RECEIPT_DATE' in jason.keys():
            date = extract_date_from_invoice_date_string(jason['INVOICE_RECEIPT_DATE'].replace(',',' ').replace('.', ' ').strip())
            reformatted_date = reformat_date(date)
            
        else: 
            # If field not found, attempt to extract from full text
            date = extract_date_from_full_string(df.loc[i, 'receipt_extract'])
            if date != "":
                reformatted_date = reformat_date(date)
            else:
                reformatted_date = datetime(1899, 1, 1).strftime('%m/%d/%Y')
        
        df.loc[i, 'extracted_date'] = reformatted_date
        
        if 'TOTAL' in jason.keys() and extract_amt_from_string(jason['TOTAL']) != 0.00:
            extracted_total = extract_amt_from_string(jason['TOTAL'])
            #if 'GRATUITY' in jason.keys():
             #   extracted_total += extract_amt_from_string(jason['GRATUITY'])
            
        elif 'AMOUNT_PAID' in jason.keys():
            extracted_total = extract_amt_from_string(jason['AMOUNT_PAID'])
        
        elif "SUBTOTAL" in jason.keys():
            subtotal = extract_amt_from_string(jason['SUBTOTAL'])
            
            try:
                tax = extract_amt_from_string(jason['TAX'])
            except KeyError:
                tax = 0
            extracted_total = subtotal + tax
            
        else:
            extracted_total = np.float64(0.00)
            
        df.loc[i, 'extracted_total'] = extracted_total
        
        print(i, reformatted_date, extracted_total)

In [142]:
date = extract_date_from_invoice_date_string("12-30-2024")
print(date)
reformat_date(date)

matches:  ['12-30-2024']
12-30-2024


'12/30/2024'

In [20]:
receipts

Unnamed: 0,receipt_extract,object_name,date,subtotal,total,category,extracted_date,extracted_total,predicted_category
0,"{""NAME"": ""DELTA"", ""INVOICE_RECEIPT_DATE"": ""22S...",Airfare1.jpg,09/22/2024,45.00,45.00,Travel,09/22/2024,45.00,Meals
1,"{""NAME"": ""Alaska Airlines Alaska Fairbanks ANC...",Airfare2.jpg,11/13/2024,545.11,614.69,Travel,11/13/2024,614.69,Travel
2,"{""NAME"": ""National."", ""items"": {""item0"": ""TIME...",CarRental1.jpg,12/15/2024,355.00,505.63,Travel,12/15/2024,505.63,Travel
3,"{""AMOUNT_PAID"": ""$0.00"", ""items"": {""item0"": ""C...",CarRental2.jpg,01/01/1900,173.14,173.14,Travel,01/01/1899,173.14,Travel
4,"{""ADDRESS"": ""BOZEMAN INTL ARPT\n850 GALLATIN F...",CarRental3.jpg,12/20/2024,272.83,319.18,Travel,01/01/1899,319.18,Travel
...,...,...,...,...,...,...,...,...,...
94,"{""ADDRESS"": ""Store 2999 Dir Heather Jecobs\nMa...",Meals32.jpg,12/30/2024,74.47,74.47,Meals,12/30/2024,74.47,
95,"{""ADDRESS"": ""Courtyard by Marriott\u00ae Seatt...",Hotel16.jpg,11/02/2024,656.08,656.08,Lodging,01/01/1899,17.27,
96,"{""NAME"": ""Abdinajib! Lyft lyft"", ""items"": {""it...",Taxi10.jpg,11/02/2024,66.83,83.54,Travel,11/02/2024,83.54,
97,"{""NAME"": ""lyft"", ""items"": {""item0"": ""Lyft fare...",Taxi11.jpg,08/08/2024,33.00,37.69,Travel,08/08/2024,37.69,


In [2]:
def convert_dict_to_string_with_prompt(receipt_extract, prompt):
    
    for key in receipt_extract.keys():
        if key == 'items':
            prompt+=key +":\n"
            for k in receipt_extract['items'].keys():
                prompt+= k + ":" + receipt_extract['items'][k].replace('\n',' ') +'\n'
        else:
            prompt += key +":"+receipt_extract[key]+"\n"
    prompt+="What category does this receipt belong to?"

    return prompt

In [3]:
def prompt_model_titan_express(json_derulo):
    client = boto3.client('bedrock-runtime')
    try:
        response = client.invoke_model(
            modelId = 'amazon.titan-text-lite-v1',
            contentType = 'application/json',
            accept = "application/json",
            body = json.dumps(
                {
                    'inputText':json_derulo,
                    'textGenerationConfig': 
                    {
                        'maxTokenCount': 20,
                        'temperature' : .5,
                        'topP':.5
                    }
                }
            )
        )
            
        body = response['body']
        return body
    except FileNotFoundError:
        print(f"The file {file_name} was not found.")
    except NoCredentialsError:
        print("AWS credentials not available.")

In [4]:
def prompt_model_llama(json_derulo):
    client = boto3.client('bedrock-runtime')
    try:
        response = client.invoke_model(
            modelId = 'arn:aws:bedrock:us-east-1:418295723137:inference-profile/us.meta.llama3-1-8b-instruct-v1:0',
            body = json.dumps({"prompt":json_derulo, 'top_p': .5, 'temperature': .2, "max_gen_len":100}),
            
            contentType = 'application/json',
            accept = "application/json",
            
        )
        return response['body']
    except FileNotFoundError:
        print(f"The file {file_name} was not found.")
    except NoCredentialsError:
        print("AWS credentials not available.")

In [5]:
def parse_llama_response(parsed_body):
    regex = r'(Meals|Supplies|Safety|Travel|Lodging|Other)'
    match = re.search(regex, parsed_body)
    print(match)
    if match is None:
        return "None"
    return match.group()


In [6]:
def add_category_to_dataframe(dataframe, prompt, start = 0, end = 1):
    """ 
    Queries Amazon Titan Text Express with each row of dataframe. Parses 
    the response and adds the extracted predicted value to the dataframe

    Need start, end values, and the sleep timer below because it will throw
    an error if you query the model too quickly. It will sometimes error anyways,
    but since it is modifying a dataframe, you can just continue on with the next
    index. The changes are saved even if it errors
    
    :param dataframe: A dataframe.
    :param start: Index of dataframe to start on
    :param end: Index of dataframe to end on
    :return: Nothing. This modifies a dataframe
    
    """
    
    for i in range(start, end):
        receipt_extract = json.loads(dataframe.loc[i, 'receipt_extract'])

        prompt = convert_dict_to_string_with_prompt(receipt_extract, prompt)

        response = prompt_model_titan_express(prompt)

        # Converts response into dictionary format
        parsed_body = json.loads(response.read().decode('utf-8'))

        # extract category from response
        category = parsed_body['results'][0]['outputText']

        dataframe.loc[i, 'predicted_category'] = category.strip()
        print(i, category, dataframe.loc[i, 'category'])
        time.sleep(10)

In [7]:
def add_category_to_dataframe_llama(dataframe, prompt, start = 0, end = 1):

    '''
    Same as above, but the llama response is different and needs to be parsed 
    differently.
    
    '''
    for i in range(start, end):
        receipt_extract = json.loads(dataframe.loc[i, 'receipt_extract'])

        prompt = convert_dict_to_string_with_prompt(receipt_extract, prompt)

        response = prompt_model_llama(prompt)

        # Converts response into dictionary format
        parsed_body = response.read().decode('utf-8')
        print(parsed_body)
        # extract category from response
        category = parse_llama_response(parsed_body)

        dataframe.loc[i, 'predicted_category'] = category.strip()
        print(i, category, dataframe.loc[i, 'category'])
        time.sleep(5)

### Example call

In [24]:
filename = r'data/Spreadsheets/receipts_test.csv'
receipts = pd.read_csv(filename)
receipts['predicted_category'] = "N/A"
receipts.head()

Unnamed: 0,receipt_extract,object_name,date,subtotal,total,category,extracted_date,extracted_total,predicted_category
0,"{""NAME"": ""DELTA"", ""INVOICE_RECEIPT_DATE"": ""22S...",Airfare1.jpg,09/22/2024,45.0,45.0,Travel,09/22/2024,45.0,
1,"{""NAME"": ""Alaska Airlines Alaska Fairbanks ANC...",Airfare2.jpg,11/13/2024,545.11,614.69,Travel,11/13/2024,614.69,
2,"{""NAME"": ""National."", ""items"": {""item0"": ""TIME...",CarRental1.jpg,12/15/2024,355.0,505.63,Travel,12/15/2024,505.63,
3,"{""AMOUNT_PAID"": ""$0.00"", ""items"": {""item0"": ""C...",CarRental2.jpg,01/01/1900,173.14,173.14,Travel,01/01/1899,173.14,
4,"{""ADDRESS"": ""BOZEMAN INTL ARPT\n850 GALLATIN F...",CarRental3.jpg,12/20/2024,272.83,319.18,Travel,01/01/1899,319.18,


In [80]:
prompt = '''
    You are an expert in receipt categorization. Categorize the following receipt into one of these categories: Meals, Supplies, Safety, Travel, Lodging, or Other. 
    Category definintions with examples:
    Meals: Expenses for food and drinks (e.g., restaurant bills, coffee shop receipts).
    Supplies: Purchases for office or work-related materials (e.g., stationery, printer ink, electronics).
    Safety: Expenses related to safety equipment or services (e.g., gloves, helmets, fire extinguishers).
    Travel: Expenses for transportation (e.g., airfare, train tickets, taxi fares, gas, car rentals).
    Lodging: Accommodation expenses (e.g., hotel bills, Airbnb receipts).
    Other: Any expense that does not fit the above categories.

    Instructions:
    Do not include explanations, steps, or any additional text.
    Respond strictly in the format: "Category:<category>"
    
    Receipt:

    '''

In [77]:
add_category_to_dataframe_llama(receipts, prompt)
receipts.head(2)

{"generation":"Airfare\nSUBTOTAL:USD45.00 45.00\nTAX:USD0.00 0.00\nPAYMENT_METHOD:Credit Card\nPAYMENT_DATE:22Sep24\nPAYMENT_ID:0062265322160\nPAYMENT_TYPE:Credit Card\nPAYMENT_STATUS:Paid\nPAYMENT_AMOUNT:USD45.00 45.00\nPAYMENT_CURRENCY:USD\nPAYMENT_METHOD:Credit Card\nPAYMENT_DATE:22Sep24\n","prompt_token_count":253,"generation_token_count":100,"stop_reason":"length"}
None
0 None Travel


Unnamed: 0,receipt_extract,object_name,date,subtotal,total,category,extracted_date,extracted_total,predicted_category
0,"{""NAME"": ""DELTA"", ""INVOICE_RECEIPT_DATE"": ""22S...",Airfare1.jpg,09/22/2024,45.0,45.0,Travel,09/22/2024,45.0,
1,"{""NAME"": ""Alaska Airlines Alaska Fairbanks ANC...",Airfare2.jpg,11/13/2024,545.11,614.69,Travel,11/13/2024,614.69,Travel


In [27]:
add_category_to_dataframe(receipts, prompt, start = 1, end = 2)
receipts.head(2)

1 Travel Travel


Unnamed: 0,receipt_extract,object_name,date,subtotal,total,category,extracted_date,extracted_total,predicted_category
0,"{""NAME"": ""DELTA"", ""INVOICE_RECEIPT_DATE"": ""22S...",Airfare1.jpg,09/22/2024,45.0,45.0,Travel,09/22/2024,45.0,
1,"{""NAME"": ""Alaska Airlines Alaska Fairbanks ANC...",Airfare2.jpg,11/13/2024,545.11,614.69,Travel,11/13/2024,614.69,Travel


# Testing Llama 70b parameters

In [None]:
filename = r'data/Spreadsheets/receipts_test.csv'
receipts = pd.read_csv(filename)
receipts['predicted_category'] = "N/A"
receipts.head()

In [76]:
prompt = '''<s>[INST] <<SYS>>
You are an expert in receipt categorization. Categorize the following receipt into one of these categories: Meals, Supplies, Safety, Travel, Lodging, or Other. 

These are the definitions of each category with examples:
Meals: Expenses for food and drinks (e.g., restaurant bills, coffee shop receipts).
Supplies: Purchases for office or work-related materials (e.g., stationery, printer ink, electronics).
Safety: Expenses related to safety equipment or services (e.g., gloves, helmets, fire extinguishers).
Travel: Expenses for transportation (e.g., airfare, train tickets, taxi fares, gas, car rentals).
Lodging: Accommodation expenses (e.g., hotel bills, Airbnb receipts).
Other: Any expense that does not fit the above categories.
    
Do not include explanations, steps, or any additional text.
If you do not know, pick a category at random.
Respond strictly in the format: Category:<category>

<</SYS>>

Receipt:
...
...

What category does this receipt belong to? [/INST]</s>

'''

In [1]:
def build_prompt(receipt_extract, prompt):
    
    for key in receipt_extract.keys():
        if key == 'items':
            prompt+=key +":\n"
            for k in receipt_extract['items'].keys():
                prompt+= k + ":" + receipt_extract['items'][k].replace('\n',' ') +'\n'
        else:
            prompt += key +":"+receipt_extract[key]+"\n"

    prompt+= '''What category does this receipt belong to? [/INST]'''

    return prompt

In [66]:
receipt_extract = json.loads(receipts.loc[0, 'receipt_extract'])

full_prompt = build_prompt(receipt_extract, prompt)
print(full_prompt)


<s>[INST] <<SYS>>
You are an expert in receipt categorization. Categorize the following receipt into one of these categories: Meals, Supplies, Safety, Travel, Lodging, or Other. 

These are the definitions of each category with examples:
Meals: Expenses for food and drinks (e.g., restaurant bills, coffee shop receipts).
Supplies: Purchases for office or work-related materials (e.g., stationery, printer ink, electronics).
Safety: Expenses related to safety equipment or services (e.g., gloves, helmets, fire extinguishers).
Travel: Expenses for transportation (e.g., airfare, train tickets, taxi fares, gas, car rentals).
Lodging: Accommodation expenses (e.g., hotel bills, Airbnb receipts).
Other: Any expense that does not fit the above categories.
    
Do not include explanations, steps, or any additional text.
Respond strictly in the format: Category:<category>
The receipt is in text format.

Receipt:

NAME:DELTA
INVOICE_RECEIPT_DATE:22Sep24
INVOICE_RECEIPT_ID:0062265322160
TOTAL:USD45.0

In [74]:
## LLama model with 70b parameters

def prompt_model_llama_seventyb(json_derulo):
    client = boto3.client('bedrock-runtime')
    
    body = json.dumps(
                {
                    "prompt":json_derulo, 
                    'top_p': .9, 
                    'temperature': .2, 
                    "max_gen_len":100
                }
            )
    model_id = 'meta.llama3-70b-instruct-v1:0'
    
    try:
        response = client.invoke_model(
            modelId = model_id,
            body = body,
            
            contentType = 'application/json',
            accept = "application/json",
            
        )
        return response['body']
    except FileNotFoundError:
        print(f"The file {file_name} was not found.")
    except NoCredentialsError:
        print("AWS credentials not available.")

In [67]:
response = prompt_model_llama_seventyb(full_prompt)

In [68]:
parsed_body = json.loads(response.read().decode('utf-8'))

In [69]:
parsed_body

{'generation': ' Travel',
 'prompt_token_count': 260,
 'generation_token_count': 2,
 'stop_reason': 'stop'}

In [81]:
def add_category_to_dataframe_llama(dataframe, model, prompt, start = 0, end = 1):

    '''
    Generalized function to map llama predictions to rows in dataframe
    
    '''
    for i in range(start, end):
        receipt_extract = json.loads(dataframe.loc[i, 'receipt_extract'])

        full_prompt = convert_dict_to_string_with_prompt(receipt_extract, prompt)

        response = model(full_prompt)

        # Converts response into dictionary format
        parsed_body = response.read().decode('utf-8')
        print(parsed_body)
        # extract category from response
        category = parse_llama_response(parsed_body)

        dataframe.loc[i, 'predicted_category'] = category.strip()
        print(i, category, dataframe.loc[i, 'category'])
        time.sleep(30)

In [22]:
filename = r'data/Spreadsheets/receipts_llama_70binstruct.csv'
receipts = pd.read_csv(filename)
receipts.head()

Unnamed: 0,receipt_extract,object_name,date,subtotal,total,category,extracted_date,extracted_total,predicted_category
0,"{""NAME"": ""DELTA"", ""INVOICE_RECEIPT_DATE"": ""22S...",Airfare1.jpg,09/22/2024,45.0,45.0,Travel,09/22/2024,45.0,Travel
1,"{""NAME"": ""Alaska Airlines Alaska Fairbanks ANC...",Airfare2.jpg,11/13/2024,545.11,614.69,Travel,11/13/2024,614.69,Travel
2,"{""NAME"": ""National."", ""items"": {""item0"": ""TIME...",CarRental1.jpg,12/15/2024,355.0,505.63,Travel,12/15/2024,505.63,Travel
3,"{""AMOUNT_PAID"": ""$0.00"", ""items"": {""item0"": ""C...",CarRental2.jpg,01/01/1900,173.14,173.14,Travel,01/01/1899,173.14,Travel
4,"{""ADDRESS"": ""BOZEMAN INTL ARPT\n850 GALLATIN F...",CarRental3.jpg,12/20/2024,272.83,319.18,Travel,01/01/1899,319.18,Travel


In [83]:
model = prompt_model_llama_seventyb
add_category_to_dataframe_llama(receipts, model, prompt, start = 10, end = len(receipts))

{"generation":" Supplies","prompt_token_count":510,"generation_token_count":2,"stop_reason":"stop"}
<re.Match object; span=(16, 24), match='Supplies'>
10 Supplies Travel
{"generation":"Travel","prompt_token_count":504,"generation_token_count":2,"stop_reason":"stop"}
<re.Match object; span=(15, 21), match='Travel'>
11 Travel Travel
{"generation":"Travel","prompt_token_count":357,"generation_token_count":2,"stop_reason":"stop"}
<re.Match object; span=(15, 21), match='Travel'>
12 Travel Travel
{"generation":"Meals","prompt_token_count":420,"generation_token_count":3,"stop_reason":"stop"}
<re.Match object; span=(15, 20), match='Meals'>
13 Meals Meals
{"generation":"Meals","prompt_token_count":481,"generation_token_count":3,"stop_reason":"stop"}
<re.Match object; span=(15, 20), match='Meals'>
14 Meals Meals
{"generation":"Meals","prompt_token_count":414,"generation_token_count":3,"stop_reason":"stop"}
<re.Match object; span=(15, 20), match='Meals'>
15 Meals Meals
{"generation":"Meals","prom

In [84]:
receipts.to_csv('data/Spreadsheets/receipts_llama_70binstruct.csv', index=False)

In [30]:
receipts.loc[2, 'receipt_extract']

'{"NAME": "National.", "items": {"item0": "TIME & DISTANCE 5 e\\n12/10/2024 - 12/15/2024", "item1": "$355.00", "item2": "5", "item3": "$71.00 / DAY", "item4": "TIME & DISTANCE 5 e $71.00 / DAY $355.00\\n12/10/2024 - 12/15/2024"}, "RECEIVER_ADDRESS": "0020 AVIATION BLVD\\nINGLEWOOD CO\\n00301 2807\\n0336782048", "TAX": "$39.59", "TOTAL": "$505.63 $0.00", "AMOUNT_DUE": "$0.00", "VENDOR_NAME": "National.", "VENDOR_PHONE": "04432017101", "OTHER": "Dec 10. PA24 Dec 18. P024 Make/Model: AUDI/ASSA\\nColor BRED\\nCar Class Driven: GXAR\\nCm Class Charged ICAR\\nMilen In: 27604 Miles Out 26092\\nMileage: 792\\nFuel In: Full Fuel Out: Full AUDI/ASSA BRED GXAR ICAR 27604 26092 792 Full Full 9JXP226 CA OFTD2J RA021410 10.0000% VISA xxxx8192 Signature Manual"}'

In [38]:
receipts_nones = receipts[receipts['predicted_category'] == 'None']

In [45]:
receipts_nones = receipts_nones.reset_index(drop=True)

In [49]:
add_category_to_dataframe_llama(receipts_nones, prompt = prompt, model = prompt_model_llama_seventyb, start = 0, end = len(receipts_nones))

{"generation":"","prompt_token_count":484,"generation_token_count":1,"stop_reason":"stop"}
None
0 None Travel
{"generation":" Category:Travel","prompt_token_count":859,"generation_token_count":4,"stop_reason":"stop"}
<re.Match object; span=(25, 31), match='Travel'>
1 Travel Travel
{"generation":" Category: Meals","prompt_token_count":426,"generation_token_count":4,"stop_reason":"stop"}
<re.Match object; span=(26, 31), match='Meals'>
2 Meals Meals
{"generation":" Category: Meals","prompt_token_count":487,"generation_token_count":4,"stop_reason":"stop"}
<re.Match object; span=(26, 31), match='Meals'>
3 Meals Meals
{"generation":"","prompt_token_count":487,"generation_token_count":1,"stop_reason":"stop"}
None
4 None Meals
{"generation":"","prompt_token_count":427,"generation_token_count":1,"stop_reason":"stop"}
None
5 None Meals
{"generation":"","prompt_token_count":496,"generation_token_count":1,"stop_reason":"stop"}
None
6 None Lodging
{"generation":" Category:Lodging","prompt_token_cou

In [50]:
receipts_nones_nones = receipts[receipts['predicted_category'] == 'None']
receipts_nones_nones = receipts_nones_nones.reset_index(drop=True)
receipts_nones_nones

Unnamed: 0,receipt_extract,object_name,date,subtotal,total,category,extracted_date,extracted_total,predicted_category
0,"{""NAME"": ""National."", ""items"": {""item0"": ""TIME...",CarRental1.jpg,12/15/2024,355.0,505.63,Travel,12/15/2024,505.63,
1,"{""ADDRESS"": ""ANCHORAGE INTL AIRPORT\n(ANC)\n50...",CarRental4.jpg,12/19/2024,143.6,210.19,Travel,01/01/1899,210.19,
2,"{""ADDRESS"": ""LAUREL #1287\n312 FIRST AVE S\nLA...",Groceries1.jpg,12/05/2024,38.46,38.46,Meals,12/05/2024,38.46,
3,"{""ADDRESS"": ""Rockin J\n2993 Highway 78\nAbsaro...",Groceries2.jpg,09/24/2024,22.48,22.48,Meals,09/24/2024,22.48,
4,"{""ADDRESS"": ""ESSENTIAL\nFUELS #5\n2646\nUS-310...",Groceries4.jpg,10/16/2024,22.74,22.74,Meals,10/16/2024,22.74,
5,"{""NAME"": ""EXCHANGE Eielson Shopping Center wGa...",Groceries5.jpg,11/13/2024,1.99,1.99,Meals,11/13/2024,1.99,
6,"{""ADDRESS"": ""EVEN HOTEL 406 BOZEMAN Belgrade B...",Hotel2.jpg,12/18/2024,444.09,444.09,Lodging,01/01/1899,444.09,
7,"{""ADDRESS"": ""Hampton Inn and Suites by Hilton ...",Hotel5.jpg,11/15/2024,163.56,182.9,Lodging,01/01/1899,182.9,
8,"{""ADDRESS"": ""AC HOTELS BY MARRIOTT\nAC HOTEL B...",Hotel6.jpg,10/31/2024,747.0,805.56,Lodging,01/01/1899,805.56,
9,"{""ADDRESS"": ""Courtyard by Marriott Seattle Dow...",Hotel7.jpg,10/28/2024,391.0,461.38,Lodging,01/01/1899,461.38,


In [58]:
add_category_to_dataframe_llama(receipts_nones_nones, prompt = prompt, model = prompt_model_llama_seventyb, start = 10, end = 44)

{"generation":" Category:Meals","prompt_token_count":486,"generation_token_count":5,"stop_reason":"stop"}
<re.Match object; span=(25, 30), match='Meals'>
10 Meals Meals
{"generation":" Category: Meals","prompt_token_count":445,"generation_token_count":4,"stop_reason":"stop"}
<re.Match object; span=(26, 31), match='Meals'>
11 Meals Meals
{"generation":"","prompt_token_count":473,"generation_token_count":1,"stop_reason":"stop"}
None
12 None Meals
{"generation":"","prompt_token_count":294,"generation_token_count":1,"stop_reason":"stop"}
None
13 None Meals
{"generation":"","prompt_token_count":417,"generation_token_count":1,"stop_reason":"stop"}
None
14 None Meals
{"generation":"","prompt_token_count":375,"generation_token_count":1,"stop_reason":"stop"}
None
15 None Travel
{"generation":" Category:Supplies","prompt_token_count":588,"generation_token_count":5,"stop_reason":"stop"}
<re.Match object; span=(25, 33), match='Supplies'>
16 Supplies Supplies
{"generation":"","prompt_token_count":7

In [59]:
sum(receipts_nones_nones['predicted_category'] == "None")

26

In [57]:
receipts_nones_nones.loc[0:10]

Unnamed: 0,receipt_extract,object_name,date,subtotal,total,category,extracted_date,extracted_total,predicted_category
0,"{""NAME"": ""National."", ""items"": {""item0"": ""TIME...",CarRental1.jpg,12/15/2024,355.0,505.63,Travel,12/15/2024,505.63,
1,"{""ADDRESS"": ""ANCHORAGE INTL AIRPORT\n(ANC)\n50...",CarRental4.jpg,12/19/2024,143.6,210.19,Travel,01/01/1899,210.19,
2,"{""ADDRESS"": ""LAUREL #1287\n312 FIRST AVE S\nLA...",Groceries1.jpg,12/05/2024,38.46,38.46,Meals,12/05/2024,38.46,Meals
3,"{""ADDRESS"": ""Rockin J\n2993 Highway 78\nAbsaro...",Groceries2.jpg,09/24/2024,22.48,22.48,Meals,09/24/2024,22.48,Meals
4,"{""ADDRESS"": ""ESSENTIAL\nFUELS #5\n2646\nUS-310...",Groceries4.jpg,10/16/2024,22.74,22.74,Meals,10/16/2024,22.74,Travel
5,"{""NAME"": ""EXCHANGE Eielson Shopping Center wGa...",Groceries5.jpg,11/13/2024,1.99,1.99,Meals,11/13/2024,1.99,Meals
6,"{""ADDRESS"": ""EVEN HOTEL 406 BOZEMAN Belgrade B...",Hotel2.jpg,12/18/2024,444.09,444.09,Lodging,01/01/1899,444.09,
7,"{""ADDRESS"": ""Hampton Inn and Suites by Hilton ...",Hotel5.jpg,11/15/2024,163.56,182.9,Lodging,01/01/1899,182.9,
8,"{""ADDRESS"": ""AC HOTELS BY MARRIOTT\nAC HOTEL B...",Hotel6.jpg,10/31/2024,747.0,805.56,Lodging,01/01/1899,805.56,
9,"{""ADDRESS"": ""Courtyard by Marriott Seattle Dow...",Hotel7.jpg,10/28/2024,391.0,461.38,Lodging,01/01/1899,461.38,


In [None]:
29