# Llama | ChatGPT : Revolutionizing Supply Chain tasks which no human assistant dare to tackle
---

    Author: Amit Shukla

[https://github.com/AmitXShukla](https://github.com/AmitXShukla)

[https://twitter.com/ashuklax](https://github.com/AShuklaX)

[https://youtube.com/@Amit.Shukla](https://youtube.com/@Amit.Shukla)


Meta has recently released Llama 2, a large language model trained with up to 70B parameters, positioning it as the fastest and most advanced solution available. This model is expected to outperform other tools in terms of both speed and accuracy.

In this blog post, I will demonstrate some automation use cases I have been working on. 

It's important to note that these use cases/models will work best when trained on "in-house" data. However, training such models is a rigorous task that requires significant computing hours and resources.

To make things more accessible and easier to utilize in production, using "off the shelf", language models like ChatGPT and Llama 2 is a viable solution.

Below, I present some examples of use cases I've been working on. 

*While these examples are not meant for production*, they still showcase the powerful capabilities of the language models.

`Upon completing this blog, you will acquire the skills to build Llama 2 and ChatGPT APIs and harness the capabilities of large language models for practical data analytics tasks.`

## Table of content
---

- Introduction
- [Llama 2 Installation Windows/Linux](https://github.com/AmitXShukla/RPA/blob/main/notebooks/llama2-UseCases.ipynb)
- [Efficient Time and Expense Monitoring with Llama 2](https://github.com/AmitXShukla/RPA/blob/main/notebooks/llama2-Efficient%20Time%20and%20Expense%20Monitoring%20with%20Llama%202.ipynb)
- [Using Llama 2 as OCR Vision AI](https://github.com/AmitXShukla/RPA/blob/main/notebooks/llama2-Using%20Llama%202%20as%20OCR%20Vision%20AI.ipynb)
- [Llama 2 as Supply Chain Assistant](https://github.com/AmitXShukla/RPA/blob/main/notebooks/llama2-as%20Supply%20Chain%20assistant.ipynb)
    - Streamlining 3-Way Receipt Match and Duplicate Voucher Invoices with Llama 2
    - Enhancing Fraud Detection: Utilizing Llama 2 as an Advanced Alert System for Monitoring Transactions
    - Maximizing Tax Savings, Ensuring Compliance, and Streamlining Audits with Llama 2
-  `**WIP**`: Tax Analytics | Spend Classification | Contracts management
---

## Introduction
---

#### About me
I'm Amit Shukla, and I specialize in training neural networks for Finance Supply Chain analysis, enabling them to identify data patterns and make accurate predictions.

During the challenges posed by the COVID-19 pandemic, I successfully trained GL and Supply Chain neural networks to anticipate supply chain shortages. The valuable insights gained from this effort have significantly influenced the content of this tutorial series.
	
#### Objective:
By delving into this powerful tool, we will master the fundamental techniques of utilizing large language models to predict hazards. 
This knowledge is crucial in preparing finance and supply chain data for advanced analytics, visualization, and predictive modeling using neural networks and machine learning.
	
#### Subject
It is crucial to emphasize that this specific series will focus exclusively on presenting `production-like examples that demonstrate certain use cases`. It is not intended for production applications. 

Nevertheless, these examples illustrate highly potent techniques that have practical applications in real-world Data Analytics.
	
#### Following
In future installments, we will explore Data Analytics and delve into the realm of Data Analytics and machine learning for predictive analytics.

Thank you for joining me, and I'm excited to embark on this educational journey together.
	
Let's get started.

---

## Installation
---

In a previous video, I demonstrated the process of activating the Open ChatGPT and Llama environments. 

In this section, I will guide you through the steps to install Llama 2 on a Windows operating system. 

While the installation process is quite similar to that on Linux, there are a few minor changes that need to be considered. 

Let's get started!

- Step 1: `download miniconda windows installer` [https://docs.conda.io/en/latest/miniconda.html](https://docs.conda.io/en/latest/miniconda.html)
- Step 2: create a new conda environment (say llamaConda)

In [None]:
# before you setup your machine for llama 2, check if you have cuda on your machine

import torch
torch.cuda.current_device()
# if you don't have cuda and torch on your machine, please move to next step and download cuda

In [None]:
# download pytorch cuda
# https://pytorch.org/get-started/locally/
# uncomment and run this command in Terminal to monitor download progress and debug any error

# !conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

In [None]:
# run these command again and make sure you have cuda available on your machine
import torch
torch.cuda.set_device(0)
torch.cuda.is_available(),torch.cuda.get_device_name()

In [None]:
# use this case if you see CUDA out of memory error
# also, try to reduce your << --max_batch_size 1 >>, max_split_size_mb:512 and work with "lowest memory size" model
# !torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 1
# clear cache

import torch
torch.cuda.empty_cache()
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:24"

# import gc
# del variables
# gc.collect()

torch.cuda.memory_summary(device=None, abbreviated=False)

signup and receive model download link from meta.

[Mete AI website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)
>>>
    do not use the 'Copy link address' option when you right click the URL. If the copied URL text starts with: https://download.llamameta.net, you copied it correctly. If the copied URL text starts with: https://l.facebook.com, you copied it the wrong way.

In [None]:
# Step 3: uncomment and run this command on your machine
# make sure, you have a Git installed on your machine if not
# for linux run 
# sudo apt install git-all

# for windows, download git from this link
# https://git-scm.com/download/win

####### clone Meta llama repo ##########
git clone https://github.com/facebookresearch/llama.git

In [None]:
# browse to root of your llama repo

## LINUX
# cd llama
# chmod +x # ./download.sh
# ./download.sh

## WINDOWS
# bash ./download.sh
# if this commands error out, download wget.exe from below link and copy wget.ex to C:\amit.la\llama
# https://eternallybored.org/misc/wget/
# make sure, you include << C:\amit.la\llama >> to windows environment path so that windows can find it

In [None]:
# make sure, latest conda env is selected as kernel
# before activating and installing dependencies

!pip install -e .
!python setup.py install

In [None]:
# only for windows
# change line #62 on llama/generate.py
# from 
# << torch.distributed.init_process_group("gloo|nccl") >>
# to
# << torch.distributed.init_process_group("gloo|nccl") >>

In [None]:
# make sure, latest conda env (llamaConda) is selected as kernel
# before activating and installing dependencies

# !pip install -e .
# !python setup.py install

In [None]:
# ./example_text_completion.py
# ./example_chat_completion.py

# change prompts | dialogue 
#   prompts = [
#         # For these prompts, the expected answer is the natural continuation of the prompt
#         "meaning of life is",
#     ]

In [None]:
# you are ready to use llama
# !torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 1

#### Addressing Llama | Cuda errors

---

Here is another solution proposed by one of users who was able to use Llama on CPU machines.

** `Please see, If you successfully install this using this approach, please open an Issue at this GitHub repository so that I can update notes.`

In [None]:
# install https://github.com/krychu/llama instead of https://github.com/facebookresearch/llama

# execute download.sh in a new terminal, provide META AI URL and download 7B model and model weights
#   run the download.sh script in a terminal, passing the URL provided when prompted to start the download
# create a new env

# python3 -m venv env
# source env/bin/activate

# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu #pour la version cpu
# python3 -m pip install -e .


# torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b/ --tokenizer_path tokenizer.model --max_seq_len 128 --max_batch_size 1


## Supply Chain business process
---


#### Data Flow
---

The supply chain process, often referred to as the B2P (Buy-to-Pay) or P2P (Procure-to-Pay) process, involves several steps to ensure the efficient procurement and delivery of goods. Let's break down the process step by step:

- Inventory Check:
    The supply chain process begins with an inventory check. This step involves assessing the current stock levels of goods in the system. It helps to determine which items are running low and need replenishment. This check can be done through various means, such as using software that tracks inventory levels in real-time.

- Replenishment Order:
    After the inventory check, if the system identifies that certain goods are running low or below the specified threshold, it automatically generates a replenishment order. This order initiates the procurement process and serves as a request for more goods to be obtained from external suppliers.

- Purchase Order Creation:
    Once the replenishment order is generated, the procurement team or relevant personnel create purchase orders. These purchase orders contain specific details about the requested goods, such as the quantity, item description, agreed-upon price, delivery date, and any other relevant terms and conditions.

- Sending Purchase Orders to Vendors:
    After the purchase orders are created, they are sent to the approved vendors or suppliers who provide the required goods. The vendors review the purchase orders and acknowledge their acceptance, confirming their commitment to fulfill the order.

- Goods Shipment:
    Upon receiving the accepted purchase orders, the vendors prepare the goods for shipment. They ensure that the correct quantity and quality of goods are packed and ready for delivery. The vendors then send the goods to the customer's designated delivery location.

- Receipt Generation:
    Once the goods are successfully delivered to the customer, the receiving team checks and verifies the received items against the details mentioned in the purchase order. If everything matches and there are no discrepancies, they generate a receipt confirming the receipt and acceptance of the goods.

- Invoicing and Payment:
    After the receipt is generated, the vendor sends an invoice to the customer for payment. The invoice contains details of the goods provided, their quantity, prices, any applicable taxes, and the total amount due. The customer reviews the invoice and makes the necessary payment to the vendor within the agreed-upon payment terms.

- Payment Settlement:
    Finally, the customer processes the payment, settling the outstanding invoice amount with the vendor. This completes the procurement process for the specific order.

The supply chain process is a continuous cycle, and the steps repeat as new demands arise, and inventory needs to be replenished. Efficiently managing the supply chain ensures that goods are available when needed, minimizing delays and disruptions in the supply of products or services.

![Data Flow](../SampleData/ER_Flow.png)

#### Business Process Flow
---

The vendor payment process, as described in the flow, involves several checks and steps to ensure accuracy, efficiency, and compliance. Let's break down the process:

- Receipt or Invoice Document:
    The process begins when the supply chain clerk receives either a receipt or an invoice document from the vendor. This document serves as a formal request for payment for the goods or services provided.

- Three-Way Match:
    The first major step in the vendor payment process is the "three-way match." This involves comparing three key documents: the invoice, the purchase order, and the receipt (or delivery confirmation). The supply chain clerk checks the following:

    a. Quantity and Amount: The quantity of items and the total amount charged on the invoice are compared to the quantities specified in the purchase order and the actual receipt. It ensures that the vendor is billing correctly for the goods actually received.

    b. Timing: The timing of when the order was placed, shipped, and received is verified to ensure that the delivery was within the expected time frame allowed from order to receipt. Any discrepancies may require investigation or follow-up.

- Discount Opportunities:
    The supply chain clerk looks for potential discount opportunities offered by the vendor for early payment. Early payment discounts can lead to cost savings, and the clerk ensures that eligible discounts are taken advantage of while making payments on time to avoid penalties.

- Contractual Agreement Compliance:
    The assistant also checks if the expenses fall under any existing contractual agreements with the vendor. These agreements may include negotiated prices, discounts, or specific terms and conditions that can be leveraged to procure goods or services at the most suitable prices.

- Avoiding Duplicate Invoices:
    With a large number of orders processed in a day, the possibility of vendors accidentally sending duplicate invoices is common. The supply chain clerk carefully matches and cross-references all received invoices to identify and eliminate any duplicated or erroneous transactions.

- Eligibility for Payment:
    Once all the necessary checks are performed, and the documents pass the three-way match, the invoices are considered eligible for payment.

- Payment Processing:
    Based on the eligibility for payment, the supply chain clerk initiates the payment process. This may involve generating payment files, authorizing transactions, and coordinating with the finance department to ensure timely and accurate disbursement of funds to the vendor.

By following this comprehensive vendor payment process, the organization can maintain strong vendor relationships, optimize costs, prevent errors, and ensure compliance with contractual agreements and payment terms. It also contributes to the overall efficiency and effectiveness of the supply chain management process.

![Business Process](../SampleData/Process_Flow.png)

## Supply chain Buy to Pay Processes Automation
---

In the upcoming sections, we'll utilize available data to create prompts for LLMs (Large Language Models) to make decisions based on statistical information. 

However, it's essential to note that LLMs are not universally beneficial and may not be suitable for every scenario. 

For instance, when dealing with large systems containing millions of documents, using LLMs to compare invoices one by one can be computationally intensive and inefficient. In such cases, alternative data science techniques offer better results and higher efficiency for tasks like identifying duplicate invoices.

#### Supply Chain 3 way Match process automation

- Step 1: Load Datasets
- Step 2: Collect statistics (describe generic behavior)
- Step 3: building prompt
- Step 4: Checking one Invoice
- Step 5: run invoice prompt through LLM (large language model) 

In [None]:
# !pip install Faker
# !pip install polars
# !pip install nltk
# !pip install sklearn
# !pip install torch
# py -m pip install pytesseract : open source
# py -m pip install PIL

#### Step 1: Load Datasets
---

In [7]:
######## Step 1 #####
# Load DataSets
# Customer, Vendor, Product, Product-Category

# load libraries
from datetime import datetime, timedelta
import polars as pl
from faker import Faker
import random
import numpy as np

# create list of values
fake = Faker()
ID = list(set(fake.unique.random_int() for i in range(1000)))
PH = []
NAME = []
EMAIL = []
ADDRESS = []
COMPANY = []
PRODUCT = []
PRODUCT_CATEGORY = []

for i in range(1000):
    PH.append(fake.unique.phone_number())
    NAME.append(fake.unique.name())
    EMAIL.append(fake.unique.email())
    ADDRESS.append(fake.unique.address())
    COMPANY.append(fake.unique.company())
    PRODUCT.append(fake.bothify(text='Product Number: ????-########'))
    PRODUCT_CATEGORY.append(fake.isbn10())

#####################
# CUSTOMER DataFrame
#####################

dfCustomer = pl.DataFrame({
    "CUSTOMER_ID": ID,
    "NAME" : NAME,
    "PH" : PH,
    "ADDRESS" : ADDRESS,
    "EMAIL_ID" : EMAIL
})

dfCustomer.sample(5)

CUSTOMER_ID,NAME,PH,ADDRESS,EMAIL_ID
i64,str,str,str,str
9076,"""Melanie Kenned…","""493-860-2796""","""9057 Jefferson…","""elopez@example…"
2891,"""Jacqueline Osb…","""(322)410-8459x…","""92734 Ramirez …","""horncarol@exam…"
4575,"""Anthony Myers""","""5814746103""","""4456 Evans Fie…","""aburns@example…"
5047,"""Andrea Sampson…","""425.769.9987""","""55081 Barker P…","""cschultz@examp…"
6918,"""Angela Wise""","""9645319781""","""40069 Jim Ramp…","""xlyons@example…"


In [8]:
dfVendor = pl.DataFrame({
    "VENDOR_ID": ID,
    "NAME" : COMPANY,
    "PH" : PH,
    "ADDRESS" : ADDRESS,
    "EMAIL_ID" : EMAIL
})

dfVendor.sample(5)

VENDOR_ID,NAME,PH,ADDRESS,EMAIL_ID
i64,str,str,str,str
8637,"""Thomas-Smith""","""359-804-9258""","""23506 Smith Pa…","""zlee@example.o…"
7789,"""Jones-Jones""","""+1-999-627-177…","""626 Brandy Exp…","""qrodriguez@exa…"
8406,"""Cisneros-Lee""","""913.371.7526""","""33314 Martha M…","""yvonnelawson@e…"
4546,"""Garcia-Martin""","""+1-847-287-829…","""65968 Goodwin …","""gregoryaguirre…"
4719,"""Silva, Rios an…","""285.461.1459x1…","""349 Crystal La…","""amy45@example.…"


In [9]:
dfProduct = pl.DataFrame({
    "PRODUCT_ID": ID,
    "PRODUCT" : PRODUCT,
    "PRODUCT_CATEGORY" : PRODUCT_CATEGORY,
    "MANUFACTURER" : COMPANY,
    "PRICE" : random.sample(range(10000), 1000)
})

dfProduct.sample(5)

PRODUCT_ID,PRODUCT,PRODUCT_CATEGORY,MANUFACTURER,PRICE
i64,str,str,str,i64
8475,"""Product Number…","""0-7658-3384-0""","""Grant-Reyes""",7974
2120,"""Product Number…","""1-69888-169-X""","""Williams, Jose…",6712
3288,"""Product Number…","""0-8155-7334-0""","""Myers Inc""",5775
4101,"""Product Number…","""0-262-40429-X""","""Armstrong, Sut…",963
3969,"""Product Number…","""0-7910-9578-9""","""Rice Inc""",4609


In [10]:
# Load DataSets
# Purchase Order
sampleSize = 100_000
dfPurOrder = pl.DataFrame({
    "PO_ID": list(range(1,sampleSize+1)),
    'AS_OF_DATE': random.choices(pl.date_range(datetime(2022, 1, 1), datetime(2023, 7, 20), timedelta(days=1), time_unit="ms"), k=sampleSize),
    "CUSTOMER_ID": random.choices(dfCustomer["CUSTOMER_ID"], k=sampleSize),
    "VENDOR_ID": random.choices(dfVendor["VENDOR_ID"], k=sampleSize),
    "PRODUCT_ID" : random.choices(dfProduct["PRODUCT_ID"], k=sampleSize),
    "QTY" : list(np.random.randint(1,50, sampleSize))
})

dfPurOrder.sample(5)

PO_ID,AS_OF_DATE,CUSTOMER_ID,VENDOR_ID,PRODUCT_ID,QTY
i64,datetime[μs],i64,i64,i64,i64
49786,2023-07-17 00:00:00,6990,1381,3594,19
13022,2022-11-04 00:00:00,4617,880,7304,31
18669,2022-02-02 00:00:00,3787,5509,7590,30
72139,2022-02-19 00:00:00,2639,698,8357,13
74971,2022-08-15 00:00:00,6367,4810,656,35


In [11]:
# Load DataSets
# Invoice
dfInvoice = dfPurOrder.join(dfProduct, on="PRODUCT_ID", how="inner").with_columns(
    (pl.col("QTY")*pl.col("PRICE")).alias("TOTAL")
)

dfInvoice.sample(5)

PO_ID,AS_OF_DATE,CUSTOMER_ID,VENDOR_ID,PRODUCT_ID,QTY,PRODUCT,PRODUCT_CATEGORY,MANUFACTURER,PRICE,TOTAL
i64,datetime[μs],i64,i64,i64,i64,str,str,str,i64,i64
96447,2023-07-06 00:00:00,3602,2774,7848,5,"""Product Number…","""0-7397-0577-6""","""Sandoval-Evans…",5576,27880
20138,2022-12-12 00:00:00,3975,3349,3849,35,"""Product Number…","""0-341-79496-1""","""Edwards-Hill""",4976,174160
63650,2022-12-31 00:00:00,693,5537,9850,42,"""Product Number…","""1-07-607178-3""","""Gutierrez Ltd""",627,26334
40985,2022-01-15 00:00:00,5312,2774,8237,48,"""Product Number…","""0-7152-3807-8""","""Miller, Thomas…",6365,305520
9725,2022-12-10 00:00:00,8024,5847,9093,9,"""Product Number…","""0-9975989-4-8""","""Holder, Waller…",8074,72666


In [12]:
# Load DataSets
# Receipt, Payment

dfReceipt = dfInvoice.join(dfProduct, on="PRODUCT_ID", how="inner").with_columns(
    pl.when(pl.col("QTY") < 42)
    .then(pl.lit(True))
    .otherwise(pl.lit(False))
    .alias("RECV_STATUS")
)

dfReceipt.sample(5)

PO_ID,AS_OF_DATE,CUSTOMER_ID,VENDOR_ID,PRODUCT_ID,QTY,PRODUCT,PRODUCT_CATEGORY,MANUFACTURER,PRICE,TOTAL,PRODUCT_right,PRODUCT_CATEGORY_right,MANUFACTURER_right,PRICE_right,RECV_STATUS
i64,datetime[μs],i64,i64,i64,i64,str,str,str,i64,i64,str,str,str,i64,bool
51383,2023-04-16 00:00:00,7743,1962,4686,15,"""Product Number…","""1-07-919635-8""","""Rodriguez, San…",8970,134550,"""Product Number…","""1-07-919635-8""","""Rodriguez, San…",8970,True
82514,2022-09-28 00:00:00,5681,6934,4851,23,"""Product Number…","""0-9835399-4-4""","""Mann Group""",6884,158332,"""Product Number…","""0-9835399-4-4""","""Mann Group""",6884,True
79397,2022-10-16 00:00:00,64,7425,8921,6,"""Product Number…","""1-140-93201-2""","""Brewer, Vasque…",5913,35478,"""Product Number…","""1-140-93201-2""","""Brewer, Vasque…",5913,True
59851,2023-05-04 00:00:00,3682,4979,5022,39,"""Product Number…","""0-12-053640-4""","""Gregory and So…",6451,251589,"""Product Number…","""0-12-053640-4""","""Gregory and So…",6451,True
43265,2023-03-15 00:00:00,1060,9850,9713,2,"""Product Number…","""1-104-21292-7""","""Garcia, Odonne…",3883,7766,"""Product Number…","""1-104-21292-7""","""Garcia, Odonne…",3883,True


#### Step 2: Collect statistics (describe generic behavior)
---

In [13]:
# use this table as an example to pull stats
# collect standard price per qty from Received order

dfReceipt.groupby("PRODUCT_ID").agg(pl.sum("QTY").alias("avg_qty_ordered"), pl.sum("TOTAL").alias("avg_price_paid")).join(dfProduct, on="PRODUCT_ID", how="inner").sample(5)

PRODUCT_ID,avg_qty_ordered,avg_price_paid,PRODUCT,PRODUCT_CATEGORY,MANUFACTURER,PRICE
i64,i64,i64,str,str,str,i64
7777,2698,13522376,"""Product Number…","""0-310-79434-X""","""Powell-Barnes""",5012
9778,2317,6336995,"""Product Number…","""0-256-06737-6""","""Nelson, Nelson…",2735
2947,2255,15185170,"""Product Number…","""1-141-75086-4""","""Roth, Smith an…",6734
486,2752,21418816,"""Product Number…","""1-133-38253-3""","""Mcclure, Corte…",7783
7101,2840,26275680,"""Product Number…","""0-497-60622-4""","""Thornton Inc""",9252


#### Step 3: building prompt
---

In [None]:
## TEST DATA Prompts ##
# respond in one word, is there an anomaly in this data.
# find anomalies in this data. 

# Record 1
# For a Product # XXX , on an average, there are XXX purchase Orders created.
# on average, XXX % of Purchase orders are received on time, and
# amount charge on invoice has average standard XXX.XXX % variance.
# is it ok to pay such an invoice ?

#### Step 4: Checking one Invoice
---

In [15]:
dfReceiptSampleCheck = dfReceipt.filter(pl.col("PO_ID") == 51383)
dfReceiptSampleCheck

PO_ID,AS_OF_DATE,CUSTOMER_ID,VENDOR_ID,PRODUCT_ID,QTY,PRODUCT,PRODUCT_CATEGORY,MANUFACTURER,PRICE,TOTAL,PRODUCT_right,PRODUCT_CATEGORY_right,MANUFACTURER_right,PRICE_right,RECV_STATUS
i64,datetime[μs],i64,i64,i64,i64,str,str,str,i64,i64,str,str,str,i64,bool
51383,2023-04-16 00:00:00,7743,1962,4686,15,"""Product Number…","""1-07-919635-8""","""Rodriguez, San…",8970,134550,"""Product Number…","""1-07-919635-8""","""Rodriguez, San…",8970,True


#### Step 5: run invoice prompt through LLM (large language model) 
---

In [17]:
out = dfReceiptSampleCheck = dfReceipt.filter(pl.col("PO_ID") == 51383)
# out.head()

prompt = "respond in one word, is there an anomaly in this data. "
prompt += "For a Product # XXX , on an average, there are XXX purchase Orders created. "
prompt += "on average, XXX % of Purchase orders are received on time, and "
prompt += "is it ok to pay such an invoice ?"
prompt

def callChatGPT(prompt):
    # completion = openai.ChatCompletion.create(
    # model=model_engine,
    # messages=[
    #     {"role": "user", "content": prompt}
    # ])
    # return completion.choices[0].message.content
    return "Yes"

def callLlama(prompt):
    # results = generator.chat_completion(
    #     dialogs,  # type: ignore
    #     max_gen_len=max_gen_len,
    #     temperature=temperature,
    #     top_p=top_p,
    # )
    # return result['generation']['content']
    return "Yes"

#### Duplicate Invoices
---

- Step 1: Read Image content to text
- Step 2: Create Corpus of documents
- Step 3: Remove punctuations, Pronouns etc.
- Step 4: Create bi | n-gram tokens
- Step 5: Creating Individual Rows of a Document Term Matrix | Tensors

#### Approach to locate Duplicate Invoice
- Approach 1: Compare line by line
- Approach 2: Compare TF-IDF 
- Approach 3: Compare Hash Trick Documents

In [None]:
#########################
# Duplicate Invoices
#########################

# Step 1: Read Image content to text
#######################################
# Scripts read text from images
#######################################
# py -m pip install pytesseract : open source
# py -m pip install PIL

import pytesseract
from PIL import Image

##############################################################################
# in case if tesseract is not included in PATH
pytesseract.pytesseract.tesseract_cmd = r'C:\amit.la\WIP\RPA\downloads\ts\tesseract.exe'
##############################################################################

def read_image_text(image_path):
    """
    Reads text from an image file using Tesseract OCR.

    Args:
        image_path (str): The file path to the input image.

    Returns:
        str: The extracted text from the image.
    """
    # Load the image file
    image = Image.open(image_path)

    # Use Tesseract OCR to extract the text from the image
    text = pytesseract.image_to_string(image)

    return text

# Example usage
image_path = "../downloads/AAPL.png"
# image_path = "../downloads/medical_form.png"
# image_path = "../downloads/email.png"
# image_path = "../downloads/vaccine.png"
# image_path = "../downloads/blurry_1.png"
# image_path = "../downloads/blurry_2.png"
text = read_image_text(image_path)
print(text)

In [None]:
# Step 2: Create Corpus of documents
# Step 3: Remove punctuations, Pronouns etc.
# Step 4: Create bi | n-gram tokens
# Step 5: Creating Individual Rows of a Document Term Matrix | Tensors


import nltk
nltk.download('punkt')

# tokens = nltk.word_tokenize(text)
# tokens

# text.discard('\n')
# create bi-grams
# list(bigrams(['more', 'is', 'said', 'than', 'done']))

######################
## tokenize by words
######################
from nltk.tokenize import word_tokenize
print(word_tokenize(text))


######################
## tokenize by sentence
######################
from nltk.tokenize import sent_tokenize
print(sent_tokenize(text))

######################
## generate bag of words
######################
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]

#### Approach 1: Compare line by line
---

In [None]:
# simple code to compare txt files line by line
# find similarities

with open('input_1.txt', 'r') as file1:
    with open('input_2.txt', 'r') as file2:
        same = set(file1).intersection(file2)

same.discard('\n')

with open('out.txt', 'w') as file_out:
    for line in same:
        file_out.write(line)

# simple code to compare txt files line by line
# find difference

with open('input_1.txt', 'r') as file1:
    with open('input_2.txt', 'r') as file2:
        difference = set(file1).difference(file2)

difference.discard('\n')

with open('out.txt', 'w') as file_out:
    for line in difference:
        file_out.write(line)

#### Approach 2: Compare TF-IDF
---

In [None]:
import nltk

fdist1 = FreqDist(text)
print(fdist1)
fdist1.most_common(50)

tf_idf = gensim.models.TfidfModel(corpus)
for doc in tfidf[corpus]:
    print([[dictionary[id], np.around(freq, decimals=2)] for id, freq in doc])

In [None]:
# similarity index
# building the index
sims = gensim.similarities.Similarity('workdir/',tf_idf[corpus],
                                        num_features=len(dictionary))

file2_docs = []

with open ('demofile2.txt') as f:
    tokens = sent_tokenize(f.read())
    for line in tokens:
        file2_docs.append(line)

print("Number of documents:",len(file2_docs))  
for line in file2_docs:
    query_doc = [w.lower() for w in word_tokenize(line)]
    query_doc_bow = dictionary.doc2bow(query_doc) #update an existing dictionary and
# create bag of words


# perform a similarity query against the corpus
query_doc_tf_idf = tf_idf[query_doc_bow]
# print(document_number, document_similarity)
print('Comparing Result:', sims[query_doc_tf_idf]) 

In [None]:
 # building the index
sims = gensim.similarities.Similarity('downloads/',tf_idf[corpus],
                                        num_features=len(dictionary))

#### Approach 3: Compare Hash Trick Documents
---

`CUDA | Tensor comparison on GPU machines`

In [None]:
# convert words into Tensor

# Trick #1
from sklearn import preprocessing
import torch

labels = ['cat', 'dog', 'mouse', 'elephant', 'pandas']
le = preprocessing.LabelEncoder()
targets = le.fit_transform(labels)
# targets: array([0, 1, 2, 3, 4])

targets = torch.as_tensor(targets)
# targets: tensor([0, 1, 2, 3, 4])

#Trick 2
In[]
import torch

words = ['שלום', 'beautiful', 'world']
max_l = 0
ts_list = []
for w in words:
    ts_list.append(torch.ByteTensor(list(bytes(w, 'utf8'))))
    max_l = max(ts_list[-1].size()[0], max_l)

w_t = torch.zeros((len(ts_list), max_l), dtype=torch.uint8)
for i, ts in enumerate(ts_list):
    w_t[i, 0:ts.size()[0]] = ts
w_t

Out[]
tensor([[215, 169, 215, 156, 215, 149, 215, 157,   0],
        [ 98, 101,  97, 117, 116, 105, 102, 117, 108],
        [119, 111, 114, 108, 100,   0,   0,   0,   0]], dtype=torch.uint8)



## Conclusion:

The use case discussed above exemplify sophisticated business processes and there is certainly lot more which is not covered. 

This use case merely scratch the surface of what can be achieved with these advanced tools. 

You may argue that the same results can be attained using simple algebraic mathematics with these datasets, and I fully support and agree with this observation.

In essence, the entire field, encompassing Data Science, Python, Llama, and ChatGPT, revolves around uncovering statistical associations within data.

However, it is crucial to recognize that the deployment of Llama or ChatGPT-like models does not surpass the importance of traditional statistics,
instead, they should be employed to streamline specific tasks.