## Initalization

In [1]:
!nvidia-smi

Fri Oct 18 16:15:21 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   34C    P8              9W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       Off |   00

In [2]:
# Install necessary packages
!pip install -Uqqq pip --progress-bar off
!pip install -qqq torch==2.0.1 --progress-bar off
!pip install -qqq transformers==4.31.0 --progress-bar off
!pip install -qqq langchain==0.0.266 --progress-bar off
!pip install -qqq chromadb==0.4.5 --progress-bar off
!pip install -qqq pypdf==3.15.0 --progress-bar off
!pip install -qqq xformers==0.0.20 --progress-bar off
!pip install -qqq sentence_transformers==2.2.2 --progress-bar off
!pip install -qqq InstructorEmbedding==1.0.1 --progress-bar off
!pip install -qqq pdf2image==1.16.3 --progress-bar off
!pip install -qqq gdown
!pip install -qqq pysqlite3-binary
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
!sudo apt-get install -y poppler-utils

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kaggle-environments 1.14.15 requires transformers>=4.33.1, but you have transformers 4.31.0 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ydata-profiling 4.6.4 requires numpy<1.26,>=1.16.0, but you have numpy 1.26.4 which is incompatible.
ydata-profiling 4.6.4 requires pydantic>=2, but you have pydantic 1.10.18 which is incompatible.[0m[31m
[0mLooking in indexes: https://pypi.org/simple, https://huggingface.github.io/autogptq-index/whl/cu118/
Collecting auto-gptq
  Downloading https://huggingface.github.io/autogptq-index/whl/cu118/auto-gptq/auto_gptq-0.7.1%2Bcu118-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23.7 MB)
[2K   

In [3]:
import torch
from auto_gptq import AutoGPTQForCausalLM
from langchain import HuggingFacePipeline, PromptTemplate
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from pdf2image import convert_from_path
from transformers import AutoTokenizer, TextStreamer, pipeline
import os
import sys
import pysqlite3
import warnings
# Suppress all warnings
warnings.filterwarnings("ignore")

# Set device
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

2024-10-18 16:19:58.839112: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-18 16:19:58.839233: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-18 16:19:58.982902: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Load Model and Tokenizer Once

In [4]:
def load_model_and_tokenizer():
    model_name_or_path = "TheBloke/Llama-2-13B-chat-GPTQ"
    model_basename = "model"
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
    model = AutoGPTQForCausalLM.from_quantized(
        model_name_or_path,
        revision="gptq-4bit-128g-actorder_True",
        model_basename=model_basename,
        use_safetensors=True,
        trust_remote_code=True,
        inject_fused_attention=False,
        device=DEVICE,
        quantize_config=None,
    )
    return model, tokenizer


## Load PDF Documents

In [5]:
def load_pdf_documents(pdf_path):
    loader = PyPDFDirectoryLoader(pdf_path)
    docs = loader.load()
    if not docs:
        return None
    return docs


## Setup Embeddings and Text Splitter

In [6]:
def setup_embeddings():
    embeddings = HuggingFaceInstructEmbeddings(
        model_name="hkunlp/instructor-large", model_kwargs={"device": DEVICE}
    )
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
    return embeddings, text_splitter


## Generate Prompt

In [7]:
def generate_prompt(prompt: str, system_prompt: str) -> str:
    return f"""
[INST] <>
{system_prompt}
<>

{prompt} [/INST]
""".strip()


## Setup QA Chain

In [8]:
def setup_qa_chain(db, model, tokenizer):
    DEFAULT_SYSTEM_PROMPT = """
    You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

    If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
    """.strip()

    SYSTEM_PROMPT = "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say 'null'. Do not try to make up an answer, don't try to make up an answer."

    template = generate_prompt(
        """
{context}

Question: {question}
""",
        system_prompt=SYSTEM_PROMPT,
    )

    prompt = PromptTemplate(template=template, input_variables=["context", "question"])

    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    text_pipeline = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=1024,
        temperature=0,
        top_p=0.95,
        repetition_penalty=1.15,
        streamer=streamer,
    )

    llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"temperature": 0})

    return RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=db.as_retriever(search_kwargs={"k": 2}),
        return_source_documents=True,
        chain_type_kwargs={"prompt": prompt},
    )


## Process a Single PDF

In [9]:
# Disable the parallelism warning
os.environ["TOKENIZERS_PARALLELISM"] = "false"

def process_pdf(pdf_path, model, tokenizer):
    docs = load_pdf_documents(pdf_path)
    if not docs:
        return f"No documents found in {pdf_path}"
    
    embeddings, text_splitter = setup_embeddings()
    
    results = []
    
    for doc in docs:
        texts = text_splitter.split_documents([doc])
        
        # Create and persist the vector store
        if os.path.exists("db"):
            os.system("rm -rf db")
        db = Chroma.from_documents(texts, embeddings, persist_directory="db")
        
        qa_chain = setup_qa_chain(db, model, tokenizer)
        
        # Query the model for each required field
        questions = {
            "invoice_number": "What is the invoice number?",
            "invoice_date": "What is the invoice date?",
            "seller_name": "What is the seller's name?",
            "seller_address": "What is the seller's address?",
            "seller_phone": "What is the seller's phone number?",
            "client_name": "What is the client's name?",
            "client_address": "What is the client's address?",
            "client_phone": "What is the client's phone number?",
            "items": "What are the items listed in the invoice?",
            "subtotal": "What is the subtotal amount?",
            "grand_total": "What is the grand total amount?"
        }
        
        result = {}
        for key, question in questions.items():
            result[key] = qa_chain(question)
        
        results.append(result)
    
    return results

## Process All PDFs

In [10]:
def process_all_pdfs(pdf_folder_path, model, tokenizer):
    results = {}
    for pdf_file in os.listdir(pdf_folder_path):
        if pdf_file.endswith(".pdf"):
            pdf_path = os.path.join(pdf_folder_path, pdf_file)
            print(f"Processing {pdf_file}...")
            result = process_pdf(pdf_path, model, tokenizer)  # Ensure 'db' is defined and loaded
            results[pdf_file] = result
            print(f"Result for {pdf_file}: {result}")

    # Print all results
    for pdf_file, result in results.items():
        print(f"Result for {pdf_file}: {result}")


## Main Execution

In [11]:
#Load Model
model, tokenizer = load_model_and_tokenizer()

# Override the sqlite3 module
sys.modules["sqlite3"] = pysqlite3

tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/837 [00:00<?, ?B/s]

1. You disabled CUDA extensions compilation by setting BUILD_CUDA_EXT=0 when install auto_gptq from source.
2. You are using pytorch without CUDA support.
3. CUDA and nvcc are not installed in your device.


config.json:   0%|          | 0.00/761 [00:00<?, ?B/s]

quantize_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/7.26G [00:00<?, ?B/s]

INFO - The layer lm_head is not quantized.


In [12]:
pdf_path = "/kaggle/input/invoice-pfds"

In [13]:
results = process_pdf(pdf_path, model, tokenizer)

.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/66.3k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.41k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

load INSTRUCTOR_Transformer
max_seq_length  512


The model 'LlamaGPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausal

 Sure! Based on the provided information, the invoice number is:

61356291
 Sure! Based on the given information, I can determine that the invoice date is September 6th, 2012. This is evident from the "Date of Issue" field, which reads "09/06/2012." Therefore, the correct answer to your question is September 6th, 2012.
 Sure! Based on the provided invoice, the seller's name is:

Chapman, Kim, and Green

This information can be found in the "Supplier" section of the invoice.
 Based on the given invoice, the seller's address is:

Chapman, Kim and Green
64731 James Branch
Smithmouth, NC 26872

This information can be found under the "Supplier" section of the invoice.
 Null. There is no information about the seller's phone number in the provided invoice.
 Sure! Based on the provided invoice, the client's name is Rodriguez-Stevens.
 Based on the provided invoice, the client's address is:

Rodriguez-Stevens
2280 Angela Plain
Hortonshire, MS 93248
 Null. There is no client's phone number in t

In [15]:
results

[{'invoice_number': {'query': 'What is the invoice number?',
   'result': '  Sure! Based on the provided information, the invoice number is:\n\n61356291',
   'source_documents': [Document(page_content='Invoice no: 61356291\nDate of issue:\n 09/06/2012\nSupplier:\nChapman, Kim and Green \n64731 James Branch \nSmithmouth, NC 26872\nTax Id: 949-84-9105\nIBAN: GB50ACIE59715038217063\nCustomer:\nRodriguez-Stevens \n2280 Angela Plain \nHortonshire, MS 93248\nTax Id: 939-98-8477\nITEMS\nNo.\nDescription\n Qty\n UM\n Net price\n Net worth\n VAT [%]\n Gross\nworth\nWine Glasses Goblets Pair Clear\nGlass\n1.\n 5,00\n each\n 12,00\n 60,00\n 10%\n 66,00\nWith Hooks Stemware Storage\nMultiple Uses Iron Wine Rack\nHanging Glass\n2.\n 4,00\n each\n 28,08\n 112,32\n 10%\n 123,55\nReplacement Corkscrew Parts\nSpiral Worm Wine Opener Bottle\nHoudini\n3.\n 1,00\n each\n 7,50\n 7,50\n 10%\n 8,25\nHOME ESSENTIALS GRADIENT\nSTEMLESS WINE GLASSES SET\nOF 4 20 FL OZ (591 ml) NEW\n4.\n 1,00\n each\n 12,99\n 12