# Table of contents
1. [Introduction](#introduction)
2. [Data Ingestion](#data-ingestion)
3. [Data Extraction](#data-extraction)
4. [Display the output](#display-the-output)
5. [Display the output in a table](#display-the-output-in-a-table)
6. [Conclusion](#conclusion)

## Introduction
In this notebook, we will be extracting the data from the given PDF and display the output in a table format. We will be using the Gemini API to extract the data from the PDF.

## Data Ingestion
We will upload the PDF file to the notebook and then transform the pdf into base64 utf-8 format.

In [5]:
from components.data_ingestion import DataIngestion
from configs import ROOT_DIR, EMPTY_RETURN_PROMPT
import os

invoice_dir = ROOT_DIR / 'data' / 'invoices'
invoice_files = os.listdir(invoice_dir)

file_path = invoice_dir / invoice_files[5]
print(file_path)
data_ingestion = DataIngestion()
data = data_ingestion.transform(file_path)

c:\Users\ravi.kumar\github\OCR\data\invoices\crViewer - 2023-08-08T091520.889.pdf


## Data Extraction
We will use the Gemini API to extract the data from the PDF.

In [6]:
from components.model import OCR_Model
ocr_model = OCR_Model(model="gemini-1.5-flash-8b", prompt=EMPTY_RETURN_PROMPT)
invoice = ocr_model.extract(data)

## Display the output
We will display the extracted data in the form of JSON.

In [7]:
ocr_model.display(invoice)

{'line_items': [{'ItemPosition': 1,
   'ProductCode': '',
   'Description': 'Non Contract Charges covering all vehicles laid down in Appendix 2 of the operations manual James Hall & Co Ltd',
   'Quantity': 1,
   'UnitPrice': 169.6,
   'ItemVatRate': 20,
   'TotalAmount': 169.6}],
 'headers': {'suppName': 'Vaculug Limited',
  'invNo': '109948',
  'invDate': '2023-06-08',
  'dueDate': 'string',
  'orderNo': '500389',
  'custName': 'James Hall & Co Ltd',
  'custAddress': 'Hoghton chambers Houghton Street southport Lancashire PR9 OTB',
  'amountNet': 169.6,
  'amountVat': 33.92,
  'amountTotal': 203.52,
  'currency': '£'}}

## Display the output in a table
We will display the extracted data in the form of a table.

In [8]:
ocr_model.display(invoice, html=True)

ItemPosition,ProductCode,Description,Quantity,UnitPrice,ItemVatRate,TotalAmount
1,,Non Contract Charges covering all vehicles laid down in Appendix 2 of the operations manual James Hall & Co Ltd,1.0,169.6,20.0,169.6
line_items,ItemPositionProductCodeDescriptionQuantityUnitPriceItemVatRateTotalAmount1Non Contract Charges covering all vehicles laid down in Appendix 2 of the operations manual James Hall & Co Ltd1169.620169.6,,,,,
headers,suppNameVaculug LimitedinvNo109948invDate2023-06-08dueDatestringorderNo500389custNameJames Hall & Co LtdcustAddressHoghton chambers Houghton Street southport Lancashire PR9 OTBamountNet169.6amountVat33.92amountTotal203.52currency£,,,,,

ItemPosition,ProductCode,Description,Quantity,UnitPrice,ItemVatRate,TotalAmount
1,,Non Contract Charges covering all vehicles laid down in Appendix 2 of the operations manual James Hall & Co Ltd,1,169.6,20,169.6

0,1
suppName,Vaculug Limited
invNo,109948
invDate,2023-06-08
dueDate,string
orderNo,500389
custName,James Hall & Co Ltd
custAddress,Hoghton chambers Houghton Street southport Lancashire PR9 OTB
amountNet,169.6
amountVat,33.92
amountTotal,203.52


## Conclusion
We have successfully extracted the data from the PDF and displayed the output in a table format.