# Table of contents
1. [Introduction](#introduction)
2. [Data Ingestion](#data-ingestion)
3. [Data Extraction](#data-extraction)
4. [Display the output](#display-the-output)
5. [Display the output in a table](#display-the-output-in-a-table)
6. [Conclusion](#conclusion)

## Introduction
In this notebook, we will be extracting the data from the given PDF and display the output in a table format. We will be using the Gemini API to extract the data from the PDF.

## Data Ingestion
We will upload the PDF file to the notebook and then transform the pdf into base64 utf-8 format.

In [1]:
from components.data_ingestion import DataIngestion
from configs import ROOT_DIR
import os

invoice_dir = ROOT_DIR / 'data' / 'invoices'
invoice_files = os.listdir(invoice_dir)

file_path = invoice_dir / invoice_files[5]
print(file_path)
data_ingestion = DataIngestion()
data = data_ingestion.transform(file_path)

c:\Users\ravi.kumar\github\OCR\data\invoices\Crowshall Invoice 50568.pdf


## Data Extraction
We will use the Gemini API to extract the data from the PDF.

In [2]:
from components.model import OCR_Model
ocr_model = OCR_Model(model="gemini-2.0-flash-thinking-exp")
invoice = ocr_model._predict(data)

In [4]:
from components.extractor import InvoiceExtractor
extractor = InvoiceExtractor()
response = extractor.extract(invoice)

## Display the output
We will display the extracted data in the form of JSON.

In [None]:
ocr_model.display(invoice)

{'line_items': [{'ItemPosition': 1,
   'ProductCode': 'null',
   'Description': 'MS H (1000d) (-80C Storage) + EYEDROPPERS 04/05/2020 21178 Belchford New',
   'Quantity': 35.0,
   'UnitPrice': 100.52,
   'ItemVatRate': 20.0,
   'TotalAmount': 3518.2}],
 'headers': {'suppName': 'CROWSHALL VETERINARY SERVICES LLP',
  'invNo': '50564',
  'invDate': '2020-05-31',
  'dueDate': '2020-06-30',
  'orderNo': 'null',
  'custName': 'L J Fairburn & Son',
  'custAddress': 'Ivy House\nFarm Office\nFarlesthorpe Road\nLincolnshire\nLN13 9PL',
  'amountNet': 3518.2,
  'amountVat': 703.64,
  'amountTotal': 4221.84,
  'currency': '£'}}

## Display the output in a table
We will display the extracted data in the form of a table.

In [5]:
ocr_model.display(response, html=True)

ItemPosition,ProductCode,Description,Quantity,UnitPrice,TotalAmount,ItemVatRate
1,,Retainer fee,1.0,1600.0,1600.0,20.0
2,,Paracox 8 (1000 doses) 0650A 22/05/2020 73310,1.0,103.16,103.16,20.0
3,,Denagard (1 litre) 22575 05/05/2020 72961,9.0,43.72,393.48,20.0
4,,Salmonella Test - Gauze swabs (10) (MSRV) 11/05/2020 S0358320,2.0,11.33,22.66,20.0
5,,SpecDelSat1pm-Citramox 07/05/2020,1.0,15.0,15.0,20.0
6,,Poulvac IB Primer (5000 doses) 386290 12/05/2020 73048,5.0,20.79,103.95,20.0
7,,Gallimune 407 (1000 doses) 1471202 26/05/2020 73357,25.0,60.73,1518.25,20.0
8,,Poulvac HB1 (5000 doses) 316856-322730 04/05/2020 72918,12.0,7.5,90.0,20.0
9,,Poulvac ILT (1000 doses) 381717 12/05/2020 73050,60.0,10.78,646.8,20.0
10,,Poulvac ILT (1000 doses) 40188 12/05/2020 73051,27.0,10.78,291.06,20.0

ItemPosition,ProductCode,Description,Quantity,UnitPrice,TotalAmount,ItemVatRate
1,,Retainer fee,1.0,1600.0,1600.0,20.0
2,,Paracox 8 (1000 doses) 0650A 22/05/2020 73310,1.0,103.16,103.16,20.0
3,,Denagard (1 litre) 22575 05/05/2020 72961,9.0,43.72,393.48,20.0
4,,Salmonella Test - Gauze swabs (10) (MSRV) 11/05/2020 S0358320,2.0,11.33,22.66,20.0
5,,SpecDelSat1pm-Citramox 07/05/2020,1.0,15.0,15.0,20.0
6,,Poulvac IB Primer (5000 doses) 386290 12/05/2020 73048,5.0,20.79,103.95,20.0
7,,Gallimune 407 (1000 doses) 1471202 26/05/2020 73357,25.0,60.73,1518.25,20.0
8,,Poulvac HB1 (5000 doses) 316856-322730 04/05/2020 72918,12.0,7.5,90.0,20.0
9,,Poulvac ILT (1000 doses) 381717 12/05/2020 73050,60.0,10.78,646.8,20.0
10,,Poulvac ILT (1000 doses) 40188 12/05/2020 73051,27.0,10.78,291.06,20.0

0,1
suppName,CROWSHALL VETERINARY SERVICES LLP
invNo,50568
invDate,2020-05-31
due_date,2020-06-30
orderNo,
custName,L J Fairburn & Son
custAddress,Ivy House Farm Office Farlesthorpe Road Lincolnshire LN13 9PL
amountNet,30959.12
amountVat,6137.88
amountTotal,37097.00


## Conclusion
We have successfully extracted the data from the PDF and displayed the output in a table format.