# Table of contents
1. [Introduction](#introduction)
2. [Data Ingestion](#data-ingestion)
3. [Data Extraction](#data-extraction)
4. [Display the output](#display-the-output)
5. [Display the output in a table](#display-the-output-in-a-table)
6. [Conclusion](#conclusion)

## Introduction
In this notebook, we will be extracting the data from the given PDF and display the output in a table format. We will be using the Gemini API to extract the data from the PDF.

## Data Ingestion
We will upload the PDF file to the notebook and then transform the pdf into base64 utf-8 format.

In [3]:
import sys
sys.path.append('..')
from src.gemini_ocr.components.data_ingestion import DataIngestion
from config.configs import ROOT_DIR
import os

file_dir = ROOT_DIR / 'data' / 'invoices'
files = os.listdir(file_dir)
file_path = file_dir / files[2]
data_ingestion = DataIngestion()
data = data_ingestion.transform(file_path)

In [4]:
file_path

WindowsPath('d:/github/OCR/notebooks/../data/invoices/NEWFO-INV-7478.pdf')

## Data Extraction
We will use the Gemini API to extract the data from the PDF.

In [5]:
from src.gemini_ocr.components.model import OCR_Model
from  config.configs import MODEL
ocr_model = OCR_Model(model= MODEL)
invoice = ocr_model.extract(data)

## Display the output
We will display the extracted data in the form of JSON.

In [6]:
ocr_model.display(invoice)

{'line_items': [{'product_code': None,
   'description': 'Orange Strapping Tape 19MMX66M',
   'quantity': '384.00',
   'price_per_unit': '1.06',
   'vat_percent': '20%',
   'total_price': '407.04'},
  {'product_code': None,
   'description': '12mm Blue Vinyl Tape',
   'quantity': '240.00',
   'price_per_unit': '1.06',
   'vat_percent': '20%',
   'total_price': '254.40'}],
 'total_amount': {'total_items': 2,
  'total_tax': '132.29',
  'total_price': '793.73'},
 'due_date': '2023-11-25',
 'payment_date': None,
 'invoice_date': '2023-10-13',
 'invoice_number': 'INV-7478',
 'purchase_order': '3106',
 'reference_numbers': ['MP38'],
 'locale': 'en-GB',
 'country': 'GBR',
 'currency': 'GBP',
 'payment_details': {'iban': None,
  'swift': None,
  'bic': None,
  'account_number': '19745860'},
 'vat_number': '188505042',
 'supplier_name': 'New Forest Growers Ltd',
 'taxes_details': [{'rate': '20%', 'amount': '132.29'}],
 'total_amount_including_taxes': '793.73',
 'total_net_amount_excluding_taxes

## Display the output in a table
We will display the extracted data in the form of a table.

In [7]:
ocr_model.display(invoice, html=True)

product_code,description,quantity,price_per_unit,vat_percent,total_price
rate,amount,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,Orange Strapping Tape 19MMX66M,384.0,1.06,20%,407.04
,12mm Blue Vinyl Tape,240.0,1.06,20%,254.4
20%,132.29,,,,
line_items,product_codedescriptionquantityprice_per_unitvat_percenttotal_priceNoneOrange Strapping Tape 19MMX66M384.001.0620%407.04None12mm Blue Vinyl Tape240.001.0620%254.40,,,,
total_amount,total_items2total_tax132.29total_price793.73,,,,
due_date,2023-11-25,,,,
payment_date,,,,,
invoice_date,2023-10-13,,,,
invoice_number,INV-7478,,,,
purchase_order,3106,,,,

product_code,description,quantity,price_per_unit,vat_percent,total_price
,Orange Strapping Tape 19MMX66M,384.0,1.06,20%,407.04
,12mm Blue Vinyl Tape,240.0,1.06,20%,254.4

0,1
total_items,2.0
total_tax,132.29
total_price,793.73

0,1
iban,
swift,
bic,
account_number,19745860.0

rate,amount
20%,132.29

0,1
vat_number,GB125476511


## Conclusion
We have successfully extracted the data from the PDF and displayed the output in a table format.