# 1. PDF to text

Two readers:

1. PDFReader
2. Document Intelligence

## Environment Setup
This notebook leverages .env files and uses relative pathing to select the correct one. It is intended that the .env file exist in the same level of the folder structure as the notebook itself

In [None]:
from dotenv import load_dotenv
load_dotenv()

In [None]:
import tiktoken
encoding = tiktoken.get_encoding("o200k_base")

def token_size(text):
    return len(encoding.encode(text))

In [None]:
sample_file = '../examples/call-center-status-report.pdf'

## PDFReader

In [None]:
from pypdf import PdfReader

In [None]:
filename = sample_file.split('/')[-1]
filename = filename.split('.')[0]
filename

In [None]:
text = ""
with open(sample_file, "rb") as f:
    reader = PdfReader(f)
    text += "\n\n".join([page.extract_text() for page in reader.pages])

with open(f"{filename}_pdfreader.txt", "w") as f:
    f.write(text)

In [None]:
print(f"token size: {token_size(text)}")

## Azure AI Services - Document Intelligence

Document intelligence document:
- [Extract Layout](https://learn.microsoft.com/en-us/python/api/overview/azure/ai-documentintelligence-readme?view=azure-python-preview&preserve-view=true#extract-layout)
- [Extract Figures from Documents](https://learn.microsoft.com/en-us/python/api/overview/azure/ai-documentintelligence-readme?view=azure-python-preview&preserve-view=true#extract-figures-from-documents)

In [None]:
import os

AZDOCINT_ENDPOINT = os.getenv("AZDOCINT_ENDPOINT")
AZDOCINT_KEY = os.getenv("AZDOCINT_KEY")

In [None]:
import base64

from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence.models import AnalyzeOutputOption, AnalyzeResult

document_analysis_client = DocumentIntelligenceClient(
    endpoint=AZDOCINT_ENDPOINT, credential=AzureKeyCredential(AZDOCINT_KEY)
)

In [None]:
print(f"converting `{sample_file}`...")
# document intelligence - access files locally
with open(sample_file, "rb") as f:
    analyze_request = {
        "base64Source": base64.b64encode(f.read()).decode('utf-8')
    }
    poller = document_analysis_client.begin_analyze_document("prebuilt-layout", 
        analyze_request,
        output=[AnalyzeOutputOption.FIGURES],
        output_content_format="markdown")


In [None]:
result = poller.result()
md_content = result["content"]
print(f"token size: {token_size(md_content)}")

In [None]:
with open(f"{filename}_docint.md", "w") as f:
    f.write(md_content)

### Extra: Save figures

In [None]:
details = poller.details
operation_id = details['operation_id']
print(f"operation_id: {operation_id}")

In [None]:
result['figures']

In [None]:
# check if the directory exists
if not os.path.exists("fig"):
    os.mkdir("fig")

In [None]:
if result.figures:
    for figure in result.figures:
        if figure.id:
            response = document_analysis_client.get_analyze_result_figure(
                model_id=result.model_id, result_id=operation_id, figure_id=figure.id
            )
            with open(f"./fig/fig_{figure.id}.png", "wb") as writer:
                writer.writelines(response)
else:
    print("No figures found.")