# PDF reader and embedder via Azure/ Open AI tools for MM-RAG
## üìò Tanat Piumsuwan's Customized Version

This notebook also contains scripts for reading and extracting content from PDF files.

### ‚úÖ Setup Before Running This Notebook:

- Ensure the **corresponding PDF file** is in its designated folder "\Data"
- Add yourown .env, based on the empty one I provided.

## Environment setup:

In [1]:
%load_ext autoreload
%autoreload 2
import os
from glob import glob

pdf_dir = os.getcwd()

from dotenv import load_dotenv
load_dotenv()

document_basename = os.environ.get("DATA_NAME")
pdf_dir = os.path.join(os.getcwd(),'Data')
document_name = glob(os.path.join(pdf_dir, "*.PDF"))[0]
pdf_doc = os.path.join(pdf_dir,document_name )  

### OpenAI Models information

In [2]:
# set `<your-endpoint>` and `<your-key>` variables with the values from the Azure portal
DI_endpoint = os.environ.get('DI_ENDPOINT')
key = os.environ.get("DI_KEY")

## Azure DI

In [3]:
# import libraries
import os
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult, DocumentContentFormat
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest


# helper functions

def get_words(page, line):
    result = []
    for word in page.words:
        if _in_span(word, line.spans):
            result.append(word)
    return result


def _in_span(word, spans):
    for span in spans:
        if word.span.offset >= span.offset and (
            word.span.offset + word.span.length
        ) <= (span.offset + span.length):
            return True
    return False



def analyze_layout(pdf_doc):
    # sample document

    document_intelligence_client = DocumentIntelligenceClient(
        endpoint=DI_endpoint, credential=AzureKeyCredential(key)
    )
    # Open your local PDF file
    with open(pdf_doc, "rb") as f:
        poller = document_intelligence_client.begin_analyze_document(
            model_id="prebuilt-layout",
            body=f,
            output_content_format=DocumentContentFormat.MARKDOWN,
            
        )

    result = poller.result()

    print("Extracted keys: ",result.keys())

    return result

In [4]:
document_intelligence_client = DocumentIntelligenceClient(
        endpoint=DI_endpoint, credential=AzureKeyCredential(key)
    )

In [5]:
DI_result = analyze_layout(pdf_doc)

Extracted keys:  dict_keys(['apiVersion', 'modelId', 'stringIndexType', 'content', 'pages', 'tables', 'paragraphs', 'styles', 'contentFormat', 'sections'])


In [6]:
DI_result['content']

'# ‡πÅ‡∏ö‡∏ö‡∏õ‡∏£‡∏∞‡πÄ‡∏°‡∏¥‡∏ô‡∏£‡∏≤‡∏¢‡∏õ‡∏µ‡∏™‡∏≥‡∏´‡∏£‡∏±‡∏ö‡∏ú‡∏π‡πâ‡∏™‡∏ô‡∏±‡∏ö‡∏™‡∏ô‡∏∏‡∏ô‡∏Å‡∏≤‡∏£‡∏Ç‡∏≤‡∏¢‡πÅ‡∏•‡∏∞‡∏£‡∏±‡∏ö‡∏ã‡∏∑‡πâ‡∏≠‡∏Ñ‡∏∑‡∏ô‡∏´‡∏ô‡πà‡∏ß‡∏¢‡∏•‡∏á‡∏ó‡∏∏‡∏ô ‡∏õ‡∏£‡∏∞‡∏à‡∏≥‡∏õ‡∏µ 2567\n\n(‡∏£‡∏ö‡∏Å‡∏ß‡∏ô‡∏ï‡∏≠‡∏ö‡∏Å‡∏•‡∏±‡∏ö‡∏†‡∏≤‡∏¢‡πÉ‡∏ô‡∏ß‡∏±‡∏ô‡∏ó‡∏µ‡πà 30 ‡∏°‡∏¥‡∏ñ‡∏∏‡∏ô‡∏≤‡∏¢‡∏ô 2568)\n\n‡∏ß‡∏±‡∏ô‡∏ó‡∏µ‡πà 27 ‡∏°‡∏¥‡∏ñ‡∏∏‡∏ô‡∏≤‡∏¢‡∏ô 2568\n\n‡∏£‡∏≤‡∏¢‡∏á‡∏≤‡∏ô‡∏â‡∏ö‡∏±‡∏ö‡∏ô‡∏µ‡πâ‡∏à‡∏±‡∏î‡∏ó‡∏≥‡∏Ç‡∏∂‡πâ‡∏ô ‡πÄ‡∏û‡∏∑‡πà‡∏≠‡∏ï‡∏£‡∏ß‡∏à‡∏™‡∏≠‡∏ö‡∏Ç‡πâ‡∏≠‡∏°‡∏π‡∏•‡∏ï‡πà‡∏≤‡∏á‡πÜ ‡πÅ‡∏•‡∏∞‡∏õ‡∏£‡∏∞‡πÄ‡∏°‡∏¥‡∏ô‡∏Å‡∏≤‡∏£‡∏à‡∏±‡∏î‡∏Å‡∏≤‡∏£‡∏Ñ‡∏ß‡∏≤‡∏°‡πÄ‡∏™‡∏µ‡πà‡∏¢‡∏á‡∏Ç‡∏≠‡∏á‡∏ú‡∏π‡πâ‡∏™‡∏ô‡∏±‡∏ö‡∏™‡∏ô‡∏∏‡∏ô‡∏Å‡∏≤‡∏£‡∏Ç‡∏≤‡∏¢\n‡πÅ‡∏•‡∏∞‡∏£‡∏±‡∏ö‡∏ã‡∏∑‡πâ‡∏≠‡∏Ñ‡∏∑‡∏ô‡∏´‡∏ô‡πà‡∏ß‡∏¢‡∏•‡∏á‡∏ó‡∏∏‡∏ô‡∏Ç‡∏≠‡∏á ‡∏ö‡∏£‡∏¥‡∏©‡∏±‡∏ó‡∏´‡∏•‡∏±‡∏Å‡∏ó‡∏£‡∏±‡∏û‡∏¢‡πå‡∏à‡∏±‡∏î‡∏Å‡∏≤‡∏£‡∏Å‡∏≠‡∏á‡∏ó‡∏∏‡∏ô‡∏Å‡∏™‡∏¥‡∏Å‡∏£‡πÑ‡∏ó‡∏¢ ‡∏à‡∏≥‡∏Å‡∏±‡∏î ‡∏õ‡∏£‡∏∞‡∏à‡∏≥‡∏õ‡∏µ 2567\n\n\n# ‡∏ö‡∏£‡∏¥‡∏©‡∏±‡∏ó‡∏´‡∏•‡∏±‡∏Å‡∏ó‡∏£‡∏±‡∏û‡∏¢‡πå ‡πÑ‡∏≠‡∏£‡πà‡∏≤ ‡∏à‡∏≥‡∏Å‡∏±‡∏î (‡∏°‡∏´‡∏≤‡∏ä‡∏