# <span style='color:Tomato;'>Load Env Variables</span>

In [56]:
import dotenv
import utils

# Add the modules directory to the Python path if needed
# sys.path.append(os.path.abspath("./modules"))

# load variables into env
root_dir = utils.get_project_root()
f = root_dir / ".secrets" / ".env"
assert f.exists(), f"File not found: {f}"
dotenv.load_dotenv(f)


Root Directory: /DATA/Ali_Data/GraphRAG-Neo4j-VMD-NAMD


True

# <span style='color:Tomato;'>Process PDFs</span>

We'll use Langchain `PyMuPDF4LLM` to load the PDF files into LangChain documents.

We'll also use LLM to convert images into a summery and extract its data.


## <span style='color:Orange;'>Initialization</span>

### <span style='color:Khaki;'>Basic Imports</span>

In [None]:
import pickle
import pprint
from pathlib import Path
import tempfile

from IPython.display import Markdown, display

# from tqdm import tqdm
from tqdm.notebook import tqdm

import concurrent.futures as cf
import fitz  # PyMuPDF


temp_dir = False


### <span style='color:Khaki;'>Loading PDF file as LangChain Document</span>

> Images will be extracted (to text) using a Multimodal LLM.

You can either use `load()` method to do it all at once in memory or inclemently do it using `lazy_load()`.

Since our docs are big, we'll use `lazy_load()` to also see the progress.

To save time, we will load the docs from a pickle file if previously processed, otherwise process them and save them as a pickle.

#### <span style='color:LightGreen;'>Custom Splitting Mode</span>

> By default, each page in the PDF is a (LangChain) Document!

When loading the PDF file you can split it in two different ways:
- By page `mode="page"`
- As a single text flow `mode="single"`. In other words, the whole PDF would be **one** LangChain Document. You can specify page delimiter to have the pages in the metadata


In [58]:
def update_filename_len(file_path: Path) -> tuple[str, int]:
    """
    Update the filename length for the given file path.
    :param file_path: The file path to update.
    :return: A tuple containing the updated file name and its length.
    """
    file_name = file_path.stem.lower()
    with fitz.open(file_path) as pdf_doc:
        file_len = len(pdf_doc)
    return file_name, file_len


In [59]:
# pdf file
file_path = Path() / ".." / "data" / "pdfs" / "biopython.pdf"

file_path = file_path.resolve()
file_path = utils.fuzzy_find(file_path)

file_name, file_len = update_filename_len(file_path)

# create directory for pkl files
pkl_dir = file_path.parent.parent / "pkls"
pkl_dir.mkdir(exist_ok=True, parents=True)

print(f"file_path = {file_path}")


File 'biopython' not found. Fuzzy Searching ...
file_path = /DATA/Ali_Data/GraphRAG-Neo4j-VMD-NAMD/data/pdfs/BioPython.pdf


In [None]:
# if a problem occurs during the loading, use this to delete previously processed pages.
# todo: the page numbers are reindexed to zero

problematic_pages = [391, 392, 395, 428]
range_to_keep = range(391, file_len)  # 391 to 445 (exclusive)

if 0:
    display(Markdown("#### <span style='color:orangered;'>Warning: Deleting Pages !!!</span>"))
    temp_dir = Path(tempfile.mkdtemp()) if not isinstance(temp_dir, Path) else temp_dir
    temp_dir.mkdir(exist_ok=True, parents=True)

    with fitz.open(file_path) as doc:
        # PART I: extract deleted pages
        if len(problematic_pages) > 0:
            range_to_keep = list(set(range_to_keep) - set(problematic_pages))  # needed for next part

            temp_doc = fitz.open()
            for page_number in problematic_pages:
                temp_doc.insert_pdf(doc, from_page=page_number, to_page=page_number)

            extract_file_path = temp_dir / f"{file_name}_extract.pdf"
            temp_doc.save(extract_file_path)
            temp_doc.close()

        # ========================================================
        # PART II: extract pages to keep
        doc.select(range_to_keep)
        partial_file_path = temp_dir / f"{file_name}_partial.pdf"
        doc.save(partial_file_path)
    
    print(f"extract_file_path = {extract_file_path}")
    print(f"partial_file_path = {partial_file_path}")

print(f"\nfile_path = {file_path}")


#### <span style='color:orangered;'>Warning: Deleting Pages !!!</span>


file_path = /DATA/Ali_Data/GraphRAG-Neo4j-VMD-NAMD/data/pdfs/BioPython.pdf
extract_file_path = /tmp/tmpgv9x6tdd/biopython_extract.pdf
partial_file_path = /tmp/tmpgv9x6tdd/biopython_partial.pdf


In [None]:
# # test if the correct pages are extracted
# with fitz.open(partial_file_path) as doc:
#     print(doc[0].get_textpage().extractText())


#### <span style='color:LightGreen;'>How the Asynchronous Lazy Loading Loop</span>

This code demonstrates an asynchronous lazy loading pattern with a progress bar. Let me explain how it works:


##### <span style='color:SkyBlue;'>Key Components</span>

1. `alazy_load()` - An asynchronous generator that yields documents one by one
2. `async for` - Asynchronous iteration through the generator
3. `tqdm.tqdm()` - Progress bar visualization
4. Batching logic to process documents in chunks of 100

##### <span style='color:SkyBlue;'>How the Async Loop Works</span>

```python
async for doc in tqdm.tqdm(await loader.alazy_load()):
    # Process each document as it becomes available
```

The `await loader.alazy_load()` returns an asynchronous iterable. The `async for` loop then:

1. Asynchronously requests the next document
2. Waits for it to be retrieved without blocking the event loop
3. Updates the progress bar via `tqdm`
4. Processes the document once available

The batching logic (collecting 100 pages before processing) allows for more efficient operations on groups of documents rather than one at a time.

This pattern is especially useful when loading documents involves network requests or other I/O operations that would otherwise block execution.


#### <span style='color:LightGreen;'>Which LLM to use?</span>

- `gemma3:4b`: **biggest,** but provide a general understanding of the images.
- `granite3.2-vision`: **small,** and fine-tunned for data extraction from images in PDF docs.
- `moondream`: **smallest,** but only good for overall description of the image.

The Prompt used:

"""

You are an assistant tasked with summarizing images for retrieval.
1. These summaries will be embedded and used to retrieve the raw image.
   Give a concise summary of the image that is well optimized for retrieval
2. extract all the text from the image. Do not exclude any content from the page.
Format answer in markdown without explanatory text and without markdown delimiter ``` at the beginning.

"""

In [66]:
from langchain_community.document_loaders.parsers import LLMImageBlobParser
from langchain_ollama import ChatOllama
from langchain_pymupdf4llm import PyMuPDF4LLMLoader

# from langchain_ollama.llms import OllamaLLM
# Use ChatOllama instead of OllamaLLM for compatibility with LLMImageBlobParser


pfile_path = extract_file_path
extract_images = False

pfile_name, pfile_len = update_filename_len(pfile_path)

if extract_images:
    loader = PyMuPDF4LLMLoader(
        pfile_path,
        mode="page",
        extract_images=True,
        images_parser=LLMImageBlobParser(model=ChatOllama(model="granite3.2-vision", max_tokens=1024)),
    )
    pfile_name = f"w.img.{pfile_name}"
else:
    loader = PyMuPDF4LLMLoader(pfile_path, mode="page")

print(f"{pfile_name=} -> {pfile_len} pages")


pfile_name='biopython_extract' -> 4 pages


The nice thing about `lazy_load()`, is that we can stop processing any page and skip it if a problem happen.

You can also resume whenever you want or process pages with different config.

In [None]:
if (pkl_dir / f"docs_{pfile_name}.pkl").exists():
    print("Loading docs from pickle")
    with open(pkl_dir / f"docs_{pfile_name}.pkl", "rb") as f:
        docs = pickle.load(f)
else:
    print(f"Loading docs from pdf. \nThis will take some time (~{int(pfile_len / 30)} min)")  # on average 30 pages per minute

    # Option 1: loading small docs
    # docs = loader.load()

    # ---------------------------

    # Option 2: Load documents asynchronously (almost 3x faster)
    # assert not extract_images, "Async loading not supported for image extraction"
    # docs = await loader.aload()

    # ---------------------------

    # Option 3: lazy load with progress bar
    # # todo: make this asynchronous
    # docs = []
    # for doc in tqdm(loader.lazy_load(), total=pfile_len):
    #     docs.append(doc)

    # ---------------------------

    # Option 4: Load with timeout
    # todo: not working properly when timeout is reached
    
    timeout_seconds = 30
    skipped_pages = []
    docs = []

    def get_next_doc(loader):
        return next(loader)

    loader_iter = iter(loader.lazy_load())

    for i in tqdm(range(pfile_len), total=pfile_len):
        with cf.ThreadPoolExecutor(max_workers=5) as executor:
            future = executor.submit(get_next_doc, loader_iter)
            try:
                doc = future.result(timeout=timeout_seconds)
                docs.append(doc)
            except cf.TimeoutError:
                skipped_pages.append(i)

    # pickle save the docs
    with open(pkl_dir / f"docs_{pfile_name}.pkl", "wb") as f:
        pickle.dump(docs, f)

print(f"Loaded {pfile_name}: {len(docs)} documents")


Loading docs from pdf. 
This will take some time (~0 min)


  0%|          | 0/4 [00:00<?, ?it/s]

Loaded biopython_extract: 4 documents


In [None]:
# # merging docs if partially processed
# with open(pkl_dir / "docs_w.img.biopython_part1.pkl", "rb") as f:
#     docs0 = pickle.load(f)

# with open(pkl_dir / "docs_w.img.biopython_partial.pkl", "rb") as f:
#     docs1 = pickle.load(f)

# with open(pkl_dir / "docs_biopython_extract.pkl", "rb") as f:
#     docs2 = pickle.load(f)

# docs = docs0 + docs1 + docs2

# print(f"Loaded {len(docs)} documents")

# with open(pkl_dir / f"docs_w.img.biopython.pkl", "wb") as f:
#     pickle.dump(docs, f)


Loaded 445 documents


In [79]:
temp = docs[-5]
display(Markdown(temp.page_content))
print('-'*50)
pprint.pp(temp.metadata)


[42] Guy St C. Slater, Ewan Birney: “Automated generation of heuristics for biological sequence comparison.” *BMC Bioinformatics* **6** : 31 (2005). `https://doi.org/10.1186/1471-2105-6-31`

[43] George W. Snedecor, William G. Cochran: *Statistical methods* . Ames, Iowa: Iowa State University Press
(1989).

[44] Martin Steinegger, Markus Meier, Milot Mirdita, Harald V¨ohringer, Stephan J. Haunsberger, Johannes
S¨oding: “HH-suite3 for fast remote homology detection and deep protein annotation.” *BMC Bioinfor-*
*matics* **20** : 473 (2019). `https://doi.org/10.1186/s12859-019-3019-7`

[45] Eric Talevich, Brandon M. Invergo, Peter J.A. Cock, Brad A. Chapman: “Bio.Phylo: A unified toolkit
for processing, analyzing and visualizing phylogenetic trees in Biopython”. *BMC Bioinformatics* **13** :
209 (2012). `https://doi.org/10.1186/1471-2105-13-209`

[46] Pablo Tamayo, Donna Slonim, Jill Mesirov, Qing Zhu, Sutisak Kitareewan, Ethan Dmitrovsky, Eric S.
Lander, Todd R. Golub: “Interpreting patterns of gene expression with self-organizing maps: Methods
and application to hematopoietic differentiation”. *Proceedings of the National Academy of Sciences USA*
**96** (6): 2907–2912 (1999). `https://doi.org/10.1073/pnas.96.6.2907`

[47] Ian K. Toth, Leighton Pritchard, Paul R. J. Birch: “Comparative genomics reveals what makes an
enterobacterial plant pathogen”. *Annual Review of Phytopathology* **44** : 305–336 (2006). `https://doi.`
```
  org/10.1146/annurev.phyto.44.070505.143444

```
[48] G´eraldine A. van der Auwera, Jaroslaw E. Kr´ol, Haruo Suzuki, Brian Foster, Rob van Houdt, Celeste
J. Brown, Max Mergeay, Eva M. Top: “Plasmids captured in C. metallidurans CH34: defining the
PromA family of broad-host-range plasmids”. *Antonie van Leeuwenhoek* **96** (2): 193–204 (2009). `https:`
```
  //doi.org/10.1007/s10482-009-9316-9

```
[49] Michael S. Waterman, Mark Eggert: “A new algorithm for best subsequence alignments with application
to tRNA-rRNA comparisons”. *Journal of Molecular Biology* **197** (4): 723–728 (1987). `https://doi.`
```
  org/10.1016/0022-2836(87)90478-5

```
[50] Ziheng Yang and Rasmus Nielsen: “Estimating synonymous and nonsynonymous substitution rates
under realistic evolutionary models“. *Molecular Biology and Evolution* **17** (1): 32–43 (2000). `https:`
```
  //doi.org/10.1093/oxfordjournals.molbev.a026236

```
[51] Ka Yee Yeung, Walter L. Ruzzo: “Principal Component Analysis for clustering gene expression data”.
*Bioinformatics* **17** (9): 763–774 (2001). `https://doi.org/10.1093/bioinformatics/17.9.763`

444



--------------------------------------------------
{'producer': 'pdfTeX-1.40.24',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2024-01-10T10:15:49+00:00',
 'source': '/tmp/tmpgv9x6tdd/biopython_partial.pdf',
 'file_path': '/tmp/tmpgv9x6tdd/biopython_partial.pdf',
 'total_pages': 50,
 'format': 'PDF 1.5',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'moddate': '2024-01-10T10:15:49+00:00',
 'trapped': '',
 'modDate': 'D:20240110101549Z',
 'creationDate': 'D:20240110101549Z',
 'page': 49}


### <span style='color:Khaki;'>initializing the graph database</span>

In [None]:
# from langchain_neo4j import Neo4jGraph

# graph = Neo4jGraph()
