<a href="https://colab.research.google.com/github/EricSiq/NASA-Space-Apps-Challenge/blob/main/Notebooks/DataIngestionDoclingTest2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
%pip install langchain_docling docling -q


## Best Code Approach: Extract and Prepare

The best approach is to use the `pandas` library to load the CSV, extract the necessary columns, and structure the data into a list of tuples or a list of dictionaries, which is a perfect input format for your subsequent processing loop with Docling.

### Python Code for Extraction

This code will load your CSV and generate a clean list of (Title, Link) pairs, filtering for the valid PMC links.

In [8]:
import pandas as pd
import csv
import os
import time
from typing import List, Tuple, Dict, Any

# Import necessary classes from Docling and its LangChain integration
# Docling's specific classes for chunking and export types
from docling.chunking import HybridChunker
from langchain_docling.loader import DoclingLoader, ExportType
from langchain_core.documents import Document


In [9]:

# Function to Extract Data from CSV
def extract_publication_data(file_name: str) -> List[Tuple[str, str]]:
    """
    Loads the CSV, extracts the Title and Link columns, and filters for valid PMC URLs.
    """
    try:
        # Load the CSV file
        df = pd.read_csv(file_name, quoting=csv.QUOTE_MINIMAL)

        # Assuming the columns are named 'Title' and 'Link'
        base_url_prefix = "https://www.ncbi.nlm.nih.gov/pmc/articles/"

        # 1. Filter: Ensure we only keep valid links for PMC articles
        df_filtered = df[df['Link'].astype(str).str.startswith(base_url_prefix, na=False)].copy()

        # 2. Select: Create a list of tuples (Title, Link)
        publication_list = list(zip(df_filtered['Title'], df_filtered['Link']))

        print(f"Extracted {len(publication_list)} valid PMC links from the CSV.")
        return publication_list

    except Exception as e:
        print(f"ERROR: An error occurred during CSV extraction: {e}")
        return []

-----

## Next Step: Docling Integration Workflow

Once the above code provides the `publication_data` list, you will use it in the next step to perform the actual ingestion and structural parsing using $\text{Docling}$.

1.  **Iterate:** Loop through the `publication_data` list.
2.  **Docling Call:** Inside the loop, initialize the $\text{DoclingLoader}$ using the **URL** from the list. Docling is capable of loading directly from a URL.
3.  **Process:** Use the `load()` or `lazy_load()` method to fetch and process the document.
4.  **Extract & Store:** Immediately process the structured $\text{Docling}$ output to perform your **NER/RE** and populate the **Knowledge Graph** and **Vector Store**.

### Conceptual Docling Integration (Python)

This snippet illustrates how the extraction output feeds into $\text{Docling}$ for processing:

After running the installation cell above, please execute the cell below again.

In [12]:


# Function to Ingest Data using Docling
def ingest_documents_with_docling(publication_data: List[Tuple[str, str]]) -> List[Document]:
    """
    Iterates through the list of (title, url) and uses Docling to fetch and process
    each document, preserving structural information in Markdown format.
    """
    if not publication_data:
        print("No data to process. Exiting ingestion.")
        return []

    processed_documents: List[Document] = []

    # (Explicitly define chunk size for better granularity):

    from docling.chunking import HybridChunker
    # Set a target chunk size (e.g., 2000 tokens) with a generous overlap
    # to ensure context isn't lost across splits.
    chunker = HybridChunker(
        max_chunk_size=2000,
        chunk_overlap=200
    )

    # You may also want to experiment with Docling's `SectionChunker`
    # if you want to strictly enforce splits by Introduction, Results, etc.
    # from docling.chunking import SectionChunker
    # chunker = SectionChunker()






    # We will only process a small sample for demonstration/testing due to processing time.
    # For full production run, remove the slicing [0:5]
    sample_data = publication_data[0:5]






    print(f"\nProcessing a sample of {len(sample_data)} documents using Docling...")

    for i, (title, url) in enumerate(sample_data):
        start_time = time.time()
        print(f"--- Document {i+1}/{len(sample_data)}: {title[:50]}...")

        try:
            # 1. Initialize DoclingLoader with the URL
            # Docling handles fetching the content from the web
            loader = DoclingLoader(
                file_path=url,
                chunker=chunker,
                # Export as Markdown to preserve semantic structure (headings, tables)
                export_type=ExportType.MARKDOWN
            )

            # 2. Load the structured document. The result is a list of Document chunks.
            docs = loader.load()

            # 3. Add custom metadata (Title, Source URL) to each chunk
            for doc in docs:
                doc.metadata['original_title'] = title
                doc.metadata['source_url'] = url
                # Add a unique ID for traceability in the Knowledge Graph
                doc.metadata['doc_id'] = f"PMC_{url.split('/')[-2]}"
                processed_documents.append(doc)

            end_time = time.time()
            print(f"    Processed {len(docs)} chunks. Time: {end_time - start_time:.2f}s")
            # This is where you would call your NER/RE and KG population functions
            # process_for_knowledge_graph(docs)

        except Exception as e:
            print(f"    FAILED to process document. Error: {e}")

    return processed_documents


The approach is to **Extract $\rightarrow$ Transform $\rightarrow$ Load** (ETL), where $\text{Docling}$ handles the critical **Transform** step by converting the messy PDF structure into clean, semantically-rich text (like Markdown) before the downstream AI work begins.

In [13]:

# Main Execution
if __name__ == "__main__":
    #  Phase 1: Extraction
    file_name = "SB_publication_PMC.csv"
    publication_data = extract_publication_data(file_name)

    #  Phase 2: Ingestion and Structural Parsing
    if publication_data:
        final_document_chunks = ingest_documents_with_docling(publication_data)

        # Verification of the processed chunks
        if final_document_chunks:
            print("\n--- Verification ---")
            print(f"Total structured chunks ready for KG/Vector Store: {len(final_document_chunks)}")

            # Show a sample chunk content and metadata
            sample_chunk = final_document_chunks[0]
            print(f"\nSample Chunk Content (First 200 chars): \n{sample_chunk.page_content[:200]}...")
            print("\nSample Chunk Metadata:")
            print(sample_chunk.metadata)

Extracted 607 valid PMC links from the CSV.

Processing a sample of 5 documents using Docling...
--- Document 1/5: Mice in Bion-M 1 space mission: training and selec...
    Processed 1 chunks. Time: 0.62s
--- Document 2/5: Microgravity induces pelvic bone loss through oste...
    Processed 1 chunks. Time: 0.76s
--- Document 3/5: Stem Cell Health and Tissue Regeneration in Microg...
    Processed 1 chunks. Time: 1.49s
--- Document 4/5: Microgravity Reduces the Differentiation and Regen...




    Processed 1 chunks. Time: 0.70s
--- Document 5/5: Microgravity validation of a novel system for RNA ...
    Processed 1 chunks. Time: 0.79s

--- Verification ---
Total structured chunks ready for KG/Vector Store: 5

Sample Chunk Content (First 200 chars): 
# Mice in Bion-M 1 Space Mission: Training and Selection

[Alexander Andreev-Andrievskiy](https://pubmed.ncbi.nlm.nih.gov/?term=%22Andreev-Andrievskiy%20A%22%5BAuthor%5D)

1, 2, * , [Anfisa Popova](ht...

Sample Chunk Metadata:
{'source': 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4136787/', 'original_title': 'Mice in Bion-M 1 space mission: training and selection', 'source_url': 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4136787/', 'doc_id': 'PMC_PMC4136787'}


Output from test:

✅ Extracted 607 valid PMC links from the CSV.
/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
tokenizer_config.json: 100%
 350/350 [00:00<00:00, 17.9kB/s]
vocab.txt:
 232k/? [00:00<00:00, 12.7MB/s]
tokenizer.json:
 466k/? [00:00<00:00, 20.7MB/s]
special_tokens_map.json: 100%
 112/112 [00:00<00:00, 9.37kB/s]
sentence_bert_config.json: 100%
 53.0/53.0 [00:00<00:00, 5.59kB/s]

Processing a sample of 5 documents using Docling...
--- Document 1/5: Mice in Bion-M 1 space mission: training and selec...
WARNING:docling.models.factories.base_factory:The plugin langchain_docling will not be loaded because Docling is being executed with allow_external_plugins=false.
    ✅ Processed 1 chunks. Time: 2.49s
--- Document 2/5: Microgravity induces pelvic bone loss through oste...
    ✅ Processed 1 chunks. Time: 1.68s
--- Document 3/5: Stem Cell Health and Tissue Regeneration in Microg...
    ✅ Processed 1 chunks. Time: 1.59s
--- Document 4/5: Microgravity Reduces the Differentiation and Regen...
WARNING:docling.backend.html_backend:Clashing formatting: 'bold=True italic=False underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>' and 'bold=False italic=True underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'! Chose 'bold=False italic=True underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'
WARNING:docling.backend.html_backend:Clashing formatting: 'bold=True italic=False underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>' and 'bold=False italic=True underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'! Chose 'bold=False italic=True underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'
WARNING:docling.backend.html_backend:Clashing formatting: 'bold=True italic=False underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>' and 'bold=False italic=True underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'! Chose 'bold=False italic=True underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'
WARNING:docling.backend.html_backend:Clashing formatting: 'bold=False italic=True underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>' and 'bold=True italic=False underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'! Chose 'bold=True italic=False underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'
WARNING:docling.backend.html_backend:Clashing formatting: 'bold=True italic=False underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>' and 'bold=False italic=True underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'! Chose 'bold=False italic=True underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'
WARNING:docling.backend.html_backend:Clashing formatting: 'bold=True italic=False underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>' and 'bold=False italic=True underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'! Chose 'bold=False italic=True underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'
WARNING:docling.backend.html_backend:Clashing formatting: 'bold=True italic=False underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>' and 'bold=False italic=True underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'! Chose 'bold=False italic=True underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'
WARNING:docling.backend.html_backend:Clashing formatting: 'bold=True italic=False underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>' and 'bold=False italic=False underline=False strikethrough=False script=<Script.SUPER: 'super'>'! Chose 'bold=False italic=False underline=False strikethrough=False script=<Script.SUPER: 'super'>'
WARNING:docling.backend.html_backend:Clashing formatting: 'bold=True italic=False underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>' and 'bold=False italic=False underline=False strikethrough=False script=<Script.SUPER: 'super'>'! Chose 'bold=False italic=False underline=False strikethrough=False script=<Script.SUPER: 'super'>'
WARNING:docling.backend.html_backend:Clashing formatting: 'bold=False italic=True underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>' and 'bold=False italic=False underline=False strikethrough=False script=<Script.SUPER: 'super'>'! Chose 'bold=False italic=False underline=False strikethrough=False script=<Script.SUPER: 'super'>'
WARNING:docling.backend.html_backend:Clashing formatting: 'bold=False italic=True underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>' and 'bold=False italic=False underline=False strikethrough=False script=<Script.SUPER: 'super'>'! Chose 'bold=False italic=False underline=False strikethrough=False script=<Script.SUPER: 'super'>'
WARNING:docling.backend.html_backend:Clashing formatting: 'bold=False italic=True underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>' and 'bold=False italic=False underline=False strikethrough=False script=<Script.SUPER: 'super'>'! Chose 'bold=False italic=False underline=False strikethrough=False script=<Script.SUPER: 'super'>'
WARNING:docling.backend.html_backend:Clashing formatting: 'bold=False italic=True underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>' and 'bold=True italic=False underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'! Chose 'bold=True italic=False underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'
WARNING:docling.backend.html_backend:Clashing formatting: 'bold=False italic=True underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>' and 'bold=True italic=False underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'! Chose 'bold=True italic=False underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'
WARNING:docling.backend.html_backend:Clashing formatting: 'bold=False italic=True underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>' and 'bold=True italic=False underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'! Chose 'bold=True italic=False underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'
WARNING:docling.backend.html_backend:Clashing formatting: 'bold=False italic=True underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>' and 'bold=True italic=False underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'! Chose 'bold=True italic=False underline=False strikethrough=False script=<Script.BASELINE: 'baseline'>'
    ✅ Processed 1 chunks. Time: 1.23s
--- Document 5/5: Microgravity validation of a novel system for RNA ...
    ✅ Processed 1 chunks. Time: 1.51s

--- Verification ---
Total structured chunks ready for KG/Vector Store: 5

Sample Chunk Content (First 200 chars):
# Mice in Bion-M 1 Space Mission: Training and Selection

[Alexander Andreev-Andrievskiy](https://pubmed.ncbi.nlm.nih.gov/?term=%22Andreev-Andrievskiy%20A%22%5BAuthor%5D)

1, 2, * , [Anfisa Popova](ht...

Sample Chunk Metadata:
{'source': 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4136787/', 'original_title': 'Mice in Bion-M 1 space mission: training and selection', 'source_url': 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4136787/', 'doc_id': 'PMC_PMC4136787'}