<a href="https://colab.research.google.com/github/EricSiq/NASA-Space-Apps-Challenge/blob/main/Notebooks/DataIngestionDoclingTest2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%pip install langchain_docling docling -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.6/162.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m231.3/231.3 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m164.4/164.4 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.1/15.1 MB[0m [31m83.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m84.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.7/42.7 kB[0m [31m2.1 MB/s[0m eta [36

In [2]:
%pip install neo4j

Collecting neo4j
  Downloading neo4j-5.28.2-py3-none-any.whl.metadata (5.9 kB)
Downloading neo4j-5.28.2-py3-none-any.whl (313 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/313.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m307.2/313.2 kB[0m [31m12.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.2/313.2 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: neo4j
Successfully installed neo4j-5.28.2



## Best Code Approach: Extract and Prepare

The best approach is to use the `pandas` library to load the CSV, extract the necessary columns, and structure the data into a list of tuples or a list of dictionaries, which is a perfect input format for your subsequent processing loop with Docling.

### Python Code for Extraction

This code will load your CSV and generate a clean list of (Title, Link) pairs, filtering for the valid PMC links.

In [3]:
import pandas as pd
import csv
import os
import time
from typing import List, Tuple, Dict, Any

# Import necessary classes from Docling and its LangChain integration
# Docling's specific classes for chunking and export types
from docling.chunking import HybridChunker
from langchain_docling.loader import DoclingLoader, ExportType
from langchain_core.documents import Document


In [4]:

# --- NER/RE - Hugging Face and Structured Output ---
# Transformers for BioBERT (or similar scientific models)
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
# Pydantic for defining the mandatory structured output schema for Relation Extraction
from pydantic import BaseModel, Field

# --- Knowledge Graph (KG) and Database Connector ---
# Neo4j Driver for connecting to and interacting with the graph database
from neo4j import GraphDatabase
# Optional: LangChain utility for Graph Store abstraction
# from langchain_community.graphs import Neo4jGraph

In [5]:

# Function to Extract Data from CSV
def extract_publication_data(file_name: str) -> List[Tuple[str, str]]:
    """
    Loads the CSV, extracts the Title and Link columns, and filters for valid PMC URLs.
    """
    try:
        # Load the CSV file
        df = pd.read_csv(file_name, quoting=csv.QUOTE_MINIMAL)

        # Assuming the columns are named 'Title' and 'Link'
        base_url_prefix = "https://www.ncbi.nlm.nih.gov/pmc/articles/"

        # 1. Filter: Ensure we only keep valid links for PMC articles
        df_filtered = df[df['Link'].astype(str).str.startswith(base_url_prefix, na=False)].copy()

        # 2. Select: Create a list of tuples (Title, Link)
        publication_list = list(zip(df_filtered['Title'], df_filtered['Link']))

        print(f"Extracted {len(publication_list)} valid PMC links from the CSV.")
        return publication_list

    except Exception as e:
        print(f"ERROR: An error occurred during CSV extraction: {e}")
        return []

-----

## Next Step: Docling Integration Workflow

Once the above code provides the `publication_data` list, you will use it in the next step to perform the actual ingestion and structural parsing using $\text{Docling}$.

1.  **Iterate:** Loop through the `publication_data` list.
2.  **Docling Call:** Inside the loop, initialize the $\text{DoclingLoader}$ using the **URL** from the list. Docling is capable of loading directly from a URL.
3.  **Process:** Use the `load()` or `lazy_load()` method to fetch and process the document.
4.  **Extract & Store:** Immediately process the structured $\text{Docling}$ output to perform your **NER/RE** and populate the **Knowledge Graph** and **Vector Store**.

### Conceptual Docling Integration (Python)

This snippet illustrates how the extraction output feeds into $\text{Docling}$ for processing:

After running the installation cell above, please execute the cell below again.

In [6]:


# Function to Ingest Data using Docling
def ingest_documents_with_docling(publication_data: List[Tuple[str, str]]) -> List[Document]:
    """
    Iterates through the list of (title, url) and uses Docling to fetch and process
    each document, preserving structural information in Markdown format.
    """
    if not publication_data:
        print("No data to process. Exiting ingestion.")
        return []

    processed_documents: List[Document] = []

    # (Explicitly define chunk size for better granularity):

    from docling.chunking import HybridChunker
    # Set a target chunk size (e.g., 2000 tokens) with a generous overlap
    # to ensure context isn't lost across splits.
    chunker = HybridChunker(
        max_chunk_size=2000,
        chunk_overlap=200
    )

    # You may also want to experiment with Docling's `SectionChunker`
    # if you want to strictly enforce splits by Introduction, Results, etc.
    # from docling.chunking import SectionChunker
    # chunker = SectionChunker()






    # We will only process a small sample for demonstration/testing due to processing time.
    # For full production run, remove the slicing [0:5]
    sample_data = publication_data[0:5]






    print(f"\nProcessing a sample of {len(sample_data)} documents using Docling...")

    for i, (title, url) in enumerate(sample_data):
        start_time = time.time()
        print(f"--- Document {i+1}/{len(sample_data)}: {title[:50]}...")

        try:
            # 1. Initialize DoclingLoader with the URL
            # Docling handles fetching the content from the web
            loader = DoclingLoader(
                file_path=url,
                chunker=chunker,
                # Export as Markdown to preserve semantic structure (headings, tables)
                export_type=ExportType.MARKDOWN
            )

            # 2. Load the structured document. The result is a list of Document chunks.
            docs = loader.load()

            # 3. Add custom metadata (Title, Source URL) to each chunk
            for doc in docs:
                doc.metadata['original_title'] = title
                doc.metadata['source_url'] = url
                # Add a unique ID for traceability in the Knowledge Graph
                doc.metadata['doc_id'] = f"PMC_{url.split('/')[-2]}"
                processed_documents.append(doc)

            end_time = time.time()
            print(f"    Processed {len(docs)} chunks. Time: {end_time - start_time:.2f}s")
            # This is where you would call your NER/RE and KG population functions
            # process_for_knowledge_graph(docs)

        except Exception as e:
            print(f"    FAILED to process document. Error: {e}")

    return processed_documents


The approach is to **Extract $\rightarrow$ Transform $\rightarrow$ Load** (ETL), where $\text{Docling}$ handles the critical **Transform** step by converting the messy PDF structure into clean, semantically-rich text (like Markdown) before the downstream AI work begins.

In [7]:

# Main Execution
if __name__ == "__main__":
    #  Phase 1: Extraction
    file_name = "SB_publication_PMC.csv"
    publication_data = extract_publication_data(file_name)

    #  Phase 2: Ingestion and Structural Parsing
    if publication_data:
        final_document_chunks = ingest_documents_with_docling(publication_data)

        # Verification of the processed chunks
        if final_document_chunks:
            print("\n--- Verification ---")
            print(f"Total structured chunks ready for KG/Vector Store: {len(final_document_chunks)}")

            # Show a sample chunk content and metadata
            sample_chunk = final_document_chunks[0]
            print(f"\nSample Chunk Content (First 200 chars): \n{sample_chunk.page_content[:200]}...")
            print("\nSample Chunk Metadata:")
            print(sample_chunk.metadata)

Extracted 607 valid PMC links from the CSV.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]


Processing a sample of 5 documents using Docling...
--- Document 1/5: Mice in Bion-M 1 space mission: training and selec...




    Processed 1 chunks. Time: 1.01s
--- Document 2/5: Microgravity induces pelvic bone loss through oste...
    Processed 1 chunks. Time: 2.52s
--- Document 3/5: Stem Cell Health and Tissue Regeneration in Microg...
    Processed 1 chunks. Time: 1.71s
--- Document 4/5: Microgravity Reduces the Differentiation and Regen...




    Processed 1 chunks. Time: 0.98s
--- Document 5/5: Microgravity validation of a novel system for RNA ...
    Processed 1 chunks. Time: 1.12s

--- Verification ---
Total structured chunks ready for KG/Vector Store: 5

Sample Chunk Content (First 200 chars): 
# Mice in Bion-M 1 Space Mission: Training and Selection

[Alexander Andreev-Andrievskiy](https://pubmed.ncbi.nlm.nih.gov/?term=%22Andreev-Andrievskiy%20A%22%5BAuthor%5D)

1, 2, * , [Anfisa Popova](ht...

Sample Chunk Metadata:
{'source': 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4136787/', 'original_title': 'Mice in Bion-M 1 space mission: training and selection', 'source_url': 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4136787/', 'doc_id': 'PMC_PMC4136787'}


In [8]:
import time
from typing import List, Tuple, Dict, Any

# Core Data Structures
from pydantic import BaseModel, Field

# Document Loading (from previous step)
from langchain_core.documents import Document

# LLM Integration (Using Hugging Face/Transformers as an open-source example)
# NOTE: The actual model (e.g., BioBERT) would replace a general-purpose placeholder
from transformers import pipeline

# You would need an actual Graph Database Connector here (e.g., Neo4j's Python Driver)
# import neo4j

# --- Pydantic Schema Definitions ---
# These models define the exact structure of the data you need for your Knowledge Graph

class Entity(BaseModel):
    """A scientific entity extracted from the text."""
    name: str = Field(description="The canonical name of the entity.")
    entity_type: str = Field(description="The entity type, e.g., Organism, Environment, Biological_Process, Biomolecule.")

class RelationshipTriple(BaseModel):
    """A structured triple representing a relationship between two entities."""
    subject: str = Field(description="The name of the subject entity (e.g., Microgravity).")
    relationship: str = Field(description="The action or link, e.g., causes, affects, measured_in, inhibits.")
    object: str = Field(description="The name of the object entity (e.g., Bone density loss).")
    evidence: str = Field(description="The exact span of text supporting this triple.")

class ExtractionSchema(BaseModel):
    """The complete structured output for a single text chunk."""
    entities: List[Entity] = Field(description="A list of all key entities found.")
    triples: List[RelationshipTriple] = Field(description="A list of all factual relationship triples found.")

print("Imports and Pydantic Schemas loaded successfully.")

Imports and Pydantic Schemas loaded successfully.


In [10]:
# --- Setup Mock LLM and Graph Connector ---

# In a real scenario, this would be initialized with your fine-tuned LLM
def initialize_structured_extraction_pipeline():
    """Initializes the LLM pipeline for structured NER and RE."""
    # NOTE: In production, this would be a specialized model (e.g., fine-tuned BioBERT)
    # The 'text-generation' pipeline is used as a stand-in for a model prompted for JSON output.
    print("Initializing mock LLM extraction pipeline...")
    time.sleep(1) # Simulate model loading time

    # We define the core prompt to guide the LLM's behavior
    system_prompt = (
        "You are an expert NASA Bioscience Knowledge Extractor. Your task is to analyze the "
        "provided text and extract all factual entities and relationships that describe "
        "experimental results and impacts related to spaceflight. STRICTLY adhere to the "
        "provided JSON schema for all output."
    )
    return system_prompt # Returning the prompt as the "model" for simulation

class MockGraphDBConnector:
    """Simulates the connection to the graph database (e.g., Neo4j)."""
    def __init__(self):
        self.total_triples = 0

    def add_triples(self, triples: List[RelationshipTriple], metadata: Dict[str, Any]):
        """Simulates adding extracted data to the KG."""
        doc_id = metadata.get('doc_id', 'N/A')
        for triple in triples:
            # Logic to create nodes and edges in the database would go here
            # e.g., neo4j_driver.run(f"MERGE (s:Entity {{name: '{triple.subject}'}})...")
            self.total_triples += 1

        # In a real implementation, this would handle the Topic Modeling clustering
        # after initial population.
        print(f"    Added {len(triples)} triples to KG from document {doc_id}.")

# --- Main NER/RE Loop Implementation ---

def run_ner_re_and_kg_population(document_chunks: List[Document]):
    """
    Orchestrates the NER, RE, and KG population using the structured LLM approach.
    """
    if not document_chunks:
        print("Document chunk list is empty. Exiting NER/RE process.")
        return

    llm_system_prompt = initialize_structured_extraction_pipeline()
    graph_db_connector = MockGraphDBConnector()

    # Mock LLM output for demonstration purposes only
    mock_llm_output = ExtractionSchema(
        entities=[
            Entity(name="Microgravity", entity_type="Environment"),
            Entity(name="Pelvic Bone", entity_type="Biological_Structure"),
            Entity(name="Osteoclastic Activity", entity_type="Biological_Process")
        ],
        triples=[
            RelationshipTriple(
                subject="Microgravity",
                relationship="induces",
                object="Pelvic Bone loss",
                evidence="Microgravity induces pelvic bone loss through osteoclastic activity..."
            )
        ]
    )

    print(f"\nStarting NER/RE on {len(document_chunks)} document chunks...")

    for i, doc_chunk in enumerate(document_chunks):
        print(f"  Processing chunk {i+1}...")

        try:
            # 1. & 2. Run Joint NER/RE using LLM
            # In production, you would call your LLM with the chunk content and the Pydantic schema
            # structured_output = llm(prompt=llm_system_prompt, text=doc_chunk.page_content, schema=ExtractionSchema)

            # Use mock output for speed/safety in this environment:
            structured_output = mock_llm_output

            # Validate the output against the schema
            if not isinstance(structured_output, ExtractionSchema):
                raise ValueError("LLM failed to return a valid structured output schema.")

            # 3. Populate Knowledge Graph (KG)
            graph_db_connector.add_triples(structured_output.triples, doc_chunk.metadata)

        except Exception as e:
            # Catch LLM/network/parsing errors and log the failure
            print(f"  ❌ Failed NER/RE for chunk {i+1} from doc_id {doc_chunk.metadata.get('doc_id')}. Error: {e}")
            continue

    print(f"\n--- NER/RE & KG Population Complete ---")
    print(f"Total Triples Mock-Added to KG: {graph_db_connector.total_triples}")
    print("Next step: Implement Topic Modeling and Visualization on the populated KG data.")

# Example Usage (assuming 'final_document_chunks' from the previous successful step)
# This will run the logic on the 5 sample chunks produced previously.
run_ner_re_and_kg_population(final_document_chunks)

Initializing mock LLM extraction pipeline...

Starting NER/RE on 5 document chunks...
  Processing chunk 1...
    Added 1 triples to KG from document PMC_PMC4136787.
  Processing chunk 2...
    Added 1 triples to KG from document PMC_PMC3630201.
  Processing chunk 3...
    Added 1 triples to KG from document PMC_PMC11988870.
  Processing chunk 4...
    Added 1 triples to KG from document PMC_PMC7998608.
  Processing chunk 5...
    Added 1 triples to KG from document PMC_PMC5587110.

--- NER/RE & KG Population Complete ---
Total Triples Mock-Added to KG: 5
Next step: Implement Topic Modeling and Visualization on the populated KG data.
