<a href="https://colab.research.google.com/github/EricSiq/NASA-Space-Apps-Challenge/blob/main/Notebooks/DataIngestionDoclingTest1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is an excellent first step in your project\! Yes, it is **absolutely possible** and is the correct workflow to prepare your data for **Docling**.

You must first extract the links and their titles from the CSV because **Docling** (or any document loader) requires the specific file path or URL for each document it processes.

## Best Code Approach: Extract and Prepare

The best approach is to use the `pandas` library to load the CSV, extract the necessary columns, and structure the data into a list of tuples or a list of dictionaries, which is a perfect input format for your subsequent processing loop with Docling.

### Python Code for Extraction

This code will load your CSV and generate a clean list of (Title, Link) pairs, filtering for the valid PMC links.

In [None]:
import pandas as pd
import csv

def extract_publication_data(file_name):
    """
    Loads the CSV, extracts the Title and Link columns, and prepares a
    list of (title, link) tuples for processing.
    """
    try:
        # Load the CSV file
        df = pd.read_csv(file_name, quoting=csv.QUOTE_MINIMAL)

        # Assuming the columns are named 'Title' and 'Link'
        # Filter for links that start with the expected base URL for PMC articles
        base_url_prefix = "https://www.ncbi.nlm.nih.gov/pmc/articles/"
        df_filtered = df[df['Link'].str.startswith(base_url_prefix, na=False)].copy()

        # Create a list of tuples (Title, Link)
        publication_list = list(zip(df_filtered['Title'], df_filtered['Link']))

        return publication_list

    except Exception as e:
        print(f"An error occurred during file processing: {e}")
        return []

# File name provided in the user context
file_name = "SB_publication_PMC.csv"

# Run the extraction
publication_data = extract_publication_data(file_name)

# Print a sample of the extracted data for verification
if publication_data:
    print(f"Total publications extracted: {len(publication_data)}")
    print("Sample of extracted (Title, Link) pairs:")
    for i in range(min(5, len(publication_data))):
        print(f"  {i+1}. Title: {publication_data[i][0][:60]}... | Link: {publication_data[i][1]}")
else:
    print("No publication data was extracted.")

-----

## Next Step: Docling Integration Workflow

Once the above code provides the `publication_data` list, you will use it in the next step to perform the actual ingestion and structural parsing using $\text{Docling}$.

1.  **Iterate:** Loop through the `publication_data` list.
2.  **Docling Call:** Inside the loop, initialize the $\text{DoclingLoader}$ using the **URL** from the list. Docling is capable of loading directly from a URL.
3.  **Process:** Use the `load()` or `lazy_load()` method to fetch and process the document.
4.  **Extract & Store:** Immediately process the structured $\text{Docling}$ output to perform your **NER/RE** and populate the **Knowledge Graph** and **Vector Store**.

### Conceptual Docling Integration (Python)

This snippet illustrates how the extraction output feeds into $\text{Docling}$ for processing:

In [None]:
# Assuming you successfully ran the extraction code above and have the 'publication_data' list

# --- Conceptual Code for Docling Ingestion (Requires Docling and LangChain setup) ---

# from langchain_docling import DoclingLoader
# from docling.chunking import HybridChunker
# from docling_project.docling import ExportType

# processed_documents = []

# for title, url in publication_data:
#     try:
#         # 1. Initialize DoclingLoader with the URL
#         loader = DoclingLoader(
#             file_path=url,
#             chunker=HybridChunker(),
#             export_type=ExportType.MARKDOWN
#         )

#         # 2. Load the structured document
#         docs = loader.load()

#         # 3. Add original metadata (Title) to each chunk and store
#         for doc in docs:
#             doc.metadata['original_title'] = title
#             processed_documents.append(doc)

#         # 4. NOW: Run NER/RE and KG population on the 'docs' list

#     except Exception as e:
#         print(f"Failed to process document: {title} at {url}. Error: {e}")

# # The 'processed_documents' list is now ready for your Knowledge Graph pipeline.

The approach is to **Extract $\rightarrow$ Transform $\rightarrow$ Load** (ETL), where $\text{Docling}$ handles the critical **Transform** step by converting the messy PDF structure into clean, semantically-rich text (like Markdown) before the downstream AI work begins.