## <b>MOF ChemUnity Synthesis Extraction</b>

This notebook demonstrates how the synthesis extraction workflow in MOF ChemUnity is used. Prior to running this, the MOF names that are extracted from the Matching Workflow are required, so we advise the user to run ``ChemUnity_Matching.ipynb`` first to retrieve these. As well, the vector stores generated from the Matching Workflow can be re-used here to keep things simple - we don't have to worry about reading any .pdf or .xml files here, or converting them to .md

In [2]:
import pandas as pd
import glob
import os
from langchain_openai import ChatOpenAI
from langchain.callbacks import get_openai_callback

from MOF_ChemUnity.Agents.ExtractionAgent import ExtractionAgent
from MOF_ChemUnity.utils.DocProcessor import DocProcessor
from MOF_ChemUnity.Prompts.Synthesis_Prompts import SYNTHESIS_EXTRACTION
from MOF_ChemUnity.utils.DataModels import Synthesis
from MOF_ChemUnity.Agents.BaseAgent import BaseAgent

<b>Preparation of MOF Names from Matching CSV</b>

Please refer to ```ChemUnity_Matching.ipynb``` to get a matching .csv file (set this as ```mof_names_df```), which manages to map MOF name to the reference code (alongside their respective DOI). These are needed to run the rest of the demonstration. Furthermore, please use your own OpenAI API key. For reference, the following cell has been constructed for preparation of extraction process.

In [None]:
os.environ["OPENAI_API_KEY"] = 'YOUR_API_KEY'

llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
parser_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

mof_names_df = pd.read_csv("Examples/MD/Elsevier/results/matching.csv")
mof_names_df.head()

Unnamed: 0,MOF Name,CSD Ref Code,Justification,DOI
0,C9H12CaO10<|>One-pot Self-assembly Reaction,OWEZEW01,The MOF C9H12CaO10 synthesized by the one-pot ...,10.1016/j.molstruc.2017.11.128
1,C18H30Ca3O24<|>Ion Exchange Method,not provided,The MOF C18H30Ca3O24 synthesized by the ion ex...,10.1016/j.molstruc.2017.11.128


<b>Vector Store Files setup</b>

For this example, we will use the markdown example explored in the matching notebook. Please follow these steps:
- Point the input_folder to the vector store folder "vs" that was created during the matching
- Initialize your extraction results dictionary (shown as ```result```)
- Create the agent object

In [24]:
input_folder = "Examples/MD/vs"
files = glob.glob(input_folder+"/*/")

agent = ExtractionAgent(llm=llm)

# Directory and file paths
results_dir = os.path.join(input_folder, 'synthesis_results')
results_file = os.path.join(results_dir, 'synthesis_extractions.csv')

# Set parameters
saving_every = 250  # Save results every X MOFs
batch_size = 4000    # Stop after processing X MOFs (manual restart required)

# Ensure the results directory exists
os.makedirs(results_dir, exist_ok=True)

# Prepare or load the results DataFrame
if os.path.exists(results_file):
    existing_results = pd.read_csv(results_file)

    # Drop "Unnamed: 0" if it exists
    if "Unnamed: 0" in existing_results.columns:
        existing_results = existing_results.drop(columns=["Unnamed: 0"])
    
    processed_refs = set(existing_results["Reference"])
    results = existing_results.to_dict(orient="list")
else:
    processed_refs = set()
    results = {
        "Metal Precursor": [], "Linker": [], "Solvent": [],
        "Temperature": [], "Reaction Time": [], "Synthesis Procedure": [],
        "Additional Conditions": [], "Justification": [], "Reference": [],
        "CSD Ref Code": []
    }

# Determine rows that haven't been processed yet
remaining_df = mof_names_df[~mof_names_df["DOI"].isin(processed_refs)].reset_index(drop=True)

# Get the total number of MOFs that need to be processed
total_remaining = len(remaining_df)
total_processed = 0  # Tracks how many MOFs have been processed in this run

<b>Running the Extraction Loop for General Property Extraction + CoV</b>

From running the following cell, general property extraction and CoV is performed to extract details from the literature that are relevant in synthesis extraction such as:
- CSD Reference Code
- Metal precursor
- Linker
- Solvent used for synthesis
- Temperature of reaction
- Reaction time
- The synthesis procedure itself
- Reference (DOI)

In [25]:
# Process in batches, but stop after processing batch_size MOFs
for batch_start in range(0, total_remaining, saving_every):
    if total_processed >= batch_size:
        break  # Stop execution when batch_size MOFs are processed

    batch_df = remaining_df.iloc[batch_start:batch_start + saving_every]
    
    # Track number of rows before processing this batch
    prev_length = len(results["Reference"])

    # Process each row in the batch
    for i, row in batch_df.iterrows():
        reference = row["DOI"]
        mof = row["MOF Name"]
        refcode = row["CSD Ref Code"]

        # Skip if refcode doesn't meet criteria
        if len(refcode) > 8 or refcode.lower() == "not provided":
            continue
        
        # Generate file name based on reference
        file_name = reference.replace("/", "_") + ".md"
        
        # Get synthesis extraction via agent response
        try:
            response = agent.agent_response(mof, file_name,
                                            SYNTHESIS_EXTRACTION, CoV=False, filtered=False,
                                            gen_extraction_structure=Synthesis,
                                            vector_store=os.path.join(input_folder, reference.replace("/", "_", 1)))
            synthesis_extraction = response

            # Append extracted synthesis data to results dictionary
            results["Metal Precursor"].append(getattr(synthesis_extraction, "metal_precursor", ""))
            results["Linker"].append(getattr(synthesis_extraction, "organic_linker", ""))
            results["Solvent"].append(getattr(synthesis_extraction, "solvent", ""))
            results["Temperature"].append(getattr(synthesis_extraction, "temperature", ""))
            results["Reaction Time"].append(getattr(synthesis_extraction, "reaction_time", ""))
            results["Synthesis Procedure"].append(getattr(synthesis_extraction, "synthesis_procedure", ""))
            results["Additional Conditions"].append(getattr(synthesis_extraction, "additional_conditions", ""))
            results["Justification"].append(getattr(synthesis_extraction, "justification", ""))
            results["Reference"].append(reference)
            results["CSD Ref Code"].append(refcode)
            
            total_processed += 1  # Increment the count of processed MOFs
            
            # Stop early if batch_size limit is reached
            if total_processed >= batch_size:
                break  

        except Exception as e:
            print(f"Error processing row {i} with DOI {reference}: {e}")

    # Save progress every `saving_every` MOFs
    batch_results_df = pd.DataFrame(results)
    batch_results_df.to_csv(results_file, index=False)

    # Calculate remaining MOFs
    # Correct calculation of remaining MOFs
    remaining_to_process = max(total_remaining - total_processed, 0)

    print(f"✅ {total_processed} MOFs processed so far. {remaining_to_process} remaining this batch. Progress saved to {results_file}.")

# Final completion message
if total_processed >= batch_size:
    print(f"🚫 Processing stopped after {batch_size} MOFs. {remaining_to_process} MOFs still need processing. Restart script to continue.")
else:
    print("✅ Processing complete! All MOFs processed.")


✅ Processing complete! All MOFs processed.


In [26]:
batch_results_df

Unnamed: 0,Metal Precursor,Linker,Solvent,Temperature,Reaction Time,Synthesis Procedure,Additional Conditions,Justification,Reference,CSD Ref Code
0,calcium acetate,trimesic acid (C9H6O6),"deionized water, ethanol",150 °C,12 days,Dissolve 2 mmol of trimesic acid in 20 mL deio...,Not Provided,2 mmol of trimesic acid were dissolved in 20 m...,10.1016/j.molstruc.2017.11.128,OWEZEW01
