## <b>MOF ChemUnity Property, Application, and Synthesis Extraction</b>

This notebook is a demonstration on how the property, application, and synthesis extraction workflow in MOF-ChemUnity is done. Prior to running this, the MOF names that are extracted from the Matching Workflow are required, so we advise the user to run ``ChemUnity_Matching.ipynb`` first to retrieve these. Furthermore, the vector stores generated from the Matching Workflow can be re-used here - there is no need to read .pdf, .xml or convert files to .md.

In [1]:
import pandas as pd
import glob
import os
from langchain_openai import ChatOpenAI
from langchain.callbacks import get_openai_callback

from MOF_ChemUnity.Agents.ExtractionAgent import ExtractionAgent
from MOF_ChemUnity.utils.DocProcessor import DocProcessor
from MOF_ChemUnity.Prompts.Extraction_Prompts import VERIFICATION, RECHECK, EXTRACTION, APPLICATION
from MOF_ChemUnity.Prompts.Water_Stability_Prompts import WATER_STABILITY, RULES_WATER_STABILITY, VERF_RULES_WATER_STABILITY, WATER_STABILITY_RE
from MOF_ChemUnity.Prompts.Synthesis_Prompts import SYNTHESIS_EXTRACTION
from MOF_ChemUnity.utils.DataModels import Synthesis
from MOF_ChemUnity.Agents.BaseAgent import BaseAgent
from openai import RateLimitError
from MOF_ChemUnity.utils.DataModels import ListApplications
from MOF_ChemUnity.utils.Filters import PROPERTIES_FILTER, APPLICATIONS_FILTER
from MOF_ChemUnity.utils.FilterTools import filter_and_standardize

<b>Preparation of MOF Names from Matching CSV</b>

Please refer to ```ChemUnity_Matching.ipynb``` to get a matching .csv file (set this as ```mof_names_df```), which manages to map MOF name to the reference code (alongside their respective DOI). These are needed to run the rest of the demonstration. Furthermore, please use your own OpenAI API key. For reference, the following cell has been constructed for preparation of extraction process.

In [None]:
# If you haven't already, set your OpenAI API key as an environment variable
os.environ["OPENAI_API_KEY"] = 'YOUR API KEY'

llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
parser_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# mof_names_df is directly from matching
mof_names_df = pd.read_csv("Examples/KG_Data/matching.csv")
mof_names_df.head()

Unnamed: 0,MOF Name,CSD Ref Code,Justification,DOI
0,[HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6]<|>co...,NAVCAO,The compound [HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)...,10.1039/a700472i
1,[Ag(CM-TTF)(CF3SO3)]<|>complex 2,FEKDAA,The MOF [Ag(CM-TTF)(CF3SO3)] matches the CSD R...,10.1039/a804945i
2,C9H12CaO10<|>calcium trimesate synthesized by ...,OWZEW01,The MOF C9H12CaO10 matches the CSD Ref Code OW...,10.1016/j.molstruc.2017.11.128
3,C23.5H28.5EuN2.5O10.5<|>complex 2,CIJBEF,The MOF with the empirical formula C23.5H28.5E...,10.1021/acs.analchem.8b00494
4,C23.5H28.5TbN2.5O10.5<|>complex 3,CIJBIJ,The MOF with the empirical formula C23.5H28.5T...,10.1021/acs.analchem.8b00494


## <b>Property Extraction: </b>
We have two types of property extraction workflows. Firstly, we use a "General" workflow to extract several properties at a time for each paper. This works very well for straightforward properties, that are clearly stated in the text/

Secondly, we have a "Specific" workflow, where we implement Chain-of-Verification (CoV) to extract properties that are more subjective. We demonstrate this using water stability. Since water stability is reported in a variety of ways, CoV is useful here, as we can use a seperate LLM call to justify our initia extraction results.

<b>Setup</b>

- Point the input_folder to the vector store folder "vs" that was created during the matching
- Initialize your extraction results dictionary (shown as ```result```, ```ws_result```) as follows.
- Load the prompts needed for water stability extraction, etc.

In [3]:
## step 1: point input_folder to vector store folder
input_folder = "Examples/vs"
files = glob.glob(input_folder+"/*/")

## step 2: initialize extraction results dictionary
result, ws_result = {}, {}
keys_of_interest = ["MOF Name", "Ref Code", "Property", "Value", "Units", "Condition", "Summary", "Reference"]

for key in keys_of_interest:
    result[key] = []
    ws_result[key] = []

## step 3: load the prompts for extraction and construct agent object
WS_READ = WATER_STABILITY.replace("{RULES}", RULES_WATER_STABILITY)
WS_CHECK = VERIFICATION.replace("{VERF_RULES}", VERF_RULES_WATER_STABILITY)
WS_RECHECK = RECHECK.replace("{RECHECK_INSTRUCTIONS}", WATER_STABILITY_RE.replace("{RULES}", RULES_WATER_STABILITY))

sp_dict = {"read_prompts": [WS_READ], "verification_prompts": [WS_CHECK], "recheck_prompts": [WS_RECHECK]}
agent = ExtractionAgent(llm=llm)

<b>Running the Extraction Loop for General Property Extraction + CoV</b>

From running the following cell, general property extraction and CoV is performed to extract details such as the crystal system, chemical formula, space group, surface area, water stability label (if it is mentioned in the literature for that particular MOF), etc. These results are stored in: ```all_props```, ```filtered``` and ```ws``` DataFrames for future reference (saved in a separate folder).

In [8]:
try:
    for i in range(len(mof_names_df)):

        mof = mof_names_df.iloc[i]["MOF Name"]
        refcode = mof_names_df.iloc[i]["CSD Ref Code"]
        reference = mof_names_df.iloc[i]["DOI"]

        if len(refcode) > 8:
            continue
        if refcode.lower() == "not provided":
            continue        
        
        response = agent.agent_response(mof, reference.replace("/","_")+".md",
                                        EXTRACTION, ["Water Stability"], sp_dict, CoV=True,
                                        vector_store=input_folder+f"/{reference.replace('/','_',1)}")


        general_extraction = response[0]

        all_props = general_extraction

        for j in all_props.properties:
            result["MOF Name"].append(mof)
            result["Ref Code"].append(refcode)
            result["Reference"].append(reference)
            result["Property"].append(j.name)
            result["Units"].append(j.units)
            result["Value"].append(j.value)
            result["Condition"].append(j.condition)
            result["Summary"].append(j.summary)
        
        specific_extraction = response[1]
        ws = specific_extraction[0]

        for j in ws:
            ws_result["MOF Name"].append(mof)
            ws_result["Ref Code"].append(refcode)
            ws_result["Reference"].append(reference)
            ws_result["Property"].append(j.name)
            ws_result["Units"].append(j.units)
            ws_result["Value"].append(j.value)
            ws_result["Condition"].append(j.condition)
            ws_result["Summary"].append(j.summary)

except Exception as e:
    print(e)
    all_props = pd.DataFrame(result)
    ws = pd.DataFrame(ws_result)

    all_props.to_csv("Examples/KG_Data/all_experimental_properties.csv")
    ws.to_csv("Examples/KG_Data/water_stability.csv")
    
all_props = pd.DataFrame(result)
ws = pd.DataFrame(ws_result)

all_props.to_csv("Examples/KG_Data/all_experimental_properties.csv")
ws.to_csv("Examples/KG_Data/water_stability.csv")

Action: reading the document
finding all properties of name 1: [HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6] ---name 2: compound 1

Result: 
1.  -Property Name: Crystal System
    -Property Value: trigonal
    -Value Units: N/A
    -Conditions: N/A
    -Summary: "The compound crystallizes in the trigonal space group P3¯c1 (no. 165) with a = 13.5274(5), c = 19.2645(6) Å, U = 3052.9(3) Å3 and Z = 2."

2.  -Property Name: Space Group
    -Property Value: P3¯c1
    -Value Units: N/A
    -Conditions: N/A
    -Summary: "The compound crystallizes in the trigonal space group P3¯c1 (no. 165) with a = 13.5274(5), c = 19.2645(6) Å, U = 3052.9(3) Å3 and Z = 2."

3.  -Property Name: Unit Cell Dimensions
    -Property Value: a = 13.5274(5), c = 19.2645(6)
    -Value Units: Å
    -Conditions: N/A
    -Summary: "The compound crystallizes in the trigonal space group P3¯c1 (no. 165) with a = 13.5274(5), c = 19.2645(6) Å, U = 3052.9(3) Å3 and Z = 2."

4.  -Property Name: Unit Cell Volume
    -Property Value:

In [9]:
all_props.head()

Unnamed: 0,MOF Name,Ref Code,Property,Value,Units,Condition,Summary,Reference
0,[HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6]<|>co...,NAVCAO,Crystal System,trigonal,,,The compound crystallizes in the trigonal spac...,10.1039/a700472i
1,[HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6]<|>co...,NAVCAO,Space Group,P3¯c1,,,The compound crystallizes in the trigonal spac...,10.1039/a700472i
2,[HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6]<|>co...,NAVCAO,Unit Cell Dimensions,"a = 13.5274(5), c = 19.2645(6)",Å,,The compound crystallizes in the trigonal spac...,10.1039/a700472i
3,[HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6]<|>co...,NAVCAO,Unit Cell Volume,3052.9(3),Å3,,The compound crystallizes in the trigonal spac...,10.1039/a700472i
4,[HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6]<|>co...,NAVCAO,Density,2.436,g cm−3,,Dc/g cm23 2.436,10.1039/a700472i


In [10]:
ws.head()

Unnamed: 0,MOF Name,Ref Code,Property,Value,Units,Condition,Summary,Reference
0,[HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6]<|>co...,NAVCAO,Water Stability,Not provided,,,The document does not provide explicit informa...,10.1039/a700472i
1,[Ag(CM-TTF)(CF3SO3)]<|>complex 2,FEKDAA,Water Stability,Not provided,,,The document does not provide specific informa...,10.1039/a804945i
2,C9H12CaO10<|>calcium trimesate synthesized by ...,OWZEW01,Water Stability,Not provided,,,The document does not provide specific informa...,10.1016/j.molstruc.2017.11.128
3,C23.5H28.5EuN2.5O10.5<|>complex 2,CIJBEF,Water Stability,Not provided,,,The document does not provide specific informa...,10.1021/acs.analchem.8b00494
4,C23.5H28.5TbN2.5O10.5<|>complex 3,CIJBIJ,Water Stability,Not provided,,,The document does not provide specific informa...,10.1021/acs.analchem.8b00494


## Filtering Properties and Standardizing Property Names

The filter will label properties that should be excluded as either `none` or `Remove`. The standardization function retains the original name of the property extracted by LLM in the `Original Property Name` column.

In [21]:
filtered_props = filter_and_standardize(all_props, "Property", PROPERTIES_FILTER)

In [22]:
filtered_props[(filtered_props["Property"] != "Remove") & (filtered_props["Property"]!="none")].head()

Unnamed: 0,MOF Name,Ref Code,Property,Value,Units,Condition,Summary,Reference,Original Property Name
0,[HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6]<|>co...,NAVCAO,Crystal System,trigonal,,,The compound crystallizes in the trigonal spac...,10.1039/a700472i,Crystal System
1,[HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6]<|>co...,NAVCAO,Space Group,P3¯c1,,,The compound crystallizes in the trigonal spac...,10.1039/a700472i,Space Group
3,[HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6]<|>co...,NAVCAO,Cell Volume,3052.9(3),Å3,,The compound crystallizes in the trigonal spac...,10.1039/a700472i,Unit Cell Volume
4,[HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6]<|>co...,NAVCAO,Density,2.436,g cm−3,,Dc/g cm23 2.436,10.1039/a700472i,Density
5,[HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6]<|>co...,NAVCAO,Formula Weight,2239.15,,,Formula C18Fe8H66N6O62P14 Atom x y z M 2239.15,10.1039/a700472i,Formula Weight


In [23]:
filtered_props[(filtered_props["Property"] != "Remove") & (filtered_props["Property"]!="none")].to_csv("Examples/KG_Data/filtered_experimental_properties.csv")

## <b>Application Extraction</b>
We again use our "General" extraction workflow, but this time use it to extract applications.

In [24]:
result = {}
result["MOF Name"] = []
result["Ref Code"] = []
result["Application"] = []
result["Recommendation"] = []
result["Justification"] = []
result["Source"] = []

agent = ExtractionAgent(llm = llm)

with get_openai_callback() as cb:
    try:
        for i in range(len(mof_names_df)):

            mof = mof_names_df.iloc[i]["MOF Name"]
            refcode = mof_names_df.iloc[i]["CSD Ref Code"]
            reference = mof_names_df.iloc[i]["DOI"]
            #file = mof_names_df.iloc[i]["File Name"]

            if len(refcode) > 8:
                print(f"{refcode} is longer than 8 characters. Skipping...")
                continue
            if refcode.lower() == "not provided":
                continue        
            
            print("Trying to get response...")
            response = agent.agent_response(mof, reference.replace("/","_")+".md",
                                            APPLICATION, CoV=False, gen_extraction_structure = ListApplications,
                                            vector_store=os.path.join(input_folder, reference.replace("/", "_", 1)))

            general_extraction = response

            applications = general_extraction

            print(applications)

            for app in applications.app_list:
                result["MOF Name"].append(mof)
                result["Ref Code"].append(refcode)
                result["Source"].append(reference)
                result["Application"].append(app.application_name)
                result["Recommendation"].append(app.recommendation)
                result["Justification"].append(app.justification)
        
        print(cb)

    except Exception as e:
        print(e)
        print(cb)
        print(i)
        applications = pd.DataFrame(result)
        applications.to_csv("Examples/KG_Data/applications.csv")
    
    applications = pd.DataFrame(result)
    applications.to_csv("Examples/KG_Data/applications.csv")

Trying to get response...
Action: reading the document
finding all properties of name 1: [HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6] ---name 2: compound 1

Result: 
- Application: Not Provided
- Recommendation: Not Provided
- Justification: The document does not mention any specific applications for the MOF [HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6].

Parsed Result: 
1- Application: Not Provided
Author Recommendation: Not Provided
Exact Sentences: The document does not mention any specific applications for the MOF [HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6].



1- Application: Not Provided
Author Recommendation: Not Provided
Exact Sentences: The document does not mention any specific applications for the MOF [HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6].

Trying to get response...
Action: reading the document
finding all properties of name 1: [Ag(CM-TTF)(CF3SO3)] ---name 2: complex 2

Result: 
- Application: Not Provided
- Recommendation: Not Provided
- Justification: The document does not ment

In [25]:
applications.head()

Unnamed: 0,MOF Name,Ref Code,Application,Recommendation,Justification,Source
0,[HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6]<|>co...,NAVCAO,Not Provided,Not Provided,The document does not mention any specific app...,10.1039/a700472i
1,[Ag(CM-TTF)(CF3SO3)]<|>complex 2,FEKDAA,Not Provided,Not Provided,The document does not mention any specific app...,10.1039/a804945i
2,C9H12CaO10<|>calcium trimesate synthesized by ...,OWZEW01,Hydrogen storage,Investigated,Experimental hydrogen storage capacity of calc...,10.1016/j.molstruc.2017.11.128
3,C23.5H28.5EuN2.5O10.5<|>complex 2,CIJBEF,Luminescent sensing,Investigated,"Lanthanide metal–organic frameworks (Ln-MOFs),...",10.1021/acs.analchem.8b00494
4,C23.5H28.5TbN2.5O10.5<|>complex 3,CIJBIJ,Luminescent sensing,Investigated,"In the past few years, Ln-MOFs have been frequ...",10.1021/acs.analchem.8b00494


## Filtering Applications and Standardizing Application Names

In [31]:
filtered_applications = filter_and_standardize(applications, "Application", APPLICATIONS_FILTER)

In [32]:
filtered_applications[(filtered_applications["Application"] != "Remove") & (filtered_applications["Application"]!="none")].head()

Unnamed: 0,MOF Name,Ref Code,Application,Recommendation,Justification,Source,Original Application Name
2,C9H12CaO10<|>calcium trimesate synthesized by ...,OWZEW01,Hydrogen Storage,Investigated,Experimental hydrogen storage capacity of calc...,10.1016/j.molstruc.2017.11.128,Hydrogen storage
3,C23.5H28.5EuN2.5O10.5<|>complex 2,CIJBEF,Sensors,Investigated,"Lanthanide metal–organic frameworks (Ln-MOFs),...",10.1021/acs.analchem.8b00494,Luminescent sensing
4,C23.5H28.5TbN2.5O10.5<|>complex 3,CIJBIJ,Sensors,Investigated,"In the past few years, Ln-MOFs have been frequ...",10.1021/acs.analchem.8b00494,Luminescent sensing


In [33]:
filtered_applications[(filtered_applications["Application"] != "Remove") & (filtered_applications["Application"]!="none")].to_csv("Examples/KG_Data/filtered_applications.csv")

## <b> Synthesis Extraction </b>

Lastly, we adapt our "General" extraction workflow to extract synthesis procedures for each MOF.

In [35]:
agent = ExtractionAgent(llm=llm)

# Initialize the results dictionary
results = {
    "MOF Name": [],
    "CSD Ref Code": [],
    "Reference": [],
    "Metal Precursor": [],
    "Linker": [],
    "Solvent": [],
    "Temperature": [],
    "Reaction Time": [],
    "Reaction Type": [],
    "Synthesis Procedure": [],
    "Additional Conditions": [],
    "Justification": [],
}

with get_openai_callback() as cb:
    try:
        for i in range(len(mof_names_df)):
            mof = mof_names_df.iloc[i]["MOF Name"]
            refcode = mof_names_df.iloc[i]["CSD Ref Code"]
            reference = mof_names_df.iloc[i]["DOI"]

            # Skip problematic refcodes
            if len(refcode) > 8 or refcode.lower() == "not provided":
                continue

            print(f"🔍 Processing: {mof} ({reference})")
            file_name = reference.replace("/", "_") + ".md"
            vector_store_path = os.path.join(input_folder, reference.replace("/", "_", 1))

            try:
                extraction = agent.agent_response(
                    mof,
                    file_name,
                    SYNTHESIS_EXTRACTION,
                    CoV=False,
                    gen_extraction_structure=Synthesis,
                    vector_store=vector_store_path
                )

                # Append fields to result
                results["MOF Name"].append(mof)
                results["CSD Ref Code"].append(refcode)
                results["Reference"].append(reference)
                results["Reaction Type"].append(getattr(extraction, "reaction_type", ""))
                results["Metal Precursor"].append(getattr(extraction, "metal_precursor", ""))
                results["Linker"].append(getattr(extraction, "organic_linker", ""))
                results["Solvent"].append(getattr(extraction, "solvent", ""))
                results["Temperature"].append(getattr(extraction, "temperature", ""))
                results["Reaction Time"].append(getattr(extraction, "reaction_time", ""))
                results["Synthesis Procedure"].append(getattr(extraction, "synthesis_procedure", ""))
                results["Additional Conditions"].append(getattr(extraction, "additional_conditions", ""))
                results["Justification"].append(getattr(extraction, "justification", ""))

            except Exception as e:
                print(f"⚠️ Error on {reference}: {e}")

    except Exception as e:
        print(f" Unexpected error: {e}")
        print(cb)
        synthesis = pd.DataFrame(results)
        synthesis.to_csv("Examples/KG_Data/synthesis.csv")

    finally:
        synthesis = pd.DataFrame(results)
        synthesis.to_csv("Examples/KG_Data/synthesis.csv")
        print(cb)


🔍 Processing: [HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6]<|>compound 1 (10.1039/a700472i)
Action: reading the document
finding all properties of name 1: [HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6] ---name 2: compound 1

Result: 
- Metal Precursor: FeCl3·6H2O
- Organic Linker: DABCO (1,4-diazabicyclo[2.2.2]octane)
- Solvent: n-butanol, water
- Temperature: 180 °C
- Reaction Type: solvothermal
- Reaction Time: 3 days
- Synthesis Procedure:
  - The synthesis was carried out in a Teflon-lined acid digestion bomb (23 cm³) under autogenous pressure.
  - FeCl3·6H2O (2.5 mmol), DABCO (7.5 mmol), H3PO4 (7.5 mmol), n-butanol (3 cm³), and water (7 cm³) were reacted.
  - The reaction mixture was heated at 180 °C for 3 days.
  - The mixture was then slowly cooled at a rate of 10 °C per hour to room temperature.
  - The product, [HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6], was obtained as colorless crystals.
  - The colorless crystals were manually separated from a small amount of green material to obtain