# MOF ChemUnity Property Extraction

This notebook demonstrates how the property extraction in MOF ChemUnity is used. You need to have the MOF name that you want to extract properties for which is obtained from the Matching workflow.

In [100]:
from src.MOF_ChemUnity.Agents.ExtractionAgent import ExtractionAgent
from src.MOF_ChemUnity.utils.DataPrep import Data_Prep
from src.MOF_ChemUnity.Extraction_Prompts import VERIFICATION, RECHECK, EXTRACTION
from src.MOF_ChemUnity.Water_Stability_Prompts import WATER_STABILITY, RULES_WATER_STABILITY, VERF_RULES_WATER_STABILITY, WATER_STABILITY_RE

### Preparation of MOF Names from Matching CSV

we need to read the matching csv file and extract the file names from within that.

In [101]:
import pandas as pd
import glob
import os

In [102]:
mof_names_df = pd.read_csv("./tests/water_stability_benchmark/case-study-3-ground-truth.csv")
mof_names_df.head()

Unnamed: 0,Reference,DOI,MOF contained,True Water Stability,Justification Sentence,Unnamed: 5
0,1,10.1038/s41586-019-1798-7,Al-PMOF/m8o66,Stable,The capture capacity of Al-PMOF for a mixture ...,
1,1,10.1038/s41586-019-1798-7,Al-PyrMOF/m8o67,Stable,The capture capacity of Al-PyrMOF for a mixtur...,
2,1,10.1038/s41586-019-1798-7,UiO-66-NH2,Not provided,Not provided,
3,1,10.1038/s41586-019-1798-7,m8o71,Unstable,"Conversely, m8o71 completely loses its CO2 cap...",
4,2,10.1039/c0dt00999g,[Zn4(dmf)(ur)2(ndc)4],Stable,The high stability of the guest-free metal–org...,


### Markdown files setup

In [103]:
input_folder = "./tests/water_stability_benchmark/markdown"


files = glob.glob(input_folder+"/*/*.md")

In [104]:
mof_names_df["File"] = [input_folder+f"/{i}/{i}.md" for i in list(mof_names_df["Reference"])]
mof_names_df.head()

Unnamed: 0,Reference,DOI,MOF contained,True Water Stability,Justification Sentence,Unnamed: 5,File
0,1,10.1038/s41586-019-1798-7,Al-PMOF/m8o66,Stable,The capture capacity of Al-PMOF for a mixture ...,,./tests/water_stability_benchmark/markdown/1/1.md
1,1,10.1038/s41586-019-1798-7,Al-PyrMOF/m8o67,Stable,The capture capacity of Al-PyrMOF for a mixtur...,,./tests/water_stability_benchmark/markdown/1/1.md
2,1,10.1038/s41586-019-1798-7,UiO-66-NH2,Not provided,Not provided,,./tests/water_stability_benchmark/markdown/1/1.md
3,1,10.1038/s41586-019-1798-7,m8o71,Unstable,"Conversely, m8o71 completely loses its CO2 cap...",,./tests/water_stability_benchmark/markdown/1/1.md
4,2,10.1039/c0dt00999g,[Zn4(dmf)(ur)2(ndc)4],Stable,The high stability of the guest-free metal–org...,,./tests/water_stability_benchmark/markdown/2/2.md


### Running the Extraction Loop for General Property Extraction + CoV

In [105]:
with open(".apikeys", 'r') as f:
    os.environ["OPENAI_API_KEY"] = f.read()

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
parser_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

In [106]:
result = {}
result["MOF Name"] = []
result["Ref Code"] = []
result["Property"] = []
result["Value"] = []
result["Units"] = []
result["Condition"] = []
result["Summary"] = []
result["Reference"] = []

In [107]:
filtered_result = {}
filtered_result["MOF Name"] = []
filtered_result["Ref Code"] = []
filtered_result["Property"] = []
filtered_result["Value"] = []
filtered_result["Units"] = []
filtered_result["Condition"] = []
filtered_result["Summary"] = []
filtered_result["Reference"] = []

In [108]:
ws_result = {}
ws_result["MOF Name"] = []
ws_result["Ref Code"] = []
ws_result["Property"] = []
ws_result["Value"] = []
ws_result["Units"] = []
ws_result["Condition"] = []
ws_result["Summary"] = []
ws_result["Reference"] = []

In [109]:
WS_READ = WATER_STABILITY.replace("{RULES}", RULES_WATER_STABILITY)
WS_CHECK = VERIFICATION.replace("{VERF_RULES}", VERF_RULES_WATER_STABILITY)
WS_RECHECK = RECHECK.replace("{RECHECK_INSTRUCTIONS}", WATER_STABILITY_RE.replace("{RULES}", RULES_WATER_STABILITY))

sp_dict = {"read_prompts": [WS_READ], "verification_prompts": [WS_CHECK], "recheck_prompts": [WS_RECHECK]}

In [110]:
from src.MOF_ChemUnity.utils.DocProcessor import DocProcessor


processor = DocProcessor(chunk_size=8000, chunk_overlap=200)
agent = ExtractionAgent(llm=llm)

In [111]:
for i in range(len(mof_names_df)):

    mof = mof_names_df.iloc[i]["MOF contained"]
    refcode = "HELLOW"
    reference = mof_names_df.iloc[i]["DOI"]
    
    _, response = agent.agent_response(mof, mof_names_df.iloc[i]["File"],
                                    EXTRACTION, ["Water Stability"], sp_dict, CoV=True, skip_general=True, fuzz_threshold=85, store_vs=True)
    
    # general_extraction = response

    # filtered = general_extraction[0]
    # all_props = general_extraction[1]

    # print(filtered)
    # print(all_props)

    # for j in filtered:
    #     filtered_result["MOF Name"].append(mof)
    #     filtered_result["Ref Code"].append(refcode)
    #     filtered_result["Reference"].append(reference)
    #     filtered_result["Property"].append(j.name)
    #     filtered_result["Units"].append(j.units)
    #     filtered_result["Value"].append(j.value)
    #     filtered_result["Condition"].append(j.condition)
    #     filtered_result["Summary"].append(j.summary)
    # for j in all_props.properties:
    #     result["MOF Name"].append(mof)
    #     result["Ref Code"].append(refcode)
    #     result["Reference"].append(reference)
    #     result["Property"].append(j.name)
    #     result["Units"].append(j.units)
    #     result["Value"].append(j.value)
    #     result["Condition"].append(j.condition)
    #     result["Summary"].append(j.summary)
    
    specific_extraction = response

    ws = specific_extraction[0]

    for j in ws:
        ws_result["MOF Name"].append(mof)
        ws_result["Ref Code"].append(refcode)
        ws_result["Reference"].append(reference)
        ws_result["Property"].append(j.name)
        ws_result["Units"].append(j.units)
        ws_result["Value"].append(j.value)
        ws_result["Condition"].append(j.condition)
        ws_result["Summary"].append(j.summary)



# all_props = pd.DataFrame(result)
# filtered = pd.DataFrame(filtered_result)
ws = pd.DataFrame(ws_result)
    

Saved vector store for ./tests/water_stability_benchmark/markdown/1/1.md in ./tests/water_stability_benchmark/markdown/vs/1
Reading to find the Water Stability of Al-PMOF/m8o66 specifically
LLM Structured Output: 
Water Stability = StableN/A ; conditions: Exposure to different harsh conditions, including immersion in water for 7 days. ; Justification: Figure 3c, d shows no loss of crystallinity upon activation as well as upon exposure to different harsh conditions, including immersion in water for 7 days.

Verifying the extraction:
Saved vector store for ./tests/water_stability_benchmark/markdown/1/1.md in ./tests/water_stability_benchmark/markdown/vs/1
Reading to find the Water Stability of Al-PyrMOF/m8o67 specifically
LLM Structured Output: 
Water Stability = StableN/A ; conditions: Exposure to different harsh conditions, including immersion in water for 7 days ; Justification: The MOF Al-PyrMOF/m8o67 is stable as it shows no loss of crystallinity upon activation as well as upon expo

In [112]:
ws.to_csv("/mnt/c/Users/Amro/Desktop/ws.csv")

## Performance Metrics

In [113]:
case_3 = pd.read_csv("./tests/water_stability_benchmark/case-study-3-ground-truth.csv")
case_3.head()

Unnamed: 0,Reference,DOI,MOF contained,True Water Stability,Justification Sentence,Unnamed: 5
0,1,10.1038/s41586-019-1798-7,Al-PMOF/m8o66,Stable,The capture capacity of Al-PMOF for a mixture ...,
1,1,10.1038/s41586-019-1798-7,Al-PyrMOF/m8o67,Stable,The capture capacity of Al-PyrMOF for a mixtur...,
2,1,10.1038/s41586-019-1798-7,UiO-66-NH2,Not provided,Not provided,
3,1,10.1038/s41586-019-1798-7,m8o71,Unstable,"Conversely, m8o71 completely loses its CO2 cap...",
4,2,10.1039/c0dt00999g,[Zn4(dmf)(ur)2(ndc)4],Stable,The high stability of the guest-free metal–org...,


In [52]:
ws = pd.read_csv("/mnt/c/Users/Amro/Desktop/ws.csv")

In [None]:
results_df = pd.read_csv("./tests/water_stability_benchmark/ws.csv")
results_df.head()

Unnamed: 0.1,Unnamed: 0,MOF Name,Ref Code,Property,Value,Units,Condition,Summary,Reference
0,0,Al-PMOF/m8o66,HELLOW,water stability,Stable,,"Exposure to different harsh conditions, includ...","Figure 3c, d shows no loss of crystallinity up...",10.1038/s41586-019-1798-7
1,1,Al-PyrMOF/m8o67,HELLOW,water stability,Stable,,"Exposure to different harsh conditions, includ...","Figure 3c, d shows no loss of crystallinity up...",10.1038/s41586-019-1798-7
2,2,UiO-66-NH2,HELLOW,water stability,Not provided,,Not provided,The water stability of the MOF: Not provided,10.1038/s41586-019-1798-7
3,3,m8o71,HELLOW,water stability,Unstable,,Exposure to relative humidity,"Conversely, m8o71 completely loses its CO2 cap...",10.1038/s41586-019-1798-7
4,4,[Zn4(dmf)(ur)2(ndc)4],HELLOW,water stability,Not provided,,Not provided,1. The water stability of the MOF: Not provided,10.1039/c0dt00999g


In [114]:
import numpy as np

matrix = np.zeros((3,3))
for i in range(len(ws)):
    y_index = 0
    x_index = 0

    if(case_3.iloc[i]["True Water Stability"].lower() == "unstable"): y_index = 0
    elif(case_3.iloc[i]["True Water Stability"].lower() == "stable"): y_index = 1 
    else: y_index = 2
    if(ws.iloc[i]["Value"].lower() == "unstable"): x_index = 0
    elif(ws.iloc[i]["Value"].lower() == "stable"): x_index = 1 
    else: x_index = 2

    if y_index == 2 and x_index == 1:
        print(ws.iloc[i]["Summary"])

    matrix[y_index, x_index] += 1

The crystalline compounds obtained from compounds 1 and 2 by heating are named as new compounds 1A and 2A, respectively. The identical space groups and nearly identical unit-cell dimensions between the original compounds and their dehydrated forms (i.e., 1 and 1A, 2 and 2A, respectively) indicate that the porous 3D frameworks and the main backbone of the protonated water cluster are retained. X-ray structural analyses of both compounds 1A and 2A exhibit identical protonated water clusters H(H2O)21+, as shown in Figure 6. By comparing the structures of H(H2O)28+ and H(H2O)27+ clusters with that of H(H2O)21+, it is found that, in the backbone of the (H2O)26 water shells, six water molecules (represented by O(3W) in Figure 3) initially in H(H2O)28+ and H(H2O)27+ clusters were removed upon dehydration. In the case of H(H2O)28+, one of the two water molecules in the center core of H(H2O)2+ was also lost, leaving a mono water molecule species (H3O)+ situated at the crystallographic center, w

In [115]:
print("\t\t\tPred. Unstable\tPred. Stable\tNP\n\tTrue Unstable\t\t{0}\t{1}\t\t{2}\n\tTrue Stable\t\t{3}\t{4}\t\t{5}\n\tNP\t\t\t{6}\t{7}\t\t{8}".format(*matrix.flatten()))
print("Accuracy: {0:0.00%}".format(np.sum([matrix[i,i] for i in range(3)])/np.sum(matrix)))

			Pred. Unstable	Pred. Stable	NP
	True Unstable		32.0	5.0		1.0
	True Stable		12.0	97.0		15.0
	NP			19.0	14.0		176.0
Accuracy: 82%
