# MOF ChemUnity Property Extraction

This notebook demonstrates how the property extraction in MOF ChemUnity is used. You need to have the MOF name that you want to extract properties for which is obtained from the Matching workflow.

In [14]:
from src.MOF_ChemUnity.Agents.ExtractionAgent import ExtractionAgent
from src.MOF_ChemUnity.utils.DocProcessor import DocProcessor
from src.MOF_ChemUnity.Extraction_Prompts import VERIFICATION, RECHECK, EXTRACTION
from src.MOF_ChemUnity.Water_Stability_Prompts import WATER_STABILITY, RULES_WATER_STABILITY, VERF_RULES_WATER_STABILITY, WATER_STABILITY_RE

### Preparation of MOF Names from Matching CSV

we need to read the matching csv file and extract the file names from within that.

In [15]:
import pandas as pd
import glob
import os

In [16]:
mof_names_df = pd.read_csv("/mnt/c/Users/Amro/Desktop/DOI LISTS/Processing Batches - PDF/Batch 1/matching.csv")
mof_names_df.head()

Unnamed: 0,MOF Name,CSD Ref Code,Justification,Reference
0,Diaqua(4-Oxoheptanedioato)Zinc(II)<|>catena(Di...,CAJCUL,The document describes the synthesis and chara...,10.1107_S010827018300918X.md
1,[Ag(C18H16N2O2)(C4H2O4)0.5]H2O,HIPZEM,The MOF [Ag(C18H16N2O2)(C4H2O4)0.5]H2O matches...,10.1107_S1600536807054591.md
2,[Er(C6H6NO6)(H2O)]<|>Polymeric Aqua(Nitrilotri...,AFUQUN,The MOF [Er(C6H6NO6)(H2O)] matches the CSD ref...,10.1107_S010827010200464X.md
3,trans-[Co(dbm)2(-dppe-O2)]n<|>catena-poly[[bis...,LICDIL,The MOF described in the document matches the ...,10.1107_S010827010700933X.md
4,"catena-poly[cadmium(II)-bis(-5amino-1,3,4-thia...",BOVPOS,The MOF described in the document matches the ...,10.1107_S160053680902371X.md


### Markdown files setup

In [17]:
input_folder = "/mnt/c/Users/Amro/Desktop/DOI LISTS/Processing Batches - PDF/Batch 1/md"


files = glob.glob(input_folder+"/*/*.md")

In [18]:
files_in_matching = [file for file in files if os.path.basename(file) in list(mof_names_df["Reference"])]

### Running the Extraction Loop for General Property Extraction + CoV

In [19]:
with open(".apikeys", 'r') as f:
    os.environ["OPENAI_API_KEY"] = f.read()

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
parser_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

In [20]:
result = {}
result["MOF Name"] = []
result["Ref Code"] = []
result["Property"] = []
result["Value"] = []
result["Units"] = []
result["Condition"] = []
result["Summary"] = []
result["Reference"] = []

In [21]:
filtered_result = {}
filtered_result["MOF Name"] = []
filtered_result["Ref Code"] = []
filtered_result["Property"] = []
filtered_result["Value"] = []
filtered_result["Units"] = []
filtered_result["Condition"] = []
filtered_result["Summary"] = []
filtered_result["Reference"] = []

In [22]:
ws_result = {}
ws_result["MOF Name"] = []
ws_result["Ref Code"] = []
ws_result["Property"] = []
ws_result["Value"] = []
ws_result["Units"] = []
ws_result["Condition"] = []
ws_result["Summary"] = []
ws_result["Reference"] = []

In [23]:
WS_READ = WATER_STABILITY.replace("{RULES}", RULES_WATER_STABILITY)
WS_CHECK = VERIFICATION.replace("{VERF_RULES}", VERF_RULES_WATER_STABILITY)
WS_RECHECK = RECHECK.replace("{RECHECK_INSTRUCTIONS}", WATER_STABILITY_RE.replace("{RULES}", RULES_WATER_STABILITY))

sp_dict = {"read_prompts": [WS_READ], "verification_prompts": [WS_CHECK], "recheck_prompts": [WS_RECHECK]}

In [24]:
agent = ExtractionAgent(llm=llm, parser_llm=parser_llm)

In [None]:
for i in range(len(mof_names_df)):
    if mof_names_df.iloc[i]["Reference"] not in [os.path.basename(file) for file in files_in_matching]:
        continue

    mof = mof_names_df.iloc[i]["MOF Name"]
    refcode = mof_names_df.iloc[i]["CSD Ref Code"]
    reference = mof_names_df.iloc[i]["Reference"]
    
    response = agent.agent_response(mof, [file for file in files_in_matching if os.path.basename(file) == reference][0],
                                    EXTRACTION, ["Water Stability"], sp_dict, CoV=True, fuzz_threshold=85)
    
    general_extraction = response[0]

    filtered = general_extraction[0]
    all_props = general_extraction[1]

    print(filtered)
    print(all_props)

    for j in filtered:
        filtered_result["MOF Name"] = mof
        filtered_result["Ref Code"] = refcode
        filtered_result["Reference"] = reference
        filtered_result["Property"] = j.name
        filtered_result["Units"] = j.units
        filtered_result["Value"] = j.value
        filtered_result["Condition"] = j.condition
        filtered_result["Summary"] = j.summary
    for j in all_props.properties:
        result["MOF Name"] = mof
        result["Ref Code"] = refcode
        result["Reference"] = reference
        result["Property"] = j.name
        result["Units"] = j.units
        result["Value"] = j.value
        result["Condition"] = j.condition
        result["Summary"] = j.summary
    
    specific_extraction = response[1]

    ws = specific_extraction[0]

    for j in ws:
        ws_result["MOF Name"] = mof
        ws_result["Ref Code"] = refcode
        ws_result["Reference"] = reference
        ws_result["Property"] = j.name
        ws_result["Units"] = j.units
        ws_result["Value"] = j.value
        ws_result["Condition"] = j.condition
        ws_result["Summary"] = j.summary


all_props = pd.DataFrame(result)
filtered = pd.DataFrame(filtered_result)
ws = pd.DataFrame(ws_result)
    


[Document(metadata={}, page_content='Structure Of Diaqua(4-Oxoheptanedioato)Zine(Ii), [Zn(Cthsos)(H20)2]\n\nBY ANASTAS KARIPIDES \nDepartment of Chemistry, Miami University, Oxford, Ohio 45056, USA \n(Received 17 May 1983; accepted 27 July 1983) \nAbstract. M r = 273.6, monoclinic, P2/c, a = 9.307 (3), b=5.194(2), c=10.850(5)/k, fl=99.74(1) °, V= 516.9/~3, Z = 2, D m = 1.74, D x = 1.76 g cm -3, 2(Mo Ka) = 0.71069 A, g = 0.246 cm -1, F(000) = 280, T= 293 K. Final R =0.042 for 1172 unique observed reflections. The Zn 2÷ and 4-oxoheptanedioate ions are each constrained by the space group to lie on sites of C2 symmetry. The severely distorted \'octahedral\' ZnO 6 polyhedron consists of two water molecules \n(cis configuration) at 2.020 (2) A and four carboxylate O atoms from two different 4-oxoheptanedioate ions at 2.213 (3) and 2.179 (2)/~. The oxo O atom is not coordinated to the Zn 2+ ion. Introduction. For some time, we have been interested in the variability of metal-ion binding by c

AttributeError: 'tuple' object has no attribute 'name'

In [None]:
j.name()

AttributeError: 'tuple' object has no attribute 'name'

In [None]:
all_props.to_csv("/mnt/c/Users/Amro/Desktop/all.csv")
filtered.to_csv("/mnt/c/Users/Amro/Desktop/fil.csv")
ws.to_csv("/mnt/c/Users/Amro/Desktop/ws.csv")