# MOF ChemUnity Property Extraction

This notebook demonstrates how the property extraction in MOF ChemUnity is used. You need to have the MOF name that you want to extract properties for which is obtained from the Matching workflow.

In [1]:
from src.MOF_ChemUnity.Agents.ExtractionAgent import ExtractionAgent
from src.MOF_ChemUnity.utils.DocProcessor import DocProcessor
from src.MOF_ChemUnity.Extraction_Prompts import VERIFICATION, RECHECK, EXTRACTION
from src.MOF_ChemUnity.Water_Stability_Prompts import WATER_STABILITY, RULES_WATER_STABILITY, VERF_RULES_WATER_STABILITY, WATER_STABILITY_RE

### Preparation of MOF Names from Matching CSV

we need to read the matching csv file and extract the file names from within that.

In [2]:
import pandas as pd
import glob
import os

In [3]:
with open(".apikeys", 'r') as f:
    os.environ["OPENAI_API_KEY"] = f.read()

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
parser_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

In [4]:
mof_names_df = pd.read_csv("/mnt/c/Users/Amro/Desktop/DOI LISTS/Processing Batches - results/Elsevier/filtered_elsevier_data.csv", nrows=1035, skiprows=1035, names=["MOF Name", "CSD Ref Code", "Justification", "DOI"])
mof_names_df.head()
mof_names_df.head()

Unnamed: 0,MOF Name,CSD Ref Code,Justification,DOI
0,{[Cu3Tb2(pydc)6(H2O)12]·4H2O}n<|>Compound 1,GERTII,"The MOF {[Cu3Tb2(pydc)6(H2O)12]·4H2O}n, also r...",10.1016/j.jssc.2012.08.028
1,[Fe(squarate)(bpee)(H2O)2]<|>complex 2,RAXNEK,The MOF [Fe(squarate)(bpee)(H2O)2] matches the...,10.1016/j.ica.2005.07.014
2,[Mn(btb)(H2O)2(N3)2]n<|>4,DEPWUR,"The MOF [Mn(btb)(H2O)2(N3)2]n, also referred t...",10.1016/j.molstruc.2006.03.082
3,[{Cu2(bisterpy)}V4O12]·2H2O<|>compound 1<|>1·2H2O,TIJHAW,The MOF [{Cu2(bisterpy)}V4O12]·2H2O matches th...,10.1016/j.solidstatesciences.2007.05.002
4,2 ∞[(Me3Sn)2(fum)]<|>compound 1,FIRDAO,The compound 2 ∞[(Me3Sn)2(fum)] is described i...,10.1016/j.jorganchem.2018.12.021


In [5]:
mof_names_df[mof_names_df["CSD Ref Code"] == "EGIMEN"]

Unnamed: 0,MOF Name,CSD Ref Code,Justification,DOI


### Markdown files setup

In [6]:
from src.MOF_ChemUnity.Agents.BaseAgent import BaseAgent


input_folder = "/mnt/c/Users/Amro/Desktop/DOI LISTS/Processing Batches - vs/Elsevier/vs"


files = glob.glob(input_folder+"/*/")

### Running the Extraction Loop for General Property Extraction + CoV

In [7]:
result = {}
result["MOF Name"] = []
result["Ref Code"] = []
result["Property"] = []
result["Value"] = []
result["Units"] = []
result["Condition"] = []
result["Summary"] = []
result["Reference"] = []

In [8]:
filtered_result = {}
filtered_result["MOF Name"] = []
filtered_result["Ref Code"] = []
filtered_result["Property"] = []
filtered_result["Value"] = []
filtered_result["Units"] = []
filtered_result["Condition"] = []
filtered_result["Summary"] = []
filtered_result["Reference"] = []

In [9]:
ws_result = {}
ws_result["MOF Name"] = []
ws_result["Ref Code"] = []
ws_result["Property"] = []
ws_result["Value"] = []
ws_result["Units"] = []
ws_result["Condition"] = []
ws_result["Summary"] = []
ws_result["Reference"] = []

In [10]:
WS_READ = WATER_STABILITY.replace("{RULES}", RULES_WATER_STABILITY)
WS_CHECK = VERIFICATION.replace("{VERF_RULES}", VERF_RULES_WATER_STABILITY)
WS_RECHECK = RECHECK.replace("{RECHECK_INSTRUCTIONS}", WATER_STABILITY_RE.replace("{RULES}", RULES_WATER_STABILITY))

sp_dict = {"read_prompts": [WS_READ], "verification_prompts": [WS_CHECK], "recheck_prompts": [WS_RECHECK]}

In [11]:
agent = ExtractionAgent(llm=llm)

In [None]:
from openai import RateLimitError

try:
    for i in range(459, 560):

        mof = mof_names_df.iloc[i]["MOF Name"]
        refcode = mof_names_df.iloc[i]["CSD Ref Code"]
        reference = mof_names_df.iloc[i]["DOI"]

        if len(refcode) > 8:
            continue
        if refcode.lower() == "not provided":
            continue        
        
        response = agent.agent_response(mof, reference.replace("/","_")+".md",
                                        EXTRACTION, ["Water Stability"], sp_dict, CoV=True, fuzz_threshold=85,
                                        vector_store=input_folder+f"/{reference.replace('/','_',1)}")


        general_extraction = response[0]

        filtered = general_extraction[0]
        all_props = general_extraction[1]

        print(filtered)
        print(all_props)

        for j in filtered:
            filtered_result["MOF Name"].append(mof)
            filtered_result["Ref Code"].append(refcode)
            filtered_result["Reference"].append(reference)
            filtered_result["Property"].append(j.name)
            filtered_result["Units"].append(j.units)
            filtered_result["Value"].append(j.value)
            filtered_result["Condition"].append(j.condition)
            filtered_result["Summary"].append(j.summary)
        for j in all_props.properties:
            result["MOF Name"].append(mof)
            result["Ref Code"].append(refcode)
            result["Reference"].append(reference)
            result["Property"].append(j.name)
            result["Units"].append(j.units)
            result["Value"].append(j.value)
            result["Condition"].append(j.condition)
            result["Summary"].append(j.summary)
        
        specific_extraction = response[1]

        ws = specific_extraction[0]

        for j in ws:
            ws_result["MOF Name"].append(mof)
            ws_result["Ref Code"].append(refcode)
            ws_result["Reference"].append(reference)
            ws_result["Property"].append(j.name)
            ws_result["Units"].append(j.units)
            ws_result["Value"].append(j.value)
            ws_result["Condition"].append(j.condition)
            ws_result["Summary"].append(j.summary)
except Exception as e:
    print(e)
    all_props = pd.DataFrame(result)
    filtered = pd.DataFrame(filtered_result)
    ws = pd.DataFrame(ws_result)

    all_props.to_csv("/mnt/c/Users/Amro/Desktop/all-Elsevier-P2_3.csv")
    filtered.to_csv("/mnt/c/Users/Amro/Desktop/fil-Elsevier-P2_3.csv")
    ws.to_csv("/mnt/c/Users/Amro/Desktop/ws-Elsevier-P2_3.csv")

all_props = pd.DataFrame(result)
filtered = pd.DataFrame(filtered_result)
ws = pd.DataFrame(ws_result)
    

KeyError: 'File Name'

In [58]:
all_props.to_csv("/mnt/c/Users/Amro/Desktop/all-Elsevier-P2_3.csv")
filtered.to_csv("/mnt/c/Users/Amro/Desktop/fil-Elsevier-P2_3.csv")
ws.to_csv("/mnt/c/Users/Amro/Desktop/ws-Elsevier-P2_3.csv")

## Application Extraction

In [59]:
from src.MOF_ChemUnity.utils.DataModels import ListApplications
from src.MOF_ChemUnity.Extraction_Prompts import APPLICATION

In [60]:
result = {}
result["MOF Name"] = []
result["Ref Code"] = []
result["Application"] = []
result["Recommendation"] = []
result["Justification"] = []
result["Source"] = []

In [61]:
agent = ExtractionAgent(llm = llm)

In [62]:
from langchain.callbacks import get_openai_callback

In [63]:
with get_openai_callback() as cb:
    try:
        for i in range(570, len(mof_names_df)):

            mof = mof_names_df.iloc[i]["MOF Name"]
            refcode = mof_names_df.iloc[i]["CSD Ref Code"]
            reference = mof_names_df.iloc[i]["DOI"]
            file = mof_names_df.iloc[i]["File Name"]

            if len(refcode) > 8:
                continue
            if refcode.lower() == "not provided":
                continue        

            response = agent.agent_response(mof, reference.replace("/","_")+".md",
                                            APPLICATION, CoV=False, filtered=False, gen_extraction_structure = ListApplications,
                                            vector_store=input_folder+f"/{file}")


            general_extraction = response

            filtered = general_extraction
            applications = general_extraction

            print(applications)

            for app in applications.app_list:
                result["MOF Name"].append(mof)
                result["Ref Code"].append(refcode)
                result["Source"].append(reference)
                result["Application"].append(app.application_name)
                result["Recommendation"].append(app.recommendation)
                result["Justification"].append(app.justification)
        
        print(cb)

    except Exception as e:
        print(e)
        print(cb)
        print(i)
        res = pd.DataFrame(result)
    

Action: reading the document
finding all properties of name 1: [[Ag(L)(CO2CF3)]2]∞(5) ---name 2: complex 5

Result: 
- Application: Not Provided
- Recommendation: Not Provided
- Justification: The documents do not mention any specific application or recommendation for the MOF {[Ag(L)(CO2CF3)]2}∞(5) or complex 5.

Parsed Result: 
1- Application: Not Provided
Author Recommendation: Not Provided
Exact Sentences: The documents do not mention any specific application or recommendation for the MOF {[Ag(L)(CO2CF3)]2}∞(5) or complex 5.



1- Application: Not Provided
Author Recommendation: Not Provided
Exact Sentences: The documents do not mention any specific application or recommendation for the MOF {[Ag(L)(CO2CF3)]2}∞(5) or complex 5.

Action: reading the document
finding all properties of name 1: Zn2(L3) ---name 2: complex 8

Result: 
- Application: Not Provided
- Recommendation: Not Provided
- Justification: The documents do not mention any specific applications or recommendations for the

In [68]:
res = pd.DataFrame(result)

In [69]:
res

Unnamed: 0,MOF Name,Ref Code,Application,Recommendation,Justification,Source
0,{[Ag(L)(CO2CF3)]2}∞(5)<|>complex 5,OSEWOZ,Not Provided,Not Provided,The documents do not mention any specific appl...,10.1021/cg101490a
1,Zn2(L3)<|>complex 8,MIPPEI,Not Provided,Not Provided,The documents do not mention any specific appl...,10.1021/ic400751n
2,Zn2(L3)<|>complex 9,MIPPIM,Not Provided,Not Provided,The documents do not mention any specific appl...,10.1021/ic400751n
3,Ag2(L3)<|>complex 1,MIPQAF,Not Provided,Not Provided,Not Provided,10.1021/ic400751n
4,"{La(Ben)3(4,4′-BPNO)(H2O)2}·HBen<|>complex 4",OTAVOV,Separation of C6−C8 aromatics,Investigated,"Recently, such a 3D coordination polymer deriv...",10.1021/cg101432k
...,...,...,...,...,...,...
340,Ag2(bipy)2(ox)·7H2O<|>complex 1,OROZEB,Not Provided,Not Provided,The documents do not mention any specific appl...,10.1021/cg100927k
341,Zn(4-pyridylacrylate)2<|>compound 3,ECIWAO,Nonlinear Optical (NLO) Materials,Investigated,We have carried out preliminary powder SHG stu...,10.1021/cm010301n
342,Cd(4-pyridylacrylate)2·H2O<|>compound 4,JUKXIW01,Second-order nonlinear optical (NLO) applications,Investigated,In light of recent success in the construction...,10.1021/cm010301n
343,[{Co2(tpyprz)(H2O)3}Mo5O15{O3P(CH2)3PO3}]·7H2O...,IFAQUC,Not Provided,Not Provided,Not Provided,10.1021/ic701573r


In [70]:
res.to_csv("/mnt/c/Users/Amro/Desktop/apps-ACS-P1_1real.csv")