## <b>MOF ChemUnity Property Extraction</b>

This notebook is a demonstration on how the property extraction workflow in MOF-ChemUnity is done. Prior to running this, the MOF names that are extracted from the Matching Workflow are required, so we advise the user to run ``ChemUnity_Matching.ipynb`` first to retrieve these. Furthermore, the vector stores generated from the Matching Workflow can be re-used here - there is no need to read .pdf, .xml or convert files to .md.

In [1]:
import pandas as pd
import glob
import os
from langchain_openai import ChatOpenAI
from langchain.callbacks import get_openai_callback

from MOF_ChemUnity.Agents.ExtractionAgent import ExtractionAgent
from MOF_ChemUnity.utils.DocProcessor import DocProcessor
from MOF_ChemUnity.Prompts.Extraction_Prompts import VERIFICATION, RECHECK, EXTRACTION
from MOF_ChemUnity.Prompts.Water_Stability_Prompts import WATER_STABILITY, RULES_WATER_STABILITY, VERF_RULES_WATER_STABILITY, WATER_STABILITY_RE
from MOF_ChemUnity.Agents.BaseAgent import BaseAgent
from openai import RateLimitError
from MOF_ChemUnity.utils.DataModels import ListApplications
from MOF_ChemUnity.Prompts.Extraction_Prompts import APPLICATION

<b>Preparation of MOF Names from Matching CSV</b>

Please refer to ```ChemUnity_Matching.ipynb``` to get a matching .csv file (set this as ```mof_names_df```), which manages to map MOF name to the reference code (alongside their respective DOI). These are needed to run the rest of the demonstration. Furthermore, please use your own OpenAI API key. For reference, the following cell has been constructed for preparation of extraction process.

In [3]:
# If you haven't already, set your OpenAI API key as an environment variable
#os.environ["OPENAI_API_KEY"] = 'YOUR_API_KEY'

llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
parser_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# mof_names_df is directly from matching
mof_names_df = pd.read_csv("Examples/KG_Data/matching.csv")
mof_names_df.head()

Unnamed: 0,MOF Name,CSD Ref Code,Justification,DOI
0,[HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6]<|>co...,NAVCAO,The compound [HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)...,10.1039/a700472i
1,[Ag(CM-TTF)(CF3SO3)]<|>complex 2,FEKDAA,The compound [Ag(CM-TTF)(CF3SO3)] is described...,10.1039/a804945i
2,C9H12CaO10<|>calcium trimesate synthesized by ...,OWZEW01,The MOF C9H12CaO10 matches the CSD Ref Code OW...,10.1016/j.molstruc.2017.11.128
3,C23.5H28.5EuN2.5O10.5<|>complex 2,CIJBEF,The MOF with the empirical formula C23.5H28.5E...,10.1021/acs.analchem.8b00494
4,C23.5H28.5TbN2.5O10.5<|>complex 3,CIJBIJ,The MOF with the empirical formula C23.5H28.5T...,10.1021/acs.analchem.8b00494


<b>Markdown files setup</b>

For this example, we will use the markdown example explored in the matching notebook. Please follow these steps:
- Point the input_folder to the vector store folder "vs" that was created during the matching
- Initialize your extraction results dictionary (shown as ```result```, ```filtered_result```, ```ws_result```) as follows.
- Load the prompts needed for water stability extraction, etc.

In [3]:
## step 1: point input_folder to vector store folder
input_folder = "Examples/MD/vs"
files = glob.glob(input_folder+"/*/")

## step 2: initialize extraction results dictionary
result, filtered_result, ws_result = {}, {}, {}
keys_of_interest = ["MOF Name", "Ref Code", "Property", "Value", "Units", "Condition", "Summary", "Reference"]

for key in keys_of_interest:
    result[key] = []
    filtered_result[key] = []
    ws_result[key] = []

## step 3: load the prompts for extraction and construct agent object
WS_READ = WATER_STABILITY.replace("{RULES}", RULES_WATER_STABILITY)
WS_CHECK = VERIFICATION.replace("{VERF_RULES}", VERF_RULES_WATER_STABILITY)
WS_RECHECK = RECHECK.replace("{RECHECK_INSTRUCTIONS}", WATER_STABILITY_RE.replace("{RULES}", RULES_WATER_STABILITY))

sp_dict = {"read_prompts": [WS_READ], "verification_prompts": [WS_CHECK], "recheck_prompts": [WS_RECHECK]}
agent = ExtractionAgent(llm=llm)

<b>Running the Extraction Loop for General Property Extraction + CoV</b>

From running the following cell, general property extraction and CoV is performed to extract details such as the crystal system, chemical formula, space group, surface area, water stability label (if it is mentioned in the literature for that particular MOF), etc. These results are stored in: ```all_props```, ```filtered``` and ```ws``` DataFrames for future reference (saved in a separate folder).

In [4]:
try:
    for i in range(len(mof_names_df)):

        mof = mof_names_df.iloc[i]["MOF Name"]
        refcode = mof_names_df.iloc[i]["CSD Ref Code"]
        reference = mof_names_df.iloc[i]["DOI"]

        if len(refcode) > 8:
            continue
        if refcode.lower() == "not provided":
            continue        
        
        response = agent.agent_response(mof, reference.replace("/","_")+".md",
                                        EXTRACTION, ["Water Stability"], sp_dict, CoV=True, fuzz_threshold=85,
                                        vector_store=input_folder+f"/{reference.replace('/','_',1)}")


        general_extraction = response[0]

        filtered = general_extraction[0]
        all_props = general_extraction[1]

        for j in filtered:
            filtered_result["MOF Name"].append(mof)
            filtered_result["Ref Code"].append(refcode)
            filtered_result["Reference"].append(reference)
            filtered_result["Property"].append(j.name)
            filtered_result["Units"].append(j.units)
            filtered_result["Value"].append(j.value)
            filtered_result["Condition"].append(j.condition)
            filtered_result["Summary"].append(j.summary)

        for j in all_props.properties:
            result["MOF Name"].append(mof)
            result["Ref Code"].append(refcode)
            result["Reference"].append(reference)
            result["Property"].append(j.name)
            result["Units"].append(j.units)
            result["Value"].append(j.value)
            result["Condition"].append(j.condition)
            result["Summary"].append(j.summary)
        
        specific_extraction = response[1]
        ws = specific_extraction[0]

        for j in ws:
            ws_result["MOF Name"].append(mof)
            ws_result["Ref Code"].append(refcode)
            ws_result["Reference"].append(reference)
            ws_result["Property"].append(j.name)
            ws_result["Units"].append(j.units)
            ws_result["Value"].append(j.value)
            ws_result["Condition"].append(j.condition)
            ws_result["Summary"].append(j.summary)

except Exception as e:
    print(e)
    all_props = pd.DataFrame(result)
    filtered = pd.DataFrame(filtered_result)
    ws = pd.DataFrame(ws_result)

    all_props.to_csv("all-Elsevier-P2_3.csv")
    filtered.to_csv("fil-Elsevier-P2_3.csv")
    ws.to_csv("ws-Elsevier-P2_3.csv")

Action: reading the document
finding all properties of name 1: C9H12CaO10 ---name 2: calcium trimesate synthesized by one-pot self-assembly reaction
Error code: 401 - {'error': {'message': 'Incorrect API key provided: YOUR_API_KEY. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}


In [5]:
all_props = pd.DataFrame(result)
filtered = pd.DataFrame(filtered_result)
ws = pd.DataFrame(ws_result)

all_props.to_csv("Examples/results/all-Elsevier-P2_3.csv")
filtered.to_csv("Examples/results/fil-Elsevier-P2_3.csv")
ws.to_csv("Examples/results/ws-Elsevier-P2_3.csv")

In [6]:
all_props.head()

Unnamed: 0,MOF Name,Ref Code,Property,Value,Units,Condition,Summary,Reference


In [7]:
filtered.head()

Unnamed: 0,MOF Name,Ref Code,Property,Value,Units,Condition,Summary,Reference


In [8]:
ws.head()

Unnamed: 0,MOF Name,Ref Code,Property,Value,Units,Condition,Summary,Reference


<b>Application Extraction</b>

The following cells will extract the recommended applications by MOF experts and scientists in the literature. It will return:
- MOF name
- CSD reference code
- The application mentioned in the literature
- From investigation, whether it is recommended by MOF experts or not (Not recommended, Investigated, Recommended)
- Justification on the recommendation of application
- DOI (source)

In [9]:
result = {}
result["MOF Name"] = []
result["Ref Code"] = []
result["Application"] = []
result["Recommendation"] = []
result["Justification"] = []
result["Source"] = []

agent = ExtractionAgent(llm = llm)

with get_openai_callback() as cb:
    try:
        for i in range(len(mof_names_df)):

            mof = mof_names_df.iloc[i]["MOF Name"]
            refcode = mof_names_df.iloc[i]["CSD Ref Code"]
            reference = mof_names_df.iloc[i]["DOI"]
            #file = mof_names_df.iloc[i]["File Name"]

            if len(refcode) > 8:
                print(f"{refcode} is longer than 8 characters. Skipping...")
                continue
            if refcode.lower() == "not provided":
                continue        
            
            print("Trying to get response...")
            response = agent.agent_response(mof, reference.replace("/","_")+".md",
                                            APPLICATION, CoV=False, filtered=False, gen_extraction_structure = ListApplications,
                                            vector_store=os.path.join(input_folder, reference.replace("/", "_", 1)))

            general_extraction = response

            filtered = general_extraction
            applications = general_extraction

            print(applications)

            for app in applications.app_list:
                result["MOF Name"].append(mof)
                result["Ref Code"].append(refcode)
                result["Source"].append(reference)
                result["Application"].append(app.application_name)
                result["Recommendation"].append(app.recommendation)
                result["Justification"].append(app.justification)
        
        print(cb)

    except Exception as e:
        print(e)
        print(cb)
        print(i)
        res = pd.DataFrame(result)

Trying to get response...
Action: reading the document
finding all properties of name 1: C9H12CaO10 ---name 2: calcium trimesate synthesized by one-pot self-assembly reaction
Error code: 401 - {'error': {'message': 'Incorrect API key provided: YOUR_API_KEY. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}
Tokens Used: 0
	Prompt Tokens: 0
		Prompt Tokens Cached: 0
	Completion Tokens: 0
		Reasoning Tokens: 0
Successful Requests: 0
Total Cost (USD): $0.0
0


In [10]:
res = pd.DataFrame(result)
res.to_csv("Examples/results/apps-ACS-P1_1real.csv")

res

Unnamed: 0,MOF Name,Ref Code,Application,Recommendation,Justification,Source
