# MOF ChemUnity Synthesis Extraction

This notebook demonstrates how the synthesis extraction workflow in MOF ChemUnity is used. 

This notebook requires the MOF names that are extracted in the Matching Workflow. Therefore, be sure to run the matching workflow first!

As well, the vector stores generated from the Matching Workflow can be re-used here to keep things simple - we don't have to worry about reading any .pdf or .xml files here, or converting them to .md



In [19]:
from src.MOF_ChemUnity.Agents.ExtractionAgent import ExtractionAgent
from src.MOF_ChemUnity.utils.DocProcessor import DocProcessor
from src.MOF_ChemUnity.Synthesis_Prompts import SYNTHESIS_EXTRACTION
from src.MOF_ChemUnity.utils.DataModels import Synthesis

### Preparation of MOF Names from Matching CSV

First, we need to read the matching csv file and extract the file names from within that.

In [20]:
import pandas as pd
import glob
import os

In [21]:
'''with open(".apikeys", 'r') as f:
    os.environ["OPENAI_API_KEY"] = f.read()
'''
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
parser_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

In [22]:
mof_names_df = pd.read_csv("Results/Matching/Final/MOF_ChemUnity.csv")
mof_names_df.head()

Unnamed: 0,MOF Name,CSD Ref Code,Justification,DOI,Publisher
0,[Eu2(tda)2(H2O)3]·5H2O<|>compound 2,GIFFEI,"The MOF [Eu2(tda)2(H2O)3]·5H2O, also referred ...",10.1002/zaac.201200234,Wiley
1,[Sm2(tda)2(H2O)3]·5H2O<|>compound 1,GIFFAE,"The MOF [Sm2(tda)2(H2O)3]·5H2O, also referred ...",10.1002/zaac.201200234,Wiley
2,{[Ni3(BPTCA)2(H2O)2][Ni(H2O)6]·10H2O}n<|>compl...,KOFVOS,The MOF {[Ni3(BPTCA)2(H2O)2][Ni(H2O)6]·10H2O}n...,10.1002/zaac.201300500,Wiley
3,[Zn(H2O)(bpy)(Hdml)2]<|>compound 6,GUGWAI,The MOF [Zn(H2O)(bpy)(Hdml)2] matches the CSD ...,10.1002/zaac.201400336,Wiley
4,[Co3(C14H8O6S)3(DMA)2(MeOH)]·DMA<|>Ia,COWRIR,"The MOF [Co3(C14H8O6S)3(DMA)2(MeOH)]·DMA, refe...",10.1002/zaac.201400388,Wiley


Randomly select 20 synthesis procedures for bench mark 

In [23]:
# Drop duplicates based on DOI to ensure uniqueness
mof_names_df = mof_names_df.drop_duplicates(subset=['DOI'])

# Randomly sample 20 rows
mof_names_df = mof_names_df.sample(n=min(20, len(mof_names_df)), random_state=42)

mof_names_df.head()

Unnamed: 0,MOF Name,CSD Ref Code,Justification,DOI,Publisher
8545,{[Cu(SO4)(L)]·(CH3OH)}n<|>1,IHOZOV,"The MOF {[Cu(SO4)(L)]·(CH3OH)}n, referred to a...",10.1016/j.jorganchem.2009.03.045,Elsevier
8822,{[μ2-P4(NCH3)6]2(CuCl)2}∞<|>Compound 2,BUXHUY,"The MOF {[μ2-P4(NCH3)6]2(CuCl)2}∞, referred to...",10.1016/j.poly.2010.03.019,Elsevier
4506,{[Zn­(H2BTTB)]·(DEF)3·(H2O)2}<|>compound 2,MIFKUJ,"The MOF {[Zn­(H2BTTB)]·(DEF)3·(H2O)2}, referre...",10.1021/cg3013393,ACS
817,[Cu(fum)(Pyphen)]<|>Compound (I),VELMAB,The MOF [Cu(fum)(Pyphen)] matches the CSD Ref ...,10.1107/S1600536806028376,IUCr
11266,[Cu2I2(C14H14N4)(C18H15P)2],AZIWEM,The MOF described in the document has a molecu...,10.1107/S1600536811036555,IUCr


In [24]:
mof_names_df.to_csv("temp.csv")

### Vector Store Files setup

In [6]:
from src.MOF_ChemUnity.Agents.BaseAgent import BaseAgent

input_folder = "/home/tom-pruyn/Documents/TDM Papers/Vector_Stores"

files = glob.glob(input_folder+"/*/")

Below I'll create a function to append the VS file locations to the mof_name dataframes. For us, our ACS files follow a different naming convention then our other files, so we'll apply some special rules for those papers

In [7]:
"""import pandas as pd

# Define the input folder
input_folder = "/home/tom-pruyn/Documents/TDM Papers/Vector_Stores"

# Function to generate the file path
def generate_file_path(row):
    doi = row["DOI"]
    publisher = row["Publisher"]

    if publisher == "ACS":
        # Remove everything before and including the first slash, then add ".tei"
        processed_doi = doi.split("/", 1)[-1] + ".tei"
    else:
        # Replace first slash with an underscore
        processed_doi = doi.replace("/", "_", 1)

    return f"{input_folder}/{processed_doi}"

# Apply the function to each row and create a new column
mof_names_df["File Path"] = mof_names_df.apply(generate_file_path, axis=1)"""


'import pandas as pd\n\n# Define the input folder\ninput_folder = "/home/tom-pruyn/Documents/TDM Papers/Vector_Stores"\n\n# Function to generate the file path\ndef generate_file_path(row):\n    doi = row["DOI"]\n    publisher = row["Publisher"]\n\n    if publisher == "ACS":\n        # Remove everything before and including the first slash, then add ".tei"\n        processed_doi = doi.split("/", 1)[-1] + ".tei"\n    else:\n        # Replace first slash with an underscore\n        processed_doi = doi.replace("/", "_", 1)\n\n    return f"{input_folder}/{processed_doi}"\n\n# Apply the function to each row and create a new column\nmof_names_df["File Path"] = mof_names_df.apply(generate_file_path, axis=1)'

In [8]:
mof_names_df.to_csv("mofname.csv")

### Running the Extraction Loop for General Property Extraction + CoV

In [9]:
result = {}
result["Metal Precursor"] = []
result["Solvent"] = []
result["Temperature"] = []
result["Reaction Time"] = []
result["Synthesis Procedure"] = []
result["Additional Conditions"] = []
result["Summary"] = []
result["Reference"] = []

In [10]:
agent = ExtractionAgent(llm=llm)

In [11]:
from langchain.callbacks import get_openai_callback

In [12]:
with get_openai_callback() as cb:
    try:
        for i in range(0, len(mof_names_df)):

            mof = mof_names_df.iloc[i]["MOF Name"]
            refcode = mof_names_df.iloc[i]["CSD Ref Code"]
            reference = mof_names_df.iloc[i]["DOI"]

            if len(refcode) > 8:
                continue
            if refcode.lower() == "not provided":
                continue        

            response = agent.agent_response(mof, reference.replace("/","_")+".md",
                                            SYNTHESIS_EXTRACTION, CoV=False, filtered=False, gen_extraction_structure = Synthesis,
                                            vector_store=input_folder+f"/{reference.replace("/", "_", 1)}")
            
            synthesis_extraction = response
            # Append extracted synthesis data
            result["Metal Precursor"].append(getattr(synthesis_extraction, "metal_precursor", ""))
            result["Solvent"].append(getattr(synthesis_extraction, "solvent", ""))
            result["Temperature"].append(getattr(synthesis_extraction, "temperature", ""))
            result["Reaction Time"].append(getattr(synthesis_extraction, "reaction_time", ""))
            result["Synthesis Procedure"].append(getattr(synthesis_extraction, "synthesis_procedure", ""))
            result["Additional Conditions"].append(getattr(synthesis_extraction, "additional_conditions", ""))
            result["Justification"].append(getattr(synthesis_extraction, "justification", "")) 
            result["Reference"].append(reference)

        print(cb)

    except Exception as e:
        print(e)
        print(cb)
        print(i)
        res = pd.DataFrame(result)


Action: reading the document
finding all properties of name 1: [[Cu(SO4)(L)]·(CH3OH)]n ---name 2: 1

Result: 
1. - Metal Precursor: CuSO4·5H2O
   - Solvent: Methanol
   - Temperature: Room temperature
   - Reaction Time: Three weeks
   - Synthesis Procedure: A methanol solution (4ml) of L (41mg, 0.1mmol) was dropwise added into an aqueous solution (3ml) of CuSO4·5H2O (25mg, 0.1mmol) to give a clear solution. The resulting solution was allowed to stand in air at room temperature for three weeks, yielding blue crystals suitable for X-ray diffraction.
   - Additional Conditions: Not Provided
   - Justification: "A methanol solution (4ml) of L (41mg, 0.1mmol) was dropwise added into an aqueous solution (3ml) of CuSO4·5H2O (25mg, 0.1mmol) to give a clear solution. The resulting solution was allowed to stand in air at room temperature for three weeks, yielding blue crystals (yield 65%) suitable for X-ray diffraction."

Parsed Result: 
Metal Precursor: CuSO4·5H2O
Solvent: Methanol
Temperature

In [16]:
res = pd.DataFrame(result)

In [17]:
res

Unnamed: 0,Metal Precursor,Solvent,Temperature,Reaction Time,Synthesis Procedure,Additional Conditions,Summary,Reference
0,CuSO4·5H2O,Methanol,Room temperature,Three weeks,"A methanol solution (4ml) of L (41mg, 0.1mmol)...",Not Provided,"A methanol solution (4ml) of L (41mg, 0.1mmol)...",10.1016/j.jorganchem.2009.03.045
1,Cuprous chloride (CuCl),Dry acetonitrile,Not Provided,Two days,0.607g (2.04mmol) of TPHMI was dissolved in 10...,Not Provided,0.607g (2.04mmol) of TPHMI was dissolved in 10...,10.1016/j.poly.2010.03.019
2,Zn(NO3)2·6H2O,"DEF, ethanol, water",100 °C,4 days,"A mixture containing Zn(NO3)2·6H2O (59.4 mg, 0...",2 drops of 1N HCl were added.,"A mixture containing Zn(NO3)2·6H2O (59.4 mg, 0...",10.1021/cg3013393
3,CuCl2H2O,Methanol and water,Room temperature,Several days,A methanol solution (6 ml) of Pyphen (0.5 mmol...,Not Provided,A methanol solution (6 ml) of Pyphen (0.5 mmol...,10.1107/S1600536806028376
4,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,The document does not provide any details abou...,10.1107/S1600536811036555
5,Zn(OAc)2·2H2O,Water and methanol,Room temperature,6 hours,To an aqueous solution (10mL) of l-cysteic aci...,Not Provided,To an aqueous solution (10mL) of l-cysteic aci...,10.1016/j.ica.2011.07.049
6,CoCl2·6H2O,Methanol (MeOH),140°C,2 days,A mixture of CoCl2·6H2O and sodium tetrazole-1...,The reaction was conducted under autogenous pr...,"""1 was obtained by the solvothermal method: a ...",10.1016/j.jssc.2015.08.049
7,Cd(NO3)2·4H2O,"N,N-dimethylformamide (DMF) and ethanol (EtOH)...",120 °C,2 days,"A solution of Cd(NO3)2·4H2O (37 mg, 0.12 mmol)...",The synthesis was conducted in a Teflon-lined ...,"""Synthesis of [Cd(ANIC)2] (C12H12N4O4Cd) (Cd-A...",10.1039/c1jm13762j
8,Cobalt(II) acetate,Water and tetrahydrofuran (THF) in a molar rat...,110°C,Three days,"A solution of cobalt(II) acetate (187 mg, 0.75...",The reaction was conducted under autogenous pr...,"""The reaction of cobalt(ii) acetate and 2,5-di...",10.1002/anie.200501508
9,Mn(MeCO2)·4H2O,H2O,150°C,120 hours,"A mixture of Mn(MeCO2)·4H2O (0.245g, 1mmol), H...",pH adjusted to approximately 6 using NaOH (2N).,"""Synthesis of [Mn(pydc)(H2O)2] (2). A mixture ...",10.1016/j.ica.2004.12.038


In [18]:
res.to_csv("synthesis_benchmark.csv")