# MOF ChemUnity Matching 

The purpose of this notebook is to use our developed tools to match CSD Ref Codes to MOF Names/Co-References found in their synthesis papers. 

### Preparation of CSD Data

First, the CSD Data must be prepared to be injected into the prompt. For each DOI we wish to process, we must gather the relevant info for each associated CSD code.

Over 20 000 DOIs have been selected for text mining. We chose MOFs that:
- Are found in CSD 
- Are also found in either QMOF or CoRE Databases

This way, every MOF in our database has relevant computational properties already calculated (found in QMOF or CoRE). The properties can be easily added to our database at the end. 

In [1]:
# Imports
import pandas as pd
import glob
import os

from src.MOF_ChemUnity.utils.DataPrep import Data_Prep
from src.MOF_ChemUnity.Agents.MatchingAgent import MatchingAgent

In [2]:
# Define path to folder containing all papers to be text mined from
paper_folder_path = '/home/tom-pruyn/Documents/TDM Papers/benchmark'

# Define path to file containing all CSD info extracted from CSD API
csd_info_path = '/home/tom-pruyn/Documents/Project/chain-eunomia/Chain_Eunomia/data/Benchmark_set_2/Ground Truth/CSD_Info.csv'

In [3]:
# List of columns we want to take from our master CSD file and put into our prompt
feature_list = [
    "CSD code", 
    "DOI",
    "Chemical Name",
    "Space group", 
    "Metal types",
    "Molecular formula",
    "Synonyms",
    "a",
    "b",
    "c"
]

In [4]:
# Initialize Data_Prep class
Prepare_Data = Data_Prep(paper_folder_path, csd_info_path,feature_list)

In [5]:
publication_data = Prepare_Data.gather_info()

In [6]:
publication_data.head()

Unnamed: 0,DOI,File Name,File Format,File Path,Journal,CSD code,Chemical Name,Space group,Metal types,Molecular formula,Synonyms,a,b,c
0,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/benchmark...,Journal(Inorganic Chemistry),QONLAI,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.111,14.791,15.671
1,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/benchmark...,Journal(Inorganic Chemistry),QOMSES,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.09,14.744,15.575
2,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/benchmark...,Journal(Inorganic Chemistry),QONKUB,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.261,15.134,15.547
3,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/benchmark...,Journal(Inorganic Chemistry),QONLIQ,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.299,14.71,15.689
4,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/benchmark...,Journal(Inorganic Chemistry),QONLEM,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],12.825,14.903,15.963


### Running The Prediction Loop

In [7]:
# Instantiate Agent
agent = MatchingAgent()

In [8]:
# Initialize Results Dict
results = {"Reference": [], "MOF Name": [], "CSD Ref Code": [], "Justification": []}
pickle_results = dict()

In [9]:
# This function is used to gather info from the publication info dataframe and put it into a dictionary 
def csd_dict(csd_data): 
    return {
        i["CSD code"]: {
            'Space Group': i["Space group"],
            'Metal Nodes': i["Metal types"],
            'Chemical Name': i["Chemical Name"],
            'a': i["a"],
            'b': i["b"],
            'c': i["c"],
            'Molecular Formula': i["Molecular formula"],
            'Synonyms': i["Synonyms"]
        } 
        for _, i in csd_data.iterrows()
    }

#### Batch Process

In [10]:
publication_data.head()

Unnamed: 0,DOI,File Name,File Format,File Path,Journal,CSD code,Chemical Name,Space group,Metal types,Molecular formula,Synonyms,a,b,c
0,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/benchmark...,Journal(Inorganic Chemistry),QONLAI,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.111,14.791,15.671
1,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/benchmark...,Journal(Inorganic Chemistry),QOMSES,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.09,14.744,15.575
2,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/benchmark...,Journal(Inorganic Chemistry),QONKUB,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.261,15.134,15.547
3,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/benchmark...,Journal(Inorganic Chemistry),QONLIQ,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.299,14.71,15.689
4,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/benchmark...,Journal(Inorganic Chemistry),QONLEM,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],12.825,14.903,15.963


We need to change this code so that DOI is used to compare instead of file name, and "reference" is DOI and not file name!!

We will also need to reprocess these results to manually change reference to doi and not file name!


In [29]:
# Dynamically create the results_file path
results_dir = os.path.join(paper_folder_path, 'results')
results_file = os.path.join(results_dir, 'matching.csv')
sub_batch_size = 250

# Ensure the results directory exists
os.makedirs(results_dir, exist_ok=True)

# Load publication data (assuming publication_data is already defined)
unique_files = publication_data["DOI"].unique()

# Prepare results DataFrame if results file exists
if os.path.exists(results_file):
    print("yes")
    existing_results = pd.read_csv(results_file)
    processed_files = set(existing_results["Reference"])  # Extract processed DOIs based on the 'Reference' column
    results = existing_results.to_dict(orient="list")
    print(processed_files)
else:
    processed_files = set()
    results = {"MOF Name": [], "CSD Ref Code": [], "Justification": [], "Reference": []}

# Filter remaining DOIs to process
remaining_files = [file for file in unique_files if file not in processed_files]
print(f"Total DOIs to process: {len(remaining_files)}")

# Process sub-batch
for file in remaining_files[:sub_batch_size]:
    # Filter csd_data based on DOI
    filtered_csd_data = publication_data[publication_data["File Name"] == file]

    # Get the associated file path
    file_path = filtered_csd_data.iloc[0]['File Path']

    # Get associated DOI
    doi = filtered_csd_data.iloc[0]['DOI']
    
    print(f"Processing DOI: {doi}")
    print(f"File Path: {file_path}")
    print("-" * 28)

    if not filtered_csd_data.empty:
        # Call the csd_dict function with the filtered data
        csd = csd_dict(filtered_csd_data)

        print(csd)
        # Get Matching Agent Response
        mofs,docs = agent.agent_response(csd, file_path, ret_docs=True)

        # Collect results
        for mof in mofs.mofs:
            results["MOF Name"].append(mof.name)
            results["CSD Ref Code"].append(mof.refcode)
            results["Justification"].append(mof.justification)
            results["Reference"].append(doi)
    else:
        print(f"No data found for DOI {doi}")

# Save updated results to CSV
pd.DataFrame(results).to_csv(results_file, index=False)
print(f"Results saved to {results_file}.")
print(f"Processed {len(remaining_files[:sub_batch_size])} DOIs in this batch.")



yes
{'10.1021/acs.inorgchem.6b00758', '10.1016/j.inoche.2016.06.032', '10.1039/C3CE42189A', '10.1039/c2ce25715g', '10.1021/cg701164u', '10.1021/cg9005696', '10.1021/cm101410h', '10.1016/j.inoche.2008.04.004', '10.1002/anie.201503636', '10.1039/c3ce40346g', '10.1016/j.solidstatesciences.2011.12.002', '10.1039/C5RA06273J', '10.1039/c2sc20443f', '10.1080/15533170802440755', '10.1002/chem.201405395', '10.1021/cg201025u', '10.1039/C5CE00833F', '10.1021/cg701041k', '10.1016/j.molstruc.2008.06.007', '10.1016/j.inoche.2007.12.038', '10.1021/ja2078637', '10.1039/C3CE42454E', '10.1039/c1ce05069a', '10.1021/jacs.5b13335', '10.1021/acs.cgd.5b00358', '10.1039/b806616g', '10.1021/ja3063138', '10.1016/j.inoche.2011.02.021', '10.1021/ic8004008', '10.1039/C2DT32674D', '10.1016/j.inoche.2013.07.001', '10.1016/j.inoche.2014.08.012', '10.1016/j.inoche.2010.04.015', '10.1016/j.ica.2015.02.010', '10.1039/c2cc16779d', '10.1002/anie.200900134', '10.1021/cg900659r', '10.1021/ja200978u', '10.1039/c3ce26864k', '

### Benchmarking
Align results with ground truth

In [23]:
ground_truth = pd.read_csv('/home/tom-pruyn/Documents/Project/chain-eunomia/Chain_Eunomia/data/Benchmark_set_2/Ground Truth/Ground Truth DOIs.csv')
ground_truth = ground_truth.drop(columns='Paper Number')

In [24]:
ground_truth

Unnamed: 0,DOI,CSD CODE,Ground Truth MOF Names
0,10.1039/c3ce40346g,DIBWUI,[TbCu(tda)(ina)2(H2O)]?3H2O<|>compound 2
1,,DIBXOD,[DyCu(tda)(ina)2(H2O)]?3H2O<|>compound 4
2,,DIBXAP,[HoCu(tda)(ina)2(H2O)]?3H2O<|>compound 5
3,10.1021/ja200978u,IBASUB,[Sm2Cu3(IDA)6]3 nH2O
4,,IBATEM,[Gd2Cu3(IDA)6]3 nH2O\t
...,...,...,...
103,,ETEJOD,{[Nd2(DHBDC)3(DMF)4](DMF)2}n <|> (2)
104,,ETEJIX,{[La2(DHBDC)3(DMF)4](DMF)2}n <|> (1)
105,10.1039/c3ce26864k,REYBAA,{[Zn2(L1)(pdp)2]4H2O}n <|> (9)
106,,REYCOP,{[Zn2(L1)(bpmp)2]6H2O}n <|> (10)


In [19]:
results_df = pd.DataFrame(results)
results_df


Unnamed: 0,MOF Name,CSD Ref Code,Justification,Reference
0,[Cu3(TP)4(N3)2(DMF)2]·3C6H12<|>1a,QONKUB,The MOF [Cu3(TP)4(N3)2(DMF)2]·3C6H12 (1a) matc...,10.1021/ic5008457
1,[Cu3(TP)4(N3)2(DMF)2]·2C5H10<|>1b,QONLAI,The MOF [Cu3(TP)4(N3)2(DMF)2]·2C5H10 (1b) matc...,10.1021/ic5008457
2,[Cu3(TP)4(N3)2(DMF)2]·H2O·C10H18<|>1c,QONLEM,The MOF [Cu3(TP)4(N3)2(DMF)2]·H2O·C10H18 (1c) ...,10.1021/ic5008457
3,[Cu3(TP)4(N3)2(DMF)2]·C4H8O2<|>1d,QONLIQ,The MOF [Cu3(TP)4(N3)2(DMF)2]·C4H8O2 (1d) matc...,10.1021/ic5008457
4,[Cu3(TP)4(N3)2]·3C4H8O2<|>1e,QOMRUH,The MOF [Cu3(TP)4(N3)2]·3C4H8O2 (1e) matches t...,10.1021/ic5008457
...,...,...,...,...
260,[Er2Cu3(IDA)6]3 nH2O<|>Er crystal,IBARUA,The MOF [Er2Cu3(IDA)6]3 nH2O is mentioned in t...,10.1021/ja200978u
261,[La2Cu3(IDA)6]3 nH2O<|>La crystal,IBASEL,The MOF [La2Cu3(IDA)6]3 nH2O is described in t...,10.1021/ja200978u
262,[Ho2Cu3(IDA)6]3 nH2O<|>Ho crystal,IBASAH,The MOF [Ho2Cu3(IDA)6]3 nH2O is identified in ...,10.1021/ja200978u
263,[Nd2Cu3(IDA)6]3 nH2O<|>Nd crystal,IBASOV,The MOF [Nd2Cu3(IDA)6]3 nH2O is mentioned in t...,10.1021/ja200978u


In [None]:
# Merge the predictions DataFrame with the ground_truth DataFrame on 'CSD Ref Code'

results_merged = pd.DataFrame(results)
results_merged = results_merged.merge(
    ground_truth,
    left_on='CSD Ref Code',
    right_on='CSD CODE',
    how='left'
)

# Reorder columns to place 'Ground Truth MOF Names' beside 'MOF Name'
results_merged = results_merged[['MOF Name', 'Ground Truth MOF Names', 'CSD Ref Code', 'Justification', 'Reference']]

# Sort the DataFrame based on the numerical order of the 'Reference' column
# Extracting the numeric part of the 'Reference' column for sorting
#results_merged['Reference Number'] = results_merged['Reference'].str.extract(r'(\d+)').astype(float)
#results_merged = results_merged.sort_values(by='Reference Number').drop(columns='Reference Number')
# Save results to CSV
pd.DataFrame(results_merged).to_csv(results_file, index=False)


In [27]:
results_merged

Unnamed: 0,MOF Name,Ground Truth MOF Names,CSD Ref Code,Justification,Reference
0,[Cu3(TP)4(N3)2(DMF)2]·3C6H12<|>1a,[Cu3(TP)4(N3)2(DMF)2]·3C6H12 <|> (1a),QONKUB,The MOF [Cu3(TP)4(N3)2(DMF)2]·3C6H12 (1a) matc...,10.1021/ic5008457
167,{[Zn(btz)]·DMF·0.5H2O}<|>1<|>1a<|>1b,{[Zn(btz)]·DMF·0.5H2O}n <|> 1,YEZFIU,The MOF {[Zn(btz)]·DMF·0.5H2O} is referred to ...,10.1021/ja3063138
168,{Cd(HBTC)(H2O)1/2(H2O)2}n<|>compound 2,{[Cd(HBTC)(H2O)](p-bix)1/2(H2O)2}n (2),TOKDON,"The MOF {Cd(HBTC)(H2O)1/2(H2O)2}n, referred to...",10.1016/j.molstruc.2008.06.007
169,[Cu2(OH)(TZI)(H2O)2]n 3nH2O<|>compound 1,Cu2(OH)(TZI)(H2O)2]n 3nH2O<|>compound 1,FURSES,"The MOF [Cu2(OH)(TZI)(H2O)2]n 3nH2O, referred ...",10.1039/C5RA15937G
170,[Co2(OH)(TZI)(H2O)2]n 4nH2O<|>compound 2,,FUSGOR,"The MOF [Co2(OH)(TZI)(H2O)2]n 4nH2O, referred ...",10.1039/C5RA15937G
...,...,...,...,...,...
94,CPM-42<|>[Li2(OPy)2(diox)]·(diox),[Li2(OPy)2(diox)]·(diox) <|> CPM-42,BUKYAJ,CPM-42 has a molecular formula of [Li2(OPy)2(d...,10.1021/acs.cgd.5b00358
95,{La2(DHBDC)3(DMF)4}n<|>Compound 1,{[La2(DHBDC)3(DMF)4](DMF)2}n <|> (1),ETEJIX,The MOF {La2(DHBDC)3(DMF)4}n matches the CSD c...,10.1039/c1ce05069a
96,{Nd2(DHBDC)3(DMF)4}n<|>Compound 2,{[Nd2(DHBDC)3(DMF)4](DMF)2}n <|> (2),ETEJOD,The MOF {Nd2(DHBDC)3(DMF)4}n matches the CSD c...,10.1039/c1ce05069a
82,Mn-PAA-1<|>Mn-PAA-1,,not provided,The MOF Mn-PAA-1 has a monoclinic crystal syst...,10.1021/acs.inorgchem.6b00758


In [34]:
unique_doi_count = results_merged['Reference'].nunique()

print(f"Number of unique DOIs: {unique_doi_count}")

Number of unique DOIs: 50
