# MOF ChemUnity Matching 

The purpose of this notebook is to use our developed tools to match CSD Ref Codes to MOF Names/Co-References found in their synthesis papers. 

### Preparation of CSD Data

First, the CSD Data must be prepared to be injected into the prompt. For each DOI we wish to process, we must gather the relevant info for each associated CSD code.

Over 20 000 DOIs have been selected for text mining. We chose MOFs that:
- Are found in CSD 
- Are also found in either QMOF or CoRE Databases

This way, every MOF in our database has relevant computational properties already calculated (found in QMOF or CoRE). The properties can be easily added to our database at the end. 

In [1]:
# Imports
import pandas as pd
import glob
import os

from src.MOF_ChemUnity.utils.DataPrep import Data_Prep
from src.MOF_ChemUnity.Agents.MatchingAgent import MatchingAgent

In [2]:
# Define path to folder containing all papers to be text mined from
paper_folder_path = '/home/tom-pruyn/Documents/TDM Papers/blah'

# Define path to file containing all CSD info extracted from CSD API
csd_info_path = '/home/tom-pruyn/Documents/Project/chain-eunomia/Chain_Eunomia/data/Benchmark_set_2/Ground Truth/CSD_Info.csv'

In [3]:
# List of columns we want to take from our master CSD file and put into our prompt
feature_list = [
    "CSD code", 
    "DOI",
    "Chemical Name",
    "Space group", 
    "Metal types",
    "Molecular formula",
    "Synonyms",
    "a",
    "b",
    "c"
]

In [4]:
# Initialize Data_Prep class
Prepare_Data = Data_Prep(paper_folder_path, csd_info_path,feature_list)

In [5]:
publication_data = Prepare_Data.gather_info()

In [6]:
publication_data.head()

Unnamed: 0,DOI,File Name,File Format,File Path,Journal,CSD code,Chemical Name,Space group,Metal types,Molecular formula,Synonyms,a,b,c
0,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/blah/21/1...,Journal(Inorganic Chemistry),QONLAI,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.111,14.791,15.671
1,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/blah/21/1...,Journal(Inorganic Chemistry),QOMSES,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.09,14.744,15.575
2,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/blah/21/1...,Journal(Inorganic Chemistry),QONKUB,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.261,15.134,15.547
3,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/blah/21/1...,Journal(Inorganic Chemistry),QONLIQ,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.299,14.71,15.689
4,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/blah/21/1...,Journal(Inorganic Chemistry),QONLEM,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],12.825,14.903,15.963


### Running The Prediction Loop

In [7]:
# Instantiate Agent
agent = MatchingAgent()

In [8]:
# This function is used to gather info from the publication info dataframe and put it into a dictionary 
def csd_dict(csd_data): 
    return {
        i["CSD code"]: {
            'Space Group': i["Space group"],
            'Metal Nodes': i["Metal types"],
            'Chemical Name': i["Chemical Name"],
            'a': i["a"],
            'b': i["b"],
            'c': i["c"],
            'Molecular Formula': i["Molecular formula"],
            'Synonyms': i["Synonyms"]
        } 
        for _, i in csd_data.iterrows()
    }

#### Batch Process

In [9]:
publication_data.head()

Unnamed: 0,DOI,File Name,File Format,File Path,Journal,CSD code,Chemical Name,Space group,Metal types,Molecular formula,Synonyms,a,b,c
0,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/blah/21/1...,Journal(Inorganic Chemistry),QONLAI,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.111,14.791,15.671
1,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/blah/21/1...,Journal(Inorganic Chemistry),QOMSES,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.09,14.744,15.575
2,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/blah/21/1...,Journal(Inorganic Chemistry),QONKUB,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.261,15.134,15.547
3,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/blah/21/1...,Journal(Inorganic Chemistry),QONLIQ,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.299,14.71,15.689
4,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/blah/21/1...,Journal(Inorganic Chemistry),QONLEM,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],12.825,14.903,15.963


We need to change this code so that DOI is used to compare instead of file name, and "reference" is DOI and not file name!!

We will also need to reprocess these results to manually change reference to doi and not file name!


In [12]:
# Dynamically create the results_file path
results_dir = os.path.join(paper_folder_path, 'results')
results_file = os.path.join(results_dir, 'matching.csv')
sub_batch_size = 250

# Ensure the results directory exists
os.makedirs(results_dir, exist_ok=True)

# Load publication data (assuming publication_data is already defined)
unique_DOIs = publication_data["DOI"].unique()

# Prepare results DataFrame if results file exists
if os.path.exists(results_file):
    print("yes")
    existing_results = pd.read_csv(results_file)
    processed_DOIs = set(existing_results["DOI"])  # Extract processed DOIs based on the 'Reference' column
    results = existing_results.to_dict(orient="list")
    print(processed_DOIs)
else:
    processed_DOIs = set()
    results = {"MOF Name": [], "CSD Ref Code": [], "Justification": [], "DOI": []}

# Filter remaining DOIs to process
remaining_DOIs = [DOI for DOI in unique_DOIs if DOI not in processed_DOIs]
print(f"Total DOIs to process: {len(remaining_DOIs)}")
print(remaining_DOIs)

# Process sub-batch
for DOI in remaining_DOIs[:sub_batch_size]:
    
    # Filter csd_data based on DOI
    filtered_csd_data = publication_data[publication_data["DOI"] == DOI]
    # Get the associated file path
    file_path = filtered_csd_data.iloc[0]['File Path']
    
    print(f"Processing DOI: {DOI}")
    print(f"File Path: {file_path}")
    print("-" * 28)

    if not filtered_csd_data.empty:
        # Call the csd_dict function with the filtered data
        csd = csd_dict(filtered_csd_data)

        print(csd)
        # Get Matching Agent Response
        mofs,docs = agent.agent_response(csd, file_path, ret_docs=True)

        # Collect results
        for mof in mofs.mofs:
            results["MOF Name"].append(mof.name)
            results["CSD Ref Code"].append(mof.refcode)
            results["Justification"].append(mof.justification)
            results["DOI"].append(DOI)
    else:
        print(f"No data found for DOI {DOI}")

# Save updated results to CSV
pd.DataFrame(results).to_csv(results_file, index=False)
print(f"Results saved to {results_file}.")
print(f"Processed {len(remaining_DOIs[:sub_batch_size])} DOIs in this batch.")

Total DOIs to process: 3
['10.1021/ic5008457', '10.1021/acs.chemmater.5b03792', '10.1039/C7CE00481H']
Processing DOI: 10.1021/ic5008457
File Path: /home/tom-pruyn/Documents/TDM Papers/blah/21/10.1021_ic5008457.md
----------------------------
{'QONLAI': {'Space Group': 'P21/n', 'Metal Nodes': 'Cu', 'Chemical Name': 'catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-bis(μ2-azido)-bis(dimethylformamide)-tri-copper cyclopentane solvate)', 'a': 13.111, 'b': 14.791, 'c': 15.671, 'Molecular Formula': 'C48Cu6H32N52', 'Synonyms': '[]'}, 'QOMSES': {'Space Group': 'P21/n', 'Metal Nodes': 'Cu', 'Chemical Name': 'catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-bis(μ2-azido)-bis(dimethylformamide)-tri-copper)', 'a': 13.09, 'b': 14.744000000000002, 'c': 15.575, 'Molecular Formula': 'C48Cu6H32N52', 'Synonyms': '[]'}, 'QONKUB': {'Space Group': 'P21/n', 'Metal Nodes': 'Cu', 'Chemical Name': 'catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-bis(μ2-azido)-bis(dimethylformamide)-tri-copper cyclohexane solvate)', 'a

In [13]:
# Assuming `docs` is your list of LangChain Document objects
for i, doc in enumerate(docs):
    print(f"Document {i + 1}:")
    print(doc.page_content)
    print("-" * 40)  # Separator for readability


Document 1:
units showing the connecting BDC ligands (represented by broken-off bonds), (b) connectivity of four dimer units showing the void space occupied by coordinated DMF molecules and (c) overall view of the structure showing the two interpenetrated three-dimensional networks in red and green. In (a) and (b) the atomic labelling scheme is as for Fig. 1, with hydrogen atoms omitted for clarity, and only the predominant orientation of the DMF molecule shown and terminal oxygens belong to water molecules.

centres, (b) view along a, and (c) view along b. Atomic labelling scheme as for Fig. 1 with hydrogen atoms omitted for clarity.

Results And Discussion

[Yb2IJBDC)3IJDMF)2]·H2O (1) crystallises from a DMF-rich solvent and we previously reported its synthesis and structure in a preliminary communication.17 The material contains a single crystallographic ytterbium coordinated to seven oxygens: six from singly coordinating 1,4-benzenedicarboxylates and one from an O-coordinated DMF, 

### Benchmarking

In [None]:
ground_truth = pd.read_csv('/home/tom-pruyn/Documents/Project/chain-eunomia/Chain_Eunomia/data/Benchmark_set_2/Ground Truth/Ground Truth DOIs.csv')
ground_truth = ground_truth.drop(columns='Paper Number')

results_df = pd.DataFrame(results)

In [None]:
# Merge the predictions DataFrame with the ground_truth DataFrame on 'CSD Ref Code'

results_merged = pd.DataFrame(results)
results_merged = results_merged.merge(
    ground_truth,
    left_on='CSD Ref Code',
    right_on='CSD CODE',
    how='left'
)

# Reorder columns to place 'Ground Truth MOF Names' beside 'MOF Name'
#results_merged = results_merged[['MOF Name', 'Ground Truth MOF Names', 'CSD Ref Code', 'Justification', 'Reference']]

# Sort the DataFrame based on the numerical order of the 'Reference' column
# Extracting the numeric part of the 'Reference' column for sorting
#results_merged['Reference Number'] = results_merged['Reference'].str.extract(r'(\d+)').astype(float)
#results_merged = results_merged.sort_values(by='Reference Number').drop(columns='Reference Number')
# Save results to CSV
pd.DataFrame(results_merged).to_csv(results_file, index=False)