# MOF ChemUnity Matching 

The purpose of this notebook is to use our developed tools to match CSD Ref Codes to MOF Names/Co-References found in their synthesis papers. 

### Preparation of CSD Data

First, the CSD Data must be prepared to be injected into the prompt. For each DOI we wish to process, we must gather the relevant info for each associated CSD code.

Over 20 000 DOIs have been selected for text mining. We chose MOFs that:
- Are found in CSD 
- Are also found in either QMOF or CoRE Databases

This way, every MOF in our database has relevant computational properties already calculated (found in QMOF or CoRE). The properties can be easily added to our database at the end. 

In [1]:
# Imports
import pandas as pd
import glob
import os

from src.MOF_ChemUnity.utils.DataPrep import Data_Prep
from src.MOF_ChemUnity.Agents.MatchingAgent import MatchingAgent

In [None]:
# Define path to folder containing all papers to be text mined from
paper_folder_path = '/home/tom-pruyn/Documents/TDM Papers/Processing Batches-PDF/Batch 2/md'

# Define path to file containing all CSD info extracted from CSD API
csd_info_path = '/home/tom-pruyn/Documents/Project/chain-eunomia/Chain_Eunomia/data/Benchmark_set_2/Ground Truth/CSD_Info.csv'

In [3]:
# List of columns we want to take from our master CSD file and put into our prompt
feature_list = [
    "CSD code", 
    "DOI",
    "Chemical Name",
    "Space group", 
    "Metal types",
    "Molecular formula",
    "Synonyms",
    "a",
    "b",
    "c"
]

In [4]:
# Initialize Data_Prep class
Prepare_Data = Data_Prep(paper_folder_path, csd_info_path,feature_list)

In [5]:
publication_data = Prepare_Data.gather_info()

In [6]:
publication_data.head()

Unnamed: 0,DOI,File Name,File Format,File Path,Journal,CSD code,Chemical Name,Space group,Metal types,Molecular formula,Synonyms,a,b,c
0,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/benchmark...,Journal(Inorganic Chemistry),QONLAI,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.111,14.791,15.671
1,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/benchmark...,Journal(Inorganic Chemistry),QOMSES,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.09,14.744,15.575
2,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/benchmark...,Journal(Inorganic Chemistry),QONKUB,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.261,15.134,15.547
3,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/benchmark...,Journal(Inorganic Chemistry),QONLIQ,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.299,14.71,15.689
4,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/benchmark...,Journal(Inorganic Chemistry),QONLEM,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],12.825,14.903,15.963


### Running The Prediction Loop

In [7]:
# Instantiate Agent
agent = MatchingAgent()

In [8]:
# This function is used to gather info from the publication info dataframe and put it into a dictionary 
def csd_dict(csd_data): 
    return {
        i["CSD code"]: {
            'Space Group': i["Space group"],
            'Metal Nodes': i["Metal types"],
            'Chemical Name': i["Chemical Name"],
            'a': i["a"],
            'b': i["b"],
            'c': i["c"],
            'Molecular Formula': i["Molecular formula"],
            'Synonyms': i["Synonyms"]
        } 
        for _, i in csd_data.iterrows()
    }

#### Batch Process

In [9]:
publication_data.head()

Unnamed: 0,DOI,File Name,File Format,File Path,Journal,CSD code,Chemical Name,Space group,Metal types,Molecular formula,Synonyms,a,b,c
0,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/benchmark...,Journal(Inorganic Chemistry),QONLAI,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.111,14.791,15.671
1,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/benchmark...,Journal(Inorganic Chemistry),QOMSES,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.09,14.744,15.575
2,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/benchmark...,Journal(Inorganic Chemistry),QONKUB,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.261,15.134,15.547
3,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/benchmark...,Journal(Inorganic Chemistry),QONLIQ,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],13.299,14.71,15.689
4,10.1021/ic5008457,10.1021_ic5008457.md,md,/home/tom-pruyn/Documents/TDM Papers/benchmark...,Journal(Inorganic Chemistry),QONLEM,catena-(tetrakis(μ3-5-(4-Pyridyl)tetrazolato)-...,P21/n,Cu,C48Cu6H32N52,[],12.825,14.903,15.963


We need to change this code so that DOI is used to compare instead of file name, and "reference" is DOI and not file name!!

We will also need to reprocess these results to manually change reference to doi and not file name!


In [10]:
# Dynamically create the results_file path
results_dir = os.path.join(paper_folder_path, 'results')
results_file = os.path.join(results_dir, 'matching.csv')
sub_batch_size = 250

# Ensure the results directory exists
os.makedirs(results_dir, exist_ok=True)

# Load publication data (assuming publication_data is already defined)
unique_DOIs = publication_data["DOI"].unique()

# Prepare results DataFrame if results file exists
if os.path.exists(results_file):
    existing_results = pd.read_csv(results_file)
    processed_DOIs = set(existing_results["DOI"])  # Extract processed DOIs based on the 'Reference' column
    results = existing_results.to_dict(orient="list")
    print(processed_DOIs)
else:
    processed_DOIs = set()
    results = {"MOF Name": [], "CSD Ref Code": [], "Justification": [], "DOI": []}

# Filter remaining DOIs to process
remaining_DOIs = [DOI for DOI in unique_DOIs if DOI not in processed_DOIs]
print(f"Total DOIs to process: {len(remaining_DOIs)}")
print(remaining_DOIs)

# Process sub-batch
for DOI in remaining_DOIs[:sub_batch_size]:
    
    # Filter csd_data based on DOI
    filtered_csd_data = publication_data[publication_data["DOI"] == DOI]
    # Get the associated file path
    file_path = filtered_csd_data.iloc[0]['File Path']
    
    print(f"Processing DOI: {DOI}")
    print(f"File Path: {file_path}")
    print("-" * 28)

    if not filtered_csd_data.empty:
        # Call the csd_dict function with the filtered data
        csd = csd_dict(filtered_csd_data)

        print(csd)
        # Get Matching Agent Response
        mofs,docs = agent.agent_response(csd, file_path, ret_docs=True)

        # Collect results
        for mof in mofs.mofs:
            results["MOF Name"].append(mof.name)
            results["CSD Ref Code"].append(mof.refcode)
            results["Justification"].append(mof.justification)
            results["DOI"].append(DOI)
    else:
        print(f"No data found for DOI {DOI}")

# Save updated results to CSV
pd.DataFrame(results).to_csv(results_file, index=False)
print(f"Results saved to {results_file}.")
print(f"Processed {len(remaining_DOIs[:sub_batch_size])} DOIs in this batch.")

Total DOIs to process: 50
['10.1021/ic5008457', '10.1016/j.inoche.2014.08.012', '10.1021/cg701041k', '10.1021/acs.chemmater.5b03792', '10.1002/ejic.201601242', '10.1016/j.inoche.2008.04.004', '10.1016/j.inoche.2007.12.038', '10.1039/c2sc20443f', '10.1039/C5CE00833F', '10.1021/ja2078637', '10.1039/C7CE00481H', '10.1039/C5RA06273J', '10.1021/ja3063138', '10.1016/j.molstruc.2008.06.007', '10.1039/C5RA15937G', '10.1021/ic8004008', '10.1039/c2dt31361h', '10.1039/c2ce25715g', '10.1016/j.inoche.2016.06.032', '10.1021/cg9005696', '10.1039/c3ce26864k', '10.1021/cg701164u', '10.1039/c001537g', '10.7503/cjcu20150259', '10.1039/c3ce40346g', '10.1039/C2DT32674D', '10.1016/j.molstruc.2014.12.014', '10.1080/15533170802440755', '10.1016/j.solidstatesciences.2011.12.002', '10.1021/cg900659r', '10.1021/acs.inorgchem.6b00758', '10.1002/chem.201405395', '10.1016/j.inoche.2013.07.001', '10.1016/j.inoche.2010.04.015', '10.1002/anie.201503636', '10.1021/acs.cgd.5b00358', '10.1039/c1ce05069a', '10.1016/j.inoc

### Print Documents

In [None]:
"""# Assuming `docs` is your list of LangChain Document objects
for i, doc in enumerate(docs):
    print(f"Document {i + 1}:")
    print(doc.page_content)
    print("-" * 40)  # Separator for readability"""

Document 1:
In summary, we synthesized a robust, large microporous coordination polymer with a Kagome´ type structure consisting of a carboxylate-amine ligand and a Cu2+ paddle-wheel cluster. This motif could be applied to create a crystalline guest-accessible space with intermediate region of micro- and mesoscale chemistry.

Fig. 2 Thermogravimetric analysis of 1 over the temperature range

from 300 to 723 K at a heating rate of b = 5 K min1.

Fig. 4 Adsorption isotherms of 1 for (a) N2 (77 K), (b) CO2 (195 K),

(c) MeOH (298 K) and (d) MeCN (298 K).

Fig. 5 (a) BET plot and (b) DFT/Monte-Carlo differential pore volume distribution of 1, which were calculated from N2 adsorption at 77 K. We thank Prof. Yoshiki Kubota for measurement of XRPD at BL02B2 line at SPring-8, Hyogo, Japan. This work was supported by ERATO, JST and a Grant-in-Aid for Scientific Research in a Priority Area ''Chemistry of Coordination Space'' (#434) from the Ministry of Education, Culture, Sports, Science and Tec

### Benchmarking

In [None]:
"""# Import ground truth
ground_truth = pd.read_csv('/home/tom-pruyn/Documents/Project/chain-eunomia/Chain_Eunomia/data/Benchmark_set_2/Ground Truth/Ground Truth DOIs.csv')
ground_truth = ground_truth.drop(columns=['Paper Number', 'DOI'])"""

In [None]:
# Merge ground truth

'''benchmark = pd.DataFrame(results)
benchmark = benchmark.merge(
    ground_truth,
    left_on='CSD Ref Code',
    right_on='CSD CODE',
    how='left'
)'''


In [None]:
#pd.DataFrame(benchmark).to_csv(results_file, index=False)