# MOF ChemUnity Matching 

The purpose of this notebook is to use our developed tools to match CSD Ref Codes to MOF Names/Co-References found in their synthesis papers. 

### Preparation of CSD Data

First, the CSD Data must be prepared to be injected into the prompt. For each DOI we wish to process, we must gather the relevant info for each associated CSD code.

Over 20 000 DOIs have been selected for text mining. We chose MOFs that:
- Are found in CSD 
- Are also found in either QMOF or CoRE Databases

This way, every MOF in our database has relevant computational properties already calculated (found in QMOF or CoRE). The properties can be easily added to our database at the end. 

In [1]:
# Imports
import pandas as pd
import glob
import os

from src.MOF_ChemUnity.utils.DataPrep import Data_Prep
from src.MOF_ChemUnity.Agents.MatchingAgent import MatchingAgent

In [2]:
# Define path to folder containing all papers to be text mined from
paper_folder_path = '/home/tom-pruyn/Documents/TDM Papers/Processing Batches-PDF/Batch 1'

# Define path to file containing all CSD info extracted from CSD API
csd_info_path = '/home/tom-pruyn/Documents/Project/chain-eunomia/Chain_Eunomia/data/Benchmark_set_2/Ground Truth/CSD_Info.csv'

In [3]:
# List of columns we want to take from our master CSD file and put into our prompt
feature_list = [
    "CSD code", 
    "DOI",
    "Chemical Name",
    "Space group", 
    "Metal types",
    "Molecular formula",
    "Synonyms",
    "a",
    "b",
    "c"
]

In [4]:
# Initialize Data_Prep class
Prepare_Data = Data_Prep(paper_folder_path, csd_info_path,feature_list)

In [5]:
publication_data = Prepare_Data.gather_info()


Missing DOIs: {'10.1107/S1600536806033654', '10.1107/S1600536806051270', '10.1107/S0108270110009510', '10.1107/S1600536806010841', '10.1107/S1600536808025233', '10.1107/S0108270197004009', '10.1107/S0108270191009484', '10.1107/S0108768111030692', '10.1107/S010827010706249X', '10.1107/S0108270105004002', '10.1107/S1600536811018290', '10.1107/S0108270100013846', '10.1107/S1600536807035726', '10.1107/S0108270198016709', '10.1107/S0108270197013061', '10.1107/S010827010701459X', '10.1107/S0108270113026450', '10.1107/S1600536803021445', '10.1107/S1600536809051721', '10.1107/S0108270189005366', '10.1107/S1600536811015091', '10.1107/S1600536804011614', '10.1107/S0108270104026757', '10.1107/S1600536808024197', '10.1107/S0108270100007435', '10.1107/S1600536804010402', '10.1107/S1600536804010438', '10.1107/S1600536808011100', '10.1107/S1600536810003879', '10.1107/S160053680501514X', '10.1107/S0108270108041504', '10.1107/S1600536809045255', '10.1107/S0108270198006660', '10.1107/S0108270105022511'

In [6]:
publication_data.head()

Unnamed: 0,DOI,File Name,File Format,File Path,Journal,CSD code,Chemical Name,Space group,Metal types,Molecular formula,Synonyms,a,b,c
0,10.1107/S010827018300918X,10.1107_S010827018300918X.md,md,/home/tom-pruyn/Documents/TDM Papers/Processin...,"Journal(Acta Crystallographica,Section C: Crys...",CAJCUL,catena(Diaqua-(4-oxoheptanedioato)-zinc(ii)),P2/c,Zn,C14H24O14Zn2,[],9.307,5.194,10.85
1,10.1107/S1600536807054591,10.1107_S1600536807054591.md,md,/home/tom-pruyn/Documents/TDM Papers/Processin...,Journal(Acta Crystallographica Section E: Stru...,HIPZEM,"catena-(bis(μ2-1,4-bis(3-Pyridylmethoxy)benzen...",P-1,Ag,Ag2C40H34N4O8,[],8.573,9.712,12.099
2,10.1107/S010827010200464X,10.1107_S010827010200464X.md,md,/home/tom-pruyn/Documents/TDM Papers/Processin...,"Journal(Acta Crystallographica,Section C: Crys...",AFUQUN,catena-((μ4-Nitrilotriacetato)-aqua-erbium(iii)),P21/n,Er,C24Er4H32N4O28,[],6.7262,6.5427,19.8
3,10.1107/S010827010700933X,10.1107_S010827010700933X.md,md,/home/tom-pruyn/Documents/TDM Papers/Processin...,"Journal(Acta Crystallographica,Section C: Crys...",LICDIL,catena-((μ2-ethylene-bis(diphenylphosphine oxi...,P-1,Co,C56Co1H46O6P2,[],9.2994,10.9817,11.6053
4,10.1107/S160053680902371X,10.1107_S160053680902371X.md,md,/home/tom-pruyn/Documents/TDM Papers/Processin...,Journal(Acta Crystallographica Section E: Stru...,BOVPOS,"catena-[bis(μ2-5-Amino-1,3,4-thiadiazole-2-thi...",C2/c,Cd,C8Cd2H8N12S8,[],12.6419,10.8341,7.7241


### Running The Prediction Loop

In [7]:
# Instantiate Agent
agent = MatchingAgent()

In [8]:
# Initialize Results Dict
results = {"Reference": [], "MOF Name": [], "CSD Ref Code": [], "Justification": []}
pickle_results = dict()

In [9]:
# This function is used to gather info from the publication info dataframe and put it into a dictionary 
def csd_dict(csd_data): 
    return {
        i["CSD code"]: {
            'Space Group': i["Space group"],
            'Metal Nodes': i["Metal types"],
            'Chemical Name': i["Chemical Name"],
            'a': i["a"],
            'b': i["b"],
            'c': i["c"],
            'Molecular Formula': i["Molecular formula"],
            'Synonyms': i["Synonyms"]
        } 
        for _, i in csd_data.iterrows()
    }

#### Batch Process

In [16]:
publication_data.head()

Unnamed: 0,DOI,File Name,File Format,File Path,Journal,CSD code,Chemical Name,Space group,Metal types,Molecular formula,Synonyms,a,b,c
0,10.1107/S010827018300918X,10.1107_S010827018300918X.md,md,/home/tom-pruyn/Documents/TDM Papers/Processin...,"Journal(Acta Crystallographica,Section C: Crys...",CAJCUL,catena(Diaqua-(4-oxoheptanedioato)-zinc(ii)),P2/c,Zn,C14H24O14Zn2,[],9.307,5.194,10.85
1,10.1107/S1600536807054591,10.1107_S1600536807054591.md,md,/home/tom-pruyn/Documents/TDM Papers/Processin...,Journal(Acta Crystallographica Section E: Stru...,HIPZEM,"catena-(bis(μ2-1,4-bis(3-Pyridylmethoxy)benzen...",P-1,Ag,Ag2C40H34N4O8,[],8.573,9.712,12.099
2,10.1107/S010827010200464X,10.1107_S010827010200464X.md,md,/home/tom-pruyn/Documents/TDM Papers/Processin...,"Journal(Acta Crystallographica,Section C: Crys...",AFUQUN,catena-((μ4-Nitrilotriacetato)-aqua-erbium(iii)),P21/n,Er,C24Er4H32N4O28,[],6.7262,6.5427,19.8
3,10.1107/S010827010700933X,10.1107_S010827010700933X.md,md,/home/tom-pruyn/Documents/TDM Papers/Processin...,"Journal(Acta Crystallographica,Section C: Crys...",LICDIL,catena-((μ2-ethylene-bis(diphenylphosphine oxi...,P-1,Co,C56Co1H46O6P2,[],9.2994,10.9817,11.6053
4,10.1107/S160053680902371X,10.1107_S160053680902371X.md,md,/home/tom-pruyn/Documents/TDM Papers/Processin...,Journal(Acta Crystallographica Section E: Stru...,BOVPOS,"catena-[bis(μ2-5-Amino-1,3,4-thiadiazole-2-thi...",C2/c,Cd,C8Cd2H8N12S8,[],12.6419,10.8341,7.7241


We need to change this code so that DOI is used to compare instead of file name, and "reference" is DOI and not file name!!

We will also need to reprocess these results to manually change reference to doi and not file name!


In [None]:
results_file = "/home/tom-pruyn/Documents/TDM Papers/Processing Batches-PDF/Batch 1/results/matching.csv"
sub_batch_size = 250

# Load publication data
unique_dois = publication_data["File Name"].unique()

# Prepare results DataFrame if results file exists
if os.path.exists(results_file):
    existing_results = pd.read_csv(results_file)
    processed_dois = set(existing_results["Reference"])  # Extract processed DOIs based on the 'Reference' column
    results = existing_results.to_dict(orient="list")
else:
    processed_dois = set()
    results = {"MOF Name": [], "CSD Ref Code": [], "Justification": [], "Reference": []}

# Filter remaining DOIs to process
remaining_dois = [doi for doi in unique_dois if doi not in processed_dois]
print(f"Total DOIs to process: {len(remaining_dois)}")

# Process sub-batch
for DOI in remaining_dois[:sub_batch_size]:
    # Filter csd_data based on DOI
    filtered_csd_data = publication_data[publication_data["File Name"] == DOI]

    # Get the associated file path
    file_path = filtered_csd_data.iloc[0]['File Path']
    
    print(f"Processing DOI: {DOI}")
    print(f"File Path: {file_path}")
    print("-" * 28)

    if not filtered_csd_data.empty:
        # Call the csd_dict function with the filtered data
        csd = csd_dict(filtered_csd_data)

        print(csd)
        # Get Matching Agent Response
        mofs = agent.agent_response(csd, file_path)

        # Collect results
        for mof in mofs.mofs:
            results["MOF Name"].append(mof.name)
            results["CSD Ref Code"].append(mof.refcode)
            results["Justification"].append(mof.justification)
            results["Reference"].append(os.path.basename(file_path))
    else:
        print(f"No data found for DOI {DOI}")

# Save updated results to CSV
pd.DataFrame(results).to_csv(results_file, index=False)
print(f"Results saved to {results_file}.")
print(f"Processed {len(remaining_dois[:sub_batch_size])} DOIs in this batch.")

Total DOIs to process: 589
Processing DOI: 10.1107_S056774088000684X.md
File Path: /home/tom-pruyn/Documents/TDM Papers/Processing Batches-PDF/Batch 1/md/10.1107_S056774088000684X/10.1107_S056774088000684X.md
----------------------------
{'AQFSZN': {'Space Group': 'P21/c', 'Metal Nodes': 'Zn', 'Chemical Name': 'catena(Tetra-aqua-(2,2,3,3-tetrafluorosuccinato)-zinc(ii))', 'a': 10.799, 'b': 9.115, 'c': 10.995, 'Molecular Formula': 'C16F16H32O32Zn4', 'Synonyms': '[]'}}

[Document(metadata={}, page_content='Fig. 1. A perspective view of the molecule along the approximate threefold axis, showing thermal ellipsoids at 50% probability. Hydrogen atoms are represented by spheres of 0. I A radius.'), Document(metadata={}, page_content='Discussion. Crystals of the title compound (I) consist of tris-chelate molecules of the complex [CO(SECNMe2)a]. Three dithiocarbamate ligands octahedrally coordinate to a Co atom through S atoms. The bond lengths and angles are listed in Table 2. A perspective vie

In [17]:
results_file

'/home/tom-pruyn/Documents/TDM Papers/Processing Batches-PDF/Batch 1/results/matching.csv'

In [10]:
for DOI in publication_data["DOI"].unique():
    # Filter csd_data based on DOI
    filtered_csd_data = publication_data[publication_data["DOI"] == DOI]

    # Get the associated file path 
    file_path = filtered_csd_data.iloc[0]['File Path']
    
    print(f"Processing DOI: {DOI}")
    print(f"File Path: {file_path}")
    print("-" * 28)

    if not filtered_csd_data.empty:
        # Call the csd_dict function with the filtered data
        csd = csd_dict(filtered_csd_data)

        print(csd)
        # Get Matching Agent Response
        mofs = agent.agent_response(csd, file_path)

        # Collect results
        for mof in mofs.mofs:
            results["MOF Name"].append(mof.name)
            results["CSD Ref Code"].append(mof.refcode)
            results["Justification"].append(mof.justification)
            results["Reference"].append(os.path.basename(file_path))
    else:
        print(f"No data found for DOI {DOI}")
    
# Save results to CSV
pd.DataFrame(results).to_csv("/home/tom-pruyn/Documents/TDM Papers/Processing Batches-PDF/Batch 1/results/matching.csv", index=False)

Processing DOI: 10.1107/S010827018300918X
File Path: /home/tom-pruyn/Documents/TDM Papers/Processing Batches-PDF/Batch 1/md/10.1107_S010827018300918X/10.1107_S010827018300918X.md
----------------------------
{'CAJCUL': {'Space Group': 'P2/c', 'Metal Nodes': 'Zn', 'Chemical Name': 'catena(Diaqua-(4-oxoheptanedioato)-zinc(ii))', 'a': 9.307, 'b': 5.194, 'c': 10.85, 'Molecular Formula': 'C14H24O14Zn2', 'Synonyms': '[]'}}
Saved vector store for /home/tom-pruyn/Documents/TDM Papers/Processing Batches-PDF/Batch 1/md/10.1107_S010827018300918X/10.1107_S010827018300918X.md in /home/tom-pruyn/Documents/TDM Papers/Processing Batches-PDF/Batch 1/md/vs/10.1107_S010827018300918X
--------------10.1107_S010827018300918X.md--------------
Action: read_doc

Result: 
1. -MOF name: catena(Diaqua-(4-oxoheptanedioato)-zinc(ii))<|>C14H24O14Zn2<|>CAJCUL
   -CSD Ref Code: CAJCUL
   -Justification: The MOF matches the CSD Reference Code CAJCUL based on the provided chemical name "catena(Diaqua-(4-oxoheptanedioato

KeyboardInterrupt: 