## <b>MOF ChemUnity Matching</b>

The purpose of this notebook is to use our developed tools to match CSD Ref Codes to MOF Names/Co-References found in their synthesis papers. 

### <b>Preparation of CSD Data</b>

First, the CSD Data must be prepared to be injected into the prompt. For each DOI we wish to process, we must gather the relevant info for each associated CSD code.

Over 20 000 DOIs have been selected for text mining. We chose MOFs that:
- Are found in CSD 
- Are also found in either QMOF or CoRE Databases

This way, every MOF in our database has relevant computational properties already calculated (found in QMOF or CoRE). The properties can be easily added to our database at the end. 

In [1]:
# Imports
import pandas as pd
import glob
import os

from MOF_ChemUnity.utils.DataPrep import Data_Prep
from MOF_ChemUnity.Agents.MatchingAgent import MatchingAgent
from MOF_ChemUnity.utils.csd_dict import csd_dict # This function is used to gather info from the publication info dataframe and put it into a dictionary 

import nltk

nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to /home/tom-
[nltk_data]     pruyn/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/tom-pruyn/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

### <b>Matching with XML</b>
Some publishers provide their papers in .xml format when downloading from their API. In order to use these files, we must turn them into .md files ourselves. Since each publisher uses unique XML tags, custom parsing methods were built to support both Elseveir and ACS publications

<b>Preprocessing and arranging everything appropriately</b>

In order to perform matching with .xml files:
1. Make a subfolder named XML (you can sort your XML files inside the XML folder however you want (we sort by publisher in these examples ```.../XML/ACS```).
2. Define the directory path to that folder: ```paper_folder_path```
3. Define a path to CSD_Info.csv - name this path ```csd_info_path```.

In [2]:
# Define path to folder containing all papers to be text mined from
# If using XML files, point to a parent folder named "XML"
paper_folder_path = 'Examples/XML' # This folder has two subfolders: ACS and Elsevier

# Define path to file containing all CSD info extracted from CSD API
csd_info_path = 'info/CSD_Info.csv'

# List of columns we want to take from our master CSD file and put into our prompt
feature_list = [
    "CSD code", 
    "DOI",
    "Chemical Name",
    "Space group", 
    "Metal types",
    "Molecular formula",
    "Synonyms",
    "a",
    "b",
    "c"
]

<b>Below, we see that 3 MOFs from CSD_info.csv originate from these 2 publications</b>


In [3]:
# Initialize Data_Prep class
Prepare_Data = Data_Prep(paper_folder_path, csd_info_path,feature_list)
publication_data = Prepare_Data.gather_info()

publication_data.head()

Unnamed: 0,DOI,File Name,File Format,File Path,Journal,CSD code,Chemical Name,Space group,Metal types,Molecular formula,Synonyms,a,b,c
0,10.1016/j.molstruc.2017.11.128,10.1016_j.molstruc.2017.11.128.xml,xml,Examples/XML/Elsevier/10.1016_j.molstruc.2017....,Journal(Journal of Molecular Structure),OWEZEW01,"catena-((μ-hydrogen benzene-1,3,5-tricarboxyla...",P21/c,Ca,C36Ca4H48O40,[],10.1977,16.4581,7.4796
1,10.1021/acs.analchem.8b00494,ac8b00494.tei.xml,xml,Examples/XML/ACS/ac8b00494.tei.xml,Journal(Analytical Chemistry),CIJBIJ,catena-((μ-5-[(4-carboxylatophenoxy)methyl]ben...,C2/c,Tb,C88H92N8O36Tb4,[],28.614,14.383,13.644
2,10.1021/acs.analchem.8b00494,ac8b00494.tei.xml,xml,Examples/XML/ACS/ac8b00494.tei.xml,Journal(Analytical Chemistry),CIJBEF,catena-((μ-5-[(4-carboxylatophenoxy)methyl]ben...,C2/c,Eu,C88Eu4H92N8O36,[],28.22,14.54,13.394


#### <b>Running The Prediction Loop</b>

In [4]:
# Instantiate Agent
agent = MatchingAgent()

# Dynamically create the results_file path
results_dir = os.path.join(paper_folder_path, 'results')
results_file = os.path.join(results_dir, 'matching.csv')
sub_batch_size = 3000

# Ensure the results directory exists
os.makedirs(results_dir, exist_ok=True)

# Load publication data (assuming publication_data is already defined)
unique_DOIs = publication_data["DOI"].unique()

# Prepare results DataFrame if results file exists
if os.path.exists(results_file):
    existing_results = pd.read_csv(results_file)
    processed_DOIs = set(existing_results["DOI"])  # Extract processed DOIs based on the 'Reference' column
    results = existing_results.to_dict(orient="list")
    print(processed_DOIs)
else:
    processed_DOIs = set()
    results = {"MOF Name": [], "CSD Ref Code": [], "Justification": [], "DOI": []}

# Filter remaining DOIs to process
remaining_DOIs = [DOI for DOI in unique_DOIs if DOI not in processed_DOIs]
print(f"Total DOIs to process: {len(remaining_DOIs)}")
print(remaining_DOIs)

# Process sub-batch
for DOI in remaining_DOIs[:sub_batch_size]:
    
    # Filter csd_data based on DOI
    filtered_csd_data = publication_data[publication_data["DOI"] == DOI]
    # Get the associated file path
    file_path = filtered_csd_data.iloc[0]['File Path']
    
    print(f"Processing DOI: {DOI}")
    print(f"File Path: {file_path}")
    print("-" * 28)

    if not filtered_csd_data.empty:
        # Call the csd_dict function with the filtered data
        csd = csd_dict(filtered_csd_data)

        print(csd)
        # Get Matching Agent Response
        mofs,docs = agent.agent_response(csd, file_path, ret_docs=True, vs_destination='Examples')

        # Collect results
        for mof in mofs.mofs:
            results["MOF Name"].append(mof.name)
            results["CSD Ref Code"].append(mof.refcode)
            results["Justification"].append(mof.justification)
            results["DOI"].append(DOI)
    else:
        print(f"No data found for DOI {DOI}")

# Save updated results to CSV
pd.DataFrame(results).to_csv(results_file, index=False)
print(f"Results saved to {results_file}.")
print(f"Processed {len(remaining_DOIs[:sub_batch_size])} DOIs in this batch.")

Total DOIs to process: 2
['10.1016/j.molstruc.2017.11.128', '10.1021/acs.analchem.8b00494']
Processing DOI: 10.1016/j.molstruc.2017.11.128
File Path: Examples/XML/Elsevier/10.1016_j.molstruc.2017.11.128.xml
----------------------------
{'OWEZEW01': {'Space Group': 'P21/c', 'Metal Nodes': 'Ca', 'Chemical Name': 'catena-((μ-hydrogen benzene-1,3,5-tricarboxylate)-tetra-aqua-calcium)', 'a': 10.1977, 'b': 16.4581, 'c': 7.4796, 'Molecular Formula': 'C36Ca4H48O40', 'Synonyms': '[]'}}
Detected Elsevier XML format in 'Examples/XML/Elsevier/10.1016_j.molstruc.2017.11.128.xml'
Converted 'Examples/XML/Elsevier/10.1016_j.molstruc.2017.11.128.xml' to Markdown at 'Examples/XML/Elsevier/Parsed_XML/10.1016_j.molstruc.2017.11.128.md'
Saved vector store for Examples/XML/Elsevier/10.1016_j.molstruc.2017.11.128.xml in Examples/vs/10.1016_j.molstruc.2017.11.128
--------------10.1016_j.molstruc.2017.11.128.xml--------------
Action: read_doc

Result: 
- MOF name: C9H12CaO10<|>calcium trimesate synthesized by 

### <b>Matching with PDFs</b>
Some publishers will provide their papers in PDF format. No custom parsers are needed to turn these into .md files, instead we use "Marker" (https://github.com/VikParuchuri/marker). There are several PDF -> MD tools available, but we have founder marker to be the best.

In the Examples/PDF folder, we provide both the PDF and converted .md file created using marker.

<b> Note: In order to process .md files, the file must be named according to its DOI, and replacing all back slashes "/" with underscores "_". This is the standard naming convention when downloading PDFs using publisher APIs. </b>

For example: 

DOI = 10.1016/j.molstruc.2017.11.128

File Name = 10.1016_j.molstruc.2017.11.128.md

One way to check if the matching has been done correctly is to access CSD structures, input the DOI and find the CSD reference codes that are associated with it.

In [5]:
df = pd.read_csv("Examples/XML/results/matching.csv")
df

Unnamed: 0,MOF Name,CSD Ref Code,Justification,DOI
0,C9H12CaO10<|>calcium trimesate synthesized by ...,OWZEW01,The MOF C9H12CaO10 matches the CSD Ref Code OW...,10.1016/j.molstruc.2017.11.128
1,C18H30Ca3O24<|>calcium trimesate synthesized b...,not provided,The MOF C18H30Ca3O24 does not match the CSD Re...,10.1016/j.molstruc.2017.11.128
2,C9H14MgO11·H2O<|>magnesium trimesate synthesiz...,not provided,The MOF C9H14MgO11·H2O does not match the CSD ...,10.1016/j.molstruc.2017.11.128
3,C18H34Mg3O26·4H2O<|>magnesium trimesate synthe...,not provided,The MOF C18H34Mg3O26·4H2O does not match the C...,10.1016/j.molstruc.2017.11.128
4,C23.5H28.5EuN2.5O10.5<|>complex 2,CIJBEF,The MOF with the empirical formula C23.5H28.5E...,10.1021/acs.analchem.8b00494
5,C23.5H28.5TbN2.5O10.5<|>complex 3,CIJBIJ,The MOF with the empirical formula C23.5H28.5T...,10.1021/acs.analchem.8b00494


<b>Preprocessing and arranging everything appropriately</b>

In order to perform matching with .MD files:
1. Make a subfolder named MD (you can sort your XML files inside the MD folder however you want (we sort by publisher in these examples ```.../MD/ACS```).
2. Define the directory path to that folder: ```paper_folder_path```
3. Define a path to CSD_Info.csv - name this path ```csd_info_path```.

In [6]:
paper_folder_path = 'Examples/PDF'

# Define path to file containing all CSD info extracted from CSD API
csd_info_path = 'info/CSD_Info.csv'

# List of columns we want to take from our master CSD file and put into our prompt
feature_list = [
    "CSD code", 
    "DOI",
    "Chemical Name",
    "Space group", 
    "Metal types",
    "Molecular formula",
    "Synonyms",
    "a",
    "b",
    "c"
]

In [7]:
# Initialize Data_Prep class
Prepare_Data = Data_Prep(paper_folder_path, csd_info_path,feature_list)
publication_data = Prepare_Data.gather_info()

publication_data.head()

Unnamed: 0,DOI,File Name,File Format,File Path,Journal,CSD code,Chemical Name,Space group,Metal types,Molecular formula,Synonyms,a,b,c
0,10.1039/a700472i,10.1039_a700472i.md,md,Examples/PDF/10.1039_a700472i.md,"Journal(Journal of the Chemical Society, Dalto...",NAVCAO,"catena-(tris(1,4-Diazoniabicyclo(2.2.2)octane)...",P-3c1,Fe,Fe16O112P28,[],13.5274,13.5274,19.2645
1,10.1039/a804945i,10.1039_a804945i.md,md,Examples/PDF/10.1039_a804945i.md,"Journal(Journal of the Chemical Society, Dalto...",FEKDAA,"catena-((μ2-2-(4,5-bis(Methylsulfanyl)-1,3-dit...",P1,Ag,Ag1C15F3H14N2O3S9,[],7.946,11.202,7.707


In [8]:
# Instantiate Agent
agent = MatchingAgent()

# Dynamically create the results_file path
results_dir = os.path.join(paper_folder_path, 'results')
results_file = os.path.join(results_dir, 'matching.csv')
sub_batch_size = 3000

# Ensure the results directory exists
os.makedirs(results_dir, exist_ok=True)

# Load publication data (assuming publication_data is already defined)
unique_DOIs = publication_data["DOI"].unique()

# Prepare results DataFrame if results file exists
if os.path.exists(results_file):
    existing_results = pd.read_csv(results_file)
    processed_DOIs = set(existing_results["DOI"])  # Extract processed DOIs based on the 'Reference' column
    results = existing_results.to_dict(orient="list")
    print(processed_DOIs)
else:
    processed_DOIs = set()
    results = {"MOF Name": [], "CSD Ref Code": [], "Justification": [], "DOI": []}

# Filter remaining DOIs to process
remaining_DOIs = [DOI for DOI in unique_DOIs if DOI not in processed_DOIs]
print(f"Total DOIs to process: {len(remaining_DOIs)}")
print(remaining_DOIs)

# Process sub-batch
for DOI in remaining_DOIs[:sub_batch_size]:
    
    # Filter csd_data based on DOI
    filtered_csd_data = publication_data[publication_data["DOI"] == DOI]
    # Get the associated file path
    file_path = filtered_csd_data.iloc[0]['File Path']
    
    print(f"Processing DOI: {DOI}")
    print(f"File Path: {file_path}")
    print("-" * 28)

    if not filtered_csd_data.empty:
        # Call the csd_dict function with the filtered data
        csd = csd_dict(filtered_csd_data)

        print(csd)
        # Get Matching Agent Response
        mofs,docs = agent.agent_response(csd, file_path, ret_docs=True, vs_destination='Examples')

        # Collect results
        for mof in mofs.mofs:
            results["MOF Name"].append(mof.name)
            results["CSD Ref Code"].append(mof.refcode)
            results["Justification"].append(mof.justification)
            results["DOI"].append(DOI)
    else:
        print(f"No data found for DOI {DOI}")

# Save updated results to CSV
pd.DataFrame(results).to_csv(results_file, index=False)
print(f"Results saved to {results_file}.")
print(f"Processed {len(remaining_DOIs[:sub_batch_size])} DOIs in this batch.")

Total DOIs to process: 2
['10.1039/a700472i', '10.1039/a804945i']
Processing DOI: 10.1039/a700472i
File Path: Examples/PDF/10.1039_a700472i.md
----------------------------
{'NAVCAO': {'Space Group': 'P-3c1', 'Metal Nodes': 'Fe', 'Chemical Name': "catena-(tris(1,4-Diazoniabicyclo(2.2.2)octane) dodecakis(μ3-hydrogen phosphato-O,O',O'')-bis(μ3-phosphato-O,O',O'')-hexa-aqua-octa-iron)", 'a': 13.5274, 'b': 13.5274, 'c': 19.2645, 'Molecular Formula': 'Fe16O112P28', 'Synonyms': '[]'}}
Saved vector store for Examples/PDF/10.1039_a700472i.md in Examples/vs/10.1039_a700472i
--------------10.1039_a700472i.md--------------
Action: read_doc

Result: 
1.  -MOF name: [HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6]<|>compound 1
    -CSD Ref Code: NAVCAO
    -Justification: The compound [HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6] matches the CSD Ref Code NAVCAO based on several key features. The metal node is iron (Fe), which matches the 'Fe' metal node in the CSD reference. The space group is P3¯c1, which mat

In [9]:
#This should give the exact same result as the previous example!
df = pd.read_csv("Examples/PDF/results/matching.csv")
df

Unnamed: 0,MOF Name,CSD Ref Code,Justification,DOI
0,[HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6]<|>co...,NAVCAO,The compound [HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)...,10.1039/a700472i
1,[Ag(CM-TTF)(CF3SO3)]<|>complex 2,FEKDAA,The MOF [Ag(CM-TTF)(CF3SO3)] matches the CSD R...,10.1039/a804945i


### <b>Combine Data For Knowledge Graph</b>
Our knowledge graph requires one master file for all MOF names and CSD Codes. Below, we will create that file by combining the results of the XML and MD extractions.

Note, basic data cleaning such as removing all MOFs have CSD Ref Code as "Not Provided" is required at this step

In [10]:
# Find all .csv files in the folder and its subfolders
csv_files = glob.glob(os.path.join("Examples", "**", "*.csv"), recursive=True)

# Combine all .csv files into one DataFrame
combined_df = pd.concat((pd.read_csv(file) for file in csv_files), ignore_index=True)

# Remove rows where "CSD Ref Code" is "not provided"
combined_df = combined_df[combined_df["CSD Ref Code"] != "not provided"]

# Save the combined DataFrame to a new .csv file
combined_df.to_csv("Examples/KG_Data/matching.csv", index=False)

In [11]:
combined_df

Unnamed: 0,MOF Name,CSD Ref Code,Justification,DOI
0,[HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)2(H2O)6]<|>co...,NAVCAO,The compound [HN(CH2CH2)3NH]3[Fe8(HPO4)12(PO4)...,10.1039/a700472i
1,[Ag(CM-TTF)(CF3SO3)]<|>complex 2,FEKDAA,The MOF [Ag(CM-TTF)(CF3SO3)] matches the CSD R...,10.1039/a804945i
2,C9H12CaO10<|>calcium trimesate synthesized by ...,OWZEW01,The MOF C9H12CaO10 matches the CSD Ref Code OW...,10.1016/j.molstruc.2017.11.128
6,C23.5H28.5EuN2.5O10.5<|>complex 2,CIJBEF,The MOF with the empirical formula C23.5H28.5E...,10.1021/acs.analchem.8b00494
7,C23.5H28.5TbN2.5O10.5<|>complex 3,CIJBIJ,The MOF with the empirical formula C23.5H28.5T...,10.1021/acs.analchem.8b00494
