## <b>MOF ChemUnity Matching</b>

The purpose of this notebook is to use our developed tools to match CSD Ref Codes to MOF Names/Co-References found in their synthesis papers. 

### <b>Preparation of CSD Data</b>

First, the CSD Data must be prepared to be injected into the prompt. For each DOI we wish to process, we must gather the relevant info for each associated CSD code.

Over 20 000 DOIs have been selected for text mining. We chose MOFs that:
- Are found in CSD 
- Are also found in either QMOF or CoRE Databases

This way, every MOF in our database has relevant computational properties already calculated (found in QMOF or CoRE). The properties can be easily added to our database at the end. 

In [1]:
# Imports
import pandas as pd
import glob
import os

from MOF_ChemUnity.utils.DataPrep import Data_Prep
from MOF_ChemUnity.Agents.MatchingAgent import MatchingAgent
from MOF_ChemUnity.utils.csd_dict import csd_dict # This function is used to gather info from the publication info dataframe and put it into a dictionary 

import nltk

nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/sartaaj/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/sartaaj/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

### <b>Matching with XML</b>

<b>Preprocessing and arranging everything appropriately</b>

In order to perform matching with .xml files:
1. Make a subfolder named XML (you can sort your XML files inside the XML folder however you want - for example, your XML files can be stored in ```.../XML/folder_name```).
2. Define the directory path to that folder: ```paper_folder_path```
3. Define a path to CSD_Info.csv - name this path ```csd_info_path```.

In [6]:
# Define path to folder containing all papers to be text mined from
# If using XML files, point to a parent folder named "XML"
paper_folder_path = 'Examples/XML/ACS'

# Define path to file containing all CSD info extracted from CSD API
csd_info_path = 'info/CSD_Info.csv'

# List of columns we want to take from our master CSD file and put into our prompt
feature_list = [
    "CSD code", 
    "DOI",
    "Chemical Name",
    "Space group", 
    "Metal types",
    "Molecular formula",
    "Synonyms",
    "a",
    "b",
    "c"
]

In [7]:
# Initialize Data_Prep class
Prepare_Data = Data_Prep(paper_folder_path, csd_info_path,feature_list)
publication_data = Prepare_Data.gather_info()

publication_data.head()

Unnamed: 0,DOI,File Name,File Format,File Path,Journal,CSD code,Chemical Name,Space group,Metal types,Molecular formula,Synonyms,a,b,c
0,10.1021/acs.analchem.8b00494,ac8b00494.tei.xml,xml,Examples/XML/ACS/ac8b00494.tei.xml,Journal(Analytical Chemistry),CIJBIJ,catena-((μ-5-[(4-carboxylatophenoxy)methyl]ben...,C2/c,Tb,C88H92N8O36Tb4,[],28.614,14.383,13.644
1,10.1021/acs.analchem.8b00494,ac8b00494.tei.xml,xml,Examples/XML/ACS/ac8b00494.tei.xml,Journal(Analytical Chemistry),CIJBEF,catena-((μ-5-[(4-carboxylatophenoxy)methyl]ben...,C2/c,Eu,C88Eu4H92N8O36,[],28.22,14.54,13.394


#### <b>Running The Prediction Loop</b>

In [8]:
# Instantiate Agent
agent = MatchingAgent()

# Dynamically create the results_file path
results_dir = os.path.join(paper_folder_path, 'results')
results_file = os.path.join(results_dir, 'matching.csv')
sub_batch_size = 3000

# Ensure the results directory exists
os.makedirs(results_dir, exist_ok=True)

# Load publication data (assuming publication_data is already defined)
unique_DOIs = publication_data["DOI"].unique()

# Prepare results DataFrame if results file exists
if os.path.exists(results_file):
    existing_results = pd.read_csv(results_file)
    processed_DOIs = set(existing_results["DOI"])  # Extract processed DOIs based on the 'Reference' column
    results = existing_results.to_dict(orient="list")
    print(processed_DOIs)
else:
    processed_DOIs = set()
    results = {"MOF Name": [], "CSD Ref Code": [], "Justification": [], "DOI": []}

# Filter remaining DOIs to process
remaining_DOIs = [DOI for DOI in unique_DOIs if DOI not in processed_DOIs]
print(f"Total DOIs to process: {len(remaining_DOIs)}")
print(remaining_DOIs)

# Process sub-batch
for DOI in remaining_DOIs[:sub_batch_size]:
    
    # Filter csd_data based on DOI
    filtered_csd_data = publication_data[publication_data["DOI"] == DOI]
    # Get the associated file path
    file_path = filtered_csd_data.iloc[0]['File Path']
    
    print(f"Processing DOI: {DOI}")
    print(f"File Path: {file_path}")
    print("-" * 28)

    if not filtered_csd_data.empty:
        # Call the csd_dict function with the filtered data
        csd = csd_dict(filtered_csd_data)

        print(csd)
        # Get Matching Agent Response
        mofs,docs = agent.agent_response(csd, file_path, ret_docs=True)

        # Collect results
        for mof in mofs.mofs:
            results["MOF Name"].append(mof.name)
            results["CSD Ref Code"].append(mof.refcode)
            results["Justification"].append(mof.justification)
            results["DOI"].append(DOI)
    else:
        print(f"No data found for DOI {DOI}")

# Save updated results to CSV
pd.DataFrame(results).to_csv(results_file, index=False)
print(f"Results saved to {results_file}.")
print(f"Processed {len(remaining_DOIs[:sub_batch_size])} DOIs in this batch.")

Total DOIs to process: 1
['10.1021/acs.analchem.8b00494']
Processing DOI: 10.1021/acs.analchem.8b00494
File Path: Examples/XML/ACS/ac8b00494.tei.xml
----------------------------
{'CIJBIJ': {'Space Group': 'C2/c', 'Metal Nodes': 'Tb', 'Chemical Name': 'catena-((μ-5-[(4-carboxylatophenoxy)methyl]benzene-1,3-dicarboxylato)-bis(dimethylformamide)-terbium)', 'a': 28.614, 'b': 14.383, 'c': 13.644, 'Molecular Formula': 'C88H92N8O36Tb4', 'Synonyms': '[]'}, 'CIJBEF': {'Space Group': 'C2/c', 'Metal Nodes': 'Eu', 'Chemical Name': 'catena-((μ-5-[(4-carboxylatophenoxy)methyl]benzene-1,3-dicarboxylato)-bis(dimethylformamide)-europium)', 'a': 28.22, 'b': 14.54, 'c': 13.394, 'Molecular Formula': 'C88Eu4H92N8O36', 'Synonyms': '[]'}}
Detected TEI XML format in 'Examples/XML/ACS/ac8b00494.tei.xml'
Converted 'Examples/XML/ACS/ac8b00494.tei.xml' to Markdown at 'Examples/MD/ACS/ac8b00494.md'
Saved vector store for Examples/XML/ACS/ac8b00494.tei.xml in Examples/XML/vs/ac8b00494.tei
--------------ac8b00494.te

One way to check if the matching has been done correctly is to access CSD structures, input the DOI and find the CSD reference codes that are associated with it.

In [9]:
df = pd.read_csv("Examples/XML/ACS/results/matching.csv")
df

Unnamed: 0,MOF Name,CSD Ref Code,Justification,DOI
0,C23.5H28.5EuN2.5O10.5<|>complex 2,CIJBEF,The MOF with the empirical formula C23.5H28.5E...,10.1021/acs.analchem.8b00494
1,C23.5H28.5TbN2.5O10.5<|>complex 3,CIJBIJ,The MOF with the empirical formula C23.5H28.5T...,10.1021/acs.analchem.8b00494


### <b>Another case: Elsevier XMLs</b>

The previous example was done with XML from ACS. This is another example of matching but with Elsevier XML. Keep in mind that the procedure for matching is the exact same as for the ACS XML files.

In [10]:
# Define path to folder containing all papers to be text mined from
# If using XML files, point to a parent folder named "XML"
paper_folder_path = 'Examples/XML/Elsevier'

# Define path to file containing all CSD info extracted from CSD API
csd_info_path = 'info/CSD_Info.csv'

# List of columns we want to take from our master CSD file and put into our prompt
feature_list = [
    "CSD code", 
    "DOI",
    "Chemical Name",
    "Space group", 
    "Metal types",
    "Molecular formula",
    "Synonyms",
    "a",
    "b",
    "c"
]

In [11]:
# Initialize Data_Prep class
Prepare_Data = Data_Prep(paper_folder_path, csd_info_path,feature_list)
publication_data = Prepare_Data.gather_info()

publication_data.head()

Unnamed: 0,DOI,File Name,File Format,File Path,Journal,CSD code,Chemical Name,Space group,Metal types,Molecular formula,Synonyms,a,b,c
0,10.1016/j.molstruc.2017.11.128,10.1016_j.molstruc.2017.11.128.xml,xml,Examples/XML/Elsevier/10.1016_j.molstruc.2017....,Journal(Journal of Molecular Structure),OWEZEW01,"catena-((μ-hydrogen benzene-1,3,5-tricarboxyla...",P21/c,Ca,C36Ca4H48O40,[],10.1977,16.4581,7.4796


In [12]:
# Instantiate Agent
agent = MatchingAgent()

# Dynamically create the results_file path
results_dir = os.path.join(paper_folder_path, 'results')
results_file = os.path.join(results_dir, 'matching.csv')
sub_batch_size = 3000

# Ensure the results directory exists
os.makedirs(results_dir, exist_ok=True)

# Load publication data (assuming publication_data is already defined)
unique_DOIs = publication_data["DOI"].unique()

# Prepare results DataFrame if results file exists
if os.path.exists(results_file):
    existing_results = pd.read_csv(results_file)
    processed_DOIs = set(existing_results["DOI"])  # Extract processed DOIs based on the 'Reference' column
    results = existing_results.to_dict(orient="list")
    print(processed_DOIs)
else:
    processed_DOIs = set()
    results = {"MOF Name": [], "CSD Ref Code": [], "Justification": [], "DOI": []}

# Filter remaining DOIs to process
remaining_DOIs = [DOI for DOI in unique_DOIs if DOI not in processed_DOIs]
print(f"Total DOIs to process: {len(remaining_DOIs)}")
print(remaining_DOIs)

# Process sub-batch
for DOI in remaining_DOIs[:sub_batch_size]:
    
    # Filter csd_data based on DOI
    filtered_csd_data = publication_data[publication_data["DOI"] == DOI]
    # Get the associated file path
    file_path = filtered_csd_data.iloc[0]['File Path']
    
    print(f"Processing DOI: {DOI}")
    print(f"File Path: {file_path}")
    print("-" * 28)

    if not filtered_csd_data.empty:
        # Call the csd_dict function with the filtered data
        csd = csd_dict(filtered_csd_data)

        print(csd)
        # Get Matching Agent Response
        mofs,docs = agent.agent_response(csd, file_path, ret_docs=True)

        # Collect results
        for mof in mofs.mofs:
            results["MOF Name"].append(mof.name)
            results["CSD Ref Code"].append(mof.refcode)
            results["Justification"].append(mof.justification)
            results["DOI"].append(DOI)
    else:
        print(f"No data found for DOI {DOI}")

# Save updated results to CSV
pd.DataFrame(results).to_csv(results_file, index=False)
print(f"Results saved to {results_file}.")
print(f"Processed {len(remaining_DOIs[:sub_batch_size])} DOIs in this batch.")

Total DOIs to process: 1
['10.1016/j.molstruc.2017.11.128']
Processing DOI: 10.1016/j.molstruc.2017.11.128
File Path: Examples/XML/Elsevier/10.1016_j.molstruc.2017.11.128.xml
----------------------------
{'OWEZEW01': {'Space Group': 'P21/c', 'Metal Nodes': 'Ca', 'Chemical Name': 'catena-((μ-hydrogen benzene-1,3,5-tricarboxylate)-tetra-aqua-calcium)', 'a': 10.1977, 'b': 16.4581, 'c': 7.4796, 'Molecular Formula': 'C36Ca4H48O40', 'Synonyms': '[]'}}
Detected Elsevier XML format in 'Examples/XML/Elsevier/10.1016_j.molstruc.2017.11.128.xml'
Converted 'Examples/XML/Elsevier/10.1016_j.molstruc.2017.11.128.xml' to Markdown at 'Examples/MD/Elsevier/10.1016_j.molstruc.2017.11.128.md'
Saved vector store for Examples/XML/Elsevier/10.1016_j.molstruc.2017.11.128.xml in Examples/XML/vs/10.1016_j.molstruc.2017.11.128
--------------10.1016_j.molstruc.2017.11.128.xml--------------
Action: read_doc

Result: 
1.  -MOF name: C9H12CaO10<|>One-pot Self-assembly Reaction
    -CSD Ref Code: OWEZEW01
    -Justif

In [13]:
df = pd.read_csv("Examples/XML/Elsevier/results/matching.csv")
df

Unnamed: 0,MOF Name,CSD Ref Code,Justification,DOI
0,C9H12CaO10<|>One-pot Self-assembly Reaction,OWEZEW01,The MOF C9H12CaO10 synthesized by the one-pot ...,10.1016/j.molstruc.2017.11.128
1,C9H14MgO11·H2O<|>One-pot Self-assembly Reaction,not provided,The MOF C9H14MgO11·H2O synthesized by the one-...,10.1016/j.molstruc.2017.11.128
2,C18H34Mg3O26·4H2O<|>Ion Exchange Method,not provided,The MOF C18H34Mg3O26·4H2O synthesized by the i...,10.1016/j.molstruc.2017.11.128
3,C18H30Ca3O24<|>Ion Exchange Method,not provided,The MOF C18H30Ca3O24 synthesized by the ion ex...,10.1016/j.molstruc.2017.11.128


### <b>Another case: .md files</b>

This case looks at performing matching on .md files. Keep in mind that these files have been converted from .pdf format. We expect to get the exact same matching results as the Elsevier XML file, as it is for the same DOI.

In [17]:
paper_folder_path = 'Examples/MD/Elsevier'

# Define path to file containing all CSD info extracted from CSD API
csd_info_path = 'info/CSD_Info.csv'

# List of columns we want to take from our master CSD file and put into our prompt
feature_list = [
    "CSD code", 
    "DOI",
    "Chemical Name",
    "Space group", 
    "Metal types",
    "Molecular formula",
    "Synonyms",
    "a",
    "b",
    "c"
]

In [18]:
# Initialize Data_Prep class
Prepare_Data = Data_Prep(paper_folder_path, csd_info_path,feature_list)
publication_data = Prepare_Data.gather_info()

publication_data.head()

Unnamed: 0,DOI,File Name,File Format,File Path,Journal,CSD code,Chemical Name,Space group,Metal types,Molecular formula,Synonyms,a,b,c
0,10.1016/j.molstruc.2017.11.128,10.1016_j.molstruc.2017.11.128.md,md,Examples/MD/Elsevier/10.1016_j.molstruc.2017.1...,Journal(Journal of Molecular Structure),OWEZEW01,"catena-((μ-hydrogen benzene-1,3,5-tricarboxyla...",P21/c,Ca,C36Ca4H48O40,[],10.1977,16.4581,7.4796


In [19]:
# Instantiate Agent
agent = MatchingAgent()

# Dynamically create the results_file path
results_dir = os.path.join(paper_folder_path, 'results')
results_file = os.path.join(results_dir, 'matching.csv')
sub_batch_size = 3000

# Ensure the results directory exists
os.makedirs(results_dir, exist_ok=True)

# Load publication data (assuming publication_data is already defined)
unique_DOIs = publication_data["DOI"].unique()

# Prepare results DataFrame if results file exists
if os.path.exists(results_file):
    existing_results = pd.read_csv(results_file)
    processed_DOIs = set(existing_results["DOI"])  # Extract processed DOIs based on the 'Reference' column
    results = existing_results.to_dict(orient="list")
    print(processed_DOIs)
else:
    processed_DOIs = set()
    results = {"MOF Name": [], "CSD Ref Code": [], "Justification": [], "DOI": []}

# Filter remaining DOIs to process
remaining_DOIs = [DOI for DOI in unique_DOIs if DOI not in processed_DOIs]
print(f"Total DOIs to process: {len(remaining_DOIs)}")
print(remaining_DOIs)

# Process sub-batch
for DOI in remaining_DOIs[:sub_batch_size]:
    
    # Filter csd_data based on DOI
    filtered_csd_data = publication_data[publication_data["DOI"] == DOI]
    # Get the associated file path
    file_path = filtered_csd_data.iloc[0]['File Path']
    
    print(f"Processing DOI: {DOI}")
    print(f"File Path: {file_path}")
    print("-" * 28)

    if not filtered_csd_data.empty:
        # Call the csd_dict function with the filtered data
        csd = csd_dict(filtered_csd_data)

        print(csd)
        # Get Matching Agent Response
        mofs,docs = agent.agent_response(csd, file_path, ret_docs=True)

        # Collect results
        for mof in mofs.mofs:
            results["MOF Name"].append(mof.name)
            results["CSD Ref Code"].append(mof.refcode)
            results["Justification"].append(mof.justification)
            results["DOI"].append(DOI)
    else:
        print(f"No data found for DOI {DOI}")

# Save updated results to CSV
pd.DataFrame(results).to_csv(results_file, index=False)
print(f"Results saved to {results_file}.")
print(f"Processed {len(remaining_DOIs[:sub_batch_size])} DOIs in this batch.")

Total DOIs to process: 1
['10.1016/j.molstruc.2017.11.128']
Processing DOI: 10.1016/j.molstruc.2017.11.128
File Path: Examples/MD/Elsevier/10.1016_j.molstruc.2017.11.128.md
----------------------------
{'OWEZEW01': {'Space Group': 'P21/c', 'Metal Nodes': 'Ca', 'Chemical Name': 'catena-((μ-hydrogen benzene-1,3,5-tricarboxylate)-tetra-aqua-calcium)', 'a': 10.1977, 'b': 16.4581, 'c': 7.4796, 'Molecular Formula': 'C36Ca4H48O40', 'Synonyms': '[]'}}
Saved vector store for Examples/MD/Elsevier/10.1016_j.molstruc.2017.11.128.md in Examples/MD/vs/10.1016_j.molstruc.2017.11.128
--------------10.1016_j.molstruc.2017.11.128.md--------------
Action: read_doc

Result: 
1.  -MOF name: C9H12CaO10<|>One-pot Self-assembly Reaction
    -CSD Ref Code: OWEZEW01
    -Justification: The MOF C9H12CaO10 synthesized by the one-pot self-assembly reaction has a monoclinic crystal system with a space group of P21/c, which matches the space group provided in the CSD reference code OWEZEW01. The metal node is calciu

In [20]:
#This should give the exact same result as the previous example!
df = pd.read_csv("Examples/MD/Elsevier/results/matching.csv")
df

Unnamed: 0,MOF Name,CSD Ref Code,Justification,DOI
0,C9H12CaO10<|>One-pot Self-assembly Reaction,OWEZEW01,The MOF C9H12CaO10 synthesized by the one-pot ...,10.1016/j.molstruc.2017.11.128
1,C18H30Ca3O24<|>Ion Exchange Method,not provided,The MOF C18H30Ca3O24 synthesized by the ion ex...,10.1016/j.molstruc.2017.11.128
