## <b>Relevant Paper Retrieval</b>

We begin by doing a keyword search on the scopus API, for a given MOF. We return the abstracts of the most relevant papers. Since this query is very general, we want to return as many papers as possible at this stage.

Following this, we will do a vector similarity search on all the abstracts, to return a list of the most relevant papers to our query. After that, using the results from cross-documentation, synthesis extraction is performed (the exact same way as ```ChemUnity_Synthesis.ipynb```) as an example.

In [1]:
import requests
import openai
import numpy as np
import pandas as pd
import faiss
import time  
import os
import glob
from langchain_openai import ChatOpenAI
from langchain.callbacks import get_openai_callback

from MOF_ChemUnity.utils.cross_doc_utils import CrossDocUtil
from MOF_ChemUnity.Agents.ExtractionAgent import ExtractionAgent
from MOF_ChemUnity.utils.DocProcessor import DocProcessor
from MOF_ChemUnity.Prompts.Synthesis_Prompts import SYNTHESIS_EXTRACTION
from MOF_ChemUnity.utils.DataModels import Synthesis

<b>Searching Scopus for articles related to the given MOF name and synthesis, retrieves metadata (Title, DOI), and fetches full abstracts.</b>

For this, the SCOPUS_API_KEY is utilized to access DOIs available online to find the desired MOF (in this example, HKUST-1 has been used). From this, ```paper_df``` is used to map the:
1. Title of paper
2. Abstract of paper
3. DOI of paper

In [None]:
# Set API keys (replace with your own API keys)
SCOPUS_API_KEY = "YOUR_SCOPUS_API_KEY"
OPENAI_API_KEY = "YOUR_OPENAI_API_KEY"
openai.api_key = OPENAI_API_KEY

In [3]:
# Search Scopus
MOF_NAME = "HKUST-1"  # Change to the desired MOF
papers_df = CrossDocUtil(SCOPUS_API_KEY = SCOPUS_API_KEY).search_scopus(MOF_NAME, count=500)
papers_df = papers_df[papers_df["doi"] != "N/A"].reset_index(drop=True)
papers_df

Unnamed: 0,title,abstract,doi
0,Optimizing CO<inf>2</inf> hydrogenation to met...,Designing an active Cu/ZnO interface is crucia...,10.1016/j.apcatb.2025.125334
1,In-situ growth of metal–organic framework on i...,The environmental risks posed by tetracycline ...,10.1016/j.seppur.2025.132707
2,Scalable upgrading metal–organic frameworks th...,Some well-established metal–organic frameworks...,10.1016/j.seppur.2025.132270
3,Ion exchange- and enrichment-based technology ...,Mass spectrometry (MS) based proteomics provid...,10.1016/j.chroma.2025.465914
4,Ratiometric fluorescence assay for ciprofloxac...,Preparation of carbon dots (CDs) from biomass ...,10.1016/j.talanta.2024.127477
...,...,...,...
490,Hybrid polymer/MOF membranes for Organic Solve...,One of the main challenges in the field of Org...,10.1016/j.memsci.2016.01.024
491,On controlling the anodic electrochemical film...,Anodic electrochemical synthesis presents itse...,10.1016/j.micromeso.2015.11.060
492,Bi<inf>2</inf>O<inf>3</inf> nanoparticles enca...,We describe a novel procedure to fabricate a r...,10.1039/c6nr00532b
493,Continuous-Flow Microwave Synthesis of Metal-O...,Metal-organic frameworks are having a tremendo...,10.1002/chem.201505139


Now, we will save our ```papers_df``` to a csv, and append MOF_Name to all the rows. This implementation is only for one MOF at a time, but we could easily run this for all MOF names in our database. Then append the abstracts and DOIs to our master dataframe.

In [4]:
# Compute embeddings for all abstracts
abstracts = papers_df["abstract"].tolist()
embeddings = np.array([CrossDocUtil(SCOPUS_API_KEY = SCOPUS_API_KEY).get_embedding(abstract) for abstract in abstracts])

# Initialize FAISS vector database
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

In [5]:
# Example query
query = (
    f"Comprehensive review of synthesis methods for {MOF_NAME}, including solvothermal, "
    "mechanochemical, and microwave-assisted routes, along with comparative analyses of "
    "reaction conditions and yield optimization. Also includes experimental studies detailing "
    f"synthesis procedures and characterization of {MOF_NAME}."
)
results = CrossDocUtil(SCOPUS_API_KEY = SCOPUS_API_KEY).vector_search(query, index, abstracts, top_k=495)

# Get the columns of papers_df
results_columns = papers_df.columns.tolist()

# Prepare results for DataFrame with the same structure as papers_df
results_data = []

for abstract, idx in results:
    row_data = papers_df.loc[idx].to_dict()  # Get full row from papers_df
    row_data["MOF Name"] = MOF_NAME  # Add MOF Name column
    results_data.append(row_data)

# Create DataFrame with the same columns as papers_df + "MOF Name"
relevant_papers_df = pd.DataFrame(results_data, columns=results_columns + ["MOF Name"])
relevant_papers_df

Unnamed: 0,title,abstract,doi,MOF Name
0,Synthesis and Characterization of Ultrapure HK...,"Herein, an ultrapure HKUST-1 MOFs synthesized ...",10.1002/slct.201904637,HKUST-1
1,Mesoporous HKUST-1 synthesized using solvother...,Metal-organic frameworks type HKUST-1 were syn...,10.31788/RJC.2019.1231968,HKUST-1
2,Statistically optimum HKUST-1 synthesized by r...,"Due to its excellency and versatility, many sy...",10.3390/molecules26216430,HKUST-1
3,Direct synthesis of Al-HKUST-1 and its applica...,Direct synthesis of Al-HKUST-1 was successfull...,10.1016/j.nanoso.2021.100773,HKUST-1
4,HKUST-1-supported cerium catalysts for CO oxid...,The synthesis method of metal–organic framewor...,10.3390/catal10010108,HKUST-1
...,...,...,...,...
490,An antibiotic-free platform for eliminating pe...,Helicobacter pylori (H. pylori) infection rema...,10.1016/j.apsb.2024.03.014,HKUST-1
491,Precise assembly of a zeolite imidazolate fram...,Polypropylene (PP) has been widely used in the...,10.1016/j.memsci.2020.118412,HKUST-1
492,Built-in skeleton Cu/NC-NFs as a sulfur carrie...,Carbon nanofibers are commendable sulfur carri...,10.1039/d5cc00529a,HKUST-1
493,Continuous Synthesis of Metal-Organic Framewor...,No abstract available,10.11203/jar.34.5,HKUST-1


<b> Downloading Relevant Papers</b>

Now that we have the relevant papers, the PDFs can be downloaded by utilizing ``CrossDocUtil(SCOPUS_API_KEY).download_full_texts(df)``. Feel free to refer to the utils function for more information.

In [None]:
save_path = "Cross_Document_Linking/XML"
os.makedirs(save_path, exist_ok=True)  # Create the folder if it doesn't exist

CrossDocUtil(SCOPUS_API_KEY=SCOPUS_API_KEY).download_full_texts(relevant_papers_df, save_path)

Requesting: https://api.elsevier.com/content/article/doi/10.1002%2Fslct.201904637?view=FULL
DOI not found: 10.1002/slct.201904637 (Status: 404)
Requesting: https://api.elsevier.com/content/article/doi/10.31788%2FRJC.2019.1231968?view=FULL
DOI not found: 10.31788/RJC.2019.1231968 (Status: 404)
Requesting: https://api.elsevier.com/content/article/doi/10.3390%2Fmolecules26216430?view=FULL
DOI not found: 10.3390/molecules26216430 (Status: 404)
Requesting: https://api.elsevier.com/content/article/doi/10.1016%2Fj.nanoso.2021.100773?view=FULL
Downloaded: Cross_Document_Linking/XML/10.1016_j.nanoso.2021.100773.xml
Requesting: https://api.elsevier.com/content/article/doi/10.3390%2Fcatal10010108?view=FULL
DOI not found: 10.3390/catal10010108 (Status: 404)
Requesting: https://api.elsevier.com/content/article/doi/10.1016%2Fj.pnsc.2018.08.002?view=FULL
Downloaded: Cross_Document_Linking/XML/10.1016_j.pnsc.2018.08.002.xml
Requesting: https://api.elsevier.com/content/article/doi/10.1039%2Fd3nj06001b?

{'success': 193, 'not_found': 301, 'error': 1}

Now we have to remove all the papers that failed to download from the "relevant_papers_df"

In [7]:
# Get list of all filenames in the save_path directory
save_path_files = set(os.listdir(save_path))

# Define a function to check if the corresponding XML file exists
def doi_in_save_path(doi):
    filename = doi.replace('/', '_') + '.xml'
    return filename in save_path_files

# Filter the DataFrame
relevant_papers_df = relevant_papers_df[relevant_papers_df['doi'].apply(doi_in_save_path)].reset_index(drop=True)
relevant_papers_df

Unnamed: 0,title,abstract,doi,MOF Name
0,Direct synthesis of Al-HKUST-1 and its applica...,Direct synthesis of Al-HKUST-1 was successfull...,10.1016/j.nanoso.2021.100773,HKUST-1
1,High efficiency synthesis of HKUST-1 under mil...,This study focuses on the development of a hyd...,10.1016/j.pnsc.2018.08.002,HKUST-1
2,Sonoelectrochemical synthesis of metal-organic...,Here we report a new synergic strategy for the...,10.1016/j.synthmet.2016.07.003,HKUST-1
3,Scalable continuous production of high quality...,Metal Organic Frameworks (MOFs) are materials ...,10.1016/j.cej.2017.05.169,HKUST-1
4,Optimized synthesis of nano-scale high quality...,This study was focused on the development of a...,10.1016/j.micromeso.2018.05.027,HKUST-1
...,...,...,...,...
188,Synthesis of core-double-shell structured Fe<i...,Organic dyes have complex and relatively stabl...,10.1016/j.jpcs.2022.111094,HKUST-1
189,Recent advances of Copper- BTC metal-organic f...,Water borne emerging pollutants represents a s...,10.1016/j.hazl.2023.100094,HKUST-1
190,An antibiotic-free platform for eliminating pe...,Helicobacter pylori (H. pylori) infection rema...,10.1016/j.apsb.2024.03.014,HKUST-1
191,Precise assembly of a zeolite imidazolate fram...,Polypropylene (PP) has been widely used in the...,10.1016/j.memsci.2020.118412,HKUST-1


<b>Synthesis Procedure Extractions</b>

Follows same steps as the synthesis extraction notebook (``ChemUnity_Synthesis.ipynb``)

In [8]:
llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
parser_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
agent = ExtractionAgent(llm=llm)
input_folder = "Cross_Document_Linking/XML"

In [9]:
# Directory and file paths
results_dir = os.path.join(input_folder, 'synthesis_results')
results_file = os.path.join(results_dir, 'synthesis_extractions.csv')

# Set parameters
saving_every = 50  # Save results every X MOFs
batch_size = 4000    # Stop after processing X MOFs (manual restart required)

# Ensure the results directory exists
os.makedirs(results_dir, exist_ok=True)

# Prepare or load the results DataFrame
if os.path.exists(results_file):
    existing_results = pd.read_csv(results_file)

    # Drop "Unnamed: 0" if it exists
    if "Unnamed: 0" in existing_results.columns:
        existing_results = existing_results.drop(columns=["Unnamed: 0"])
    
    processed_refs = set(existing_results["Reference"])
    results = existing_results.to_dict(orient="list")
else:
    processed_refs = set()
    results = {
        "Metal Precursor": [], "Linker": [], "Solvent": [],
        "Temperature": [], "Reaction Time": [], "Reaction Type": [], "Synthesis Procedure": [],
        "Additional Conditions": [], "Justification": [], "Reference": [],
        "MOF Name": []
    }

# Determine rows that haven't been processed yet
remaining_df = relevant_papers_df[~relevant_papers_df["doi"].isin(processed_refs)].reset_index(drop=True)

# Get the total number of MOFs that need to be processed
total_remaining = len(remaining_df)
total_processed = 0  # Tracks how many MOFs have been processed in this run

In [10]:
# Process in batches, but stop after processing batch_size MOFs
for batch_start in range(0, total_remaining, saving_every):
    if total_processed >= batch_size:
        break  # Stop execution when batch_size MOFs are processed

    batch_df = remaining_df.iloc[batch_start:batch_start + saving_every]
    
    # Track number of rows before processing this batch
    prev_length = len(results["Reference"])

    # Process each row in the batch
    for i, row in batch_df.iterrows():
        reference = row["doi"]
        mof = row["MOF Name"]
        
        # Generate file name based on reference
        file_name = os.path.join(input_folder, reference.replace("/", "_") + ".xml")
        print(file_name)
        
        # Get synthesis extraction via agent response
        try:
            response = agent.agent_response(mof, file_name,
                                            SYNTHESIS_EXTRACTION, CoV=False, filtered=False,
                                            gen_extraction_structure=Synthesis,
                                            )
            synthesis_extraction = response

            # Append extracted synthesis data to results dictionary
            results["Metal Precursor"].append(getattr(synthesis_extraction, "metal_precursor", ""))
            results["Linker"].append(getattr(synthesis_extraction, "organic_linker", ""))
            results["Solvent"].append(getattr(synthesis_extraction, "solvent", ""))
            results["Temperature"].append(getattr(synthesis_extraction, "temperature", ""))
            results["Reaction Type"].append(getattr(synthesis_extraction, "reaction_type", ""))
            results["Reaction Time"].append(getattr(synthesis_extraction, "reaction_time", ""))
            results["Synthesis Procedure"].append(getattr(synthesis_extraction, "synthesis_procedure", ""))
            results["Additional Conditions"].append(getattr(synthesis_extraction, "additional_conditions", ""))
            results["Justification"].append(getattr(synthesis_extraction, "justification", ""))
            results["Reference"].append(reference)
            results["MOF Name"].append(mof)
            
            total_processed += 1  # Increment the count of processed MOFs
            
            # Stop early if batch_size limit is reached
            if total_processed >= batch_size:
                break  

        except Exception as e:
            print(f"Error processing row {i} with DOI {reference}: {e}")

    # Save progress every `saving_every` MOFs
    batch_results_df = pd.DataFrame(results)
    batch_results_df.to_csv(results_file, index=False)

    # Calculate remaining MOFs
    # Correct calculation of remaining MOFs
    remaining_to_process = max(total_remaining - total_processed, 0)

    print(f"✅ {total_processed} MOFs processed so far. {remaining_to_process} remaining this batch. Progress saved to {results_file}.")

# Final completion message
if total_processed >= batch_size:
    print(f"🚫 Processing stopped after {batch_size} MOFs. {remaining_to_process} MOFs still need processing. Restart script to continue.")
else:
    print("✅ Processing complete! All MOFs processed.")

Cross_Document_Linking/XML/10.1016_j.nanoso.2021.100773.xml
Detected Elsevier XML format in 'Cross_Document_Linking/XML/10.1016_j.nanoso.2021.100773.xml'
Converted 'Cross_Document_Linking/XML/10.1016_j.nanoso.2021.100773.xml' to Markdown at 'Cross_Document_Linking/MD/10.1016_j.nanoso.2021.100773.md'
Action: reading the document
finding all properties of name 1: HKUST-1

Result: 
- Metal Precursor: Cu(NO3)2.3H2O
- Organic Linker: H3BTC (1,3,5-benzenetricarboxylic acid)
- Solvent: Ethanol, DMF (dimethylformamide) in a ratio of 1:1, demineralized water
- Temperature: 100 °C
- Reaction Type: Solvothermal
- Reaction Time: 10 hours
- Synthesis Procedure:
  - Dissolve 1 g of H3BTC in 30 mL of a mixture of ethanol and DMF with a ratio of 1:1.
  - In a separate container, dissolve 2.0772 g of Cu(NO3)2.3H2O in 15 mL of demineralized water.
  - Mix both solutions and stir using a magnetic stirrer until a homogeneous solution is obtained.
  - Carry out a solvothermal process in an oven at 100 °C f

In [11]:
batch_results_df

Unnamed: 0,Metal Precursor,Linker,Solvent,Temperature,Reaction Time,Reaction Type,Synthesis Procedure,Additional Conditions,Justification,Reference,MOF Name
0,Cu(NO3)2.3H2O,"H3BTC (1,3,5-benzenetricarboxylic acid)","Ethanol, DMF (dimethylformamide) in a ratio of...",100 °C,10 hours,Solvothermal,Dissolve 1 g of H3BTC in 30 mL of a mixture of...,Not Provided,HKUST-1 synthesis was performed based on the r...,10.1016/j.nanoso.2021.100773,HKUST-1
1,Copper nitrate trihydrate,"Benzene-1,3,5-tricarboxylic acid (BTC)",Not Provided,50 °C,3 hours,Solvothermal,"Copper nitrate trihydrate and benzene-1,3,5-tr...",Atmospheric pressure,"In a typical synthesis process, copper nitrate...",10.1016/j.pnsc.2018.08.002,HKUST-1
2,Copper wire (Cu),"BTC (1,3,5-benzenetricarboxylic acid)","Dimethylformamide (DMF)/H2O (1:1, v/v)",Not Provided,Not Provided,Sonoelectrochemical,"Dissolve NaNO3 and 1,3,5-benzenetricarboxylic ...",Electrical potential difference: 12V; Ultrasou...,"In a typical process, NaNO3 (0.24molL−1) and 1...",10.1016/j.synthmet.2016.07.003,HKUST-1
3,Copper nitrate hemipentahydrate,"Trimesic acid (1,3,5-benzenetricarboxylic acid)","Ethanol, DMF/Water/Ethanol","60°C, 79°C","5 hours (batch), 13 seconds to 5 minutes (cont...","Solvothermal, Microwave assisted",For batch synthesis: Mix copper nitrate hemipe...,"For microwave synthesis, maintain microwave po...","""All batch tests were performed using 6mL of t...",10.1016/j.cej.2017.05.169,HKUST-1
4,Copper nitrate trihydrate,TMA (trimesic acid),"Deionized water, Ethanol",85 °C,3 hours,Hydro/solvo-thermal,Dissolve 0.03 mol of copper nitrate trihydrate...,"The molar ratio of TEA, Cu2+, and TMA is 6:3:2...","Firstly, 0.03 mol of copper nitrate trihydrate...",10.1016/j.micromeso.2018.05.027,HKUST-1
...,...,...,...,...,...,...,...,...,...,...,...
188,Copper nitrate trihydrate (Cu(NO3)2·3H2O),"H3BTC (1,3,5-benzenetricarboxylic acid)","Deionized water, anhydrous ethanol",150 °C,15 hours,Solvothermal,Disperse and dissolve 5 g of Cu(NO3)2·3H2O in ...,Not Provided,"First, 5 g of Cu(NO3)2·3H2O was dispersed and ...",10.1016/j.jpcs.2022.111094,HKUST-1
189,"Cu(NO3)2·3H2O, Cu(NO3)2·6H2O","H3BTC (1,3,5-benzenetricarboxylic acid)","Water, Ethanol","80 °C, 120 °C, 393 K","3 h, 12 h, 30 min","Solvothermal, Hydrothermal, Microwave-assisted",Solvothermal Approach: Continuously stir H3BTC...,Not Provided,The most extensively utilized technique for th...,10.1016/j.hazl.2023.100094,HKUST-1
190,Cu(CH3COO)2·H2O (Copper(II) acetate monohydrate),"H3BTC (trimesic acid, [1,3,5-benzenetricarboxy...",25% aqueous solution of dimethylformamide (DMF...,Not Provided,"3 hours stirring, 30 minutes sonication",Not Provided,Prepare a solution of 1.2 g Cu(CH3COO)2·H2O an...,Not Provided,HKUST-1 nanoparticles were prepared with sligh...,10.1016/j.apsb.2024.03.014,HKUST-1
191,Cu(CH3COO)2·H2O,"1,3,5-benzene tricarboxylic acid (BTC)",Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,Not Provided,"Zn(NO3)2·6H2O, Co(NO3)2·6H2O, Cu(CH3COO)2·H2O,...",10.1016/j.memsci.2020.118412,HKUST-1
