# Pre-processing of data
The first thing to do is extracting the data we are interested in from the XML files (XML files are used from a dataset of US patent applications from 2001 to 2016. This dataset can be found here: https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873). We are interested in the title, the general experimental procedure, the reaction SMILES, but most importantly the product SMILES.
Therefore, using xml.Etree in the following we iterate through the child elements of the root of the file. 


In [37]:
# First we import the necessary libraries

import xml.etree.cElementTree as et     # for parsing the XML file
import pandas as pd
import numpy as np
import os
import re
from groq import Groq # for the LLM Groq queries (needs to be installed via pip)
import json

Now we use a function to iterate over all the XML files in the applications data from 2001. We collect the title, the paragraphText, mostly including the experimental procedures, the reaction SMILES and the product SMILES from it.

In [36]:
def extract_data(root_file):
    """function that extracts data from an XML file and returns a list of dictionaries containing the extracted information.
        Information to be extracted includes the title of the reaction, the experimental procedure, the reaction SMILES, and the product SMILES.

    Args:
        root_file: root of the parsed XML file

    Returns:
        reaction_list: list of dictionaries containing the extracted information
    """
    
    # Define the namespace that prevents mismatching of tags in the XML file
    ns = {'cml': 'http://www.xml-cml.org/schema', 'dl': 'http://bitbucket.org/dan2097'}

    # Create lists to store extracted information
    reaction_list = []

    # Find all <reaction> elements
    reaction_elements = root_file.findall('.//cml:reaction', ns)

    # Iterate over each <reaction> element
    for reaction_element in reaction_elements:
        # Create a dictionary to store information about the reaction
        reaction_dict = {}

        # Extract title
        title = reaction_element.find('.//dl:headingText', ns)
        if title is not None:
            reaction_dict['title'] = title.text

        # Extract paragraph text
        paragraph_text = reaction_element.find('.//dl:paragraphText', ns)
        if paragraph_text is not None:
            reaction_dict['paragraphText'] = paragraph_text.text

        # Extract reaction SMILES
        reaction_smiles = reaction_element.find('.//dl:reactionSmiles', ns)
        if reaction_smiles is not None:
            reaction_dict['reactionSmiles'] = reaction_smiles.text

        # Extract product SMILES
        product_elements = reaction_element.findall('.//cml:product', ns)
        product_smiles = []
        for product_element in product_elements:
            identifier_element = product_element.find('.//cml:identifier[@dictRef="cml:smiles"]', ns)
            if identifier_element is not None:
                smiles_value = identifier_element.get('value')
                if smiles_value is not None:
                    product_smiles.append(smiles_value)
        if product_smiles:
            reaction_dict['productSmiles'] = product_smiles

        # Append the reaction dictionary to the reaction list
        reaction_list.append(reaction_dict)

    # Check if any values were extracted
    #print("Reaction List:", reaction_list)
    return reaction_list


In a second step, we iterate through all files in every folder of the application data of the dataset and extract the data with the function extract_data. This takes quite a bit, but you can see the progress whenever a new folder is treated. 

In [3]:
Applications_list = []

# define path to access first XML file in the folder 2001 of applications
# Insert the path to the folder containing the XML files
for folder in os.listdir(r'C:\Users\milen\git\ppChem\PPChem_TLC\data\applications'):
    folder = os.path.join(r'C:\Users\milen\git\ppChem\PPChem_TLC\data\applications', folder)
    print(folder) 
    for file in os.listdir(folder):
        if file.endswith('.xml'):
            file = os.path.join(folder, file)
            tree = et.parse(file)
         # define root of the XML file to iterate through the file
            root = tree.getroot()
            Applications_list.append(extract_data(root))    
                  

C:\Users\milen\git\ppChem\PPChem_TLC\data\applications\2001
C:\Users\milen\git\ppChem\PPChem_TLC\data\applications\2002
C:\Users\milen\git\ppChem\PPChem_TLC\data\applications\2003
C:\Users\milen\git\ppChem\PPChem_TLC\data\applications\2004
C:\Users\milen\git\ppChem\PPChem_TLC\data\applications\2005
C:\Users\milen\git\ppChem\PPChem_TLC\data\applications\2006
C:\Users\milen\git\ppChem\PPChem_TLC\data\applications\2007
C:\Users\milen\git\ppChem\PPChem_TLC\data\applications\2008
C:\Users\milen\git\ppChem\PPChem_TLC\data\applications\2009
C:\Users\milen\git\ppChem\PPChem_TLC\data\applications\2010


KeyboardInterrupt: 

# Put extracted data into a dataframe.
Now that we created a list of lists containing a dictionnary for every reaction that was extracted from the XML files, we put the lists into a dataframe.

In [None]:
# Create a DataFrame from the extracted data
df_extracts = pd.DataFrame(Applications_list[0])

# Iterate through the list of dictionaries and create a DataFrame
for i in range(1, len(Applications_list)):
    df = pd.DataFrame(Applications_list[i])
    
    # Concatenate the DataFrames
    df_extracts = pd.concat([df_extracts, df], ignore_index=True)

print(df_extracts.shape)
df_extracts.head()


(1939253, 4)


Unnamed: 0,paragraphText,reactionSmiles,productSmiles,title
0,"PL 137,526 describes the hydrogenation of p-te...",[C:1]([C:5]1[CH:10]=[CH:9][C:8]([OH:11])=[CH:7...,[C(C)(C)(C)C1CCC(CC1)O],
1,"Slurry aluminum chloride (140.9 g, 1.075 mol) ...",[Cl-].[Al+3].[Cl-].[Cl-].[Cl:5][CH2:6][CH2:7][...,[ClCCCC(=O)C1=CC=C(C=C1)C(C)C],Step h: 4-Chloro-1-(4-isopropyl-phenyl)-butan-...
2,"Suspend anhydrous AlCl3 (156 g, 1.15 mol) in t...",[Al+3].[Cl-].[Cl-].[Cl-].[Cl:5][CH2:6][CH2:7][...,[ClCCCC(=O)C1=CC=C(C=C1)C],Step d: 4-Chloro-1-(4-methyl-phenyl)-butan-1-one
3,Dissolve 4-chloro-1-(4-isopropyl-phenyl)-butan...,[Cl:1][CH2:2][CH2:3][CH2:4][C:5]([C:7]1[CH:12]...,[BrC(C)(C)C1=CC=C(C=C1)C(CCCCl)=O],1-[4-(1-Bromo-1-methyl-ethyl)-phenyl]-4-chloro...
4,Dissolve 4-chloro-1-(4-isopropyl-phenyl)-butan...,[Cl:1][CH2:2][CH2:3][CH2:4][C:5]([C:7]1[CH:12]...,[BrC(C)(C)C1=CC=C(C=C1)C(CCCCl)=O],1-[4-(1-Bromo-1-methyl-ethyl)-phenyl]-4-chloro...


That Dataframe is huge! Finally, save all the extracted data into a csv file on the local device.

In [None]:
df_extracts.to_csv(r'C:\Users\milen\git\ppChem\PPChem_TLC\extracted_data_raw_applications.csv', index=False)

# Further Processing of the Data using Regex and LLM
Of course, not all entries in the data frame can be used for our model. Many of the experimental procedures do not include any information about the Rf value. Thus, we need to find the entries with Rf values. As all experimental procedures are written differently, we will try to find the value of interest by using Regex (Regular Expressions).

First, we load the extracted data into a new dataframe.

In [3]:
df_new = pd.read_csv(r'C:\Users\milen\git\ppChem\PPChem_TLC\extracted_data_raw_applications.csv')

And we create another dataframe, which only includes the first 1000 rows of the dataframe to reduce the cost of executing the test code below.

In [4]:
df_new.shape
df_new = df_new.iloc[0:1000]
df_new.shape

(1000, 4)

Seems to have worked! Now let's take a closer look at how we will try to extract the Rf values. 
Throughout different testing until now, we found several criteria that need to be defined in the Regex pattern for the Rf value: it should be a number following the general pattern 0.XY with Y not being obligatory and X being some digit between 2 and 8 (this is to exclude as many other matches as possible and as Rf values should ideally be around 0.5, we thought this would be optimal). Furthermore, the pattern should not be followed by other digits (e.g. 0.45005) nor include special signs (0.5:0.4) or temperature values. The last remaining problem is, to distinguish between information about quantity and the Rf value (e.g. 0.56 vs. 0.56 mg).

In [8]:
def extract_rf_eluent(Dataframe: pd.DataFrame):
    """Function that applies defined regex patterns to data in a dataframe and creates 
    the following new columns: 
    Rf value, solvent A, solvent B, % solvent A, % solvent B. If no Rf value can be found, all columns
    are filled with NaN. 

    Args:
        Dataframe (_type_): Dataframe containing the extracted data from the US patents
    """
    # copy the dataframe to leave old dataframe unchanged
    df = Dataframe.copy()
    
    # Define the regex patterns
    Rf_check = r'( ?R[fF]?[ :=(]?)'
    Rf_pattern = r'(0\.(?!0|9)\d{1,2})\b(?! *mg\b| *mL\b| *g\b)' # exclude decimals that start by 0.0x, this yet seems not to work though, try to specifiy that digit after . cannot start with 0.
    #Rf_pattern = r'[ =:]?(0\.[^0][0-9])[^\dmglL](?! mg)' # exclude decimals that start by 0.0x, this yet seems not to work though, try to specifiy that digit after . cannot start with 0.
    
    #set a count to see how many multiple Rf values are found in the dataframe, how many NaN values are found
    count = 0
    count_nan = 0
    
    # Extract the Rf values from the paragraphText and put them into a new column
    for index, row in df.iterrows():
        checkRf = re.findall(Rf_check, row['paragraphText'])
        
        #check if Rf value can be found in the paragraphText column
        if checkRf:
             #try to find the Rf value in the paragraphText column
            match = re.findall(Rf_pattern, row['paragraphText'])
            if match:
                df.at[index, 'Rf_value'] = match[0] # df['paragraphText'].str.extract(Rf_pattern)
            
                # Check if multiple Rf values were found (potential error source)
                if len(match) > 1:
                    print('Multiple Rf values found in paragraphText:', match, 'at index:', index)
                    count += 1
       
        else:
            df.at[index, 'Rf_value'] = np.nan
            count_nan += 1
            
    print("Number of entries with multiple Rf values:", count)
    print("Number of entries with no Rf values found:", count_nan)
    print("Number of entries with Rf values found:", df['Rf_value'].count())
    return df
    

Turns out it would be very painful to do all this with regex. Not the best idea. Instead, we will try to use a LLM in the following. To still reduce the cost of computation, we will pre-filter the dataframe with the subsequent function to only get back data where an Rf value can be found in the experimental procedure. 

In [7]:
def extract_rows_with_rf(Dataframe: pd.DataFrame):
    """Function that extracts rows with Rf values from a dataframe and returns a new dataframe containing only these rows. 

    Args:
        Dataframe (_type_): Dataframe containing the extracted data from the US patents
    """
    # copy the dataframe to leave old dataframe unchanged
    df = Dataframe.copy()
    
    # Define the regex patterns
    Rf_check = r'( ?R[fF][ :=(]?)'
    
    # List to store indices of rows without Rf values
    rows_to_drop = []
    
    # Search for rows with Rf values in the paragraphText column
    for index, row in df.iterrows():
        checkRf = re.findall(Rf_check, row['paragraphText'])

        if not checkRf:
            rows_to_drop.append(index)
               
    # Drop rows without Rf values
    df = df.drop(rows_to_drop)
            
    return df
    

That is still applying the first function and one can clearly see the mess. Won't be possible to extract the exact Rf value in the multiple detected rows.

In [5]:
df_new.head()
df_processed_first_try = extract_rf_eluent(df_new)
df_processed_first_try.head()
#df_processed_first_try.to_csv(r'C:\Users\milen\git\ppChem\PPChem_TLC\extracted_data_first_processing_rf_values.csv', index=False)

NameError: name 'extract_rf_eluent' is not defined

Now the second function: already the dataframe size cut down from 1000 entries to 246 which we can hopefully treat with a LLM.

In [10]:
df_processed_second_try = extract_rows_with_rf(df_new)
df_processed_second_try.head()
df_processed_second_try.shape
df_processed_second_try.to_csv(r'C:\Users\milen\git\ppChem\PPChem_TLC\extracted_data_second_processing_rf_values.csv', index=False)

For the LLM we use the API of the open source model offered by Groq Clouds (https://console.groq.com/docs/quickstart). Different models can be tested out.

In [47]:
#Access token croq: 
import json

# Create a Groq client (it is recommended to use the following Quickstart: https://console.groq.com/docs/quickstart)
# However, this did not work in our case and we had to use the following code to create a client 
client = Groq(
    api_key="", # insert your API key here
)
user_prompt = "Give me the Rf value, the solvent mixture and their ratio of the following procedure in the following format of a dictionary only: Rf= , solvent A= , solvent B= , percent A= , percent B= . If there is a third solvent, please provide the information in the same format, call it additive C and percent additive C = . Only give the dictionnary as output, no additional notes or information!!!"
procedure = "An Rf value of 0.22 was found using DCM/EtOAc 20:1 and 0.4% Hydroxylammonium. "#"Rf(Hex/EtOAc 1:20):0.22"
user_prompt_procedure = user_prompt + procedure
chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": user_prompt_procedure
        }
    ],
    model="mixtral-8x7b-32768", # other models: LlaMA3 70 b (llama3-70b-8192) can be found here: https://console.groq.com/docs/models
)

response_str = chat_completion.choices[0].message.content

print(response_str)
type(response_str)

response_dict = json.loads(response_str)
type(response_dict)
print(response_dict["Rf"])


{"Rf": 0.22, "solvent A": "DCM", "solvent B": "EtOAc", "percent A": 95, "percent B": 5, "additive C": "Hydroxylammonium", "percent additive C": 0.4}
0.22


In [65]:
def parse_response(response_str: str):
    """Function that extracts the Rf value, solvent A, solvent B, % solvent A, and % solvent B from a LLM response.

    Args:
        response (str): response from the LLM model 

    Returns:
        rf_value (str)
        solvent_a (str) 
        solvent_b (str) 
        percent_a (str)
        percent_b (str)
        additive_c (str)
        percent_c (str)
    """
    try: 
        
        # Convert the response string to a dictionary
        response_dict = json.loads(response_str)
        
        #extracts values from the response dictionary
        
        rf_value = response_dict["Rf"]
        solvent_a = response_dict["solvent A"]
        solvent_b = response_dict["solvent B"]
        percent_a = response_dict["percent A"]
        percent_b = response_dict["percent B"]
        additive_c = response_dict["additive C"]
        percent_c = response_dict["percent C"]
        
    except KeyError:
        raise KeyError(f"Error extracting values from the response: {response_str}")
    
    
    # search for Rf value in the response using regex
   # rf_match = re.search(r"Rf\s*=\s*([0-9.]+)", response)
    #solvent_a_match = re.search(r"[Ss]olvent A\s*=\s*([A-Za-z\s]+)", response)
    #solvent_b_match = re.search(r"[Ss]olvent B\s*=\s*([A-Za-z\s]+)", response)
    #percent_a_match = re.search(r"%\s*Solvent A\s*=\s*([0-9.]+)%", response)
    #percent_b_match = re.search(r"%\s*Solvent B\s*=\s*([0-9.]+)%", response)
    
    # Extract values from regex matches
    #rf_value = rf_match.group(1) if rf_match else None
    #solvent_a = solvent_a_match.group(1) if solvent_a_match else None
    #solvent_b = solvent_b_match.group(1) if solvent_b_match else None
    #percent_a = percent_a_match.group(1) if percent_a_match else None
    #percent_b = percent_b_match.group(1) if percent_b_match else None

    # Return the extracted values
    return rf_value, solvent_a, solvent_b, percent_a, percent_b, additive_c, percent_c

In [77]:
def get_values(Dataframe: pd.DataFrame):
    """Function that extracts the Rf value from a row in a Dataframe using LLM and returns it.

    Args:
        DataFrame (_type_): DataFrame containing the extracted data from the US patents
    """
    
    client = Groq(
        api_key="" ) # insert your API key here # we have to add a file containing the API key to upload it to github.
    
    
    
    for index, row in Dataframe.iterrows():
        user_prompt = f"Please only provide the Rf value, solvents A and B, additives C (if applicable), and their ratios in percent as a Python dictionary only (without 'Here is the python dictionary') with keys 'Rf', 'solvent A', 'solvent B', 'additive C', 'percent A', 'percent B', and 'percent C' of this procedure: {row['paragraphText']}. If there is no information for one category, put 'Nan'. Do not provide any additional notes or information except for the dictionary!!! Always use this format!!" # "Give me the Rf value, the solvent A, B and additive C if applied and their ratios in percent as a python dictionary calling the keys Rf (should be a number), solvent A (should be a solvent only), solvent B (solvent only), percent A (number), percent B (number), additive C (solvent), percent C (number) of this procedure:" + row['paragraphText'] + "If there is no information for one category of solvent, put Nan. Do not put any other notes or information except for the dictionary!!! Always use this format!!"  # "Give me the Rf value, the solvent mixture and their ratio of the following procedure" + row['paragraphText'] +  "in the following format of a dictionary only: Rf= , solvent A= , solvent B= , percent A= , percent B= . If there is a third solvent, please provide the information in the same format, call it additive C and percent C = . Only give the dictionnary as output, no additional notes or information!!!"        
        
        try:             
            # Call the LLM model to extract the Rf value
            chat_completion = client.chat.completions.create(
                messages=[
                    {
                        "role": "user",
                        "content": user_prompt,
                    }
                ],
                model="mixtral-8x7b-32768", # other models: LlaMA3 70 b (llama3-70b-8192), mixtral (mixtral-8x7b-32768) can be found here: https://console.groq.com/docs/models
            )
            
            # Extract the Rf value from the response
            response = chat_completion.choices[0].message.content
            print(response)
            
            # Parse the response to extract the Rf value, solvent A, solvent B, % solvent A, and % solvent B using the parse_response function
            rf_value, solvent_a, solvent_b, percent_a, percent_b, additive_c, percent_c = parse_response(response)
            
            # Add extracted values to new columns in the dataframe row
            Dataframe.at[index, 'Rf'] = rf_value
            Dataframe.at[index, 'Solvent_A'] = solvent_a
            Dataframe.at[index, 'Solvent_B'] = solvent_b
            Dataframe.at[index, 'Percent_A'] = percent_a
            Dataframe.at[index, 'Percent_B'] = percent_b
            Dataframe.at[index, 'Additive_C'] = additive_c
            Dataframe.at[index, 'Percent_C'] = percent_c
            
        except Exception as e:
            # Print the error message and the index of the row where the error occurred
            print(e)
            print("Error at index:", index)
            continue
        
    return Dataframe
        

        
    

In [78]:
df_processed_second_try = get_values(df_processed_second_try)
df_processed_second_try.head(20)

{'Rf': 0.15,
 'solvent A': 'ethyl acetate',
 'solvent B': 'methanol',
 'additive C': 'Nan',
 'percent A': 'Nan',
 'percent B': '20',
 'percent C': 'Nan'}
Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Error at index: 524
{'Rf': 0.28,
 'solvent A': 'methylene chloride',
 'solvent B': 'methanol',
 'additive C': 'ammonium hydroxide',
 'percent A': 92,
 'percent B': 7.96,
 'percent C': 0.04}
Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Error at index: 525
{
'Rf': 0.37,
'solvent A': 'methanol',
'solvent B': 'methylene chloride',
'additive C': 'Nan',
'percent A': 10,
'percent B': 90,
'percent C': 'Nan'
}
Expecting property name enclosed in double quotes: line 2 column 1 (char 2)
Error at index: 538
{'Rf': 0.35, 'solvent A': 'methanol', 'solvent B': 'methylene chloride', 'additive C': 'Nan', 'percent A': 10, 'percent B': 90, 'percent C': 'Nan'}
Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Error at index

Unnamed: 0,paragraphText,reactionSmiles,productSmiles,title,Rf,Solvent_A,Solvent_B,Percent_A,Percent_B,Additive_C,Percent_C
524,A mixture of 7-trifluoromethylsulfonyloxy-1-(4...,FC(F)(F)S(O[C:7]1[CH:16]=[C:15]2[C:10]([CH:11]...,['CN1CCN(CC1)C1=CC=CC2=CC=C(C=C12)C=1C=NC=NC1'],1-(4-Methylpiperazin-1-yl)-7-(pyrimid-5-yl)nap...,0.15,methylene chloride,methanol,90.0,10.0,ammonium hydroxide,0.1
525,"A mixture of the above acid (0.25 g, 0.82 mmol...",[CH3:1][N:2]1[CH2:7][CH2:6]O[CH2:4][CH2:3]1.O....,['ClC1=CC=C(CNC(=O)C2=CC3=C(C=CC=C3C=C2)N2CCN(...,8-(4-Methylpiperazin-1-yl)naphthalene-2-carbox...,0.28,methylene chloride,methanol,92.0,8.0,ammonium hydroxide,0.4
538,"A solution of 1,4-benzodioxan-6-amine (2.00 g,...",[O:1]1[C:6]2[CH:7]=[CH:8][C:9]([NH2:11])=[CH:1...,['O1CCOC2=C1C=CC(=C2)N2CCNCC2'],,0.37,methanol,methylene chloride,10.0,90.0,Nan,Nan
539,A solution of 6-aminochroman from part 2 of th...,[NH2:1][C:2]1[CH:3]=[C:4]2[C:9](=[CH:10][CH:11...,['O1CCCC2=CC(=CC=C12)N1CCNCC1'],1-(Chroman-6-yl)piperazine,0.35,methanol,methylene chloride,10.0,90.0,Nan,Nan
540,"A solution of 1-(1,4-benzodioxan-6-yl)piperazi...",O1C2C=CC(N3CCNCC3)=CC=2OCC1.C(O)(=O)C(O)=O.[O:...,['O1CCOC2=C1C=CC(=C2)N2CCN(CC2)CC2=CC=C(C=C2)F'],,0.35,ethyl acetate,hexane,30.0,70.0,Nan,Nan
543,"To a stirred solution of 4e (355 mg, 0.891 mmo...",[CH2:1]([O:4][C:5]([C:7]1[CH:12]=[CH:11][C:10]...,['O=C1N(C(CC1)=O)OC(C1=CC=C(C=C1)N1N=C(C2=CC=C...,,,,,,,,
544,5.295 g of N-chlorosuccinimide is added at 60°...,[Cl:1]N1C(=O)CCC1=O.[C:9]1([CH2:15][O:16][C:17...,['ClC1=C2CC[C@H]3[C@@H]4CCC([C@@]4(C)C[C@@H](C...,Stage A: 4-chloro-11beta-[4-(phenylmethoxy) ph...,,,,,,,
545,6 ml of sulphuryl chloride at 10% in dichlorom...,S(Cl)([Cl:4])(=O)=O.[CH3:6][N:7]([CH3:37])[CH2...,['ClC1=C2CC[C@H]3[C@@H]4CCC([C@@]4(C)C[C@@H](C...,Stage A: 4-chloro-11beta-[4-[2-(dimethylamino)...,,,,,,,
580,5-fluoro-2-methyl-1-(p-methylsulfonylbenzylide...,C[O-].[Na+].[F:4][C:5]1[CH:6]=[C:7]2[C:11](=[C...,['FC=1C=C2C(=C(C(C2=CC1)=CC1=CC=C(C=C1)S(=O)(=...,,,,,,,,
774,Boc-Lys(Cbz)-OH (25 g) was dissolved in dichlo...,[NH:1]([C:21]([O:23][C:24]([CH3:27])([CH3:26])...,['N([C@@H](CCCCNC(=O)OCC1=CC=CC=C1)C(=O)OC)C(=...,,,,,,,,
