# Pre-processing of data
The first thing to do is extracting the data we are interested in from the XML file (XML files are used from a dataset of US patent applications from 2001 to 2016. This dataset can be found here: https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873). We are interested in the title, the general experimental procedure, the reaction SMILES, but most importantly the product SMILES.
Therefore, using xml.Etree in the following we iterate through the child elements of the root of the file. 


In [None]:
# First we import the necessary libraries

import xml.etree.cElementTree as et     # for parsing the XML file
import pandas as pd
import numpy as np

In [1]:
# define path to access first XML file in the folder 2001 of applications

tree = et.parse(r"C:\Users\milen\git\ppChem\project\test\data\applications\2001\20010315.xml")

# define root of the XML file to iterate through the file
root = tree.getroot()

# get familiar with the root of the XML file

print(root.tag) 
print(len(root))
print(root[0].tag)
print(len(root[0]))

{http://www.xml-cml.org/schema}reactionList
38
{http://www.xml-cml.org/schema}reaction
6


In [3]:
# Get more familiar with the data and with how you can iterate through an XML file.
count = 0
for title in root:
    print(title[0][0].tag)
    count += 1
print(count)

{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucket.org/dan2097}documentId
{http://bitbucke

## Collect first values
Now that we are more familiar with the XML file and how we can iterate or access the different children of the root, we try to create a list with the values of interest.

In [16]:
# Extract the title of the reactions
Title = []
for title in root.iter('{http://bitbucket.org/dan2097}headingText'):
    print(title.text)   # print the title of the reactions to see what they are like
    Title.append(title.text)
print(len(Title))

Step h: 4-Chloro-1-(4-isopropyl-phenyl)-butan-1-one
Step d: 4-Chloro-1-(4-methyl-phenyl)-butan-1-one
1-[4-(1-Bromo-1-methyl-ethyl)-phenyl]-4-chloro-butan-1-one
1-[4-(1-Bromo-1-methyl-ethyl)-phenyl]-4-chloro-butan-1-one
1-[4-(1-Bromo-1-methyl-ethyl)-phenyl]-4-chloro-butan-1-one
(4-Bromomethyl-phenyl)-cyclopropyl-methanone
Step 1: (4-Cyclopropanecarbonyl-phenyl)-acetonitrile
Step d: [4-(4-Chloro-butyryl)-phenyl]-acetic acid, 2-ethylhexyl ester
Step h: 2-[4-(4-Chloro-butyryl)-phenyl]-2-methyl-propionic acid, N-methoxy-N-methylamide
Step h: 2-[4-(4-Chloro-butyryl)-phenyl]-2-methyl-propionic acid, dimethylamide
Step h: 2-[4-(4-Chloro-butyryl)-phenyl]-2-methyl-propionic acid pyrrolidineamide
(4-cyclopropanecarbonyl-phenyl)-acetic acid, dimethylamide;
2-(4-Cyclopropanecarbonyl-phenyl)-proprionic acid, dimethylamide;
2-(4-Cyclopropanecarbonyl-phenyl)-2-methyl-proprionic acid. dimethylamide;
[4-(4-Chloro-butyryl)-phenyl]-acetic acid, dimethylamide;
2-[4-(4-Chloro-butyryl)-phenyl]-propionic acid

OOps! Already we encounter the first problem here: the list of titles only counts up to 32 instead of the 38 reactions we could previously extract from this first XML file! Therefore, it will not be possible to zip the different lists we will create in the following, as we would loose the information of the matching. Stay tuned for the solution to this problem!

In [10]:
# Extract the experimental procedure
ExpProcedure = []
for expprocedure in root.iter('{http://bitbucket.org/dan2097}paragraphText'):
    # print(title.text)
    ExpProcedure.append(expprocedure.text)
print(len(ExpProcedure))

38


In [11]:
# Extract the reaction SMILES
RxnSmiles = []
for smiles in root.iter('{http://bitbucket.org/dan2097}reactionSmiles'):
    # print(title.text)
    RxnSmiles.append(smiles.text)
print(len(RxnSmiles))

38


Here we had a problem:
while in the other tags, the value we wanted to extract was directly linked, in the product tag there are different values and we only want to extract the product smiles identifier. Here we need to be very specific to prevent extracting other identifier smiles from the reactants or spectators tags. 

In [7]:
# Define the namespace
ns = {'cml': 'http://www.xml-cml.org/schema'}

# Find all <reaction> elements
reaction_elements = root.findall('.//cml:reaction', ns)

# Iterate over each <reaction> element
for reaction_element in reaction_elements:
    # Find all <product> elements within the current <reaction> element
    product_elements = reaction_element.findall('.//cml:product', ns)
    # Iterate over each <product> element
    for product_element in product_elements:
        # Find all <identifier> elements within the current <product> element
        identifier_elements = product_element.findall('.//cml:identifier[@dictRef="cml:smiles"]', ns)
        # Iterate over each <identifier> element
        for identifier_element in identifier_elements:
            # Extract the value attribute (SMILES value)
            smiles_value = identifier_element.attrib.get('value')
            if smiles_value is not None:
                # Append the SMILES value to the list or process it as needed
                PrdSmiles.append(smiles_value)

# Check if any values were extracted
print("Product SMILES:", PrdSmiles)

Product SMILES: ['C(C)(C)(C)C1CCC(CC1)O', 'ClCCCC(=O)C1=CC=C(C=C1)C(C)C', 'ClCCCC(=O)C1=CC=C(C=C1)C', 'BrC(C)(C)C1=CC=C(C=C1)C(CCCCl)=O', 'BrC(C)(C)C1=CC=C(C=C1)C(CCCCl)=O', 'BrC(C)(C)C1=CC=C(C=C1)C(CCCCl)=O', 'BrCC1=CC=C(C=C1)C(=O)C1CC1', 'C1(CC1)C(=O)C1=CC=C(C=C1)CC#N', 'C1(=CC=CC=C1)CC(=O)O', 'ClCCCC(=O)C1=CC=C(C=C1)CC(=O)OCC(CCCC)CC', 'C1(=CC=CC=C1)C(C(=O)OCC)C', 'CON(C(C(C)(C1=CC=CC=C1)C)=O)C', 'CON(C(C(C)(C)C1=CC=C(C=C1)C(CCCCl)=O)=O)C', 'CN(C(C(C)(C1=CC=CC=C1)C)=O)C', 'CN(C(C(C)(C)C1=CC=C(C=C1)C(CCCCl)=O)=O)C', 'N1(CCCC1)C(=O)N.ClCCCC(=O)C1=CC=C(C=C1)C(C(=O)O)(C)C', 'CN(C(CC1=CC=C(C=C1)C(=O)C1CC1)=O)C', 'CN(C(C(C)C1=CC=C(C=C1)C(=O)C1CC1)=O)C', 'CN(C(C(C)(C)C1=CC=C(C=C1)C(=O)C1CC1)=O)C', 'CN(C(CC1=CC=C(C=C1)C(CCCCl)=O)=O)C', 'CN(C(C(C)C1=CC=C(C=C1)C(CCCCl)=O)=O)C', 'CN(C(C(C)(C)C1=CC=C(C=C1)C(CCCCl)=O)=O)C', 'ClCCCC(=O)C1=CC=C(C=C1)C(C(=O)OCC)(C)C', 'ClCCCC(=O)C1=CC=C(C=C1)C(C(=O)OC)(C)C', 'ClCCCC(=O)C1=CC=C(C=C1)C(C(=O)OC)(C)C', 'ClCCCC(=O)C1=CC=C(C=C1)C(C(=O)O)(C)C', 'ClCCCC(=O

In [8]:
print(len(PrdSmiles))

38


Now, to prevent mismatching through handling of lists of different lengths, the best is to create a dictionnary for every reaction containing the values of interest as keys. After that, we can build a dataframe where every dictionnary is one entry in the df.

In [12]:
# Define the namespace
ns = {'cml': 'http://www.xml-cml.org/schema', 'dl': 'http://bitbucket.org/dan2097'}

# Create lists to store extracted information
reaction_list = []

# Find all <reaction> elements
reaction_elements = root.findall('.//cml:reaction', ns)

# Iterate over each <reaction> element
for reaction_element in reaction_elements:
    # Create a dictionary to store information about the reaction
    reaction_dict = {}

    # Extract title
    title = reaction_element.find('.//dl:headingText', ns)
    if title is not None:
        reaction_dict['title'] = title.text

    # Extract paragraph text
    paragraph_text = reaction_element.find('.//dl:paragraphText', ns)
    if paragraph_text is not None:
        reaction_dict['paragraphText'] = paragraph_text.text

    # Extract reaction SMILES
    reaction_smiles = reaction_element.find('.//dl:reactionSmiles', ns)
    if reaction_smiles is not None:
        reaction_dict['reactionSmiles'] = reaction_smiles.text

    # Extract product SMILES
    product_elements = reaction_element.findall('.//cml:product', ns)
    product_smiles = []
    for product_element in product_elements:
        identifier_element = product_element.find('.//cml:identifier[@dictRef="cml:smiles"]', ns)
        if identifier_element is not None:
            smiles_value = identifier_element.get('value')
            if smiles_value is not None:
                product_smiles.append(smiles_value)
    if product_smiles:
        reaction_dict['productSmiles'] = product_smiles

    # Append the reaction dictionary to the reaction list
    reaction_list.append(reaction_dict)

# Check if any values were extracted
print("Reaction List:", reaction_list)


Reaction List: [{'paragraphText': 'PL 137,526 describes the hydrogenation of p-tert-butylphenol to form p-tert-butylcyclohexanol using a nickel catalyst.', 'reactionSmiles': '[C:1]([C:5]1[CH:10]=[CH:9][C:8]([OH:11])=[CH:7][CH:6]=1)([CH3:4])([CH3:3])[CH3:2]>[Ni]>[C:1]([CH:5]1[CH2:6][CH2:7][CH:8]([OH:11])[CH2:9][CH2:10]1)([CH3:4])([CH3:2])[CH3:3]', 'productSmiles': ['C(C)(C)(C)C1CCC(CC1)O']}, {'title': 'Step h: 4-Chloro-1-(4-isopropyl-phenyl)-butan-1-one', 'paragraphText': 'Slurry aluminum chloride (140.9 g, 1.075 mol) and 4-chlorobutyryl chloride (148 g, 1.05 mol) in methylene chloride (1.0 L) add, by dropwise addition, cumene (125 g, 1.04 mol) over a thirty minute period under a nitrogen atmosphere while maintaining the internal temperature between 5-8° C. with an ice bath. Allow the stirred solution to come to room temperature and continue stirring under nitrogen for 14 hours. Cautiously add the methylene chloride solution to 1 L of crushed ice with stirring and add additional methyle

In [13]:
print(len(reaction_list))

38


This seems to have worked out! Let's check out the daraframe.

In [15]:
df = pd.DataFrame(reaction_list)
df.head()

Unnamed: 0,paragraphText,reactionSmiles,productSmiles,title
0,"PL 137,526 describes the hydrogenation of p-te...",[C:1]([C:5]1[CH:10]=[CH:9][C:8]([OH:11])=[CH:7...,[C(C)(C)(C)C1CCC(CC1)O],
1,"Slurry aluminum chloride (140.9 g, 1.075 mol) ...",[Cl-].[Al+3].[Cl-].[Cl-].[Cl:5][CH2:6][CH2:7][...,[ClCCCC(=O)C1=CC=C(C=C1)C(C)C],Step h: 4-Chloro-1-(4-isopropyl-phenyl)-butan-...
2,"Suspend anhydrous AlCl3 (156 g, 1.15 mol) in t...",[Al+3].[Cl-].[Cl-].[Cl-].[Cl:5][CH2:6][CH2:7][...,[ClCCCC(=O)C1=CC=C(C=C1)C],Step d: 4-Chloro-1-(4-methyl-phenyl)-butan-1-one
3,Dissolve 4-chloro-1-(4-isopropyl-phenyl)-butan...,[Cl:1][CH2:2][CH2:3][CH2:4][C:5]([C:7]1[CH:12]...,[BrC(C)(C)C1=CC=C(C=C1)C(CCCCl)=O],1-[4-(1-Bromo-1-methyl-ethyl)-phenyl]-4-chloro...
4,Dissolve 4-chloro-1-(4-isopropyl-phenyl)-butan...,[Cl:1][CH2:2][CH2:3][CH2:4][C:5]([C:7]1[CH:12]...,[BrC(C)(C)C1=CC=C(C=C1)C(CCCCl)=O],1-[4-(1-Bromo-1-methyl-ethyl)-phenyl]-4-chloro...
