# Extracting structured informaiton from synthesis sequences


Once a synthesis protocol has been parsed into a basic sequence and stored as a `.json` file, we need to convert ll the parsed information into useable formats for further analysis. 
In this notebook, we will go through all of the data parsing for a single paper to demonstrate the data structures available.

In [1]:
import os
import sys
import matplotlib.pyplot as plt

try:
    from synoracle.sequence import Sequence
except ModuleNotFoundError:
    module_path = os.path.abspath(os.path.join('..'))
    if module_path not in sys.path:
        sys.path.append(module_path)
    from synoracle.sequence import Sequence

In [2]:
from glob import glob
from tqdm.notebook import tqdm, trange
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
import json
def li_iterate(li):
    l = iter(li)
    for _ in trange(len(li)):
        yield next(l)

## Importing the sequence data

First, we import the raw synthesis sequence from `json` format into a `Sequence` object. 
Inside the `Sequence` object the raw synthesis information is stored as a pandas `DataFrame` under the attribute `raw_synthesis`. 
Using this, we can check the information which has been gathered, as well as reference the original text to manually check the fidelity of the previous steps, if required.

In terms of data which will be used for later processing, the `new_chemicals`, `temp`, and `time` columns contain chemical information, temperatures, and times respectively. 
Each of these will be processed to generate structured information from the synthesis, accoridngly sythesis steps containing no information about any of these three synthesis aspects will be discarded to create a `clean_synthesis` (not shown here).


In [3]:
cq = Sequence.from_json('./S1385894723007039.90.json')
cq.raw_synthesis

Unnamed: 0,name,text,new_chemicals,temp,time,prepphrase,apparatus,step number
0,Dissolve,0.730 g of Zn(NO3)2·6H2O was dissolved in 40 m...,"[{'name': 'Zn(NO3)2·6H2O', 'mass': '0.730 g', ...",[],[],[in 40 mL methanol],[],0
1,Wait,sonicated for 10 min,[],[],[for 10 min],[],[],1
2,Yield,to form solution A,"[{'name': 'A', 'mass': None, 'other_amount': N...",[],[],[],[],2
3,Dissolve,"Similarly , 3.285 g of 2-methylimidazole was d...","[{'name': '2-methylimidazole', 'mass': '3.285 ...",[],[],[in 40 mL methanol],[],3
4,Wait,sonicated for 10 min,[],[],[for 10 min],[],[],4
5,Yield,to form solution B,"[{'name': 'B', 'mass': None, 'other_amount': N...",[],[],[],[],5
6,Add,The two solutions were then mixed,[],[],[],[],[],6
7,Stir,stirred vigorously for 3 h at 25 ± 2 °C,[],[],[for 3 h],[at 25 ± 2 °C],[],7
8,Partition,"Subsequently , the turbid mixture was separate...",[],[],[],[by centrifugation ( 10000 rpm )],[],8
9,Yield,yielding white,[],[],[],[],[],9


### Processing chemical information

Once a `clean_synthesis` has been generated, the first structured informaiton to extract are details of which chemicals are present and in what quantity. 
As chemicals can be added multiple times during synthesis or mentioned using different names in different studies, and their quantity can be reported in a number of units, the following steps need ot be carried out:
1. Identify chemical names
2. Determine which units have been used to measure each one

These steps are carried out by the `Sequence.extract_chemicals()` method, which produces a `ChemicalList` object under the attribute `chemical_list`, containing information about each mentioned chemical with quanities sorted by type (mass, volume, concentration, moles (other_amount)).

In [4]:
cq.extract_chemicals()
cq.chemical_list.chemical_list

Unnamed: 0,name,mass,other_amount,volume,percent,concentration,aliases,Units used
0,Zn(NO3)2·6H2O,0.730 g,,,,,[Zn(NO3)2·6H2O],mass
1,methanol,,,40 mL,,,[methanol],volume
2,A,,,,,,[A],
3,2-methylimidazole,3.285 g,,,,,[2-methylimidazole],mass
4,methanol,,,40 mL,,,[methanol],volume
5,B,,,,,,[B],
6,methanol,,,,,,[methanol],
7,ethanol,,,,,,[ethanol],


Once a `ChemicalList` has been generated, this can be further processed into an itemised bill of materials for a synthesis, containing unique identifiers for each chemical and the total quantity used throughout the synthesis.
The steps to convert a `ChemicalList` into a `BillOfMaterials` are contained wihtin the `ChemicalList.produce_bill_of_mats` class method, whih performs the following actions:
1. Groups all chemicals together with the same name 
2. Searches the online PubChem database for the chemical's name, taking the database's chemical ID number as a unique identifier
3. Uses the PubChem entry to extract key informaiton about the compound like molecuar weight
4. Estimates the compound's density using the ChEDl database and COSTALD method, if required
5. Calculates the total number of moles present of the compound
6. Groups multiple instances of the same chemical together, to show the total bill of materials present


In [5]:
ingreds_bom = cq.chemical_list.produce_bill_of_mats()

 A
 Zn(NO3)2·6H2O


In [6]:
ingreds_bom.bill_of_materials

Unnamed: 0_level_0,name,moles
pubchem_id,Unnamed: 1_level_1,Unnamed: 2_level_1
702,[ethanol],0.0
887,[methanol],2.056345
12749,[2-methylimidazole],0.040012
5462311,[B],0.0


Notably, the above bill of materials is missing the zinc source. 
This is because the PubChem search is not identifying zinc nitrate from the formuala used. 
In order to rectify this, we create a cache of chemical identities, in which we can manually enter the chemical's identity. 
When recreating the `BillOfMaterials` with the unique identifier `56846048` in the identifier cache, a corrected `BillOfMaterials` will be created.

In [7]:
ingreds_bom = cq.chemical_list.produce_bill_of_mats(identifier_cache_location='./id_cache.json', property_cache_location='./prop_cache.json')

In [8]:
ingreds_bom.bill_of_materials

Unnamed: 0_level_0,name,moles
pubchem_id,Unnamed: 1_level_1,Unnamed: 2_level_1
702,[ethanol],0.0
887,[methanol],2.056345
12749,[2-methylimidazole],0.040012
56846048,[Zn(NO3)2·6H2O],0.00352


By standardising the format of the bill of materials, both in terms of chemical identiity and quantity units, we are able to seamlessly compare between different synthesis protocols. 
In this way, statistics on how common a certain chemical are or what quantity of a certain chemical is used can be easily calculated.

## Extracting time and temperature information from the sequence
We then perform a similar set of processing for reaction conditions like times and temperatures, grouping into minutes and degrees kelvin respectively.
We can then analyse the total synthesis time, and set of temperatures used for later comparison between different protocols.

In [10]:
cq.extract_conditions()
cq.conditions.time_temp

Unnamed: 0,step number,time,temp,T (K),Time (min)
0,0,[],[],,
1,1,[for 10 min],[],,[10.0]
2,2,[],[],,
3,3,[],[],,
4,4,[for 10 min],[],,[10.0]
5,5,[],[],,
6,6,[],[],,
7,7,[for 3 h],[],,[180.0]
8,8,[],[],,
9,9,[],[],,


In [11]:
sum(cq.conditions.time_temp['Time (min)'][cq.conditions.time_temp['Time (min)'].notna()].sum())

200.0

In [12]:
set(cq.conditions.time_temp['T (K)'][cq.conditions.time_temp['T (K)'].notna()].sum())

{333.15}

### Analysing the sequence of actions itself
Finally we can investigate the sequence of steps themselves to analyse how complex the synthesis is, and break down the ingredients and conditions by reaction step. 
We condense the synthesis procedure into "blocks", each with their own chemicals and conditions. 
This gives us the added opportunity to perform like-for-like analysis on subsets of a reaction.

In [13]:
cq.condense_to_supertypes()
cq.condensed_sequence

Unnamed: 0_level_0,name,new_chemicals,temp,time,Condensed steps
Step supertype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
add,Dissolve,"[{'name': 'Zn(NO3)2·6H2O', 'mass': '0.730 g', ...",[],[],1
react,Wait,[],[],[for 10 min],1
remove,Yield,"[{'name': 'A', 'mass': None, 'other_amount': N...",[],[],1
add,Dissolve,"[{'name': '2-methylimidazole', 'mass': '3.285 ...",[],[],1
react,Wait,[],[],[for 10 min],1
remove,Yield,"[{'name': 'B', 'mass': None, 'other_amount': N...",[],[],1
add,AddStir,[],[],[for 3 h],2
remove,PartitionYieldPrecipitateWashDry,"[{'name': 'methanol', 'mass': None, 'other_amo...",[at 60 °C],[],5


In [15]:

cq.extract_chemicals(
    partial_sequence =pd.DataFrame(cq.condensed_sequence.reset_index().loc[0]).T
    )
ingredients_sub_selection = cq.chemical_list.produce_bill_of_mats(identifier_cache_location='./id_cache.json', property_cache_location='./prop_cache.json')

ingredients_sub_selection.bill_of_materials

Unnamed: 0_level_0,name,moles
pubchem_id,Unnamed: 1_level_1,Unnamed: 2_level_1
887,[methanol],1.028173
56846048,[Zn(NO3)2·6H2O],0.00352
