# Building data tables from a corpus of papers

Once a synthesis protocol has been parsed into a basic sequence and stored as a `.json` file, we need to convert ll the parsed information into useable formats for further analysis. 
In this notebook, we will go through all of the data parsing for a single paper to demonstrate the data structures available.

In [1]:
import os
import sys
import matplotlib.pyplot as plt

try:
    from synoracle.sequence import Sequence
except ModuleNotFoundError:
    module_path = os.path.abspath(os.path.join('..'))
    if module_path not in sys.path:
        sys.path.append(module_path)
    from synoracle.sequence import Sequence

In [2]:
from glob import glob
from tqdm.notebook import tqdm, trange
import numpy as np
import pandas as pd
from glob import glob
pd.options.mode.chained_assignment = None
import json
def li_iterate(li):
    l = iter(li)
    for _ in trange(len(li)):
        yield next(l)

## Importing the sequence data frame into pandas
First, we instantiate a `Sequence` object which contains the synthesis steps as a pandas dataframe. It starts as a `raw_synthesis`

In [3]:
sequence_source = glob('../../../majed/*.json')
sequence_source

['../../../majed\\acsaccounts5b00165.0.json',
 '../../../majed\\anie200504114.100.json',
 '../../../majed\\B206698J.13.json',
 '../../../majed\\B306504A.8.json',
 '../../../majed\\B810295C.12.json',
 '../../../majed\\B915273C.21.json',
 '../../../majed\\cm049398e.85.json',
 '../../../majed\\cm8012733.89.json',
 '../../../majed\\cm801411y.70.json',
 '../../../majed\\cm801411y.71.json',
 '../../../majed\\cm801411y.72.json',
 '../../../majed\\es000990o.67.json',
 '../../../majed\\es000990o.68.json',
 '../../../majed\\ie0705047.65.json',
 '../../../majed\\ja0559911.70.json',
 '../../../majed\\ja974025i.78.json',
 '../../../majed\\ja974025i.79.json',
 '../../../majed\\jctbv9612.142.json',
 '../../../majed\\jjcis201309023.54.json',
 '../../../majed\\jp014280w.77.json',
 '../../../majed\\jp021964a.72.json',
 '../../../majed\\jp044538t.97.json',
 '../../../majed\\la035834k.json',
 '../../../majed\\la902239m.63.json',
 '../../../majed\\nature02529.108.json',
 '../../../majed\\S0009261400013853.

In [4]:
sequences = {}
all_sequences = pd.DataFrame()
for c,x in enumerate(sequence_source):
    cq = Sequence.from_json(x)
    cq.condense_to_supertypes()

    

    paper_id = x.rsplit('\\',1)[-1].split('.')[0]
    paragraph_number = x.rsplit('\\',1)[-1].split('.')[1]

    working = cq.clean_synthesis
    working['synthesis_number'] = c
    working['paper_identifier'] = paper_id
    working['paragraph_number'] = paragraph_number
    
    sequences[c] = cq
    all_sequences=all_sequences.append(working)


In [16]:
web_links = [
    'https://doi.org/10.1021/acs.accounts.5b00165',
    'https://doi.org/10.1002/anie.200504114',
    'https://doi.org/10.1039/B206698J',
    'https://doi.org/10.1039/B306504A',
    'https://doi.org/10.1039/B810295C',
    'https://doi.org/10.1039/B915273C',
    'https://doi.org/10.1021/cm049398e',
    'https://doi.org/10.1021/cm8012733',
    'https://doi.org/10.1021/cm801411y',
    'https://doi.org/10.1021/cm801411y',
    'https://doi.org/10.1021/cm801411y',
    'https://doi.org/10.1021/es000990o',
    'https://doi.org/10.1021/es000990o',
    'https://doi.org/10.1021/ie0705047',
    'https://doi.org/10.1021/ja0559911',
    'https://doi.org/10.1021/ja974025i',
    'https://doi.org/10.1021/ja974025i',
    'https://doi.org/10.1002/jctb.6908',
    'https://doi.org/10.1016/j.jcis.2013.09.023',
    'https://doi.org/10.1021/jp014280w',
    'https://doi.org/10.1021/jp021964a',
    'https://doi.org/10.1021/jp044538t',
    'https://doi.org/10.1021/la035834k',
    'https://doi.org/10.1021/la902239m',
    'https://doi.org/10.1038/nature02529',
    'https://doi.org/10.1016/S0009-2614(00)01385-3',
    'https://doi.org/10.1016/j.chemosphere.2005.08.047',
    'https://doi.org/10.1016/j.clay.2011.02.024',
    'https://doi.org/10.1016/j.jhazmat.2008.03.013',
    'https://doi.org/10.1016/j.jhazmat.2009.11.135',
    'https://doi.org/10.1016/j.jhazmat.2009.11.135',
    'https://doi.org/10.1007/s10853-009-3610-9',
    'https://doi.org/10.1016/j.micromeso.2003.12.004',
    'https://doi.org/10.1016/S1566-7367(02)00051-1',
    'https://doi.org/10.1016/S1566-7367(02)00051-1'
        
]

titles = [
    'The Dynamic Association Processes Leading from a Silica Precursor to a Mesoporous SBA-15 Material',
    'Synthesis and Characterization of Mesoporous Silica AMS-10 with Bicontinuous Cubic Pn3m Symmetry',
    'Phase diagram for mesoporous CTAB–silica films prepared under dynamic conditions',
    'Cubic Ia3d large mesoporous silica: synthesis and replication to platinum nanowires, carbon nanorods and carbon nanotubes',
    'Synthesis of porous silica with hierarchical structure directed by a silica precursor carrying a pore-generating cage',
    'Convenient synthesis of ordered mesoporous silica at room temperature and quasi-neutral pH',
    'Structural Solution of Mesocaged Material AMS-8',
    'Synthesis of Ultra-Large-Pore SBA-15 Silica with Two-Dimensional Hexagonal Structure Using Triisopropylbenzene As Micelle Expander',
    'Porous Silica Nanocapsules and Nanospheres: Dynamic Self-Assembly Synthesis and Application in Controlled Release',
    'Porous Silica Nanocapsules and Nanospheres: Dynamic Self-Assembly Synthesis and Application in Controlled Release',
    'Porous Silica Nanocapsules and Nanospheres: Dynamic Self-Assembly Synthesis and Application in Controlled Release',
    'Surfactant-Templated Mesoporous Silicate Materials as Sorbents for Organic Pollutants in Water',
    'Surfactant-Templated Mesoporous Silicate Materials as Sorbents for Organic Pollutants in Water',
    'Separation of Organic Compounds by Spherical Mesoporous Silica Prepared from W/O Microemulsions of Tetrabutoxysilane',
    'Resolving Intermediate Solution Structures during the Formation of Mesoporous SBA-15',
    'Nonionic Triblock and Star Diblock Copolymer and Oligomeric Surfactant Syntheses of Highly Ordered, Hydrothermally Stable, Mesoporous Silica Structures',
    'Nonionic Triblock and Star Diblock Copolymer and Oligomeric Surfactant Syntheses of Highly Ordered, Hydrothermally Stable, Mesoporous Silica Structures',
    'Synthesis and formation mechanism analysis of meso-microporous ZSM-5 with controllable mesoporous volume',
    'Single-pot synthesis of ordered mesoporous silica films with unique controllable morphology',
    'Structural Design of Mesoporous Silica by Micelle-Packing Control Using Blends of Amphiphilic Block Copolymers',
    'Study of the Formation of the Mesoporous Material SBA-15 by EPR Spectroscopy',
    'Properties of the Silica Layer during the Formation of MCM-41 Studied by EPR of a Silica-Bound Spin Probe',
    'Nonionic Fluorinated Surfactant: Investigation of Phase Diagram and Preparation of Ordered Mesoporous Materials',
    'Ultrafast Sonochemical Synthesis of Methane and Ethane Bridged Periodic Mesoporous Organosilicas',
    'Synthesis and characterization of chiral mesoporous silica',
    'A novel morphology of mesoporous molecular sieve MCM-41',
    'Effective uptake of decontaminating agent (citric acid) from aqueous solution by mesoporous and microporous materials: An adsorption process',
    'Organosilicas and organo-clay minerals as sorbents for toluene',
    'Adsorption of phenol and o-chlorophenol by mesoporous MCM-41',
    'Fast and efficient mesoporous adsorbents for the separation of toxic compounds from aqueous media',
    'Fast and efficient mesoporous adsorbents for the separation of toxic compounds from aqueous media',
    'Morphological control on SBA-15 mesoporous silicas via a slow self-assembling rate',
    'Microwave synthesis of cubic mesoporous silica SBA-16',
    'Fast and efficient synthesis of ZSM-5 under high pressure',
    'Fast and efficient synthesis of ZSM-5 under high pressure'
]

In [23]:
paper_metadata = pd.DataFrame(columns = ['synthesis_number', 'paper_identifier', 'title', 'link'])
for counter, (link, title, (c,v)) in enumerate(zip(web_links, titles, sequences.items())):

    entry = {
        'synthesis_number': c,
        'paper_identifier': v.clean_synthesis['paper_identifier'].unique()[0],
        'title': title,
        'link': link
    }
    paper_metadata.loc[counter] = entry
    print(entry)


{'synthesis_number': 0, 'paper_identifier': 'acsaccounts5b00165', 'title': 'The Dynamic Association Processes Leading from a Silica Precursor to a Mesoporous SBA-15 Material', 'link': 'https://doi.org/10.1021/acs.accounts.5b00165'}
{'synthesis_number': 1, 'paper_identifier': 'anie200504114', 'title': 'Synthesis and Characterization of Mesoporous Silica AMS-10 with Bicontinuous Cubic Pn3m Symmetry', 'link': 'https://doi.org/10.1002/anie.200504114'}
{'synthesis_number': 2, 'paper_identifier': 'B206698J', 'title': 'Phase diagram for mesoporous CTAB–silica films prepared under dynamic conditions', 'link': 'https://doi.org/10.1039/B206698J'}
{'synthesis_number': 3, 'paper_identifier': 'B306504A', 'title': 'Cubic Ia3d large mesoporous silica: synthesis and replication to platinum nanowires, carbon nanorods and carbon nanotubes', 'link': 'https://doi.org/10.1039/B306504A'}
{'synthesis_number': 4, 'paper_identifier': 'B810295C', 'title': 'Synthesis of porous silica with hierarchical structure 

In [25]:
paper_metadata.to_csv('paper_metadata.csv')

In [33]:
all_sequences.to_csv('sequence_information.csv')

### Pulling out chemical information
The first thing we'll do is extract information about the chemicals involved in the synthesis procedure. 
We'll start by building a `chemical_list` of the mentioned materials with quanities sorted by type (mass, volume, concentration, moles (other_amount)).

In [28]:
all_boms = pd.DataFrame()
for c,x in sequences.items():
    paper_id = x.clean_synthesis['paper_identifier'].unique()[0]
    para_no = x.clean_synthesis['paragraph_number'].unique()[0]

    print(paper_id)
    print(para_no)

    x.extract_chemicals()

    working_bom = x.chemical_list.produce_bill_of_mats().bill_of_materials
    working_bom['synthesis_number'] = c
    working_bom['paper_identifier'] = paper_id
    working_bom['paragraph_number'] = para_no

    all_boms = all_boms.append(working_bom)



acsaccounts5b00165
0


 SBA-15


anie200504114
100


 AMS-10
 C14GluA
 C14GluA / TMAPS / TEOS / H2O / NaOH
ERROR:root:Error encountered converting raw info to pint unit:
----------------------
ERROR:root:amount in question: name                  HCl
mass                  NaN
other_amount          NaN
percent         [37 wt %]
Units used      [percent]
Name: 3, dtype: object
 hydrolyze


B206698J
13


 CTAB / Si


B306504A
8


 P123
 polypropylene


B810295C
12


 3-[(4-adamantan-1-yl-phenoxy)propyl]trimethoxysilane
 P123
 SBA-15
[Compound(7016082), Compound(21285386)]
 silsesquioxane


B915273C
21


 P123


cm049398e
85


 AMS-8
 AMS-8 silicate
 TEOS : TMAPS : C12Glysine : H2O
 TMAPS
 silicate
 sodium N-lauroyl-l-glysine


cm8012733
89


 0.199x
 BASF
 P123
 SBA-15
 TIPB
 TIPB : NH4F : HCl : H2O


cm801411y
70


ERROR:root:Error encountered converting raw info to pint unit:
----------------------
ERROR:root:amount in question: name                       ethyl ether
mass                               NaN
volume        [20-50 mL, 20 mL, 50 mL]
percent                            NaN
Units used    [volume, volume, volume]
Name: 5, dtype: object
 porogen


cm801411y
71


ERROR:root:Error encountered converting raw info to pint unit:
----------------------
ERROR:root:amount in question: name                   2-ethoxyethanol
mass                               NaN
volume        [25-50 mL, 25 mL, 50 mL]
percent                            NaN
Units used    [volume, volume, volume]
Name: 0, dtype: object
 porogen


cm801411y
72


 ( 28 )
ERROR:root:Error encountered converting raw info to pint unit:
----------------------
ERROR:root:amount in question: name                 unknown
mass                [0.05 g]
volume          [0.5-1.0 mL]
percent                  NaN
Units used    [mass, volume]
Name: 8, dtype: object


es000990o
67


 ( 21 )
 AlCl3·6H2O
 Aluminosilicate MCM-41
 HDTMA
 MCM-41(20)
 MCM-41(30)
 MCM-41(40)
 MCM-41(∞)
 Si/Al
 TMOS : HDTMA : H2O : MeOH
 methanol / water


es000990o
68


 HDTMA
 MCM-41(∞)-IN
 SiO2 : HDTMA : H2O
 silicate


ie0705047
65


 SMS
 TBOS
ERROR:root:Error encountered converting raw info to pint unit:
----------------------
ERROR:root:amount in question: name                 water
mass                   NaN
volume           [250 cm3]
concentration          NaN
Units used        [volume]
Name: 4, dtype: object


ja0559911
70


 P123
 SBA-15


ja974025i
78


 nonionic


ja974025i
79


 Silica-block


jctbv9612
142


 Al2O3
 D-ZSM-5
 O-ZSM-5
 P-ZSM-5
 Si-O-Si
 ZSM-5
 aluminum isopropanol
 dodecyl trimethoxy silane
 dodecyl trimethoxy silane zeolite
 methoxy silicon
 octyl trimethoxy silane
 organosilane
 propyl trimethoxy silane
 x-ZSM-5


jjcis201309023
54


 P123
 PTFE
 [ 29 ]


jp014280w
77


 ( x )
 C12EO10
 C12EO23
 EO
 EOxPO70EOx
 Silica/polymer
 c-HCl
 ethanol / HCl


jp021964a
72


 L62-NO
 P123
 SBA
 SBA-15


jp044538t
97


 CA-MCM-41
 H2O / MeOH
 HCl/2.5
 HY-MCM-41
 RT-MCM-41
 SE-MCM-41
 SL1SiEt


la035834k
json


 RF8(EO)9
 surfactant/silica


la902239m
63


 BTSE
 BTSM
 Et-PMO-i
 HTABr
 Me-PMO-i
 PMOs


nature02529
108


 APS
 N-miristoyl-l -alanine sodium
 TMAPS
 amino acid


S0009261400013853
30


 CPBr
 alkaline
 silicate
 tetrbutylorthotitanate
 titanium-containing MCM-41


S0045653505011185
json


 Al-MCM-41
 Si-MCM-41
 Si/Al
 X
 X Al2O3
 silicate
 sodium silicate nanohydrate


S0169131711000895
40


 MCM-48


S0304389408003610
81




S0304389409019554
48


 [ 52 ]


S0304389409019554
49


 0.5NaOH : 62H2O : 0.1NaF
 1.0SiO2
 MCM-48
 [ 53 ]


s1085300936109
30


 ( 4.35-8.90 )
ERROR:root:Error encountered converting raw info to pint unit:
----------------------
ERROR:root:amount in question: name                            H2O
mass             [50-200 g, 50.0 g]
concentration                   NaN
Units used             [mass, mass]
Name: 1, dtype: object
 silicas
 surfactant-silica


S1387181103006966
40


 F127
 MAR-5
 Pluronic F127
 S30 / M60
 SBA-16
This entry will not be calculated just now.


S1566736702000511
35




S1566736702000511
36




In [34]:
all_boms.to_csv('reagent_information.csv')

From this, we'll build a `BillOfMaterials`, with chemicals grouped by identity and the total amount of moles provided. This lets us perform a lot of later analysis into reaction stoichiometry and the like.

## Extracting time and temperature information from the sequence
We then perform a similar set of processing for reaction conditions like times and temperatures, grouping into minutes and degrees kelvin respectively.
We can then analyse the total synthesis time, and set of temperatures used for later comparison between different protocols.

In [30]:
all_times = pd.DataFrame()
all_temps = pd.DataFrame()
for c,x in sequences.items():
    x.extract_conditions()

    working_conds = x.conditions


    working_time = x.conditions.times
    working_temp = x.conditions.temps

    working_time['synthesis_number'] = c
    working_time['paper_identifier'] = x.clean_synthesis['paper_identifier']
    working_time['paragraph_number'] = x.clean_synthesis['paragraph_number']

    all_times = all_times.append(working_time)

    working_temp['synthesis_number'] = c
    working_temp['paper_identifier'] = x.clean_synthesis['paper_identifier']
    working_temp['paragraph_number'] = x.clean_synthesis['paragraph_number']

    all_temps = all_temps.append(working_temp)




Time extraction failed for: "min", original was "min"
IndexError list index out of range
Time extraction failed for: "hour", original was "h"
IndexError list index out of range
Time extraction failed for: "min", original was "min"
IndexError list index out of range
Time extraction failed for: "min", original was "min"
IndexError list index out of range
Time extraction failed for: "hour", original was "h"
IndexError list index out of range
Time extraction failed for: "min", original was "min"
IndexError list index out of range
Time extraction failed for: "hour", original was "h"
IndexError list index out of range
Time extraction failed for: "hour", original was "h"
IndexError list index out of range
Time extraction failed for: "a", original was "for a period"
IndexError list index out of range
Time extraction failed for: "hour", original was "h"
IndexError list index out of range
Time extraction failed for: "hour", original was "h"
IndexError list index out of range
Time extraction fail



Time extraction failed for: "hour", original was "h"
IndexError list index out of range
Time extraction failed for: "periods", original was "for periods"
IndexError list index out of range
Time extraction failed for: "time", original was "of time"
IndexError list index out of range
Time extraction failed for: "hour", original was "h"
IndexError list index out of range
Time extraction failed for: "periods", original was "for periods"
IndexError list index out of range
Time extraction failed for: "time", original was "of time"
IndexError list index out of range
Time extraction failed for: "day", original was "D"
IndexError list index out of range
Time extraction failed for: "day", original was "D"
IndexError list index out of range




Time extraction failed for: "time", original was "time"
IndexError list index out of range
Time extraction failed for: "time", original was "time"
IndexError list index out of range
Time extraction failed for: "several minutes", original was "for several minutes"
ValueError could not convert string to float: 'several'
Time extraction failed for: "several minutes", original was "for several minutes"
ValueError could not convert string to float: 'several'
Time extraction failed for: "a few minutes", original was "After a few minutes"
ValueError could not convert string to float: 'few'
Time extraction failed for: "hour", original was "h"
IndexError list index out of range
Time extraction failed for: "days", original was "days"
IndexError list index out of range
Time extraction failed for: "per half hour", original was "per half hour"
ValueError could not convert string to float: 'half'
Time extraction failed for: "a few minutes", original was "After a few minutes"
ValueError could not con

In [37]:
all_times.to_csv('time_information.csv')

In [38]:
all_temps.to_csv('temperature_information.csv')