# PDF File Extraction - Split in top level sections

Extract complete PDF file text.
To prepare for ML6 use case.


### Matching of sections based on Regular Expressions

!!!!! The latest version of pdfminer does not work on all PDF files. Need to have the one installed when installing textract, textract will downgrade to proper pdfminer version).

pdfminer.six-20181108-py2.py3-none-any.whl

uninstall six 1.15.0


In [1]:
# See https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/
#     http://zevross.com/blog/2014/04/09/extracting-tabular-data-from-a-pdf-an-example-using-python-and-regular-expressions/

import io
import pdfminer

from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage


In [2]:
laparams = pdfminer.layout.LAParams()
setattr(laparams, 'all_texts', True)

def extract_text_from_pdf(pdf_path):
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle, laparams=laparams)
    page_interpreter = PDFPageInterpreter(resource_manager, converter)

    with open(pdf_path, 'rb') as fh:
        for page in PDFPage.get_pages(fh, 
                                      caching=True,
                                      check_extractable=True):
            page_interpreter.process_page(page)

        text = fake_file_handle.getvalue()

    # close open handles
    converter.close()
    fake_file_handle.close()

    if text:
        return text

In [3]:
import re

def FindOccurences(Text, RegularExpression):
    # Find all Occurences of SubString in text. Return dict with begin and end position in Text for each occurence
    # If not found, returns empty dictionary
    
    count = 1
    Dict = {}
    for m in re.finditer(RegularExpression, Text, flags= re.MULTILINE | re.IGNORECASE):
         Dict[count] = (m.start(), m.end())
         count += 1
          
    return(Dict)

In [4]:
def VerifyOccurence(AllHeaders):
    # Check if SubString occurs in Text.
    
    #if len(AllHeaders) == 0:
    #    s = "No Occurences"
    #elif len(AllHeaders) > 1:
    #    s = "Multiple Occurences"
    #else:
    #    s = "OK - Single Occurence"
    
    #return(s)
    return(len(AllHeaders))

In [5]:
def ExtractText(Text, Begin, End):
    # Extract text Segment between 2 position indexes in a text string
    
    return(Text[Begin:End])

In [5]:
# From Maarten:

dict_structure = {"SECTION 1: Identification of the substance":
                          {"1.1. Product identifier":"",
                           "1.2. Relevant identified uses of the substance or mixture and uses advised against":"",
                           "1.3. Details of the supplier of the safety data sheet":"",
                           "1.4. Emergency telephone number":""},
                  "SECTION 2: Hazards identification":
                          {"2.1. Classification of the substance or mixture":"",
                           "2.2. Label elements":"",
                           "2.3. Other hazards":""},
                  "SECTION 3: Composition/information on ingredients":
                          {"3.1. Substances":"",
                           "3.2. Mixtures":""},
                  "SECTION 4: First aid measures":
                          {"4.1. Description of first aid measures":"",
                           "4.2. Most important symptoms and effects, both acute and delayed":"",
                           "4.3. Indication of any immediate medical attention and special treatment needed":""},
                  "SECTION 5: Firefighting measures":
                          {"5.1. Extinguishing media":"",
                           "5.2. Special hazards arising from the substance or mixture":"",
                           "5.3. Advice for firefighters":""},
                  "SECTION 6: Accidental release measures":
                          {"6.1. Personal precautions, protective equipment and emergency procedures":"",
                           "6.2. Environmental precautions":"",
                           "6.3. Methods and material for containment and cleaning up":"",
                           "6.4. Reference to other sections":""},
                  "SECTION 7: Handling and storage":
                          {"7.1. Precautions for safe handling":"",
                           "7.2. Conditions for safe storage, including any incompatibilities":"",
                           "7.3. Specific end use(s)":""},
                  "SECTION 8: Exposure controls/personal protection":
                          {"8.1. Control parameters":"",
                           "8.2. Exposure controls":""},
                  "SECTION 9: Physical and chemical properties":
                          {"9.1. Information on basic physical and chemical properties":"",
                           "9.2. Other information":""},
                  "SECTION 10: Stability and reactivity":
                          {"10.1. Reactivity":"",
                           "10.2. Chemical stability":"",
                           "10.3. Possibility of hazardous reactions":"",
                           "10.4. Conditions to avoid":"",
                           "10.5. Incompatible materials":"",
                           "10.6. Hazardous decomposition products":""},
                  "SECTION 11: Toxicological information":
                          {"11.1. Information on toxicological effects":""},
                  "SECTION 12: Ecological information":
                          {"12.1. Toxicity":"",
                           "12.2. Persistence and degradability":"",
                           "12.3. Bioaccumulative potential":"",
                           "12.4. Mobility in soil":"",
                           "12.5. Results of PBT and vPvB assessment":"",
                           "12.6. Other adverse effects":""},
                  "SECTION 13: Disposal considerations":
                          {"13.1. Waste treatment methods":""},
                  "SECTION 14: Transport information":
                          {"14.1. UN number":"",
                           "14.2. UN proper shipping name":"",
                           "14.3. Transport hazard class(es)":"",
                           "14.4. Packing group":"",
                           "14.5. Environmental hazards":""},
                  "SECTION 15: Regulatory information":
                          {"15.1. Safety, health and environmental regulations/legislation specific for the substance or mixture":"",
                           "15.2. Chemical safety assessment":""},
                  "SECTION 16: Other information":""
                 }


In [22]:
# SDS headers top level:
# All variants in a list need to be replaced by a single Regular Expression, matching all of the items in that list.
# The first element in the list is taken from the standard specification. The others are taken from all 20 PDF test files. 
# There may be repeats in the lists
# Only the first (official) element is still in use

HeaderDict = { 0 : [""], \
               1 : ["SECTION 1: Identification of the substance", "1. PRODUCT AND COMPANY IDENTIFICATION", "1.  PRODUCT AND COMPANY IDENTIFICATION", "1 Identification", "1. Identification", "1.Identification", "SECTION 1: Identification", "1. IDENTIFICATION", "SECTION 1: PRODUCT AND COMPANY IDENTIFICATION", "Section 1. Identification", " SECTION 1: Identification of the substance/mixture and of the company/undertaking ", "SECTION 1: Identification of the Substance/Preparation and the Company/Undertaking", "Section 1 - Identification of the Substance/Preparation and the Company/Undertaking"], \
               2 : ["SECTION 2: Hazards identification", "Hazards Identification", "2. HAZARDS IDENTIFICATION", "2 Hazard\(s\) identification", "2 Hazard\(s\) Identification", "2. Hazard\(s\) identification", "2.Hazard\(s\) identification", "2.Hazard(s) identification", "SECTION 2: Hazard\(s\) identification", "2.  HAZARDS IDENTIFICATION"," SECTION 2: HAZARDS IDENTIFICATION","SECTION 2: Hazards Identification", "SECTION 2: Hazards identification", "Section 2. Hazards identification"], \
               3 : ["SECTION 3: Composition/information on ingredients", "3. COMPOSITION/INFORMATION ON INGREDIENTS", "3 Composition/information on ingredients", "3 Composition/Information on Ingredients", "3.  COMPOSITION / INFORMATION ON INGREDIENTS", "3. Composition/information on ingredients", "3. Composition/Information on Ingredients", "3.Composition/information on ingredients", "Composition / Information on Ingredients", "3. COMPOSITION", "SECTION 3: Composition/Information on ingredients", "SECTION 3: Composition/Information on Ingredients", "SECTION 3: COMPOSITION/INFORMATION ON INGREDIENTS", "SECTION 3:Composition/information on ingredients", "Section 3. Composition/information on ingredients", "SECTION 3: COMPOSITION"], \
               4 : ["SECTION 4: First aid measures", "4. FIRST AID MEASURES", "4 First-Aid Measures", "4 First aid measures", "4. First-aid measures", "4. First-Aid Measures", "4.  FIRST AID MEASURES", "4.First-aid measures", "SECTION 4: First-aid measures", "SECTION 4: First-Aid Measures", "SECTION 4: FIRST AID MEASURES", " SECTION 4: First aid measures", "Section 4 - First Aid Measures", "Section 4. First aid measures"], \
               5 : ["SECTION 5: Firefighting measures", "5. FIRE-FIGHTING MEASURES", "5 Fire-Fighting Measures", "5 Fire-fighting measures", "5. FIRE FIGHTING MEASURES", "5. FIREFIGHTING MEASURES", "5.  FIRE FIGHTING MEASURES", "5. Fire-fighting measures", "5. Fire-Fighting Measures", "5.Fire-fighting measures", "SECTION 5: Fire-fighting measures", "SECTION 5: Fire-Fighting Measures", "SECTION 5: FIREFIGHTING MEASURES", "SECTION 5:FIRE-FIGHTING MEASURES", "SECTION 5: FIRE FIGHTING MEASURES"," SECTION 5: Firefighting measures", "Section 5 - Fire Fighting Measures", "Section 5. Fire-fighting measures"], \
               6 : ["SECTION 6: Accidental release measures", "6. ACCIDENTAL RELEASE MEASURES", "6 Accidental Release Measures", "6 Accidental release measures", "6.  ACCIDENTAL RELEASE MEASURES", "6. Accidental release measures", "6. Accidental Release Measures", "6.Accidental release measures", "SECTION 6: Accidental release measures","SECTION 6: Accidental Release Measures", "SECTION 6: ACCIDENTAL RELEASE MEASURES", "Section 6 - Accidental Release Measures", "Section 6. Accidental release measures"], \
               7 : ["SECTION 7: Handling and storage", "7. HANDLING AND STORAGE", "7 Handling and Storage", "7 Handling and storage", "7.  HANDLING AND STORAGE", "7. Handling and storage", "7. Handling and Storage", "7.Handling and storage", "SECTION 7: Handling and storage", "SECTION 7: Handling and Storage", "SECTION 7: HANDLING AND STORAGE", "Section 7 - Handling and Storage", "Section 7. Handling and storage"], \
               8 : ["SECTION 8: Exposure controls/personal protection", "8. EXPOSURE CONTROLS/PERSONAL PROTECTION", "8 Exposure controls/personal protection", "8. EXPOSURE CONTROLS / PERSONAL PROTECTION", "8.  EXPOSURE CONTROLS / PERSONAL PROTECTION", "8. Exposure Controls/personal protection", "8. Exposure controls / personal protection", "8 Exposure Controls/Personal Protection", "8. Exposure Controls/Personal Protection", "8. EXPOSURE CONTROLS AND PERSONAL PROTECTION", "8.Exposure controls/personal protection", "SECTION 8: Exposure controls/personal protection"," SECTION 8: Exposure Controls/Personal Protection", "SECTION 8: EXPOSURE CONTROLS/PERSONAL PROTECTION ", " SECTION 8: EXPOSURE CONTROLS AND PERSONAL PROECTION", "Section 8 - Exposure Controls / Personal Protection", "Section 8. Exposure controls/personal protection", "SECTION 8: 8. Exposure Controls/personal protection", "8. EXPOSURE CONTROLS AND PERSONAL PROECTION" ], \
               9 : ["SECTION 9: Physical and chemical properties", "9. PHYSICAL AND CHEMICAL PROPERTIES", "9 Physical and Chemical Properties", "9 Physical and chemical properties", "9.  PHYSICAL AND CHEMICAL PROPERTIES", "9. Physical and Chemical Properties", "9. Physical and chemical properties", "9. Physical and Chemical Properties", "9.Physical and chemical properties", "SECTION 9: Physical and chemical properties", "SECTION 9: Physical and Chemical Properties", "SECTION 9: PHYSICAL AND CHEMICAL PROPERTIES", "Section 9 - Physical & Chemical Properties", "Section 9. Physical and chemical properties"], \
               10 : ["SECTION 10: Stability and reactivity", "10. STABILITY AND REACTIVITY", "10 Stability and Reactivity", "10 Stability and reactivity", "10Stability and Reactivity", "10.Stability and reactivity", "10.  STABILITY AND REACTIVITY", "10. Stability and Reactivity", "10. Stability and reactivity", "SECTION 10: Stability and reactivity", "SECTION 10: Stability and Reactivity", "SECTION 10: STABILITY AND REACTIVITY", "Section 10 - Chemical Stability & Reactivity Information", "Section 10. Stability and reactivity"], \
               11 : ["SECTION 11: Toxicological information", "11. TOXICOLOGICAL INFORMATION", "11 Toxicological Information", "11 Toxicological information", "11Toxicological Information", "11.Toxicological information", "11.  TOXICOLOGICAL INFORMATION", "11. Toxicological information", "11. Toxicological Information", "SECTION 11: Toxicological information", "SECTION 11: Toxicological Information", "SECTION 11: TOXICOLOGICAL INFORMATION", "Section 11 - Toxicological Information", "Section 11. Toxicological information"], \
               12 : ["SECTION 12: Ecological information", "12. ECOLOGICAL INFORMATION", "12 Ecological Information", "12 Ecological information", "12Ecological Information", "12.Ecological information", "12.  ECOLOGICAL INFORMATION", "12. Ecological Information", "12. Ecological information", "SECTION 12: Ecological information", "SECTION 12: Ecological Information", "SECTION 12: ECOLOGICAL INFORMATION", "Section 12 - Ecological Information", "Section 12. Ecological information"], \
               13 : ["SECTION 13: Disposal considerations", "13. DISPOSAL CONSIDERATIONS", "13 Disposal Considerations", "13 Disposal consideration", "13Disposal Considerations", "13.Disposal considerations", "13.  DISPOSAL CONSIDERATIONS", "13. Disposal considerations", "13. Disposal Considerations", "SECTION 13: Disposal considerations", "SECTION 13: Disposal Considerations", "SECTION 13: DISPOSAL CONSIDERATIONS", "Section 13 - Disposal Considerations", "Section 13. Disposal considerations"], \
               14 : ["SECTION 14: Transport information", "14. TRANSPORT INFORMATION", "14 Transport Information", "14 Transport information", "14Transport Information", "14.Transport information", "14.  TRANSPORT INFORMATION", "14. Transport information", "14. Transport Information", "SECTION 14: Transport information", "SECTION 14: Transport Information", "SECTION 14: TRANSPORT INFORMATION", "Section 14 - Transportation Information", "Section 14. Transport information"], \
               15 : ["SECTION 15: Regulatory information", "15. REGULATORY INFORMATION", "15 Regulatory Information", "15 Regulatory information", "15Regulatory Information", "15.Regulatory information", "15.REGULATORY INFORMATION", "15. Regulatory information", "15. Regulatory Information", "SECTION 15: Regulatory information", "SECTION 15: Regulatory information", "SECTION 15: Regulatory Information", "SECTION 15: REGULATORY INFORMATION", "Section 15 - Regulatory Information", "Section 15. Regulatory information"], \
               16 : ["SECTION 16: Other information", "16. OTHER INFORMATION", "16 Other Information", "16 Other information", "16Other Information", "16.  OTHER INFORMATION", "16. Other information", "16. Other Information", "16.Other information, including date of preparation or last revision", "SECTION 16: Other information", "SECTION 16: Other Information","SECTION 16: OTHER INFORMATION", " SECTION 16: Other information, including date of preparation or last revision", "Section 16 - Other Information", "Section 16. Other information"], \
             }

# Check 8 : "SECTION 8: 8. Exposure Controls/personal protection" (last item in list)

In [6]:
# From Jonathan

import re

regex_checkpoint = [
    "^[^.\n]*?(SECTION|section|Section)?[^\S\r\n]?\d{1,2}[^\S\r\n]?[.:-]?[^\S\r\n]*?(\w+[\s/]+)*?",
    "^[^a-zA-Z\n]*?(SECTION|section|Section)?[^\S\r\n]?\d{1,2}[^\S\r\n]?[.:-]?[^\S\r\n]*?(\w+[\s/]+)*?"
]

regex1 = "^[^a-zA-Z\n]*?(SECTION)?[^\S\r\n]?(?<!\.)"
regex2 = "[^\S\r\n]?[.:-]?[^\S\r\n]*?(\w+[\s/]+)*?"

sections = {
    1: "(IDENTIFICATION|Identification|identification)",
    2: "(HAZARDS|Hazard\(s\)|INGREDIENTS)",
    3: "(COMPOSITION|INGREDIENTS|HAZARDS|Hazard\(s\))",
    4: "(First[\s-]aid)",
    5: "(Fire[\s-]*Fighting)",
    6: "(Accidental\sRelease)",
    7: "(Handling\sAnd\sStorage)",
    8: "(Exposure\s(Controls\s*)?[And|/]*\s*?Personal\s(Protection)*)",
    9: "(Physical\s(And|&)\sChemical\sProperties)",
    10: "(Stability\s(And|&)\sReactivity)",
    11: "(Toxicological\sInformation)",
    12: "(Ecological\sInformation)",
    13: "(Disposal\sConsiderations*)",
    14: "(Transport(ation)*\sInformation)",
    15: "(Regulatory\sInformation)",
    16: "(Other\sInformation)"
}

In [7]:
def GenerateHeaderRE(SectionNbr):
    # Output the Regular expression to search for the Section Headers of section SectionNbr.
    
    RE = regex1 + str(SectionNbr) + regex2 + f"{sections[SectionNbr]}"
    return(RE)

In [9]:
SubHeaderDict = { 0: {}, \
                  1: { 1 : ["Product identifier"], \
                       2 : ["Relevant identified uses of the substance or mixture and uses advised against"], \
                       3 : ["Details of the supplier of the safety data sheet"], \
                       4 : ["Emergency telephone number"] \
                     }, \
                  2: { 1 : ["Classification of the substance or mixture"], \
                       2 : ["Label elements"], \
                       3 : ["Other hazards"] \
                     }, \
                  3: { 1 : ["Substances"], \
                       2 : ["Mixtures"] \
                     }, \
                  4: { 1 : ["Description of first aid measures"], \
                       2 : ["Most important symptoms and effects, both acute and delayed"], \
                       3 : ["Indication of any immediate medical attention and special treatment needed"] \
                     }, \
                  5: { 1 : ["Extinguishing media"], \
                       2 : ["Special hazards arising from the substance or mixture"], \
                       3 : ["Advice for firefighters"] \
                     }, \
                  6: { 1 : ["Personal precautions, protective equipment and emergency procedures"], \
                       2 : ["Environmental precautions"], \
                       3 : ["Methods and material for containment and cleaning up"], \
                       4 : ["Reference to other sections"] \
                     }, \
                  7: { 1 : ["Precautions for safe handling"], \
                       2 : ["Conditions for safe storage, including any incompatibilities"], \
                       3 : ["Specific end use\(s\)"] \
                     }, \
                  8: { 1 : ["Control parameters"], \
                       2 : ["Exposure controls"] \
                       }, \
                  9: { 1 : ["Information on basic physical and chemical properties"], \
                       2 : ["Other information"] \
                     }, \
                  10: { 1 : ["Reactivity"], \
                        2 : ["Chemical stability"], \
                        3 : ["Possibility of hazardous reactions"], \
                        4 : ["Conditions to avoid"], \
                        5 : ["Incompatible materials"], \
                        6 : ["Hazardous decomposition products"] \
                      }, \
                  11: { 1 : ["Information on toxicological effects"] \
                      }, \
                  12: { 1 : ["Toxicity"], \
                        2 : ["Persistence and degradability"], \
                        3 : ["Bioaccumulative potential"], \
                        4 : ["Mobility in soil"], \
                        5 : ["Results of PBT and vPvB assessment"], \
                        6 : ["Other adverse effects"] \
                      }, \
                  13: { 1 : ["Waste treatment methods"] \
                      }, \
                  14: { 1 : ["UN number", "UN-Number:"], \
                        2 : ["UN proper shipping name"], \
                        3 : ["Transport hazard class\(es\)"], \
                        4 : ["Packing group"], \
                        5 : ["Environmental hazards", "Environmental hazards:"] \
                      }, \
                  15: { 1 : ["Safety, health and environmental regulations/legislation specific for the substance or mixture"], \
                        2 : ["Chemical safety assessment"] \
                      }, \
                  16: {}
                 }


In [10]:
def SubSections(SectionNbr, PDFSectionText):
    # Split text of section SectionNbr from PDFSectionText into subsections
    
    SubSectionTextDict = {}
    
    # Verify there is a section:
    #print(SectionNbr, "\n", PDFSectionText)
    #print("section len = ", len(PDFSectionText[SectionNbr]) )
    if (len(PDFSectionText[SectionNbr]) < 8):
        return({})
    
    #print(PDFSectionText[SectionNbr])
    
    # Check if there are subsections:
    if SubHeaderDict[SectionNbr]:
        
        # Locate all Headers:
        HeaderIndexes = {}
        AllHeaders = {}
        
        # loop through all possible subheaders of the section:
        for sh in range(1,len(SubHeaderDict[SectionNbr])+1):    # indexing of subsections starts at 1.
            # loop through all possible regular expressions for that subheader:
            for i in range(len(SubHeaderDict[SectionNbr][sh])):
                #print(sh, i)
                #print(SubHeaderDict[SectionNbr][sh][i])
                AllHeaders = FindOccurences(PDFSectionText[SectionNbr], SubHeaderDict[SectionNbr][sh][i])
                if AllHeaders:
                    HeaderIndexes[sh] = AllHeaders[1]        
        #print(HeaderIndexes)
        if not HeaderIndexes: # no sub headers found
            return({})
        
        #print(HeaderIndexes)
        # Determine begin and end position of each header:    
        SectionIndexes = {}
        #print(len(SubHeaderDict[SectionNbr]))
        #for SubSection in range(1,len(SubHeaderDict[SectionNbr])+1):
        #    if SubSection in HeaderIndexes:
        #        print(SubSection, "is in\n")
        #    else:
        #        print(SubSection, "is NOT in\n")
                
        for SubSection in range(1,len(SubHeaderDict[SectionNbr])+1):
            #print(SubSection)
            # Subsection headers are optional, apply HeaderIndexes in increasing order:
            SSlist = []
            for SubSection in HeaderIndexes:
                SSlist.append(SubSection)
            #print(SSlist)
            for i in SSlist[:-1]:
                #print(i)
                SectionIndexes[i] = (HeaderIndexes[i][1] + 1, HeaderIndexes[i+1][0] - 1)

            SectionIndexes[SSlist[-1]] = (HeaderIndexes[SSlist[-1]][1] + 1, len(PDFSectionText[SectionNbr]))
        

        # Extract the text of each subsection:
        
        #print(SectionIndexes)
    
        #for SubSection in range(1,len(SubHeaderDict[SectionNbr])+1):
        for SubSection in SSlist:
            SubSectionTextDict[SubSection] = ExtractText(PDFSectionText[SectionNbr], SectionIndexes[SubSection][0], SectionIndexes[SubSection][1])
    
    return(SubSectionTextDict)              

In [18]:
def SubSectionsStr(SectionNbr, SectionText):
    # Split text of section SectionNbr from PDFSectionText into subsections
    
    SubSectionTextDict = {}
    
    # Verify there is a section:
    if len(SectionText) < 8:
        return({})
    
    #print(PDFSectionText[SectionNbr])
    
    # Check if there are subsections:
    if SubHeaderDict[SectionNbr]:
        
        # Locate all Headers:
        HeaderIndexes = {}
        AllHeaders = {}
        
        # loop through all possible subheaders of the section:
        for sh in range(1,len(SubHeaderDict[SectionNbr])+1):    # indexing of subsections starts at 1.
            # loop through all possible regular expressions for that subheader:
            for i in range(len(SubHeaderDict[SectionNbr][sh])):
                #print(sh, i)
                #print(SubHeaderDict[SectionNbr][sh][i])
                AllHeaders = FindOccurences(SectionText, SubHeaderDict[SectionNbr][sh][i])
                if AllHeaders:
                    HeaderIndexes[sh] = AllHeaders[1]        
        #print(HeaderIndexes)
        if not HeaderIndexes: # no sub headers found
            return({})
        
        #print(HeaderIndexes)
        # Determine begin and end position of each header:    
        SectionIndexes = {}
        #print(len(SubHeaderDict[SectionNbr]))
        #for SubSection in range(1,len(SubHeaderDict[SectionNbr])+1):
        #    if SubSection in HeaderIndexes:
        #        print(SubSection, "is in\n")
        #    else:
        #        print(SubSection, "is NOT in\n")
                
        for SubSection in range(1,len(SubHeaderDict[SectionNbr])+1):
            #print(SubSection)
            # Subsection headers are optional, apply HeaderIndexes in increasing order:
            SSlist = []
            for SubSection in HeaderIndexes:
                SSlist.append(SubSection)
            #print(SSlist)
            for i in SSlist[:-1]:
                #print(i)
                SectionIndexes[i] = (HeaderIndexes[i][1] + 1, HeaderIndexes[i+1][0] - 1)

            SectionIndexes[SSlist[-1]] = (HeaderIndexes[SSlist[-1]][1] + 1, len(SectionText))
        

        # Extract the text of each subsection:
        
        #print(SectionIndexes)
    
        #for SubSection in range(1,len(SubHeaderDict[SectionNbr])+1):
        for SubSection in SSlist:
            SubSectionTextDict[SubSection] = ExtractText(SectionText, SectionIndexes[SubSection][0], SectionIndexes[SubSection][1])
    
    return(SubSectionTextDict)              

In [145]:
FindOccurences(PDFSectionsText[2], SubHeaderDict[2][3][0])

{}

In [163]:
PDFSectionsText[1]

'· Product Identifier\n· Trade name: 10N Sodium Hydroxide (NaOH 40%)\n· Product Number: NGT-10N NaOH\n· Relevant identified uses of the substance or mixture and uses advised against:\nNo further relevant information available.\n· Product Description PC21   Laboratory chemicals\n· Application of the substance / the mixture: Laboratory chemicals\n· Details of the Supplier of the Safety Data Sheet:\n· Manufacturer/Supplier:\nNuGeneration Technologies, LLC (dba NuGenTec)\n1155 Park Avenue, Emeryville, CA 94608\nsalesteam@nugentec.com\n1-888-996-8436 or 1-707-820-4080 for product information\n· Emergency telephone number:\nPERS Emergency Response: Domestic and Canada - 1-800-633-8253, International 1-801-629-0667\n\nwww.nugentec.com\n'

In [305]:
SubS = SubSections(1, PDFSectionsText)
SubS

{1: '· Trade name: 10N Sodium Hydroxide (NaOH 40%)\n· Product Number: NGT-10N NaOH\n·',
 2: '\nNo further relevant information available.\n· Product Description PC21   Laboratory chemicals\n· Application of the substance / the mixture: Laboratory chemicals\n·',
 3: '\n· Manufacturer/Supplier:\nNuGeneration Technologies, LLC (dba NuGenTec)\n1155 Park Avenue, Emeryville, CA 94608\nsalesteam@nugentec.com\n1-888-996-8436 or 1-707-820-4080 for product information\n·',
 4: '\nPERS Emergency Response: Domestic and Canada - 1-800-633-8253, International 1-801-629-0667\n\nwww.nugentec.com\n'}

In [230]:
print(type(SubS))

<class 'dict'>


In [285]:
PDFSectionsText[2]

'Identification\n· Classification of the substance or mixture:\n\nd~\uf082 GHS05 Corrosion\n\n* \n\n* \n\nSkin Corr. 1A H314 Causes severe skin burns and eye damage.\nEye Dam. 1 H318 Causes serious eye damage.\n· Label elements:\n· GHS label elements\nThe product is classified and labeled according to the Globally Harmonized System (GHS).\n· Hazard pictograms:\n\nd~\uf082\n\nGHS05\n\n· Signal word: Danger\n· Hazard-determining components of labeling:\nSodium Hydroxide\n· Hazard statements:\nH314 Causes severe skin burns and eye damage.\n· Precautionary statements:\nP260\nP280\nP264\nP303+P361+P353 If on skin (or hair): Take off immediately all contaminated clothing. Rinse skin with\n\nDo not breathe dusts or mists.\nWear protective gloves/protective clothing/eye protection/face protection.\nWash thoroughly after handling.\n\nwater/shower.\n\nP305+P351+P338 If in eyes: Rinse cautiously with water for several minutes. Remove contact lenses,\n\nif present and easy to do. Continue rinsing.

In [306]:
SubSections(2, PDFSectionsText)

{1: '\n\nd~\uf082 GHS05 Corrosion\n\n* \n\n* \n\nSkin Corr. 1A H314 Causes severe skin burns and eye damage.\nEye Dam. 1 H318 Causes serious eye damage.\n·',
 2: '\n· GHS label elements\nThe product is classified and labeled according to the Globally Harmonized System (GHS).\n· Hazard pictograms:\n\nd~\uf082\n\nGHS05\n\n· Signal word: Danger\n· Hazard-determining components of labeling:\nSodium Hydroxide\n· Hazard statements:\nH314 Causes severe skin burns and eye damage.\n· Precautionary statements:\nP260\nP280\nP264\nP303+P361+P353 If on skin (or hair): Take off immediately all contaminated clothing. Rinse skin with\n\nDo not breathe dusts or mists.\nWear protective gloves/protective clothing/eye protection/face protection.\nWash thoroughly after handling.\n\nwater/shower.\n\nP305+P351+P338 If in eyes: Rinse cautiously with water for several minutes. Remove contact lenses,\n\nif present and easy to do. Continue rinsing.\n\n(Contd. on page 2)\n US \n\n \n\n44.2.1\n\n\x0cSafety Data Sh

In [341]:
SubSections(12, PDFSectionsText)

12 
section len =  1


{}

In [221]:
PDFSectionsText[14]



In [307]:
for i in range(len(PDFSectionsText)):
     print("Sub H = ", i)
     print(SubSections(i, PDFSectionsText))

Sub H =  0
{}
Sub H =  1
{1: '· Trade name: 10N Sodium Hydroxide (NaOH 40%)\n· Product Number: NGT-10N NaOH\n·', 2: '\nNo further relevant information available.\n· Product Description PC21   Laboratory chemicals\n· Application of the substance / the mixture: Laboratory chemicals\n·', 3: '\n· Manufacturer/Supplier:\nNuGeneration Technologies, LLC (dba NuGenTec)\n1155 Park Avenue, Emeryville, CA 94608\nsalesteam@nugentec.com\n1-888-996-8436 or 1-707-820-4080 for product information\n·', 4: '\nPERS Emergency Response: Domestic and Canada - 1-800-633-8253, International 1-801-629-0667\n\nwww.nugentec.com\n'}
Sub H =  2
{1: '\n\nd~\uf082 GHS05 Corrosion\n\n* \n\n* \n\nSkin Corr. 1A H314 Causes severe skin burns and eye damage.\nEye Dam. 1 H318 Causes serious eye damage.\n·', 2: '\n· GHS label elements\nThe product is classified and labeled according to the Globally Harmonized System (GHS).\n· Hazard pictograms:\n\nd~\uf082\n\nGHS05\n\n· Signal word: Danger\n· Hazard-determining components 

In [21]:
SubHeaderDict[10][1][0]

'Reactivity'

In [22]:
SubSections(10, "")

In [11]:
def PDF2Sections(PDFfile):
    # Convert a PDF file in a Dict of Sections

    # Locate all Headers:
    HeaderIndexes = {}
    AllHeaders = {}

    for Header in range(1,17):
        AllHeaders = FindOccurences(PDFfile, GenerateHeaderRE(Header))
        if AllHeaders:
            HeaderIndexes[Header] = AllHeaders[1]
            #VerifyOccurence(AllHeaders)
        else:
            print("\nHeader ", HeaderDict[Header][0], " not found")
            return(None)
    
    #print(HeaderIndexes)

    # Determine begin and end position of each header:    
    SectionIndexes = {}
                                    
    SectionIndexes[0] = (0, HeaderIndexes[1][0] - 1)   # Section 0 is to contain PDF file intro
     
    for Section in range(1,16):
        SectionIndexes[Section] = (HeaderIndexes[Section][1] + 1, HeaderIndexes[Section + 1][0] - 1)
    
    SectionIndexes[16] = (HeaderIndexes[16][1] + 1, len(PDFfile) )

    # Extract the text of each section:
    SectionTextDict = {}
    
    for Section in range(0,17):
        SectionTextDict[Section] = ExtractText(PDFfile, SectionIndexes[Section][0], SectionIndexes[Section][1])
    
    return(SectionTextDict)   

In [12]:
import sys

def PrintSections(SectionDict, FileName = None):
    # Print out the PDF converted to Dictionary
    # If FileName given, will to write to that file, otherwise on screen display.
    
    if FileName:
        original_stdout = sys.stdout # Save a reference to the original standard output
        f = open(FileName, 'w')
        sys.stdout = f # Change the standard output to the file we created.
    
    for Section in range(0,17):
        print("-------------\nSection ", Section, ": ", HeaderDict[Section][0], "\n-------------\n", SectionDict[Section], "\n")
    
    if FileName:
        sys.stdout = original_stdout # Reset the standard output to its original value
    
    return(None)

In [13]:
def PDF2SubSections(PDFfile):
    # Convert a PDF file in a Dict of Sections

    # Locate all Headers:
    HeaderIndexes = {}
    AllHeaders = {}

    for Header in range(1,17):
        AllHeaders = FindOccurences(PDFfile, GenerateHeaderRE(Header))
        if AllHeaders:
            HeaderIndexes[Header] = AllHeaders[1]
            #VerifyOccurence(AllHeaders)
        else:
            print("\nHeader ", HeaderDict[Header], " not found")
            return(None)
    
    #print(HeaderIndexes)

    # Determine begin and end position of each header:    
    SectionIndexes = {}
                                    
    SectionIndexes[0] = (0, HeaderIndexes[1][0] - 1)   # Section 0 is to contain PDF file intro
     
    for Section in range(1,16):
        SectionIndexes[Section] = (HeaderIndexes[Section][1] + 1, HeaderIndexes[Section + 1][0] - 1)
    
    SectionIndexes[16] = (HeaderIndexes[16][1] + 1, len(PDFfile) )

    # Extract the text of each section:
    SectionTextDict = {}
    
    for Section in range(0,17):
        
        Hlp = ExtractText(PDFfile, SectionIndexes[Section][0], SectionIndexes[Section][1])
        #print(Hlp, "==================================================================================== END")
        #if Section == 11:
        #   return(Hlp)
        #print("len Hlp = ", len(Hlp))
        SubS = SubSectionsStr(Section, Hlp)
        if (SubS):
            SectionTextDict[Section] = SubS
        else: # no subsections
            SectionTextDict[Section] = Hlp
    
    return(SectionTextDict)   

In [37]:
import sys

def PrintSubSections(SubSectionDict, FileName = None):
    # Print out the PDF converted to Dictionary
    # If FileName given, will to write to that file, otherwise on screen display.
    
    if FileName:
        original_stdout = sys.stdout # Save a reference to the original standard output
        f = open(FileName, 'w')
        sys.stdout = f # Change the standard output to the file we created.
    
    for Section in range(0,17):
        print("==============\nSection ", Section, ": ", HeaderDict[Section][0], "\n==============\n")
        if SubHeaderDict[Section]:
            if isinstance(SubSectionDict[Section], str):
                print(SubSectionDict[Section])
            elif isinstance(SubSectionDict[Section], dict):
                for sh in SubSectionDict[Section]:
                    print("-------------\nSubSection ", sh, ": ", SubHeaderDict[Section][sh][0], "\n-------------\n")
                    print(SubSectionDict[Section][sh])
        else:
            if SubSectionDict[Section]:
                print(SubSectionDict[Section])
    
    if FileName:
        sys.stdout = original_stdout # Reset the standard output to its original value
    
    return(None)

In [15]:
InputFile = 'data/1799516.pdf'
PDFtext = extract_text_from_pdf(InputFile)

In [16]:
PDFtext



In [19]:
SubSdict = PDF2SubSections(PDFtext)

In [20]:
type(SubSdict)

dict

In [26]:
SubSdict

{0: 'Material Safety Data Sheet \n\nRevision Number: 004.0\n\nIssue date: 02/01/2012\n',
 1: '\n\nProduct name: \nProduct type:\n\nDB EPOXY Pot CMPD E-60NC 50ML \nEpoxy Hardener  \n\nCompany address: \nHenkel Corporation \nOne Henkel Way \nRocky Hill, Connecticut 06067\n\n701990\n29324_209537\nUnited States\n\nIDH number:\nItem number:\nRegion: \nContact information: \nTelephone: 860.571.5100 \nMEDICAL EMERGENCY Phone: Poison Control Center \n1-877-671-4608 (toll free) or 1-303-592-1711 \nTRANSPORT EMERGENCY Phone: CHEMTREC \n1-800-424-9300 (toll free) or 1-703-527-3887 \nInternet: www.henkelna.com\n\n \n \n \nPhysical state: \nColor: \nOdor: \n\nDANGER:\n',
 2: 'IDENTIFICATION \n\nEMERGENCY OVERVIEW\nHMIS:\n\nLiquid \nPale yellow\nMild\n\nHEALTH: \nFLAMMABILITY: \nPHYSICAL HAZARD: \nPersonal Protection: \n\n *3\n 1\n0\n\n See MSDS Section 8\n\nCAUSES EYE, SKIN AND RESPIRATORY TRACT BURNS. \nMAY CAUSE ALLERGIC SKIN REACTION. \nHARMFUL IF SWALLOWED OR INHALED. \nMAY BE HARMFUL IF ABSORB

In [38]:
PrintSubSections(SubSdict)

Section  0 :   

Material Safety Data Sheet 

Revision Number: 004.0

Issue date: 02/01/2012

Section  1 :  SECTION 1: Identification of the substance 



Product name: 
Product type:

DB EPOXY Pot CMPD E-60NC 50ML 
Epoxy Hardener  

Company address: 
Henkel Corporation 
One Henkel Way 
Rocky Hill, Connecticut 06067

701990
29324_209537
United States

IDH number:
Item number:
Region: 
Contact information: 
Telephone: 860.571.5100 
MEDICAL EMERGENCY Phone: Poison Control Center 
1-877-671-4608 (toll free) or 1-303-592-1711 
TRANSPORT EMERGENCY Phone: CHEMTREC 
1-800-424-9300 (toll free) or 1-703-527-3887 
Internet: www.henkelna.com

 
 
 
Physical state: 
Color: 
Odor: 

DANGER:

Section  2 :  SECTION 2: Hazards identification 

IDENTIFICATION 

EMERGENCY OVERVIEW
HMIS:

Liquid 
Pale yellow
Mild

HEALTH: 
FLAMMABILITY: 
PHYSICAL HAZARD: 
Personal Protection: 

 *3
 1
0

 See MSDS Section 8

CAUSES EYE, SKIN AND RESPIRATORY TRACT BURNS. 
MAY CAUSE ALLERGIC SKIN REACTION. 
HARMFUL IF SWAL

In [355]:
print(len(SubSFile))
print(SubSFile)

17


In [318]:
SubSFile

'\n\nHazardous components \n\nNTP Carcinogen \n\nIARC Carcinogen \n\nOSHA Carcinogen \n\n(Specifically Regulated) \n\n4-Nonylphenol, branched\nCycloaliphatic amine\nAlcohol derivative\n\nHazardous components \n\n4-Nonylphenol, branched\nCycloaliphatic amine\nAlcohol derivative\n\nNo\nNo\nNo\n\nNo\nNo\nNo\n\nNo\nNo\nNo\n\nHealth Effects/Target Organs \n\nIrritant, Corrosive\n\nBrain, Corrosive, Irritant, Liver, Lung, Muscle, Skin\nAllergen, Central nervous system, Corrosive, Irritant\n\nEcological information: \n\nNot available.\n'

In [268]:
SubSections(1, SubSFile)

{}

In [237]:
PrintSections(SubSFile)


Header  ['SECTION 1: Identification of the substance', '1. PRODUCT AND COMPANY IDENTIFICATION', '1.  PRODUCT AND COMPANY IDENTIFICATION', '1 Identification', '1. Identification', '1.Identification', 'SECTION 1: Identification', '1. IDENTIFICATION', 'SECTION 1: PRODUCT AND COMPANY IDENTIFICATION', 'Section 1. Identification', ' SECTION 1: Identification of the substance/mixture and of the company/undertaking ', 'SECTION 1: Identification of the Substance/Preparation and the Company/Undertaking', 'Section 1 - Identification of the Substance/Preparation and the Company/Undertaking']  not found


In [330]:
PDFSectionsText = PDF2Sections(PDFtext)

In [337]:
PDFSectionsText[11]

'\n\nHazardous components \n\nNTP Carcinogen \n\nIARC Carcinogen \n\nOSHA Carcinogen \n\n(Specifically Regulated) \n\n4-Nonylphenol, branched\nCycloaliphatic amine\nAlcohol derivative\n\nHazardous components \n\n4-Nonylphenol, branched\nCycloaliphatic amine\nAlcohol derivative\n\nNo\nNo\nNo\n\nNo\nNo\nNo\n\nNo\nNo\nNo\n\nHealth Effects/Target Organs \n\nIrritant, Corrosive\n\nBrain, Corrosive, Irritant, Liver, Lung, Muscle, Skin\nAllergen, Central nervous system, Corrosive, Irritant\n\nEcological information: \n\nNot available.\n'

In [338]:
PDFSectionsText[12]

'\n'

In [339]:
print(len(PDFSectionsText[12]))

1


In [59]:
import os

def ProcessAllPDFs(Directory):
    # Converts to sections all PDF files in Directory
    
    for filename in os.listdir(Directory):
        if filename.endswith(".pdf"):
            print(f"\n======================================================\nFile: {filename}\n======================================================")
            PDFtext = extract_text_from_pdf(InputFile)
            PDFSectionsText = PDF2Sections(PDFtext)
            PrintSections(PDFSectionsText)
    
    return(None)

## Remarks

InputFile = 'data/1799516.pdf'

!!!!!!!! Section 12 text ends up in section 11 (header 12 not recognized.
         Some information appears more than once. Section 16 contains info from other sections !!!!!!!!!

InputFile = 'data/gylcol-ether-dpm-sds.pdf'

!!!!! Section Title '8. EXPOSURE CONTROLS AND PERSONAL PROECTION' contains typo !!!!!

InputFile = 'data/NN-dimethylformamide_3_US_EN_sds.pdf'

!!!!! Section 16: First letter of text is lost !!!!!!

InputFile = "data/quilon_h-eu_msds__1594909058.pdf"

!!!!! This file does not respect the section numbering. Section 2 and 3 interchanged !!!!!!!!!

InputFile = "data/SDS - Horizon Blue Clarifier.pdf"

!!!! This one has text in disorder. Labels and values are no longer togheter, values are missing. !!!!

InputFile = "data/Sledgehammer-Heat-Cure-Powder-SDS-US-20150522.pdf"

!!!! Repeats section header on new page !!!  Multiple occurrences of section header.

## PDF File Processing

In [232]:
ProcessAllPDFs("./data")


File: 10N_Sodium_Hydroxide_NaOH_40_6_US_EN_sds (1).pdf
-------------
Section  0 :   
-------------
  

SIGMA-ALDRICH 

  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 
1.1 

Product identifiers 
Product name 

Product Number 
Brand 
 
CAS-No. 

 

:  329800 
:  Aldrich 
 
:  765-12-8 

 

:  Tri(ethylene glycol) divinyl ether 

sigma-aldrich.com 
SAFETY DATA SHEET 
Version 4.2 
Revision Date 07/01/2014 
Print Date 11/10/2018 

1.2  Relevant identified uses of the substance or mixture and uses advised against 

Identified uses 

:  Laboratory chemicals, Manufacture of substances 

1.3  Details of the supplier of the safety data sheet 

Company 

:  Sigma-Aldrich 

3050 Spruce Street 
SAINT LOUIS MO  63103 
USA 

:  +1 800-325-5832 
:  +1 800-325-5052 

:  +1-703-527-3887 (CHEMTREC) 

 
Telephone 
Fax 
Emergency telephone number 
Emergency Phone # 

1.4 

  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 IDENTIFI

-------------
Section  0 :   
-------------
  

SIGMA-ALDRICH 

  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 
1.1 

Product identifiers 
Product name 

Product Number 
Brand 
 
CAS-No. 

 

:  329800 
:  Aldrich 
 
:  765-12-8 

 

:  Tri(ethylene glycol) divinyl ether 

sigma-aldrich.com 
SAFETY DATA SHEET 
Version 4.2 
Revision Date 07/01/2014 
Print Date 11/10/2018 

1.2  Relevant identified uses of the substance or mixture and uses advised against 

Identified uses 

:  Laboratory chemicals, Manufacture of substances 

1.3  Details of the supplier of the safety data sheet 

Company 

:  Sigma-Aldrich 

3050 Spruce Street 
SAINT LOUIS MO  63103 
USA 

:  +1 800-325-5832 
:  +1 800-325-5052 

:  +1-703-527-3887 (CHEMTREC) 

 
Telephone 
Fax 
Emergency telephone number 
Emergency Phone # 

1.4 

  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 IDENTIFICATION 
2.1  Classification of the substance or mixture 

-------------
Section  0 :   
-------------
  

SIGMA-ALDRICH 

  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 
1.1 

Product identifiers 
Product name 

Product Number 
Brand 
 
CAS-No. 

 

:  329800 
:  Aldrich 
 
:  765-12-8 

 

:  Tri(ethylene glycol) divinyl ether 

sigma-aldrich.com 
SAFETY DATA SHEET 
Version 4.2 
Revision Date 07/01/2014 
Print Date 11/10/2018 

1.2  Relevant identified uses of the substance or mixture and uses advised against 

Identified uses 

:  Laboratory chemicals, Manufacture of substances 

1.3  Details of the supplier of the safety data sheet 

Company 

:  Sigma-Aldrich 

3050 Spruce Street 
SAINT LOUIS MO  63103 
USA 

:  +1 800-325-5832 
:  +1 800-325-5052 

:  +1-703-527-3887 (CHEMTREC) 

 
Telephone 
Fax 
Emergency telephone number 
Emergency Phone # 

1.4 

  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 IDENTIFICATION 
2.1  Classification of the substance or mixture 

-------------
Section  0 :   
-------------
  

SIGMA-ALDRICH 

  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 
1.1 

Product identifiers 
Product name 

Product Number 
Brand 
 
CAS-No. 

 

:  329800 
:  Aldrich 
 
:  765-12-8 

 

:  Tri(ethylene glycol) divinyl ether 

sigma-aldrich.com 
SAFETY DATA SHEET 
Version 4.2 
Revision Date 07/01/2014 
Print Date 11/10/2018 

1.2  Relevant identified uses of the substance or mixture and uses advised against 

Identified uses 

:  Laboratory chemicals, Manufacture of substances 

1.3  Details of the supplier of the safety data sheet 

Company 

:  Sigma-Aldrich 

3050 Spruce Street 
SAINT LOUIS MO  63103 
USA 

:  +1 800-325-5832 
:  +1 800-325-5052 

:  +1-703-527-3887 (CHEMTREC) 

 
Telephone 
Fax 
Emergency telephone number 
Emergency Phone # 

1.4 

  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 IDENTIFICATION 
2.1  Classification of the substance or mixture 

-------------
Section  0 :   
-------------
  

SIGMA-ALDRICH 

  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 
1.1 

Product identifiers 
Product name 

Product Number 
Brand 
 
CAS-No. 

 

:  329800 
:  Aldrich 
 
:  765-12-8 

 

:  Tri(ethylene glycol) divinyl ether 

sigma-aldrich.com 
SAFETY DATA SHEET 
Version 4.2 
Revision Date 07/01/2014 
Print Date 11/10/2018 

1.2  Relevant identified uses of the substance or mixture and uses advised against 

Identified uses 

:  Laboratory chemicals, Manufacture of substances 

1.3  Details of the supplier of the safety data sheet 

Company 

:  Sigma-Aldrich 

3050 Spruce Street 
SAINT LOUIS MO  63103 
USA 

:  +1 800-325-5832 
:  +1 800-325-5052 

:  +1-703-527-3887 (CHEMTREC) 

 
Telephone 
Fax 
Emergency telephone number 
Emergency Phone # 

1.4 

  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 IDENTIFICATION 
2.1  Classification of the substance or mixture 

-------------
Section  0 :   
-------------
  

SIGMA-ALDRICH 

  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 
1.1 

Product identifiers 
Product name 

Product Number 
Brand 
 
CAS-No. 

 

:  329800 
:  Aldrich 
 
:  765-12-8 

 

:  Tri(ethylene glycol) divinyl ether 

sigma-aldrich.com 
SAFETY DATA SHEET 
Version 4.2 
Revision Date 07/01/2014 
Print Date 11/10/2018 

1.2  Relevant identified uses of the substance or mixture and uses advised against 

Identified uses 

:  Laboratory chemicals, Manufacture of substances 

1.3  Details of the supplier of the safety data sheet 

Company 

:  Sigma-Aldrich 

3050 Spruce Street 
SAINT LOUIS MO  63103 
USA 

:  +1 800-325-5832 
:  +1 800-325-5052 

:  +1-703-527-3887 (CHEMTREC) 

 
Telephone 
Fax 
Emergency telephone number 
Emergency Phone # 

1.4 

  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 IDENTIFICATION 
2.1  Classification of the substance or mixture 

-------------
Section  0 :   
-------------
  

SIGMA-ALDRICH 

  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 
1.1 

Product identifiers 
Product name 

Product Number 
Brand 
 
CAS-No. 

 

:  329800 
:  Aldrich 
 
:  765-12-8 

 

:  Tri(ethylene glycol) divinyl ether 

sigma-aldrich.com 
SAFETY DATA SHEET 
Version 4.2 
Revision Date 07/01/2014 
Print Date 11/10/2018 

1.2  Relevant identified uses of the substance or mixture and uses advised against 

Identified uses 

:  Laboratory chemicals, Manufacture of substances 

1.3  Details of the supplier of the safety data sheet 

Company 

:  Sigma-Aldrich 

3050 Spruce Street 
SAINT LOUIS MO  63103 
USA 

:  +1 800-325-5832 
:  +1 800-325-5052 

:  +1-703-527-3887 (CHEMTREC) 

 
Telephone 
Fax 
Emergency telephone number 
Emergency Phone # 

1.4 

  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 IDENTIFICATION 
2.1  Classification of the substance or mixture 

-------------
Section  0 :   
-------------
  

SIGMA-ALDRICH 

  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 
1.1 

Product identifiers 
Product name 

Product Number 
Brand 
 
CAS-No. 

 

:  329800 
:  Aldrich 
 
:  765-12-8 

 

:  Tri(ethylene glycol) divinyl ether 

sigma-aldrich.com 
SAFETY DATA SHEET 
Version 4.2 
Revision Date 07/01/2014 
Print Date 11/10/2018 

1.2  Relevant identified uses of the substance or mixture and uses advised against 

Identified uses 

:  Laboratory chemicals, Manufacture of substances 

1.3  Details of the supplier of the safety data sheet 

Company 

:  Sigma-Aldrich 

3050 Spruce Street 
SAINT LOUIS MO  63103 
USA 

:  +1 800-325-5832 
:  +1 800-325-5052 

:  +1-703-527-3887 (CHEMTREC) 

 
Telephone 
Fax 
Emergency telephone number 
Emergency Phone # 

1.4 

  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 IDENTIFICATION 
2.1  Classification of the substance or mixture 

-------------
Section  0 :   
-------------
  

SIGMA-ALDRICH 

  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 
1.1 

Product identifiers 
Product name 

Product Number 
Brand 
 
CAS-No. 

 

:  329800 
:  Aldrich 
 
:  765-12-8 

 

:  Tri(ethylene glycol) divinyl ether 

sigma-aldrich.com 
SAFETY DATA SHEET 
Version 4.2 
Revision Date 07/01/2014 
Print Date 11/10/2018 

1.2  Relevant identified uses of the substance or mixture and uses advised against 

Identified uses 

:  Laboratory chemicals, Manufacture of substances 

1.3  Details of the supplier of the safety data sheet 

Company 

:  Sigma-Aldrich 

3050 Spruce Street 
SAINT LOUIS MO  63103 
USA 

:  +1 800-325-5832 
:  +1 800-325-5052 

:  +1-703-527-3887 (CHEMTREC) 

 
Telephone 
Fax 
Emergency telephone number 
Emergency Phone # 

1.4 

  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 IDENTIFICATION 
2.1  Classification of the substance or mixture 

-------------
Section  0 :   
-------------
  

SIGMA-ALDRICH 

  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 
1.1 

Product identifiers 
Product name 

Product Number 
Brand 
 
CAS-No. 

 

:  329800 
:  Aldrich 
 
:  765-12-8 

 

:  Tri(ethylene glycol) divinyl ether 

sigma-aldrich.com 
SAFETY DATA SHEET 
Version 4.2 
Revision Date 07/01/2014 
Print Date 11/10/2018 

1.2  Relevant identified uses of the substance or mixture and uses advised against 

Identified uses 

:  Laboratory chemicals, Manufacture of substances 

1.3  Details of the supplier of the safety data sheet 

Company 

:  Sigma-Aldrich 

3050 Spruce Street 
SAINT LOUIS MO  63103 
USA 

:  +1 800-325-5832 
:  +1 800-325-5052 

:  +1-703-527-3887 (CHEMTREC) 

 
Telephone 
Fax 
Emergency telephone number 
Emergency Phone # 

1.4 

  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 IDENTIFICATION 
2.1  Classification of the substance or mixture 

-------------
Section  0 :   
-------------
  

SIGMA-ALDRICH 

  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 
1.1 

Product identifiers 
Product name 

Product Number 
Brand 
 
CAS-No. 

 

:  329800 
:  Aldrich 
 
:  765-12-8 

 

:  Tri(ethylene glycol) divinyl ether 

sigma-aldrich.com 
SAFETY DATA SHEET 
Version 4.2 
Revision Date 07/01/2014 
Print Date 11/10/2018 

1.2  Relevant identified uses of the substance or mixture and uses advised against 

Identified uses 

:  Laboratory chemicals, Manufacture of substances 

1.3  Details of the supplier of the safety data sheet 

Company 

:  Sigma-Aldrich 

3050 Spruce Street 
SAINT LOUIS MO  63103 
USA 

:  +1 800-325-5832 
:  +1 800-325-5052 

:  +1-703-527-3887 (CHEMTREC) 

 
Telephone 
Fax 
Emergency telephone number 
Emergency Phone # 

1.4 

  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 IDENTIFICATION 
2.1  Classification of the substance or mixture 

-------------
Section  0 :   
-------------
  

SIGMA-ALDRICH 

  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 
1.1 

Product identifiers 
Product name 

Product Number 
Brand 
 
CAS-No. 

 

:  329800 
:  Aldrich 
 
:  765-12-8 

 

:  Tri(ethylene glycol) divinyl ether 

sigma-aldrich.com 
SAFETY DATA SHEET 
Version 4.2 
Revision Date 07/01/2014 
Print Date 11/10/2018 

1.2  Relevant identified uses of the substance or mixture and uses advised against 

Identified uses 

:  Laboratory chemicals, Manufacture of substances 

1.3  Details of the supplier of the safety data sheet 

Company 

:  Sigma-Aldrich 

3050 Spruce Street 
SAINT LOUIS MO  63103 
USA 

:  +1 800-325-5832 
:  +1 800-325-5052 

:  +1-703-527-3887 (CHEMTREC) 

 
Telephone 
Fax 
Emergency telephone number 
Emergency Phone # 

1.4 

  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 IDENTIFICATION 
2.1  Classification of the substance or mixture 

-------------
Section  0 :   
-------------
  

SIGMA-ALDRICH 

  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 
1.1 

Product identifiers 
Product name 

Product Number 
Brand 
 
CAS-No. 

 

:  329800 
:  Aldrich 
 
:  765-12-8 

 

:  Tri(ethylene glycol) divinyl ether 

sigma-aldrich.com 
SAFETY DATA SHEET 
Version 4.2 
Revision Date 07/01/2014 
Print Date 11/10/2018 

1.2  Relevant identified uses of the substance or mixture and uses advised against 

Identified uses 

:  Laboratory chemicals, Manufacture of substances 

1.3  Details of the supplier of the safety data sheet 

Company 

:  Sigma-Aldrich 

3050 Spruce Street 
SAINT LOUIS MO  63103 
USA 

:  +1 800-325-5832 
:  +1 800-325-5052 

:  +1-703-527-3887 (CHEMTREC) 

 
Telephone 
Fax 
Emergency telephone number 
Emergency Phone # 

1.4 

  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 IDENTIFICATION 
2.1  Classification of the substance or mixture 

-------------
Section  0 :   
-------------
  

SIGMA-ALDRICH 

  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 
1.1 

Product identifiers 
Product name 

Product Number 
Brand 
 
CAS-No. 

 

:  329800 
:  Aldrich 
 
:  765-12-8 

 

:  Tri(ethylene glycol) divinyl ether 

sigma-aldrich.com 
SAFETY DATA SHEET 
Version 4.2 
Revision Date 07/01/2014 
Print Date 11/10/2018 

1.2  Relevant identified uses of the substance or mixture and uses advised against 

Identified uses 

:  Laboratory chemicals, Manufacture of substances 

1.3  Details of the supplier of the safety data sheet 

Company 

:  Sigma-Aldrich 

3050 Spruce Street 
SAINT LOUIS MO  63103 
USA 

:  +1 800-325-5832 
:  +1 800-325-5052 

:  +1-703-527-3887 (CHEMTREC) 

 
Telephone 
Fax 
Emergency telephone number 
Emergency Phone # 

1.4 

  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 IDENTIFICATION 
2.1  Classification of the substance or mixture 

-------------
Section  0 :   
-------------
  

SIGMA-ALDRICH 

  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 
1.1 

Product identifiers 
Product name 

Product Number 
Brand 
 
CAS-No. 

 

:  329800 
:  Aldrich 
 
:  765-12-8 

 

:  Tri(ethylene glycol) divinyl ether 

sigma-aldrich.com 
SAFETY DATA SHEET 
Version 4.2 
Revision Date 07/01/2014 
Print Date 11/10/2018 

1.2  Relevant identified uses of the substance or mixture and uses advised against 

Identified uses 

:  Laboratory chemicals, Manufacture of substances 

1.3  Details of the supplier of the safety data sheet 

Company 

:  Sigma-Aldrich 

3050 Spruce Street 
SAINT LOUIS MO  63103 
USA 

:  +1 800-325-5832 
:  +1 800-325-5052 

:  +1-703-527-3887 (CHEMTREC) 

 
Telephone 
Fax 
Emergency telephone number 
Emergency Phone # 

1.4 

  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 IDENTIFICATION 
2.1  Classification of the substance or mixture 

-------------
Section  0 :   
-------------
  

SIGMA-ALDRICH 

  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 
1.1 

Product identifiers 
Product name 

Product Number 
Brand 
 
CAS-No. 

 

:  329800 
:  Aldrich 
 
:  765-12-8 

 

:  Tri(ethylene glycol) divinyl ether 

sigma-aldrich.com 
SAFETY DATA SHEET 
Version 4.2 
Revision Date 07/01/2014 
Print Date 11/10/2018 

1.2  Relevant identified uses of the substance or mixture and uses advised against 

Identified uses 

:  Laboratory chemicals, Manufacture of substances 

1.3  Details of the supplier of the safety data sheet 

Company 

:  Sigma-Aldrich 

3050 Spruce Street 
SAINT LOUIS MO  63103 
USA 

:  +1 800-325-5832 
:  +1 800-325-5052 

:  +1-703-527-3887 (CHEMTREC) 

 
Telephone 
Fax 
Emergency telephone number 
Emergency Phone # 

1.4 

  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 IDENTIFICATION 
2.1  Classification of the substance or mixture 

-------------
Section  0 :   
-------------
  

SIGMA-ALDRICH 

  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 
1.1 

Product identifiers 
Product name 

Product Number 
Brand 
 
CAS-No. 

 

:  329800 
:  Aldrich 
 
:  765-12-8 

 

:  Tri(ethylene glycol) divinyl ether 

sigma-aldrich.com 
SAFETY DATA SHEET 
Version 4.2 
Revision Date 07/01/2014 
Print Date 11/10/2018 

1.2  Relevant identified uses of the substance or mixture and uses advised against 

Identified uses 

:  Laboratory chemicals, Manufacture of substances 

1.3  Details of the supplier of the safety data sheet 

Company 

:  Sigma-Aldrich 

3050 Spruce Street 
SAINT LOUIS MO  63103 
USA 

:  +1 800-325-5832 
:  +1 800-325-5052 

:  +1-703-527-3887 (CHEMTREC) 

 
Telephone 
Fax 
Emergency telephone number 
Emergency Phone # 

1.4 

  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 IDENTIFICATION 
2.1  Classification of the substance or mixture 

-------------
Section  0 :   
-------------
  

SIGMA-ALDRICH 

  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 
1.1 

Product identifiers 
Product name 

Product Number 
Brand 
 
CAS-No. 

 

:  329800 
:  Aldrich 
 
:  765-12-8 

 

:  Tri(ethylene glycol) divinyl ether 

sigma-aldrich.com 
SAFETY DATA SHEET 
Version 4.2 
Revision Date 07/01/2014 
Print Date 11/10/2018 

1.2  Relevant identified uses of the substance or mixture and uses advised against 

Identified uses 

:  Laboratory chemicals, Manufacture of substances 

1.3  Details of the supplier of the safety data sheet 

Company 

:  Sigma-Aldrich 

3050 Spruce Street 
SAINT LOUIS MO  63103 
USA 

:  +1 800-325-5832 
:  +1 800-325-5052 

:  +1-703-527-3887 (CHEMTREC) 

 
Telephone 
Fax 
Emergency telephone number 
Emergency Phone # 

1.4 

  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 IDENTIFICATION 
2.1  Classification of the substance or mixture 

-------------
Section  0 :   
-------------
  

SIGMA-ALDRICH 

  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 
1.1 

Product identifiers 
Product name 

Product Number 
Brand 
 
CAS-No. 

 

:  329800 
:  Aldrich 
 
:  765-12-8 

 

:  Tri(ethylene glycol) divinyl ether 

sigma-aldrich.com 
SAFETY DATA SHEET 
Version 4.2 
Revision Date 07/01/2014 
Print Date 11/10/2018 

1.2  Relevant identified uses of the substance or mixture and uses advised against 

Identified uses 

:  Laboratory chemicals, Manufacture of substances 

1.3  Details of the supplier of the safety data sheet 

Company 

:  Sigma-Aldrich 

3050 Spruce Street 
SAINT LOUIS MO  63103 
USA 

:  +1 800-325-5832 
:  +1 800-325-5052 

:  +1-703-527-3887 (CHEMTREC) 

 
Telephone 
Fax 
Emergency telephone number 
Emergency Phone # 

1.4 

  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 IDENTIFICATION 
2.1  Classification of the substance or mixture 

-------------
Section  0 :   
-------------
  

SIGMA-ALDRICH 

  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 
1.1 

Product identifiers 
Product name 

Product Number 
Brand 
 
CAS-No. 

 

:  329800 
:  Aldrich 
 
:  765-12-8 

 

:  Tri(ethylene glycol) divinyl ether 

sigma-aldrich.com 
SAFETY DATA SHEET 
Version 4.2 
Revision Date 07/01/2014 
Print Date 11/10/2018 

1.2  Relevant identified uses of the substance or mixture and uses advised against 

Identified uses 

:  Laboratory chemicals, Manufacture of substances 

1.3  Details of the supplier of the safety data sheet 

Company 

:  Sigma-Aldrich 

3050 Spruce Street 
SAINT LOUIS MO  63103 
USA 

:  +1 800-325-5832 
:  +1 800-325-5052 

:  +1-703-527-3887 (CHEMTREC) 

 
Telephone 
Fax 
Emergency telephone number 
Emergency Phone # 

1.4 

  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 IDENTIFICATION 
2.1  Classification of the substance or mixture 