# PDF text Extraction

ML6 use case. 20 pdf documents with product Safety hazard information.


Safety Data Sheet (SDS) information:

https://en.wikipedia.org/wiki/Safety_data_sheet

https://ekos.gda.pl/en/article/baza-wiedzy/safety-data-sheet-annex-ii.html

http://chemicalsafety.com/sds-search/

https://www.msdsonline.com/sds-search/

https://comply1.com/compliance/ghs/MSDS-SDS-Comparison

https://www.everysafeuse.com/chemparser/     - Commercial product that converts SDS into json - json output example


https://echa.europa.eu/-/guidance-on-the-compilation-of-safety-data-sheets


Assumptions:
      - all Safety Data Sheets are in English only.


### Extraction based on PDFminer

!!!!! The latest version of pdfminer does not work on all PDF files. Need to have the one installed when installing textract, textract will downgrade to proper pdfminer version).

pdfminer.six-20181108-py2.py3-none-any.whl

uninstall six 1.15.0


In [1]:
# See https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/
#     http://zevross.com/blog/2014/04/09/extracting-tabular-data-from-a-pdf-an-example-using-python-and-regular-expressions/

import io
import pdfminer

from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage


In [2]:
laparams = pdfminer.layout.LAParams()
setattr(laparams, 'all_texts', True)

def miner_extract_text_from_pdf(pdf_path):
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle, laparams=laparams)
    page_interpreter = PDFPageInterpreter(resource_manager, converter)

    with open(pdf_path, 'rb') as fh:
        for page in PDFPage.get_pages(fh, 
                                      caching=True,
                                      check_extractable=True):
            page_interpreter.process_page(page)

        text = fake_file_handle.getvalue()

    # close open handles
    converter.close()
    fake_file_handle.close()

    if text:
        return text

### Extraction based on Tika

In [3]:
from tika import parser

def tika_extract_text_from_pdf(pdf_path):
    
    # opening pdf file 
    parsed_pdf = parser.from_file(pdf_path) 
  
    # saving content of pdf 
    # you can also bring text only, by parsed_pdf['text']  
    # parsed_pdf['content'] #returns string  
    text = parsed_pdf['content'] 

    if text:
        return text

### Matching of sections based on Regular Expressions

In [4]:
import re

def FindOccurences(Text, RegularExpression):
    # Find all Occurences of SubString in text. Return dict with begin and end position in Text for each occurence
    # If not found, returns empty dictionary
    
    count = 1
    Dict = {}
    for m in re.finditer(RegularExpression, Text, flags= re.MULTILINE | re.IGNORECASE):
         Dict[count] = (m.start(), m.end())
         count += 1
          
    return(Dict)

In [5]:
def VerifyOccurence(AllHeaders):
    # Check if SubString occurs in Text.
    
    #if len(AllHeaders) == 0:
    #    s = "No Occurences"
    #elif len(AllHeaders) > 1:
    #    s = "Multiple Occurences"
    #else:
    #    s = "OK - Single Occurence"
    
    #return(s)
    return(len(AllHeaders))

In [6]:
def ExtractText(Text, Begin, End):
    # Extract text Segment between 2 position indexes in a text string
    
    return(Text[Begin:End])

In [7]:
# From Maarten:

dict_structure = {"SECTION 1: Identification of the substance":
                          {"1.1. Product identifier":"",
                           "1.2. Relevant identified uses of the substance or mixture and uses advised against":"",
                           "1.3. Details of the supplier of the safety data sheet":"",
                           "1.4. Emergency telephone number":""},
                  "SECTION 2: Hazards identification":
                          {"2.1. Classification of the substance or mixture":"",
                           "2.2. Label elements":"",
                           "2.3. Other hazards":""},
                  "SECTION 3: Composition/information on ingredients":
                          {"3.1. Substances":"",
                           "3.2. Mixtures":""},
                  "SECTION 4: First aid measures":
                          {"4.1. Description of first aid measures":"",
                           "4.2. Most important symptoms and effects, both acute and delayed":"",
                           "4.3. Indication of any immediate medical attention and special treatment needed":""},
                  "SECTION 5: Firefighting measures":
                          {"5.1. Extinguishing media":"",
                           "5.2. Special hazards arising from the substance or mixture":"",
                           "5.3. Advice for firefighters":""},
                  "SECTION 6: Accidental release measures":
                          {"6.1. Personal precautions, protective equipment and emergency procedures":"",
                           "6.2. Environmental precautions":"",
                           "6.3. Methods and material for containment and cleaning up":"",
                           "6.4. Reference to other sections":""},
                  "SECTION 7: Handling and storage":
                          {"7.1. Precautions for safe handling":"",
                           "7.2. Conditions for safe storage, including any incompatibilities":"",
                           "7.3. Specific end use(s)":""},
                  "SECTION 8: Exposure controls/personal protection":
                          {"8.1. Control parameters":"",
                           "8.2. Exposure controls":""},
                  "SECTION 9: Physical and chemical properties":
                          {"9.1. Information on basic physical and chemical properties":"",
                           "9.2. Other information":""},
                  "SECTION 10: Stability and reactivity":
                          {"10.1. Reactivity":"",
                           "10.2. Chemical stability":"",
                           "10.3. Possibility of hazardous reactions":"",
                           "10.4. Conditions to avoid":"",
                           "10.5. Incompatible materials":"",
                           "10.6. Hazardous decomposition products":""},
                  "SECTION 11: Toxicological information":
                          {"11.1. Information on toxicological effects":""},
                  "SECTION 12: Ecological information":
                          {"12.1. Toxicity":"",
                           "12.2. Persistence and degradability":"",
                           "12.3. Bioaccumulative potential":"",
                           "12.4. Mobility in soil":"",
                           "12.5. Results of PBT and vPvB assessment":"",
                           "12.6. Other adverse effects":""},
                  "SECTION 13: Disposal considerations":
                          {"13.1. Waste treatment methods":""},
                  "SECTION 14: Transport information":
                          {"14.1. UN number":"",
                           "14.2. UN proper shipping name":"",
                           "14.3. Transport hazard class(es)":"",
                           "14.4. Packing group":"",
                           "14.5. Environmental hazards":""},
                  "SECTION 15: Regulatory information":
                          {"15.1. Safety, health and environmental regulations/legislation specific for the substance or mixture":"",
                           "15.2. Chemical safety assessment":""},
                  "SECTION 16: Other information":""
                 }


In [8]:
# SDS headers top level:
# All variants in a list need to be replaced by a single Regular Expression, matching all of the items in that list.
# The first element in the list is taken from the standard specification. The others are taken from all 20 PDF test files. 
# There may be repeats in the lists
# Only the first (official) element is still in use

HeaderDict = { 0 : [""], \
               1 : ["SECTION 1: Identification of the substance", "1. PRODUCT AND COMPANY IDENTIFICATION", "1.  PRODUCT AND COMPANY IDENTIFICATION", "1 Identification", "1. Identification", "1.Identification", "SECTION 1: Identification", "1. IDENTIFICATION", "SECTION 1: PRODUCT AND COMPANY IDENTIFICATION", "Section 1. Identification", " SECTION 1: Identification of the substance/mixture and of the company/undertaking ", "SECTION 1: Identification of the Substance/Preparation and the Company/Undertaking", "Section 1 - Identification of the Substance/Preparation and the Company/Undertaking"], \
               2 : ["SECTION 2: Hazards identification", "Hazards Identification", "2. HAZARDS IDENTIFICATION", "2 Hazard\(s\) identification", "2 Hazard\(s\) Identification", "2. Hazard\(s\) identification", "2.Hazard\(s\) identification", "2.Hazard(s) identification", "SECTION 2: Hazard\(s\) identification", "2.  HAZARDS IDENTIFICATION"," SECTION 2: HAZARDS IDENTIFICATION","SECTION 2: Hazards Identification", "SECTION 2: Hazards identification", "Section 2. Hazards identification"], \
               3 : ["SECTION 3: Composition/information on ingredients", "3. COMPOSITION/INFORMATION ON INGREDIENTS", "3 Composition/information on ingredients", "3 Composition/Information on Ingredients", "3.  COMPOSITION / INFORMATION ON INGREDIENTS", "3. Composition/information on ingredients", "3. Composition/Information on Ingredients", "3.Composition/information on ingredients", "Composition / Information on Ingredients", "3. COMPOSITION", "SECTION 3: Composition/Information on ingredients", "SECTION 3: Composition/Information on Ingredients", "SECTION 3: COMPOSITION/INFORMATION ON INGREDIENTS", "SECTION 3:Composition/information on ingredients", "Section 3. Composition/information on ingredients", "SECTION 3: COMPOSITION"], \
               4 : ["SECTION 4: First aid measures", "4. FIRST AID MEASURES", "4 First-Aid Measures", "4 First aid measures", "4. First-aid measures", "4. First-Aid Measures", "4.  FIRST AID MEASURES", "4.First-aid measures", "SECTION 4: First-aid measures", "SECTION 4: First-Aid Measures", "SECTION 4: FIRST AID MEASURES", " SECTION 4: First aid measures", "Section 4 - First Aid Measures", "Section 4. First aid measures"], \
               5 : ["SECTION 5: Firefighting measures", "5. FIRE-FIGHTING MEASURES", "5 Fire-Fighting Measures", "5 Fire-fighting measures", "5. FIRE FIGHTING MEASURES", "5. FIREFIGHTING MEASURES", "5.  FIRE FIGHTING MEASURES", "5. Fire-fighting measures", "5. Fire-Fighting Measures", "5.Fire-fighting measures", "SECTION 5: Fire-fighting measures", "SECTION 5: Fire-Fighting Measures", "SECTION 5: FIREFIGHTING MEASURES", "SECTION 5:FIRE-FIGHTING MEASURES", "SECTION 5: FIRE FIGHTING MEASURES"," SECTION 5: Firefighting measures", "Section 5 - Fire Fighting Measures", "Section 5. Fire-fighting measures"], \
               6 : ["SECTION 6: Accidental release measures", "6. ACCIDENTAL RELEASE MEASURES", "6 Accidental Release Measures", "6 Accidental release measures", "6.  ACCIDENTAL RELEASE MEASURES", "6. Accidental release measures", "6. Accidental Release Measures", "6.Accidental release measures", "SECTION 6: Accidental release measures","SECTION 6: Accidental Release Measures", "SECTION 6: ACCIDENTAL RELEASE MEASURES", "Section 6 - Accidental Release Measures", "Section 6. Accidental release measures"], \
               7 : ["SECTION 7: Handling and storage", "7. HANDLING AND STORAGE", "7 Handling and Storage", "7 Handling and storage", "7.  HANDLING AND STORAGE", "7. Handling and storage", "7. Handling and Storage", "7.Handling and storage", "SECTION 7: Handling and storage", "SECTION 7: Handling and Storage", "SECTION 7: HANDLING AND STORAGE", "Section 7 - Handling and Storage", "Section 7. Handling and storage"], \
               8 : ["SECTION 8: Exposure controls/personal protection", "8. EXPOSURE CONTROLS/PERSONAL PROTECTION", "8 Exposure controls/personal protection", "8. EXPOSURE CONTROLS / PERSONAL PROTECTION", "8.  EXPOSURE CONTROLS / PERSONAL PROTECTION", "8. Exposure Controls/personal protection", "8. Exposure controls / personal protection", "8 Exposure Controls/Personal Protection", "8. Exposure Controls/Personal Protection", "8. EXPOSURE CONTROLS AND PERSONAL PROTECTION", "8.Exposure controls/personal protection", "SECTION 8: Exposure controls/personal protection"," SECTION 8: Exposure Controls/Personal Protection", "SECTION 8: EXPOSURE CONTROLS/PERSONAL PROTECTION ", " SECTION 8: EXPOSURE CONTROLS AND PERSONAL PROECTION", "Section 8 - Exposure Controls / Personal Protection", "Section 8. Exposure controls/personal protection", "SECTION 8: 8. Exposure Controls/personal protection", "8. EXPOSURE CONTROLS AND PERSONAL PROECTION" ], \
               9 : ["SECTION 9: Physical and chemical properties", "9. PHYSICAL AND CHEMICAL PROPERTIES", "9 Physical and Chemical Properties", "9 Physical and chemical properties", "9.  PHYSICAL AND CHEMICAL PROPERTIES", "9. Physical and Chemical Properties", "9. Physical and chemical properties", "9. Physical and Chemical Properties", "9.Physical and chemical properties", "SECTION 9: Physical and chemical properties", "SECTION 9: Physical and Chemical Properties", "SECTION 9: PHYSICAL AND CHEMICAL PROPERTIES", "Section 9 - Physical & Chemical Properties", "Section 9. Physical and chemical properties"], \
               10 : ["SECTION 10: Stability and reactivity", "10. STABILITY AND REACTIVITY", "10 Stability and Reactivity", "10 Stability and reactivity", "10Stability and Reactivity", "10.Stability and reactivity", "10.  STABILITY AND REACTIVITY", "10. Stability and Reactivity", "10. Stability and reactivity", "SECTION 10: Stability and reactivity", "SECTION 10: Stability and Reactivity", "SECTION 10: STABILITY AND REACTIVITY", "Section 10 - Chemical Stability & Reactivity Information", "Section 10. Stability and reactivity"], \
               11 : ["SECTION 11: Toxicological information", "11. TOXICOLOGICAL INFORMATION", "11 Toxicological Information", "11 Toxicological information", "11Toxicological Information", "11.Toxicological information", "11.  TOXICOLOGICAL INFORMATION", "11. Toxicological information", "11. Toxicological Information", "SECTION 11: Toxicological information", "SECTION 11: Toxicological Information", "SECTION 11: TOXICOLOGICAL INFORMATION", "Section 11 - Toxicological Information", "Section 11. Toxicological information"], \
               12 : ["SECTION 12: Ecological information", "12. ECOLOGICAL INFORMATION", "12 Ecological Information", "12 Ecological information", "12Ecological Information", "12.Ecological information", "12.  ECOLOGICAL INFORMATION", "12. Ecological Information", "12. Ecological information", "SECTION 12: Ecological information", "SECTION 12: Ecological Information", "SECTION 12: ECOLOGICAL INFORMATION", "Section 12 - Ecological Information", "Section 12. Ecological information"], \
               13 : ["SECTION 13: Disposal considerations", "13. DISPOSAL CONSIDERATIONS", "13 Disposal Considerations", "13 Disposal consideration", "13Disposal Considerations", "13.Disposal considerations", "13.  DISPOSAL CONSIDERATIONS", "13. Disposal considerations", "13. Disposal Considerations", "SECTION 13: Disposal considerations", "SECTION 13: Disposal Considerations", "SECTION 13: DISPOSAL CONSIDERATIONS", "Section 13 - Disposal Considerations", "Section 13. Disposal considerations"], \
               14 : ["SECTION 14: Transport information", "14. TRANSPORT INFORMATION", "14 Transport Information", "14 Transport information", "14Transport Information", "14.Transport information", "14.  TRANSPORT INFORMATION", "14. Transport information", "14. Transport Information", "SECTION 14: Transport information", "SECTION 14: Transport Information", "SECTION 14: TRANSPORT INFORMATION", "Section 14 - Transportation Information", "Section 14. Transport information"], \
               15 : ["SECTION 15: Regulatory information", "15. REGULATORY INFORMATION", "15 Regulatory Information", "15 Regulatory information", "15Regulatory Information", "15.Regulatory information", "15.REGULATORY INFORMATION", "15. Regulatory information", "15. Regulatory Information", "SECTION 15: Regulatory information", "SECTION 15: Regulatory information", "SECTION 15: Regulatory Information", "SECTION 15: REGULATORY INFORMATION", "Section 15 - Regulatory Information", "Section 15. Regulatory information"], \
               16 : ["SECTION 16: Other information", "16. OTHER INFORMATION", "16 Other Information", "16 Other information", "16Other Information", "16.  OTHER INFORMATION", "16. Other information", "16. Other Information", "16.Other information, including date of preparation or last revision", "SECTION 16: Other information", "SECTION 16: Other Information","SECTION 16: OTHER INFORMATION", " SECTION 16: Other information, including date of preparation or last revision", "Section 16 - Other Information", "Section 16. Other information"], \
             }

# Check 8 : "SECTION 8: 8. Exposure Controls/personal protection" (last item in list)

In [9]:
# From Jonathan

import re

regex1 = "^[^a-zA-Z\n]*?(SECTION)?[^\S\r\n]?(?<!\.)"
regex2 = "[^\S\r\n]?[.:-]?[^\S\r\n]*?(\w+[\s/]+)*?"

sections = {
    1: "(IDENTIFICATION|Identification|identification)",
    2: "(HAZARDS|Hazard\(s\)|INGREDIENTS)",
    3: "(COMPOSITION|INGREDIENTS|HAZARDS|Hazard\(s\))",
    4: "(First[\s-]aid)",
    5: "(Fire[\s-]*Fighting)",
    6: "(Accidental\sRelease)",
    7: "(Handling\sAnd\sStorage)",
    8: "(Exposure\s(Controls\s*)?[And|/]*\s*?Personal\s(Protection)*)",
    9: "(Physical\s(And|&)\sChemical\sProperties)",
    10: "(Stability\s(And|&)\sReactivity)",
    11: "(Toxicological\sInformation)",
    12: "(Ecological\sInformation)",
    13: "(Disposal\sConsiderations*)",
    14: "(Transport(ation)*\sInformation)",
    15: "(Regulatory\sInformation)",
    16: "(Other\sInformation)"
}

In [10]:
def GenerateHeaderRE(SectionNbr):
    # Output the Regular expression to search for the Section Headers of section SectionNbr.
    
    RE = regex1 + str(SectionNbr) + regex2 + f"{sections[SectionNbr]}"
    return(RE)

In [11]:
## Edited from Maarten
SubHeaderDict = { 0: {}, \
                  1: { 1 : ["Product identifier","1.1. Identification","PRODUCT NAME:","Product Name","GHS product identifier"], \
                       2 : ["Relevant identified uses of the substance or mixture and uses advised against","Recommended Use","Recommended use of the chemical and restrictions on use","1.2. Recommended use and restrictions on use","Product use:"], \
                       3 : ["Details of the supplier of the safety data sheet","Supplier's details :","1.3. Supplier ","COMPANY IDENTIFICATION","Company"], \
                       4 : ["Emergency telephone number","Emergency","Emergency telephone\nnumber (with hours of\noperation)","Emergency phone number","Emergency Information"] \
                     }, \
                  2: { 1 : ["Classification of the substance or mixture","Classification","Emergency Overview"], \
                       2 : ["Label elements"], \
                       3 : ["Other hazards","Hazards not otherwise classified","Potential Health Effects","Hazards not otherwise/nclassified""· Hazards not otherwise classified \(HNOC\):","· Hazards not otherwise classified \(HNOC\):","Hazard(s) not otherwise classified (HNOC):","Hazards not otherwise classified (HNOC) or not covered by GHS"] \
                     }, \
                  3: { 1 : ["Substances","components"], \
                       2 : ["Mixtures"] \
                     }, \
                  4: { 1 : ["Description of first aid measures","4.1. Description of first aid measures","MEASURES"], \
                       2 : ["Most important symptoms and effects, both acute and delayed","Most important symptoms/effects, acute and delayed","Most important\nsymptoms/effects, acute and\ndelayed","Most important symptoms and\neffects","4.2. Most important symptoms and effects \(acute and delayed\)"], \
                       3 : ["Indication of any immediate medical attention and special treatment needed","Indication of immediate medical attention and special treatment needed, if necessary","Indication of immediate \nmedical attention and special \ntreatment needed","4.3. Immediate medical attention and special treatment, if necessary","Indication of immediate medical attention and special treatment needed"] \
                     }, \
                  5: { 1 : ["Extinguishing media","· Suitable extinguishing agents:"], \
                       2 : ["Special hazards arising from the substance or mixture","Hazardous Combustion Products","Specific Hazards Arising from the Chemical","5.2. Specific hazards arising from the chemical"], \
                       3 : ["Advice for firefighters","Special protective actions \nfor fire-fighters","Fire Fighting Equipment/Instructions","Protective Equipment and Precautions for Firefighters","5.3. Special protective equipment and precautions for fire-fighters","· Special protective equipment and precaution for firefighters","Further firefighting advice:"] \
                     }, \
                  6: { 1 : ["Personal precautions, protective equipment and emergency procedures","Personal Precautions"], \
                       2 : ["Environmental precautions","Containment Procedures"], \
                       3 : ["Methods and material for containment and cleaning up","Clean-Up Procedures","Methods for Containment and Clean\nUp","In case of spill or leak:","Methods and materials for containment and cleaning up"], \
                       4 : ["Reference to other sections","Special Procedures"] \
                     }, \
                  7: { 1 : ["Precautions for safe handling","Handling"], \
                       2 : ["Conditions for safe storage, including any incompatibilities","Conditions for safe storage:","Storage"], \
                       3 : ["Specific end use(s)","Specific end use\(s\)","Specific Use"] \
                     }, \
                  8: { 1 : ["Control parameters","Engineering Controls"], \
                       2 : ["Exposure controls","Substance Exposure Limits","Exposure Guidelines","8.3. Individual protection measures/Personal protective equipment"] \
                       }, \
                  9: { 1 : ["Information on basic physical and chemical properties","Appearance"], \
                       2 : ["Other information","Other safety information","NOTE:"] \
                     }, \
                  10: { 1 : ["Reactivity","Reactive Hazard"], \
                        2 : ["Chemical stability", "stability","Stability:"], \
                        3 : ["Possibility of hazardous reactions","Possibility of hazardous\nreactions","Hazardous reactions:"], \
                        4 : ["Conditions to avoid","Conditions / hazards to avoid:",'Conditions to avoid:'], \
                        5 : ["Incompatible materials","Materials to avoid","Incompatible materials:"], \
                        6 : ["Hazardous decomposition products","Hazardous decomposition\nproducts",'Hazardous decomposition producs:'] \
                      }, \
                  11: { 1 : ["Information on toxicological effects","Potential Health Effects","Toxicological information on this product or its components","Acute Toxicity"] \
                      }, \
                  12: { 1 : ["Toxicity"], \
                        2 : ["Persistence and degradability","Persistence & Degradation"], \
                        3 : ["Bioaccumulative potential","Bioaccumulation","Bioaccumulation/ Accumulation","· Bio-accumulative potential:"], \
                        4 : ["Mobility in soil","Mobility"], \
                        5 : ["Results of PBT and vPvB assessment"], \
                        6 : ["Other adverse effects"] \
                      }, \
                  13: { 1 : ["Waste treatment methods","Waste Disposal Instructions","Waste Disposal Methods","13.1. Disposal methods","Disposal methods:","Waste disposal:"] \
                      }, \
                  14: { 1 : ["UN number","UN #:","UN-No.", "UN-Number:","· UN-Number"], \
                        2 : ["Proper Shipping Name","Shipping Name:","UN proper shipping name","Proper shipping name :"], \
                        3 : ["Transport hazard class(es)","Transport hazard class\(es\)","· Transport hazard class \(es\)"], \
                        4 : ["Packing group", "Packaging group"], \
                        5 : ["Environmental hazards","Hazard Class:","Hazard labels","· Environmental hazardous"], \
                        6 : ["Special precautions for user"], \
                        7 : ["Transport in bulk according to Annex II of Marpol and the IBC Code","Transport in bulk according to Annex II of MARPOL73/78 and the IBC Code","Transport in bulk according to\nAnnex II of MARPOL 73/78\nand the IBC Code","Transport/Additional information:","Transport in bulk \naccording to Annex I or II \nof MARPOL 73/78 and the \nIBC or IGC Code"]
                      }, \
                  15: { 1 : ["Safety, health and environmental regulations/legislation specific for the substance or mixture","Safety, health and environmental regulations/legislation specific for the substance or\nmixture"], \
                        2 : ["Chemical safety assessment"] \
                      }, \
                  16: {}
                 }

### Eliminate noise from text

In [12]:
# Removing Extra Whitespaces

import re

def RemoveExtraSpaces(Text):
    """
    Return : string, text after removing duplicate whitespaces between words and leading and trailing whitespaces
    Input : string, text to be cleaned
    Output : string
    """
    
    # Duplicate spaces and newlines between words:
    Pattern = r'\s+'
    SpaceRemoved = re.sub(pattern=Pattern, repl=" ", string=Text)
    
    #Remove leading and trailing whitespaces:
    
    return(SpaceRemoved.strip(" \n"))

In [13]:
from fuzzywuzzy import fuzz

def get_starting_section(first_header_pos, text):
    section = text[:first_header_pos]
    section_list = section.split("\n")
    return section_list

def RemovePageHeaders(FileText):
    # Clean the file text from headers
    # FileText: full text of PDF file
    # zero_section_list : list of all lines of the header, one line per list item

    # locate header on first page of PDF:
    AllHeaders = FindOccurences(FileText, GenerateHeaderRE(1))
    # Pageheader = all lines before first header:
    zero_section_list = get_starting_section(AllHeaders[1][0], FileText)
    
    CleanFile = ""
    
    for division in FileText.split("\n"):       
        if not(division == ""):
            PageHeaderMatch = False
            for zero in zero_section_list:
                ratio = fuzz.token_sort_ratio(zero, division)
                if ratio >= 85: # match with header, skip line
                    PageHeaderMatch = True
                    break
            if not PageHeaderMatch:
                CleanFile += (division + "\n")
        
    return(CleanFile)



### Locate sections headers, subsectons + text

In [14]:
def SortHeaders(HeaderIndexes):
    # indicate which is the order the headers occur in the text
    
    val = [item[0] for item in list(HeaderIndexes.values())]
    SortList = sorted(val)
    SortedHeaders = []
    for count in range(len(SortList)):
        for k,v in HeaderIndexes.items():
            if v[0] == SortList[count]:
                SortedHeaders.append(k)
                break;
                
    return(SortedHeaders)
        

In [15]:
def SubSectionsStr(SectionNbr, SectionText):
    # Split text of section SectionNbr from PDFSectionText into subsections
    # Correct order of subsections
    
    SubSectionTextDict = {}
    
    # Verify there is a section:
    if len(SectionText) < 8:
        return({})
    
    # Check if there are subsections:
    if SubHeaderDict[SectionNbr]:
        
        # Locate all Headers:
        HeaderIndexes = {}
        AllHeaders = {}
        
        # loop through all possible subheaders of the section:
        for sh in range(1,len(SubHeaderDict[SectionNbr])+1):    # indexing of subsections starts at 1.
            # loop through all possible regular expressions for that subheader:
            for i in range(len(SubHeaderDict[SectionNbr][sh])):
                AllHeaders = FindOccurences(SectionText, SubHeaderDict[SectionNbr][sh][i])
                if AllHeaders:
                    HeaderIndexes[sh] = AllHeaders[1] # Assume first subheader occurence is the right one       
        if not HeaderIndexes: # no sub headers found
            return({})
        
        # Subheaders found are not necessarily in sequential order in the text.
        # To allow for sequential cutting of the text body, sort the found headers
        HeaderOrder = SortHeaders(HeaderIndexes)
                
        # Determine begin and end position of each header:    
        SectionIndexes = {}
        for SubSection in range(1,len(SubHeaderDict[SectionNbr])+1):
            # Subsection headers are optional, apply HeaderIndexes in order as in the text:
            SSlist = []
            for SubSection in HeaderOrder:            
                SSlist.append(SubSection)
            if SSlist:
                if len(SSlist) == 1:
                    SectionIndexes[SSlist[0]] = (HeaderIndexes[SSlist[0]][1] + 1, len(SectionText))
                else:
                    for i in range(len(SSlist) - 1):
                        SectionIndexes[SSlist[i]] = (HeaderIndexes[SSlist[i]][1] + 1, HeaderIndexes[SSlist[i+1]][0] - 1)
                    SectionIndexes[SSlist[-1]] = (HeaderIndexes[SSlist[-1]][1] + 1, len(SectionText))
        

        # Extract the text of each subsection:
        
        # now switch to official section order again:
        if (SSlist):
            for SubSection in HeaderIndexes:
                 SubSectionTextDict[SubSection] = RemoveExtraSpaces(ExtractText(SectionText, SectionIndexes[SubSection][0], SectionIndexes[SubSection][1]))
    
    return(SubSectionTextDict)              

In [16]:
def PDF2Sections(PDFfile):
    # Convert a PDF file in a Dict of Sections

    # Locate all Headers:
    HeaderIndexes = {}
    AllHeaders = {}

    for Header in range(1,17):
        AllHeaders = FindOccurences(PDFfile, GenerateHeaderRE(Header))
        if AllHeaders:
            HeaderIndexes[Header] = AllHeaders[1]
            #VerifyOccurence(AllHeaders)
        else:
            print("\nHeader ", HeaderDict[Header][0], " not found")
            return(None)
    
    #print(HeaderIndexes)

    # Determine begin and end position of each header:    
    SectionIndexes = {}
                                    
    SectionIndexes[0] = (0, HeaderIndexes[1][0] - 1)   # Section 0 is to contain PDF file intro
     
    for Section in range(1,16):
        SectionIndexes[Section] = (HeaderIndexes[Section][1] + 1, HeaderIndexes[Section + 1][0] - 1)
    
    SectionIndexes[16] = (HeaderIndexes[16][1] + 1, len(PDFfile) )

    # Extract the text of each section:
    SectionTextDict = {}
    
    for Section in range(0,17):
        SectionTextDict[Section] = ExtractText(PDFfile, SectionIndexes[Section][0], SectionIndexes[Section][1])
    
    return(SectionTextDict)   

In [17]:
def PDF2SubSections(PDFfile):
    # Convert a PDF file in a Dict of Sections

    # Locate all Headers:
    HeaderIndexes = {}
    AllHeaders = {}

    for Header in range(1,17):
        AllHeaders = FindOccurences(PDFfile, GenerateHeaderRE(Header))
        if AllHeaders:
            HeaderIndexes[Header] = AllHeaders[1]
            #VerifyOccurence(AllHeaders)
        else:
            print("\nHeader ", HeaderDict[Header][0], " not found")
            return(None)

    # Determine begin and end position of each header:    
    SectionIndexes = {}
                                    
    SectionIndexes[0] = (0, HeaderIndexes[1][0] - 1)   # Section 0 is to contain PDF file intro
     
    for Section in range(1,16):
        SectionIndexes[Section] = (HeaderIndexes[Section][1] + 1, HeaderIndexes[Section + 1][0] - 1)
    
    SectionIndexes[16] = (HeaderIndexes[16][1] + 1, len(PDFfile) )

    # Extract the text of each section:
    SectionTextDict = {}
    
    for Section in range(0,17):        
        Hlp = ExtractText(PDFfile, SectionIndexes[Section][0], SectionIndexes[Section][1])
        SubS = SubSectionsStr(Section, Hlp)
        if (SubS):
            SectionTextDict[Section] = SubS
        else: # no subsections
            SectionTextDict[Section] = RemoveExtraSpaces(Hlp)
    
    return(SectionTextDict)   

### Output

#### Readable, on screen

In [18]:
import sys

def PrintSections(SectionDict, FileName = None):
    # Print out the PDF converted to Dictionary
    # If FileName given, will to write to that file, otherwise on screen display.
    
    if FileName:
        original_stdout = sys.stdout # Save a reference to the original standard output
        f = open(FileName, 'w')
        sys.stdout = f # Change the standard output to the file we created.
    
    for Section in range(0,17):
        print("-------------\nSection ", Section, ": ", HeaderDict[Section][0], "\n-------------\n", SectionDict[Section], "\n")
    
    if FileName:
        sys.stdout = original_stdout # Reset the standard output to its original value
    
    return(None)

In [19]:
import sys

def PrintSubSections(SubSectionDict, FileName = None):
    # Print out the PDF converted to Dictionary
    # If FileName given, will to write to that file, otherwise on screen display.
    
    if FileName:
        original_stdout = sys.stdout # Save a reference to the original standard output
        f = open(FileName, 'w')
        sys.stdout = f # Change the standard output to the file we created.
    
    for Section in range(0,17):
        print("==============\nSection ", Section, ": ", HeaderDict[Section][0], "\n==============\n")
        if SubHeaderDict[Section]:
            if isinstance(SubSectionDict[Section], str):
                print(SubSectionDict[Section])
            elif isinstance(SubSectionDict[Section], dict):
                for sh in SubSectionDict[Section]:
                    print("-------------\nSubSection ", sh, ": ", SubHeaderDict[Section][sh][0], "\n-------------\n")
                    print(SubSectionDict[Section][sh])
        else:
            if SubSectionDict[Section]:
                print(SubSectionDict[Section])
    
    if FileName:
        sys.stdout = original_stdout # Reset the standard output to its original value
    
    return(None)

#### json  output file

In [20]:
# Create a template dictionary based on the SDS standard:

# Top level
SDSTemplate = {  \
               "identification" : {"product_identifier" : "", \
                                   "uses" : "", \
                                   "supplier_details" : "", \
                                   "emergency_telephone_number" : "" \
                                  }, \
               "hazards" :  {"classification" : "", \
                             "label" : "", \
                             "other" : "" \
                            }, \
               "ingredients" : {"substances" : "", \
                                "mixtures" : "" \
                               }, \
               "First-Aid" : {"measures" : "", \
                              "symptoms" : "", \
                              "treatment" : "", \
                             }, \
               "Firefighting" : {"extinguishing_media" : "", \
                                 "special_hazards" : "", \
                                 "advice" : "" \
                                }, \
               "Accidents" : {"procedures" : "", \
                              "environmental_precautions" : "", \
                              "containment_and_cleaning" : "", \
                              "section_reference" : "", \
                             }, \
               "Handling" :  {"precautions" : "", \
                              "safe_storage" : "", \
                              "specific_end_use" : "", \
                             }, \
               "Exposure" : {"parameters" : "", \
                             "controls" : "", \
                            }, \
               "Properties" : {"basic" : "", \
                               "Other" : "", \
                              }, \
               "Stability" : {"reactivity" : "", \
                              "chemical " : "", \
                              "hazardous_reactions" : "", \
                              "conditions_to_avoid" : "", \
                              "incompatibilities" : "", \
                              "hazardous_decomposition_products" : "", \
                             }, \
               "Toxicology" : {"information" : ""}, \
               "Ecology" : {"toxicity" : "", \
                            "persistence_and_degradability" : "", \
                            "bioaccumulative_potential" : "", \
                            "mobility_in_soil" : "", \
                            "PBT_and_vPvB_assessment" : "", \
                            "other" : "", \
                           }, \
               "Disposal" : {"treatment" : ""}, \
               "Transport" : {"UN_number" : "", \
                              "shipping_name" : "", \
                              "hazard classes" : "", \
                              "packing_group" : "", \
                              "environmental_hazards" : "", \
                              "user_precautions" : "", \
                              "bulk" : "", \
                             }, \
               "Regulatory" : {"regulations_legislation" : "", \
                               "Chemical safety assessment" : "", \
                              }, \
               "Other" : {} \
             }

In [21]:
def CreateJSONdict(InDict):
    # Create a dictionary with values according to the template, with keywords as keys:
    
    # InDict is the processed dictionary, with heading numbers as keys
    
    OutputDict = SDSTemplate.copy()

    TopCount = 1
    for kt in OutputDict.keys():
        if TopCount in InDict.keys():
            if isinstance(InDict[TopCount], str):
                OutputDict[kt] = InDict[TopCount]                
            else:
                SubCount = 1
                for ks in OutputDict[kt].keys():
                    if SubCount in InDict[TopCount].keys():
                        OutputDict[kt][ks] = InDict[TopCount][SubCount]
                    else:
                        OutputDict[kt][ks] = ""
                    SubCount += 1
        TopCount += 1
        
    return(OutputDict)

In [22]:
import json

def write_json(dictionary, filename):
    
    JsonDict = CreateJSONdict(dictionary)
    
    with open(filename, "w") as outfile:
        json.dump(JsonDict, outfile, indent=4)
        
    return(JsonDict)

### PDF Processing

In [23]:
InputFile = './data/10N_Sodium_Hydroxide_NaOH_40_6_US_EN_sds (1).pdf'
PDFtext = tika_extract_text_from_pdf(InputFile)
CleanFile = RemovePageHeaders(PDFtext)
SubSdict = PDF2SubSections(CleanFile)
PrintSubSections(SubSdict)

Section  0 :   

Section  1 :  SECTION 1: Identification of the substance 

-------------
SubSection  1 :  Product identifier 
-------------

· Trade name: 10N Sodium Hydroxide (NaOH 40%) · Product Number: NGT-10N NaOH ·
-------------
SubSection  2 :  Relevant identified uses of the substance or mixture and uses advised against 
-------------

No further relevant information available. · Product Description PC21 Laboratory chemicals · Application of the substance / the mixture: Laboratory chemicals ·
-------------
SubSection  3 :  Details of the supplier of the safety data sheet 
-------------

· Manufacturer/Supplier: NuGeneration Technologies, LLC (dba NuGenTec) 1155 Park Avenue, Emeryville, CA 94608 salesteam@nugentec.com www.nugentec.com 1-888-996-8436 or 1-707-820-4080 for product information ·
-------------
SubSection  4 :  Emergency telephone number 
-------------

telephone number: PERS Emergency Response: Domestic and Canada - 1-800-633-8253, International 1-801-629-0667
Secti

In [24]:
json_dict = write_json(SubSdict, 'json_out.txt')

In [25]:
json_dict

{'identification': {'product_identifier': '· Trade name: 10N Sodium Hydroxide (NaOH 40%) · Product Number: NGT-10N NaOH ·',
  'uses': 'No further relevant information available. · Product Description PC21 Laboratory chemicals · Application of the substance / the mixture: Laboratory chemicals ·',
  'supplier_details': '· Manufacturer/Supplier: NuGeneration Technologies, LLC (dba NuGenTec) 1155 Park Avenue, Emeryville, CA 94608 salesteam@nugentec.com www.nugentec.com 1-888-996-8436 or 1-707-820-4080 for product information ·',
  'emergency_telephone_number': 'telephone number: PERS Emergency Response: Domestic and Canada - 1-800-633-8253, International 1-801-629-0667'},
 'hazards': {'classification': 'of the substance or mixture: d~\uf082 GHS05 Corrosion Skin Corr. 1A H314 Causes severe skin burns and eye damage. Eye Dam. 1 H318 Causes serious eye damage. ·',
  'label': '· GHS label elements The product is classified and labeled according to the Globally Harmonized System (GHS). · Hazard

In [26]:
InputFile = 'data/Dubei Steriling Alcohol Wipes SDS.pdf'
PDFtext = tika_extract_text_from_pdf(InputFile)
CleanFile = RemovePageHeaders(PDFtext)
SubSdict = PDF2SubSections(CleanFile)
PrintSubSections(SubSdict)

Section  0 :   

Section  1 :  SECTION 1: Identification of the substance 

-------------
SubSection  1 :  Product identifier 
-------------

·
-------------
SubSection  2 :  Relevant identified uses of the substance or mixture and uses advised against 
-------------

· Application of the substance/ mixture: Clean hands and disinfection ·
-------------
SubSection  3 :  Details of the supplier of the safety data sheet 
-------------

· Manufacturer/Supplier: Tel: 13957903789 Email: / · Further information obtainable from: Jinhua dubei maternal and Infant Products Co., Ltd ·
-------------
SubSection  4 :  Emergency telephone number 
-------------

telephone number Miss wang Tel: 13957903789
Section  2 :  SECTION 2: Hazards identification 

-------------
SubSection  1 :  Classification of the substance or mixture 
-------------

of the substance or mixture: GHS02 Flame Flam. Liq. 3 H226 Flammable liquid and vapour GHS07 Exclamation mark Eye Irrit. 2A H319 Causes serious eye irritation · I

In [27]:
json_dict = write_json(SubSdict, 'json_out.txt')

In [28]:
json_dict

{'identification': {'product_identifier': '·',
  'uses': '· Application of the substance/ mixture: Clean hands and disinfection ·',
  'supplier_details': '· Manufacturer/Supplier: Tel: 13957903789 Email: / · Further information obtainable from: Jinhua dubei maternal and Infant Products Co., Ltd ·',
  'emergency_telephone_number': 'telephone number Miss wang Tel: 13957903789'},
 'hazards': {'classification': 'of the substance or mixture: GHS02 Flame Flam. Liq. 3 H226 Flammable liquid and vapour GHS07 Exclamation mark Eye Irrit. 2A H319 Causes serious eye irritation · Information concerning particular hazards for human and environment: The product has to be labeled due to the calculation procedure of OSHA Hazard Communication Standard (29 CFR 1910. 1200). · Classification system: The classification is according to the latest edition of OSHA Hazard Communication Standard (29 CFR 1910.1200) and extended by company and literature data. ·',
  'other': 'No further relevant information availab

In [29]:
SubSdict

 1: {1: '·',
  2: '· Application of the substance/ mixture: Clean hands and disinfection ·',
  3: '· Manufacturer/Supplier: Tel: 13957903789 Email: / · Further information obtainable from: Jinhua dubei maternal and Infant Products Co., Ltd ·',
  4: 'telephone number Miss wang Tel: 13957903789'},
 2: {1: 'of the substance or mixture: GHS02 Flame Flam. Liq. 3 H226 Flammable liquid and vapour GHS07 Exclamation mark Eye Irrit. 2A H319 Causes serious eye irritation · Information concerning particular hazards for human and environment: The product has to be labeled due to the calculation procedure of OSHA Hazard Communication Standard (29 CFR 1910. 1200). · Classification system: The classification is according to the latest edition of OSHA Hazard Communication Standard (29 CFR 1910.1200) and extended by company and literature data. ·',
  3: 'No further relevant information available.'},
 3: {1: 'CAS No. Composition Name Weight by % 63449-41-2 Benzalkonium chloride 0.25% 2682-20-4 2-methyl-4

### Process all pdf files in a directory

In [30]:
# PDFminer based 
# Only top level  - TODO sublevel

import os

def MinerProcessAllPDFs(Directory):
    # Converts to sections all PDF files in Directory
    
    # test for existence of JSON subdirectory, otherwise create it
    if not os.path.isdir(Directory + '/JSON'):
        os.makedirs(Directory + '/JSON')
    
    for filename in os.listdir(Directory):
        if filename.endswith(".pdf"):
            print(f"\n======================================================\nFile: {filename}\n======================================================")
            PDFtext = miner_extract_text_from_pdf(Directory + '/' + filename)
            CleanedFile = RemovePageHeaders(PDFtext)
            PDFSectionsText = PDF2SubSections(CleanedFile)
            PrintSections(PDFSectionsText)
            write_json(PDFSectionsText, Directory + '/' + 'JSON/' + filename + '.json')
    
    return(None)

In [31]:
# Tika based
# Top level + Sublevel

import os

def TikaProcessAllPDFs(Directory):
    # Converts to sections all PDF files in Directory
    
    # test for existence of JSON subdirectory, otherwise create it
    if not os.path.isdir(Directory + '/JSON'):
        os.makedirs(Directory + '/JSON')
    
    for filename in os.listdir(Directory):
        if filename.endswith(".pdf"):
            print(f"\n======================================================\nFile: {filename}\n======================================================")
            PDFtext = tika_extract_text_from_pdf(Directory + '/' + filename)
            CleanedFile = RemovePageHeaders(PDFtext)
            PDFSectionsText = PDF2SubSections(CleanedFile)
            PrintSubSections(PDFSectionsText)
            write_json(PDFSectionsText, Directory + '/' + 'JSON/' + filename + '.json')
    
    return(None)


In [32]:
TikaProcessAllPDFs('./data')


File: 10N_Sodium_Hydroxide_NaOH_40_6_US_EN_sds (1).pdf
Section  0 :   

Section  1 :  SECTION 1: Identification of the substance 

-------------
SubSection  1 :  Product identifier 
-------------

· Trade name: 10N Sodium Hydroxide (NaOH 40%) · Product Number: NGT-10N NaOH ·
-------------
SubSection  2 :  Relevant identified uses of the substance or mixture and uses advised against 
-------------

No further relevant information available. · Product Description PC21 Laboratory chemicals · Application of the substance / the mixture: Laboratory chemicals ·
-------------
SubSection  3 :  Details of the supplier of the safety data sheet 
-------------

· Manufacturer/Supplier: NuGeneration Technologies, LLC (dba NuGenTec) 1155 Park Avenue, Emeryville, CA 94608 salesteam@nugentec.com www.nugentec.com 1-888-996-8436 or 1-707-820-4080 for product information ·
-------------
SubSection  4 :  Emergency telephone number 
-------------

telephone number: PERS Emergency Response: Domestic and Can

Section  0 :   

Section  1 :  SECTION 1: Identification of the substance 

-------------
SubSection  3 :  Details of the supplier of the safety data sheet 
-------------

address: Contact information: Henkel Corporation One Henkel Way Rocky Hill, Connecticut 06067 Telephone: 860.571.5100 MEDICAL
-------------
SubSection  4 :  Emergency telephone number 
-------------

Phone: Poison Control Center 1-877-671-4608 (toll free) or 1-303-592-1711 TRANSPORT EMERGENCY Phone: CHEMTREC 1-800-424-9300 (toll free) or 1-703-527-3887 Internet: www.henkelna.com
Section  2 :  SECTION 2: Hazards identification 

-------------
SubSection  1 :  Classification of the substance or mixture 
-------------

HMIS: Physical state: Liquid HEALTH: *3 Color: Pale yellow FLAMMABILITY: 1 Odor: Mild PHYSICAL HAZARD: 0 Personal Protection: See MSDS Section 8 DANGER: CAUSES EYE, SKIN AND RESPIRATORY TRACT BURNS. MAY CAUSE ALLERGIC SKIN REACTION. HARMFUL IF SWALLOWED OR INHALED. MAY BE HARMFUL IF ABSORBED THROUGH SKIN.

Section  0 :   

1. PRODUCT AND COMPANY IDENTIFICATION Company Arkema Inc. 900 First Avenue King of Prussia, Pennsylvania 19406 Functional Additives Customer Service Telephone Number: (800) 331-7654 (Monday through Friday, 8:30 AM to 5:30 PM EST) Emergency Information Transportation: CHEMTREC: (800) 424-9300 (24 hrs., 7 days a week) Medical: Rocky Mountain Poison Center: (866) 767-5089 (24 hrs., 7 days a week) Product Information Product name: LUPEROX® P Synonyms: Peroxyester, t-butyl perbenzoate, tert-butyl peroxybenzoate Molecular formula: C11 H14 O3 Chemical family: Organic peroxide - peroxyesters Product use: Initiator 2. HAZARDS IDENTIFICATION Emergency Overview Color: Colourless to yellow. Physical state: liquid Odor: unpleasant CAUTION! ORGANIC PEROXIDE. HAZARDOUS DECOMPOSITION MAY OCCUR. MAY CAUSE ALLERGIC SKIN REACTION. MAY CAUSE SKIN IRRITATION. Potential Health Effects Primary routes of exposure: Inhalation and skin contact. Signs and symptoms of acute exposure: May cause sk

Section  0 :   

Section  1 :  SECTION 1: Identification of the substance 

-------------
SubSection  4 :  Emergency telephone number 
-------------

Telephone: Customer Service: Product Disposal Information: CHEMTREC: (989) 496-5900 (989) 496-6000 (989) 496-6315 (800) 424-9300 MSDS No.: 04088687 Generic Description: Alkoxysilane Physical Form: Liquid Color: Colorless to pale yellow Odor: Aromatic odor NFPA Profile: Health 2 Flammability 1 Instability/Reactivity 0 Note: NFPA = National Fire Protection Association
Section  2 :  SECTION 2: Hazards identification 

-------------
SubSection  3 :  Other hazards 
-------------

Acute Effects Eye: Direct contact may cause mild irritation. Skin: May cause moderate irritation. Inhalation: Vapor irritating to the respiratory tract. Vapor overexposure may cause drowsiness. Oral: Overexposure by ingestion may cause effects similar to those listed under repeated exposure. Prolonged/Repeated Exposure Effects Skin: Overexposure may injure internally 

Section  0 :   

Section  1 :  SECTION 1: Identification of the substance 

-------------
SubSection  1 :  Product identifier 
-------------

: BYK-349 Product Use Description : Substrate Wetting Additive
-------------
SubSection  3 :  Details of the supplier of the safety data sheet 
-------------

: BYK USA Inc. 524 South Cherry Street Wallingford CT 06492 Prepared by : J.Nole, Safety; M.McCutcheon, Regulatory Telephone : (203) 265-2086 Visit our web site : www.byk.com E-mail address : ehs.byk.usa@altana.com
-------------
SubSection  4 :  Emergency telephone number 
-------------

telephone number : CHEMTREC 800-424-9300
Section  2 :  SECTION 2: Hazards identification 

-------------
SubSection  1 :  Classification of the substance or mixture 
-------------

Form : liquid Colour : light brown Odour : not significant OSHA Regulatory Status This material is considered hazardous by the OSHA Hazard Communication Standard (29 CFR1910.1200)
-------------
SubSection  3 :  Other hazards 
---

Section  0 :   

1. PRODUCT AND COMPANY IDENTIFICATION 1.1 Product identifiers Product name : Tri(ethylene glycol) divinyl ether Product Number : 329800 Brand : Aldrich CAS-No. : 765-12-8 1.2 Relevant identified uses of the substance or mixture and uses advised against Identified uses : Laboratory chemicals, Manufacture of substances 1.3 Details of the supplier of the safety data sheet Company : Sigma-Aldrich 3050 Spruce Street SAINT LOUIS MO 63103 USA Telephone : +1 800-325-5832 Fax : +1 800-325-5052 1.4 Emergency telephone number Emergency Phone # : +1-703-527-3887 (CHEMTREC) 2. HAZARDS IDENTIFICATION 2.1 Classification of the substance or mixture Not a hazardous substance or mixture. 2.2 GHS Label elements, including precautionary statements Not a hazardous substance or mixture. 2.3 Hazards not otherwise classified (HNOC) or not covered by GHS - none 3. COMPOSITION/INFORMATION ON INGREDIENTS 3.1 Substances Formula : C10H18O4 Molecular Weight : 202.25 g/mol CAS-No. : 765-12-8 Hazardo

Section  0 :   

Section  1 :  SECTION 1: Identification of the substance 

-------------
SubSection  1 :  Product identifier 
-------------

CARBOWAX™ SENTRY™ POLYETHYLENE GLYCOL 400 NF, FCC GRADE; MACROGOL 400 Ph. Eur.
-------------
SubSection  2 :  Relevant identified uses of the substance or mixture and uses advised against 
-------------

Identified uses: For the CARBOWAX™ Product Line, a partial list of applications include pharmaceutical products, personal care products, automotive products, household products, packaging products, petroleum chemicals, plastics, inks, coatings, adhesives, chemical intermediates, rubber processing, lubricants, metalworking fluids, mold release agents, ceramics, and wood treating. CARBOWAX™ SENTRY™ Polyethylene Glycol Grades 300 – 8000 have clearances under FDA. It is the responsibility of the user of this product to read and understand all applicable FDA regulations in Title 21 of the Code of Federal regulations as well as any other applicable reg

Section  0 :   

Section  1 :  SECTION 1: Identification of the substance 

-------------
SubSection  1 :  Product identifier 
-------------

·
-------------
SubSection  2 :  Relevant identified uses of the substance or mixture and uses advised against 
-------------

· Application of the substance/ mixture: Clean hands and disinfection ·
-------------
SubSection  3 :  Details of the supplier of the safety data sheet 
-------------

· Manufacturer/Supplier: Tel: 13957903789 Email: / · Further information obtainable from: Jinhua dubei maternal and Infant Products Co., Ltd ·
-------------
SubSection  4 :  Emergency telephone number 
-------------

telephone number Miss wang Tel: 13957903789
Section  2 :  SECTION 2: Hazards identification 

-------------
SubSection  1 :  Classification of the substance or mixture 
-------------

of the substance or mixture: GHS02 Flame Flam. Liq. 3 H226 Flammable liquid and vapour GHS07 Exclamation mark Eye Irrit. 2A H319 Causes serious eye irritation · I

Section  0 :   

Section  1 :  SECTION 1: Identification of the substance 

-------------
SubSection  1 :  Product identifier 
-------------

GLYCOL ETHER DPM CAS NO: 34590-94-8 PROPER NAME: DIPROPYLENE GLYCOL MONOMETHYL ETHER
Section  2 :  SECTION 2: Hazards identification 

-------------
SubSection  3 :  Other hazards 
-------------

Eye Contact: May cause slight temporary eye irritation. Corneal injury is unlikely. Skin Contact: Prolonged exposure not likely to cause significant skin irritation. Skin Absorption: Prolonged skin contact with very large amounts may cause dizziness or drowsiness. Inhalation: Excessive exposure may cause irritation to upper respiratory tract (nose and throat). Symptoms of excessive exposure may be anesthetic or narcotic effects; dizziness and drowsiness may be observed. Ingestion: Very low toxicity if swallowed. Harmful effects not anticipated from swallowing small amounts. Aspiration hazard: Based on physical properties, not likely to be an aspiration h

Section  0 :   

Section  1 :  SECTION 1: Identification of the substance 

-------------
SubSection  1 :  Product identifier 
-------------

· Trade name: Hydrogen Peroxide 35% ·
-------------
SubSection  2 :  Relevant identified uses of the substance or mixture and uses advised against 
-------------

· Product Description PC21 Laboratory chemicals ·
-------------
SubSection  3 :  Details of the supplier of the safety data sheet 
-------------

· Manufacturer/Supplier: NuGeneration Technologies, LLC (dba NuGenTec) 1155 Park Avenue, Emeryville, CA 94608 salesteam@nugentec.com www.nugentec.com 1-888-996-8436 or 1-707-820-4080 for product information ·
-------------
SubSection  4 :  Emergency telephone number 
-------------

telephone number: PERS Emergency Response: Domestic and Canada - 1-800-633-8253, International 1-801-629-0667
Section  2 :  SECTION 2: Hazards identification 

-------------
SubSection  1 :  Classification of the substance or mixture 
-------------

of the substance

Section  0 :   

SECTION 1: Identification 1.1. Identification Product form : Substance Substance name : Isopropyl Alcohol (2-Propanol) CAS-No. : 67-63-0 Product code : LC15750 Formula : C3H8O Synonyms : 1-methylethanol / 1-methylethyl alcohol / 2-hydroxypropane / dimethyl carbinol / ethyl carbinol / hydroxypropane / IPA / i-propanol / isoethylcarbinol / propan-2-ol / sec-propanol 1.2. Recommended use and restrictions on use Use of the substance/mixture : Disinfectant Solvent Recommended use : Laboratory chemicals Restrictions on use : Not for food, drug or household use 1.3. Supplier LabChem, Inc. Jackson's Pointe Commerce Park Building 1000, 1010 Jackson's Pointe Court Zelienople, PA 16063 - USA T 412-826-5230 - F 724-473-0647 info@labchem.com - www.labchem.com 1.4. Emergency telephone number Emergency number : CHEMTREC: 1-800-424-9300 or +1-703-741-5970 SECTION 2: Hazard(s) identification 2.1. Classification of the substance or mixture GHS US classification Flammable liquids Categor

Section  0 :   

Section  1 :  SECTION 1: Identification of the substance 

-------------
SubSection  1 :  Product identifier 
-------------

· Trade name: N,N-dimethylformamide · CAS Number: 68-12-2 · EC number: 200-679-5 · Index number: 616-001-00-X ·
-------------
SubSection  2 :  Relevant identified uses of the substance or mixture and uses advised against 
-------------

No further relevant information available. · Product Description PC19 Intermediate ·
-------------
SubSection  3 :  Details of the supplier of the safety data sheet 
-------------

· Manufacturer/Supplier: NuGeneration Technologies, LLC (dba NuGenTec) 1155 Park Avenue, Emeryville, CA 94608 salesteam@nugentec.com www.nugentec.com 1-888-996-8436 or 1-707-820-4080 for product information ·
-------------
SubSection  4 :  Emergency telephone number 
-------------

telephone number: PERS Emergency Response: Domestic and Canada - 1-800-633-8253, International 1-801-629-0667
Section  2 :  SECTION 2: Hazards identification

Section  0 :   

Section  1 :  SECTION 1: Identification of the substance 

-------------
SubSection  1 :  Product identifier 
-------------

Omnirad 184 Chemical name Hydroxycyclohexyl phenyl ketone CAS number 947-19-3
-------------
SubSection  2 :  Relevant identified uses of the substance or mixture and uses advised against 
-------------

Application Photoinitiator. Uses advised against No specific uses advised against are identified.
-------------
SubSection  3 :  Details of the supplier of the safety data sheet 
-------------

Supplier www.igmresins.com IGM Resins USA Inc. 3300 Westinghouse Blvd Charlotte, NC 28273, United States regulatory@igmresins.com T: +1 630-213-1616 F: +1 704 450 6484
-------------
SubSection  4 :  Emergency telephone number 
-------------

telephone number Emergency telephone Italy +39 0362 512868 EMEA +1 760 476 3961 Access code 333 293. US +1 760 476 3962 Access code 333 293.
Section  2 :  SECTION 2: Hazards identification 

-------------
SubSection  1 

Section  0 :   

1. Identification Product Name Phthalic anhydride Cat No. : AC423320000; AC423320020; AC423320050; AC423320250; AC423325000 CAS-No 85-44-9 Synonyms 1,3-Isobenzofurandione Recommended Use Laboratory chemicals. Uses advised against Food, drug, pesticide or biocidal product use. Details of the supplier of the safety data sheet Emergency Telephone Number For information US call: 001-800-ACROS-01 / Europe call: +32 14 57 52 11 Emergency Number US:001-201-796-7100 / Europe: +32 14 57 52 99 CHEMTREC Tel. No.US:001-800-424-9300 / Europe:001-703-527-3887 2. Hazard(s) identification Classification This chemical is considered hazardous by the 2012 OSHA Hazard Communication Standard (29 CFR 1910.1200) Label Elements Signal Word Danger Company Fisher Scientific One Reagent Lane Fair Lawn, NJ 07410 Tel: (201) 796-7100 Acros Organics One Reagent Lane Fair Lawn, NJ 07410 Acute oral toxicity Category 4 Skin Corrosion/Irritation Category 2 Serious Eye Damage/Eye Irritation Category 1 Re

Section  0 :   

* * * Section 1 - Identification of the Substance/Preparation and the Company/Undertaking * * * Manufacturer Information Zaclon LLC Phone: 216-271-1569 or 800-356-7327 2981 Independence Road Fax: 216-271-1792 Cleveland, OH 44115 Emergency # 800-424-9300 CHEMTREC * * * Section 2 - Composition / Information on Ingredients * * * EC # Component Percent Symbols Risks 200-661-7 Isopropyl alcohol 67-63-0 30-61 Xi R:11-36-67 200-662-2 Acetone 67-64-1 10-16 Xi R:11-36-66-67 231-791-2 Water 7732-18-5 3.7-13 239-286-9 Chromium, tetrachloro--hydroxy[- (octadecanoato-O:O')]di- 15242-96-3 2.85-4.6 239-730-1 Chromium, tetrachloro--hydroxy[- (tetradecanoato-O:O')]di- 15659-56-0 2.85-4.6 201-161-1 Chloroacetone 78-95-5 0.3 * * * Section 3 - Hazards Identification * * * Human and Environmental Hazards Causes eye burns. May cause irritation of the skin, nose and throat. Ingestion of this material may cause gastrointestinal tract irritation. * * * Section 4 - First Aid Measures * * * 

Section  0 :   

SECTION 1: Identification of the substance/mixture and of the company/undertaking 1.1. Product identifier Trade name Material number: 103327 1.2. Relevant identified uses of the substance or mixture and uses advised against Relevant identified uses of the substance or mixture Industry sector : Paints, lacquers and varnishes industry Type of use : dye for special industries 1.3. Details of the supplier of the safety data sheet Identification of the company Clariant Production (France) Usine de Huningue Avenue de Bâle 68331 Huningue Telephone no. : +33 3 89 89 60 00 Information about the substance/mixture Division Pigments & Additives tel.: +33.3.89.89.63.38 e-mail: France.ProductSafety@clariant.com 1.4. Emergency telephone number +33 1 45 42 59 59 (24 h) SECTION 2: Hazards identification 2.1. Classification of the substance or mixture 2.2. Label elements Labelling according CLP regulation (Regulation (EC) No. 1272/2008, as amended) The product does not require classific

Section  0 :   

Section  1 :  SECTION 1: Identification of the substance 

-------------
SubSection  1 :  Product identifier 
-------------

Other means of identification
-------------
SubSection  2 :  Relevant identified uses of the substance or mixture and uses advised against 
-------------

Recommended restrictions SAFETY DATA SHEET Horizon Blue Clarifier None. ALL PROPER AND LEGAL PURPOSES None known. Manufacturer/Importer/Supplier/Distributor information Manufacturer
-------------
SubSection  3 :  Details of the supplier of the safety data sheet 
-------------

name Address Telephone E-mail
-------------
SubSection  4 :  Emergency telephone number 
-------------


Section  2 :  SECTION 2: Hazards identification 

-------------
SubSection  1 :  Classification of the substance or mixture 
-------------

Observe good industrial hygiene practices. Wash hands after handling. Store away from incompatible materials. Dispose of waste and residues in accordance with local authority requi

Section  0 :   

Section  1 :  SECTION 1: Identification of the substance 

-------------
SubSection  2 :  Relevant identified uses of the substance or mixture and uses advised against 
-------------

Not applicable. Product code : 1000494-496, 1000498-500, 1000502-504, 1000543-545, 1001960-964 SAFETY DATA SHEET (800) 535-5053 Product use : Dental Products Polymer
-------------
SubSection  3 :  Details of the supplier of the safety data sheet 
-------------

Keystone Industries 616 Hollywood Ave. Cherry Hill, NJ 08002 (856) 663-4700
Section  2 :  SECTION 2: Hazards identification 

-------------
SubSection  1 :  Classification of the substance or mixture 
-------------

of the substance or mixture Signal word : Danger Hazard statements : May form combustible dust concentrations in air. May damage the unborn child. Suspected of damaging fertility. Suspected of causing cancer. Hazard pictograms : Precautionary statements Prevention : Obtain special instructions before use. Do not handle 

Section  0 :   

Section  1 :  SECTION 1: Identification of the substance 

-------------
SubSection  1 :  Product identifier 
-------------

· Trade name: Trifluoroacetic Acid (TFA) · Product Number: ngt-TFA · CAS Number: 76-05-1 · EC number: 200-929-3 · Index number: 607-091-00-1 ·
-------------
SubSection  2 :  Relevant identified uses of the substance or mixture and uses advised against 
-------------

No further relevant information available. · Product Description PC19 Intermediate ·
-------------
SubSection  3 :  Details of the supplier of the safety data sheet 
-------------

TFA - Pharmacutical Grade Manufactured by Rhodia · Manufacturer/Supplier: NuGeneration Technologies, LLC (dba NuGenTec) 1155 Park Avenue, Emeryville, CA 94608 salesteam@nugentec.com www.nugentec.com 1-888-996-8436 or 1-707-820-4080 for product information ·
-------------
SubSection  4 :  Emergency telephone number 
-------------

telephone number: PERS Emergency Response: Domestic and Canada - 1-800-633-82

Section  0 :   

Section  1 :  SECTION 1: Identification of the substance 

-------------
SubSection  1 :  Product identifier 
-------------

Tinuvin® 1130 Chemical name: reaction mass of alpha-3-(3-(2H-benzotriazol-2- yl)-5-tert-butyl-4- hydroxyphenyl)propionyl- omega-hydroxypoly(oxyethylene) and alpha-3-(3- (2H-benzotriazol-2-yl)-5- tert-butyl-4-hydroxyphenyl)propionyl-omega-3-(3-(2H-benzotriazol-2-yl)-5-tert-butyl-4- hydroxyphenyl)propionyloxypoly(oxyethylene) INDEX-Number: 607-176-00-3
-------------
SubSection  2 :  Relevant identified uses of the substance or mixture and uses advised against 
-------------

Relevant identified uses: stabilizer Details of the supplier of the safety data sheet
-------------
SubSection  3 :  Details of the supplier of the safety data sheet 
-------------

BASF SE 67056 Ludwigshafen GERMANY Regional Business Unit Dispersions and Resins Europe Telephone: +49 621 60-90799 E-mail address: ed-psr@basf.com
-------------
SubSection  4 :  Emergency telephone

## Remarks

InputFile = 'data/1799516.pdf'

!!!!!!!! Section 12 text ends up in section 11 (header 12 not recognized.
         Some information appears more than once. Section 16 contains info from other sections !!!!!!!!!

InputFile = 'data/gylcol-ether-dpm-sds.pdf'

!!!!! Section Title '8. EXPOSURE CONTROLS AND PERSONAL PROECTION' contains typo !!!!!

InputFile = 'data/NN-dimethylformamide_3_US_EN_sds.pdf'

!!!!! Section 16: First letter of text is lost !!!!!!

InputFile = "data/quilon_h-eu_msds__1594909058.pdf"

!!!!! This file does not respect the section numbering. Section 2 and 3 interchanged !!!!!!!!!

InputFile = "data/SDS - Horizon Blue Clarifier.pdf"

!!!! This one has text in disorder. Labels and values are no longer togheter, values are missing. !!!!

InputFile = "data/Sledgehammer-Heat-Cure-Powder-SDS-US-20150522.pdf"

!!!! Repeats section header on new page !!!  Multiple occurrences of section header.

## Hazard images extraction

#### PDFbox

from:

https://pdfbox.apache.org/

https://pypi.org/project/python-pdfbox/

In [33]:
import pdfbox

p = pdfbox.PDFBox()

In [34]:
p

<pdfbox.PDFBox at 0x7feaad084700>

In [35]:
InputFile = 'data/gylcol-ether-dpm-sds.pdf'

In [36]:
# extracts images embedded in the PDF file as .png files
p.extract_images(InputFile) # writes images to /path/to/my_file-1.png, /path/to/my_file-2.png, etc.

In [37]:
#text extraction, in html format:
p.extract_text(InputFile, '-html')   # writes text to /path/to/my_file.txt, i.e. strips the extention and replaces by .html