# PDF File Extraction - Split in top level sections

Extract complete PDF file text.
To prepare for ML6 use case.


### With PDFminer

In [1]:
# See https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/
#     http://zevross.com/blog/2014/04/09/extracting-tabular-data-from-a-pdf-an-example-using-python-and-regular-expressions/

import io
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage

def extract_text_from_pdf(pdf_path):
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle)
    page_interpreter = PDFPageInterpreter(resource_manager, converter)
    
    with open(pdf_path, 'rb') as fh:
        for page in PDFPage.get_pages(fh, 
                                      caching=True,
                                      check_extractable=True):
            page_interpreter.process_page(page)
            
        text = fake_file_handle.getvalue()
    
    # close open handles
    converter.close()
    fake_file_handle.close()
    
    if text:
        return text


In [2]:
import re

def FindOccurences(Text, SubString):
    # Find all Occurences of SubString in text. Return dict with begin and end position in Text for each occurence
    # If not found, returns empty dictionary
    
    count = 1
    Dict = {}
    for m in re.finditer(SubString, Text):
         Dict[count] = (m.start(), m.end())
         count += 1
          
    return(Dict)

In [3]:
def VerifyOccurence(AllHeaders):
    # Check if SubString occurs in Text.
    
    if len(AllHeaders) == 0:
        s = "No Occurences"
    elif len(AllHeaders) > 1:
        s = "Multiple Occurences"
    else:
        s = "OK - Single Occurence"
    
    return(s)   

In [4]:
def ExtractText(Text, Begin, End):
    # Extract text Segment between 2 position indexes in a text string
    
    return(Text[Begin:End])

In [5]:
# From Maarten:

dict_structure = {"SECTION 1: Identification of the substance":
                          {"1.1. Product identifier":"",
                           "1.2. Relevant identified uses of the substance or mixture and uses advised against":"",
                           "1.3. Details of the supplier of the safety data sheet":"",
                           "1.4. Emergency telephone number":""},
                  "SECTION 2: Hazards identification":
                          {"2.1. Classification of the substance or mixture":"",
                           "2.2. Label elements":"",
                           "2.3. Other hazards":""},
                  "SECTION 3: Composition/information on ingredients":
                          {"3.1. Substances":"",
                           "3.2. Mixtures":""},
                  "SECTION 4: First aid measures":
                          {"4.1. Description of first aid measures":"",
                           "4.2. Most important symptoms and effects, both acute and delayed":"",
                           "4.3. Indication of any immediate medical attention and special treatment needed":""},
                  "SECTION 5: Firefighting measures":
                          {"5.1. Extinguishing media":"",
                           "5.2. Special hazards arising from the substance or mixture":"",
                           "5.3. Advice for firefighters":""},
                  "SECTION 6: Accidental release measures":
                          {"6.1. Personal precautions, protective equipment and emergency procedures":"",
                           "6.2. Environmental precautions":"",
                           "6.3. Methods and material for containment and cleaning up":"",
                           "6.4. Reference to other sections":""},
                  "SECTION 7: Handling and storage":
                          {"7.1. Precautions for safe handling":"",
                           "7.2. Conditions for safe storage, including any incompatibilities":"",
                           "7.3. Specific end use(s)":""},
                  "SECTION 8: Exposure controls/personal protection":
                          {"8.1. Control parameters":"",
                           "8.2. Exposure controls":""},
                  "SECTION 9: Physical and chemical properties":
                          {"9.1. Information on basic physical and chemical properties":"",
                           "9.2. Other information":""},
                  "SECTION 10: Stability and reactivity":
                          {"10.1. Reactivity":"",
                           "10.2. Chemical stability":"",
                           "10.3. Possibility of hazardous reactions":"",
                           "10.4. Conditions to avoid":"",
                           "10.5. Incompatible materials":"",
                           "10.6. Hazardous decomposition products":""},
                  "SECTION 11: Toxicological information":
                          {"11.1. Information on toxicological effects":""},
                  "SECTION 12: Ecological information":
                          {"12.1. Toxicity":"",
                           "12.2. Persistence and degradability":"",
                           "12.3. Bioaccumulative potential":"",
                           "12.4. Mobility in soil":"",
                           "12.5. Results of PBT and vPvB assessment":"",
                           "12.6. Other adverse effects":""},
                  "SECTION 13: Disposal considerations":
                          {"13.1. Waste treatment methods":""},
                  "SECTION 14: Transport information":
                          {"14.1. UN number":"",
                           "14.2. UN proper shipping name":"",
                           "14.3. Transport hazard class(es)":"",
                           "14.4. Packing group":"",
                           "14.5. Environmental hazards":""},
                  "SECTION 15: Regulatory information":
                          {"15.1. Safety, health and environmental regulations/legislation specific for the substance or mixture":"",
                           "15.2. Chemical safety assessment":""},
                  "SECTION 16: Other information":""
                 }


In [81]:
# SDS headers top level:
# All variants in a list need to be replaced by a single Regular Expression, matching all of the items in that list.
# The first element in the list is taken from the standard specification. The others are taken from all 20 PDF test files. 
# There may be repeats in the lists

HeaderDict = { 0 : [""], \
               1 : ["SECTION 1: Identification of the substance", "1. PRODUCT AND COMPANY IDENTIFICATION", "1.  PRODUCT AND COMPANY IDENTIFICATION", "1 Identification", "1. Identification", "1. Identification", "SECTION 1: Identification", "1. IDENTIFICATION", "SECTION 1: PRODUCT AND COMPANY IDENTIFICATION", "Section 1. Identification", " SECTION 1: Identification of the substance/mixture and of the company/undertaking ", "SECTION 1: Identification of the Substance/Preparation and the Company/Undertaking", "Section 1 - Identification of the Substance/Preparation and the Company/Undertaking"], \
               2 : ["SECTION 2: Hazards identification", "Hazards Identification", "2. HAZARDS IDENTIFICATION", "2 Hazard\(s\) identification", "2 Hazard\(s\) Identification", "2. Hazard\(s\) identification", "SECTION 2: Hazard\(s\) identification", "2.  HAZARDS IDENTIFICATION"," SECTION 2: HAZARDS IDENTIFICATION","SECTION 2: Hazards Identification", "SECTION 2: Hazards identification", "Section 2. Hazards identification"], \
               3 : ["SECTION 3: Composition/information on ingredients", "3. COMPOSITION/INFORMATION ON INGREDIENTS", "3 Composition/information on ingredients", "3 Composition/Information on Ingredients", "3.  COMPOSITION / INFORMATION ON INGREDIENTS", "3. Composition/information on ingredients", "3. Composition/Information on Ingredients", "Composition / Information on Ingredients",  "SECTION 3: Composition/Information on ingredients", "SECTION 3: Composition/Information on Ingredients", "SECTION 3: COMPOSITION/INFORMATION ON INGREDIENTS", "SECTION 3:Composition/information on ingredients", "Section 3. Composition/information on ingredients", "SECTION 3: COMPOSITION"], \
               4 : ["SECTION 4: First aid measures", "4. FIRST AID MEASURES", "4 First-Aid Measures", "4 First aid measures", "4. First-aid measures", "4. First-Aid Measures", "4.  FIRST AID MEASURES", "SECTION 4: First-aid measures", "SECTION 4: First-Aid Measures", "SECTION 4: FIRST AID MEASURES", " SECTION 4: First aid measures", "Section 4 - First Aid Measures", "Section 4. First aid measures"], \
               5 : ["SECTION 5: Firefighting measures", "5. FIRE-FIGHTING MEASURES", "5 Fire-Fighting Measures", "5 Fire-fighting measures", "5. FIRE FIGHTING MEASURES", "5. FIREFIGHTING MEASURES", "5.  FIRE FIGHTING MEASURES", "5. Fire-fighting measures", "5. Fire-Fighting Measures", "SECTION 5: Fire-fighting measures", "SECTION 5: Fire-Fighting Measures", "SECTION 5: FIREFIGHTING MEASURES", "SECTION 5:FIRE-FIGHTING MEASURES", "SECTION 5: FIRE FIGHTING MEASURES"," SECTION 5: Firefighting measures", "Section 5 - Fire Fighting Measures", "Section 5. Fire-fighting measures"], \
               6 : ["SECTION 6: Accidental release measures", "6. ACCIDENTAL RELEASE MEASURES", "6 Accidental Release Measures", "6 Accidental release measures", "6.  ACCIDENTAL RELEASE MEASURES", "6. Accidental release measures", "6. Accidental Release Measures", "SECTION 6: Accidental release measures","SECTION 6: Accidental Release Measures", "SECTION 6: ACCIDENTAL RELEASE MEASURES", "Section 6 - Accidental Release Measures", "Section 6. Accidental release measures"], \
               7 : ["SECTION 7: Handling and storage", "7. HANDLING AND STORAGE", "7 Handling and Storage", "7 Handling and storage", "7.  HANDLING AND STORAGE", "7. Handling and storage", "7. Handling and Storage", "SECTION 7: Handling and storage", "SECTION 7: Handling and Storage", "SECTION 7: HANDLING AND STORAGE", "Section 7 - Handling and Storage", "Section 7. Handling and storage"], \
               8 : ["SECTION 8: Exposure controls/personal protection", "8. EXPOSURE CONTROLS/PERSONAL PROTECTION", "8 Exposure controls/personal protection", "8. EXPOSURE CONTROLS / PERSONAL PROTECTION", "8.  EXPOSURE CONTROLS / PERSONAL PROTECTION", "8. Exposure Controls/personal protection", "8. Exposure controls / personal protection", "8 Exposure Controls/Personal Protection", "8. Exposure Controls/Personal Protection", "SECTION 8: Exposure controls/personal protection"," SECTION 8: Exposure Controls/Personal Protection", "SECTION 8: EXPOSURE CONTROLS/PERSONAL PROTECTION ", " SECTION 8: EXPOSURE CONTROLS AND PERSONAL PROECTION ", "Section 8 - Exposure Controls / Personal Protection", "Section 8. Exposure controls/personal protection", "SECTION 8: 8. Exposure Controls/personal protection"], \
               9 : ["SECTION 9: Physical and chemical properties", "9. PHYSICAL AND CHEMICAL PROPERTIES", "9 Physical and Chemical Properties", "9 Physical and chemical properties", "9.  PHYSICAL AND CHEMICAL PROPERTIES", "9. Physical and Chemical Properties", "9. Physical and chemical properties", "9. Physical and Chemical Properties", "SECTION 9: Physical and chemical properties", "SECTION 9: Physical and Chemical Properties", "SECTION 9: PHYSICAL AND CHEMICAL PROPERTIES", "Section 9 - Physical & Chemical Properties", "Section 9. Physical and chemical properties"], \
               10 : ["SECTION 10: Stability and reactivity", "10. STABILITY AND REACTIVITY", "10 Stability and Reactivity", "10 Stability and reactivity", "10Stability and Reactivity", "10.  STABILITY AND REACTIVITY", "10. Stability and Reactivity", "10. Stability and reactivity", "SECTION 10: Stability and reactivity", "SECTION 10: Stability and Reactivity", "SECTION 10: STABILITY AND REACTIVITY", "Section 10 - Chemical Stability & Reactivity Information", "Section 10. Stability and reactivity"], \
               11 : ["SECTION 11: Toxicological information", "11. TOXICOLOGICAL INFORMATION", "11 Toxicological Information", "11 Toxicological information", "11Toxicological Information", "11.  TOXICOLOGICAL INFORMATION", "11. Toxicological information", "11. Toxicological Information", "SECTION 11: Toxicological information", "SECTION 11: Toxicological Information", "SECTION 11: TOXICOLOGICAL INFORMATION", "Section 11 - Toxicological Information", "Section 11. Toxicological information"], \
               12 : ["SECTION 12: Ecological information", "12. ECOLOGICAL INFORMATION", "12 Ecological Information", "12 Ecological information", "12Ecological Information", "12.  ECOLOGICAL INFORMATION", "12. Ecological Information", "12. Ecological information", "SECTION 12: Ecological information", "SECTION 12: Ecological Information", "SECTION 12: ECOLOGICAL INFORMATION", "Section 12 - Ecological Information", "Section 12. Ecological information"], \
               13 : ["SECTION 13: Disposal considerations", "13. DISPOSAL CONSIDERATIONS", "13 Disposal Considerations", "13 Disposal consideration", "13Disposal Considerations", "13.  DISPOSAL CONSIDERATIONS", "13. Disposal considerations", "13. Disposal Considerations", "SECTION 13: Disposal considerations", "SECTION 13: Disposal Considerations", "SECTION 13: DISPOSAL CONSIDERATIONS", "Section 13 - Disposal Considerations", "Section 13. Disposal considerations"], \
               14 : ["SECTION 14: Transport information", "14. TRANSPORT INFORMATION", "14 Transport Information", "14 Transport information", "14Transport Information", "14.  TRANSPORT INFORMATION", "14. Transport information", "14. Transport Information", "SECTION 14: Transport information", "SECTION 14: Transport Information", "SECTION 14: TRANSPORT INFORMATION", "Section 14 - Transportation Information", "Section 14. Transport information"], \
               15 : ["SECTION 15: Regulatory information", "15. REGULATORY INFORMATION", "15 Regulatory Information", "15 Regulatory information", "15Regulatory Information", "15.REGULATORY INFORMATION", "15. Regulatory information", "15. Regulatory Information", "SECTION 15: Regulatory information", "SECTION 15: Regulatory information", "SECTION 15: Regulatory Information", "SECTION 15: REGULATORY INFORMATION", "Section 15 - Regulatory Information", "Section 15. Regulatory information"], \
               16 : ["SECTION 16: Other information", "16. OTHER INFORMATION", "16 Other Information", "16 Other information", "16Other Information", "16.  OTHER INFORMATION", "16. Other information", "16. Other Information", "SECTION 16: Other information", "SECTION 16: Other Information","SECTION 16: OTHER INFORMATION", " SECTION 16: Other information, including date of preparation or last revision", "Section 16 - Other Information", "Section 16. Other information"], \
             }

# Check 8 : "SECTION 8: 8. Exposure Controls/personal protection" (last item in list)

In [7]:
# Myriam
# HeaderDict = { 0 : "", \
#               1 : ["1. PRODUCT AND COMPANY IDENTIFICATION", "SECTION 1: Identification", "SECTION 1: PRODUCT AND COMPANY IDENTIFICATION"," SECTION 1: Identification of the substance/mixture and of the company/undertaking ", "SECTION 1: Identification of the Substance/Preparation and the Company/Undertaking"], \
#               2 : ["2. HAZARDS IDENTIFICATION", "SECTION 2: Hazard(s) identification"," SECTION 2 : Hazard(s) Identification"," SECTION 2: HAZARDS IDENTIFICATION","SECTION 2: Hazards Identification","SECTION 2: Hazards identification"],\
#               3 : ["3. COMPOSITION/INFORMATION ON INGREDIENTS", "SECTION 3: Composition/Information on ingredients", "SECTION 3: Composition/Information on Ingredients", "SECTION 3: COMPOSITION/INFORMATION ON INGREDIENTS","SECTION 3:Composition/information on ingredients","SECTION 3: COMPOSITION"], \
#               4 : ["4. FIRST AID MEASURES", "SECTION 4: First-aid measures","SECTION 4: First-Aid Measures","SECTION 4: FIRST AID MEASURES", " SECTION 4: First aid measures"], \
#               5 : ["5. FIRE-FIGHTING MEASURES", "SECTION 5: Fire-fighting measures","SECTION 5: Fire-Fighting Measures", "SECTION 5: FIREFIGHTING MEASURES ","SECTION 5:FIRE-FIGHTING MEASURES","SECTION 5: FIRE FIGHTING MEASURES"," SECTION 5: Firefighting measures"], \
#               6 : ["6. ACCIDENTAL RELEASE MEASURES", "SECTION 6: Accidental release measures", "SECTION 6: Accidental Release Measures", "SECTION 6: ACCIDENTAL RELEASE MEASURES "], \ 
#               7 : ["7. HANDLING AND STORAGE", "SECTION 7: Handling and storage", " SECTION 7: Handling and Storage","SECTION 7: HANDLING AND STORAGE"], \
#               8 : ["8. EXPOSURE CONTROLS/PERSONAL PROTECTION", "SECTION 8: Exposure controls/personal protection" , " SECTION 8: Exposure Controls/Personal Protection","SECTION 8: EXPOSURE CONTROLS/PERSONAL PROTECTION "," SECTION 8: EXPOSURE CONTROLS AND PERSONAL PROECTION ", "SECTION 8: 8. Exposure Controls/personal protection"], \
#               9 : ["9. PHYSICAL AND CHEMICAL PROPERTIES", "SECTION 9: Physical and chemical properties", " SECTION 9: Physical and Chemical Properties", "SECTION 9: PHYSICAL AND CHEMICAL PROPERTIES"], \
#               10 : ["10. STABILITY AND REACTIVITY", "SECTION 10: Stability and reactivity", " SECTION 10: Stability and Reactivity","SECTION 10: STABILITY AND REACTIVITY"], \
#               11 : ["11. TOXICOLOGICAL INFORMATION", "SECTION 11: Toxicological information", " SECTION 11: Toxicological Information", "SECTION 11: TOXICOLOGICAL INFORMATION"], \
#               12 : ["12. ECOLOGICAL INFORMATION", "SECTION 12: Ecological information", " SECTION 12: Ecological Information", "SECTION 12: ECOLOGICAL INFORMATION"], \
#               13 : ["13. DISPOSAL CONSIDERATIONS", "SECTION 13: Disposal considerations", " SECTION 13: Disposal Considerations","SECTION 13: DISPOSAL CONSIDERATIONS"], \
#               14 : ["14. TRANSPORT INFORMATION", "SECTION 14: Transport information", " SECTION 14: Transport Information","SECTION 14: TRANSPORT INFORMATION"], \
#               15 : ["15. REGULATORY INFORMATION", "SECTION 15: Regulatory information", "SECTION 15: Regulatory Information", "SECTION 15: REGULATORY INFORMATION"], \
#               16 : ["16. OTHER INFORMATION", "SECTION 16: Other information", "SECTION 16: Other Information","SECTION 16: OTHER INFORMATION", " SECTION 16: Other information, including date of preparation or last revision"], \
#             }

In [8]:
print(len(HeaderDict[1]))

9


In [9]:
def PDF2Sections(PDFfile):
    # Convert a PDF file in a Dict of Sections

    # Locate all Headers:
    HeaderIndexes = {}
    AllHeaders = {}

    for Header in range(1,17):
        #AllHeaders = FindOccurences(PDFfile, HeaderDict[Header][0])
        #if not AllHeaders:   # if first option in the list fails, try 2nd option
        #    AllHeaders = FindOccurences(PDFfile, HeaderDict[Header][1])
        for i in range(len(HeaderDict[Header])):
            AllHeaders = FindOccurences(PDFfile, HeaderDict[Header][i])
            if AllHeaders:
                HeaderIndexes[Header] = AllHeaders[1]
                break
        if AllHeaders:
            VerifyOccurence(AllHeaders)
        else:
            print("\nHeader ", HeaderDict[Header], " not found")
            return(None)

    # Determine begin and end position of each header:    
    SectionIndexes = {}
                                    
    SectionIndexes[0] = (0, HeaderIndexes[1][0] - 1)   # Section 0 is to contain PDF file intro
    
    for Section in range(1,16):
        SectionIndexes[Section] = (HeaderIndexes[Section][1] + 1, HeaderIndexes[Section + 1][0] - 1)
    
    SectionIndexes[16] = (HeaderIndexes[16][1] + 1, len(PDFfile) )

    # Extract the text of each section:
    SectionText = {}
    
    for Section in range(0,17):
        SectionText[Section] = ExtractText(PDFfile, SectionIndexes[Section][0], SectionIndexes[Section][1])
    
    return(SectionText)   

In [10]:
def PrintSections(SectionDict):
    # Print out the PDF converted to Dictionary

    for Section in range(0,17):
        print("-------------\nSection ", Section, ": ", HeaderDict[Section][0], "\n-------------\n", SectionDict[Section], "\n")
        
    return(None)

In [11]:
InputFile = "data/10N_Sodium_Hydroxide_NaOH_40_6_US_EN_sds (1).pdf"

In [12]:
PDFtext = extract_text_from_pdf(InputFile)
PDFtext



In [13]:
PDFSectionsText = PDF2Sections(PDFtext)
PrintSections(PDFSectionsText)

-------------
Section  0 :   
-------------
  Page 1/11Safety Data Sheet (SDS)OSHA HazCom Standard 29 CFR 1910.1200(g) and GHS Rev 03.Issue date 02/09/2017Reviewed on 02/09/201744.2.1* 

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
  Product Identifier· Trade name:10N Sodium Hydroxide (NaOH 40%)· Product Number:NGT-10N NaOH· Relevant identified uses of the substance or mixture and uses advised against:No further relevant information available.· Product DescriptionPC21   Laboratory chemicals· Application of the substance / the mixture:Laboratory chemicals· Details of the Supplier of the Safety Data Sheet:· Manufacturer/Supplier:NuGeneration Technologies, LLC (dba NuGenTec)1155 Park Avenue, Emeryville, CA 94608salesteam@nugentec.comwww.nugentec.com1-888-996-8436 or 1-707-820-4080 for product information· Emergency telephone number:PERS Emergency Response: Domestic and Canada - 1-800-633-8253, International 1-801-629-0667* 

-------------
Section  

In [14]:
InputFile = 'data/765-12-8_Sigma-Aldrich.pdf'

In [15]:
PDFtext = extract_text_from_pdf(InputFile)
print(PDFtext)
PDFSectionsText = PDF2Sections(PDFtext)
PrintSections(PDFSectionsText)

 Aldrich - 329800  Page 1  of  6  SIGMA-ALDRICH sigma-aldrich.com SAFETY DATA SHEET Version 4.2 Revision Date 07/01/2014 Print Date 11/10/2018  1. PRODUCT AND COMPANY IDENTIFICATION 1.1 Product identifiers Product name : Tri(ethylene glycol) divinyl ether  Product Number : 329800 Brand : Aldrich    CAS-No. : 765-12-8 1.2 Relevant identified uses of the substance or mixture and uses advised against Identified uses : Laboratory chemicals, Manufacture of substances 1.3 Details of the supplier of the safety data sheet Company : Sigma-Aldrich 3050 Spruce Street SAINT LOUIS MO  63103 USA  Telephone : +1 800-325-5832 Fax : +1 800-325-5052 1.4 Emergency telephone number Emergency Phone # : +1-703-527-3887 (CHEMTREC)  2. HAZARDS IDENTIFICATION 2.1 Classification of the substance or mixture  Not a hazardous substance or mixture. 2.2 GHS Label elements, including precautionary statements Not a hazardous substance or mixture. 2.3 Hazards not otherwise classified (HNOC) or not covered by GHS - none

In [16]:
InputFile = 'data/23114.pdf'

In [17]:
PDFtext = extract_text_from_pdf(InputFile)
PDFtext

'   Material Safety Data Sheet   LUPEROX® P      Product code: 051000   Version 2.0 Issued on: 07/11/2011 Page: 1 / 10     1. PRODUCT AND COMPANY IDENTIFICATION  Company  Arkema Inc. 900 First Avenue King of Prussia, Pennsylvania 19406  Functional Additives  Customer Service Telephone Number: (800) 331-7654 (Monday through Friday, 8:30 AM to 5:30 PM EST)  Emergency Information  Transportation: CHEMTREC: (800) 424-9300 (24 hrs., 7 days a week) Medical: Rocky Mountain Poison Center: (866) 767-5089 (24 hrs., 7 days a week)  Product Information  Product name: LUPEROX® P Synonyms: Peroxyester, t-butyl perbenzoate, tert-butyl peroxybenzoate Molecular formula: C11 H14 O3 Chemical family: Organic peroxide - peroxyesters Product use: Initiator  2. HAZARDS IDENTIFICATION  Emergency Overview  Color: Colourless to yellow. Physical state: liquid  Odor: unpleasant  CAUTION! ORGANIC PEROXIDE. HAZARDOUS DECOMPOSITION MAY OCCUR.  MAY CAUSE ALLERGIC SKIN REACTION.  MAY CAUSE SKIN IRRITATION.   Potential

In [18]:
PDFSectionsText = PDF2Sections(PDFtext)
PrintSections(PDFSectionsText)

-------------
Section  0 :   
-------------
    Material Safety Data Sheet   LUPEROX® P      Product code: 051000   Version 2.0 Issued on: 07/11/2011 Page: 1 / 10     

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
  Company  Arkema Inc. 900 First Avenue King of Prussia, Pennsylvania 19406  Functional Additives  Customer Service Telephone Number: (800) 331-7654 (Monday through Friday, 8:30 AM to 5:30 PM EST)  Emergency Information  Transportation: CHEMTREC: (800) 424-9300 (24 hrs., 7 days a week) Medical: Rocky Mountain Poison Center: (866) 767-5089 (24 hrs., 7 days a week)  Product Information  Product name: LUPEROX® P Synonyms: Peroxyester, t-butyl perbenzoate, tert-butyl peroxybenzoate Molecular formula: C11 H14 O3 Chemical family: Organic peroxide - peroxyesters Product use: Initiator  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
  Emergency Overview  Color: Colourless to yellow. Physical state: liquid  Odor:

In [19]:
InputFile = 'data/580076.pdf'

In [20]:
PDFtext = extract_text_from_pdf(InputFile)
print(PDFtext)
PDFSectionsText = PDF2Sections(PDFtext)
PrintSections(PDFSectionsText)

-------------
Section  0 :   
-------------
   XIAMETER(R) Material Safety Data Sheet    Page:  1 of 8 Version:  1.0 Revision Date:  2009/01/20 XIAMETER(R) OFS-6030 SILANE    

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
  Dow Corning Corporation South Saginaw Road Midland, Michigan  48686   24 Hour Emergency Telephone:  Customer Service:  Product Disposal Information:  CHEMTREC:   (989) 496-5900 (989) 496-6000 (989) 496-6315 (800) 424-9300  MSDS No.:  04088687  Revision Date:  2009/01/20 Generic Description:  Alkoxysilane Physical Form:  Liquid Color:  Colorless to pale yellow Odor:  Aromatic odor  NFPA Profile:  Health 2 Flammability 1 Instability/Reactivity 0 Note:  NFPA = National Fire Protection Association  

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
  POTENTIAL HEALTH EFFECTS  Acute Effects  Eye:  Direct contact may cause mild irritation.    Skin:  May cause moderate irritation.    Inhalation:  Vapor ir

In [21]:
InputFile = 'data/586346.pdf'

In [22]:
PDFtext = extract_text_from_pdf(InputFile)
print(PDFtext)
PDFSectionsText = PDF2Sections(PDFtext)
PrintSections(PDFSectionsText)

-------------
Section  0 :   
-------------
 Material Safety Data Sheet    BYK-349 Version 5  Revision Date 02/04/2011 Print Date 02/04/2011    1 / 9 SECTION 

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 Product name : BYK-349 Product Use Description : Substrate Wetting Additive  Company : BYK USA Inc. 524 South Cherry Street Wallingford CT 06492 Prepared by : J.Nole, Safety; M.McCutcheon, Regulatory Telephone : (203) 265-2086 Visit our web site : www.byk.com E-mail address : ehs.byk.usa@altana.com Emergency telephone number : CHEMTREC 800-424-9300 SECTION 

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 Emergency Overview Form : liquid Colour : light brown Odour : not significant   OSHA Regulatory Status  This material is considered hazardous by the OSHA Hazard Communication Standard (29 CFR1910.1200)   Potential Health Effects Eyes : Contact will probably cause irritation.  Skin : Contact will probably cause ir

In [23]:
InputFile = 'data/1799516.pdf'

In [24]:
PDFtext = extract_text_from_pdf(InputFile)
print(PDFtext)
PDFSectionsText = PDF2Sections(PDFtext)
PrintSections(PDFSectionsText)

-------------
Section  0 :   
-------------
 Material Safety Data Sheet Revision Number: 004.0Issue date: 02/01/2012IDH number: 701990Product name: DB EPOXY Pot CMPD E-60NC 50MLPage 1 of 5 

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 Product name: DB EPOXY Pot CMPD E-60NC 50ML IDH number:701990Product type:Epoxy Hardener  Item number:29324_209537Region: United StatesCompany address: Contact information: Henkel Corporation One Henkel Way Rocky Hill, Connecticut 06067Telephone: 860.571.5100 MEDICAL EMERGENCY Phone: Poison Control Center 1-877-671-4608 (toll free) or 1-303-592-1711 TRANSPORT EMERGENCY Phone: CHEMTREC 1-800-424-9300 (toll free) or 1-703-527-3887 Internet: www.henkelna.co 

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
  EMERGENCY OVERVIEW  HMIS:Physical state: Liquid HEALTH:  *3Color: Pale yellowFLAMMABILITY:  1Odor: MildPHYSICAL HAZARD: 0Personal Protection:  See MSDS Section 8DANGER:CAUSES EYE, SK

!!!!!!!!  Some information appears more than once. Section 16 contains info from other sections !!!!!!!!!

In [25]:
InputFile = 'data/Carbowax Sentry PEG 400 NF - Dow - 2015-03-26.pdf'

In [26]:
PDFtext = extract_text_from_pdf(InputFile)
PDFtext



In [27]:
PDFSectionsText = PDF2Sections(PDFtext)
PrintSections(PDFSectionsText)

-------------
Section  0 :   
-------------
 ® ™ Trademark of The Dow Chemical Company ("Dow") or an affiliated company of Dow Page 1 of 10   SAFETY DATA SHEET THE DOW CHEMICAL COMPANY  Product name: CARBOWAX™ SENTRY™ POLYETHYLENE GLYCOL 400 NF, FCC GRADE; MACROGOL 400 Ph. Eur. Issue Date: 03/26/2015 Print Date: 04/01/2015  THE DOW CHEMICAL COMPANY encourages and expects you to read and understand the entire (M)SDS, as there is important information throughout the document.  We expect you to follow the precautions identified in this document unless your use conditions would necessitate other appropriate methods or actions.  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 Product name: CARBOWAX™ SENTRY™ POLYETHYLENE GLYCOL 400 NF, FCC GRADE; MACROGOL 400 Ph. Eur.  Recommended use of the chemical and restrictions on use Identified uses: For the CARBOWAX™ Product Line, a partial list of applications include pharmaceutical products, personal care pr

In [28]:
InputFile = 'data/Dubei Steriling Alcohol Wipes SDS.pdf'

In [29]:
PDFtext = extract_text_from_pdf(InputFile)
print(PDFtext)
PDFSectionsText = PDF2Sections(PDFtext)
PrintSections(PDFSectionsText)

-------------
Section  0 :   
-------------
 Safety Data Sheet OSHA 29 CFR 1910.1200                              Trade Name: Disinfection wet wipes                  Page 1 of 1 US OSHA SDS                                                                                   Page 1 of 7        Version No.: 1.0                                                                     Issuing Date: Mar. 27, 2020 Trade Name: Disinfection wet wipes                                                      Print Date: Mar. 27, 2020  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
                                                                             · Product identifier · Trade name: Disinfection wet wipes · Recommended use of the chemical and restrictions on use · Application of the substance/ mixture: Clean hands and disinfection   · Details of the supplier of the safety data sheet · Manufacturer/Supplier: Jinhua dubei maternal and Infant Products Co., Ltd 

In [30]:
InputFile = 'data/gylcol-ether-dpm-sds.pdf'

In [31]:
PDFtext = extract_text_from_pdf(InputFile)
PDFtext

PSTypeError: Literal required: /b'CIDSystemInfo'

In [32]:
InputFile = 'data/Hydrogen_Peroxide_35_7_US_EN_sds.pdf'

In [33]:
PDFtext = extract_text_from_pdf(InputFile)
print(PDFtext)
PDFSectionsText = PDF2Sections(PDFtext)
PrintSections(PDFSectionsText)

-------------
Section  0 :   
-------------
  Page 1/11Safety Data Sheet (SDS)OSHA HazCom Standard 29 CFR 1910.1200(g) and GHS Rev 03.Issue date 02/13/2017Reviewed on 02/13/201744.2.2* 

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
  Product Identifier· Trade name:Hydrogen Peroxide 35%· Relevant identified uses of the substance or mixture and uses advised against:· Product DescriptionPC21   Laboratory chemicals· Details of the Supplier of the Safety Data Sheet:· Manufacturer/Supplier:NuGeneration Technologies, LLC (dba NuGenTec)1155 Park Avenue, Emeryville, CA 94608salesteam@nugentec.comwww.nugentec.com1-888-996-8436 or 1-707-820-4080 for product information· Emergency telephone number:PERS Emergency Response: Domestic and Canada - 1-800-633-8253, International 1-801-629-0667* 

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
  Classification of the substance or mixture:d~GHS03 Flame over circleOx. Liq. 1H271May cau

In [34]:
InputFile = 'data/LC15750.pdf'

In [35]:
PDFtext = extract_text_from_pdf(InputFile)
PDFtext

' Isopropyl Alcohol (2-Propanol) Safety Data Sheet    according to Federal Register / Vol. 77, No. 58 / Monday, March 26, 2012 / Rules and Regulations Date of issue: 11/14/2013 Revision date: 01/23/2020 Supersedes: 01/26/2018 Version: 1.3  01/23/2020 EN (English US)  Page 1  SECTION 1: Identification  1.1. Identification Product form : Substance Substance name : Isopropyl Alcohol (2-Propanol) CAS-No. : 67-63-0 Product code : LC15750 Formula : C3H8O Synonyms : 1-methylethanol / 1-methylethyl alcohol / 2-hydroxypropane / dimethyl carbinol / ethyl carbinol / hydroxypropane / IPA / i-propanol / isoethylcarbinol / propan-2-ol / sec-propanol  1.2. Recommended use and restrictions on use Use of the substance/mixture : Disinfectant Solvent Recommended use : Laboratory chemicals Restrictions on use : Not for food, drug or household use  1.3. Supplier LabChem, Inc. Jackson\'s Pointe Commerce Park Building 1000, 1010 Jackson\'s Pointe Court Zelienople, PA 16063 - USA T 412-826-5230 - F 724-473-06

SECTION 2: Hazard(s) identification

is in this file, but is not found by re.finditer

In [36]:
s = "SECTION 2: Hazard(s) identification"
print(s)
l = [ord(c) for c in s]

print(l)

for char in l:
    print( chr(char))

SECTION 2: Hazard(s) identification
[83, 69, 67, 84, 73, 79, 78, 32, 50, 58, 32, 72, 97, 122, 97, 114, 100, 40, 115, 41, 32, 105, 100, 101, 110, 116, 105, 102, 105, 99, 97, 116, 105, 111, 110]
S
E
C
T
I
O
N
 
2
:
 
H
a
z
a
r
d
(
s
)
 
i
d
e
n
t
i
f
i
c
a
t
i
o
n


In [37]:
FindOccurences(PDFtext, 'SECTION 2: Hazard(s) identification')

{}

In [38]:
s = "SECTION 2: Hazard\(s\) identification"
FindOccurences(PDFtext, s)

{1: (1133, 1168)}

In [39]:
PDFSectionsText = PDF2Sections(PDFtext)
PrintSections(PDFSectionsText)

-------------
Section  0 :   
-------------
  Isopropyl Alcohol (2-Propanol) Safety Data Sheet    according to Federal Register / Vol. 77, No. 58 / Monday, March 26, 2012 / Rules and Regulations Date of issue: 11/14/2013 Revision date: 01/23/2020 Supersedes: 01/26/2018 Version: 1.3  01/23/2020 EN (English US)  Page 1  

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
  1.1. Identification Product form : Substance Substance name : Isopropyl Alcohol (2-Propanol) CAS-No. : 67-63-0 Product code : LC15750 Formula : C3H8O Synonyms : 1-methylethanol / 1-methylethyl alcohol / 2-hydroxypropane / dimethyl carbinol / ethyl carbinol / hydroxypropane / IPA / i-propanol / isoethylcarbinol / propan-2-ol / sec-propanol  1.2. Recommended use and restrictions on use Use of the substance/mixture : Disinfectant Solvent Recommended use : Laboratory chemicals Restrictions on use : Not for food, drug or household use  1.3. Supplier LabChem, Inc. Jackson's Pointe Commerce

In [40]:
InputFile = 'data/NN-dimethylformamide_3_US_EN_sds.pdf'

In [41]:
PDFtext = extract_text_from_pdf(InputFile)
print(PDFtext)
PDFSectionsText = PDF2Sections(PDFtext)
PrintSections(PDFSectionsText)

-------------
Section  0 :   
-------------
  Page 1/11Safety Data Sheet (SDS)OSHA HazCom Standard 29 CFR 1910.1200(g) and GHS Rev 03.Issue date 02/13/2017Reviewed on 02/13/201744.2.2* 

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
  Product Identifier· Trade name:N,N-dimethylformamide· CAS Number:68-12-2· EC number:200-679-5· Index number:616-001-00-X· Relevant identified uses of the substance or mixture and uses advised against:No further relevant information available.· Product DescriptionPC19   Intermediate· Details of the Supplier of the Safety Data Sheet:· Manufacturer/Supplier:NuGeneration Technologies, LLC (dba NuGenTec)1155 Park Avenue, Emeryville, CA 94608salesteam@nugentec.comwww.nugentec.com1-888-996-8436 or 1-707-820-4080 for product information· Emergency telephone number:PERS Emergency Response: Domestic and Canada - 1-800-633-8253, International 1-801-629-0667* 

-------------
Section  2 :  SECTION 2: Hazards identification 
----

!!!!! Section 16: First letter of text is lost !!!!!!

In [42]:
InputFile = "data/Omnirad-184.pdf"

In [45]:
PDFtext = extract_text_from_pdf(InputFile)
print(PDFtext)
PDFSectionsText = PDF2Sections(PDFtext)
PrintSections(PDFSectionsText)

-------------
Section  0 :   
-------------
 Revision date: 10/11/2017Revision: 2Supersedes date: 5/23/2017SAFETY DATA SHEETOmnirad 184According to Appendix D, OSHA Hazard Communication Standard 29 CFR §1910.120 

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 roduct identifierProduct nameOmnirad 184Chemical nameHydroxycyclohexyl phenyl ketoneCAS number947-19-3Recommended use of the chemical and restrictions on useApplicationPhotoinitiator.Uses advised againstNo specific uses advised against are identified.Details of the supplier of the safety data sheetSupplierwww.igmresins.comIGM Resins USA Inc.3300 Westinghouse BlvdCharlotte, NC 28273, United Statesregulatory@igmresins.comT: +1 630-213-1616F: +1 704 450 6484Emergency telephone numberEmergency telephoneItaly +39 0362 512868EMEA +1 760 476 3961 Access code 333 293.US +1 760 476 3962 Access code 333 293 

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 lassification 

In [46]:
InputFile = "data/PHTHALIC-ANHYDRIDE--ACS-5KG-pdf.pdf"

In [61]:
PDFtext = extract_text_from_pdf(InputFile)
print(PDFtext)
PDFSectionsText = PDF2Sections(PDFtext)
PrintSections(PDFSectionsText)

SAFETY DATA SHEETCreation Date  19-Nov-2008Revision Date  23-Jan-2018Revision Number  31. IdentificationProduct NamePhthalic anhydrideCat No. :AC423320000; AC423320020; AC423320050; AC423320250;AC423325000CAS-No85-44-9Synonyms1,3-IsobenzofurandioneRecommended UseLaboratory chemicals.Uses advised againstFood, drug, pesticide or biocidal product use.Details of the supplier of the safety data sheet Emergency Telephone Number For information US call: 001-800-ACROS-01 / Europe call: +32 14 57 52 11Emergency Number US:001-201-796-7100 / Europe: +32 14 57 52 99CHEMTREC Tel. No.US:001-800-424-9300 / Europe:001-703-527-38872. Hazard(s) identificationClassification This chemical is considered hazardous by the 2012 OSHA Hazard Communication Standard (29 CFR 1910.1200)Label Elements  Signal WordDangerCompany Fisher ScientificOne Reagent LaneFair Lawn, NJ 07410Tel: (201) 796-7100Acros OrganicsOne Reagent LaneFair Lawn, NJ 07410Acute oral toxicityCategory 4Skin Corrosion/IrritationCategory 2Serious 

In [62]:
InputFile = "data/quilon_h-eu_msds__1594909058.pdf"

In [66]:
PDFtext = extract_text_from_pdf(InputFile)
print(PDFtext)
PDFSectionsText = PDF2Sections(PDFtext)
PrintSections(PDFSectionsText)

 Safety Data Sheet Material Name: Quilon H - Chromium Complex  ____________________________________________________________ Page 1 of 7 Issue Date: 12/27/10    Revision: 1.0000   Print Date:   12/28/2010 * * *  Section 1 - Identification of the Substance/Preparation and the Company/Undertaking  * * * Manufacturer Information Zaclon LLC Phone: 216-271-1569 or 800-356-7327 2981 Independence Road Fax: 216-271-1792 Cleveland, OH  44115 Emergency # 800-424-9300 CHEMTREC    * * *  Section 2 - Composition / Information on Ingredients  * * *  EC # Component Percent Symbols Risks 200-661-7 Isopropyl alcohol 67-63-0 30-61 Xi R:11-36-67 200-662-2 Acetone 67-64-1 10-16 Xi R:11-36-66-67 231-791-2 Water 7732-18-5 3.7-13   239-286-9 Chromium, tetrachloro--hydroxy[-(octadecanoato-O:O')]di- 15242-96-3 2.85-4.6   239-730-1 Chromium, tetrachloro--hydroxy[-(tetradecanoato-O:O')]di- 15659-56-0 2.85-4.6   201-161-1 Chloroacetone 78-95-5 0.3    * * *  Section 3 - Hazards Identification  * * * Human and E

!!!!! This file does not respect the section numbering. Section 2 and 3 interchanged !!!!!!!!!

In [67]:
InputFile = "data/Savinyl-Orange-RLS.pdf"

In [68]:
PDFtext = extract_text_from_pdf(InputFile)
print(PDFtext)
PDFSectionsText = PDF2Sections(PDFtext)
PrintSections(PDFSectionsText)

 Safety Data Sheet in accordance with Regulation (EU) No.453/2010    Savinyl Orange RLS  Page  1(8)    Substance key: KS7032  Revision Date: 25.11.2010  Version : 1 - 5 / EU  Date of printing : 21.06.2011    SECTION 1: Identification of the substance/mixture and of the company/undertaking  1.1. Product identifier  Trade name  Savinyl Orange RLS   Material number: 103327   1.2. Relevant identified uses of the substance or mixture and uses advised against  Relevant identified uses of the substance or mixture Industry sector : Paints, lacquers and varnishes industry Type of use : dye for special industries  1.3. Details of the supplier of the safety data sheet  Identification of the company Clariant Production (France)  Usine de Huningue  Avenue de Bâle  68331 Huningue  Telephone no. : +33 3 89 89 60 00   Information about the substance/mixture Division Pigments & Additives  tel.: +33.3.89.89.63.38  e-mail: France.ProductSafety@clariant.com   1.4. Emergency telephone number  +33 1 45 42 5

In [69]:
InputFile = "data/SDS - Horizon Blue Clarifier.pdf"

In [70]:
PDFtext = extract_text_from_pdf(InputFile)
print(PDFtext)
PDFSectionsText = PDF2Sections(PDFtext)
PrintSections(PDFSectionsText)

PSTypeError: Literal required: /b'begin'

In [71]:
InputFile = "data/Sledgehammer-Heat-Cure-Powder-SDS-US-20150522.pdf"

In [74]:
PDFtext = extract_text_from_pdf(InputFile)
print(PDFtext)
PDFSectionsText = PDF2Sections(PDFtext)
PrintSections(PDFSectionsText)

-------------
Section  0 :   
-------------
 Sledgehammer Heat Cure PowderNot available.Powder.GHS product identifierOther means of identificationProduct typeEmergency telephone number (with hours of operation)Section 

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 :::Supplier's details:Keystone Industries616 Hollywood Ave.Cherry Hill, NJ 08002(856) 663-4700Relevant identified uses of the substance or mixture and uses advised againstNot applicable.Product code:1000494-496, 1000498-500, 1000502-504, 1000543-545, 1001960-964SAFETY DATA SHEETSledgehammer Heat Cure Powder(800) 535-5053Product use:Dental ProductsPolyme 

-------------
Section  2 :  SECTION 2: Hazards identification 
-------------
 OMBUSTIBLE DUSTSCARCINOGENICITY - Category 2TOXIC TO REPRODUCTION (Unborn child) - Category 1BTOXIC TO REPRODUCTION (Fertility) - Category 2Classification of the substance or mixture:Signal word:DangerHazard statements:May form combustible dust concentratio

!!!! Repeats section header on new page !!!  Multiple occurrences of section header.

In [75]:
InputFile = "data/Trifluoroacetic_Acid_TFA_7_US_EN_sds.pdf"

In [76]:
PDFtext = extract_text_from_pdf(InputFile)
print(PDFtext)
PDFSectionsText = PDF2Sections(PDFtext)
PrintSections(PDFSectionsText)

-------------
Section  0 :   
-------------
  Page 1/12Safety Data Sheet (SDS)OSHA HazCom Standard 29 CFR 1910.1200(g) and GHS Rev 03.Issue date 02/13/2017Reviewed on 02/13/201744.2.2* 

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
  Product Identifier· Trade name:Trifluoroacetic Acid (TFA)· Product Number:ngt-TFA· CAS Number:76-05-1· EC number:200-929-3· Index number:607-091-00-1· Relevant identified uses of the substance or mixture and uses advised against:No further relevant information available.· Product DescriptionPC19   Intermediate· Details of the Supplier of the Safety Data Sheet:TFA - Pharmacutical Grade Manufactured by Rhodia· Manufacturer/Supplier:NuGeneration Technologies, LLC (dba NuGenTec)1155 Park Avenue, Emeryville, CA 94608salesteam@nugentec.comwww.nugentec.com1-888-996-8436 or 1-707-820-4080 for product information· Emergency telephone number:PERS Emergency Response: Domestic and Canada - 1-800-633-8253, International 1-801-62

In [77]:
InputFile = "data/巴斯夫紫外线吸收剂-tinuvin-1130-msds宝旭化工.pdf"

In [82]:
PDFtext = extract_text_from_pdf(InputFile)
print(PDFtext)
PDFSectionsText = PDF2Sections(PDFtext)
PrintSections(PDFSectionsText)

-------------
Section  0 :   
-------------
         Safety data sheet   Page: 1/14 BASF Safety data sheet according to UN GHS 4th rev. Date / Revised: 21.11.2017 Version: 3.0 Product: Tinuvin® 1130 (ID no. 30080322/SDS_GEN_00/EN) Date of print 22.11.2017       

-------------
Section  1 :  SECTION 1: Identification of the substance 
-------------
 Product identifier  Tinuvin® 1130  Chemical name: reaction mass of alpha-3-(3-(2H-benzotriazol-2- yl)-5-tert-butyl-4-hydroxyphenyl)propionyl- omega-hydroxypoly(oxyethylene) and alpha-3-(3- (2H-benzotriazol-2-yl)-5-tert-butyl-4-hydroxyphenyl)propionyl-omega-3-(3-(2H-benzotriazol-2-yl)-5-tert-butyl-4-hydroxyphenyl)propionyloxypoly(oxyethylene) INDEX-Number: 607-176-00-3  Relevant identified uses of the substance or mixture and uses advised against  Relevant identified uses: stabilizer  Details of the supplier of the safety data sheet  Company: BASF SE 67056 Ludwigshafen GERMANY Regional Business Unit Dispersions and Resins Europe   Telephone: 