In [23]:
# Import Dependencies
import PyPDF2
from pdf2image import convert_from_path
import pytesseract
import nltk


# Dependeny installation
# !pip install nltk PyPDF2

[nltk_data] Downloading package punkt to /Users/chris/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# PDF file path
pdf_file_path = "./SampleReports/2023_Coles_Report.pdf"

## Method 1: PyPDF 2 Parser

Parsing is a process that extracts data from structured or semi-structured PDFs by analyzing their internal structure and metadata. Parsing software reads the PDF file and identifies the elements and attributes that define the data, such as tags, fields, coordinates, or styles.

In Method 1 we use the PyPDF2 to read the annual reports. More details can be found [here](https://pypdf2.readthedocs.io/en/3.x/)

In [25]:
# PyPDF 2 Parser
def read_pdf(file_path):
    text = ""
    with open(file_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        num_pages = len(reader.pages)
        
        for page_number in range(num_pages):
            page = reader.pages[page_number]
            text += page.extract_text()
            
    return text

In [26]:
# Call Parser
text = read_pdf(pdf_file_path)
print(text)

2023 Sustainability ReportWorking 
towards  
a more
sustainable 
future
Coles Group Limited  
ABN 11 004 089 936Acknowledgement of Country
Coles wishes to acknowledge the Traditional Custodians of 
Country throughout Australia.
We recognise their strength and resilience and pay our respects 
to their Elders past and present.
Coles extends that respect to all Aboriginal and Torres Strait 
Islander people, and recognises their rich cultures and their 
continuing connection to land and waters.
Aboriginal and Torres Strait Islander people are advised  
that this report may contain names and images of people  
who are deceased.
All references to Indigenous and First Nations people in this 
report are intended to include Aboriginal and/or Torres Strait 
Islander people.
Feedback
We welcome feedback on this report. For more information or to 
provide comments, please contact us at:
sustainability@coles.com.au
Anyone seeking to use information in this Sustainability Report to 
draw conclusions

In [27]:
# nltk.download('punkt')  # Download the required dataset for sentence tokenization

def get_sentences(text):
    sentences = nltk.sent_tokenize(text)
    return sentences

# Get sentences
sentences = get_sentences(text)
print(sentences)

['2023 Sustainability ReportWorking \ntowards  \na more\nsustainable \nfuture\nColes Group Limited  \nABN 11 004 089 936Acknowledgement of Country\nColes wishes to acknowledge the Traditional Custodians of \nCountry throughout Australia.', 'We recognise their strength and resilience and pay our respects \nto their Elders past and present.', 'Coles extends that respect to all Aboriginal and Torres Strait \nIslander people, and recognises their rich cultures and their \ncontinuing connection to land and waters.', 'Aboriginal and Torres Strait Islander people are advised  \nthat this report may contain names and images of people  \nwho are deceased.', 'All references to Indigenous and First Nations people in this \nreport are intended to include Aboriginal and/or Torres Strait \nIslander people.', 'Feedback\nWe welcome feedback on this report.', 'For more information or to \nprovide comments, please contact us at:\nsustainability@coles.com.au\nAnyone seeking to use information in this Sus

In [29]:
# Replace '\n' with ' ' in each sentence
sentences = [sentence.replace('\n', ' ') for sentence in sentences]

# Print the modified sentences
for sentence in sentences:
    print("\n",sentence)


 2023 Sustainability ReportWorking  towards   a more sustainable  future Coles Group Limited   ABN 11 004 089 936Acknowledgement of Country Coles wishes to acknowledge the Traditional Custodians of  Country throughout Australia.

 We recognise their strength and resilience and pay our respects  to their Elders past and present.

 Coles extends that respect to all Aboriginal and Torres Strait  Islander people, and recognises their rich cultures and their  continuing connection to land and waters.

 Aboriginal and Torres Strait Islander people are advised   that this report may contain names and images of people   who are deceased.

 All references to Indigenous and First Nations people in this  report are intended to include Aboriginal and/or Torres Strait  Islander people.

 Feedback We welcome feedback on this report.

 For more information or to  provide comments, please contact us at: sustainability@coles.com.au Anyone seeking to use information in this Sustainability Report to  dr

## Method 2: OCR Reader

OCR stands for optical character recognition, a process that converts images of text into editable and searchable text. OCR software scans the PDF file and analyzes the pixels to identify the characters and words. OCR can be useful for extracting data from scanned or image-based PDFs, such as invoices, receipts, forms, or reports. 

**IMPORTANT NOTE: Tesseract and poppler need to be installed and added to system PATH for error-free execution.**

We use the Tesseract OCR for Method 2. More details can be accessed [here](https://github.com/tesseract-ocr/tesseract)

In [18]:
# OCR PDF Reader.
def extract_text_from_pdf(pdf_path):
    images = convert_from_path(pdf_path)
    extracted_text = ""
    
    for image in images:
        text = pytesseract.image_to_string(image, lang='eng')
        extracted_text += text

    return extracted_text

# Extract text from the PDF using Tesseract OCR
extracted_text = extract_text_from_pdf(pdf_file_path)

# Print the extracted text
print(extracted_text)


Working
towards

a more
sustainable
future

2023 Sustainability Report

colesgroup

Coles Group Limited
ABN 11 004 089 936

x Sles Secéng

Ste, Ending Hunger,

Acknowledgement of Country

Coles wishes to acknowledge the Traditional Custodians of
Country throughout Australia.

We recognise their strength and resilience and pay our respects
to their Elders past and present.

Coles extends that respect to all Aboriginal and Torres Strait
Islander people, and recognises their rich cultures and their
continuing connection fo land and waters.

Aboriginal and Torres Strait Islander people are advised
that this report may contain names and images of people
who are deceased.

All references to Indigenous and First Nations people in this
report are intended to include Aboriginal and/or Torres Strait
Islander people.

Feedback

We welcome feedback on this report. For more information or to
provide comments, please contact us at:

@ sustainability@coles.com.au

Anyone seeking to use information in

In [31]:
def get_sentences(text):
    sentences = nltk.sent_tokenize(text)
    return sentences

# Get sentences
sentences = get_sentences(extracted_text)
print(sentences)


 ['Working\ntowards\n\na more\nsustainable\nfuture\n\n2023 Sustainability Report\n\ncolesgroup\n\nColes Group Limited\nABN 11 004 089 936\n\nx Sles Secéng\n\nSte, Ending Hunger,\n\nAcknowledgement of Country\n\nColes wishes to acknowledge the Traditional Custodians of\nCountry throughout Australia.', 'We recognise their strength and resilience and pay our respects\nto their Elders past and present.', 'Coles extends that respect to all Aboriginal and Torres Strait\nIslander people, and recognises their rich cultures and their\ncontinuing connection fo land and waters.', 'Aboriginal and Torres Strait Islander people are advised\nthat this report may contain names and images of people\nwho are deceased.', 'All references to Indigenous and First Nations people in this\nreport are intended to include Aboriginal and/or Torres Strait\nIslander people.', 'Feedback\n\nWe welcome feedback on this report.', "For more information or to\nprovide comments, please contact us at:\n\n@ sustainability@

In [32]:
# Replace '\n' with ' ' in each sentence
sentences = [sentence.replace('\n', ' ') for sentence in sentences]

# Print the modified sentences
for sentence in sentences:
    print("\n",sentence)


 Working towards  a more sustainable future  2023 Sustainability Report  colesgroup  Coles Group Limited ABN 11 004 089 936  x Sles Secéng  Ste, Ending Hunger,  Acknowledgement of Country  Coles wishes to acknowledge the Traditional Custodians of Country throughout Australia.

 We recognise their strength and resilience and pay our respects to their Elders past and present.

 Coles extends that respect to all Aboriginal and Torres Strait Islander people, and recognises their rich cultures and their continuing connection fo land and waters.

 Aboriginal and Torres Strait Islander people are advised that this report may contain names and images of people who are deceased.

 All references to Indigenous and First Nations people in this report are intended to include Aboriginal and/or Torres Strait Islander people.

 Feedback  We welcome feedback on this report.

 For more information or to provide comments, please contact us at:  @ sustainability@coles.com.au  Anyone seeking to use infor