In [1]:
# Import Dependencies
import PyPDF2
from pdf2image import convert_from_path
import pytesseract
import nltk
import re

# Dependeny installation
# !pip install nltk PyPDF2

# Download the required dataset for sentence tokenization
# nltk.download('punkt')  

# PDF to Sentence Parser

### Method 1: PyPDF 2 Parser

Parsing is a process that extracts data from structured or semi-structured PDFs by analyzing their internal structure and metadata. Parsing software reads the PDF file and identifies the elements and attributes that define the data, such as tags, fields, coordinates, or styles.

In Method 1 we use the PyPDF2 to read the annual reports. More details can be found [here](https://pypdf2.readthedocs.io/en/3.x/)

In [16]:
# PyPDF 2 Parser
def parser_pypdf(file_path):
    
    text = ""
    with open(file_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        num_pages = len(reader.pages)
        
        for page_number in range(num_pages):
            page = reader.pages[page_number]
            text += page.extract_text()
            
    return text

# Function to split document into sentences
def get_sentences(text):
    
    sentences = nltk.sent_tokenize(text)
    return sentences

# Function to perform text formatting operations on a list of sentences
def sentence_formatter(sentences):
    
    formatted_sentences = []
    
    for sentence in sentences:
        sentence = sentence.replace('\n', ' ')
        sentence = sentence.strip()
        sentence = re.sub(r'\s+', ' ', sentence)
        
        # Join fragmented sentences
        if len(formatted_sentences) > 0 and not sentence[0].isupper():
            formatted_sentences[-1] += ' ' + sentence
        else:
            formatted_sentences.append(sentence)
    
    return formatted_sentences

### Method 2: OCR Reader

OCR stands for optical character recognition, a process that converts images of text into editable and searchable text. OCR software scans the PDF file and analyzes the pixels to identify the characters and words. OCR can be useful for extracting data from scanned or image-based PDFs, such as invoices, receipts, forms, or reports. 

**IMPORTANT NOTE: Tesseract and poppler need to be installed and added to system PATH for error-free execution.**

We use the Tesseract OCR for Method 2. More details can be accessed [here](https://github.com/tesseract-ocr/tesseract)

In [17]:
# OCR PDF Reader.
def parser_ocr(pdf_path):
    images = convert_from_path(pdf_path)
    extracted_text = ""
    
    for image in images:
        text = pytesseract.image_to_string(image, lang='eng')
        extracted_text += text

    return extracted_text

---

# 2023 Coles Annual Report

### Method 1: PyPDF 2 Parser

In [18]:
# PDF file path to 2023 Coles Report
pdf_file_path = "./SampleReports/2023_Coles_Report.pdf"

# Extract text from the PDF using PyPDF
text = parser_pypdf(pdf_file_path)

# Get sentences
sentences = get_sentences(text)

# Formatting Sentences
formatted_sentences = sentence_formatter(sentences)

# Print the first 10 formatted sentences
for sentence in formatted_sentences[:10]:
    print('\n', sentence)


 2023 Sustainability ReportWorking towards a more sustainable future Coles Group Limited ABN 11 004 089 936Acknowledgement of Country Coles wishes to acknowledge the Traditional Custodians of Country throughout Australia.

 We recognise their strength and resilience and pay our respects to their Elders past and present.

 Coles extends that respect to all Aboriginal and Torres Strait Islander people, and recognises their rich cultures and their continuing connection to land and waters.

 Aboriginal and Torres Strait Islander people are advised that this report may contain names and images of people who are deceased.

 All references to Indigenous and First Nations people in this report are intended to include Aboriginal and/or Torres Strait Islander people.

 Feedback We welcome feedback on this report.

 For more information or to provide comments, please contact us at: sustainability@coles.com.au Anyone seeking to use information in this Sustainability Report to draw conclusions fro

### Method 2: OCR Reader

In [19]:
# Extract text from the PDF using Tesseract OCR
text_ocr = parser_ocr(pdf_file_path)

# Get sentences
sentences = get_sentences(text_ocr)

# Formatting Sentences
formatted_sentences = sentence_formatter(sentences)

# Print the first 10 formatted sentences
for sentence in formatted_sentences[:10]:
    print('\n', sentence)


 Working towards a more sustainable future 2023 Sustainability Report colesgroup Coles Group Limited ABN 11 004 089 936 x Sles Secéng Ste, Ending Hunger, Acknowledgement of Country Coles wishes to acknowledge the Traditional Custodians of Country throughout Australia.

 We recognise their strength and resilience and pay our respects to their Elders past and present.

 Coles extends that respect to all Aboriginal and Torres Strait Islander people, and recognises their rich cultures and their continuing connection fo land and waters.

 Aboriginal and Torres Strait Islander people are advised that this report may contain names and images of people who are deceased.

 All references to Indigenous and First Nations people in this report are intended to include Aboriginal and/or Torres Strait Islander people.

 Feedback We welcome feedback on this report.

 For more information or to provide comments, please contact us at: @ sustainability@coles.com.au Anyone seeking to use information in t

---

# 2023 Kathmandu Report

### Method 1: PyPDF 2 Parser

In [20]:
# PDF file path to 2023 Kathmandu Report
pdf_file_path = "./SampleReports/2023_KMD_Report.pdf"

# Extract text from the PDF using PyPDF
text = parser_pypdf(pdf_file_path)

# Get sentences
sentences = get_sentences(text)

# Formatting Sentences
formatted_sentences = sentence_formatter(sentences)

# Print the first 10 formatted sentences
for sentence in formatted_sentences[:10]:
    print('\n', sentence)


 Annual Integrated Report 2023 CONTENTS 2 OUR JOURNEY 2 Reporting approach 3 Our purpose and vision 4 Our brands 6 Highlights and lowlights for FY23 8 Our world 10 LEADERSHIP & GOVERNANCE 10 Report from the Chair 12 Group CEO report 14 Governance at KMD Brands 16 Our board 17 Our management team 18 WHAT MATTERS MOST 18 Materiality approach 20 Our material issues 22 STRATEGY 22 How we create value 24 Our strategic pillars 26 BUILDING GLOBAL BRANDS 39 ELEVATING DIGITAL 48 OPERATIONAL EXCELLENCE 58 LEAD IN ESG 60 Communities 84 Climate 96 Circularity 118 FINANCING OUR IMPACT 119 Group CFO report 122 Financial statements 168 Auditors report 172 ADDITIONAL DISCLOSURES 172 Corporate Governance Statement 184 Statutory information 189 Directory 190 GRI index 198 SASB index 202 Our partners KMD Brands acknowledges Tangata Whenua, the Indigenous Nations, First Peoples, and Custodians of the lands and waterways on which our brand head offices reside in New Zealand, Australia and the United State

### Method 2: OCR Reader

In [21]:
# Extract text from the PDF using Tesseract OCR
text_ocr = parser_ocr(pdf_file_path)

# Get sentences
sentences = get_sentences(text_ocr)

# Formatting Sentences
formatted_sentences = sentence_formatter(sentences)

# Print the first 10 formatted sentences
for sentence in formatted_sentences[:10]:
    print('\n', sentence)


 BRANDS KMD Brands acknowledges Tangata Whenua, the Indigenous Nations, First Peoples, and Custodians of the lands and waterways on which our brand head offices reside in New Zealand, Australia and the United States.

 CONTENTS OUR JOURNEY Reporting approach Our purpose and vision Our brands Highlights and lowlights for FY23 Our world LEADERSHIP & GOVERNANCE Report from the Chair Group CEO report Governance at KMD Brands Our board Our management team Materiality approach Our material issues STRATEGY How we create value Our strategic pillars BUILDING GLOBAL BRANDS ELEVATING DIGITAL LEAD IN ESG Communities Climate Circularity FINANCING OUR IMPACT Group CFO report Financial statements Auditors report ADDITIONAL DISCLOSURES Corporate Governance Statement Statutory information Directory GRI index SASB index Our partners KMD BRANDS — OUR JOURNEY OUR JOURNEY Reporting approach BRANDS ABOUT THIS REPORT This integrated report is a review of our financial, economic, social and environmental per