## This Document is processing the text document for BERT Training

To continue the document processing the notebook follows the steps below 
1. Read the pdf standards 
2. Clean the documents
3. Split into chunks
4. Save chunks into dataframe
5. Read the previous standard.csv created in the previous data processing notebook to a dataframe
6. Add the new dataframe to the read dataframe
7. Write into the csv document.

In [1]:
#import necessary libraries
from pypdf import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import pandas as pd

In [2]:
#define directories
pdf_directory = '/Users/pelumioluwaabiola/Downloads/Researchwork/Standards'
csv_directory = '/Users/pelumioluwaabiola/Downloads/Researchwork/CSVfiles/standards.csv'

In [7]:
#read pdf file
framework = []
document_name = '/jul20CDSBClimate_related_disclosures.pdf'
reader = PdfReader(pdf_directory + document_name)
number_of_pages = len(reader.pages)
page = reader.pages[2]
text = page.extract_text()

for i in range(0, number_of_pages):
    page = reader.pages[i]
    text = page.extract_text()
    framework.append(text)

print(framework)

['July 2020\ncdsb.net/climateguidanceCDSB Framework\nApplication \nguidance for \nclimate-related \ndisclosures\n', ' \n', '01 CDSB Framework 01 CDSB Framework | Application guidance for climate-related disclosures  \nCopyright © 2020 Climate Disclosure Standards Board (CDSB) and CDP Worldwide (Europe) \ngGmbH. All rights reserved. Dissemination of the contents of this report is encouraged. Please give \nfull acknowledgement of the source when reproducing extracts in other published work. All \ninformation in this report is provided without warranty of any kind, express or implied. The authors \ndisclaim any responsibility for the information or conclusions in this report. The authors accept no \nliability for any loss arising from any action taken or refrained from being taken as a result of \ninformation contained in this report.About the  \nClimate Disclosure \nStandards BoardCDSB is an international consortium of \nbusiness and environmental NGOs. We are \ncommitted to advancing an

In [8]:
def clean_pdf(text):
    # If the first character is a digit, remove it
    for _ in range(4):
        if text and text[0].isdigit():
            text = text[1:]
    #remove '\n' (newlines)
    text = text.replace('\n', '  ')
    # Remove '\x0c' (form feed/new page)
    text = text.replace('\x0c', ' ')
    # Remove '\xa0' (non-breaking space)
    text = text.replace('\xa0', ' ')
    #remove all unnecessary spaces
    text = ' '.join(text.split())

    return text

In [9]:
#clean pdf
clean_framework = []
for i in range(len(framework)):
    text = clean_pdf(framework[i])
    clean_framework.append(text)

clean_framework

['July 2020 cdsb.net/climateguidanceCDSB Framework Application guidance for climate-related disclosures',
 '',
 'CDSB Framework 01 CDSB Framework | Application guidance for climate-related disclosures Copyright © 2020 Climate Disclosure Standards Board (CDSB) and CDP Worldwide (Europe) gGmbH. All rights reserved. Dissemination of the contents of this report is encouraged. Please give full acknowledgement of the source when reproducing extracts in other published work. All information in this report is provided without warranty of any kind, express or implied. The authors disclaim any responsibility for the information or conclusions in this report. The authors accept no liability for any loss arising from any action taken or refrained from being taken as a result of information contained in this report.About the Climate Disclosure Standards BoardCDSB is an international consortium of business and environmental NGOs. We are committed to advancing and aligning the global mainstream corpo

In [10]:
#delete any pdf page that is not needed
print(f'Length before deleting pages: {len(clean_framework)}')
pages_to_delete = [1,3,-1,-2]
for i in pages_to_delete:
    del clean_framework[i]

print(f'Length after deleting pages: {len(clean_framework)}')
clean_framework

Length before deleting pages: 19
Length after deleting pages: 15


['July 2020 cdsb.net/climateguidanceCDSB Framework Application guidance for climate-related disclosures',
 'CDSB Framework 01 CDSB Framework | Application guidance for climate-related disclosures Copyright © 2020 Climate Disclosure Standards Board (CDSB) and CDP Worldwide (Europe) gGmbH. All rights reserved. Dissemination of the contents of this report is encouraged. Please give full acknowledgement of the source when reproducing extracts in other published work. All information in this report is provided without warranty of any kind, express or implied. The authors disclaim any responsibility for the information or conclusions in this report. The authors accept no liability for any loss arising from any action taken or refrained from being taken as a result of information contained in this report.About the Climate Disclosure Standards BoardCDSB is an international consortium of business and environmental NGOs. We are committed to advancing and aligning the global mainstream corporate 

In [11]:
#chunking the text into sentences

textsplitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=0.25,
    length_function=len,
    is_separator_regex=False,
)


PdfChunks = textsplitter.create_documents(clean_framework)
print(PdfChunks)
print(f'Total number of chunks are : {len(PdfChunks)}')
#print first chunk
print(PdfChunks[0].page_content)
#print last chunk
print(PdfChunks[-1].page_content)


[Document(page_content='July 2020 cdsb.net/climateguidanceCDSB Framework Application guidance for climate-related disclosures'), Document(page_content='CDSB Framework 01 CDSB Framework | Application guidance for climate-related disclosures Copyright © 2020 Climate Disclosure Standards Board (CDSB) and CDP Worldwide (Europe) gGmbH. All rights reserved. Dissemination of the contents of this report is encouraged. Please give full acknowledgement of the source when reproducing extracts in other published work. All information in this report is provided without warranty of any kind, express or implied. The authors disclaim any responsibility for the information'), Document(page_content='or conclusions in this report. The authors accept no liability for any loss arising from any action taken or refrained from being taken as a result of information contained in this report.About the Climate Disclosure Standards BoardCDSB is an international consortium of business and environmental NGOs. We ar

In [12]:
#get the page contents in the pdfchunks and save in a list
page_contents = []
for i in range(len(PdfChunks)):
    page_contents.append(PdfChunks[i].page_content)
page_contents

['July 2020 cdsb.net/climateguidanceCDSB Framework Application guidance for climate-related disclosures',
 'CDSB Framework 01 CDSB Framework | Application guidance for climate-related disclosures Copyright © 2020 Climate Disclosure Standards Board (CDSB) and CDP Worldwide (Europe) gGmbH. All rights reserved. Dissemination of the contents of this report is encouraged. Please give full acknowledgement of the source when reproducing extracts in other published work. All information in this report is provided without warranty of any kind, express or implied. The authors disclaim any responsibility for the information',
 'or conclusions in this report. The authors accept no liability for any loss arising from any action taken or refrained from being taken as a result of information contained in this report.About the Climate Disclosure Standards BoardCDSB is an international consortium of business and environmental NGOs. We are committed to advancing and aligning the global mainstream corpor

In [13]:
len(page_contents)

151

In [14]:
#read the standards csv file into a df
df = pd.read_csv(csv_directory)
df.head() #view first 5 rows

Unnamed: 0,standard_type,document_title,document_text
0,carborn disclosure project,Climate Disclosure Framework,A CLIMATE DISCLOSURE FRAMEWORK FOR SMALL AND M...
1,carborn disclosure project,Climate Disclosure Framework,A significant proportion of the world’s busine...
2,carborn disclosure project,Climate Disclosure Framework,report on their progress and ultimately reduce...
3,carborn disclosure project,Climate Disclosure Framework,nearing and that impacts will continue to occu...
4,carborn disclosure project,Climate Disclosure Framework,report on in their climate disclosures.OVERVIE...


In [15]:
df.tail() #view last 5 rows

Unnamed: 0,standard_type,document_title,document_text
584,climate disclosure standards board,CDSB Framework for reporting environmental & s...,Available from: https://www.ohchr.org/en/issue...
585,climate disclosure standards board,CDSB Framework for reporting environmental & s...,https://www.carbontracker.org/reports 102. Int...
586,climate disclosure standards board,CDSB Framework for reporting environmental & s...,boundary setting in mainstream reports. [PDF]....
587,climate disclosure standards board,CDSB Framework for reporting environmental & s...,https://www.ifac.org/system/files/publications...
588,climate disclosure standards board,CDSB Framework for reporting environmental & s...,International Auditing and Assurance Standards...


In [16]:

#convert the standard name, standard title and page contents to a dataframe
standard_name = ['climate disclosure standards board']  
document_title = ['Application guidance for climate-related disclosures']  

# Repeat standard_name and document_title to match the length of page_contents
standard_name = standard_name * len(page_contents)
document_title = document_title * len(page_contents)

df2 = pd.DataFrame({
    'standard_type': standard_name,
    'document_title': document_title,
    'document_text': page_contents
})

df2.head()

Unnamed: 0,standard_type,document_title,document_text
0,climate disclosure standards board,Application guidance for climate-related discl...,July 2020 cdsb.net/climateguidanceCDSB Framewo...
1,climate disclosure standards board,Application guidance for climate-related discl...,CDSB Framework 01 CDSB Framework | Application...
2,climate disclosure standards board,Application guidance for climate-related discl...,or conclusions in this report. The authors acc...
3,climate disclosure standards board,Application guidance for climate-related discl...,environmental information with the same rigour...
4,climate disclosure standards board,Application guidance for climate-related discl...,"resilient capital markets. Collectively, we ai..."


In [17]:
# Merge df and df2
df_merged = pd.concat([df, df2])

# Write df_merged to a CSV file
df_merged.to_csv(csv_directory, index=False)

In [18]:
#read the csv file and view the first 5 rows and last 5 rows
df = pd.read_csv(csv_directory)
df.head()

Unnamed: 0,standard_type,document_title,document_text
0,carborn disclosure project,Climate Disclosure Framework,A CLIMATE DISCLOSURE FRAMEWORK FOR SMALL AND M...
1,carborn disclosure project,Climate Disclosure Framework,A significant proportion of the world’s busine...
2,carborn disclosure project,Climate Disclosure Framework,report on their progress and ultimately reduce...
3,carborn disclosure project,Climate Disclosure Framework,nearing and that impacts will continue to occu...
4,carborn disclosure project,Climate Disclosure Framework,report on in their climate disclosures.OVERVIE...


In [19]:
df.tail()

Unnamed: 0,standard_type,document_title,document_text
735,climate disclosure standards board,Application guidance for climate-related discl...,REQ-11 December 2019 www.cdsb.net/frameworkCDS...
736,climate disclosure standards board,Application guidance for climate-related discl...,"risks and opportunities, and metrics and targe..."
737,climate disclosure standards board,Application guidance for climate-related discl...,"over the short-, medium-, and long-term.REQ-03..."
738,climate disclosure standards board,Application guidance for climate-related discl...,"climate-related risks.REQ-01, REQ-02 and REQ-0..."
739,climate disclosure standards board,Application guidance for climate-related discl...,"strategy and risk management process.REQ-02, R..."
