## This Document is processing the text document for BERT Training

To continue the document processing the notebook follows the steps below 
1. Read the pdf standards 
2. Clean the documents
3. Split into chunks
4. Save chunks into dataframe
5. Read the previous standard.csv created in the previous data processing notebook to a dataframe
6. Add the new dataframe to the read dataframe
7. Write into the csv document.

In [46]:
#import necessary libraries
from pypdf import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import pandas as pd

In [47]:
#define directories
pdf_directory = '/Users/pelumioluwaabiola/Downloads/Researchwork/Standards'
csv_directory = '/Users/pelumioluwaabiola/Downloads/Researchwork/CSVfiles/standards.csv'

In [48]:
#read pdf file
framework = []
document_name = '/jan22_CDSB_freporting_environmental_social_information.pdf'
reader = PdfReader(pdf_directory + document_name)
number_of_pages = len(reader.pages)
page = reader.pages[2]
text = page.extract_text()

for i in range(0, number_of_pages):
    page = reader.pages[i]
    text = page.extract_text()
    framework.append(text)

print(framework)

['CDSB \nFramework\nAdvancing and aligning disclosure of \nenvironmental and social information \nin mainstream reports for reporting environmental & \nsocial information\nJanuary 2022\nwww.cdsb.net/framework', ' \n', '2 CDSB Framework \nAbout CDSB\nThe Climate Disclosure Standards Board (CDSB) is an international consortium of business and \nenvironmental NGOs, hosted by CDP. We are committed to advancing and aligning the global \nmainstream corporate reporting model to equate natural and social capital with financial capital. \nWe do this by offering companies a framework for reporting environment- and social-related \ninformation with the same rigour as financial information. In turn this helps them to provide \ninvestors with decision-useful environmental information via the mainstream corporate report, \nenhancing the efficient allocation of capital. Regulators have also benefited from CDSB’s \ncompliance-ready materials. ', 'Contents\nAbout CDSB 02\n \nChapter 1  \nIntroduction t

In [49]:
def clean_pdf(text):
    # If the first character is a digit, remove it
    for _ in range(4):
        if text and text[0].isdigit():
            text = text[1:]
    #remove '\n' (newlines)
    text = text.replace('\n', '  ')
    # Remove '\x0c' (form feed/new page)
    text = text.replace('\x0c', ' ')
    # Remove '\xa0' (non-breaking space)
    text = text.replace('\xa0', ' ')
    #remove all unnecessary spaces
    text = ' '.join(text.split())

    return text

In [50]:
#clean pdf
clean_framework = []
for i in range(len(framework)):
    text = clean_pdf(framework[i])
    clean_framework.append(text)

clean_framework

['CDSB Framework Advancing and aligning disclosure of environmental and social information in mainstream reports for reporting environmental & social information January 2022 www.cdsb.net/framework',
 '',
 'CDSB Framework About CDSB The Climate Disclosure Standards Board (CDSB) is an international consortium of business and environmental NGOs, hosted by CDP. We are committed to advancing and aligning the global mainstream corporate reporting model to equate natural and social capital with financial capital. We do this by offering companies a framework for reporting environment- and social-related information with the same rigour as financial information. In turn this helps them to provide investors with decision-useful environmental information via the mainstream corporate report, enhancing the efficient allocation of capital. Regulators have also benefited from CDSB’s compliance-ready materials.',
 'Contents About CDSB 02 Chapter 1 Introduction to the CDSB Framework 1. Purpose 06 2. O

In [51]:
#delete any pdf page that is not needed
print(f'Length before deleting pages: {len(clean_framework)}')
pages_to_delete = [1,3,4,-1,-2,-3,-4,-5,-6,-7]
for i in pages_to_delete:
    del clean_framework[i]

print(f'Length after deleting pages: {len(clean_framework)}')
clean_framework

Length before deleting pages: 43
Length after deleting pages: 33


['CDSB Framework Advancing and aligning disclosure of environmental and social information in mainstream reports for reporting environmental & social information January 2022 www.cdsb.net/framework',
 'CDSB Framework About CDSB The Climate Disclosure Standards Board (CDSB) is an international consortium of business and environmental NGOs, hosted by CDP. We are committed to advancing and aligning the global mainstream corporate reporting model to equate natural and social capital with financial capital. We do this by offering companies a framework for reporting environment- and social-related information with the same rigour as financial information. In turn this helps them to provide investors with decision-useful environmental information via the mainstream corporate report, enhancing the efficient allocation of capital. Regulators have also benefited from CDSB’s compliance-ready materials.',
 'Contents About CDSB 02 Chapter 1 Introduction to the CDSB Framework 1. Purpose 06 2. Object

In [52]:
#chunking the text into sentences

textsplitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=0.25,
    length_function=len,
    is_separator_regex=False,
)


PdfChunks = textsplitter.create_documents(clean_framework)
print(PdfChunks)
print(f'Total number of chunks are : {len(PdfChunks)}')
#print first chunk
print(PdfChunks[0].page_content)
#print last chunk
print(PdfChunks[-1].page_content)


[Document(page_content='CDSB Framework Advancing and aligning disclosure of environmental and social information in mainstream reports for reporting environmental & social information January 2022 www.cdsb.net/framework'), Document(page_content='CDSB Framework About CDSB The Climate Disclosure Standards Board (CDSB) is an international consortium of business and environmental NGOs, hosted by CDP. We are committed to advancing and aligning the global mainstream corporate reporting model to equate natural and social capital with financial capital. We do this by offering companies a framework for reporting environment- and social-related information with the same rigour as financial information. In turn this helps them to provide investors with'), Document(page_content='decision-useful environmental information via the mainstream corporate report, enhancing the efficient allocation of capital. Regulators have also benefited from CDSB’s compliance-ready materials.'), Document(page_content=

In [53]:
#get the page contents in the pdfchunks and save in a list
page_contents = []
for i in range(len(PdfChunks)):
    page_contents.append(PdfChunks[i].page_content)
page_contents

['CDSB Framework Advancing and aligning disclosure of environmental and social information in mainstream reports for reporting environmental & social information January 2022 www.cdsb.net/framework',
 'CDSB Framework About CDSB The Climate Disclosure Standards Board (CDSB) is an international consortium of business and environmental NGOs, hosted by CDP. We are committed to advancing and aligning the global mainstream corporate reporting model to equate natural and social capital with financial capital. We do this by offering companies a framework for reporting environment- and social-related information with the same rigour as financial information. In turn this helps them to provide investors with',
 'decision-useful environmental information via the mainstream corporate report, enhancing the efficient allocation of capital. Regulators have also benefited from CDSB’s compliance-ready materials.',
 'Contents About CDSB 02 Chapter 1 Introduction to the CDSB Framework 1. Purpose 06 2. Ob

In [54]:
len(page_contents)

201

In [57]:
#read the standards csv file into a df
df = pd.read_csv(csv_directory)
df.head() #view first 5 rows

Unnamed: 0,standard_type,document_title,document_text
0,carborn disclosure project,Climate Disclosure Framework,A CLIMATE DISCLOSURE FRAMEWORK FOR SMALL AND M...
1,carborn disclosure project,Climate Disclosure Framework,A significant proportion of the world’s busine...
2,carborn disclosure project,Climate Disclosure Framework,report on their progress and ultimately reduce...
3,carborn disclosure project,Climate Disclosure Framework,nearing and that impacts will continue to occu...
4,carborn disclosure project,Climate Disclosure Framework,report on in their climate disclosures.OVERVIE...


In [58]:
df.tail() #view last 5 rows

Unnamed: 0,standard_type,document_title,document_text
383,climate disclosure standards board,Application guidance for water-related disclos...,CDSB Framework 57 CDSB Framework | Application...
384,climate disclosure standards board,Application guidance for water-related disclos...,Benefit Accounting (VWBA): A Practical Guide t...
385,climate disclosure standards board,Application guidance for water-related disclos...,investor-water-toolkit/details#translating-wat...
386,climate disclosure standards board,Application guidance for water-related disclos...,https://www.unep-wcmc.org/resources-and- data/...
387,climate disclosure standards board,Application guidance for water-related disclos...,https://www.lifecycleinitiative. org/training-...


In [59]:

#convert the standard name, standard title and page contents to a dataframe
standard_name = ['climate disclosure standards board']  
document_title = ['CDSB Framework for reporting environmental & social information']  

# Repeat standard_name and document_title to match the length of page_contents
standard_name = standard_name * len(page_contents)
document_title = document_title * len(page_contents)

df2 = pd.DataFrame({
    'standard_type': standard_name,
    'document_title': document_title,
    'document_text': page_contents
})

df2.head()

Unnamed: 0,standard_type,document_title,document_text
0,climate disclosure standards board,CDSB Framework for reporting environmental & s...,CDSB Framework Advancing and aligning disclosu...
1,climate disclosure standards board,CDSB Framework for reporting environmental & s...,CDSB Framework About CDSB The Climate Disclosu...
2,climate disclosure standards board,CDSB Framework for reporting environmental & s...,decision-useful environmental information via ...
3,climate disclosure standards board,CDSB Framework for reporting environmental & s...,Contents About CDSB 02 Chapter 1 Introduction ...
4,climate disclosure standards board,CDSB Framework for reporting environmental & s...,shall be prepared applying the principles of r...


In [60]:
# Merge df and df2
df_merged = pd.concat([df, df2])

# Write df_merged to a CSV file
df_merged.to_csv(csv_directory, index=False)