## This notebook processes financial standards documents for training a BERT model. This is the first processing document

To acomplish financial standards document processing for BERT, we take the following steps:
1. Import the pdfs of each standard
2. Clean the documents, removing spaces, newlines, tabs and similar characters not needed
3. split the document in chunks and label the chunks by standard name and document name 
4. Create a new csv file and save the text documents


In [61]:
#import necessary libraries
from pypdf import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import pandas as pd

In [62]:
#define directories
reading_directory = '/Users/pelumioluwaabiola/Downloads/Researchwork/Standards'
writing_directory = '/Users/pelumioluwaabiola/Downloads/Researchwork/CSVfiles'

In [63]:
#read pdf file
cdf_climate_framework = []
document_name = '/climate_framework.pdf'
reader = PdfReader(reading_directory + document_name)
number_of_pages = len(reader.pages)
page = reader.pages[2]
text = page.extract_text()

for i in range(0, number_of_pages):
    page = reader.pages[i]
    text = page.extract_text()
    cdf_climate_framework.append(text)

print(cdf_climate_framework)

['1\nA CLIMATE DISCLOSURE \nFRAMEWORK\nFOR SMALL AND MEDIUM-SIZED  \nENTERPRISES (SME s)\nNOVEMBER 2021\nDISCL OSURE INSI GHT ACTION', '1CONTENTS\n02 OVERVIEW\n04 INTRODUCTION\n04  BACKGROUND\n04  DEFINING SME\n05  NOTE ON MICRO AND SMALL SMES \n05  PURPOSE \n05  OBJECTIVES \n05  DEVELOPMENT \n06  INTENDED USERS \n06  ALIGNMENT AND MAPPING WITH OTHER FRAMEWORKS \n07 GUIDING PRINCIPLES \n07  PURPOSE OF PRINCIPLES \n07  FRAMEWORK PRINCIPLES \n08 MODULES: REPORTING REQUIREMENTS AND RECOMMENDATIONS \n08  HOW TO USE THE FRAMEWORK \n09  MEASURE \n11  COMMIT \n13  ACTION AND IMPACT \n14  ENERGY \n15  VALUE CHAIN EMISSIONS \n16  MANAGEMENT AND RESILIENCE\n19  CLIMATE SOLUTIONS \n20  CONCLUSION AND AREAS FOR FUTURE WORK \n20  CONTRIBUTIONS \n21  APPENDIX \n22  EXAMPLES OF CLIMATE-RELATED INITIATIVE TYPES\nVersion Publication date Revisions\n1.0 25 November 2021 -\n1.1   10 December 2021  Corporate net-zero definition updated to align with the SBTi Corporate Net-Zero Standard \nImportant Notice\

In [64]:
def clean_pdf(text):
    # If the first character is a digit, remove it
    for _ in range(4):
        if text and text[0].isdigit():
            text = text[1:]
    #remove '\n' (newlines)
    text = text.replace('\n', '  ')
    # Remove '\x0c' (form feed/new page)
    text = text.replace('\x0c', ' ')
    # Remove '\xa0' (non-breaking space)
    text = text.replace('\xa0', ' ')
    #remove all unnecessary spaces
    text = ' '.join(text.split())

    return text

In [65]:
#clean pdf
clean_cdf_climate_framework = []
for i in range(len(cdf_climate_framework)):
    text = clean_pdf(cdf_climate_framework[i])
    clean_cdf_climate_framework.append(text)

clean_cdf_climate_framework

['A CLIMATE DISCLOSURE FRAMEWORK FOR SMALL AND MEDIUM-SIZED ENTERPRISES (SME s) NOVEMBER 2021 DISCL OSURE INSI GHT ACTION',
 'CONTENTS 02 OVERVIEW 04 INTRODUCTION 04 BACKGROUND 04 DEFINING SME 05 NOTE ON MICRO AND SMALL SMES 05 PURPOSE 05 OBJECTIVES 05 DEVELOPMENT 06 INTENDED USERS 06 ALIGNMENT AND MAPPING WITH OTHER FRAMEWORKS 07 GUIDING PRINCIPLES 07 PURPOSE OF PRINCIPLES 07 FRAMEWORK PRINCIPLES 08 MODULES: REPORTING REQUIREMENTS AND RECOMMENDATIONS 08 HOW TO USE THE FRAMEWORK 09 MEASURE 11 COMMIT 13 ACTION AND IMPACT 14 ENERGY 15 VALUE CHAIN EMISSIONS 16 MANAGEMENT AND RESILIENCE 19 CLIMATE SOLUTIONS 20 CONCLUSION AND AREAS FOR FUTURE WORK 20 CONTRIBUTIONS 21 APPENDIX 22 EXAMPLES OF CLIMATE-RELATED INITIATIVE TYPES Version Publication date Revisions 1.0 25 November 2021 - 1.1 10 December 2021 Corporate net-zero definition updated to align with the SBTi Corporate Net-Zero Standard Important Notice This work is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0

In [66]:
#remove the table of contents page and the last page of the pdf
del clean_cdf_climate_framework[1]
del clean_cdf_climate_framework[-1]
clean_cdf_climate_framework


['A CLIMATE DISCLOSURE FRAMEWORK FOR SMALL AND MEDIUM-SIZED ENTERPRISES (SME s) NOVEMBER 2021 DISCL OSURE INSI GHT ACTION',
 'A significant proportion of the world’s businesses are small and medium- sized enterprises (SMEs). Globally, micro-enterprises (SMEs with fewer than ten employees) alone account for 70% to 90% of all firms1. As such, SMEs play an important role in reducing global emissions and bringing innovative climate solutions to the market. It is crucial that they are equipped with the tools and resources needed to measure their emissions, set greenhouse gas reduction targets grounded in science, take bold actions, report on their progress and ultimately reduce their emissions. This framework provides guidelines for SMEs on doing exactly that. It is open for anyone to use and can be used directly by SMEs to guide their reporting of climate impacts and strategies to multiple stakeholders. It can also be used by SME support organizations (such as consultancies) and data colle

In [67]:
#chunking the text into sentences

textsplitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=0.25,
    length_function=len,
    is_separator_regex=False,
)


PdfChunks = textsplitter.create_documents(clean_cdf_climate_framework)
print(PdfChunks)
print(f'Total number of chunks are : {len(PdfChunks)}')
#print first chunk
print(PdfChunks[0].page_content)
#print last chunk
print(PdfChunks[-1].page_content)


[Document(page_content='A CLIMATE DISCLOSURE FRAMEWORK FOR SMALL AND MEDIUM-SIZED ENTERPRISES (SME s) NOVEMBER 2021 DISCL OSURE INSI GHT ACTION'), Document(page_content='A significant proportion of the world’s businesses are small and medium- sized enterprises (SMEs). Globally, micro-enterprises (SMEs with fewer than ten employees) alone account for 70% to 90% of all firms1. As such, SMEs play an important role in reducing global emissions and bringing innovative climate solutions to the market. It is crucial that they are equipped with the tools and resources needed to measure their emissions, set greenhouse gas reduction targets grounded in science, take bold actions,'), Document(page_content='report on their progress and ultimately reduce their emissions. This framework provides guidelines for SMEs on doing exactly that. It is open for anyone to use and can be used directly by SMEs to guide their reporting of climate impacts and strategies to multiple stakeholders. It can also be us

In [68]:
#get the page contents in the pdfchunks and save in a list
page_contents = []
for i in range(len(PdfChunks)):
    page_contents.append(PdfChunks[i].page_content)
page_contents

['A CLIMATE DISCLOSURE FRAMEWORK FOR SMALL AND MEDIUM-SIZED ENTERPRISES (SME s) NOVEMBER 2021 DISCL OSURE INSI GHT ACTION',
 'A significant proportion of the world’s businesses are small and medium- sized enterprises (SMEs). Globally, micro-enterprises (SMEs with fewer than ten employees) alone account for 70% to 90% of all firms1. As such, SMEs play an important role in reducing global emissions and bringing innovative climate solutions to the market. It is crucial that they are equipped with the tools and resources needed to measure their emissions, set greenhouse gas reduction targets grounded in science, take bold actions,',
 'report on their progress and ultimately reduce their emissions. This framework provides guidelines for SMEs on doing exactly that. It is open for anyone to use and can be used directly by SMEs to guide their reporting of climate impacts and strategies to multiple stakeholders. It can also be used by SME support organizations (such as consultancies) and data c

In [69]:
len(page_contents)

95

In [70]:

#convert the standard name, standard title and page contents to a dataframe
standard_name = ['carborn disclosure project']  
document_title = ['Climate Disclosure Framework']  

# Repeat standard_name and document_title to match the length of page_contents
standard_name = standard_name * len(page_contents)
document_title = document_title * len(page_contents)

df = pd.DataFrame({
    'standard_type': standard_name,
    'document_title': document_title,
    'document_text': page_contents
})

df.head()

Unnamed: 0,standard_type,document_title,document_text
0,carborn disclosure project,Climate Disclosure Framework,A CLIMATE DISCLOSURE FRAMEWORK FOR SMALL AND M...
1,carborn disclosure project,Climate Disclosure Framework,A significant proportion of the world’s busine...
2,carborn disclosure project,Climate Disclosure Framework,report on their progress and ultimately reduce...
3,carborn disclosure project,Climate Disclosure Framework,nearing and that impacts will continue to occu...
4,carborn disclosure project,Climate Disclosure Framework,report on in their climate disclosures.OVERVIE...


In [72]:
#save the dataframe to a csv file and write in the current directory
document_title = '/standards.csv'
df.to_csv(writing_directory + document_title, index=False)
