## This Document is processing the text document for BERT Training

To continue the document processing the notebook follows the steps below 
1. Read the pdf standards 
2. Clean the documents
3. Split into chunks
4. Save chunks into dataframe
5. Read the previous standard.csv created in the previous data processing notebook to a dataframe
6. Add the new dataframe to the read dataframe
7. Write into the csv document.

In [72]:
#import necessary libraries
from pypdf import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import pandas as pd

In [117]:
#define directories
pdf_directory = '/Users/pelumioluwaabiola/Downloads/Researchwork/Standards'
csv_directory = '/Users/pelumioluwaabiola/Downloads/Researchwork/CSVfiles/standards.csv'

In [273]:
#read pdf file
framework = []
document_name = '/Dec2023_IRSG_Code_of_ConductforESGRatingsandData-Products-Providers-v3.pdf'
reader = PdfReader(pdf_directory + document_name)
number_of_pages = len(reader.pages)
page = reader.pages[2]
text = page.extract_text()

for i in range(0, number_of_pages):
    page = reader.pages[i]
    text = page.extract_text()
    framework.append(text)

print(framework)

['Code of Conduct for \nESG Ratings and Data \nProducts Providers\nDECEMBER 2023\nDeveloped by the \nESG Data and Ratings \nWorking Group (DRWG)\n', 'Contents\nBackground 1\nIntroduction 1\nOverview of the Code of Conduct  1\nHow the Code of Conduct was developed 2\nApplication and approach 3\nOwnership of the Code of Conduct  3\nScope and definitions 4\nTarget Scope and Application  4\nTerminology 4\nNegative Scope  5\nPrinciples 6\n1. Principle on Good Governance 6\n2. Principle on Securing Quality (Systems and Controls) 7\n3. Principle on Conflicts of Interest 8\n4. Principle on Transparency 9\n5. Principle on Confidentiality (Systems and Controls) 10\n6. Principle on Engagement (Systems and Controls) 11\nAnnex 1 12\nMembers of the ESG Data and Ratings Working Group (DRWG) and others who  \ncontributed to this draft Code of Conduct for ESG Ratings and Data products providers  12\nAnnex 2 13\nMapping of IOSCO recommendations against drafting of voluntary Code of \nConduct\xa0for\xa0E

In [274]:
def clean_pdf(text):
    # If the first character is a digit, remove it
    for _ in range(4):
        if text and text[0].isdigit():
            text = text[1:]
    #remove '\n' (newlines)
    text = text.replace('\n', '  ')
    # Remove '\x0c' (form feed/new page)
    text = text.replace('\x0c', ' ')
    # Remove '\xa0' (non-breaking space)
    text = text.replace('\xa0', ' ')
    #remove all unnecessary spaces
    text = ' '.join(text.split())

    return text

In [275]:
#clean pdf
clean_framework = []
for i in range(len(framework)):
    text = clean_pdf(framework[i])
    clean_framework.append(text)

clean_framework

['Code of Conduct for ESG Ratings and Data Products Providers DECEMBER 2023 Developed by the ESG Data and Ratings Working Group (DRWG)',
 'Contents Background 1 Introduction 1 Overview of the Code of Conduct 1 How the Code of Conduct was developed 2 Application and approach 3 Ownership of the Code of Conduct 3 Scope and definitions 4 Target Scope and Application 4 Terminology 4 Negative Scope 5 Principles 6 1. Principle on Good Governance 6 2. Principle on Securing Quality (Systems and Controls) 7 3. Principle on Conflicts of Interest 8 4. Principle on Transparency 9 5. Principle on Confidentiality (Systems and Controls) 10 6. Principle on Engagement (Systems and Controls) 11 Annex 1 12 Members of the ESG Data and Ratings Working Group (DRWG) and others who contributed to this draft Code of Conduct for ESG Ratings and Data products providers 12 Annex 2 13 Mapping of IOSCO recommendations against drafting of voluntary Code of Conduct for ESG ratings and data products providers 13 ii',
 

In [276]:
#delete any pdf page that is not needed
del clean_framework[1]
del clean_framework[-1]
clean_framework



['Code of Conduct for ESG Ratings and Data Products Providers DECEMBER 2023 Developed by the ESG Data and Ratings Working Group (DRWG)',
 'Background Introduction 1.1 Environmental, social and governance (“ ESG ”) factors play an increasingly important role in financial markets. This growth is leading to both a rapid increase in the use of and demand for related services, such as ESG ratings and data products, and to an increase in the scrutiny of their providers. As the landscape changes, concerns around the transparency, quality and reliability of ESG ratings and data products are emerging, calling for closer scrutiny of their providers. A Code of Conduct can help improve trust in these products, especially those relevant to the financial services sector, to guide investors in allocating their money to the right assets as well as to alleviate the risk of greenwashing. 1.2 In November 2021, the International Organization of Securities Commissions (“ IOSCO ”), in its final report " Env

In [277]:
#chunking the text into sentences

textsplitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=0.25,
    length_function=len,
    is_separator_regex=False,
)


PdfChunks = textsplitter.create_documents(clean_framework)
print(PdfChunks)
print(f'Total number of chunks are : {len(PdfChunks)}')
#print first chunk
print(PdfChunks[0].page_content)
#print last chunk
print(PdfChunks[-1].page_content)


[Document(page_content='Code of Conduct for ESG Ratings and Data Products Providers DECEMBER 2023 Developed by the ESG Data and Ratings Working Group (DRWG)'), Document(page_content='Background Introduction 1.1 Environmental, social and governance (“ ESG ”) factors play an increasingly important role in financial markets. This growth is leading to both a rapid increase in the use of and demand for related services, such as ESG ratings and data products, and to an increase in the scrutiny of their providers. As the landscape changes, concerns around the transparency, quality and reliability of ESG ratings and data products are emerging, calling for closer scrutiny of their providers. A'), Document(page_content='Code of Conduct can help improve trust in these products, especially those relevant to the financial services sector, to guide investors in allocating their money to the right assets as well as to alleviate the risk of greenwashing. 1.2 In November 2021, the International Organiz

In [278]:
#get the page contents in the pdfchunks and save in a list
page_contents = []
for i in range(len(PdfChunks)):
    page_contents.append(PdfChunks[i].page_content)
page_contents

['Code of Conduct for ESG Ratings and Data Products Providers DECEMBER 2023 Developed by the ESG Data and Ratings Working Group (DRWG)',
 'Background Introduction 1.1 Environmental, social and governance (“ ESG ”) factors play an increasingly important role in financial markets. This growth is leading to both a rapid increase in the use of and demand for related services, such as ESG ratings and data products, and to an increase in the scrutiny of their providers. As the landscape changes, concerns around the transparency, quality and reliability of ESG ratings and data products are emerging, calling for closer scrutiny of their providers. A',
 'Code of Conduct can help improve trust in these products, especially those relevant to the financial services sector, to guide investors in allocating their money to the right assets as well as to alleviate the risk of greenwashing. 1.2 In November 2021, the International Organization of Securities Commissions (“ IOSCO ”), in its final report "

In [279]:
len(page_contents)

154

In [280]:
#read the standards csv file into a df
df = pd.read_csv(csv_directory)
df.head() #view first 5 rows

Unnamed: 0,standard_type,document_title,document_text
0,carborn disclosure project,Climate Disclosure Framework,A CLIMATE DISCLOSURE FRAMEWORK FOR SMALL AND M...
1,carborn disclosure project,Climate Disclosure Framework,A significant proportion of the world’s busine...
2,carborn disclosure project,Climate Disclosure Framework,report on their progress and ultimately reduce...
3,carborn disclosure project,Climate Disclosure Framework,nearing and that impacts will continue to occu...
4,carborn disclosure project,Climate Disclosure Framework,report on in their climate disclosures.OVERVIE...


In [281]:
df.tail() #view last 5 rows

Unnamed: 0,standard_type,document_title,document_text
9544,Integrated Reporting,International Framework,www.integratedreporting.org 56 Contents Next B...
9545,Integrated Reporting,International Framework,information in an integrated report should be ...
9546,Integrated Reporting,International Framework,question: How does the organization’s governan...
9547,Integrated Reporting,International Framework,Strategy and resource allocation 4.28 An integ...
9548,Integrated Reporting,International Framework,"encounter in pursuing its strategy, and what a..."


In [282]:

#convert the standard name, standard title and page contents to a dataframe
standard_name = ['International Regulatory Strategy Group']  
document_title = ['Code of Conduct for ESG Ratings and Data Products Providers']  

# Repeat standard_name and document_title to match the length of page_contents
standard_name = standard_name * len(page_contents)
document_title = document_title * len(page_contents)

df2 = pd.DataFrame({
    'standard_type': standard_name,
    'document_title': document_title,
    'document_text': page_contents
})

df2.head()

Unnamed: 0,standard_type,document_title,document_text
0,International Regulatory Strategy Group,Code of Conduct for ESG Ratings and Data Produ...,Code of Conduct for ESG Ratings and Data Produ...
1,International Regulatory Strategy Group,Code of Conduct for ESG Ratings and Data Produ...,"Background Introduction 1.1 Environmental, soc..."
2,International Regulatory Strategy Group,Code of Conduct for ESG Ratings and Data Produ...,Code of Conduct can help improve trust in thes...
3,International Regulatory Strategy Group,Code of Conduct for ESG Ratings and Data Produ...,ESG ratings and data products and ESG ratings ...
4,International Regulatory Strategy Group,Code of Conduct for ESG Ratings and Data Produ...,conflicts of interest. 1.3 Following the IOSCO...


In [283]:
# Merge df and df2
df_merged = pd.concat([df, df2])

# Write df_merged to a CSV file
df_merged.to_csv(csv_directory, index=False)

In [284]:
#read the csv file and view the first 5 rows and last 5 rows
df = pd.read_csv(csv_directory)
df.head()

Unnamed: 0,standard_type,document_title,document_text
0,carborn disclosure project,Climate Disclosure Framework,A CLIMATE DISCLOSURE FRAMEWORK FOR SMALL AND M...
1,carborn disclosure project,Climate Disclosure Framework,A significant proportion of the world’s busine...
2,carborn disclosure project,Climate Disclosure Framework,report on their progress and ultimately reduce...
3,carborn disclosure project,Climate Disclosure Framework,nearing and that impacts will continue to occu...
4,carborn disclosure project,Climate Disclosure Framework,report on in their climate disclosures.OVERVIE...


In [285]:
df.tail()

Unnamed: 0,standard_type,document_title,document_text
9698,International Regulatory Strategy Group,Code of Conduct for ESG Ratings and Data Produ...,the ESG ratings and data products provider.(A)...
9699,International Regulatory Strategy Group,Code of Conduct for ESG Ratings and Data Produ...,assessed; and (ii) of the principal categories...
9700,International Regulatory Strategy Group,Code of Conduct for ESG Ratings and Data Produ...,Publishing terms of engagement describing how ...
9701,International Regulatory Strategy Group,Code of Conduct for ESG Ratings and Data Produ...,review. IOSCO recommendation 10 63. Entities s...
9702,International Regulatory Strategy Group,Code of Conduct for ESG Ratings and Data Produ...,"coordinates for, all the entities’ sustainabil..."
