## This Document is processing the text document for BERT Training

To continue the document processing the notebook follows the steps below 
1. Read the pdf standards 
2. Clean the documents
3. Split into chunks
4. Save chunks into dataframe
5. Read the previous standard.csv created in the previous data processing notebook to a dataframe
6. Add the new dataframe to the read dataframe
7. Write into the csv document.

In [728]:
#import necessary libraries
from pypdf import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import pandas as pd

In [729]:
#define directories
pdf_directory = '/Users/pelumioluwaabiola/Downloads/Researchwork/Non-sustainable_standards'
csv_directory = '/Users/pelumioluwaabiola/Downloads/Researchwork/CSVfiles/standards.csv'

In [1002]:
#read pdf file
framework = []
document_name = '/ifrs-17-insurance-contracts.pdf'
reader = PdfReader(pdf_directory + document_name)
number_of_pages = len(reader.pages)
page = reader.pages[2]
text = page.extract_text()

for i in range(0, number_of_pages):
    page = reader.pages[i]
    text = page.extract_text()
    framework.append(text)

print(framework)

['IFRS 17\nInsurance Contracts\nIn March 2004 the International Accounting Standards Board (Board) issued IFRS 4\nInsurance Contracts . IFRS 4 was an interim standard which was meant to be in place until\nthe Board completed its project on insurance contracts. IFRS 4 permitted entities to use a\nwide variety of accounting practices for insurance contracts, reflecting national\naccounting requirements and variations of those requirements, subject to limited\nimprovements and specified disclosures.\nIn May 2017, the Board completed its project on insurance contracts with the issuance of\nIFRS 17  Insurance Contracts . IFRS 17 replaces IFRS 4 and sets out principles for the\nrecognition, measurement, presentation and disclosure of insurance contracts within the\nscope of IFRS 17.\nIn June 2020, the Board issued Amendments to IFRS 17 . The objective of the amendments is\nto assist entities implementing the Standard, while not unduly disrupting\nimplementation or diminishing the usefulness 

In [967]:
def clean_pdf(text):
    # If the first character is a digit, remove it
    for _ in range(4):
        if text and text[0].isdigit():
            text = text[1:]
    #remove '\n' (newlines)
    text = text.replace('\n', '  ')
    # Remove '\x0c' (form feed/new page)
    text = text.replace('\x0c', ' ')
    # Remove '\xa0' (non-breaking space)
    text = text.replace('\xa0', ' ')
    #remove all other special characters
    text = text.replace('\uf08c', ' ')
    text = text.replace('\uf099', ' ')
    text = text.replace('\uf09a ', ' ')
    text = text.replace('\uf232', ' ')
    text = text.replace('\uf0e0', ' ')
    text = text.replace('\x00', ' ')
    text = text.replace('\uf0e1', ' ')
    text = text.replace('\uf095', ' ')
    text = text.replace('\ue816', ' ')
    text = text.replace('\uf00d', ' ')
    text = text.replace('\uf002', ' ')
    text = text.replace('\uf107', ' ')
    text = text.replace('\uf078', ' ')
    text = text.replace('\uf0b7', ' ')
    #remove all unnecessary spaces
    text = ' '.join(text.split())



    return text

In [1003]:
#clean pdf
clean_framework = []
for i in range(len(framework)):
    text = clean_pdf(framework[i])
    clean_framework.append(text)

clean_framework

['IFRS 17 Insurance Contracts In March 2004 the International Accounting Standards Board (Board) issued IFRS 4 Insurance Contracts . IFRS 4 was an interim standard which was meant to be in place until the Board completed its project on insurance contracts. IFRS 4 permitted entities to use a wide variety of accounting practices for insurance contracts, reflecting national accounting requirements and variations of those requirements, subject to limited improvements and specified disclosures. In May 2017, the Board completed its project on insurance contracts with the issuance of IFRS 17 Insurance Contracts . IFRS 17 replaces IFRS 4 and sets out principles for the recognition, measurement, presentation and disclosure of insurance contracts within the scope of IFRS 17. In June 2020, the Board issued Amendments to IFRS 17 . The objective of the amendments is to assist entities implementing the Standard, while not unduly disrupting implementation or diminishing the usefulness of the informat

In [1004]:
#delete any pdf page that is not needed
del clean_framework[-1]
clean_framework



['IFRS 17 Insurance Contracts In March 2004 the International Accounting Standards Board (Board) issued IFRS 4 Insurance Contracts . IFRS 4 was an interim standard which was meant to be in place until the Board completed its project on insurance contracts. IFRS 4 permitted entities to use a wide variety of accounting practices for insurance contracts, reflecting national accounting requirements and variations of those requirements, subject to limited improvements and specified disclosures. In May 2017, the Board completed its project on insurance contracts with the issuance of IFRS 17 Insurance Contracts . IFRS 17 replaces IFRS 4 and sets out principles for the recognition, measurement, presentation and disclosure of insurance contracts within the scope of IFRS 17. In June 2020, the Board issued Amendments to IFRS 17 . The objective of the amendments is to assist entities implementing the Standard, while not unduly disrupting implementation or diminishing the usefulness of the informat

In [1005]:
#chunking the text into sentences

textsplitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=0.25,
    length_function=len,
    is_separator_regex=False,
)


PdfChunks = textsplitter.create_documents(clean_framework)
print(PdfChunks)
print(f'Total number of chunks are : {len(PdfChunks)}')
#print first chunk
print(PdfChunks[0].page_content)
#print last chunk
print(PdfChunks[-1].page_content)


[Document(page_content='IFRS 17 Insurance Contracts In March 2004 the International Accounting Standards Board (Board) issued IFRS 4 Insurance Contracts . IFRS 4 was an interim standard which was meant to be in place until the Board completed its project on insurance contracts. IFRS 4 permitted entities to use a wide variety of accounting practices for insurance contracts, reflecting national accounting requirements and variations of those requirements, subject to limited improvements and specified disclosures. In May 2017, the'), Document(page_content='Board completed its project on insurance contracts with the issuance of IFRS 17 Insurance Contracts . IFRS 17 replaces IFRS 4 and sets out principles for the recognition, measurement, presentation and disclosure of insurance contracts within the scope of IFRS 17. In June 2020, the Board issued Amendments to IFRS 17 . The objective of the amendments is to assist entities implementing the Standard, while not unduly disrupting implementati

In [1006]:
#get the page contents in the pdfchunks and save in a list
page_contents = []
for i in range(len(PdfChunks)):
    page_contents.append(PdfChunks[i].page_content)
page_contents

['IFRS 17 Insurance Contracts In March 2004 the International Accounting Standards Board (Board) issued IFRS 4 Insurance Contracts . IFRS 4 was an interim standard which was meant to be in place until the Board completed its project on insurance contracts. IFRS 4 permitted entities to use a wide variety of accounting practices for insurance contracts, reflecting national accounting requirements and variations of those requirements, subject to limited improvements and specified disclosures. In May 2017, the',
 'Board completed its project on insurance contracts with the issuance of IFRS 17 Insurance Contracts . IFRS 17 replaces IFRS 4 and sets out principles for the recognition, measurement, presentation and disclosure of insurance contracts within the scope of IFRS 17. In June 2020, the Board issued Amendments to IFRS 17 . The objective of the amendments is to assist entities implementing the Standard, while not unduly disrupting implementation or diminishing the usefulness of the info

In [1007]:
len(page_contents)

531

In [1008]:
#read the standards csv file into a df
df = pd.read_csv(csv_directory)
df.head() #view first 5 rows

Unnamed: 0,standard_type,document_title,document_text,label
0,carborn disclosure project,Climate Disclosure Framework,A CLIMATE DISCLOSURE FRAMEWORK FOR SMALL AND M...,0
1,carborn disclosure project,Climate Disclosure Framework,A significant proportion of the world’s busine...,0
2,carborn disclosure project,Climate Disclosure Framework,report on their progress and ultimately reduce...,0
3,carborn disclosure project,Climate Disclosure Framework,nearing and that impacts will continue to occu...,0
4,carborn disclosure project,Climate Disclosure Framework,report on in their climate disclosures.OVERVIE...,0


In [1009]:
df.tail() #view last 5 rows

Unnamed: 0,standard_type,document_title,document_text,label
25321,Non-sustainable standards,ias-8-accounting-policies-changes-in-accountin...,Approval by the Board of IAS 8 issued in Decem...,1
25322,Non-sustainable standards,ias-8-accounting-policies-changes-in-accountin...,Tatsumi YamadaIAS 8 © IFRS Foundation A1075,1
25323,Non-sustainable standards,ias-8-accounting-policies-changes-in-accountin...,Approval by the Board of Definition of Materia...,1
25324,Non-sustainable standards,ias-8-accounting-policies-changes-in-accountin...,Approval by the Board of Definition of Account...,1
25325,Non-sustainable standards,ias-8-accounting-policies-changes-in-accountin...,TokarIAS 8 © IFRS Foundation A1077,1


In [1010]:

#convert the standard name, standard title and page contents to a dataframe
standard_name = ['Non-sustainable standards']  
document_title = ['ifrs-17-insurance-contracts']  
label = [1]

# Repeat standard_name and document_title to match the length of page_contents
standard_name = standard_name * len(page_contents)
document_title = document_title * len(page_contents)
label = label * len(page_contents)

df2 = pd.DataFrame({
    'standard_type': standard_name,
    'document_title': document_title,
    'document_text': page_contents,
    'label': label
})

df2.head()

Unnamed: 0,standard_type,document_title,document_text,label
0,Non-sustainable standards,ifrs-17-insurance-contracts,IFRS 17 Insurance Contracts In March 2004 the ...,1
1,Non-sustainable standards,ifrs-17-insurance-contracts,Board completed its project on insurance contr...,1
2,Non-sustainable standards,ifrs-17-insurance-contracts,"applying IFRS 17. In December 2021, the Board ...",1
3,Non-sustainable standards,ifrs-17-insurance-contracts,CONTENTS from paragraph IFRS 17 INSURANCE CONT...,1
4,Non-sustainable standards,ifrs-17-insurance-contracts,DERECOGNITION 72 Modification of an insurance ...,1


In [1011]:
# Merge df and df2
df_merged = pd.concat([df, df2])

# Write df_merged to a CSV file
df_merged.to_csv(csv_directory, index=False)

In [1012]:
#read the csv file and view the first 5 rows and last 5 rows
df = pd.read_csv(csv_directory)
df.head()

Unnamed: 0,standard_type,document_title,document_text,label
0,carborn disclosure project,Climate Disclosure Framework,A CLIMATE DISCLOSURE FRAMEWORK FOR SMALL AND M...,0
1,carborn disclosure project,Climate Disclosure Framework,A significant proportion of the world’s busine...,0
2,carborn disclosure project,Climate Disclosure Framework,report on their progress and ultimately reduce...,0
3,carborn disclosure project,Climate Disclosure Framework,nearing and that impacts will continue to occu...,0
4,carborn disclosure project,Climate Disclosure Framework,report on in their climate disclosures.OVERVIE...,0


In [1013]:
df.tail()

Unnamed: 0,standard_type,document_title,document_text,label
25852,Non-sustainable standards,ifrs-17-insurance-contracts,Approval by the International Accounting Stand...,1
25853,Non-sustainable standards,ifrs-17-insurance-contracts,Mary Tokar Wei-Guo ZhangIFRS 17 © IFRS Foundat...,1
25854,Non-sustainable standards,ifrs-17-insurance-contracts,Approval by the International Accounting Stand...,1
25855,Non-sustainable standards,ifrs-17-insurance-contracts,Approval by the Board of Initial Application o...,1
25856,Non-sustainable standards,ifrs-17-insurance-contracts,Foundation A971,1
