# Metals Mining Standard 2021 Summary

### Objective:

Summarize the below pdf by different sections /headers.

pdf link: https://www.sasb.org/wp-content/uploads/2018/11/Metals_Mining_Standard_2021.pdf
 

The model should be able to extract the sections like Introduction, Sustainability disclosures topics…  and sub sections like Greenhouse Gas emission, Air Quality etc and it’s relevant text and summarize it.

 



## Table of Contents

- [1. Read PDF](#READ_PDF)
  - [1.1 Extract TOC](#section_1_1)
  - [1.2 Extract Table](#section_1_2)
  - [1.3 Extract Section Content](#section_1_3)
- [2. Model Assessment](#Summary_of_Models)
  - [2.1 Use Bert Model](#section_2_1)
  - [2.1.1 T5 small Bert Model](#section_2_1_1)
  - [2.1.2 MT5 Amazon fine tuned Bert Model](#section_2_1_2)
  - [2.2 Bart Model](#section_2_2)
  - [2.2.1 Default Bart Model](#section_2_2_1)
  - [2.2.2 Facebook Bart Mode](#section_2_2_2)
  - [2.3 Use GPT-3 Model](#section_2_3)
  - [2.3.1 Open_ai_davinici_002 Model](#section_2_3_1)
  - [2.3.2 Open_ai_davinici Modell](#section_2_3_2)
  - [2.4 Bert Extractive Summarizer](#section_2_4)
  - [2.4.1 Bert-large-uncased Model](#section_2_4_1)
  - [2.4.2 Facebook Bart Modell](#section_2_4_2)
  - [2.4.2 Paraphrase-MiniLM-L6-v2 Model](#section_2_4_2)
  - [2.5 Use Custom Model](#section_2_5)
- [3. Finalize Model](#section_3)
  - [3.1 Analysis](#section_3_1)
  - [3.2 Final Model](#section_3_3)
  - [3.3 Save in output file](#section_3_3)
- [4. Conclusion](#section_4)

<!-- ### Table of Contents:

## 1. Read PDF
## 1.1 Extract TOC
## 1.2 Extract Tables
## 1.3 Extract Section content

# 2. Modelling
## 2.1 Use Bert Model
## 2.2 Use GPT-3 Model 
## 2.3 Use others

# 3. Conclusion 

#### 4.1.1 Default Bert Model <a class="anchor" id="section_4_1_1"></a>
#### 4.1.2 MT5 Amazon fine tuned Bert Model <a class="anchor" id="section_4_1_2"></a>

### 4.2 Bart Model <a class="anchor" id="section_4_2"></a>
### 4.3 Use GPT-3 Model <a class="anchor" id="section_4_3"></a>
### 4.4 Use others <a class="anchor" id="section_4_4"></a>

-->

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [91]:
filename = 'resource/Metals_Mining_Standard_2021.pdf' 

OUTPUT_FILE_PATH = "results/summary/Metals_Mining_Standard_2021_Summary.txt"



In [3]:
import pdfplumber
import pandas as pd
import copy


import tensorflow
import transformers
from transformers import pipeline


import openai
import wget
import pathlib
import pdfplumber
import numpy as np

import collections 
import datetime as datetime
import textwrap


from summarizer import Summarizer
from summarizer.sbert import SBertSummarizer
from transformers import AutoConfig, AutoTokenizer, AutoModel

In [4]:
print(transformers.__version__)
print(tensorflow.__version__)

4.19.2
2.9.0


In [5]:
OPEN_AI_ORG="org-example-id"
OPEN_AI_API_KEY="OPEN-API-KEY"

# 1. Read PDF <a class="anchor" id="READ_PDF"></a>

## 1.1 Extract TOC <a class="anchor" id="section_1_1"></a>

In this Section TOC is extrced from page 3 and a dictionary is maintained for quick access of content by key.

In [7]:
def text_num_split(item):
    for index, letter in enumerate(item, 0):
        if letter.isdigit():
            return (item[:index],item[index:])

In [8]:
with pdfplumber.open(filename) as pdf:
    first_page = pdf.pages[2]
    page_string = first_page.extract_text()


ls_texts = page_string.replace('.', '').split('\n')

toc_map ={}
for i in ls_texts:
    elec = text_num_split(i)
    if elec:
        toc_map[elec[0]] = {'page_no':elec[1]}
    
toc_map

{'Introduction': {'page_no': '4'},
 'Purpose of SASB Standards': {'page_no': '4'},
 'Overview of SASB Standards': {'page_no': '4'},
 'Use of the Standards': {'page_no': '5'},
 'Industry Description': {'page_no': '5'},
 'Sustainability Disclosure Topics & Accounting Metrics': {'page_no': '6'},
 'Greenhouse Gas Emissions': {'page_no': '9'},
 'Air Quality': {'page_no': '13'},
 'Energy Management': {'page_no': '15'},
 'Water Management': {'page_no': '17'},
 'Waste & Hazardous Materials Management': {'page_no': '19'},
 'Biodiversity Impacts': {'page_no': '24'},
 'Security, Human Rights & Rights of Indigenous Peoples': {'page_no': '29'},
 'Community Relations': {'page_no': '34'},
 'Labor Relations': {'page_no': '38'},
 'Workforce Health & Safety': {'page_no': '40'},
 'Business Ethics & Transparency': {'page_no': '42'},
 'Tailings Storage Facilities Management': {'page_no': '44'},
 'SUSTAINABILITY ACCOUNTING STANDARD | METALS & MINING | ': {'page_no': '3'}}

## 1.2 Extract Table <a class="anchor" id="section_1_2"></a>

In [9]:
with pdfplumber.open(filename) as pdf:
    dfs = []
    table_df = pd.DataFrame()
    for page in pdf.pages:
        if page.find_tables():
            tables = page.find_tables()
            t1_content = tables[0].extract(x_tolerance = 5)
            
            if len(tables) >1:
                t2_content = tables[1].extract(x_tolerance = 5)

            df_tmp = pd.DataFrame(t1_content)
            dfs.append(df_tmp)
            

table1_df = pd.concat(dfs) # creating table 1
table2_df = pd.DataFrame(t2_content) # creating table 2

table1_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,TOPIC,ACCOUNTING METRIC,CATEGORY,UNIT OF\nMEASURE,CODE,,,,,,,
1,Greenhouse \nGas Emissions,"Gross global Scope 1 emissions, percentage \nc...",Quantitative,"Metric tons (t) \nCO₂-e, \nPercentage (%)",EM-MM-110a.1,,,,,,,
2,,Discussion of long-term and short-term \nstrat...,Discussion and \nAnalysis,,EM-MM-110a.2,,,,,,,
3,Air Quality,Air emissions of the following pollutants: (1)...,Quantitative,Metric tons (t),EM-MM-120a.1,,,,,,,
4,Energy \nManagement,"(1) Total energy consumed, (2) percentage grid...",Quantitative,"Gigajoules (GJ), \nPercentage (%)",EM-MM-130a.1,,,,,,,


In [10]:

# dropping null column, and irrelavent rows.
table1_df = table1_df[:-2]
table1_df.dropna(how='all', axis=1, inplace=True)

# drop duplicates - this removes the duplicate header rows
table1_df = table1_df.drop_duplicates()
# table1_df.head()

In [11]:
# set first row as column names
new_header = table1_df.iloc[0] #grab the first row for the header
table1_df = table1_df[1:] #take the data less the header row
table1_df.columns = new_header

In [12]:

# set TOPIC as index
table1_df['TOPIC'].fillna(method='pad', inplace=True)
table1_df = table1_df.set_index('TOPIC')

table1_df.head()

Unnamed: 0_level_0,ACCOUNTING METRIC,CATEGORY,UNIT OF\nMEASURE,CODE
TOPIC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Greenhouse \nGas Emissions,"Gross global Scope 1 emissions, percentage \nc...",Quantitative,"Metric tons (t) \nCO₂-e, \nPercentage (%)",EM-MM-110a.1
Greenhouse \nGas Emissions,Discussion of long-term and short-term \nstrat...,Discussion and \nAnalysis,,EM-MM-110a.2
Air Quality,Air emissions of the following pollutants: (1)...,Quantitative,Metric tons (t),EM-MM-120a.1
Energy \nManagement,"(1) Total energy consumed, (2) percentage grid...",Quantitative,"Gigajoules (GJ), \nPercentage (%)",EM-MM-130a.1
Water \nManagement,"(1) Total fresh water withdrawn, (2) total fre...",Quantitative,"Thousand cubic \nmeters (m³), \nPercentage (%)",EM-MM-140a.1


In [13]:
table2_df

Unnamed: 0,0,1,2,3
0,ACTIVITY METRIC,CATEGORY,UNIT OF\nMEASURE,CODE
1,Production of (1) metal ores and (2) finished ...,Quantitative,Metric tons (t) \nsaleable,EM-MM-000.A
2,"Total number of employees, percentage contractors",Quantitative,"Number, \nPercentage (%)",EM-MM-000.B


## 1.3 Extract Section Content <a class="anchor" id="section_1_3"></a>

In this section, The PDF file is parsed, the content, page_no, and section, sub-section hierarchy is formed for each of the section and appended to the list
 
```json 
[
    {'title': 'name of the title',
    'page_no': 0,
    'child': [],
    'content': ''
    },
]
```


In [14]:
def getPageNo(title, level, parent=None):
    
    if title in toc_map:
        toc_map[title]['level'] = level
        if parent:
            toc_map[title]['parent'] = parent
        return toc_map[title]['page_no']
    return -1
    
    

Creating the Structure of the content map

In [15]:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage

# add node for each section
# create section hierarchy
# in node add title, pageno, reference for child
def parse(filename, maxlevel, page_n=0):
    fp = open(filename, 'rb')
    parser = PDFParser(fp)
    doc = PDFDocument(parser)

    tocList =[]
    outlines = doc.get_outlines()
    prev =0
    for (level, title, dest, a, se) in outlines:
        if level <= maxlevel:
            
            if level == 1:
                page_no = getPageNo(title, level)
                node = {'title': title, 'page_no':page_no, 'child':[], 'content':""}
                tocList.append(node)
        
            else:
                page_no = getPageNo(title, level, tocList[-1]['title'])
                node = {'title': title, 'page_no':page_no, 'child':[], 'content':"", 'parent': tocList[-1]['title']}
                tocList[-1]['child'].append(node)
    
    return tocList

In [16]:
tocList=parse(filename,2)
tocList

[{'title': 'Metals & Mining', 'page_no': -1, 'child': [], 'content': ''},
 {'title': 'About the Value Reporting Foundation',
  'page_no': -1,
  'child': [],
  'content': ''},
 {'title': 'Table of Contents', 'page_no': -1, 'child': [], 'content': ''},
 {'title': 'Introduction',
  'page_no': '4',
  'child': [{'title': 'Purpose of SASB Standards',
    'page_no': '4',
    'child': [],
    'content': '',
    'parent': 'Introduction'},
   {'title': 'Overview of SASB Standards',
    'page_no': '4',
    'child': [],
    'content': '',
    'parent': 'Introduction'},
   {'title': 'Use of the Standards',
    'page_no': '5',
    'child': [],
    'content': '',
    'parent': 'Introduction'},
   {'title': 'Industry Description',
    'page_no': '5',
    'child': [],
    'content': '',
    'parent': 'Introduction'}],
  'content': ''},
 {'title': 'Sustainability Disclosure Topics & Accounting Metrics',
  'page_no': '6',
  'child': [{'title': 'Greenhouse Gas Emissions',
    'page_no': '9',
    'child': 

In [17]:
print(toc_map['Introduction'])
print(toc_map['Purpose of SASB Standards'])

{'page_no': '4', 'level': 1}
{'page_no': '4', 'level': 2, 'parent': 'Introduction'}


In [18]:

content_map = copy.deepcopy(toc_map)

for k, v in content_map.items():
    v['content'] = ""
    v['total_content_length'] =0
print(content_map['Introduction'])
print(content_map['Purpose of SASB Standards'])

{'page_no': '4', 'level': 1, 'content': '', 'total_content_length': 0}
{'page_no': '4', 'level': 2, 'parent': 'Introduction', 'content': '', 'total_content_length': 0}


**The PDF data is parsed, removed whitecaharacters, and according to the section, content map is populated.**

In [19]:

# create section and populate with content
with pdfplumber.open(filename) as pdf:

    pages = pdf.pages
    
    cur_section = None
    
    for i in range(3, len(pages)): # extract texts from page 3
        cur_page = pages[i]
        extracted_text = cur_page.extract_text()
        # clean the text and create a list by using string separator '\n'
        cleaned_texts_list = [x.strip() for x in extracted_text.split('\n')]
        
        # get the cur section begining, and add content till new section is reached.
        for text in cleaned_texts_list:
            if text in toc_map:                
                # track the node, and change to current
                cur_section = content_map[text]
            
            elif cur_section:
                cur_section['content'] += text
                


In [20]:
# verifying the content in the content map

print(content_map['Introduction'])
print(content_map['Purpose of SASB Standards'])

{'page_no': '4', 'level': 1, 'content': '', 'total_content_length': 0}
{'page_no': '4', 'level': 2, 'parent': 'Introduction', 'content': 'The use of the term “sustainability” in the SASB Standards refers to corporate activities that maintain or enhance theability of the company to create value over the long term. Sustainability accounting reflects the governance andmanagement of a company’s environmental and social impacts arising from production of goods and services, as well asits governance and management of the environmental and social capitals necessary to create long-term value. The SASBStandards also refer to sustainability as “ESG” (environmental, social, and governance), though traditional corporategovernance issues such as board composition are not included within the scope of the SASB Standards.SASB Standards are designed to identify a minimum set of sustainability issues most likely to impact the operatingperformance or financial condition of the typical company in an indus

common function for saving file in excel format.

In [21]:
def saveInExcel(dict1, filename="Excel", debug=False):
    
    today = datetime.datetime.now()

    path = "results/model_evaluation/"+ today.strftime("%d%b/")
    if debug:
        path += "debug/" 
    fname = path + filename + "_"+ today.strftime('%m_%d_%Y_%H_%M_%S') 
    
    df = pd.DataFrame.from_dict(dict1, orient='index') # convert dict to dataframe

    df.to_csv(fname+".csv") #

    df.to_excel(fname+".xlsx")

In [22]:
saveInExcel(toc_map, "toc_before_modelling")

In [23]:
saveInExcel(content_map, "content_map_before_modelling")

# 2. Model Assessment <a class="anchor" id="Summary_of_Models"></a>


Assess each of the model with first pargraph, predict it and it to the model summary results list for comparison
```json
 [( model-id, key-of-summary, summary_generated)]
```

1. summary by section 

2. include subsection into parent and summarize

In [24]:
model_summary_results = [] 

In [25]:
def saveInResultsList(m_content_map, p_name, p_cols, model_id):
    for k,v in m_content_map[p_name].items():

        if k in p_cols[p_name]:
            name = p_name + " : " + k
            model_summary_results.append((model_id, name, v))

**create a parent content by concatenating all the child for predicting parent section summary.**

In [26]:

def create_parent_content_including_sub_section(content_map):
    b_content_map = copy.deepcopy(content_map) # creating a copy of the map for adding summarized content

    # create a parent content by concatenating all the child
    for k, v in content_map.items():
        if v['content'] !="" and v['level'] >1 and v['parent']:
            new_col_key = 'content'
            
            parent_key = v['parent']
            parent_content = b_content_map[parent_key]['content']
            child_content = v['content']
            
            if new_col_key in b_content_map[parent_key]:
                b_content_map[parent_key][new_col_key] += child_content
            else:
                b_content_map[parent_key][new_col_key] = parent_content + child_content
    return b_content_map
                

In [27]:
def cleanup_content_map(a_content_map):
        # clean up the parent content            
    for k, v in a_content_map.items():
        
        if v['content'] !="" and v['level'] >1 and v['parent']:
            new_col_key = 'content'
            parent_key = v['parent']
            
            removed_value = a_content_map[parent_key].pop(new_col_key, 'No Key found')

### **Common method for summarizing transformer model**

In [28]:
def summarizeModel(name, b_content_map, classifier, debug=False, max_length_list=[150]):
    
    a_content_map = copy.deepcopy(b_content_map)
        
    new_colmns = collections.defaultdict(list) 
    debug_cnt =0
    for k, v in b_content_map.items():
        
        if v['content'] !="":
            length = len(v['content'])
            a_content_map[k]['total_content_length'] = length

            if length > 10:
                
                # for each of the max lengths generate summary, add it as a new key-val in the dictionary
                for m_length in max_length_list:
                    col_name = name+"_summary_"+ str(m_length)
                    
                    summary_m_length = classifier_default_t5_small(v['content'], max_length=m_length, min_length=10, do_sample=False)
                    a_content_map[k][col_name] = summary_m_length[0]['summary_text'].strip()
                    new_colmns[k].append(col_name)

                if debug:
                    debug_cnt +=1
                    if debug_cnt >1:
                        break
    
    cleanup_content_map(a_content_map)
            
    return a_content_map, new_colmns


### Sample content for processing summary model prediction

In [29]:

text_content_map = create_parent_content_including_sub_section(content_map)
ITRODUCTION = text_content_map['Introduction']['content']
# ITRODUCTION

In [30]:
ARTICLE = text_content_map['Purpose of SASB Standards']['content']

In [31]:
text_content_map.keys()

dict_keys(['Introduction', 'Purpose of SASB Standards', 'Overview of SASB Standards', 'Use of the Standards', 'Industry Description', 'Sustainability Disclosure Topics & Accounting Metrics', 'Greenhouse Gas Emissions', 'Air Quality', 'Energy Management', 'Water Management', 'Waste & Hazardous Materials Management', 'Biodiversity Impacts', 'Security, Human Rights & Rights of Indigenous Peoples', 'Community Relations', 'Labor Relations', 'Workforce Health & Safety', 'Business Ethics & Transparency', 'Tailings Storage Facilities Management', 'SUSTAINABILITY ACCOUNTING STANDARD | METALS & MINING | '])

list of summary length for assesing for assesing model performance

In [32]:
max_summary_lengths = [50,100,150,200,250]

## 2.1 Use Bert Model <a class="anchor" id="section_2_1"></a>

### 2.1.1 T5 small Bert Model <a class="anchor" id="section_2_1_1"></a>

In [33]:

t5_model_id = "t5-small"
classifier_default_t5_small = pipeline("summarization", model=t5_model_id)

In [34]:
print(classifier_default_t5_small(ARTICLE, max_length=130, min_length=10, do_sample=False))

[{'summary_text': 'the term "sustainability" in the SASB Standards refers to corporate activities that maintain or enhance the company\'s ability to create value over the long term . the SASBStandards also refer to sustainability as "ESG"'}]


In [35]:

t5_content_map, t5_new_colmns = summarizeModel("default_t5", text_content_map, classifier_default_t5_small, debug=True, max_length_list=max_summary_lengths)
print(t5_new_colmns)


Token indices sequence length is longer than the specified maximum sequence length for this model (2671 > 512). Running this sequence through the model will result in indexing errors


defaultdict(<class 'list'>, {'Introduction': ['default_t5_summary_50', 'default_t5_summary_100', 'default_t5_summary_150', 'default_t5_summary_200', 'default_t5_summary_250'], 'Purpose of SASB Standards': ['default_t5_summary_50', 'default_t5_summary_100', 'default_t5_summary_150', 'default_t5_summary_200', 'default_t5_summary_250']})


In [36]:
saveInResultsList(t5_content_map, 'Introduction', {'Introduction':['default_t5_summary_150']}, "t5-small")

In [37]:
t5_content_map['Introduction']

{'page_no': '4',
 'level': 1,
 'total_content_length': 10968,
 'default_t5_summary_50': 'SASB Standards are designed to identify a minimum set of sustainability issues most likely to impact the operating performance or financial condition of the typical company in an industry, regardless of location . each accounting metric is accompanied by a technical protocol',
 'default_t5_summary_100': 'SASB Standards are designed to identify a minimum set of sustainability issues most likely to impact the operating performance or financial condition of the typical company in an industry, regardless of location . each accounting metric is accompanied by a technical protocol that provides guidance for third-party assurance .',
 'default_t5_summary_150': 'SASB Standards are designed to identify a minimum set of sustainability issues most likely to impact the operating performance or financial condition of the typical company in an industry, regardless of location . each accounting metric is accompan

In [38]:
saveInResultsList(t5_content_map, 'Purpose of SASB Standards', {'Purpose of SASB Standards':['default_t5_summary_150']}, "t5-small")
t5_content_map['Purpose of SASB Standards']

{'page_no': '4',
 'level': 2,
 'parent': 'Introduction',
 'content': 'The use of the term “sustainability” in the SASB Standards refers to corporate activities that maintain or enhance theability of the company to create value over the long term. Sustainability accounting reflects the governance andmanagement of a company’s environmental and social impacts arising from production of goods and services, as well asits governance and management of the environmental and social capitals necessary to create long-term value. The SASBStandards also refer to sustainability as “ESG” (environmental, social, and governance), though traditional corporategovernance issues such as board composition are not included within the scope of the SASB Standards.SASB Standards are designed to identify a minimum set of sustainability issues most likely to impact the operatingperformance or financial condition of the typical company in an industry, regardless of location. SASB Standards aredesigned to enable co

In [39]:
saveInExcel(t5_content_map, "content_map_default_t5_modelling", True)

**from above two example summary It is evident that after max_length 150 the summary generation remains constant, hencehorth keeping max length as 150**

### 2.1.2 MT5 Amazon fine tuned Bert Model <a class="anchor" id="section_2_1_2"></a>

In [40]:
hub_model_id = "huggingface-course/mt5-small-finetuned-amazon-en-es"
summarizer_amazon_en_es = pipeline("summarization", model=hub_model_id)

In [41]:
summarizer_amazon_en_es_map, amazon_new_colmns = summarizeModel("summarizer_amazon_en_es", text_content_map, summarizer_amazon_en_es, debug=True)


In [42]:
saveInResultsList(summarizer_amazon_en_es_map, 'Introduction', amazon_new_colmns, "mt5-small-finetuned-amazon-en-es")
summarizer_amazon_en_es_map['Introduction']

{'page_no': '4',
 'level': 1,
 'total_content_length': 10968,
 'summarizer_amazon_en_es_summary_150': 'SASB Standards are designed to identify a minimum set of sustainability issues most likely to impact the operating performance or financial condition of the typical company in an industry, regardless of location . each accounting metric is accompanied by a technical protocol that provides guidance for third-party assurance .'}

In [43]:
saveInResultsList(summarizer_amazon_en_es_map, 'Purpose of SASB Standards', amazon_new_colmns, "mt5-small-finetuned-amazon-en-es")
summarizer_amazon_en_es_map['Purpose of SASB Standards']

{'page_no': '4',
 'level': 2,
 'parent': 'Introduction',
 'content': 'The use of the term “sustainability” in the SASB Standards refers to corporate activities that maintain or enhance theability of the company to create value over the long term. Sustainability accounting reflects the governance andmanagement of a company’s environmental and social impacts arising from production of goods and services, as well asits governance and management of the environmental and social capitals necessary to create long-term value. The SASBStandards also refer to sustainability as “ESG” (environmental, social, and governance), though traditional corporategovernance issues such as board composition are not included within the scope of the SASB Standards.SASB Standards are designed to identify a minimum set of sustainability issues most likely to impact the operatingperformance or financial condition of the typical company in an industry, regardless of location. SASB Standards aredesigned to enable co

In [44]:
saveInExcel(summarizer_amazon_en_es_map, "content_map_summarizer_amazon_en_es_modelling", True)

## 2.2 Bart Model <a class="anchor" id="section_2_2"></a>

### 2.2.1 Default Bart Model <a class="anchor" id="section_2_2_1"></a>


In [45]:
distilbart_model_id = "sshleifer/distilbart-cnn-12-6"

classifier_default_distilbart = pipeline("summarization", model=distilbart_model_id)

In [46]:
print(classifier_default_distilbart(ARTICLE, max_length=130, min_length=10, do_sample=False))

[{'summary_text': ' Sustainability accounting reflects the governance and management of a company’s environmental and social impacts arising from production of goods and services . The use of the SASB Standards refers to corporate activities that maintain or enhance theability of the company to create value over the long term .'}]


In [47]:

distilbart_content_map, distilbart_new_colmns = summarizeModel("default_t5", text_content_map, classifier_default_t5_small, debug=True)
print(distilbart_new_colmns)

defaultdict(<class 'list'>, {'Introduction': ['default_t5_summary_150'], 'Purpose of SASB Standards': ['default_t5_summary_150']})


In [48]:
saveInResultsList(distilbart_content_map, 'Introduction', distilbart_new_colmns, distilbart_model_id)
distilbart_content_map['Introduction']

{'page_no': '4',
 'level': 1,
 'total_content_length': 10968,
 'default_t5_summary_150': 'SASB Standards are designed to identify a minimum set of sustainability issues most likely to impact the operating performance or financial condition of the typical company in an industry, regardless of location . each accounting metric is accompanied by a technical protocol that provides guidance for third-party assurance .'}

In [49]:
saveInResultsList(distilbart_content_map, 'Purpose of SASB Standards', distilbart_new_colmns, distilbart_model_id)
distilbart_content_map['Purpose of SASB Standards']

{'page_no': '4',
 'level': 2,
 'parent': 'Introduction',
 'content': 'The use of the term “sustainability” in the SASB Standards refers to corporate activities that maintain or enhance theability of the company to create value over the long term. Sustainability accounting reflects the governance andmanagement of a company’s environmental and social impacts arising from production of goods and services, as well asits governance and management of the environmental and social capitals necessary to create long-term value. The SASBStandards also refer to sustainability as “ESG” (environmental, social, and governance), though traditional corporategovernance issues such as board composition are not included within the scope of the SASB Standards.SASB Standards are designed to identify a minimum set of sustainability issues most likely to impact the operatingperformance or financial condition of the typical company in an industry, regardless of location. SASB Standards aredesigned to enable co

In [50]:
saveInExcel(distilbart_content_map, "content_map_distilbart_content_map_modelling", True)

### 2.2.2 Facebook Bart Model <a class="anchor" id="section_2_2_2"></a>

In [51]:
summarizer_fb_bart = pipeline("summarization", model="facebook/bart-large-cnn")


In [52]:
summarizer_fb_bart_map, fb_new_colmns = summarizeModel("summarizer_fb_bart", text_content_map, summarizer_fb_bart, debug=True)


In [53]:
saveInResultsList(summarizer_fb_bart_map, 'Introduction', fb_new_colmns, "facebook/bart-large-cnn")
summarizer_fb_bart_map['Introduction']

{'page_no': '4',
 'level': 1,
 'total_content_length': 10968,
 'summarizer_fb_bart_summary_150': 'SASB Standards are designed to identify a minimum set of sustainability issues most likely to impact the operating performance or financial condition of the typical company in an industry, regardless of location . each accounting metric is accompanied by a technical protocol that provides guidance for third-party assurance .'}

In [54]:
saveInResultsList(summarizer_fb_bart_map, 'Purpose of SASB Standards', fb_new_colmns, "facebook/bart-large-cnn")
summarizer_fb_bart_map['Purpose of SASB Standards']

{'page_no': '4',
 'level': 2,
 'parent': 'Introduction',
 'content': 'The use of the term “sustainability” in the SASB Standards refers to corporate activities that maintain or enhance theability of the company to create value over the long term. Sustainability accounting reflects the governance andmanagement of a company’s environmental and social impacts arising from production of goods and services, as well asits governance and management of the environmental and social capitals necessary to create long-term value. The SASBStandards also refer to sustainability as “ESG” (environmental, social, and governance), though traditional corporategovernance issues such as board composition are not included within the scope of the SASB Standards.SASB Standards are designed to identify a minimum set of sustainability issues most likely to impact the operatingperformance or financial condition of the typical company in an industry, regardless of location. SASB Standards aredesigned to enable co

In [55]:
saveInExcel(summarizer_fb_bart_map, "content_map_summarizer_fb_bart_modelling", True)

## 2.3 Use GPT-3 Model <a class="anchor" id="section_2_3"></a>


In [56]:
def summarizeOpenAI(text, max_tokens=64, engine="text-davinci-002",temperature=0.7):
    
    openai.organization = OPEN_AI_ORG
    openai.api_key = OPEN_AI_API_KEY
    print(openai.organization)
    print(openai.api_key)
    engine_list = openai.Engine.list() 
    response = openai.Completion.create(
      engine=engine,
      prompt=text,
      temperature=temperature,
      max_tokens=max_tokens,
      top_p=1,
      frequency_penalty=0,
      presence_penalty=0
    )
    return response


In [57]:

def summarizeModelOpenAI(name, b_content_map, engine="davinci",temperature=0.3, size=-1, debug=False, max_length_list=[150]):
        
    a_content_map = copy.deepcopy(b_content_map)

    new_colmns = collections.defaultdict(list) 
    debug_cnt =0
    for k, v in b_content_map.items():
        
        if v['content'] !="":
            length = len(v['content'])
            a_content_map[k]['total_content_length'] = length

            if length > 10:
                
                # for each of the max lengths generate summary, add it as a new key-val in the dictionary
                for m_length in max_length_list:
                    col_name = name+"_summary_"+ str(m_length)
                    
                    summary_m_length = summarizeOpenAI(v['content'][:size], m_length, engine, temperature)
                    a_content_map[k][col_name] = summary_m_length['choices'][0]['text'].strip()
                    new_colmns[k].append(col_name)

                if debug:
                    debug_cnt +=1
                    if debug_cnt >1:
                        break
    # clean up the parent content            
    cleanup_content_map(a_content_map)            
            
    return a_content_map, new_colmns


### 2.3.1 Open_ai_davinici_002 Model <a class="anchor" id="section_2_3_1"></a>


In [58]:
# summarizer_open_ai_davinici_002_map, open_ai_002_new_colmns = summarizeModelOpenAI("open_ai_davinici_002", text_content_map,"text-davinci-002", 0.7, debug=True)


In [59]:
# saveInExcel(summarizer_open_ai_davinici_002_map, "content_map_open_ai_davinici_002_modelling", True)

In [60]:
file_davinci_002= "results/model_evaluation/content_map_open_ai_davinici_002_modelling_05_23_2022.xlsx"
file_davinci = "results/model_evaluation/content_map_open_ai_davinici_modelling_05_23_2022.xlsx"

In [61]:
summarizer_open_ai_davinici_002_map = pd.read_excel(file_davinci_002, index_col=0).to_dict()

In [62]:
vv = summarizer_open_ai_davinici_002_map['open_ai_davinici_002_summary_130']['Purpose of SASB Standards']
vv = vv.strip()
vv = vv.replace("\n","")
model_summary_results.append(("open_ai_davinici_002", 'Purpose of SASB Standards : '+'open_ai_davinici_002_summary_130', vv))

vv

'The SASB Standards are intended to be used by companies to prepare their sustainability disclosure for the benefit offinancial statement users. They should not be used to assess environmental or social performance, craft environmental orsocial strategies, or implement environmental or social programs.SASB Standards are not rules or regulations. They are not mandatory. Companies are not required to use the SASBStandards to prepare sustainability disclosure.The SASB Standards were developed based on the premise that companies that can better manage and communicatesustainability issues most likely to impact the operating performance or financial condition of the typical company'

### 2.3.2 Open_ai_davinici Model <a class="anchor" id="section_2_3_2"></a>


In [63]:
# summarizer_open_ai_davinici_map, open_ai_davinci_new_colmns = summarizeModelOpenAIGPT("open_ai_davinici", text_content_map,"davinci", 0.3, 1900, debug=True)


In [64]:
# saveInExcel(summarizer_open_ai_davinici_map, "content_map_open_ai_davinici_modelling", True)

In [65]:
summarizer_open_ai_davinici_map = pd.read_excel(file_davinci, index_col=0).to_dict()

In [66]:
vv = summarizer_open_ai_davinici_map['open_ai_davinici_summary_130']['Purpose of SASB Standards']
vv = vv.strip()
model_summary_results.append(("open_ai_davinici", 'Purpose of SASB Standards : '+'open_ai_davinici_summary_130', vv))
vv

'The SASB Standards are designed to be consistent with existing financial reporting standards and practices. SASBStandards are designed to be used in conjunction with existing financial reporting standards and practices. SASBStandards are designed to be used in conjunction with existing financial reporting standards and practices. SASBStandards are designed to be used in conjunction with existing financial reporting standards and practices. SASBStandards are designed to be used in conjunction with existing financial reporting standards and practices. SASBStandards are designed to be used in conjunction with existing financial reporting standards and practices. SASBStandards are designed to be used in conjunction with existing financial reporting standards and practices'

## 2.4 Bert Extractive Summarizer <a class="anchor" id="section_2_4"></a>

### 2.4.1 Bert-large-uncased Model <a class="anchor" id="section_2_4_1"></a>

In [67]:

model = Summarizer()

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [68]:
#"bert-large-uncased"
res = model(ARTICLE)
print(res)
result = model.run_embeddings(ARTICLE)  # Specified with ratio. 
print(result)
print("\n")

res = model(ARTICLE, ratio=0.2)
print(res)
result = model.run_embeddings(ARTICLE, ratio=0.2)  # Specified with ratio. 
print(result)
print("\n")

res = model(ARTICLE, num_sentences=3)
print(res)
result = model.run_embeddings(ARTICLE, num_sentences=3)  # Will return (3, N) embedding numpy matrix.
print(result)
print("\n")

The use of the term “sustainability” in the SASB Standards refers to corporate activities that maintain or enhance theability of the company to create value over the long term. Businesses can use the SASB Standards to better identify, manage, and communicate to investors sustainabilityinformation that is financially material.
[[-0.10613243 -0.16330948 -0.3940094  ...  0.2102791  -0.04488603
   0.3458741 ]
 [ 0.14696051 -0.41607744 -0.4576886  ...  0.1237503   0.45531017
   0.1342383 ]]


The use of the term “sustainability” in the SASB Standards refers to corporate activities that maintain or enhance theability of the company to create value over the long term. Businesses can use the SASB Standards to better identify, manage, and communicate to investors sustainabilityinformation that is financially material.
[[-0.10613243 -0.16330948 -0.3940094  ...  0.2102791  -0.04488603
   0.3458741 ]
 [ 0.14696051 -0.41607744 -0.4576886  ...  0.1237503   0.45531017
   0.1342383 ]]


The use of the

**from these various config, by using 3 sentences gives a well constructed summary, hence forth will be using the sentences.**

In [69]:
# saveInResultsList(summarizer_fb_bart_map, 'Purpose of SASB Standards', fb_new_colmns, "facebook/bart-large-cnn")
model_summary_results.append(("bert-large-uncased", 'Purpose of SASB Standards : '+'bert-large-uncased', res.strip()))


### 2.4.2 Facebook Bart Model <a class="anchor" id="section_2_4_2"></a>

In [70]:
model_distilbert_base_uncased = Summarizer('distilbert-base-uncased', hidden=[-1,-2], hidden_concat=True)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [71]:
result = model_distilbert_base_uncased(ARTICLE, num_sentences=3)
model_summary_results.append(("distilbert-base-uncased", 'Purpose of SASB Standards : '+'distilbert-base-uncased', result.strip()))

print(result)

The use of the term “sustainability” in the SASB Standards refers to corporate activities that maintain or enhance theability of the company to create value over the long term. Sustainability accounting reflects the governance andmanagement of a company’s environmental and social impacts arising from production of goods and services, as well asits governance and management of the environmental and social capitals necessary to create long-term value. Businesses can use the SASB Standards to better identify, manage, and communicate to investors sustainabilityinformation that is financially material.


### 2.4.3 Paraphrase-MiniLM-L6-v2 Model <a class="anchor" id="section_2_4_3"></a>


In [72]:
model_paraphrase_MiniLM_L6_v2 = SBertSummarizer('paraphrase-MiniLM-L6-v2')

In [73]:
result = model_paraphrase_MiniLM_L6_v2(ARTICLE, num_sentences=3)
model_summary_results.append(("paraphrase-MiniLM-L6-v2", 'Purpose of SASB Standards : '+'paraphrase-MiniLM-L6-v2', result.strip()))

print(result)

The use of the term “sustainability” in the SASB Standards refers to corporate activities that maintain or enhance theability of the company to create value over the long term. Sustainability accounting reflects the governance andmanagement of a company’s environmental and social impacts arising from production of goods and services, as well asits governance and management of the environmental and social capitals necessary to create long-term value. SASB Standards aredesigned to enable communications on corporate performance on industry-level sustainability issues in a cost-effectiveand decision-useful manner using existing disclosure and reporting mechanisms.


## 2.5 Use Custom Model <a class="anchor" id="section_2_5"></a>


In [75]:

# Load model, model config and tokenizer via Transformers
custom_config = AutoConfig.from_pretrained('allenai/scibert_scivocab_uncased')
custom_config.output_hidden_states=True
custom_tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
custom_model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased', config=custom_config)

model_custom = Summarizer(custom_model=custom_model, custom_tokenizer=custom_tokenizer)


Some weights of the model checkpoint at allenai/scibert_scivocab_uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [76]:
result = model_custom(ARTICLE)
model_summary_results.append(("custom_scibert_scivocab_uncased", 'Purpose of SASB Standards : '+'scibert_scivocab_uncased', result.strip()))

print(result)

The use of the term “sustainability” in the SASB Standards refers to corporate activities that maintain or enhance theability of the company to create value over the long term. The SASBStandards also refer to sustainability as “ESG” (environmental, social, and governance), though traditional corporategovernance issues such as board composition are not included within the scope of the SASB Standards.


# 3. Finalize Model <a class="anchor" id="section_3"></a>


In [77]:
# displaying the content and its summary for verifying the model performance.
# print(ITRODUCTION)
print()
print(ARTICLE)


The use of the term “sustainability” in the SASB Standards refers to corporate activities that maintain or enhance theability of the company to create value over the long term. Sustainability accounting reflects the governance andmanagement of a company’s environmental and social impacts arising from production of goods and services, as well asits governance and management of the environmental and social capitals necessary to create long-term value. The SASBStandards also refer to sustainability as “ESG” (environmental, social, and governance), though traditional corporategovernance issues such as board composition are not included within the scope of the SASB Standards.SASB Standards are designed to identify a minimum set of sustainability issues most likely to impact the operatingperformance or financial condition of the typical company in an industry, regardless of location. SASB Standards aredesigned to enable communications on corporate performance on industry-level sustainabilit

In [78]:
for model_id, summary_key, summary_value in model_summary_results:
    print(model_id)
    print("-"*len(model_id))
    print()
    
    print(summary_key+":")
    print()
    
    print(summary_value)
#     print("-"*150)
    print()
    
    print("*"*100)
    print()

t5-small
--------

Introduction : default_t5_summary_150:

SASB Standards are designed to identify a minimum set of sustainability issues most likely to impact the operating performance or financial condition of the typical company in an industry, regardless of location . each accounting metric is accompanied by a technical protocol that provides guidance for third-party assurance .

****************************************************************************************************

t5-small
--------

Purpose of SASB Standards : default_t5_summary_150:

the term "sustainability" in the SASB Standards refers to corporate activities that maintain or enhance the company's ability to create value over the long term . the SASBStandards also refer to sustainability as "ESG"

****************************************************************************************************

mt5-small-finetuned-amazon-en-es
--------------------------------

Introduction : summarizer_amazon_en_es_summary_150

## 3.1 Analysis <a class="anchor" id="section_3_1"></a>




   1. The Four Models used ** t5-small, mt5-small-finetuned-amazon-en-es, sshleifer/distilbart-cnn-12-6, facebook/bart-large-cnn ** were from **transformers** library. All of these exhibit similar summary of the text data by extracting relevant sentences from the text and giving the output resonse.
    
    
   2. **open_ai_davinici_002** model paraphrased the sentence and provided well-structure ouput. To execute large set of data, API service needs to be upgrade with additional charges.
    
   
   3. **open_ai_davinici** model could not summarize the text, and returned same sentence seven times. hence it is not chosen.
    
   
   4. **MiniLM-L6-v2** model in the SBertSummarizer library paraphrased the input sentence, hence it is not chosen
    
   
   5. **bert-large-uncased** model provides a crisp description, **distilbert-base-uncased** tries to add few more words in the summary. both of them are seems to be good for the text. These are part of Summarizer library.
    
   
   6. Fine tuned Auto model returned definitions of the text.

**AS the distilbert-base-uncased model provides a crisp description choosing it as the final model.**

## 3.2 Final Model <a class="anchor" id="section_3_2"></a>


In [79]:
# do the complete summary extraction, save it in a file.

In [80]:

final_model = Summarizer('distilbert-base-uncased', hidden=[-1,-2], hidden_concat=True)


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [81]:
def summarizeContentMap(name, b_content_map, summarizer_model, size=-1, debug=False, max_length_list=[150], num_sentences=3):
    
    a_content_map = copy.deepcopy(b_content_map)

    new_colmns = collections.defaultdict(list) 
    debug_cnt =0
    for k, v in b_content_map.items():
        
        if v['content'] !="":
            length = len(v['content'])
            a_content_map[k]['total_content_length'] = length

            if length > 10:
                
                # for each of the max lengths generate summary, add it as a new key-val in the dictionary
                for m_length in max_length_list:
                    col_name = "content_summary"
                    
                    result = summarizer_model(v['content'], num_sentences=num_sentences)
                    a_content_map[k][col_name] = result.strip()
                    new_colmns[k].append(col_name)

                if debug:
                    debug_cnt +=1
                    if debug_cnt >1:
                        break
    # clean up the parent content            
    cleanup_content_map(a_content_map)            
            
    return a_content_map, new_colmns


In [82]:
summarizer_distill_bert_map, distill_bert_new_colmns = summarizeContentMap('distilbert-base-uncased', text_content_map, final_model)


In [83]:
distill_bert_new_colmns

defaultdict(list,
            {'Introduction': ['content_summary'],
             'Purpose of SASB Standards': ['content_summary'],
             'Overview of SASB Standards': ['content_summary'],
             'Use of the Standards': ['content_summary'],
             'Industry Description': ['content_summary'],
             'Sustainability Disclosure Topics & Accounting Metrics': ['content_summary'],
             'Greenhouse Gas Emissions': ['content_summary'],
             'Air Quality': ['content_summary'],
             'Energy Management': ['content_summary'],
             'Water Management': ['content_summary'],
             'Waste & Hazardous Materials Management': ['content_summary'],
             'Biodiversity Impacts': ['content_summary'],
             'Security, Human Rights & Rights of Indigenous Peoples': ['content_summary'],
             'Community Relations': ['content_summary'],
             'Labor Relations': ['content_summary'],
             'Workforce Health & Safety': [

In [84]:
summarizer_distill_bert_map['Introduction']

{'page_no': '4',
 'level': 1,
 'total_content_length': 10968,
 'content_summary': 'The use of the term “sustainability” in the SASB Standards refers to corporate activities that maintain or enhance theability of the company to create value over the long term. A company determineswhich Standard(s) is relevant to the company, which disclosure topics are financially material to its business, and whichassociated metrics to report, taking relevant legal requirements into account1. Activity MetricsUNIT OFACTIVITY METRIC CATEGORY CODEMEASUREMetric tons (t)Production of (1) metal ores and (2) finished metal products Quantitative EM-MM-000.AsaleableNumber,Total number of employees, percentage contractors Quantitative EM-MM-000.BPercentage (%)SUSTAINABILITY ACCOUNTING STANDARD | METALS & MINING | 8'}

In [85]:
summarizer_distill_bert_map['Purpose of SASB Standards']

{'page_no': '4',
 'level': 2,
 'parent': 'Introduction',
 'content': 'The use of the term “sustainability” in the SASB Standards refers to corporate activities that maintain or enhance theability of the company to create value over the long term. Sustainability accounting reflects the governance andmanagement of a company’s environmental and social impacts arising from production of goods and services, as well asits governance and management of the environmental and social capitals necessary to create long-term value. The SASBStandards also refer to sustainability as “ESG” (environmental, social, and governance), though traditional corporategovernance issues such as board composition are not included within the scope of the SASB Standards.SASB Standards are designed to identify a minimum set of sustainability issues most likely to impact the operatingperformance or financial condition of the typical company in an industry, regardless of location. SASB Standards aredesigned to enable co

In [86]:
# toc_map

In [87]:
# summarizer_distill_bert_map

## 3.3 Save in output file <a class="anchor" id="section_3_3"></a>

Using tocList, and summarizer_distill_bert_map create a output in a text a file.
texts are wrapped to readable format.


In [88]:

def summary_wrap(value, level=1):
    wrapper = textwrap.TextWrapper(width=75)
  
    wrap_list = wrapper.wrap(value)
    
    return [item + "\n" + " "*level for item in wrap_list]

In [89]:
result_summary_data = []
for toc in tocList:
    title = toc['title']
    if title in summarizer_distill_bert_map:
        content_summary = summarizer_distill_bert_map[title]['content_summary']
        result_summary_data.append(title)
        result_summary_data.append("\n")
        result_summary_data.append("-"*len(title))
        result_summary_data.append("\n"*2)
        result_summary_data.append(" ")
        result_summary_data.extend(summary_wrap(content_summary))
        result_summary_data.append("\n"*2)

        # check for subsection
        if toc['child']:
            for child in toc['child']:
                child_title = child['title']
                child_content_summary = summarizer_distill_bert_map[child_title]['content_summary']

                result_summary_data.append(" ")
                result_summary_data.append(child_title)
                result_summary_data.append("\n")
                result_summary_data.append(" ")
                result_summary_data.append("-"*len(child_title))
                result_summary_data.append("\n"*2)
                result_summary_data.append(" "*2)
                result_summary_data.extend(summary_wrap(child_content_summary, 2))
                result_summary_data.append("\n"*2)

        result_summary_data.append("\n"*2)


In [92]:
text_file = open(OUTPUT_FILE_PATH, "wt")
n = text_file.write(''.join(result_summary_data))
text_file.close()

# 4. Conclusion <a class="anchor" id="section_4"></a>



1. The models in the transformers library evidently generate summary upto 150 characters, and they are often the definition of the key terms in the input content text.


2. GPT 3 model **davinci 002** summarizes the text good enough, however it is a subscription based API.


3. SBertSummarizer based miniLM-L6-v2 model paraphrases the input, and for a large corpus summary is expected to be shorter and conceptual text, hence it is not suitable for this summariser.


4. BERT architecture based text-summarisers such as **bert-large-uncased**, **distilbert-base-uncased** performed very well, for the base test, for complete PDF summary later model is chosen as it included key terms as well in the summary. **AS the distilbert-base-uncased model provides a crisp description choosing it as the final model.**


5. Fine tuned Auto model from transformer seemed to return basic definition of the key words.


6. The Various comparsion of the summary is captured in "results/comparison_model_results.txt" and summary is captured in "results/summary/Metals_Mining_Standard_2021_Summary.txt" files.
