<a href="https://colab.research.google.com/github/LordLean/Extracting-Green-Bonds-Use-of-Proceeds/blob/main/GB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Document Scraper 

In [1]:
!pip install rank-bm25

!pip install PyPDF2

!pip install tabula-py

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rank-bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank-bm25
Successfully installed rank-bm25-0.2.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting PyPDF2
  Downloading PyPDF2-2.10.0-py3-none-any.whl (208 kB)
[K     |████████████████████████████████| 208 kB 4.7 MB/s 
Installing collected packages: PyPDF2
Successfully installed PyPDF2-2.10.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tabula-py
  Downloading tabula_py-2.4.0-py3-none-any.whl (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 5.0 MB/s 
Collecting distro
  Downloading distro-1.7.0-py3-none-any.whl (20 kB)
Installing collected packages: distro, tabula-py
Successfully installed distro-1.7.0 tabula-py-2.4.0


In [2]:
# !wget https://www.amppartners.org/docs/default-source/investors/annual-reports/2020/2020_amp_sustainability_report.pdf

# !wget https://www.icmagroup.org/Emails/icma-vcards/amp_combined%20hydro%20projects_External%20Review%20report.pdf

!wget https://www.globalworth.com/wp-content/uploads/2021/07/Globalworth-Green-Bond-Report-2020-20-July-2021.pdf

--2022-08-07 12:27:29--  https://www.globalworth.com/wp-content/uploads/2021/07/Globalworth-Green-Bond-Report-2020-20-July-2021.pdf
Resolving www.globalworth.com (www.globalworth.com)... 195.242.93.66
Connecting to www.globalworth.com (www.globalworth.com)|195.242.93.66|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1021145 (997K) [application/pdf]
Saving to: ‘Globalworth-Green-Bond-Report-2020-20-July-2021.pdf’


2022-08-07 12:27:31 (695 KB/s) - ‘Globalworth-Green-Bond-Report-2020-20-July-2021.pdf’ saved [1021145/1021145]



In [3]:
import numpy as np

import tabula
from rank_bm25 import BM25Okapi
from PyPDF2 import PdfReader

In [127]:
class TableReader:

  def __init__(self, pdf):
    self.pdf = pdf
    self.dfs = None

  def read_pages(self, pages="all", multiple_tables=True, stream=True):
    '''
    Return tables discovered within pdf.
    '''
    self.dfs = tabula.read_pdf(self.pdf, pages=pages, multiple_tables=multiple_tables, stream=stream)
    self.__clean_dfs()
    return self.dfs

  def __clean_dfs(self, thresh=2):
    self.dfs = [df.dropna(thresh=thresh) for df in self.dfs]


class Reader:

  def __init__(self, filename):
    self.reader = PdfReader(filename)
    self.tb = TableReader(filename)
    self.page_viewer = {page_num : {} for page_num in range(self.reader.numPages)}
    self.idx2page_item = []
  
  def __extract_text(self,):
    '''
    Page-wise text extraction and tokenize for BM25.
    '''
    text_index_mem = 0
    # List to store each tokenized corpus
    tokenized_corpus_list = []
    for i in range(self.reader.numPages):
      raw_text = self.reader.getPage(i).extractText()
      self.page_viewer[i]["raw_text"] = raw_text
      # Split text
      corpus = raw_text.split("\n \n")
      # Store results.
      self.page_viewer[i]["corpus"] = corpus
      for item in corpus:
        self.idx2page_item.append((i, item)) # page,textItem
      # Tokenize
      tokenized_corpus = [doc.split(" ") for doc in corpus]
      tokenized_corpus_list.append(tokenized_corpus)
    # BM25 computations only after the complete tokenized corpus is collated. 
    # Merge tokenized corpus'.
    tokenized_corpus_complete = [item for sublist in tokenized_corpus_list for item in sublist]
    # BM25
    self.bm25 = BM25Okapi(tokenized_corpus_complete)

  def __extract_tables(self):
    '''
    Page-wise table extractor.
    '''
    for i in range(self.reader.numPages):
      # page=0 will throw error using tabula.
      page = str(i+1)
      self.page_viewer[i]["tables"] = self.tb.read_pages(pages=page)

  def extract_pdf(self):
    # Extract data
    self.__extract_text()
    self.__extract_tables()

  def print_page(self, page_num):
    '''
    Print separated sections of text given a page.
    '''
    corpus = self.page_viewer[page_num]["corpus"]
    for item in (corpus):
      print("\n{}\n".format("-"*60))
      print(item)
    print("\n{}\n".format("-"*60))
    for df in self.page_viewer[page_num]["tables"]:
      print(df.style)
      display(df)

  def __score(self, queries, weights):
    '''
    Compute the average BM25 score of each given query on each page of text.
    '''
    self.ranked_scores = []
    for query in queries:
      # tokenize query by whitespace.
      tokenized_query = query.split()
      # Compute score.
      doc_scores = self.bm25.get_scores(tokenized_query)
      self.ranked_scores.append(doc_scores)
    # Compute average (weighted) score against all queries.
    if not len(weights):
      # Equal weighting.
      self.average_score = np.average(self.ranked_scores, axis=0)
    elif len(queries) != len(weights):
        # Unequal number of elements.
        raise ValueError("Number of query and weight elements passed must be equal.")
    else:
      # Weighted average.
      self.average_score = np.average(self.ranked_scores, weights=weights, axis=0)
 
  def get_ranked_texts(self, queries, weights=[], n=5):
    '''
    Return n pages which scored highest using BM25.
    '''
    # Run score method to calculate BM25.
    self.__score(queries, weights)
    idx = sorted(range(len(self.average_score)), key=lambda i: self.average_score[i], reverse=True)[:n]
    
    return [list(self.idx2page_item[id]) + [self.page_viewer[self.idx2page_item[id][0]]["tables"]] for id in idx]

In [129]:
filename = "Globalworth-Green-Bond-Report-2020-20-July-2021.pdf"

reader = Reader(filename)

reader.extract_pdf()

Aug 07, 2022 3:43:20 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>

Aug 07, 2022 3:43:26 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>

Aug 07, 2022 3:43:48 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



In [130]:
queries = [
    "use of proceeds",
    "allocation of proceeds",
    # "projects financed"
    ]

top_n = 5
top_items = reader.get_ranked_texts(queries, n=top_n)

In [145]:
for (page_num, text, tables) in top_items:
  print("Page: {}\n\n{}".format(page_num,text))
  for table in tables:
    display(table.style)
  print("-"*60)
  print("\n\n")

Page: 4

Globalworth in line with its commitment under it Green Bond Framework, to enable investors to 
follow its Green Bond progress, and to provide insight to prioritised areas, is providing this Green Bond 
update consisting of an Allocation Report and an Impact Report (where feasible) . 
The allocation of the green bond proceeds and compliance with the financ ing  or refinanc ing of  Eligible 
Green Projects  is subject to an annual external assurance by an independent third party, ERNST & 
YOUNG (HELLAS) Certified Auditors - Accountants S.A. (EY ), (as appended hereto).  

------------------------------------------------------------



Page: 4

Sustainal ytics Highlights on Globalworths’ Green Bond Framework (28 May 2020)  
Use of Proceeds:  • The eligible categories for the use of proceeds, Green Buildings and 
Energy Efficiency, are aligned with those recognized by the Green Bond 
Principles 2018.  
• Sustainalytics considers that the eligible categories wi ll lead to positive 

Unnamed: 0.1,Pre-screening based,Unnamed: 0,Unnamed: 1,Unnamed: 2,External review of,Unnamed: 3
0,,,Final validation by the,,Eligible Use of,
2,criteria by investment,,Green Bond,,Proceeds with the,
3,,,Committee (annually,,criteria displayed in,
5,,,or earlier if necessary),,the Green Bond,
9,,Eligible,,Validated,,Verified
10,STEP 1,assets,STEP 2,,STEP 3,assets


------------------------------------------------------------



Page: 9

6. Allocation of Proceeds  
Globalworth’s net proceeds from its inaugural Green Bond issue in July 2020  were €386.5 million, of 
which €376.9 million or c. 97. 5% have been allocated in standing and under 
refurbishment/ construction  properties. The remainder unallocated proceeds of € 9.6  million, will be 
dedicated to financing an office  project  currently under construction  which is expected to be delivered  
in 2021.  
The Green Bond Committee decided to allocate the proceeds as follows:  
Allocations:        
 Country  No of 
Buildings  Status  Certification Level  Allocated  
Amounts 
(€m)  Unallocated  
Amounts  
(€m)  
Globalworth Campus  Romania  3 Standing  BREEAM Excellent  198.4  - 
Globalworth Square  Romania  1 Under  
Con struc tion   [BREEAM Outstanding]  40.0  9.6  
Podium Park  A Poland  1 Standing  BREEAM Outstanding  40.3  - 
Renoma  Poland  1 Under 
Refurbishment  BREEAM Excellent  98.3  -

Unnamed: 0.1,Allocations:,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,,Country,No of,Status,Certification Level,Allocated,Unallocated
1,,,Buildings,,,Amounts,Amounts
2,,,,,,(€m),(€m)
3,Globalworth Campus,Romania,3,Standing,BREEAM Excellent,198.4,-
4,Globalworth Square,Romania,1,Under,[BREEAM Outstanding],40.0,9.6
6,Podium Park A,Poland,1,Standing,BREEAM Outstanding,40.3,-
7,Renoma,Poland,1,Under,BREEAM Excellent,98.3,-
9,TOTAL:,,,,,376.9,9.6


------------------------------------------------------------



Page: 6

• Individual measures : Individual measures reducing energy use and/or 
carbon emissions for the operational phase of the building. A list of 
eligible indivi dual measures can be found under Appendix 1 of the Green 
Bond Framework.  


Unnamed: 0.1,Pre-screening based,Unnamed: 0,Unnamed: 1,Unnamed: 2,External review of,Unnamed: 3
0,,,Final validation by the,,Eligible Use of,
2,criteria by investment,,Green Bond,,Proceeds with the,
3,,,Committee (annually,,criteria displayed in,
5,,,or earlier if necessary),,the Green Bond,
9,,Eligible,,Validated,,Verified
10,STEP 1,assets,STEP 2,,STEP 3,assets


------------------------------------------------------------





# Legacy

In [None]:
# class TableReader:

#   def __init__(self, pdf):
#     self.pdf = pdf
#     self.dfs = None

#   def read_pages(self, pages="all", multiple_tables=True, stream=True):
#     '''
#     Return tables discovered within pdf.
#     '''
#     self.dfs = tabula.read_pdf(self.pdf, pages=pages, multiple_tables=multiple_tables, stream=stream)
#     self.__clean_dfs()
#     print("\n {} tables detected.".format(len(self.dfs)))
#     return self.dfs

#   def __clean_dfs(self, thresh=2):
#     self.dfs = [df.dropna(thresh=thresh) for df in self.dfs]
    

# class Reader:

#   def __init__(self, filename):
#     # self.filename = filename
#     self.reader = PdfReader(filename)
#     self.tb = TableReader(filename)
#     # Text data
#     self.corpus = ""
#     # Tabular data
#     self.tables = None
#     # Extract data
#     self.__extract_text()
#     self.__extract_tables()
  
#   def __extract_text(self,):
#     '''
#     Extract text from page and tokenize for BM25.
#     '''
#     text = ""
#     for i in range(self.reader.numPages):
#       text +=self.reader.getPage(i).extractText()
#     # Split text
#     self.corpus = text.split("\n \n")
#     # Tokenize
#     self.tokenized_corpus = [doc.split(" ") for doc in self.corpus]
#     self.bm25 = BM25Okapi(self.tokenized_corpus)

#   def __extract_tables(self):
#     '''
#     Extract all tables from pdf.
#     '''
#     self.tables = self.tb.read_pages()

#   def print_corpus(self):
#     '''
#     Print separated sections of text.
#     '''
#     for item in (self.corpus):
#       print("\n{}\n".format("-"*60))
#       print(item)

#   def __score(self, queries, weights):
#     '''
#     Compute the average BM25 score of each given query on the whole text.
#     '''
#     # List to store individual scores (per query).
#     self.ranked_scores = []
#     for query in queries:
#       tokenized_query = query.split()
#       # Compute score.
#       doc_scores = self.bm25.get_scores(tokenized_query)
#       self.ranked_scores.append(doc_scores)
#     # Compute average (weighted) score against all queries.
#     if not len(weights):
#       # Equal weighting.
#       self.avg_score = np.average(self.ranked_scores, axis=0)
#     elif len(queries) != len(weights):
#       # Unequal number of elements.
#       raise ValueError("Number of query and weight elements passed must be equal.")
#     else:
#       # Weighted average.
#       self.avg_score = np.average(self.ranked_scores, weights=weights, axis=0)

#   def get_ranked_texts(self, queries, weights=[], n=5):
#     '''
#     Return n corpus items which scored highest using BM25.
#     '''
#     self.__score(queries, weights)
#     idx = (-self.avg_score).argsort()[:n]
#     return {self.avg_score[id] : self.corpus[id] for id in idx}

In [None]:
# class TableReader:

#   def __init__(self, pdf):
#     self.pdf = pdf
#     self.dfs = None

#   def read_pages(self, pages="all", multiple_tables=True, stream=True):
#     '''
#     Return tables discovered within pdf.
#     '''
#     self.dfs = tabula.read_pdf(self.pdf, pages=pages, multiple_tables=multiple_tables, stream=stream)
#     self.__clean_dfs()
#     return self.dfs

#   def __clean_dfs(self, thresh=2):
#     self.dfs = [df.dropna(thresh=thresh) for df in self.dfs]
    

# class Reader:

#   def __init__(self, filename):
#     self.reader = PdfReader(filename)
#     self.tb = TableReader(filename)
#     self.page_viewer = {page_num : {} for page_num in range(self.reader.numPages)}
  
#   def __extract_text(self,):
#     '''
#     Page-wise text extraction and tokenize for BM25.
#     '''
#     for i in range(self.reader.numPages):
#       raw_text = self.reader.getPage(i).extractText()
#       self.page_viewer[i]["raw_text"] = raw_text
#       # Split text
#       corpus = raw_text.split("\n \n")
#       # Tokenize
#       tokenized_corpus = [doc.split(" ") for doc in corpus]
#       # BM25
#       bm25 = BM25Okapi(tokenized_corpus)
#       # Store results.
#       self.page_viewer[i]["corpus"] = corpus
#       self.page_viewer[i]["tokenized_corpus"] = tokenized_corpus
#       self.page_viewer[i]["bm25"] = bm25

#   def __extract_tables(self):
#     '''
#     Page-wise table extractor.
#     '''
#     for i in range(self.reader.numPages):
#       # page=0 will throw error using tabula.
#       page = str(i+1)
#       self.page_viewer[i]["tables"] = self.tb.read_pages(pages=page)

#   def extract_pdf(self):
#     # Extract data
#     self.__extract_text()
#     self.__extract_tables()

#   def print_page(self, page_num):
#     '''
#     Print separated sections of text given a page.
#     '''
#     corpus = self.page_viewer[page_num]["corpus"]
#     for item in (corpus):
#       print("\n{}\n".format("-"*60))
#       print(item)
#     print("\n{}\n".format("-"*60))
#     for df in self.page_viewer[page_num]["tables"]:
#       print(df.style)
#       print(df)

#   def __score(self, queries, weights):
#     '''
#     Compute the average BM25 score of each given query on each page of text.
#     '''
#     for i in range(self.reader.numPages):
#       self.page_viewer[i]["ranked_scores"] = []
#       for query in queries:
#         # tokenize query by whitespace.
#         tokenized_query = query.split()
#         # Compute score.
#         doc_scores = self.page_viewer[i]["bm25"].get_scores(tokenized_query)
#         self.page_viewer[i]["ranked_scores"].append(doc_scores)
#       # Compute average (weighted) score against all queries.
#       if not len(weights):
#         # Equal weighting.
#         self.page_viewer[i]["average_score"] = np.average(self.page_viewer[i]["ranked_scores"], axis=0)
#       elif len(queries) != len(weights):
#           # Unequal number of elements.
#           raise ValueError("Number of query and weight elements passed must be equal.")
#       else:
#         # Weighted average.
#         self.page_viewer[i]["average_score"] = np.average(self.page_viewer[i]["ranked_scores"], weights=weights, axis=0)

#   def get_ranked_texts(self, queries, weights=[], n=5):
#     '''
#     Return n pages which scored highest using BM25.
#     '''
#     # Run score method to calculate BM25.
#     self.__score(queries, weights)
#     # Storage for all calculated scores.
#     top_scores = {}
#     for page_index in range(self.reader.numPages):
#       for item_index, score in enumerate(self.page_viewer[page_index]["average_score"]):
#         # Store each score by page and item.
#         top_scores[(page_index, item_index)] = score
    
#     # idx = (-self.page_viewer[i]["average_score]).argsort()[:n]
#     # return {self.page_viewer[i]["average_score][id] : self.corpus[id] for id in idx}

In [None]:
filename = "amp_combined hydro projects_External Review report.pdf"

parsed_pdf = parser.from_file(filename)

data = parsed_pdf['content'] 

2022-08-02 14:18:59,154 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar.
INFO:tika.tika:Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar.
2022-08-02 14:18:59,673 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5.
INFO:tika.tika:Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5.
2022-08-02 14:19:00,046 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...


In [None]:
# Printing of content 
# print(data)

In [None]:
splitted = data.split("\n \n")

In [None]:
for item in splitted:
  print("\n\n")
  print(item)




























































Title of the document










  



  




 



 



 



 



 



 







 




 
  

www.sustainalytics.com 




www.sustainalytics.com 




www.sustainalytics.com 




Charlotte Peyraud  

Senior Advisor, Institutional Relations (New York) 

charlotte.peyraud@sustainalytics.com 

(+1) 646 518 0184 









Vikram Puppala  

AMERICAN MUNICIPAL 
POWER, INC. 

COMBINED HYDROELECTRIC 
PROJECTS  

REVENUE BONDS, SERIES 
2016A (GREEN BONDS) 









FRAMEWORK OVERVIEW AND SECOND OPINION BY 
SUSTAINALYTICS 




 




August 31st, 2016 

Marion Oliver 

Manager, Advisory Services (Toronto) 

marion.oliver@sustainalytics.com 

(+1) 647 317 3644 









Vikram Puppala  

Manager,  

http://www.sustainalytics.com/
http://www.sustainalytics.com/
http://www.sustainalytics.com/


© Sustainalytics 2016 




 

2 

TABLE OF CONTENTS 

FRAMEWORK OVERVIEW AND SECOND OPINION BY SUSTAINALYTICS 1 

1. Preface 3 

2. Introduction 3 

3. F

In [None]:
processed_text = []
curr_para = ""

for i, line in enumerate(splitted):
  if len(splitted[i]) > 2:
    curr_para += line
  elif i < len(splitted) - 2 and (len(splitted[i+1]) > 2 or len(splitted[i+2]) > 2):
    processed_text.append(curr_para)
    curr_para = ""
  else:
    pass

In [None]:
final_text = []
curr = ""

for i, text in enumerate(processed_text):
  if len(processed_text[i]) > 50 and len(processed_text[i+1])

['',
 '',
 'Title of the document',
 '',
 'www.sustainalytics.com ',
 '',
 'www.sustainalytics.com ',
 '',
 'www.sustainalytics.com ',
 '',
 'Charlotte Peyraud  ',
 'Senior Advisor, Institutional Relations (New York) ',
 'charlotte.peyraud@sustainalytics.com ',
 '(+1) 646 518 0184 ',
 '',
 'Vikram Puppala  ',
 'AMERICAN MUNICIPAL POWER, INC. ',
 'COMBINED HYDROELECTRIC PROJECTS  ',
 'REVENUE BONDS, SERIES 2016A (GREEN BONDS) ',
 '',
 'FRAMEWORK OVERVIEW AND SECOND OPINION BY SUSTAINALYTICS ',
 '',
 'August 31st, 2016 ',
 'Marion Oliver ',
 'Manager, Advisory Services (Toronto) ',
 'marion.oliver@sustainalytics.com ',
 '(+1) 647 317 3644 ',
 '',
 'Vikram Puppala  ',
 'Manager,  ',
 'http://www.sustainalytics.com/http://www.sustainalytics.com/http://www.sustainalytics.com/',
 '',
 '© Sustainalytics 2016 ',
 '',
 'TABLE OF CONTENTS ',
 'FRAMEWORK OVERVIEW AND SECOND OPINION BY SUSTAINALYTICS 1 ',
 '1. Preface 3 ',
 '2. Introduction 3 ',
 '3. Framework overview 3 3.1 Use of Proceeds 3 3.2 

In [None]:

splitted = [line for line in splitted if len(line) > 2]

splitted[x for x in data]

['\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 '\n',
 'T',
 'i',
 't',
 'l',
 'e',
 ' ',
 'o',
 'f',
 ' ',
 't',
 'h',
 'e',
 ' ',
 'd',
 'o',
 'c',
 'u',
 'm',
 'e',
 'n',
 't',
 '\n',
 '\n',
 '\n',
 ' ',
 '\n',
 '\n',
 ' ',
 '\n',
 '\n',
 ' ',
 ' ',
 '\n',
 ' ',
 '\n',
 ' ',
 ' ',
 '\n',
 '\n',
 ' ',
 '\n',
 ' ',
 '\n',
 ' ',
 '\n',
 ' ',
 '\n',
 ' ',
 '\n',
 ' ',
 '\n',
 ' ',
 '\n',
 ' ',
 '\n',
 ' ',
 '\n',
 ' ',
 '\n',
 ' ',
 '\n',
 ' ',
 '\n',
 ' ',
 '\n',
 '\n',
 ' ',
 '\n',
 ' ',
 '\n',
 '\n',
 ' ',
 '\n',
 ' ',
 '\n',
 ' ',
 ' ',
 '\n',
 '\n',
 'w',
 'w',
 'w',
 '.',
 's',
 'u',
 's',
 't',
 'a',
 'i',
 'n',
 'a',
 'l',
 'y',
 't',
 'i',
 'c',


In [None]:
reader = PdfReader(filename)

text = ""

# extracting text from page
for i in range(reader.numPages):
    text +=reader.getPage(i).extractText()

In [None]:
text = text.split("\n \n")

In [None]:
for item in (text):
  print("\n---------------------------------------------------------------\n")
  print(item)

In [None]:
corpus = text

tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

In [None]:
queries = [
    "use of proceeds",
    "allocation of proceeds",
    "projects financed"
]

scores = []


for query in queries:

  tokenized_query = query.split()
  doc_scores = bm25.get_scores(tokenized_query)
  scores.append(doc_scores)


avg_score = np.mean(scores, axis=0)

In [None]:
avg_score

array([0.        , 0.28200893, 0.        , 0.23975133, 0.        ,
       0.29836719, 0.        , 0.        , 0.        , 0.        ,
       0.29475909, 1.76406565, 0.        , 3.42232876, 0.        ,
       0.        , 2.47046447, 0.        , 0.34791921, 0.26030342,
       2.34655543, 0.33686379, 1.44553162, 0.38705052, 1.40436426,
       3.23931942, 0.        , 0.        , 0.27876912, 0.35367899,
       0.18561977, 0.        , 0.04113845, 0.        , 0.        ,
       1.95499172, 0.28318963, 0.33254593, 0.30105638, 0.        ,
       0.        , 0.21556798, 1.30241004, 0.        , 1.0308614 ,
       0.        , 0.        , 0.6712968 , 0.        , 0.        ,
       0.22576717, 0.30986586, 0.        , 0.23144514, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.8526635 , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        ])

In [None]:
idx = (-avg_score).argsort()[:5]
idx

array([13, 25, 16, 20, 35])

In [None]:
for id in idx:
  print("--------------------------")
  print((corpus[id]))
  print("\n\n")

--------------------------
Sustainal ytics Highlights on Globalworths’ Green Bond Framework (28 May 2020)  
Use of Proceeds:  • The eligible categories for the use of proceeds, Green Buildings and 
Energy Efficiency, are aligned with those recognized by the Green Bond 
Principles 2018.  
• Sustainalytics considers that the eligible categories wi ll lead to positive 
environmental impacts and advance the UN Sustainable Development 
Goals, specifically SDGs 7 & 11  
Project Evaluation / 
Selection:  • Globalworth’s  internal process of evaluating and selecting projects is 
carried out by the Green Bond Committee. The Committee is responsible 
for screening projects against the eligibility criteria and recommending 
eligible projects for inclusion in the Eligible Green P roject Portfolio. The 
Portfolio will be reviewed annually to ensure projects’ eligibility and, if 
no longer eligible, projects will be removed and replaced as soon as 
practically feasible.  
• Sustainalytics considers 

In [None]:
query = "use of proceeds"
tokenized_query = query.split()

doc_scores = bm25.get_scores(tokenized_query)
doc_scores

array([0.        , 0.4230134 , 0.        , 0.35962699, 0.        ,
       0.44755079, 0.        , 0.        , 0.        , 0.        ,
       0.44213864, 1.80524036, 0.        , 3.1161517 , 0.        ,
       0.        , 2.31987075, 0.        , 0.52187881, 0.39045513,
       2.32041074, 0.50529569, 3.80795209, 0.58057578, 0.43526282,
       3.20935785, 0.        , 0.        , 0.41815368, 0.53051848,
       0.27842965, 0.        , 0.06170768, 0.        , 0.        ,
       2.93248758, 0.42478444, 0.4988189 , 0.45158457, 0.        ,
       0.        , 0.32335197, 2.7438143 , 0.        , 2.66250291,
       0.        , 0.        , 1.55999942, 0.        , 0.        ,
       0.33865075, 0.46479878, 0.        , 0.34716771, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.95608277, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        ])

In [None]:
np.mean([doc_scores, doc_scores+1], axis=0)

array([0.5       , 0.9230134 , 0.5       , 0.85962699, 0.5       ,
       0.94755079, 0.5       , 0.5       , 0.5       , 0.5       ,
       0.94213864, 2.30524036, 0.5       , 3.6161517 , 0.5       ,
       0.5       , 2.81987075, 0.5       , 1.02187881, 0.89045513,
       2.82041074, 1.00529569, 4.30795209, 1.08057578, 0.93526282,
       3.70935785, 0.5       , 0.5       , 0.91815368, 1.03051848,
       0.77842965, 0.5       , 0.56170768, 0.5       , 0.5       ,
       3.43248758, 0.92478444, 0.9988189 , 0.95158457, 0.5       ,
       0.5       , 0.82335197, 3.2438143 , 0.5       , 3.16250291,
       0.5       , 0.5       , 2.05999942, 0.5       , 0.5       ,
       0.83865075, 0.96479878, 0.5       , 0.84716771, 0.5       ,
       0.5       , 0.5       , 0.5       , 0.5       , 0.5       ,
       1.45608277, 0.5       , 0.5       , 0.5       , 0.5       ,
       0.5       , 0.5       , 0.5       , 0.5       , 0.5       ,
       0.5       , 0.5       ])

In [None]:
index_max = np.argmax(doc_scores)
index_max

22

In [None]:
for item in (bm25.get_top_n(tokenized_query, corpus, n=5)):
  print("\n\n----------------------------------------------------\n----------------------------------------------------\n" + item)



----------------------------------------------------
----------------------------------------------------
• Individual measures : Individual measures reducing energy use and/or 
carbon emissions for the operational phase of the building. A list of 
eligible indivi dual measures can be found under Appendix 1 of the Green 
Bond Framework.  


----------------------------------------------------
----------------------------------------------------
On the basis of the screening process, the Green Bond Committee will recommend eligible projects 
for inclusion as Eligible Use of Proceeds to the Board of Directors of Gl obalworth, notifying all other 
appropriate teams and committees.  
The Green Bond Committee will review, annually or earlier if should be deemed necessary, the 
allocation of the proceeds to the Eligible Use of Proceeds and determine if any changes are necess ary 
(for instance, in the event that projects have been completed or otherwise become ineligible). While 
any Globa

### Tabular

In [None]:
class TableReader:

  def __init__(self, pdf):
    self.pdf = pdf
    self.dfs = None

  def read_pages(self, pages="all", multiple_tables=True, stream=True):
    self.dfs = tabula.read_pdf(self.pdf, pages=pages, multiple_tables=multiple_tables, stream=stream)
    self.__clean_dfs()
    print("\n {} tables detected.".format(len(self.dfs)))

  def __clean_dfs(self, thresh=2):
    self.dfs = [df.dropna(thresh=thresh) for df in self.dfs]

In [None]:
tb = TableReader("Globalworth-Green-Bond-Report-2020-20-July-2021.pdf")

In [None]:
tb.read_pages()

Aug 03, 2022 1:19:00 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Aug 03, 2022 1:19:01 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Aug 03, 2022 1:19:04 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Aug 03, 2022 1:19:04 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>




 5 tables detected.


In [None]:
tb.dfs[2]

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Year,Latest,Unnamed: 2,GAV
0,Standing properties,Country,Build,refurbishment,Green certification,(€m)
2,Globalworth Tower,RO,2016,2016,LEED Platinum,184.3
4,Globalworth Campus T1,RO,2017,2017,BREEAM Excellent,64.6
6,Globalworth Campus T2,RO,2018,2018,BREEAM Excellent,65.5
8,Globalworth Campus T3,RO,2020,2020,BREEAM Excellent,78.0
10,Green Court B,RO,2015,2015,LEED Gold,52.4
12,Green Court C,RO,2016,2016,LEED Gold,43.2
14,Renault Bucharest Connected,RO,2018,2018,BREEAM Excellent,82.1
16,Gara Herastrau,RO,2016,2016,BREEAM Excellent,28.4
18,Batory Building 1,PL,2000,2013,BREEAM Excellent,11.4


In [None]:
tb.dfs[3]

Unnamed: 0.1,Allocations:,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,,Country,No of,Status,Certification Level,Allocated,Unallocated
1,,,Buildings,,,Amounts,Amounts
2,,,,,,(€m),(€m)
3,Globalworth Campus,Romania,3,Standing,BREEAM Excellent,198.4,-
4,Globalworth Square,Romania,1,Under,[BREEAM Outstanding],40.0,9.6
6,Podium Park A,Poland,1,Standing,BREEAM Outstanding,40.3,-
7,Renoma,Poland,1,Under,BREEAM Excellent,98.3,-
9,TOTAL:,,,,,376.9,9.6


In [None]:
tb.dfs[3]

Unnamed: 0.1,Allocations:,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,,Country,No of,Status,Certification Level,Allocated,Unallocated
1,,,Buildings,,,Amounts,Amounts
2,,,,,,(€m),(€m)
3,Globalworth Campus,Romania,3,Standing,BREEAM Excellent,198.4,-
4,Globalworth Square,Romania,1,Under,[BREEAM Outstanding],40.0,9.6
6,Podium Park A,Poland,1,Standing,BREEAM Outstanding,40.3,-
7,Renoma,Poland,1,Under,BREEAM Excellent,98.3,-
9,TOTAL:,,,,,376.9,9.6


.

### BM25

In [None]:
!pip install rank-bm25

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# https://github.com/dorianbrown/rank_bm25/blob/master/rank_bm25.py

In [None]:
from rank_bm25 import BM25Okapi

corpus = [
    "Hello there good man!",
    "It is quite windy in London",
    "How is the weather today?"
]

tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

In [None]:
query = "man windy"
tokenized_query = query.split()

doc_scores = bm25.get_scores(tokenized_query)
doc_scores
# array([0.        , 0.93729472, 0.        ])

array([0.        , 0.46864736, 0.        ])

In [None]:
bm25.get_top_n(tokenized_query, corpus, n=2)

['It is quite windy in London', 'How is the weather today?']