In [1]:
import json
import os

import pandas as pd

from esg_data_pipeline.components import Curator
from esg_data_pipeline.config import config
from esg_data_pipeline.components import Extractor

## Extraction

 The extraction stage coverts a pdf into a format that can be easily handled by Python.
 <br>All the text in a pdf is extracted and saved in a JSON file. 
 <br>Currently, we only extract the text and leave out all tables and figures.

This process is applied to all the pdfs mentioned in an annotation Excel file provided by Allianz.
<br>A JSON file is created for each pdf, with the same file name.
<br>Place the Excel files in `data/annotations`, and the pdfs in `data/pdfs`. 
<br>The extracted pdfs will be saved in `data/extraction`. These directories can be changed in the config file `config/config.py`
 
 We will start by deleting all the extracted files for the demo.

In [15]:
!rm $config.EXTRACTION_FOLDER/*  2> /dev/null

We have placed two small sample pdfs (subsets of the original pdfs) in `data/pdfs`

In [16]:
!ls $config.PDF_FOLDER

2015_BASF_Report.pdf  cez-en-annual-report-2018.pdf


#### Running the Extraction stage 

In [17]:
ext = Extractor(config.EXTRACTORS)
ext.run_folder(config.PDF_FOLDER, config.EXTRACTION_FOLDER)

*There is a JSON file for each pdf.*

In [18]:
!ls -lh $config.EXTRACTION_FOLDER

total 76K
-rw-r--r-- 1 root root 49K Jul 16 03:29 '2015_BASF_Report-BASF SE-0.json'
-rw-r--r-- 1 root root 24K Jul 16 03:29  cez-en-annual-report-2018-CEZ-0.json


Let's look at the content of one of the JSON files.

In [19]:
with open("{}/2015_BASF_Report-BASF SE-0.json".format(config.EXTRACTION_FOLDER), "r") as f:
    json_file = json.load(f)

In [20]:
print("Page 0")
for paragraph in json_file["0"]:
    print(paragraph)

print("=" * 20)
print("Page 1")
for paragraph in json_file["1"]:
    print(paragraph)

Page 0
BASF Report 2015
Economic, environmental and
social performance
Page 1
BASF Report 2015
Economic, environmental and
social performance
Chemicals
The  Chemicals  segment  comprises  our  business  with 
basic  chemicals  and  intermediates.  Its  portfolio  ranges 
from  solvents,  plasticizers  and  high-volume  monomers 
to glues and electronic chemicals as well as raw materi-
als  for  detergents,  plastics,  textile  fibers,  paints  and 
coatings, crop protection and medicines. In addition to 
supplying  customers  in  the  chemical  industry  and  
numerous other sectors, we also ensure that other BASF 
segments  are  supplied  with  chemicals  for  producing 
downstream products.
Performance Products
Our Performance Products lend stability, color and bet-
ter  application  properties  to  many  every day  products. 
Our  product  portfolio  includes  vitamins  and  other  food 
additives in addition to ingredients for pharmaceuticals, 
personal  care  and  cosmetics,  as 

**Alternatively**, we can have the pipeline work on a **single pdf**, by using the `run()` method and specifying the path to the desired pdf file.

In [21]:
test_dir = "{}/test_dir".format(config.DATA_FOLDER)
if not os.path.exists(test_dir):
    os.mkdir(test_dir)

sample_pdf = "{}/2015_BASF_Report.pdf".format(config.PDF_FOLDER)
ext.run(input_filepath=sample_pdf, output_folder=test_dir)

In [22]:
!ls $config.DATA_FOLDER/test_dir

2015_BASF_Report.json


## Curation

The extracted JSON files are fed into the next stage to curate a training dataset.
<br>The positive examples (label 1) are taken from the annotated data provided by Allinaz.
<br>A negative example (label 0) for each question is created by selecting a random paragraph from the JSON files.

In [23]:
!rm $config.CURATION_FOLDER/*  2> /dev/null

In [24]:
cur = Curator(config.CURATORS)
cur.run(config.EXTRACTION_FOLDER, config.ANNOTATION_FOLDER, config.CURATION_FOLDER)

[]


In [25]:
!ls $config.CURATION_FOLDER

esg_dataset.csv


Let's take a look at the curated dataset.

In [26]:
df = pd.read_csv("{}/esg_dataset.csv".format(config.CURATION_FOLDER))

In [27]:
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,company,kpi_id,year,answer,relevant_paragraphs,source_file,source_page,data_type,irrelevant_paragraphs,"comments, questions",annotator,context,label,question
0,0,,BASF SE,2.0,2015.0,"1,744 million BOE","[""Our proven crude oil and natural gas reserve...",2015_BASF_Report.pdf,[107],TEXT,,,1qbit_edited_kpi_extraction template_Allyson.xlsx,Our proven crude oil and natural gas reserves ...,1,What is the volume of estimated proven or prob...
1,1,,BASF SE,3.0,2015.0,153 million barrels of oil equivalent (BOE),"[""We increased our crude oil and natural gas p...",2015_BASF_Report.pdf,[107],TEXT,,,1qbit_edited_kpi_extraction template_Allyson.xlsx,We increased our crude oil and natural gas pro...,1,What is the total volume of hydrocarbons produ...
2,2,,BASF SE,2.0,2016.0,"1,622 million BOE","[""Our proven crude oil and natural gas reserve...",BASF_Report_2016.pdf,[106],TEXT,,,1qbit_edited_kpi_extraction template_Allyson.xlsx,Our proven crude oil and natural gas reserves ...,1,What is the volume of estimated proven or prob...
3,3,,BASF SE,3.0,2016.0,165 million BOE,"[""We increased our crude oil and natural gas p...",BASF_Report_2016.pdf,[106],TEXT,,,1qbit_edited_kpi_extraction template_Allyson.xlsx,We increased our crude oil and natural gas pro...,1,What is the total volume of hydrocarbons produ...
4,4,,BASF SE,2.0,2017.0,"1,677 million BOE","[""Our proven oil and gas reserves rose by 3% c...",BASF_Report_2017.pdf,[99],TEXT,,,1qbit_edited_kpi_extraction template_Allyson.xlsx,Our proven oil and gas reserves rose by 3% com...,1,What is the volume of estimated proven or prob...


In [28]:
print("Row 0")
print("Question:", df["question"][0])
print("Context:", df["context"][0])
print("Label:", df["label"][0])

Row 0
Question: What is the volume of estimated proven or probable hydrocarbons reserves?
Context: Our proven crude oil and natural gas reserves increased by 2% compared with the end of 2014, to 1,744 million BOE.
Label: 1


### Next Steps
The curated dataset will be fed into our traininig pipeline to train on NLP model.
<br>The Same process is repeated for table data as well.