# Environmental and Social-Economic Assessment (ESA) Dataset

On 1 May 2020, the Canadian Energy Regulator (CER) published the [ESD Data Bank](https://apps.cer-rec.gc.ca/REGDOCS/Home/Index/FAKE), an interactive tool that allows users to visualize, download, and share ESA data from applications filed in support of federally regulated pipeline projects. The tools contains 13,681 and 3,024 unique tables and figures from 37 pipeline projects submitted to the CER between 2003 and 2019 (output files). Data was extracted from 1,902 PDF documents available from the CER's public repository called [REGDOCS](https://apps.cer-rec.gc.ca/REGDOCS).  <br>

To download individual tables (in CSV and JPG format) and figures (in JPG) format, see the [ESD Data Bank](https://apps.cer-rec.gc.ca/REGDOCS/Home/Index/FAKE) Data Bank online tool. The [ESA Data Bank](https://apps.cer-rec.gc.ca/REGDOCS/Home/Index) is an interactive tool that allows users to visualize, download, and share the Canada Energy Regulator’s (CER) ESA data from applications filed in support of federally regulated pipeline projects and related facilities. <br>

This repo contains several python functions that create the ESA dataset and the figure and table output files. Data is extracted from PDF files submitted by pipeline companies. <br>

This Notebooks provides a step by step guide to re-create the ESA Data Bank dataset. In total, 13,681 unique tables were extracted. Many tables span multiple pages, and thus the total number of extracted CSVs is 27,058. Figure data was saved as a unique JPG file. 3,024 figures were extracted in total. 

# About the Code 

Our assumption is that “Index of PDFs for Major Projects with ESAs” is already created.  We will implement the other pieces of codes with functions. The list of the functions are as follows: <br>

1.	Scrape PDF File<br>

2.	Convert PDF to Tika Files <br>

3.	PDF Metadata <br>

Things which we are currently working on but which will be implemented in the Phase 2 of the project are:
-	Index 3 – List of at CSV Files with Categories 
-	Index 4 – List of Images with Categories
-	Index 5 – GIS Locations extracted for PDF Files
-	Index 6 – Alignment Sheets extracted with geotags


# Installing the Required Packages 

In [1]:
#importing standard packages

import pandas as pd
import time
import os
import sys
import requests
#from bs4 import BeautifulSoup as bs
#import wget
import re
from urllib.parse import unquote
import PyPDF2 as p2
import glob
import multiprocessing
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
# importing custom functions 
import scraper
import rotate_pdfs
import pickles_functions_mp
import pdf_metadata

# Exploring the Input Files (Index of PDFs for Major Projects with ESAs)

In [3]:
Index0_path = os.path.realpath('..\\..') + "\\Input_Files\\Index_of_PDFs_for_Major_Projects_with_ESAs.csv"

Index0 = pd.read_csv(Index0_path, index_col = 0)
Index0.head()
Index0.info()

Unnamed: 0_level_0,Application Name,Application Short Name,Application Filing Date,Company Name,Commodity,File Name,ESA Folder URL,Document Number,Data ID,PDF Download URL,Application Type (NEB Act),Pipeline Location,Hearing order,Consultant Name,Pipeline Status,Regulatory Instrument(s),Application URL,Decision URL,ESA Section(s),ESA Section(s) Index
application_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2003-03-17 Application to Construct and Operate Ekwan Pipeline GH-1-2003,Application to Construct and Operate Ekwan Pip...,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C0 - 13.0 EIA - Section 13.1 to 13.6,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C0,268706,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,Large Projects (over 40 km),"Alberta, British Columbia",GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,"Section 13.1: Introduction, Section 13.1: Proj...",1.0
2003-03-17 Application to Construct and Operate Ekwan Pipeline GH-1-2003,Application to Construct and Operate Ekwan Pip...,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C1 - 13.0 EIA - Section 13.7 Wildlife Part 1,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C1,268709,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,Large Projects (over 40 km),"Alberta, British Columbia",GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,Section 13.7: Wildlife and Wildlife Habitat,2.0
2003-03-17 Application to Construct and Operate Ekwan Pipeline GH-1-2003,Application to Construct and Operate Ekwan Pip...,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C2 - 13.0 EIA - Section 13.7 Wildlife Part 2,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C2,268712,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,Large Projects (over 40 km),"Alberta, British Columbia",GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,Section 13.7: Wildlife and Wildlife Habitat,3.0
2003-03-17 Application to Construct and Operate Ekwan Pipeline GH-1-2003,Application to Construct and Operate Ekwan Pip...,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C3 - 13.0 EIA - Section 13.8 to 13.13,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C3,269018,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,Large Projects (over 40 km),"Alberta, British Columbia",GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,"Section 13.8: Fisheries and Aquatic Resources,...",4.0
2003-03-17 Application to Construct and Operate Ekwan Pipeline GH-1-2003,Application to Construct and Operate Ekwan Pip...,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C4 - 13.1 App 13A - Alignment Sheets,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C4,269021,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,Large Projects (over 40 km),"Alberta, British Columbia",GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,Appendix 13A: Environmental Alignment Sheets,5.0


<class 'pandas.core.frame.DataFrame'>
Index: 1902 entries, 2003-03-17 Application to Construct and Operate Ekwan Pipeline GH-1-2003 to 2009-02-27 - Application for the Keystone XL Pipeline (OH-1-2009)
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Application Name            1902 non-null   object 
 1   Application Short Name      1902 non-null   object 
 2   Application Filing Date     1902 non-null   object 
 3   Company Name                1902 non-null   object 
 4   Commodity                   1902 non-null   object 
 5   File Name                   1902 non-null   object 
 6   ESA Folder URL              1902 non-null   object 
 7   Document Number             1902 non-null   object 
 8   Data ID                     1902 non-null   int64  
 9   PDF Download URL            1902 non-null   object 
 10  Application Type (NEB Act)  1902 non-null   object 
 11  Pipeline Location   

# 1. Scrape PDF Files

In [4]:
Index0 = Index0.head(10)
len(Index0)

10

In [5]:
count = scraper.file_scraper(os.path.realpath('..\\..'), Index0)
print("{} Files were downloaded from {} URL links".format(count, len(Index0)))

10 Files were downloaded from 10 URL links


# 2. Rotate the PDF Files

In [6]:
count = rotate_pdfs.rotate_pdf(os.path.realpath('..\\..'), Index0)
print("{} Files were rotaed successfully".format(count))

10 Files were rotaed successfully


# 3. Convert PDFs to Tika Files 

In [7]:
# list of full paths to pdfs
pdf_folder_path = os.path.realpath('..\\..') + '\\Data_Files\\PDFs\\'

subset_list_pdf_full = [pdf_folder_path
                        + x.split('\\')[-1] for x in glob.glob(pdf_folder_path + '*.pdf')]

# Directory where the output pickle files are saved
pkl_folder_path = os.path.realpath('..\\..') + '\\Data_Files\\Pickle_Files\\'
# prepare arguments for multiprocessing
args = pickles_functions_mp.get_argument(subset_list_pdf_full, pkl_folder_path)

# timing the process-start
starttime = time.time()

# #sequential
# for arg in args:
#     try:
#         pickles_functions_mp.pickle_pdf_xml(arg)
#     except Exception:
#         #print("exception was raised for {}".format(arg))
#         pass

# multiprocessing
pool = multiprocessing.Pool()
pool.map(pickles_functions_mp.pickle_pdf_xml, args)
pool.close()
#time ends and dellta displayed
print('That took {} seconds'.format(time.time() - starttime))

[True, True, True, True, True, True, True, True, True, True]

That took 8.274465322494507 seconds


# 4. Converting Rotated PDFs to Rotated Tika Files

In [9]:
# list of full paths to roatted pdfs
pdf_folder_path = os.path.realpath('..\\..') + '\\Data_Files\\PDFs_Rotated\\'

subset_list_pdf_full = [pdf_folder_path
                        + x.split('\\')[-1] for x in glob.glob(pdf_folder_path + '*.pdf')]

# Directory where the output pickle files are saved
pkl_folder_path = os.path.realpath('..\\..') + '\\Data_Files\\Pickle_Files_Rotated\\'
# prepare arguments for multiprocessing
args = pickles_functions_mp.get_argument(subset_list_pdf_full, pkl_folder_path)

# timing the process-start
starttime = time.time()

# #sequential
# for arg in args:
#     try:
#         pickles_functions_mp.pickle_pdf_xml(arg)
#     except Exception:
#         #print("exception was raised for {}".format(arg))
#         pass

# multiprocessing
pool = multiprocessing.Pool()
pool.map(pickles_functions_mp.pickle_pdf_xml, args)
pool.close()
#time ends and dellta displayed
print('That took {} seconds'.format(time.time() - starttime))

[True, True, True, True, True, True, True, True, True, True]

That took 8.198291540145874 seconds


# 5. Extracting PDF Metadata  

In [6]:
Index1 = pdf_metadata.pdf_categorize(Index0_path, Index0)
Index1 = pdf_metadata.pdf_size(os.path.realpath('..'), Index1)
Index1 = pdf_metadata.pdf_pagenumbers(os.path.realpath('..'), Index1)
Index1

Unnamed: 0_level_0,Application Name,Application Short Name,Application Filing Date,Company Name,Commodity,File Name,ESA Folder URL,Document Number,Data ID,PDF Download URL,...,Hearing order,Consultant Name,Pipeline Status,Regulatory Instrument(s),Application URL,Decision URL,ESA Section(s),ESA Section(s) Index,Topics,PDF Size (bytes)
application_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2003-03-17 Application to Construct and Operate Ekwan Pipeline GH-1-2003,Application to Construct and Operate Ekwan Pip...,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C0 - 13.0 EIA - Section 13.1 to 13.6,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C0,268706,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,...,GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,"Section 13.1: Introduction, Section 13.1: Proj...",1.0,"[Land, Air, Vegetation]",1483221
2003-03-17 Application to Construct and Operate Ekwan Pipeline GH-1-2003,Application to Construct and Operate Ekwan Pip...,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C1 - 13.0 EIA - Section 13.7 Wildlife Part 1,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C1,268709,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,...,GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,Section 13.7: Wildlife and Wildlife Habitat,2.0,[Wildlife],4544963
2003-03-17 Application to Construct and Operate Ekwan Pipeline GH-1-2003,Application to Construct and Operate Ekwan Pip...,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C2 - 13.0 EIA - Section 13.7 Wildlife Part 2,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C2,268712,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,...,GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,Section 13.7: Wildlife and Wildlife Habitat,3.0,[Wildlife],4369127
2003-03-17 Application to Construct and Operate Ekwan Pipeline GH-1-2003,Application to Construct and Operate Ekwan Pip...,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C3 - 13.0 EIA - Section 13.8 to 13.13,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C3,269018,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,...,GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,"Section 13.8: Fisheries and Aquatic Resources,...",4.0,"[Land, Water, Wildlife, Human]",2180117
2003-03-17 Application to Construct and Operate Ekwan Pipeline GH-1-2003,Application to Construct and Operate Ekwan Pip...,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C4 - 13.1 App 13A - Alignment Sheets,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C4,269021,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,...,GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,Appendix 13A: Environmental Alignment Sheets,5.0,[Alignment Sheet],3266671
2003-03-17 Application to Construct and Operate Ekwan Pipeline GH-1-2003,Application to Construct and Operate Ekwan Pip...,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C5 - 13.2 App 13B - NEB Concordance,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C5,269024,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,...,GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,Appendix 13B: NEB Guidelines for Filing Requir...,6.0,[Other],152004
2003-03-17 Application to Construct and Operate Ekwan Pipeline GH-1-2003,Application to Construct and Operate Ekwan Pip...,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C6 - 13.3 App 13C - CEAA 16.1 Concordance,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C6,269027,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,...,GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,Appendix 13C: CEAA Section 16(1) Concordance T...,7.0,[Other],68333
2003-03-17 Application to Construct and Operate Ekwan Pipeline GH-1-2003,Application to Construct and Operate Ekwan Pip...,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C7 - 13.4 App 13D - EPP,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C7,269030,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,...,GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,Appendix 13D: Environmental Protection Plan,8.0,[Environment Protection Plan],242261
2003-03-17 Application to Construct and Operate Ekwan Pipeline GH-1-2003,Application to Construct and Operate Ekwan Pip...,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C8 - 13.4 App 13D.A - Typical Drawings,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C8,269033,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,...,GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,Appendix 13D: Environmental Protection Plan,9.0,[Environment Protection Plan],4036880
2003-03-17 Application to Construct and Operate Ekwan Pipeline GH-1-2003,Application to Construct and Operate Ekwan Pip...,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C9 - 13.4 App 13D.B - Watercouse Crossings,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C9,268930,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,...,GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,Appendix 13D: Environmental Protection Plan,10.0,"[Water, Environment Protection Plan]",2639960
