# Environmental and Social-Economic Assessment (ESA) Dataset

On 1 May 2020, the Canadian Energy Regulator (CER) published the [ESD Data Bank](https://apps.cer-rec.gc.ca/REGDOCS/Home/Index/FAKE), an interactive tool that allows users to visualize, download, and share ESA data from applications filed in support of federally regulated pipeline projects. The tools contains 13,681 and 3,024 unique tables and figures from 37 pipeline projects submitted to the CER between 2003 and 2019 (output files). Data was extracted from 1,902 PDF documents available from the CER's public repository called [REGDOCS](https://apps.cer-rec.gc.ca/REGDOCS).  <br>

To download individual tables (in CSV and JPG format) and figures (in JPG) format, see the [ESD Data Bank](https://apps.cer-rec.gc.ca/REGDOCS/Home/Index/FAKE) Data Bank online tool. The [ESA Data Bank](https://apps.cer-rec.gc.ca/REGDOCS/Home/Index) is an interactive tool that allows users to visualize, download, and share the Canada Energy Regulator’s (CER) ESA data from applications filed in support of federally regulated pipeline projects and related facilities. <br>

This repo contains several python functions that create the ESA dataset and the figure and table output files. Data is extracted from PDF files submitted by pipeline companies. <br>

This Notebooks provides a step by step guide to re-create the ESA Data Bank dataset. In total, 13,681 unique tables were extracted. Many tables span multiple pages, and thus the total number of extracted CSVs is 27,058. Figure data was saved as a unique JPG file. 3,024 figures were extracted in total. 

# About the Code 

Our assumption is that “Index of PDFs for Major Projects with ESAs” is already created.  We will implement the other pieces of codes with functions. The list of the functions are as follows: <br>

1.	Scrape PDF File<br>
Input: Index of PDFs for Major Projects with ESAs: List of the PDF downloadable links  <br>
Output: PDF Files <br>
<br>
2.	Convert PDF to Tika Files <br>
Input: PDF files <br>
Output: Tika Files (XML), A text file with all the files which could not be converted <br>
<br>
3.	Categorization of the PDFs <br>
Input: Index of PDFs for Major Projects with ESAs: List of the PDFs with keywords based on the indices of the PDFs<br>
Output: Index 1 - PDFs for Major Projects with ESAs.csv: PDF file names with categories assigned <br>


Things which we are currently working on but which will be implemented in the Phase 2 of the project are:
-	Index 3 – List of at CSV Files with Categories 
-	Index 4 – List of Images with Categories
-	Index 5 – GIS Locations extracted for PDF Files
-	Index 6 – Alignment Sheets extracted with geotags


# Installing the Required Packages 

In [1]:
#importing standard packages

import pandas as pd
import time
import os
import sys
import requests
#from bs4 import BeautifulSoup as bs
#import wget
import re
from urllib.parse import unquote
import PyPDF2 as p2
import glob
import multiprocessing
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
# importing custom functions 
import scraper
import pdf_categorization
import pickles_functions_mp

# Exploring the Input Files (Index of PDFs for Major Projects with ESAs)

In [3]:
Index0_path = os.path.realpath('..') + "\\Input Files\Index of PDFs for Major Projects with ESAs.csv"

Index0 = pd.read_csv(Index0_path, index_col = 0)
Index0.head()
Index0.info()

Unnamed: 0_level_0,Application Short Name,Application Filing Date,Company Name,Commodity,File Name,ESA Folder URL,Document Number,Data ID,PDF Download URL,Application Type (NEB Act),Pipeline Location,Hearing order,Consultant Name,Pipeline Status,Regulatory Instrument(s),Application URL,Decision URL,ESA Section(s),ESA Section(s) Index
Application Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Application to Construct and Operate Ekwan Pipeline,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C0 - 13.0 EIA - Section 13.1 to 13.6,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C0,268706,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,Large Projects (over 40 km),"Alberta, British Columbia",GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,"Section 13.1: Introduction, Section 13.1: Proj...",1.0
Application to Construct and Operate Ekwan Pipeline,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C1 - 13.0 EIA - Section 13.7 Wildlife Part 1,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C1,268709,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,Large Projects (over 40 km),"Alberta, British Columbia",GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,Section 13.7: Wildlife and Wildlife Habitat,2.0
Application to Construct and Operate Ekwan Pipeline,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C2 - 13.0 EIA - Section 13.7 Wildlife Part 2,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C2,268712,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,Large Projects (over 40 km),"Alberta, British Columbia",GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,Section 13.7: Wildlife and Wildlife Habitat,3.0
Application to Construct and Operate Ekwan Pipeline,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C3 - 13.0 EIA - Section 13.8 to 13.13,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C3,269018,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,Large Projects (over 40 km),"Alberta, British Columbia",GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,"Section 13.8: Fisheries and Aquatic Resources,...",4.0
Application to Construct and Operate Ekwan Pipeline,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C4 - 13.1 App 13A - Alignment Sheets,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C4,269021,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,Large Projects (over 40 km),"Alberta, British Columbia",GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,Appendix 13A: Environmental Alignment Sheets,5.0


<class 'pandas.core.frame.DataFrame'>
Index: 1903 entries, Application to Construct and Operate Ekwan Pipeline to Application for the Keystone XL Pipeline
Data columns (total 19 columns):
Application Short Name        1903 non-null object
Application Filing Date       1903 non-null object
Company Name                  1903 non-null object
Commodity                     1903 non-null object
File Name                     1903 non-null object
ESA Folder URL                1903 non-null object
Document Number               1903 non-null object
Data ID                       1903 non-null int64
PDF Download URL              1903 non-null object
Application Type (NEB Act)    1903 non-null object
Pipeline Location             1903 non-null object
Hearing order                 1903 non-null object
Consultant Name               1903 non-null object
Pipeline Status               1903 non-null object
Regulatory Instrument(s)      1879 non-null object
Application URL               1903 non-null obje

# 1. Scrape PDF Files

In [4]:
Index0 = Index0.head(10)
len(Index0)

10

In [7]:
count = scraper.file_scraper(os.path.realpath('..'), Index0)
print("{} Files were downloaded from {} URL links".format(count, len(Index0)))

10 Files were downloaded from 10 URL links


# 2. Convert PDF to Tika Files 

In [None]:
# list of full paths to pdfs
pdf_folder_path = os.path.realpath('..') + '\\Data Files\\PDFs\\'

subset_list_pdf_full = [pdf_folder_path
                        + x.split('\\')[-1] for x in glob.glob(pdf_folder_path + '*.pdf')]

# Directory where the output pickle files are saved
pkl_folder_path = os.path.realpath('..') + '\\Data Files\\Pickle Files\\'
# prepare arguments for multiprocessing
args = pickles_functions_mp.get_argument(subset_list_pdf_full, pkl_folder_path)

# timing the process-start
starttime = time.time()

# #sequential
# for arg in args:
#     try:
#         pickles_functions_mp.pickle_pdf_xml(arg)
#     except Exception:
#         #print("exception was raised for {}".format(arg))
#         pass

# multiprocessing
pool = multiprocessing.Pool()
pool.map(pickles_functions_mp.pickle_pdf_xml, args)
pool.close()
#time ends and dellta displayed
print('That took {} seconds'.format(time.time() - starttime))

# 3. Categorization of the PDFs  

In [5]:
Index1 = pdf_categorization.pdf_categorize(Index0_path, Index0)
Index1.head()

Unnamed: 0_level_0,Application Short Name,Application Filing Date,Company Name,Commodity,File Name,ESA Folder URL,Document Number,Data ID,PDF Download URL,Application Type (NEB Act),Pipeline Location,Hearing order,Consultant Name,Pipeline Status,Regulatory Instrument(s),Application URL,Decision URL,ESA Section(s),ESA Section(s) Index,Topics
Application Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Application to Construct and Operate Ekwan Pipeline,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C0 - 13.0 EIA - Section 13.1 to 13.6,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C0,268706,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,Large Projects (over 40 km),"Alberta, British Columbia",GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,"Section 13.1: Introduction, Section 13.1: Proj...",1.0,"[Land, Air, Vegetation]"
Application to Construct and Operate Ekwan Pipeline,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C1 - 13.0 EIA - Section 13.7 Wildlife Part 1,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C1,268709,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,Large Projects (over 40 km),"Alberta, British Columbia",GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,Section 13.7: Wildlife and Wildlife Habitat,2.0,[Wildlife]
Application to Construct and Operate Ekwan Pipeline,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C2 - 13.0 EIA - Section 13.7 Wildlife Part 2,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C2,268712,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,Large Projects (over 40 km),"Alberta, British Columbia",GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,Section 13.7: Wildlife and Wildlife Habitat,3.0,[Wildlife]
Application to Construct and Operate Ekwan Pipeline,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C3 - 13.0 EIA - Section 13.8 to 13.13,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C3,269018,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,Large Projects (over 40 km),"Alberta, British Columbia",GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,"Section 13.8: Fisheries and Aquatic Resources,...",4.0,"[Land, Water, Wildlife, Human]"
Application to Construct and Operate Ekwan Pipeline,Ekwan,2003-03-17,EnCana Ekwan Pipeline Inc.,Gas,A0H8C4 - 13.1 App 13A - Alignment Sheets,https://apps.cer-rec.gc.ca/REGDOCS/Item/LoadRe...,A0H8C4,269021,https://apps.cer-rec.gc.ca/REGDOCS/File/Downlo...,Large Projects (over 40 km),"Alberta, British Columbia",GH-1-2003,AXYS Environmental Consulting Ltd.,Operating,GC-108,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,https://apps.cer-rec.gc.ca/REGDOCS/Item/View/2...,Appendix 13A: Environmental Alignment Sheets,5.0,[Alignment Sheet]


In [None]:
#main