# Environmental and Social-Economic Assessment (ESA) Dataset

On 1 May 2020, the Canadian Energy Regulator (CER) published the [ESD Data Bank](https://apps.cer-rec.gc.ca/REGDOCS/Home/Index/FAKE), an interactive tool that allows users to visualize, download, and share ESA data from applications filed in support of federally regulated pipeline projects. The tools contains tables and figures from 37 pipeline projects submitted to the CER between 2003 and 2019 (output files). Data was extracted from 1,902 PDF documents available from the CER's public repository called [REGDOCS](https://apps.cer-rec.gc.ca/REGDOCS).  <br>

To download individual tables (in CSV and JPG format) and figures (in JPG) format, see the [ESD Data Bank](https://apps.cer-rec.gc.ca/REGDOCS/Home/Index/FAKE) Data Bank online tool. The [ESA Data Bank](https://apps.cer-rec.gc.ca/REGDOCS/Home/Index) is an interactive tool that allows users to visualize, download, and share the Canada Energy Regulator’s (CER) ESA data from applications filed in support of federally regulated pipeline projects and related facilities. <br>

This repo contains several python functions that create the ESA dataset and the figure and table output files. Data is extracted from PDF files submitted by pipeline companies. <br>

# About the Code 

This Notebooks covers the code for the first step to re-create the ESA Data Bank dataset. In the first step, we focus on the data extraction and data preparation for the 1902 PDF Files. <br>

“Index of PDFs for Major Projects with ESAs” (Index0) is already created which contains the list of the PDF files submitted for the. In this notebook we try to cover the following steps: <br>

1.	Scrape PDF File<br>

2.	Rotate the PDF Files <br>

3.  Convert PDF to Pickled Files <br>

4.	Convert rotated PDF Files to rotated Pickled Files

5.  Extract PDF Metadata <br>


# Installing the Required Packages 

In [None]:
# importing Python standard libraries
import pandas as pd
import time
import os
import glob
import multiprocessing
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
# importing custom libraries built by CER DDA team  
import scraper
import rotate_pdfs
import pickles_functions_mp
import pdf_metadata

# Exploring the Input Files (Index of PDFs for Major Projects with ESAs)

In [None]:
Index0_path = os.path.realpath('..\\..') + "\\Input_Files\\Index_of_PDFs_for_Major_Projects_with_ESAs.csv"

Index0 = pd.read_csv(Index0_path, index_col = 0)
Index0.head()
Index0.info()

# 1. Scrape PDF Files

Here we are downloading the PDF Files using the downloadable link provided in the Index0 Dataframe and saving the files to our desktop. 

In [None]:
# Inorder to have a faster trial demo, we are limiting the number of files to 10 
Index0 = Index0.head(10)
len(Index0)

In [None]:
count = scraper.file_scraper(os.path.realpath('..\\..'), Index0)
print("{} Files were downloaded from {} URL links".format(count, len(Index0)))

# 2. Rotate the PDF Files

Some pages in the PDF files for the ESA projects were rotated by 90 degrees. Extraction of data from those files can be extremely time taking. Hence, this function was used to keep the rotated PDF files in a seperate folder. 

In [None]:
count = rotate_pdfs.rotate_pdf(os.path.realpath('..\\..'), Index0)
print("{} Files were rotaed successfully rotated".format(count))

# 3. Convert PDFs to Pickled Files 

In this section we are using the pickle library which implements binary protiocals for serializing and de-serializing on the python object of the PDF files and converts teh PDF files into pickled files. The pickle data format uses a relatively compact binary representation, allowing faster processing of the files with a reduced failure rate. We have implemented multiprocessing and sequential processing for this step. 

In [None]:
# list of full paths to pdfs
pdf_folder_path = os.path.realpath('..\\..') + '\\Data_Files\\PDFs\\'

subset_list_pdf_full = [pdf_folder_path
                        + x.split('\\')[-1] for x in glob.glob(pdf_folder_path + '*.pdf')]

# Directory where the output pickle files are saved
pkl_folder_path = os.path.realpath('..\\..') + '\\Data_Files\\Pickle_Files\\'
# prepare arguments for multiprocessing
args = pickles_functions_mp.get_argument(subset_list_pdf_full, pkl_folder_path)

# timing the process-start
starttime = time.time()

# #sequential
# for arg in args:
#     try:
#         pickles_functions_mp.pickle_pdf_xml(arg)
#     except Exception:
#         #print("exception was raised for {}".format(arg))
#         pass

# multiprocessing
pool = multiprocessing.Pool()
pool.map(pickles_functions_mp.pickle_pdf_xml, args)
pool.close()
#time ends and dellta displayed
print('That took {} seconds'.format(time.time() - starttime))

# 4. Convert rotated PDF Files to rotated TIKA Files

The data for the rotated pages of the PDF Files will not be extrated correctly unless the PDF files are rotated too. Hence, in this step, we are pickling the rotated PDF files too.

In [None]:
# list of full paths to roatted pdfs
pdf_folder_path = os.path.realpath('..\\..') + '\\Data_Files\\PDFs_Rotated\\'

subset_list_pdf_full = [pdf_folder_path
                        + x.split('\\')[-1] for x in glob.glob(pdf_folder_path + '*.pdf')]

# Directory where the output pickle files are saved
pkl_folder_path = os.path.realpath('..\\..') + '\\Data_Files\\Pickle_Files_Rotated\\'
# prepare arguments for multiprocessing
args = pickles_functions_mp.get_argument(subset_list_pdf_full, pkl_folder_path)

# timing the process-start
starttime = time.time()

# #sequential
# for arg in args:
#     try:
#         pickles_functions_mp.pickle_pdf_xml(arg)
#     except Exception:
#         #print("exception was raised for {}".format(arg))
#         pass

# multiprocessing
pool = multiprocessing.Pool()
pool.map(pickles_functions_mp.pickle_pdf_xml, args)
pool.close()
#time ends and dellta displayed
print('That took {} seconds'.format(time.time() - starttime))

# 5. Extracting PDF Metadata  

In this section, we are trying to extract some useful metadata from these PDF files.  which  from the PDF files. 

In [None]:
# Identify ESA categories for the PDF files  
Index1 = pdf_metadata.pdf_categorize(Index0_path, Index0)

# Identify the PDF File size 
Index1 = pdf_metadata.pdf_size(os.path.realpath('..\\..'), Index1)

# Identify the number of pages in the PDF file
Index1 = pdf_metadata.pdf_pagenumbers(os.path.realpath('..\\..'), Index1)

# Identify if Outline (or TOC) is present in the PDF file or not
Index1 = pdf_metadata.get_outline_present(os.path.realpath('..\\..'), Index1)

Index1

In [None]:
Index1.to_csv(os.path.realpath('..\\..') + '\\Output_Files\\Index 1 - PDFs for Major Projects with ESAs.csv', index = False, encoding='utf-8-sig')