# Data Extraction Demo

This jupyter notebook is intended to upload a pdf to the ESG_TEST project without further configuration and to extract the ESG KPI's from the pdf.

**Note**: If you first want to see which questions are answered excecute the cells after 1).

## 0) Insert the needed paths

Specify the path to the pdf you want to extract data from.

**Note**: You could use as a test the Test.pdf in corporate_data_extraction/data_extractor/data/TEST/input/pdfs
/inference/.

In [1]:
pdf_path = "/opt/app-root/src/TEST/input/pdfs/inference/Test.pdf"

Specify the output folder in which the output should be save.

In [2]:
output_folder = "/opt/app-root/src/test_output"


**Note:** From now on you only have to execute all the remaining cells via "SHIFT + ENTER".

## 1) Import needed packages and credentials (**DO NOT CHANGE**)

In [3]:
import os
from s3_communication import S3Communication
import pathlib
from dotenv import load_dotenv
import requests
import pandas as pd
from time import sleep, time

# Load credentials
dotenv_dir = os.environ.get("CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src"))
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

# init s3 connector to the main s3 bucket
s3c = S3Communication(
    s3_endpoint_url=os.getenv("LANDING_AWS_ENDPOINT"),
    aws_access_key_id=os.getenv("LANDING_AWS_ACCESS_KEY"),
    aws_secret_access_key=os.getenv("LANDING_AWS_SECRET_KEY"),
    s3_bucket=os.getenv("LANDING_AWS_BUCKET_NAME"),
)

project_name = "ESG_TEST"

# Delete old input if still exists
input_inf_prefix = f"corporate_data_extraction_projects/{project_name}/" + "data/input/pdfs/inference"
my_bucket = s3c.s3_resource.Bucket(name=s3c.bucket)
for objects in my_bucket.objects.filter(Prefix=input_inf_prefix):
    s3c.s3_resource.Object(s3c.bucket, objects.key).delete()

# Delete old output if still exists
output_prefix = f"corporate_data_extraction_projects/{project_name}/" + "data/output"
my_bucket = s3c.s3_resource.Bucket(name=s3c.bucket)
for objects in my_bucket.objects.filter(Prefix=output_prefix):
    s3c.s3_resource.Object(s3c.bucket, objects.key).delete()

print(f"Your current working directory is: {os.getcwd()}.")

print("The current KPI's, which are extracted, are:")
# Download one file
kpi_prefix = f"corporate_data_extraction_projects/{project_name}/" + "data/input/kpi_mapping"
file_name = "kpi_mapping.csv"
dest_path = "/opt/app-root/src/" + file_name
# Download relevance model
s3c.download_file_from_s3(dest_path, kpi_prefix, file_name)
df_kpi = pd.read_csv(dest_path)
pd.set_option("display.max_colwidth", None)
display(df_kpi[["kpi_id", "question"]])
file = pathlib.Path(file_name)
file.unlink()

Your current working directory is: /opt/app-root/src.
The current KPI's, which are extracted, are:


Unnamed: 0,kpi_id,question
0,0.0,What is the company name?
1,1.0,In which year was the annual report or the sustainability report published?
2,2.0,What is the total volume of proven and probable hydrocarbons reserves?
3,2.1,What is the volume of estimated proven hydrocarbons reserves?
4,2.2,What is the volume of estimated probable hydrocarbons reserves?
5,3.0,What is the total volume of hydrocarbons production?
6,3.1,What is the total volume of crude oil liquid production?
7,3.2,What is the total volume of natural gas liquid production?
8,3.3,What is the total volume of natural gas production?
9,4.0,What is the annual total production from coal?


## 2) Prepare Input for "OSC Data Extractor" (**DO NOT CHANGE**)

Next we upload the file to S3 and print the upload.

In [4]:
pdf_path = pathlib.Path(pdf_path)

# Test if input is valid. !Do not change!
if pdf_path.is_file() and pdf_path.suffix == ".pdf":
    print("Path discribes a valid pdf file.")
else:
    msg = "ERROR: Path does not discribes a valid pdf file."
    raise Exception(msg)

prefix_data = "corporate_data_extraction_projects/" + project_name + "/data"

# Upload pdfs for inference
s3c.upload_file_to_s3(filepath=str(pdf_path), s3_prefix=prefix_data + "/input/pdfs/inference", s3_key=pdf_path.name)

uploaded_file = f"corporate_data_extraction_projects/{project_name}/" + "data/input/pdfs/inference"

# Show only objects which satisfy our prefix
upload = False
my_bucket = s3c.s3_resource.Bucket(name=s3c.bucket)
for objects in my_bucket.objects.filter(Prefix=uploaded_file):
    print("The file was uploaded to the following location: \n" + objects.key)
    upload = True

if not upload:
    print("Something went wrong. Please check the input if it is correct.")

Path discribes a valid pdf file.
The file was uploaded to the following location: 
corporate_data_extraction_projects/ESG_TEST/data/input/pdfs/inference/Test.pdf


## 3) Next we start the "OSC Data Extractor" (**DO NOT CHANGE**)

In [5]:
http_liveness = "http://main-terminal-aicos-osc-demo.apps.odh-cl2.apps." + "os-climate.org/liveness"

tmp = requests.get(http_liveness)
if tmp.status_code == 200:
    print("Server is up and we can start extraction.")
else:
    raise Exception("Server is not up. " + "Please contact the Data Extraction team.")

http_inference = (
    "http://main-terminal-aicos-osc-demo.apps.odh-cl2.apps."
    + "os-climate.org/infer?project_name=ESG_TEST&s3_usage=Y&mode=both"
)

tmp_2 = requests.get(http_inference)
tic = time()
if tmp_2.status_code == 200:
    print("Extraction worked out, please check the output.")
elif tmp_2.status_code == 504:
    running = True
    http_running = "http://main-terminal-aicos-osc-demo.apps.odh-cl2.apps." + "os-climate.org/running"
    waiting_time = 0
    pause_time = 30
    print("Gateway timeout, but extraction is still running. " + f"We will recheck in {pause_time} seconds.")
    while running:
        sleep(pause_time)
        waiting_time += pause_time
        tmp_3 = requests.get(http_running)
        if "False" in str(tmp_3.content):
            running = False
        else:
            print(
                "Extraction is still running. "
                + f"We will recheck in {pause_time} seconds. "
                + f"Total waiting time up to now is {waiting_time} seconds."
            )
    print("Extraction done. Please check the output.")
else:
    raise Exception(
        "Unexpected error while extracting."
        + " Please contact the Data Extraction team."
        + " Please also provide the "
        + f"error message: \n {tmp_2.content}."
    )
toc = time()
print(f"It took in total {toc-tic} seconds.")

Server is up and we can start extraction.
Gateway timeout, but extraction is still running. We will recheck in 30 seconds.
Extraction is still running. We will recheck in 30 seconds. Total waiting time up to now is 30 seconds.
Extraction done. Please check the output.
It took in total 60.08472490310669 seconds.


## 4) Check and download the output of the "OSC Data Extractor"

Next we can check which output was produced and afterwards we can download it.

### 4.1) Check the output

In [6]:
project_output_s3 = f"corporate_data_extraction_projects/{project_name}/" + "data/output"
my_bucket = s3c.s3_resource.Bucket(name=s3c.bucket)
for objects in my_bucket.objects.filter(Prefix=project_output_s3):
    print(objects.key)
    print(objects.last_modified)

corporate_data_extraction_projects/ESG_TEST/data/output/KPI_EXTRACTION/joined_ml_rb/1698000539_Test.pdf.csv
2023-10-22 18:49:02+00:00
corporate_data_extraction_projects/ESG_TEST/data/output/KPI_EXTRACTION/ml/Text/Test_predictions_kpi.csv
2023-10-22 18:48:59+00:00
corporate_data_extraction_projects/ESG_TEST/data/output/RELEVANCE/Text/Test_predictions_relevant.csv
2023-10-22 18:48:17+00:00
corporate_data_extraction_projects/ESG_TEST/data/output/TEXT_EXTRACTION/Test.json
2023-10-22 18:49:02+00:00


### 4.2) Download the output

In [7]:
# Download the whole folder
prefix = f"corporate_data_extraction_projects/{project_name}/data/output"
s3c.download_files_in_prefix_to_dir(
    prefix,
    output_folder,
)