<span style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">An Exception was encountered at '<a href="#papermill-error-cell">In [9]</a>'.</span>

# Relevance Inference
This notebook takes in the extracted text from PDF preprocessing stage, the fine tuned relevance model from the training stage, and performs inference on the input text.

In [1]:
import os
from os.path import exists
import pandas as pd
import pathlib
from src.models.relevance_infer import TextRelevanceInfer
from config_farm_train import InferConfig
import config
from src.data.s3_communication import S3Communication, S3FileType
from dotenv import load_dotenv
import zipfile

import glob

07/10/2022 10:38:30 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [2]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

In [3]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_ENDPOINT"),
    aws_access_key_id=os.getenv("S3_LANDING_ACCESS_KEY"),
    aws_secret_access_key=os.getenv("S3_LANDING_SECRET_KEY"),
    s3_bucket=os.getenv("S3_BUCKET"),
)

In [4]:
infer_config = InferConfig("infer_demo")

In [5]:
# When running in Automation using Elyra and Kubeflow Pipelines,
# set AUTOMATION = 1 as an environment variable
if os.getenv("AUTOMATION"):
    # extracted pdfs
    if not os.path.exists(config.BASE_EXTRACTION_FOLDER):
        config.BASE_EXTRACTION_FOLDER.mkdir(parents=True, exist_ok=True)

    # inference results dir
    if not os.path.exists(infer_config.result_dir['Text']):
        pathlib.Path(infer_config.result_dir['Text']).mkdir(parents=True, exist_ok=True)

    # load dir
    if not os.path.exists(infer_config.load_dir['Text']):
        pathlib.Path(infer_config.load_dir['Text']).mkdir(parents=True, exist_ok=True)

    # download extracted pdfs from s3
    s3c.download_files_in_prefix_to_dir(
        config.BASE_EXTRACTION_S3_PREFIX,
        config.BASE_EXTRACTION_FOLDER,
    )

In [6]:
model_root = pathlib.Path(infer_config.load_dir['Text']).parent
model_zfilename = "RELEVANCE.zip"

model_rel_zip = pathlib.Path(model_root, model_zfilename)

# Do we really need to download RELEVANCE.zip each and every time?
if False and not exists(model_rel_zip):
    s3c.download_file_from_s3(model_rel_zip, config.CHECKPOINT_S3_PREFIX, model_zfilename)

    with zipfile.ZipFile(model_rel_zip, 'r') as z:
        z.extractall(model_root)

However, we advise that you manually update the parameters in the corresponding config file

`esg_data_pipeline/config/config_farm_trainer.py`

## Inference

### Loading the model

The following cell will load the trained model.

In [7]:
print(infer_config.load_dir)
print(infer_config.extracted_dir)
print(infer_config.result_dir)

{'Text': '/opt/app-root/src/aicoe-osc-demo/models/RELEVANCE'}
/opt/app-root/src/aicoe-osc-demo/data/extraction
{'Text': '/opt/app-root/src/aicoe-osc-demo/data/infer_relevance'}


In [9]:
kpi_df = s3c.download_df_from_s3(
    f"{config.EXPERIMENT_NAME}/kpi_mapping",
    config.KPI_MAPPING_CSV,
    filetype=S3FileType.CSV,
    header=0,
)
kpi_df.head()

Unnamed: 0,kpi_id,question,sectors,add_year,kpi_category
0,1,What is the company name?,"OG, CM, CU",False,TEXT
1,2,What is the Start Date of the CDP report publi...,"OG, CM, CU",True,TEXT
2,3,What is the End Date of the CDP report published?,"OG, CM, CU",True,TEXT
3,4,What is the currency used for all financial in...,"OG, CM, CU",False,TEXT
4,5,Did you have an emissions target that was acti...,"OG, CM, CU",True,TEXT


<span id="papermill-error-cell" style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">Execution using papermill encountered an exception here and stopped:</span>

In [10]:
component = TextRelevanceInfer(infer_config, kpi_df)



### Prediction on a Single Example

In [11]:
input_text = "The company is going to reduce 8% in gas production"
input_question = "Is the company going to go green?"
component.run_text(input_text=input_text, input_question=input_question)

[{'task': 'text_classification',
  'predictions': [{'start': None,
    'end': None,
    'context': 'Is the company going to go green?|The company is going to reduce 8% in gas production',
    'label': '1',
    'probability': 0.7343753}]}]

In [12]:
input_text = "The company is about semi conductors"
input_question = "Is the company going to go green?"
component.run_text(input_text=input_text, input_question=input_question)

[{'task': 'text_classification',
  'predictions': [{'start': None,
    'end': None,
    'context': 'Is the company going to go green?|The company is about semi conductors',
    'label': '0',
    'probability': 0.9893261}]}]

### Prediction on an Entire Folder

`run_folder()` will make prediction on all the JSON files in the /data/extraction folder. This will take some time, based on the number of json files.

In [13]:
component.run_folder()

07/10/2022 01:50:45 - INFO - src.models.relevance_infer -   #################### Starting Relevence Inference for the following extracted pdf files found in /opt/app-root/src/aicoe-osc-demo/data/infer_relevance:
['2020-cdp-climate-response-checkpoint', '2020-cdp-climate-response', 'Adobe_CDP_Climate_Change_Questionnaire_2021', 'Apple_CDP-Climate-Change-Questionnaire_2021', 'Bayer AG Climate Change 2021', 'Corning_Incorporated_CDP_Climate_Change_Questionnaire_2021_FINAL', 'Michelin-CDP-Climate-Change-2021_def', 'NextEra Energy 2021 CDP Response', 'PGE_Corporation_CDP_Climate_Change_Questionnaire_2021', 'Unilever CDP Climate Response', 'bp-cdp-climate-change-questionnaire-2021', 'gap_inc-_cdp_climate_change_questionnaire_2021', 'sustainability-report-2019', 'vodafone-group-cdp-climate-change-questionnaire2021'] 
07/10/2022 01:50:45 - INFO - src.models.relevance_infer -   #################### 1/14 PDFs
07/10/2022 01:50:45 - INFO - src.models.relevance_infer -   Running inference for 2020-

Unnamed: 0,page,pdf_name,text,text_b,source
0,0,2020-cdp-climate-response-checkpoint,What is the company name?,The Coca-Cola Company - Climate Change 2020,Text
1,0,2020-cdp-climate-response-checkpoint,What is the company name?,The Coca-Cola Company (NYSE: KO) is here to re...,Text
2,0,2020-cdp-climate-response-checkpoint,What is the company name?,The Coca-Cola Company is a total beverage comp...,Text
3,0,2020-cdp-climate-response-checkpoint,What is the company name?,Together with our approximately 225 bottling p...,Text
4,2,2020-cdp-climate-response-checkpoint,What is the company name?,"Please explain At The Coca-Cola Company, we re...",Text
...,...,...,...,...,...
3743,75,vodafone-group-cdp-climate-change-questionnair...,Report your organization’s energy consumption ...,MWh consumed accounted for at a zero emission ...,Text
3744,76,vodafone-group-cdp-climate-change-questionnair...,Report your organization’s energy consumption ...,MWh consumed accounted for at a zero emission ...,Text
3745,76,vodafone-group-cdp-climate-change-questionnair...,Report your organization’s energy consumption ...,MWh consumed accounted for at a zero emission ...,Text
3746,76,vodafone-group-cdp-climate-change-questionnair...,Report your organization’s energy consumption ...,MWh consumed accounted for at a zero emission ...,Text


The results are saved in a CSV. For each table, the extracted text, as well as the page number from the source pdf file are saved.

In [14]:
csvfiles = [f for f in glob.glob(infer_config.result_dir['Text'] + "/*.csv")]
for i in range(len(csvfiles)):
    f = csvfiles[i]
    df_table_results = pd.read_csv(f)
    if i < 5:
        print(f"for file {f}:")
        print(df_table_results.head(20))
    else:
        print (f"and file {f} (len = {len(df_table_results)})")

for file /opt/app-root/src/aicoe-osc-demo/data/infer_relevance/sustainability-report-2019_predictions_relevant.csv:
    Unnamed: 0  page                    pdf_name                       text  \
0            0     1  sustainability-report-2019  What is the company name?   
1            1     1  sustainability-report-2019  What is the company name?   
2            2     1  sustainability-report-2019  What is the company name?   
3            3     1  sustainability-report-2019  What is the company name?   
4            4     1  sustainability-report-2019  What is the company name?   
5            5     1  sustainability-report-2019  What is the company name?   
6            6     1  sustainability-report-2019  What is the company name?   
7            7     1  sustainability-report-2019  What is the company name?   
8            8     1  sustainability-report-2019  What is the company name?   
9            9     2  sustainability-report-2019  What is the company name?   
10          10 

In [None]:
if os.getenv("AUTOMATION"):
    # upload the predicted files to s3
    s3c.upload_files_in_dir_to_prefix(
        infer_config.result_dir['Text'],
        config.BASE_INFER_RELEVANCE_S3_PREFIX
    )

# Conclusion
This notebook ran the _Relevance_ inference on a sample dataset and stored the output in a csv format.

In [4]:
config.BASE_INFER_RELEVANCE_S3_PREFIX

'test_cdp/pipeline_run/small/infer_relevance'