# Relevance Inference
This notebook takes in the extracted text from PDF preprocessing stage, the fine tuned relevance model from the training stage, and performs inference on the input text.

In [1]:
import os
import pandas as pd
import pathlib
from src.models.relevance_infer import TextRelevanceInfer
from config_farm_train import InferConfig
import config
from src.data.s3_communication import S3Communication
from dotenv import load_dotenv
import zipfile

11/02/2021 15:12:53 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [2]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

In [3]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_ENDPOINT"),
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    s3_bucket=os.getenv("S3_BUCKET"),
)

In [4]:
infer_config = InferConfig("infer_demo")

In [None]:
# When running in Automation using Elyra and Kubeflow Pipelines,
# set AUTOMATION = 1 as an environment variable
if os.getenv("AUTOMATION"):
    # extracted pdfs
    if not os.path.exists(config.BASE_EXTRACTION_FOLDER):
        config.BASE_EXTRACTION_FOLDER.mkdir(parents=True, exist_ok=True)

    # inference results dir
    if not os.path.exists(infer_config.result_dir['Text']):
        pathlib.Path(infer_config.result_dir['Text']).mkdir(parents=True, exist_ok=True)

    # load dir
    if not os.path.exists(infer_config.load_dir['Text']):
        pathlib.Path(infer_config.load_dir['Text']).mkdir(parents=True, exist_ok=True)

    # download extracted pdfs from s3
    s3c.download_files_in_prefix_to_dir(
        config.BASE_EXTRACTION_S3_PREFIX,
        config.BASE_EXTRACTION_FOLDER,
    )

In [5]:
model_root = pathlib.Path(infer_config.load_dir['Text']).parent
model_rel_zip = pathlib.Path(model_root, 'RELEVANCE.zip')
s3c.download_file_from_s3(model_rel_zip, config.CHECKPOINT_S3_PREFIX, "RELEVANCE.zip")

with zipfile.ZipFile(pathlib.Path(model_root, 'RELEVANCE.zip'), 'r') as z:
    z.extractall(model_root)

However, we advise that you manually update the parameters in the corresponding config file

`esg_data_pipeline/config/config_farm_trainer.py`

## Inference

### Loading the model

The following cell will load the trained model.

In [6]:
print(infer_config.load_dir)
print(infer_config.extracted_dir)
print(infer_config.result_dir)

{'Text': '/opt/app-root/src/aicoe-osc-demo/models/RELEVANCE'}
/opt/app-root/src/aicoe-osc-demo/data/extraction
{'Text': '/opt/app-root/src/aicoe-osc-demo/data/infer'}


In [7]:
component = TextRelevanceInfer(infer_config)



### Prediction on a Single Example

In [8]:
input_text = "The company is going to reduce 8% in gas production"
input_question = "Is the company going to go green?"
component.run_text(input_text=input_text, input_question=input_question)

[{'task': 'text_classification',
  'task_name': 'text_classification',
  'predictions': [{'start': None,
    'end': None,
    'context': 'Is the company going to go green?|The company is going to reduce 8% in gas production',
    'label': '0',
    'probability': 0.821921}]}]

In [9]:
input_text = "The company is about semi conductors"
input_question = "Is the company going to go green?"
component.run_text(input_text=input_text, input_question=input_question)

[{'task': 'text_classification',
  'task_name': 'text_classification',
  'predictions': [{'start': None,
    'end': None,
    'context': 'Is the company going to go green?|The company is about semi conductors',
    'label': '0',
    'probability': 0.821921}]}]

### Prediction on an Entire Folder

`run_folder()` will make prediction on all the JSON files in the /data/extraction folder. This will take some time, based on the number of json files.

In [10]:
component.run_folder()

11/02/2021 15:13:42 - INFO - src.models.relevance_infer -   #################### Starting Relevence Inference for the following extracted pdf files found in /opt/app-root/src/aicoe-osc-demo/data/infer:
['sustainability-report-2019'] 
11/02/2021 15:13:42 - INFO - src.models.relevance_infer -   #################### 1/1 PDFs
11/02/2021 15:13:42 - INFO - src.models.relevance_infer -   The relevance infer results for sustainability-report-2019 already exists. Skipping.
11/02/2021 15:13:42 - INFO - src.models.relevance_infer -   If you would like to re-process the already processed files, set `skip_processed_files` to False in the config file. 


The results are saved in a CSV. For each table, the extracted text, as well as the page number from the source pdf file are saved.

In [11]:
df_table_results = pd.read_csv(infer_config.result_dir['Text'] + "/sustainability-report-2019_predictions_relevant.csv")
df_table_results.head(20)

Unnamed: 0.1,Unnamed: 0,page,pdf_name,text,text_b,source
0,0,1,sustainability-report-2019,What is the company name?,to invest in the protection of tropical forest...,Text
1,1,1,sustainability-report-2019,What is the company name?,mechanism to tap into the important and effect...,Text
2,2,1,sustainability-report-2019,What is the company name?,natural sinks to absorb CO₂ from the atmosphere.,Text
3,3,1,sustainability-report-2019,What is the company name?,The global energy transition creates new busin...,Text
4,4,1,sustainability-report-2019,What is the company name?,opportunities. Decades of offshore experience ...,Text
5,5,1,sustainability-report-2019,What is the company name?,solutions enable Equinor to capture those oppo...,Text
6,6,1,sustainability-report-2019,What is the company name?,"offshore wind area. Last year, Equinor prepare...",Text
7,7,1,sustainability-report-2019,What is the company name?,substantially scaling up investments in offsho...,Text
8,8,1,sustainability-report-2019,What is the company name?,"with our partner SSE, we were awarded contract...",Text
9,9,1,sustainability-report-2019,What is the company name?,world’s largest offshore wind farm in the Dogg...,Text


In [12]:
# upload the predicted files to s3
s3c.upload_files_in_dir_to_prefix(
    infer_config.result_dir['Text'],
    config.BASE_INFER_RELEVANCE_S3_PREFIX
)

# Conclusion
This notebook ran the _Relevance_ inference on a sample dataset and stored the output in a csv format.