# Entity Extraction
We read back our results of classification models. As you know in the previous step, we classified RR document pages with two clause models. `"finclf_esg"` model classifies the Responsibility and ESG reports pages into three categories: `Environmental`, `Social`, `Governance`. The `"finclf_augmented_esg"` model classifes the reports pages into 26 categories

Now, we will apply our `finner_responsibility_reports` NER model to extract up to 20 quantifiable entities, including KPI, from the Responsibility and ESG reports pages.

### Colab Setup

In [None]:
! pip install -q johnsnowlabs

In [None]:
from johnsnowlabs import *

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install(refresh_install=True, force_browser = True)

In [3]:
from johnsnowlabs import *

from pyspark.ml import PipelineModel
from pyspark.sql import functions as F

import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Automatically load licenste data and start a session with all jars user has access to
spark = nlp.start()

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.3.2, 💊Spark-Healthcare==4.3.2, running on ⚡ PySpark==3.1.2


## Import the result of ESG models

In [4]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/solution_accelerator_esg_and_rr/finclf_esg_result.parquet.zip

!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/solution_accelerator_esg_and_rr/finclf_augmented_esg_result.parquet.zip

In [None]:
!unzip finclf_esg_result.parquet.zip -d finclf_esg_result.parquet

!unzip finclf_augmented_esg_result.parquet.zip -d finclf_augmented_esg_result.parquet

# Let's start


In [7]:
data = spark.read.parquet("./finclf_augmented_esg_result.parquet")

data.show(truncate = 50)

+--------+--------------------------------------------------+----------------------------------------------+
|page_num|                                              text|                                        result|
+--------+--------------------------------------------------+----------------------------------------------+
|       1|CF Industries 2021  ESG Report 1CF INDUSTRIES  ...|[Management_Of_Legal_And_Regulatory_Framework]|
|       2|A Message from Our CEO To our Stakeholders: CF ...|    [Waste_And_Hazardous_Materials_Management]|
|       3|OUR BUSINESS AND STRATEGY At our core, CF Indus...|    [Waste_And_Hazardous_Materials_Management]|
|       4|ACCOUNTABILITY AND TRANSPARENCY Our strategy of...|                               [GHG_Emissions]|
|       5|This report details CF Industries’ progress and...|[Management_Of_Legal_And_Regulatory_Framework]|
|       6|At CF Industries, our mission is to provide cle...|                           [Energy_Management]|
|       7|AMMONIA’S

## NER: Named Entity Recognition on Responsibility Reports Pages
This is `finner_responsibility_reports`model pipeline.

In [10]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")\

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")\

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .setContextChars(['.', ',', ';', ':', '!', '?', '*', '(', ')', '-', '"', "'", "’", '%', '&', '>', '”', '“','▶'])

ner_model = finance.BertForTokenClassification.pretrained("finner_responsibility_reports", "en", "finance/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("ner")\
    .setCaseSensitive(True)\
    .setMaxSentenceLength(512)

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

pipeline =  nlp.Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    ner_model,
    ner_converter
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

ner_model = pipeline.fit(empty_data)

finner_responsibility_reports download started this may take some time.
[OK!]


Let's choose **page 16** as an example and continue with it. Its category is `Water_And_Wastewater_Management`

In [11]:
sample_data = data.filter(data.page_num == 16)

result = ner_model.transform(sample_data)

In [12]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['confidence']").alias("confidence")).show(100, truncate=False)

+------------------------------------+---------------------+----------+
|chunk                               |ner_label            |confidence|
+------------------------------------+---------------------+----------+
|During 2021                         |DATE_PERIOD          |0.9993252 |
|GHG                                 |ENVIRONMENTAL_KPI    |0.99813557|
|CO2 equivalent                      |ENVIRONMENTAL_KPI    |0.915977  |
|emissions                           |ENVIRONMENTAL_KPI    |0.9913143 |
|25%                                 |PERCENTAGE           |0.9989867 |
|per  ton of product                 |RELATIVE_METRIC      |0.9983778 |
|by 2030                             |DATE_PERIOD          |0.9992966 |
|2015                                |DATE_PERIOD          |0.9833939 |
|Scope 3 emissions                   |ENVIRONMENTAL_KPI    |0.99981374|
|10%                                 |PERCENTAGE           |0.9991741 |
|by 2030                             |DATE_PERIOD          |0.99

In [13]:
res = result.collect()[0]

### Visualize Results

In [14]:
ner_vis = nlp.viz.NerVisualizer()

visualiser = nlp.viz.NerVisualizer()

visualiser.display(res, label_col='ner_chunk', document_col='document')

## Let's extract entities from result of finclf_esg model

In [15]:
data = spark.read.parquet("./finclf_esg_result.parquet")

data.show(30)

+--------+--------------------+---------------+
|page_num|                text|         result|
+--------+--------------------+---------------+
|       1|CF Industries 202...|   [Governance]|
|       2|A Message from Ou...|[Environmental]|
|       3|OUR BUSINESS AND ...|[Environmental]|
|       4|ACCOUNTABILITY AN...|[Environmental]|
|       5|This report detai...|[Environmental]|
|       6|At CF Industries,...|[Environmental]|
|       7|AMMONIA’S ROLE IN...|[Environmental]|
|       8|(1) Other segment...|[Environmental]|
|       9|“We operate advan...|[Environmental]|
|      10|DIMENSIONS & KEY ...|[Environmental]|
|      11|1. Energy, Emissi...|[Environmental]|
|      12|Our four distinct...|[Environmental]|
|      13|Key Issues    1) ...|[Environmental]|
|      14|14) Product Desig...|[Environmental]|
|      15|Our intensive wor...|   [Governance]|
|      16|ESG Goals ENERGY,...|[Environmental]|
|      17|We are excited ab...|[Environmental]|
|      18| 
▶ Supplier Scre...|       [S

### We get the page that includes `Social` information
We get page 34 as an example

In [16]:
from pyspark.sql.functions import *

sample_data = data.filter(data.page_num == 34)

result = ner_model.transform(sample_data)

In [17]:
sample_data.show(truncate = 50)

+--------+--------------------------------------------------+--------+
|page_num|                                              text|  result|
+--------+--------------------------------------------------+--------+
|      34|Two of our ESG goals are related to ID&E. The  ...|[Social]|
+--------+--------------------------------------------------+--------+



In [18]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['confidence']").alias("confidence")).show(100, truncate=False)

+----------------+------------------+----------+
|chunk           |ner_label         |confidence|
+----------------+------------------+----------+
|females         |TARGET_GROUP      |0.9997331 |
|persons of color|TARGET_GROUP      |0.99637717|
|leadership      |PROFESSIONAL_GROUP|0.99970853|
|director        |PROFESSIONAL_GROUP|0.99915284|
|30%             |PERCENTAGE        |0.9992818 |
|by 2025         |DATE_PERIOD       |0.99955726|
|underrepresented|TARGET_GROUP      |0.99945337|
|employee        |PROFESSIONAL_GROUP|0.9996691 |
|end of  2021    |DATE_PERIOD       |0.98864007|
|hiring          |SOCIAL_KPI        |0.6323411 |
|promotion       |SOCIAL_KPI        |0.547257  |
|minority        |TARGET_GROUP      |0.99972713|
|female          |TARGET_GROUP      |0.99984014|
|December 2021   |DATE_PERIOD       |0.99864554|
|38%             |PERCENTAGE        |0.99958694|
|leadership      |PROFESSIONAL_GROUP|0.9997405 |
|director        |PROFESSIONAL_GROUP|0.9992318 |
|females         |TA

In [19]:
res = result.collect()[0]

### Visualize Results

In [20]:
ner_vis = nlp.viz.NerVisualizer()

visualiser = nlp.viz.NerVisualizer()

visualiser.display(res, label_col='ner_chunk', document_col='document')

# Now you can proceed to 05 Extract and Understand Table!