# Financial Named Entity Recognition (NER) and Zero-shot NER

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Finance/4.Financial_NER_ZeroShotNER.ipynb)

## Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install johnsnowlabs 

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Saving latest_3_1_x_spark_nlp_for_healthcare_spark_ocr_5112.json to latest_3_1_x_spark_nlp_for_healthcare_spark_ocr_5112.json


In [None]:
from johnsnowlabs import * 
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect
jsl.install()

👌 Detected license file /content/latest_3_1_x_spark_nlp_for_healthcare_spark_ocr_5112.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up John Snow Labs home in /home/ckl/.johnsnowlabs this might take a few minutes.
Downloading 🐍+🚀 Python Library Spark-NLP-4.1.0-wheel-for-spark-3.x.x.whl
Downloading 🐍+💊 Python Library hc
Downloading 🐍+🕶 Python Library Spark-OCR-4.0.1-wheel-for-spark-3.x.x.whl
Downloading 🫘+🚀 Java Library Spark-NLP-4.1.0-cpu-for-spark-3.x.x.jar
Downloading 🫘+💊 Java Library hc
Downloading 🫘+🕶 Java Library Spark-OCR-4.0.1-cpu-for-spark-3.x.x.jar
🙆 JSL Home setup in /root/.johnsnowlabs
Running "/usr/bin/python3 -m pip install https://pypi.johnsnowlabs.com/[LIBRARY_SECRET]spark-ocr/spark_ocr-4.0.1-py3-none-any.whl --force-reinstall"
Running "/usr/bin/python3 -m pip install https://pypi.johnsnowlabs.com/[LIBRARY_SECRET]spark-nlp-internal/spark_nlp_internal-4.1.0-py3-none-any.whl --force-reinst

## Start Spark Session

In [None]:
from johnsnowlabs import * 
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

👌 Detected license file /content/latest_3_1_x_spark_nlp_for_healthcare_spark_ocr_5112.json
📋 Stored new John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_2_for_Spark-Healthcare_Spark-OCR.json
👌 Launched SparkSession with Jars for: 🚀Spark-NLP, 💊Spark-Healthcare, 🕶Spark-OCR


In [5]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
# if you want to start the session with custom params as in start function above
def start(SECRET):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:"+PUBLIC_VERSION) \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+SECRET+"/spark-nlp-jsl-"+JSL_VERSION+".jar")
      
    return builder.getOrCreate()

#spark = start(SECRET)


## NER Model Implementation in Spark NLP

  The deep neural network architecture for NER model in Spark NLP is BiLSTM-CNN-Char framework. a slightly modified version of the architecture proposed by Jason PC Chiu and Eric Nichols ([Named Entity Recognition with Bidirectional LSTM-CNNs](https://arxiv.org/abs/1511.08308)). It is a neural network architecture that automatically detects word and character-level features using a hybrid bidirectional LSTM and CNN architecture, eliminating the need for most feature engineering steps.
  
  In the original framework, the CNN extracts a fixed length feature vector from character-level features. For each word, these vectors are concatenated and fed to the BLSTM network and then to the output layers. They employed a stacked bi-directional recurrent neural network with long short-term memory units to transform word features into named entity tag scores. The extracted features of each word are fed into a forward LSTM network and a backward LSTM network. The output of each network at each time step is decoded by a linear layer and a log-softmax layer into log-probabilities for each tag category. These two vectors are then simply added together to produce the final output. In the architecture of the proposed framework in the original paper, 50-dimensional pretrained word embeddings is used for word features, 25-dimension character embeddings is used for char features, and capitalization features (allCaps, upperInitial, lowercase, mixedCaps, noinfo) are used for case features.

### Finance SEC NER Model

This model detects Organizations (ORG), People (PER) and Locations (LOC) in financial texts. It was trained using manual annotations, conll2003 and financial documents obtained from U.S. Security and Exchange Commission (SEC) filings.

In [6]:
documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

# Consider using SentenceDetector with rules/patterns to get smaller chunks from long sentences
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_legal_bert_base_uncased","en")\
        .setInputCols(["sentence", "token"])\
        .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_sec_conll", "en", "finance/models") \
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[
        documentAssembler,
        sentence_detector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
bert_embeddings_legal_bert_base_uncased download started this may take some time.
Approximate size to download 391.1 MB
[OK!]
finner_sec_conll download started this may take some time.
[OK!]


In [7]:
## you can see pipeline stages with this code

model.stages

[DocumentAssembler_92faa57f6621,
 SentenceDetectorDLModel_8aaebf7e098e,
 REGEX_TOKENIZER_88b902dbdad4,
 BERT_EMBEDDINGS_ed0baec430e2,
 MedicalNerModel_5a7a3c15ca97,
 NerConverter_e030839fbc47]

In [8]:
## With this code, you can see which labels your NER model has.

ner_model.getClasses()

['O', 'B-ORG', 'I-ORG', 'B-PER', 'I-PER', 'B-LOC', 'I-LOC']

In [9]:
ner_model.extractParamMap()

# With extractParamMap() function, you can see the parameters of any annotators you are using.

{Param(parent='MedicalNerModel_5a7a3c15ca97', name='inferenceBatchSize', doc='number of sentences to process in a single batch during inference'): 1,
 Param(parent='MedicalNerModel_5a7a3c15ca97', name='labelCasing', doc='Setting all labels of the NER models upper/lower case. values upper|lower'): '',
 Param(parent='MedicalNerModel_5a7a3c15ca97', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='MedicalNerModel_5a7a3c15ca97', name='includeConfidence', doc='whether to include confidence scores in annotation metadata'): True,
 Param(parent='MedicalNerModel_5a7a3c15ca97', name='includeAllConfidenceScores', doc='whether to include all confidence scores in annotation metadata or just the score of the predicted tag'): False,
 Param(parent='MedicalNerModel_5a7a3c15ca97', name='batchSize', doc='Size of every batch'): 128,
 Param(parent='MedicalNerModel_5a7a3c15ca97', name='classes', doc='get the tags used to trained this MedicalNe

#### Sample Text

In [10]:
text ='''December 2007 SUBORDINATED LOAN AGREEMENT. THIS LOAN AGREEMENT is made on 7th December, 2007 BETWEEN: (1) SILICIUM DE PROVENCE S.A.S., a private company with limited liability, incorporated under the laws of France, whose registered office is situated at Usine de Saint Auban, France, represented by Mr.Frank Wouters, hereinafter referred to as the "Borrower", and ( 2 ) EVERGREEN SOLAR INC., a company incorporated in Delaware, U.S.A., with registered number 2426798, whose registered office is situated at Bartlett Street, Marlboro, Massachusetts, U.S.A. represented by Richard Chleboski, hereinafter referred to as "Lender" '''

df = spark.createDataFrame([[text]]).toDF("text")

result = model.transform(df)

#### Getting Result

In [11]:
result.select(F.explode(F.arrays_zip(result.token.result, result.ner.result, result.ner.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"),
                          F.expr("cols['2']['confidence']").alias("confidence")).show(200, truncate=100)

+-------------+---------+----------+
|        token|ner_label|confidence|
+-------------+---------+----------+
|     December|        O|       1.0|
|         2007|        O|       1.0|
| SUBORDINATED|        O|    0.9954|
|         LOAN|        O|    0.8873|
|    AGREEMENT|        O|    0.9998|
|            .|        O|    0.9999|
|         THIS|        O|       1.0|
|         LOAN|        O|       1.0|
|    AGREEMENT|        O|       1.0|
|           is|        O|       1.0|
|         made|        O|       1.0|
|           on|        O|       1.0|
|          7th|        O|       1.0|
|     December|        O|       1.0|
|            ,|        O|       1.0|
|         2007|        O|       1.0|
|      BETWEEN|        O|       1.0|
|            :|        O|       1.0|
|            (|        O|       1.0|
|            1|        O|       1.0|
|            )|        O|       1.0|
|     SILICIUM|    B-ORG|    0.9569|
|           DE|    I-ORG|    0.8518|
|     PROVENCE|    I-ORG|    0.9618|
|

In [12]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['confidence']").alias("confidence")).show(truncate=False)

+--------------------------+---------+----------+
|chunk                     |ner_label|confidence|
+--------------------------+---------+----------+
|SILICIUM DE PROVENCE S.A.S|ORG      |0.94205004|
|France                    |LOC      |0.9921    |
|Usine de Saint Auban      |LOC      |0.9893    |
|France                    |LOC      |0.9928    |
|Mr.Frank Wouters          |PER      |0.98465   |
|Borrower                  |PER      |0.9981    |
|EVERGREEN SOLAR INC       |ORG      |0.99956673|
|Delaware                  |LOC      |0.9976    |
|U.S.A                     |LOC      |0.9998    |
|Bartlett Street           |LOC      |0.99915004|
|Marlboro                  |LOC      |0.9923    |
|Massachusetts             |LOC      |0.6141    |
|U.S.A                     |LOC      |0.9995    |
|Richard Chleboski         |PER      |0.9997    |
|Lender                    |PER      |0.9998    |
+--------------------------+---------+----------+



#### Getting Result with LightPipeline

LightPipelines are Spark NLP specific Pipelines, equivalent to Spark ML Pipeline, but meant to deal with smaller amounts of data. They’re useful working with small datasets, debugging results, or when running either training or prediction from an API that serves one-off requests.

Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, becoming more than 10x times faster for smaller amounts of data (small is relative, but 50k sentences are roughly a good maximum). To use them, we simply plug in a trained (fitted) pipeline and then annotate a plain text. We don't even need to convert the input text to DataFrame in order to feed it into a pipeline that's accepting DataFrame as an input in the first place. This feature would be quite useful when it comes to getting a prediction for a few lines of text from a trained ML model.

 **It is nearly 10x faster than using Spark ML Pipeline**

For more details:
[https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1](https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1)

In [13]:
light_model = LightPipeline(model)

light_result = light_model.fullAnnotate(text)


chunks = []
entities = []
sentence= []
begin = []
end = []

for n in light_result[0]['ner_chunk']:
        
    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    sentence.append(n.metadata['sentence'])
    
    

df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 
                   'sentence_id':sentence, 'entities':entities})

df.head(20)

Unnamed: 0,chunks,begin,end,sentence_id,entities
0,SILICIUM DE PROVENCE S.A.S,106,131,1,ORG
1,France,208,213,1,LOC
2,Usine de Saint Auban,255,274,1,LOC
3,France,277,282,1,LOC
4,Mr.Frank Wouters,300,315,1,PER
5,Borrower,350,357,1,PER
6,EVERGREEN SOLAR INC,371,389,1,ORG
7,Delaware,419,426,2,LOC
8,U.S.A,429,433,2,LOC
9,Bartlett Street,508,522,2,LOC


#### NER Visualizer

For saving the visualization result as html, provide `save_path` parameter in the display function.

In [14]:
# from sparknlp_display import NerVisualizer

visualiser = viz.NerVisualizer()

visualiser.display(light_result[0], label_col='ner_chunk', document_col='document')

## Create Generic Pipeline for NerDL Models

In [15]:
def base_pipeline():
    
    document_assembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

    # Consider using SentenceDetector with rules/patterns to get smaller chunks from long sentences
    sentence_detector = nlp.SentenceDetector() \
        .setInputCols(["document"]) \
        .setOutputCol("sentence") \
        .setCustomBounds(["\n\n"])

    tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")
    
    pipeline = Pipeline(stages=[
            documentAssembler,
            sentence_detector,
            tokenizer])
    
    return pipeline

In [16]:
def generic_ner_pipeline(model_name, embeddings):
    
    word_embeddings = nlp.BertEmbeddings.pretrained(embeddings, "en")\
            .setInputCols("sentence", "token") \
            .setOutputCol("embeddings")\

    ner_model = finance.NerModel.pretrained(model_name, "en", "finance/models")\
            .setInputCols(["sentence", "token", "embeddings"])\
            .setOutputCol("ner")

    ner_converter = nlp.NerConverter()\
            .setInputCols(["sentence","token","ner"])\
            .setOutputCol("ner_chunk")

    nlpPipeline = Pipeline(stages=[
            base_pipeline(),
            word_embeddings,
            ner_model,
            ner_converter])

    empty_data = spark.createDataFrame([[""]]).toDF("text")

    model = nlpPipeline.fit(empty_data)
    
    return model

## Create Generic Result Function

In [17]:
def get_result(result):
    result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols")) \
          .select(F.expr("cols['0']").alias("chunk"),
                  F.expr("cols['1']['entity']").alias("ner_label")).show(50, truncate=False)

### Finance SEC10K Forms NER Model

This model uses Name Entity Recognition to extract information like **Company Name**, **Trading symbols**, **Stock markets**, **Addresses**, **Phones**, **Stock types* and values, **IRS**, **CFN**, etc. from the first page of 10-K filings.

In [18]:
text = """ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES AND EXCHANGE ACT OF 1934
For the annual period ended January 31, 2021
or
TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from________to_______
Commission File Number: 001-38856
PAGERDUTY, INC.
(Exact name of registrant as specified in its charter)
Delaware
27-2793871
(State or other jurisdiction of
incorporation or organization)
(I.R.S. Employer
Identification Number)
600 Townsend St., Suite 200, San Francisco, CA 94103
(844) 800-3889
(Address, including zip code, and telephone number, including area code, of registrant’s principal executive offices)
Securities registered pursuant to Section 12(b) of the Act:
Title of each class
Trading symbol(s)
Name of each exchange on which registered
Common Stock, $0.000005 par value,
PD
New York Stock Exchange"""

model_name = "finner_10k_summary"

embeddings = "bert_embeddings_finbert_pretrain_yiyanghkust"

df = spark.createDataFrame([[text]]).toDF("text")

result = generic_ner_pipeline(model_name, embeddings).transform(df)

bert_embeddings_finbert_pretrain_yiyanghkust download started this may take some time.
Approximate size to download 393.2 MB
[OK!]
finner_10k_summary download started this may take some time.
[OK!]


In [19]:
get_result(result)

+----------------------------------------------+-----------------+
|chunk                                         |ner_label        |
+----------------------------------------------+-----------------+
|January 31, 2021                              |FISCAL_YEAR      |
|001-38856                                     |CFN              |
|PAGERDUTY, INC                                |ORG              |
|Delaware                                      |STATE            |
|27-2793871                                    |IRS              |
|600 Townsend St., Suite 200, San Francisco, CA|ADDRESS          |
|(844) 800-3889                                |PHONE            |
|Common Stock                                  |TITLE_CLASS      |
|$0.000005                                     |TITLE_CLASS_VALUE|
|PD                                            |TICKER           |
|New York Stock Exchange                       |STOCK_EXCHANGE   |
+----------------------------------------------+--------------

### Finance Cuad_NER_Org_Products Model

This model uses Name Entity Recognition to extract ORG (Organization names), PRODUCT (Product names) and ALIAS.

Entities:
 - ORG (Organization names)
 - PRODUCT (Product names)
 - ALIAS

In [20]:
text = """This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties")."""

model_name = "finner_orgs_prods_alias"

embeddings = "bert_embeddings_sec_bert_base"

df = spark.createDataFrame([[text]]).toDF("text")

result = generic_ner_pipeline(model_name, embeddings).transform(df)


bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]
finner_orgs_prods_alias download started this may take some time.
[OK!]


In [21]:
get_result(result)

+-----------------------------------+---------+
|chunk                              |ner_label|
+-----------------------------------+---------+
|Armstrong Flooring, Inc            |ORG      |
|Seller                             |ALIAS    |
|AFI Licensing LLC                  |ORG      |
|Licensing                          |ALIAS    |
|Seller                             |ALIAS    |
|Arizona                            |ALIAS    |
|AHF Holding, Inc                   |ORG      |
|Tarzan HoldCo, Inc                 |ORG      |
|Buyer                              |ALIAS    |
|Armstrong Hardwood Flooring Company|ORG      |
|Company                            |ALIAS    |
|Buyer                              |ALIAS    |
|Buyer Entities                     |ALIAS    |
|Arizona                            |ALIAS    |
|Buyer Entities                     |ALIAS    |
|Party                              |ALIAS    |
|Parties                            |ALIAS    |
+-----------------------------------+---

### Finance NER_Ticker Model with RoBertaForTokenClassification

This model aims to detect Trading Symbols / Tickers in texts.

Enities
 - TICKER


In [22]:
tokenClassifier = nlp.RoBertaForTokenClassification.pretrained("finner_roberta_ticker", "en", "finance/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[
    base_pipeline(), 
    tokenClassifier,
    ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

finner_roberta_ticker download started this may take some time.
Approximate size to download 443.8 MB
[OK!]


In [23]:
text = """There are some serious purchases and sales of AMZN, NFLX and GPRO stock today."""

df = spark.createDataFrame([[text]]).toDF("text")

result = model.transform(df)

In [24]:
get_result(result)

+-----+---------+
|chunk|ner_label|
+-----+---------+
|AMZN |TICKER   |
|NFLX |TICKER   |
|GPRO |TICKER   |
+-----+---------+



# Zero-shot Named Entity Recognition

`Zero-shot` is a new inference paradigm which allows us to use a model for prediction without any previous training step.

For doing that, several examples (_hypotheses_) are provided and sent to the Language model, which will use `NLI (Natural Language Inference)` to check if the any information found in the text matches the examples (confirm the hypotheses).

NLI usually works by trying to _confirm or reject an hypotheses_. The _hypotheses_ are the `prompts` or examples we are going to provide. If any piece of information confirm the constructed hypotheses (answer the examples we are given), then the hypotheses is confirmed and the Zero-shot is triggered.

Let's see it  in action.


In [25]:
# from sparknlp.base import *
# from sparknlp.annotator import *
# from pyspark.ml import Pipeline
# from sparknlp_jsl.annotator import *

documentAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sen = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

sparktokenizer = nlp.Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")

zero_shot_ner = finance.ZeroShotNerModel.pretrained("finner_roberta_zeroshot", "en", "finance/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("zero_shot_ner")\
    .setEntityDefinitions(
        {
            "DATE": ['When was the company acquisition?', 'When was the company purchase agreement?'],
            "ORG": ["Which company was acquired?"],
            "PRODUCT": ["Which product?"],
            "PROFIT_INCREASE": ["How much has the gross profit increased?"],
            "REVENUES_DECLINED": ["How much has the revenues declined?"],
            "OPERATING_LOSS_2020": ["Which was the operating loss in 2020"],
            "OPERATING_LOSS_2019": ["Which was the operating loss in 2019"]
        })

nerconverter = nlp.NerConverter()\
  .setInputCols(["sentence", "token", "zero_shot_ner"])\
  .setOutputCol("ner_chunk")

pipeline =  Pipeline(stages=[
  documentAssembler,
  sen,
  sparktokenizer,
  zero_shot_ner,
  nerconverter,
    ]
)

finner_roberta_zeroshot download started this may take some time.
[OK!]


In [26]:
# from pyspark.sql.types import StructType,StructField, StringType
sample_text = ["In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.",
              "In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.",
              "While our gross profit margin increased to 81.4% in 2020 from 63.1% in 2019, our revenues declined approximately 27% in 2020 as compared to 2019.",
              "We reported an operating loss of approximately $8,048,581 million in 2020 as compared to an operating loss of $7,738,193 in 2019."]

p_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

res = p_model.transform(spark.createDataFrame(sample_text, StringType()).toDF("text"))

In [27]:
# from pyspark.sql import functions as F

res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias("cols")) \
   .select(F.expr("cols['0']").alias("chunk"),
           F.expr("cols['3']['entity']").alias("ner_label"))\
   .filter("ner_label!='O'")\
   .show(truncate=False)

+------------------+-------------------+
|chunk             |ner_label          |
+------------------+-------------------+
|March 2012        |DATE               |
|Vertro            |ORG                |
|ALOT              |PRODUCT            |
|February 2017     |DATE               |
|NetSeer           |ORG                |
|81.4%             |PROFIT_INCREASE    |
|27%               |REVENUES_DECLINED  |
|$8,048,581 million|OPERATING_LOSS_2020|
|$7,738,193        |OPERATING_LOSS_2019|
|2019              |DATE               |
+------------------+-------------------+



In [28]:
lp = LightPipeline(p_model)
lp_res_1 = lp.fullAnnotate(sample_text[2])
lp_res_2 = lp.fullAnnotate(sample_text[3])

In [29]:
# from sparknlp_display import NerVisualizer

visualiser = viz.NerVisualizer()

visualiser.display(lp_res_1[0], label_col='ner_chunk', document_col='document')

In [30]:
visualiser.display(lp_res_2[0], label_col='ner_chunk', document_col='document')