![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Classify Financial Texts

In this notebook, you will learn how to use Spark NLP and Finance NLP to perform text classification.

## Environment Setup

First, you need to setup the environment to be able to use the licensed package. If you are not running in Google Colab, please check the documentation [here](https://nlp.johnsnowlabs.com/docs/en/licensed_install).

In [1]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -qU johnsnowlabs

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/74.2 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.2/74.2 KB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m570.6/570.6 KB[0m [31m50.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.4/95.4 KB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m448.4/448.4 KB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 KB[0m [31m27.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 KB[0m [31m7.7 MB/s[0m eta

In [2]:
from johnsnowlabs import nlp
# Log in to your John Snow Labs account to login and get your license keys
nlp.install(force_browser=True)

<IPython.core.display.Javascript object>

127.0.0.1 - - [08/Jan/2023 05:03:19] "GET /login?code=eusKtNI7QgPqwXL0YNLZ2RX1Tbp6y4 HTTP/1.1" 200 -


<IPython.core.display.Javascript object>

Downloading license...
Licenses extracted successfully
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.2.4-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-4.2.4-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.2.4.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-4.2.4.jar
🙆 JSL Home setup in /root/.johnsnowlabs
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.2.4-py3-none-any.whl to /usr/bin/python3
Running: /usr/bin/python3 -m pip install /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.2.4-py3-none-any.whl
Installed 1 products:
💊 Spark-Healthcare==4.2.4 installed! ✅ Heal the planet with NLP! 


Also, let's install some tools to display PDF files that will be used on examples.

> Please restart the runtime and follow to the next cells

## Pretrained models

For the text classification tasks, we will use two annotators:

- `ClassifierDL`: uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. Then, a deep learning model (DNNs) built with TensorFlow that supports `Binary Classification` and `Multiclass Classification` (up to 100 classes).
- `MultiClassifierDL`: `Multilabel Classification` (can predict more than one class for each text) using a Bidirectional GRU with Convolution architecture built with TensorFlow that supports up to 100 classes. The inputs are Sentence Embeddings such as state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings or SentenceEmbeddings.

Example Classification models:

| title                                                    | language   | predicted_entities                                                                                                      | compatible_editions                |
|:---------------------------------------------------------|:-----------|:------------------------------------------------------------------------------------------------------------------------|:-----------------------------------|
| Bank Complaints Classification                           | en         | ['Accounts', 'Credit Cards', 'Credit Reporting', 'Debt Collection', 'Loans', 'Money Transfer and Currency', 'Mortgage'] | ['Finance NLP 1.0', 'Finance NLP'] |
| Financial Finbert Sentiment Analysis (DistilRoBerta)     | en         | ['positive', 'negative', 'neutral']                                                                                     | ['Finance NLP 1.0', 'Finance NLP'] |
| Financial Business Item Binary Classifier                | en         | ['other', 'business']                                                                                                   | ['Finance NLP 1.0', 'Finance NLP'] |
| Financial Controls procedures Item Binary Classifier     | en         | ['other', 'controls_procedures']                                                                                        | ['Finance NLP 1.0', 'Finance NLP'] |
| Financial Equity Item Binary Classifier                  | en         | ['other', 'equity']                                                                                                     | ['Finance NLP 1.0', 'Finance NLP'] |
| Financial Executives compensation Item Binary Classifier | en         | ['other', 'executives_compensation']                                                                                    | ['Finance NLP 1.0', 'Finance NLP'] |
| Financial Executives Item Binary Classifier              | en         | ['other', 'executives']                                                                                                 | ['Finance NLP 1.0', 'Finance NLP'] |
| Financial Exhibits Item Binary Classifier                | en         | ['other', 'exhibits']                                                                                                   | ['Finance NLP 1.0', 'Finance NLP'] |
| Financial Financial conditions Item Binary Classifier    | en         | ['other', 'financial_conditions']                                                                                       | ['Finance NLP 1.0', 'Finance NLP'] |
| Financial Financial statements Item Binary Classifier    | en         | ['other', 'financial_statements']                                                                                       | ['Finance NLP 1.0', 'Finance NLP'] |

## Multiclass Classifiers

Multiclass classifiers predicts one class out of a predefined set of possible classes. 

Before using the model, let's first start the Spark Session, which can be done using our library: 

In [3]:
import pyspark.sql.functions as F
from johnsnowlabs import nlp, finance, viz
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()
spark

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.2.4, 💊Spark-Healthcare==4.2.4, running on ⚡ PySpark==3.1.2


#### Environmental, Social and Governance (ESG)

We will use two classifiers, one with 26 classes:

`Business_Ethics`, `Data_Security`, `Access_And_Affordability`, `Business_Model_Resilience`, `Competitive_Behavior`, `Critical_Incident_Risk_Management`, `Customer_Welfare`, `Director_Removal`, `Employee_Engagement_Inclusion_And_Diversity`, `Employee_Health_And_Safety`, `Human_Rights_And_Community_Relations`, `Labor_Practices`, `Management_Of_Legal_And_Regulatory_Framework`, `Physical_Impacts_Of_Climate_Change`, `Product_Quality_And_Safety`, `Product_Design_And_Lifecycle_Management`, `Selling_Practices_And_Product_Labeling`, `Supply_Chain_Management`, `Systemic_Risk_Management`, `Waste_And_Hazardous_Materials_Management`, `Water_And_Wastewater_Management`, `Air_Quality`, `Customer_Privacy`, `Ecological_Impacts`, `Energy_Management`, `GHG_Emissions`


and one with only three: `Social`, `Governance`, `Environmental` (or `None`)

In [4]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

tokenizer = nlp.Tokenizer().setInputCols(["document"]).setOutputCol("token")

many_classes = (
    finance.BertForSequenceClassification.pretrained(
        "finclf_augmented_esg", "en", "finance/models"
    )
    .setInputCols(["document", "token"])
    .setOutputCol("esg_many")
)

three_classes = (
    finance.BertForSequenceClassification.pretrained(
        "finclf_esg", "en", "finance/models"
    )
    .setInputCols(["document", "token"])
    .setOutputCol("esg")
)

pipeline = nlp.Pipeline(
    stages=[document_assembler, tokenizer, many_classes, three_classes]
)

# couple of simple examples
example = spark.createDataFrame(
    [
        [
            """The Canadian Environmental Assessment Agency (CEAA) concluded that in June 2016 the company had not made an effort
 to protect public drinking water and was ignoring concerns raised by its own scientists about the potential levels of pollutants in the local water supply.
  At the time, there were concerns that the company was not fully testing onsite wells for contaminants and did not use the proper methods for testing because 
  of its test kits now manufactured in China.A preliminary report by the company in June 2016 was commissioned by the Alberta government to provide recommendations 
  to Alberta Environment officials"""
        ]
    ]
).toDF("text")

result = pipeline.fit(example).transform(example)

# result is a DataFrame
result.select(
    "text", F.expr("esg_many.result as many"), F.expr("esg.result as esg")
).show(truncate=80)

finclf_augmented_esg download started this may take some time.
[OK!]
finclf_esg download started this may take some time.
[OK!]
+--------------------------------------------------------------------------------+------------------------------------------+---------------+
|                                                                            text|                                      many|            esg|
+--------------------------------------------------------------------------------+------------------------------------------+---------------+
|The Canadian Environmental Assessment Agency (CEAA) concluded that in June 20...|[Waste_And_Hazardous_Materials_Management]|[Environmental]|
+--------------------------------------------------------------------------------+------------------------------------------+---------------+



### Financial News Multilabel Classification

This model can identify different topics contained in financial news (trained on news scrapped from the Internet and manual in-house annotations). The available topics are:

- `acq`: Acquisition / Purchase operations
- `finance`: Generic financial news
- `fuel`: News about fuel and energy sources
- `jobs`: News about jobs, employment rates, etc.
- `livestock`: News about animales and livestock
- `mineral`: News about mineral as copper, gold, silver, coal, etc.
- `plant`: News about greens, plants, cereals, etc
- `trade`: Trading news

In [5]:
documentAssembler = (
    nlp.DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
    .setCleanupMode("shrink")
)

embeddings = (
    nlp.UniversalSentenceEncoder.pretrained()
    .setInputCols("document")
    .setOutputCol("embeddings")
)

docClassifier = (
    nlp.MultiClassifierDLModel.pretrained("finmulticlf_news", "en", "finance/models")
    .setInputCols("embeddings")
    .setOutputCol("topics")
)

pipeline = nlp.Pipeline().setStages([documentAssembler, embeddings, docClassifier])

empty_data = spark.createDataFrame([[""]]).toDF("text")
pipelineModel = pipeline.fit(empty_data)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
finmulticlf_news download started this may take some time.
Approximate size to download 12.3 MB
[OK!]


In [6]:
text = ["""
ECUADOR HAS TRADE SURPLUS IN FIRST FOUR MONTHS Ecuador posted a trade surplus of 10.6 mln dlrs in the first four months of 1987 compared with a surplus of 271.7 mln in the same period in 1986, the central bank of Ecuador said in its latest monthly report. Ecuador suspended sales of crude oil, its principal export product, in March after an earthquake destroyed part of its oil-producing infrastructure. Exports in the first four months of 1987 were around 639 mln dlrs and imports 628.3 mln, compared with 771 mln and 500 mln respectively in the same period last year. Exports of crude and products in the first four months were around 256.1 mln dlrs, compared with 403.3 mln in the same period in 1986. The central bank said that between January and May Ecuador sold 16.1 mln barrels of crude and 2.3 mln barrels of products, compared with 32 mln and 2.7 mln respectively in the same period last year. Ecuador's international reserves at the end of May were around 120.9 mln dlrs, compared with 118.6 mln at the end of April and 141.3 mln at the end of May 1986, the central bank said. gold reserves were 165.7 mln dlrs at the end of May compared with 124.3 mln at the end of April.
"""]

df = spark.createDataFrame([text]).toDF("text")

result = pipelineModel.transform(df)
result.select("text", "topics.result").show(truncate=60)

+------------------------------------------------------------+----------------+
|                                                        text|          result|
+------------------------------------------------------------+----------------+
|
ECUADOR HAS TRADE SURPLUS IN FIRST FOUR MONTHS Ecuador p...|[finance, trade]|
+------------------------------------------------------------+----------------+



## Finding relevant sections of 10-K fillings  

We will use a publicly available information about Cadence in SEC's Edgar database [here](https://www.sec.gov/Archives/edgar/data/813672/000081367222000012/cdns-20220101.htm) and [Wikipedia](https://en.wikipedia.org/wiki/Cadence_Design_Systems) for to illustrate some of our binary classifiers.

In [7]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings_JSL/Finance/data/cdns-20220101.html.txt -O sample10k.txt

In [8]:
text = open("sample10k.txt", "r").read()
print(text[:200])

Table of Contents
UNITED STATES SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
_____________________________________ 
FORM 10-K 
_____________________________________  
(Mark One)
☒
ANNUAL 


First, lets split this big text into pages (we identified that every page starts with the string "Table of Contents" and use that to split).

In [9]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

sentence_detector = (
    nlp.SentenceDetector()
    .setInputCols(["document"])
    .setOutputCol("pages")
    .setCustomBounds(["Table of Contents"])
    .setUseCustomBoundsOnly(True)
)

nlp_pipeline = nlp.Pipeline(stages=[document_assembler, sentence_detector])

empty_data = spark.createDataFrame([[""]]).toDF("text")
sentence_splitting_pipe = nlp_pipeline.fit(empty_data)
sentence_splitting_lightpipe = nlp.LightPipeline(sentence_splitting_pipe)

In [10]:
res = sentence_splitting_lightpipe.annotate(text)
pages = res['pages']
pages = [p for p in pages if p.strip() != ''] # We remove empty pages
len(pages)

90

In [11]:
print(pages[0])

UNITED STATES SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
_____________________________________ 
FORM 10-K 
_____________________________________  
(Mark One)
☒
ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the fiscal year ended January 1, 2022 
OR
☐
TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from _________ to_________.

Commission file number 000-15867 
_____________________________________
 
CADENCE DESIGN SYSTEMS, INC. 
(Exact name of registrant as specified in its charter)
____________________________________ 
Delaware
 
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S. EmployerIdentification No.)
2655 Seely Avenue, Building 5,
San Jose,
California
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
(408)
-943-1234 
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 12

<img src="https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/data/10k_image.png?raw=true"/>

Let's create a funtion that generates pipelines with the desird model, so we can use different binary classifiers with ease.

In [12]:
def get_binary_pipeline(model_name):
    documentAssembler = (
        nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
    )

    useEmbeddings = (
        nlp.UniversalSentenceEncoder.pretrained()
        .setInputCols("document")
        .setOutputCol("sentence_embeddings")
    )

    docClassifier = (
        nlp.ClassifierDLModel.pretrained(model_name, "en", "finance/models")
        .setInputCols(["sentence_embeddings"])
        .setOutputCol("category")
    )

    nlpPipeline = nlp.Pipeline(stages=[documentAssembler, useEmbeddings, docClassifier])

    return nlpPipeline

### Finding Summary part

Summary page is usually the first page of the report, but let's suppose we don't know that. This binary classifier will predict `summary` if the page is the summary page or `other` otherwise.

In [13]:
cls_pipeline = get_binary_pipeline("finclf_form_10k_summary_item")
empty_data = spark.createDataFrame([[""]]).toDF("text")

cls_model = cls_pipeline.fit(empty_data)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
finclf_form_10k_summary_item download started this may take some time.
Approximate size to download 21.2 MB
[OK!]


In [14]:
df = spark.createDataFrame([[pages[0]]]).toDF("text")
result = cls_model.transform(df)
result.select('category.result').show()

+------------------+
|            result|
+------------------+
|[form_10k_summary]|
+------------------+



### Finding Acquisitions and Subsidiaries part

Let's send some pages and check which one(s) contain that information. In a real case, you could send all the pages to the model, but here for time saving purposes, we will show just a subset.

In [15]:
candidates = [[pages[0]], [pages[1]], [pages[35]], [pages[50]], [pages[67]]] # Some examples
df = spark.createDataFrame(candidates).toDF("text")

In [16]:
classification_pipeline = get_binary_pipeline('finclf_acquisitions_item')

model = classification_pipeline.fit(df)
result = model.transform(df)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
finclf_acquisitions_item download started this may take some time.
Approximate size to download 21.3 MB
[OK!]


In [17]:
result.select('category.result').show()

+--------------+
|        result|
+--------------+
|       [other]|
|       [other]|
|       [other]|
|       [other]|
|[acquisitions]|
+--------------+





### Finding About Management and their work experience part


Let's send some pages and check which one(s) contain that information. In a real case, you could send all the pages to the model, but here for time saving purposes, we will show just a subset.

In [18]:
candidates = [[pages[4]], [pages[84]], [pages[85]], [pages[86]], [pages[87]]]
df = spark.createDataFrame(candidates).toDF("text")


classification_pipeline = get_binary_pipeline('finclf_work_experience_item')
model = classification_pipeline.fit(df)
result = model.transform(df)
result.select('category.result').show()

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
finclf_work_experience_item download started this may take some time.
Approximate size to download 21.2 MB
[OK!]
+-----------------+
|           result|
+-----------------+
|          [other]|
|          [other]|
|          [other]|
|[work_experience]|
|          [other]|
+-----------------+



### Using LightPipeline

[LightPipelines](https://nlp.johnsnowlabs.com/docs/en/concepts#using-spark-nlps-lightpipeline) are Spark NLP specific Pipelines, equivalent to Spark ML Pipeline, but meant to deal with smaller amounts of data. They’re useful working with small datasets, debugging results, or when running either training or prediction from an API that serves one-off requests.

Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, **becoming more than 10x times faster** for smaller amounts of data (small is relative, but 50k sentences are roughly a good maximum). To use them, we simply plug in a trained (fitted) pipeline and then annotate a plain text. We don't even need to convert the input text to DataFrame in order to feed it into a pipeline that's accepting DataFrame as an input in the first place. This feature would be quite useful when it comes to getting a prediction for a few lines of text from a trained ML model.

For more details:
[https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1](https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1)

In [19]:
light_model = nlp.LightPipeline(cls_model)

You can use strings or list of strings with the method [.annotate()](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/light_pipeline/index.html#sparknlp.base.light_pipeline.LightPipeline.annotate) to get the results. To get more metadata in the result, use the method [.fullAnnotate()](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/light_pipeline/index.html#sparknlp.base.light_pipeline.LightPipeline.fullAnnotate) instead. The result is a `list` if a `list` is given, or a `dict` if a string was given.

To extract the results from the object, you just need to parse the dictionary.

In [20]:
lp_results = light_model.annotate(pages[0])
lp_results.keys()

dict_keys(['document', 'sentence_embeddings', 'category'])

In [21]:
# List with all the chunks
lp_results["category"]

['form_10k_summary']

We can see that the `.annotate()` did't return metadata in the `category` item. How can we obtain them? Using the `.fullAnnotate()` instead. This method always returns a list.

In [22]:
lp_results_full = light_model.fullAnnotate(pages[0])
lp_results_full[0].keys()

dict_keys(['document', 'sentence_embeddings', 'category'])

In [23]:
lp_results_full[0]["category"]

[Annotation(category, 0, 4047, form_10k_summary, {'sentence': '0', 'form_10k_summary': '0.99994636', 'other': '5.3589152E-5'})]

Now we can see all the metadata in the annotation objects. Let's get the results in a tabular form.

In [24]:
results_tabular = []
for res in lp_results_full[0]["category"]:
    results_tabular.append(
        (
            res.begin,
            res.end,
            res.result,
            res.metadata["form_10k_summary"],
        )
    )

import pandas as pd

pd.DataFrame(results_tabular, columns=["begin", "end", "category", "confidence"])


Unnamed: 0,begin,end,category,confidence
0,0,4047,form_10k_summary,0.99994636


## Training a custom Classification model

> **Please restart the runtime (if in Colab) to free memory required for training**

In [2]:
import os
from johnsnowlabs import nlp, finance
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pyspark.sql.functions as F


In [10]:
spark = nlp.start()

📋 Loading license number 0 from /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.2.4, 💊Spark-Healthcare==4.2.4, running on ⚡ PySpark==3.1.2


If your appliation needs different categories than the provided pretrained models can identify, what you can do is to train a new model that fits your requirements. To do that you first need to collect and label enough data. If you are not sure how to annotate (label) text data and prepare it in the CoNLL 2003 format, try our free tool [Annotation Lab](https://nlp.johnsnowlabs.com/docs/en/alab/quickstart), where you can easily label text data and export in the correct format for training.

For our purposes here, we will use a sample file annotated by our team.

In [26]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings_JSL/Finance/data/finance_clf_data.csv

In [3]:
!head -2 finance_clf_data.csv

text,label,len
"Presently we do not believe any U S or State regulatory body has taken any action or position adverse to our main cryptocurrency bitcoin with respect to its production sale and use as a medium of exchange however future changes to existing regulations or entirely new regulations may affect our business in ways it is not presently possible for us to predict with any reasonable degree of reliability 


In [4]:
finance_data = pd.read_csv("finance_clf_data.csv")
finance_data.head(3)

Unnamed: 0,text,label,len
0,Presently we do not believe any U S or State r...,business,402
1,\nnetwork outages or performance degradation ...,risk_factors,496
2,Available Information\nOur reports filed with ...,business,356


In [5]:
finance_data["label"].value_counts()

risk_factors               1926
financial_statements       1888
business                    970
financial_conditions        346
form_10k_summary            240
executives_compensation     155
controls_procedures         138
equity                      111
market_risk                 100
executives                   73
legal_proceedings            51
properties                   48
security_ownership           46
exhibits                     36
Name: label, dtype: int64

In [6]:
finance_data.columns

Index(['text', 'label', 'len'], dtype='object')

In [7]:
train_data, test_data = train_test_split(
    finance_data, train_size=0.8, stratify=finance_data.label, random_state=42
)

#### Train with Universal Encoder

In [11]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

embeddings = (
    nlp.UniversalSentenceEncoder.pretrained()
    .setInputCols("document")
    .setOutputCol("sentence_embeddings")
)

classsifierdl = (
    finance.ClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(50)
    .setEnableOutputLogs(True)
    .setBatchSize(4)
)

clf_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, classsifierdl])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [12]:
train_data = spark.createDataFrame(train_data)
test_data = spark.createDataFrame(test_data)

In [13]:
%%time
clf_pipelineModel = clf_pipeline.fit(train_data)

CPU times: user 1.25 s, sys: 135 ms, total: 1.38 s
Wall time: 5min 4s


Testing the model

In [14]:
preds = clf_pipelineModel.transform(test_data)

preds_df = preds.select('label', 'text', "class.result").toPandas()

# The result is an array since in Spark NLP you can have multiple sentences.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

print (classification_report(preds_df['label'], preds_df['result']))

                         precision    recall  f1-score   support

               business       0.62      0.76      0.68       194
    controls_procedures       0.00      0.00      0.00        28
                 equity       0.00      0.00      0.00        22
             executives       0.00      0.00      0.00        15
executives_compensation       0.00      0.00      0.00        31
               exhibits       0.00      0.00      0.00         7
   financial_conditions       0.00      0.00      0.00        69
   financial_statements       0.65      0.93      0.76       378
       form_10k_summary       0.00      0.00      0.00        48
      legal_proceedings       0.00      0.00      0.00        10
            market_risk       0.00      0.00      0.00        20
             properties       0.00      0.00      0.00        10
           risk_factors       0.77      0.88      0.82       385
     security_ownership       0.00      0.00      0.00         9

               accuracy

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


We can see that for the classes with few observations we didn't get any prediction! This happens because we used a small dataset and trained for few epochs. Let's check the logs to see how the model was learning.

In [15]:
log_file_name = os.listdir("/root/annotator_logs")[0]
log_file_name

'FinanceClassifierDLApproach_9e9f77b13825.log'

In [16]:
!cat /root/annotator_logs/$log_file_name

Training started - epochs: 50 - learning_rate: 0.005 - batch_size: 4 - training_examples: 4902 - classes: 14
Epoch 0/50 - 6.46s - loss: 2992.5574 - acc: 0.3087755 - batches: 1226
Epoch 1/50 - 6.33s - loss: 2996.461 - acc: 0.30816326 - batches: 1226
Epoch 2/50 - 5.62s - loss: 2996.461 - acc: 0.30816326 - batches: 1226
Epoch 3/50 - 5.66s - loss: 2996.461 - acc: 0.30816326 - batches: 1226
Epoch 4/50 - 5.85s - loss: 2996.461 - acc: 0.30816326 - batches: 1226
Epoch 5/50 - 6.34s - loss: 2996.461 - acc: 0.30816326 - batches: 1226
Epoch 6/50 - 5.49s - loss: 2996.461 - acc: 0.30816326 - batches: 1226
Epoch 7/50 - 5.62s - loss: 2996.461 - acc: 0.30816326 - batches: 1226
Epoch 8/50 - 5.72s - loss: 2996.461 - acc: 0.30816326 - batches: 1226
Epoch 9/50 - 5.50s - loss: 2916.2534 - acc: 0.37938777 - batches: 1226
Epoch 10/50 - 5.58s - loss: 2701.5913 - acc: 0.56142855 - batches: 1226
Epoch 11/50 - 5.42s - loss: 2682.6233 - acc: 0.57265306 - batches: 1226
Epoch 12/50 - 5.55s - loss: 2671.152 - acc: 0.

We can see that the accuracy has an increasing trend, meaning that the model could continue to improve. You can try by youtself to increase the number of epochs on the pipeline and check how the model improves.

### Saving and loading models

In [17]:
clf_pipelineModel.stages

[DocumentAssembler_b1121d39595c,
 UNIVERSAL_SENTENCE_ENCODER_4de71669b7ec,
 FinanceClassifierDLModel_366460063901]

In [18]:
clf_pipelineModel.stages[-1].write().overwrite().save('Clf_model')

In [19]:
# Load back  saved Classifier Model
ClfModel = finance.ClassifierDLModel.load('Clf_model')

In [20]:
ld_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings,ClfModel])
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))

In [21]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test_data)

In [22]:
ld_preds_df = ld_preds.select('text','label',"class.result").toPandas()
ld_preds_df.head()

Unnamed: 0,text,label,result
0,In addition in connection with our issuance of...,risk_factors,[risk_factors]
1,\nBasis for Opinion\nThese financial statemen...,financial_statements,[financial_statements]
2,associated with cryptocurrencies may have had ...,risk_factors,[risk_factors]
3,30 416\n \nGoodwill\n51 041\n \nTotal purchase...,financial_statements,[financial_statements]
4,currently receive or at all or that a lease te...,risk_factors,[risk_factors]
