![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

#🔎 Classify Financial Texts

In this notebook, you will learn how to use Spark NLP and Finance NLP to perform text classification.

In [None]:
from johnsnowlabs import nlp, finance, viz
import pyspark.sql.functions as F

##🔎 Pretrained models

📜For the text classification tasks, we will use two annotators:

- `ClassifierDL`: uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. Then, a deep learning model (DNNs) built with TensorFlow that supports `Binary Classification` and `Multiclass Classification` (up to 100 classes).
- `MultiClassifierDL`: `Multilabel Classification` (can predict more than one class for each text) using a Bidirectional GRU with Convolution architecture built with TensorFlow that supports up to 100 classes. The inputs are Sentence Embeddings such as state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings or SentenceEmbeddings.

Example Classification models:

| title                                                    | language   | predicted_entities                                                                                                      | compatible_editions                |
|:---------------------------------------------------------|:-----------|:------------------------------------------------------------------------------------------------------------------------|:-----------------------------------|
| Bank Complaints Classification                           | en         | ['Accounts', 'Credit Cards', 'Credit Reporting', 'Debt Collection', 'Loans', 'Money Transfer and Currency', 'Mortgage'] | ['Finance NLP 1.0', 'Finance NLP'] |
| Financial Finbert Sentiment Analysis (DistilRoBerta)     | en         | ['positive', 'negative', 'neutral']                                                                                     | ['Finance NLP 1.0', 'Finance NLP'] |
| Financial Business Item Binary Classifier                | en         | ['other', 'business']                                                                                                   | ['Finance NLP 1.0', 'Finance NLP'] |
| Financial Controls procedures Item Binary Classifier     | en         | ['other', 'controls_procedures']                                                                                        | ['Finance NLP 1.0', 'Finance NLP'] |
| Financial Equity Item Binary Classifier                  | en         | ['other', 'equity']                                                                                                     | ['Finance NLP 1.0', 'Finance NLP'] |
| Financial Executives compensation Item Binary Classifier | en         | ['other', 'executives_compensation']                                                                                    | ['Finance NLP 1.0', 'Finance NLP'] |
| Financial Executives Item Binary Classifier              | en         | ['other', 'executives']                                                                                                 | ['Finance NLP 1.0', 'Finance NLP'] |
| Financial Exhibits Item Binary Classifier                | en         | ['other', 'exhibits']                                                                                                   | ['Finance NLP 1.0', 'Finance NLP'] |
| Financial Financial conditions Item Binary Classifier    | en         | ['other', 'financial_conditions']                                                                                       | ['Finance NLP 1.0', 'Finance NLP'] |
| Financial Financial statements Item Binary Classifier    | en         | ['other', 'financial_statements']                                                                                       | ['Finance NLP 1.0', 'Finance NLP'] |

##🔎 Multiclass Classifiers

Multiclass classifiers predicts one class out of a predefined set of possible classes.

####📚 Environmental, Social and Governance (ESG)

📜We will use two classifiers, one with 26 classes:

`Business_Ethics`, `Data_Security`, `Access_And_Affordability`, `Business_Model_Resilience`, `Competitive_Behavior`, `Critical_Incident_Risk_Management`, `Customer_Welfare`, `Director_Removal`, `Employee_Engagement_Inclusion_And_Diversity`, `Employee_Health_And_Safety`, `Human_Rights_And_Community_Relations`, `Labor_Practices`, `Management_Of_Legal_And_Regulatory_Framework`, `Physical_Impacts_Of_Climate_Change`, `Product_Quality_And_Safety`, `Product_Design_And_Lifecycle_Management`, `Selling_Practices_And_Product_Labeling`, `Supply_Chain_Management`, `Systemic_Risk_Management`, `Waste_And_Hazardous_Materials_Management`, `Water_And_Wastewater_Management`, `Air_Quality`, `Customer_Privacy`, `Ecological_Impacts`, `Energy_Management`, `GHG_Emissions`


and one with only three: `Social`, `Governance`, `Environmental` (or `None`)

In [None]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

tokenizer = nlp.Tokenizer().setInputCols(["document"]).setOutputCol("token")

many_classes = (
    finance.BertForSequenceClassification.pretrained(
        "finclf_augmented_esg", "en", "finance/models"
    )
    .setInputCols(["document", "token"])
    .setOutputCol("esg_many")
)

three_classes = (
    finance.BertForSequenceClassification.pretrained(
        "finclf_esg", "en", "finance/models"
    )
    .setInputCols(["document", "token"])
    .setOutputCol("esg")
)

pipeline = nlp.Pipeline(
    stages=[document_assembler, tokenizer, many_classes, three_classes]
)

# couple of simple examples
example = spark.createDataFrame(
    [
        [
            """The Canadian Environmental Assessment Agency (CEAA) concluded that in June 2016 the company had not made an effort
 to protect public drinking water and was ignoring concerns raised by its own scientists about the potential levels of pollutants in the local water supply.
  At the time, there were concerns that the company was not fully testing onsite wells for contaminants and did not use the proper methods for testing because 
  of its test kits now manufactured in China.A preliminary report by the company in June 2016 was commissioned by the Alberta government to provide recommendations 
  to Alberta Environment officials"""
        ]
    ]
).toDF("text")

result = pipeline.fit(example).transform(example)

# result is a DataFrame
result.select(
    "text", F.expr("esg_many.result as many"), F.expr("esg.result as esg")
).show(truncate=80)

###📚 Financial News Multilabel Classification

📜This model can identify different topics contained in financial news (trained on news scrapped from the Internet and manual in-house annotations). The available topics are:

- `acq`: Acquisition / Purchase operations
- `finance`: Generic financial news
- `fuel`: News about fuel and energy sources
- `jobs`: News about jobs, employment rates, etc.
- `livestock`: News about animales and livestock
- `mineral`: News about mineral as copper, gold, silver, coal, etc.
- `plant`: News about greens, plants, cereals, etc
- `trade`: Trading news

In [None]:
documentAssembler = (
    nlp.DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
    .setCleanupMode("shrink")
)

embeddings = (
    nlp.UniversalSentenceEncoder.pretrained()
    .setInputCols("document")
    .setOutputCol("embeddings")
)

docClassifier = (
    nlp.MultiClassifierDLModel.pretrained("finmulticlf_news", "en", "finance/models")
    .setInputCols("embeddings")
    .setOutputCol("topics")
)

pipeline = nlp.Pipeline().setStages([documentAssembler, embeddings, docClassifier])

empty_data = spark.createDataFrame([[""]]).toDF("text")
pipelineModel = pipeline.fit(empty_data)

In [None]:
text = ["""
ECUADOR HAS TRADE SURPLUS IN FIRST FOUR MONTHS Ecuador posted a trade surplus of 10.6 mln dlrs in the first four months of 1987 compared with a surplus of 271.7 mln in the same period in 1986, the central bank of Ecuador said in its latest monthly report. Ecuador suspended sales of crude oil, its principal export product, in March after an earthquake destroyed part of its oil-producing infrastructure. Exports in the first four months of 1987 were around 639 mln dlrs and imports 628.3 mln, compared with 771 mln and 500 mln respectively in the same period last year. Exports of crude and products in the first four months were around 256.1 mln dlrs, compared with 403.3 mln in the same period in 1986. The central bank said that between January and May Ecuador sold 16.1 mln barrels of crude and 2.3 mln barrels of products, compared with 32 mln and 2.7 mln respectively in the same period last year. Ecuador's international reserves at the end of May were around 120.9 mln dlrs, compared with 118.6 mln at the end of April and 141.3 mln at the end of May 1986, the central bank said. gold reserves were 165.7 mln dlrs at the end of May compared with 124.3 mln at the end of April.
"""]

df = spark.createDataFrame([text]).toDF("text")

result = pipelineModel.transform(df)
result.select("text", "topics.result").show(truncate=60)

##🔎 Finding relevant sections of 10-K fillings

We will use a publicly available information about Cadence in SEC's Edgar database [here](https://www.sec.gov/Archives/edgar/data/813672/000081367222000012/cdns-20220101.htm) and [Wikipedia](https://en.wikipedia.org/wiki/Cadence_Design_Systems) for to illustrate some of our binary classifiers.

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt -O sample10k.txt

dbutils.fs.cp("file:/databricks/driver/sample10k.txt", "dbfs:/") 

In [None]:
text = open("sample10k.txt", "r").read()
print(text[:200])

First, lets split this big text into pages (we identified that every page starts with the string "Table of Contents" and use that to split).

In [None]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

text_splitter = (
    finance.TextSplitter()
    .setInputCols(["document"])
    .setOutputCol("pages")
    .setCustomBounds(["Table of Contents"])
    .setUseCustomBoundsOnly(True)
)

nlp_pipeline = nlp.Pipeline(stages=[document_assembler, text_splitter])

empty_data = spark.createDataFrame([[""]]).toDF("text")
text_splitting_pipe = nlp_pipeline.fit(empty_data)
text_splitting_lightpipe = nlp.LightPipeline(text_splitting_pipe)

In [None]:
res = text_splitting_lightpipe.annotate(text)
pages = res['pages']
pages = [p for p in pages if p.strip() != ''] # We remove empty pages
len(pages)

In [None]:
print(pages[0])

<img src="https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Finance/data/10k_image.png?raw=true"/>

Let's create a funtion that generates pipelines with the desird model, so we can use different binary classifiers with ease.

In [None]:
def get_binary_pipeline(model_name):
    documentAssembler = (
        nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
    )

    useEmbeddings = (
        nlp.UniversalSentenceEncoder.pretrained()
        .setInputCols("document")
        .setOutputCol("sentence_embeddings")
    )

    docClassifier = (
        nlp.ClassifierDLModel.pretrained(model_name, "en", "finance/models")
        .setInputCols(["sentence_embeddings"])
        .setOutputCol("category")
    )

    nlpPipeline = nlp.Pipeline(stages=[documentAssembler, useEmbeddings, docClassifier])

    return nlpPipeline

###📚 Finding Summary part

Summary page is usually the first page of the report, but let's suppose we don't know that. This binary classifier will predict `summary` if the page is the summary page or `other` otherwise.

In [None]:
cls_pipeline = get_binary_pipeline("finclf_form_10k_summary_item")
empty_data = spark.createDataFrame([[""]]).toDF("text")

cls_model = cls_pipeline.fit(empty_data)

In [None]:
df = spark.createDataFrame([[pages[0]]]).toDF("text")
result = cls_model.transform(df)
result.select('category.result').show()

###📚 Finding Acquisitions and Subsidiaries part

Let's send some pages and check which one(s) contain that information. In a real case, you could send all the pages to the model, but here for time saving purposes, we will show just a subset.

In [None]:
candidates = [[pages[0]], [pages[1]], [pages[35]], [pages[50]], [pages[67]]] # Some examples
df = spark.createDataFrame(candidates).toDF("text")

In [None]:
classification_pipeline = get_binary_pipeline('finclf_acquisitions_item')

model = classification_pipeline.fit(df)
result = model.transform(df)

In [None]:
result.select('category.result').show()

###📚 Finding About Management and their work experience part

Let's send some pages and check which one(s) contain that information. In a real case, you could send all the pages to the model, but here for time saving purposes, we will show just a subset.

In [None]:
candidates = [[pages[4]], [pages[84]], [pages[85]], [pages[86]], [pages[87]]]
df = spark.createDataFrame(candidates).toDF("text")


classification_pipeline = get_binary_pipeline('finclf_work_experience_item')
model = classification_pipeline.fit(df)
result = model.transform(df)
result.select('category.result').show()

###📚 Using LightPipeline

[LightPipelines](https://nlp.johnsnowlabs.com/docs/en/concepts#using-spark-nlps-lightpipeline) are Spark NLP specific Pipelines, equivalent to Spark ML Pipeline, but meant to deal with smaller amounts of data. They’re useful working with small datasets, debugging results, or when running either training or prediction from an API that serves one-off requests.

Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, **becoming more than 10x times faster** for smaller amounts of data (small is relative, but 50k sentences are roughly a good maximum). To use them, we simply plug in a trained (fitted) pipeline and then annotate a plain text. We don't even need to convert the input text to DataFrame in order to feed it into a pipeline that's accepting DataFrame as an input in the first place. This feature would be quite useful when it comes to getting a prediction for a few lines of text from a trained ML model.

For more details:
[https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1](https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1)

In [None]:
light_model = nlp.LightPipeline(cls_model)

You can use strings or list of strings with the method [.annotate()](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/light_pipeline/index.html#sparknlp.base.light_pipeline.LightPipeline.annotate) to get the results. To get more metadata in the result, use the method [.fullAnnotate()](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/light_pipeline/index.html#sparknlp.base.light_pipeline.LightPipeline.fullAnnotate) instead. The result is a `list` if a `list` is given, or a `dict` if a string was given.

To extract the results from the object, you just need to parse the dictionary.

In [None]:
lp_results = light_model.annotate(pages[0])
lp_results.keys()

In [None]:
# List with all the chunks
lp_results["category"]

We can see that the `.annotate()` did't return metadata in the `category` item. How can we obtain them? Using the `.fullAnnotate()` instead. This method always returns a list.

In [None]:
lp_results_full = light_model.fullAnnotate(pages[0])
lp_results_full[0].keys()

In [None]:
lp_results_full[0]["category"]

Now we can see all the metadata in the annotation objects. Let's get the results in a tabular form.

In [None]:
results_tabular = []
for res in lp_results_full[0]["category"]:
    results_tabular.append(
        (
            res.begin,
            res.end,
            res.result,
            res.metadata["form_10k_summary"],
        )
    )

import pandas as pd

pd.DataFrame(results_tabular, columns=["begin", "end", "category", "confidence"])


Unnamed: 0,begin,end,category,confidence
0,0,4047,form_10k_summary,0.99994636
