![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

#🔎 Financial Text Splitting

Text splitting is very important for many reasons.

- Financial documents may be **very long** (e.g., filings, annual/quarterly reports) and applying AI to the whole document may take much time.
- Most Language Models have **token restrictions**. For example, Bert-based models can only process up to 512 tokens, while our biggest LM (Longformers) can process up to 4096 (but affecting negatively to the performance).
- If you have models which do specific extractions on **specific sections**, it does not make sense you run them over the whole document: you may get a lot of False Positives.

But also:
- Language Models will only "understand" only that **portion of text you send to them**. If you send *little information*, they may be biased and take wrong decisions.
- On the other hand, sometimes sending *too much information* makes the information get deluded or add noise.

And!
- Make sure the tokenization and splitting mechanism is the same in training and inference, or you will get undesirable results.

Because of these reasons, the very first thing you need to think about is how to properly split your texts, depending on the kind of task you want to carry out.

In [None]:
from johnsnowlabs import *


#✔ 1. Text Classification

#✔ 1.1. Background
Text classification is the NLP task in charge of retrieving a `category` based on a piece of text you send to the model. A good example of it it's Section or Item identification in Financial Documents.

📚There are several ways we can carry out Text Classification:

- `At a whole-text level (no splitting)`: That's not feasible for most financial documents. As we already know, we have a token restrictions (512 for most BERT-based thansformers). We can use Longformers (4096), but in most cases, documents as Financial Reports have much more than that, what will go for sure beyond any limitation.

- `Retrieving first page`: In most cases, the relevant information of a document is in the first page. Just by splitting by pages and retrieving the first one, you can do text classification. 

- `At paragraph, section or sentence level using Finance NLP TextSplitter`: `TextSplitter` is an NLP annotator that it's meant to split documents into sentences, but is pretty customisable to retrieve paragraphs or setions as well. It works with `regex` or list of character splitting, among other more complex techniques.

- `At paragraph, section or sentence level using NER` for detecting headers and `ChunkSentenceSplitter`

The way you split may totally change the results you are getting. Let's see an example.

#✔ 1.2. Page splitting

Sometimes, pages have patterns which tell you how to split them. In our case, `Table of Contents` was present in the bottom of our documents.

Feel free to always analyze for signals when trying to detect pages boundaries. Patterns you can usually find in the bottom of a page:
- Number of page
- Bottom placeholders
- Name of people
- Name of the document
- etc.

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

text_splitter = finance.TextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("pages")\
    .setCustomBounds(["Table of Contents"])\
    .setUseCustomBoundsOnly(True)\
    .setExplodeSentences(True)

nlp_pipeline = nlp.Pipeline(stages=[
    document_assembler,
    text_splitter])

📜Explanation:
- `.setCustomBounds(["Table of Contents"])` sets an array of regular expression(s) to tell the annotator how to split the document.
- `.setUseCustomBoundsOnly(True)` the default behaviour of TextSplitter is Text Splitting, so we set to ignore the default regex ('\n', ...).
- `.setExplodeSentences(True)` creates one new row in the dataframe per split.

Let's download a document and use `whole-text` classifiers.

In [None]:
import requests
URL = "https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/cdns-20220101.html.txt"
response = requests.get(URL)

cadence_sec10k = response.content.decode('utf-8')

Let's apply our page splitter.

## Using Fit/Transform (all the dataframe will be processed in paralel using all the spark cluster nodes)

In [None]:
#fit: trains, configures and prepares the pipeline for inference. 

sdf = spark.createDataFrame([[ cadence_sec10k ]]).toDF("text")

fit = nlp_pipeline.fit(sdf)

In [None]:
%%time

#transforms: executes inference on a fit pipeline
res = fit.transform(sdf)

res.show()

## Using LightPipelines (everything will be executed in 1 node. It's much quicker for small dataframes, but does not leverage the cluster capacities)

In [None]:
%%time
import json
lp = nlp.LightPipeline(fit)

json_res = lp.annotate(cadence_sec10k)

print(json.dumps(json_res, indent=4))

In [None]:
pages = [json_res['pages'][i] for i in range(0,20)]

#✔ 1.3. Document Classification using the 1st page
Most of the documents can be identified by the 1 page (given it's a real 1st page and not a separation, cover page, page with noise, etc - you can filter those out).

![image.png](/files/FINLEG/dc1.png)

📚We have some text classifiers which can be used to identify different financial reports:
- `finclf_sec_filings_en`: This model allows you to classify documents among a list of specific US Security Exchange Commission filings, as `10-K, 10-Q, 8-K, S-8, 3, 4, Other`. IMPORTANT : This model works with the first 512 tokens of a document, you don't need to run it in the whole document.`
- `finclf_earning_broker_10k_en`: This is a Text Cassification model, which can help you identify if a model is an Earning Call, a Broker Report, a 10K filing or something else. IMPORTANT : This model works with the first 512 tokens of a document, you don't need to run it in the whole document.`
- ...
- missing something? Let us know and we will train that for you!

In [None]:
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

embeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "en") \
    .setInputCols("document") \
    .setOutputCol("sentence_embeddings")

docClassifier = finance.ClassifierDLModel.pretrained("finclf_earning_broker_10k", "en", "finance/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("label") \

nlp_pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    embeddings,
    docClassifier])

### Using fit / transform (all Spark Cluster nodes)

In [None]:
# Since the model works with the first 512 tokens, let's just send some initial characters...
sdf = spark.createDataFrame([[ pages[0] ]]).toDF("text")

fit = nlp_pipeline.fit(sdf)
res = fit.transform(sdf)

res = res.select('label.result').show()

### Using Light Pipeline (only driver, quicker but not scalable for big amounts of documents)

In [None]:
from johnsnowlabs import nlp
import json

lp = nlp.LightPipeline(fit)
json_res = lp.annotate( pages[0] )
print(json.dumps(json_res, indent=4))

### Are you curious of what would happen with other pages❓

In [None]:
json_res = lp.annotate( [ pages[0], pages[1], pages[5], pages[10], pages[15], pages[19]] )
print(json.dumps(json_res, indent=4))

Ok our classifiers, which work on the top 512 tokens, say we have a **10K document** by using the first page.

#✔ 1.4. Paragraph / Section / Items splitting

##📌 TextSplitter
To split by sentences or paragraphs or ...

In most cases:
- paragraphs can be extracted using just `\n\n`.
- sections and items may be delimited by headers and subheaders. We can train DL models to detect them or, if use numeration (if they are numerated sections) or other patterns to delimit their boundaries

In [None]:
example1 = """PART I

ITEM 1. BUSINESS.
Company Overview
Inuvo is a technology company that develops and sells information technology solutions for marketing.

ITEM 1A. RISK FACTORS.
An investment in our common stock involves a significant degree of risk.

ITEM 1B. UNRESOLVED STAFF COMMENTS.
Not applicable to a smaller reporting company.

ITEM 2. PROPERTIES.
Our corporate headquarters are located in Little Rock, Arkansas where we entered into a five-year agreement to lease office space on October 1, 2015 and amended the lease as of February 1, 2021
Angeles, CA, San Jose, CA and Secaucus, NJ.

ITEM 3. LEGAL PROCEEDINGS.
We are not party to any pending legal proceedings.

ITEM 4. MINE SAFETY DISCLOSURES.
Not applicable.

PART II
ITEM 5. MARKET FOR REGISTRANT’S COMMON EQUITY, RELATED STOCKHOLDER MATTERS AND ISSUER PURCHASES OF EQUITY SECURITIES.
Market Information"""

By `\n\n`...

In [None]:
text_splitter = finance.TextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("paragraphs")\
    .setCustomBounds(["\n\n"])\
    .setUseCustomBoundsOnly(True)\
    .setExplodeSentences(True)

nlp_pipeline = nlp.Pipeline(stages=[
    document_assembler,
    text_splitter])

sdf = spark.createDataFrame([[ example1 ]]).toDF("text")

fit = nlp_pipeline.fit(sdf)
res = fit.transform(sdf)

res = res.select('paragraphs.result').show(truncate = False)

In [None]:
text_splitter = finance.TextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("paragraphs")\
    .setCustomBounds(["\n\n"])\
    .setUseCustomBoundsOnly(True)\
    .setExplodeSentences(True)

nlp_pipeline = nlp.Pipeline(stages=[
    document_assembler,
    text_splitter])

sdf = spark.createDataFrame([[ example1 ]]).toDF("text")

fit = nlp_pipeline.fit(sdf)
res = fit.transform(sdf)

res = res.select('paragraphs.result').show(truncate = False)

By other regular expressions ...

In [None]:
example2 = """PART I
ITEM 1. BUSINESS.
Company Overview
Inuvo is a technology company that develops and sells information technology solutions for marketing.
ITEM 1A. RISK FACTORS.
An investment in our common stock involves a significant degree of risk.
ITEM 1B. UNRESOLVED STAFF COMMENTS.
Not applicable to a smaller reporting company.
ITEM 2. PROPERTIES.
Our corporate headquarters are located in Little Rock, Arkansas where we entered into a five-year agreement to lease office space on October 1, 2015 and amended the lease as of February 1, 2021
Angeles, CA, San Jose, CA and Secaucus, NJ.
ITEM 3. LEGAL PROCEEDINGS.
We are not party to any pending legal proceedings.
ITEM 4. MINE SAFETY DISCLOSURES.
Not applicable.
PART II
ITEM 5. MARKET FOR REGISTRANT’S COMMON EQUITY, RELATED STOCKHOLDER MATTERS AND ISSUER PURCHASES OF EQUITY SECURITIES.
Market Information"""

In [None]:
text_splitter = finance.TextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("paragraphs")\
    .setCustomBounds(["[A-Z]{4,}[ ]*[0-9]*[A-Z]*\.? ?"])\
    .setUseCustomBoundsOnly(True)\
    .setExplodeSentences(True)

nlp_pipeline = nlp.Pipeline(stages=[
    document_assembler,
    text_splitter])

sdf = spark.createDataFrame([[ example2 ]]).toDF("text")

fit = nlp_pipeline.fit(sdf)
res = fit.transform(sdf)

res = res.select('paragraphs.result').show(truncate=False)

BE CAREFUL. If you use a pattern to split and you want it to be included, use `setCustomBoundsStrategy` to `prepend` (goes as the first part of the next section) or `append` (goes as a last part of the previous section)

In [None]:
text_splitter = finance.TextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("paragraphs")\
    .setCustomBounds(["[A-Z]{4,}[ ]*[0-9]*[A-Z]*\.? ?"])\
    .setUseCustomBoundsOnly(True)\
    .setExplodeSentences(True)\
    .setCustomBoundsStrategy('prepend')

nlp_pipeline = nlp.Pipeline(stages=[
    document_assembler,
    text_splitter])

sdf = spark.createDataFrame([[ example2 ]]).toDF("text")

fit = nlp_pipeline.fit(sdf)
res = fit.transform(sdf)


res = res.select('paragraphs.result').show(truncate=False)

##📌 Using NER and ChunkSentenceSplitter

... using pretrained NER models for `Headers and Subheaders`

In [None]:
documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
        .setInputCols(["sentence", "token"]) \
        .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_headers", "en", "finance/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner")

ner_converter = finance.NerConverterInternal()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")

nlp_pipeline = nlp.Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter])

In [None]:
sdf = spark.createDataFrame([[ example2 ]]).toDF("text")

fit = nlp_pipeline.fit(sdf)

lp = nlp.LightPipeline(fit)

In [None]:
res = lp.fullAnnotate(example2)

In [None]:
from johnsnowlabs import viz

ner_viz = viz.NerVisualizer()

displayHTML(ner_viz.display(res[0], label_col='ner_chunk', return_html = True))

In [None]:
ner_res = res[0]['ner_chunk'] # Document 0

sections = []
for ann in ner_res:
  sections.append( (ann.begin, ann.end) )
sections

In [None]:
section_texts = []
last_section = 0
last_section_name = ""
for s in sections:
  t = last_section_name + example2[last_section:s[0]]
  if t != '':
    section_texts.append(t)
  last_section = s[1]+1
  last_section_name = example2[s[0]:s[1]+1]
section_texts.append(last_section_name + example2[last_section:s[0]])


In [None]:
OKGREEN = '\033[92m'
ENDC = '\033[0m'

In [None]:
for t in section_texts:
  print(f"{OKGREEN}SECTION:{ENDC}\n{t}")

... or automatically using ChunkSentenceSpliting...

In [None]:
chunkSentenceSplitter = finance.ChunkSentenceSplitter()\
    .setInputCols("document","ner_chunk")\
    .setOutputCol("paragraphs")\
    .setGroupBySentences(False)

    
nlp_pipeline = nlp.Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter,
        chunkSentenceSplitter])


paragraphs = nlp_pipeline.fit(sdf).transform(sdf)

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
result_df = paragraphs.selectExpr("explode(paragraphs) as result").selectExpr("result.result","result.metadata.entity").toPandas()
result_df.head()

Unnamed: 0,result,entity
0,PART I\n,HEADER
1,ITEM 1. BUSINESS.\n,HEADER
2,Company Overview\nInuvo is a technology company that develops and sells information technology solutions for marketing.\n,SUBHEADER
3,ITEM 1A.,SUBHEADER
4,RISK FACTORS.\nAn investment in our common stock involves a significant degree of risk.\n,SUBHEADER


#✔ 1.5. Paragraph Classification and the consequences of splitting

![image.png](/files/FINLEG/pc1.png)

In Finance NLP we have several Text Classifiers to detect some useful sections of a 10K (or 10Q) filing. For example:
```
- finclf_acquisitions_item
- finclf_business_item
- finclf_equity_item
- finclf_exhibits_item
- finclf_executives_item
- finclf_properties_item
- finclf_work_experience_item
- finclf_controls_procedures_item
- finclf_security_ownership_item
- finclf_executives_compensation_item
- finclf_financial_statements_item
- finclf_market_risk_item
- finclf_financial_conditions_item
- finclf_legal_proceedings_item
- finclf_risk_factors_item
```
All of these models are Binary Classifiers, which means they will return True or False, or better to say ,the name of the class (for example, `acquisitions`) if the class `acquisition` is found, or `other` if it's not an `acquisition`, but something else.

CLARIFICATION: We did not want to return `True` or `False` because these models can all be stuck one after another, so in case a clause triggers some classes at the same time (they may not be disjoint) we could be getting several `True`, which is less informative than retrieving directly the type of the classes detected `acquisitions` and `properties` for example.

##📌 Example of how splitting affects paragraph classification

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

text_splitter = finance.TextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("pages")\
    .setCustomBounds(["Table of Contents"])\
    .setUseCustomBoundsOnly(True)\
    .setExplodeSentences(True)

nlp_pipeline = nlp.Pipeline(stages=[
    document_assembler,
    text_splitter])

fit = nlp_pipeline.fit(spark.createDataFrame([[ cadence_sec10k ]]).toDF("text"))

In [None]:
lp = nlp.LightPipeline(fit)

In [None]:
res = lp.annotate(cadence_sec10k)
pages = res['pages']
pages = [p for p in pages if p.strip() != ''] # We remove empty pages

In [None]:
candidates = [[pages[4]], [pages[84]], [pages[85]], [pages[86]], [pages[87]]]

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

use_embeddings = nlp.UniversalSentenceEncoder.pretrained() \
    .setInputCols("document") \
    .setOutputCol("sentence_embeddings")

doc_classifier = finance.ClassifierDLModel.pretrained('finclf_work_experience_item', "en", "finance/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("category")

nlp_pipeline = nlp.Pipeline(stages=[
    document_assembler, 
    use_embeddings,
    doc_classifier])

In [None]:
df = spark.createDataFrame(candidates).toDF("text")

model = nlp_pipeline.fit(df)

result = model.transform(df)
result.select('category.result').show()

Page 86 has some information about people and their roles...

In [None]:
print(pages[86][-800:])

But page 4 too!

In [None]:
print(pages[4][-800:])

What happened?

In [None]:
pages[4]

![image.png](/files/FINLEG/pc2.png)

Predicting at page level added too much information, and the part of the CEO got deluded by the rest of the information!

Also token size restrictions may have happened!

SOLUTION: A smaller lever of granularity

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

text_splitter = finance.TextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("paragraphs")

nlp_pipeline = nlp.Pipeline(stages=[
    document_assembler, 
    text_splitter])

In [None]:
empty_data = spark.createDataFrame([[""]]).toDF("text")
text_splitting_pipe = nlp_pipeline.fit(empty_data)
text_splitting_lightpipe = nlp.LightPipeline(text_splitting_pipe)

In [None]:
res = text_splitting_lightpipe.annotate(pages[4])
paragraphs = res['paragraphs']
paragraphs = [p for p in paragraphs if p.strip() != ''] # We remove empty pages

In [None]:
candidates = [[x] for x in paragraphs]

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

use_embeddings = nlp.UniversalSentenceEncoder.pretrained() \
    .setInputCols("document") \
    .setOutputCol("sentence_embeddings")

doc_classifier = finance.ClassifierDLModel.pretrained('finclf_work_experience_item', "en", "finance/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("category")

nlp_pipeline = nlp.Pipeline(stages=[
    document_assembler, 
    use_embeddings,
    doc_classifier])

In [None]:
df = spark.createDataFrame(candidates).toDF("text")
model = nlp_pipeline.fit(df)
result = model.transform(df)
result.select('category.result').show()