![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/15.0.Financial_Text_Generation.ipynb)


# **Financial Text Generation**

Financial Text Generator uses the basic Flan-T5 model to perform various tasks related to financial text abstraction. With this models, a user can provide a prompt and context and instruct the system to perform a financial specific task. The Flan-T5 is an enhanced version of the original T5 model and is designed to produce better quality and more coherent text generation. It is trained on a large dataset of diverse texts and can generate high-quality summaries of articles, documents, and other text-based inputs.


Available models can be found at the [Models Hub](https://nlp.johnsnowlabs.com/models?edition=Finance+NLP&task=Text+Generation).


# Colab Setup

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, finance
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, finance
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7162 (7).json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.4.1, 💊Spark-Healthcare==4.4.3, running on ⚡ PySpark==3.1.2


In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

# Text Generation Models

<div align="center">

| **Index** | **Text Generator Models**        |
|---------------|----------------------|
| 1        |  [fingen_flant5_base](https://nlp.johnsnowlabs.com/2023/04/21/fingen_flant5_base_en.html)     |
| 2      | [fingen_flant5_finetuned_sec10k](https://nlp.johnsnowlabs.com/2023/04/28/fingen_flant5_finetuned_sec10k_en.html)    |
| 3      | [fingen_flant5_finetuned_alpaca](https://nlp.johnsnowlabs.com/2023/05/25/fingen_flant5_finetuned_alpaca_en.html)    |
| 4      | [fingen_flant5_finetuned_fiqa](https://nlp.johnsnowlabs.com/2023/05/29/fingen_flant5_finetuned_fiqa_en.html)    |


</div>

## **fingen_flant5_base**

This model is a modified version of Flan-T5 (LLM) based text generation model that is finetuned with natural instruction datasets by John Snow Labs.  Given a few tokens as an intro, it can generate human-like, conceptually meaningful texts  up to 512 tokens given an input text (max 1024 tokens).

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("prompt")

flant5 = finance.TextGenerator.pretrained("fingen_flant5_base","en","finance/models")\
    .setInputCols(["prompt"])\
    .setOutputCol("generated_text")\
    .setMaxNewTokens(150)\
    .setStopAtEos(True)\
  
pipeline = nlp.Pipeline(stages=[document_assembler, flant5])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))


fingen_flant5_base download started this may take some time.
[OK!]


In [None]:
data = spark.createDataFrame([[1, "Explain what is Sec 10-k filing"]]).toDF('id', 'text')

result = model.transform(data)

result.select("id", "text", "generated_text.result").show(truncate=False)

+---+-------------------------------+--------------------------------------------------------------------------------------------------------------------+
|id |text                           |result                                                                                                              |
+---+-------------------------------+--------------------------------------------------------------------------------------------------------------------+
|1  |Explain what is Sec 10-k filing|[Sec 10k filing is a form of tax filing that requires a party to file jointly or several entities for tax purposes.]|
+---+-------------------------------+--------------------------------------------------------------------------------------------------------------------+



## **fingen_flant5_finetuned_sec10k**

This `fingen_flant5_finetuned_sec10k` model has been fine-tuned on FLANT5 Using SEC filings data. FLAN-T5 is a state-of-the-art language model developed by Google AI that utilizes the T5 architecture for text generation tasks.

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("prompt")

flant5 = finance.TextGenerator.pretrained("fingen_flant5_finetuned_sec10k", "en", "finance/models")\
    .setInputCols(["prompt"])\
    .setOutputCol("generated_text")\
    .setMaxNewTokens(256)\
    .setNoRepeatNgramSize(3)\
    .setTopK(3)\
    .setRandomSeed(42)\
    .setStopAtEos(True)

pipeline = nlp.Pipeline(stages=[document_assembler, flant5])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

fingen_flant5_finetuned_sec10k download started this may take some time.
[OK!]


In [None]:
data = spark.createDataFrame(
[[1, 
"""The cumulative spend under management metric presented above does not directly correlate to our revenue or results of operations because we do not generally charge our customers based on actual usage of our core platform"""
]]).toDF('id', 'text')

result = model.transform(data)

result.select("id", "text", "generated_text.result").show(truncate=False)

+---+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### **Using LightPipeline**

In [None]:
text = ["""The cumulative spend under management metric presented above does not directly correlate to our revenue or results of operations because we do not generally charge our customers based on actual usage of our core platform"""]

light_model = nlp.LightPipeline(model)

light_result = light_model.annotate(text)

light_result

[{'prompt': ['The cumulative spend under management metric presented above does not directly correlate to our revenue or results of operations because we do not generally charge our customers based on actual usage of our core platform'],
  'generated_text': ['We do not have any significant revenue or results of operations based on the cumulative spend under management metric presented above, which is not directly related to our revenue or revenues, as we do not generally charge our customers based upon actual usage of our core platform we do have a significant revenue and results of operation based in part on our revenue from our customers using our core platforms, which are primarily based primarily on our subscriptions and subscriptions to our platform, which we believe are a reasonable estimate of our revenue and revenues for each of the three years in the period ended december 31, 2020, we had approximately $13 5 million of revenue from subscriptions, which was primarily due to our

In [None]:
import textwrap

input = textwrap.fill(light_result[0]['prompt'][0], width=120)

output = textwrap.fill(light_result[0]['generated_text'][0], width=120)

print("➤ Input: \n{}".format(input))
print("\n")
print("➤ Output: \n{}".format(output))
print("\n")

➤ Input: 
The cumulative spend under management metric presented above does not directly correlate to our revenue or results of
operations because we do not generally charge our customers based on actual usage of our core platform


➤ Output: 
We do not have any significant revenue or results of operations based on the cumulative spend under management metric
presented above, which is not directly related to our revenue or revenues, as we do not generally charge our customers
based upon actual usage of our core platform we do have a significant revenue and results of operation based in part on
our revenue from our customers using our core platforms, which are primarily based primarily on our subscriptions and
subscriptions to our platform, which we believe are a reasonable estimate of our revenue and revenues for each of the
three years in the period ended december 31, 2020, we had approximately $13 5 million of revenue from subscriptions,
which was primarily due to our subscription re