![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/14.0.Financial_Text_Generation.ipynb)


# **Financial Text Generation**

Financial Text Generator uses the basic Flan-T5 model to perform various tasks related to financial text abstraction. With this models, a user can provide a prompt and context and instruct the system to perform a financial specific task. The Flan-T5 is an enhanced version of the original T5 model and is designed to produce better quality and more coherent text generation. It is trained on a large dataset of diverse texts and can generate high-quality summaries of articles, documents, and other text-based inputs.


Available models can be found at the [Models Hub](https://nlp.johnsnowlabs.com/models?edition=Finance+NLP&task=Text+Generation).


# Colab Setup

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, finance
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, finance
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [5]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

# Text Generation Models

<div align="center">

| **Index** | **Text Generator Models**        |
|---------------|----------------------|
| 1        |  [fingen_flant5_base](https://nlp.johnsnowlabs.com/2023/04/21/fingen_flant5_base_en.html)     |
| 2      | [fingen_flant5_finetuned_sec10k](https://nlp.johnsnowlabs.com/2023/04/28/fingen_flant5_finetuned_sec10k_en.html)    |
| 3      | [fingen_flant5_finetuned_alpaca](https://nlp.johnsnowlabs.com/2023/05/25/fingen_flant5_finetuned_alpaca_en.html)    |
| 4      | [fingen_flant5_finetuned_fiqa](https://nlp.johnsnowlabs.com/2023/05/29/fingen_flant5_finetuned_fiqa_en.html)    |


</div>

## **fingen_flant5_base**

This model is a modified version of Flan-T5 (LLM) based text generation model that is finetuned with natural instruction datasets by John Snow Labs.  Given a few tokens as an intro, it can generate human-like, conceptually meaningful texts  up to 512 tokens given an input text (max 1024 tokens).

In [6]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("prompt")

flant5 = finance.TextGenerator.pretrained("fingen_flant5_base","en","finance/models")\
    .setInputCols(["prompt"])\
    .setOutputCol("generated_text")\
    .setMaxNewTokens(150)\
    .setStopAtEos(True)\
  
pipeline = nlp.Pipeline(stages=[document_assembler, flant5])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))


fingen_flant5_base download started this may take some time.
[OK!]


In [7]:
data = spark.createDataFrame([[1, "Explain what is Sec 10-k filing"]]).toDF("id", "text")

result = model.transform(data)

result.select("generated_text.result").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------+
|result                                                                                                              |
+--------------------------------------------------------------------------------------------------------------------+
|[Sec 10k filing is a form of tax filing that requires a party to file jointly or several entities for tax purposes.]|
+--------------------------------------------------------------------------------------------------------------------+



## **fingen_flant5_finetuned_sec10k**

This `fingen_flant5_finetuned_sec10k` model has been fine-tuned on FLANT5 Using SEC filings data. FLAN-T5 is a state-of-the-art language model developed by Google AI that utilizes the T5 architecture for text generation tasks.

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("prompt")

flant5 = finance.TextGenerator.pretrained('fingen_flant5_finetuned_sec10k','en','finance/models')\
    .setInputCols(["prompt"])\
    .setOutputCol("generated_text")\
    .setMaxNewTokens(256)\
    .setDoSample(True)\
    .setTopK(3)\
    .setRandomSeed(42)\
    .setStopAtEos(True)
  
pipeline = nlp.Pipeline(stages=[document_assembler, flant5])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

In [15]:
data = spark.createDataFrame(
[[1,
 """Deferred revenue primarily consists of customer billings or payments received in advance of revenues being recognized from the company’s subscription and services contracts"""
]]
).toDF('id', 'text')

result = model.transform(data)

result.select("generated_text.result").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### **Using LightPipeline**

In [17]:
text = ["""Deferred revenue primarily consists of customer billings or payments received in advance of revenues being recognized from the company’s subscription and services contracts"""]

light_model = nlp.LightPipeline(model)

light_result = light_model.annotate(text)

light_result

[{'prompt': ['Deferred revenue primarily consists of customer billings or payments received in advance of revenues being recognized from the company’s subscription and services contracts'],
  'generated_text': ['The company’s deferred revenue is recognized ratably over the term of the contract, which is generally one year or less, based on the estimated useful lives of the customer and the expected life of the customer’s subscription or services contract, and the estimated useful lives of the customer’s subscription or services contract, if any, if the company determines that the estimated useful lives of the customer’s subscription or services contract are less than the estimated useful lives of the customer’s subscription or services contract, the company recognizes revenue ratably over the term of the contract, which is generally one year or less, based on the estimated useful lives of the customer’s subscription or services contract, if the company determines that the estimated use

In [18]:
import textwrap

document_text = textwrap.fill(light_result[0]['prompt'][0], width=120)

summary_text = textwrap.fill(light_result[0]['generated_text'][0], width=120)

print("➤ Input: \n{}".format(document_text))
print("\n")
print("➤ Output: \n{}".format(summary_text))
print("\n")

➤ Input: 
Deferred revenue primarily consists of customer billings or payments received in advance of revenues being recognized
from the company’s subscription and services contracts


➤ Output: 
The company’s deferred revenue is recognized ratably over the term of the contract, which is generally one year or less,
based on the estimated useful lives of the customer and the expected life of the customer’s subscription or services
contract, and the estimated useful lives of the customer’s subscription or services contract, if any, if the company
determines that the estimated useful lives of the customer’s subscription or services contract are less than the
estimated useful lives of the customer’s subscription or services contract, the company recognizes revenue ratably over
the term of the contract, which is generally one year or less, based on the estimated useful lives of the customer’s
subscription or services contract, if the company determines that the estimated useful lives of the