![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/finance-nlp/12.1.Financial_Summarization.ipynb)

#🎬 Installation

In [None]:
! pip install -q johnsnowlabs

##🔗 Automatic Installation
Using my.johnsnowlabs.com SSO

In [2]:
from johnsnowlabs import nlp, finance, legal

nlp.install(refresh_install=True, visual=True, force_browser = True)

##🔗 Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

#📌 Starting

In [3]:
spark = nlp.start()

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.4.0, 💊Spark-Healthcare==4.4.0, running on ⚡ PySpark==3.1.2


#🔎 Financial Summarization

📜Explanation:

Financial Summarization is the process of generating a concise and informative summary of financial documents, such as annual reports, financial statements, earnings transcripts, and news articles related to finance. John Snow Labs, a leading provider of natural language processing tools and technologies, offers a Financial Summarization solution that utilizes state-of-the-art deep learning algorithms to automatically extract and summarize key information from financial texts.

By using our new Financial Summarizer() module, you can get state-of-the-art, short versions of your financial documents, without losing any information.

We included 2 models for Financial Summarization:

  - **Financial FLAN-T5 Summarization (Base):** The base model, with generic capacities for summarizing financial documents.
  - **Financial Finetuned FLAN-T5 Summarization ( SEC 10k Filings ):** A specifically finetuned model trained to summarize Financial Reports sections. For this task, we finetuned our base model with more than 8K sections from different SEC Financial Reports.



### Let's see how to get summaries in different Finance documents using the `Summarizer()` module.


## Suspicious Activity Report

In [8]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("documents")

flant5 = finance.Summarizer().pretrained('finsum_flant5_base','en','finance/models')\
    .setInputCols(["documents"])\
    .setOutputCol("summary")\
    .setMaxNewTokens(1000)

pipeline = nlp.Pipeline(stages=[document_assembler, flant5])

data = spark.createDataFrame([
  [1, """Description of Activity:
  
On [Date], [Name of Business] submitted a loan application for a large sum of money. The loan officer noted that the application contained several red flags that raised suspicions of possible fraudulent activity.

Firstly, the business provided minimal documentation to support their financial statements, such as tax returns or bank statements. Secondly, the business listed a residential address as their place of business, which appeared to be a private residence. Additionally, the business provided inconsistent information regarding their ownership structure and the intended use of the loan proceeds.

Further investigation revealed that the business had no visible online presence, including a lack of a website, social media accounts, or business reviews. The loan officer also discovered that the business had only been in operation for a short period, despite their claims of significant revenue and growth.

Based on these findings, it is suspected that [Name of Business] may be engaging in fraudulent activity and using the loan to perpetrate such activity. Therefore, we recommend that this loan application be denied, and further investigation be conducted to determine if any additional suspicious activity has occurred."""]
]).toDF('id', 'text')

results = pipeline.fit(data).transform(data)

results.select("summary.result").show(truncate=False)

finsum_flant5_base download started this may take some time.
[OK!]
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                        

## Responsibility Reports

In [9]:
data = spark.createDataFrame([
  [2, """Lost Time Incident Rate: 

  The lost time incident rate per 200,000 hours worked in 2021 was 0.14, which decreased by 17.6% compared to 2020 (0.17) and decreased by 70.8% compared to 2019 (0.48). The decrease in the lost time incident rate can be attributed to the company's efforts to improve workplace safety and implement effective risk management strategies. 
  
  The total Scope 2 GHG emissions in 2021 were 688,228 tonnes, which remained relatively stable compared to 2020. The company's efforts to transition to renewable energy sources have helped to minimize Scope 2 GHG emissions."""]
]).toDF('id', 'text')

results = pipeline.fit(data).transform(data)

results.select("summary.result").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                         |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Broker Reports

In [10]:
data = spark.createDataFrame([
  [3, """Broker Report: Company XYZ

Introduction:
Company XYZ is a leading player in the technology industry that has released its financial results for the fiscal year 2022. The company has reported significant improvements in its cash flow operations, free cash flow, and loss reduction. This report aims to analyze these improvements and provide insights into the future prospects of the company.

Cash Flow Operations:
Company XYZ's cash flow operations have shown significant improvement over the fiscal year 2022. The net cash flow from operating activities has increased by 15% compared to the previous year. This improvement is primarily due to the increase in sales and effective management of accounts receivable and accounts payable. The company has also reduced its inventory levels, resulting in a reduction of cash outflows from operating activities.

Free Cash Flow:
Company XYZ's free cash flow has also increased by 20% over the fiscal year 2022. This increase is primarily due to the improvement in cash flow operations and the reduction in capital expenditures. The company has been able to generate positive free cash flow for the third consecutive year. This is a significant achievement for the company and shows its commitment to improving its financial position."""]
]).toDF('id', 'text')

results = pipeline.fit(data).transform(data)

results.select("summary.result").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------+
|result                                                                                                              |
+--------------------------------------------------------------------------------------------------------------------+
|[Company XYZ has reported significant improvements in its cash flow operations, free cash flow, and loss reduction.]|
+--------------------------------------------------------------------------------------------------------------------+



## SEC10K

In [4]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("documents")

flant5 = finance.Summarizer().pretrained('finsum_flant5_finetuned_sec10k','en','finance/models')\
    .setInputCols(["documents"])\
    .setOutputCol("summary")\
    .setMaxNewTokens(1000)

pipeline = nlp.Pipeline(stages=[document_assembler, flant5])

data = spark.createDataFrame([
  [4, """Report on Form 10-K.
Moreover, we operate in a very competitive and rapidly changing environment. New risks and uncertainties emerge from time to time, and it is not possible for us to predict all risks and uncertainties that could have an impact on the forward-looking statements contained in this Annual Report on Form 10-K. We cannot assure you that the results, events, and circumstances reflected in the forward-looking statements will be achieved or occur, and actual results, events, or circumstances could differ materially from those described in the forward-looking statements.
The forward-looking statements made in this Annual Report on Form 10-K relate only to events as of the date on which the statements are made. We undertake no obligation to update any forward-looking statements made in this Annual Report on Form 10-K to reflect events or circumstances after the date of this Annual Report on Form 10-K or to reflect new information or the occurrence of unanticipated events, except as required by law. We may not actually achieve the plans, intentions, or expectations disclosed in our forward-looking statements, and you should not place undue reliance on our forward-looking statements. Our forward-looking statements do not reflect the potential impact of any future acquisitions, mergers, dispositions, joint ventures, or investments we may make.
SUMMARY OF RISK FACTORS
Below is a summary of the principal factors that
could materially harm our business, operating results and/or financial condition, impair our future prospects and/or cause the price of our Class A common stock to decline.
This summary does not address all of the risks that we face. Additional discussion of the risks summarized in this risk factor summary, and other risks that we face, can be found below under the heading “Risk Factors” and should be carefully considered, together with other information in this Form 10-K and our other filings with the Securities and Exchange Commission ("SEC") before making an investment decision regarding our Class A common stock."""]
]).toDF('id', 'text')

results = pipeline.fit(data).transform(data)

results.select("summary.result").show(truncate=False)

finsum_flant5_finetuned_sec10k download started this may take some time.
[OK!]
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                              