# Financial Text Classification


![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)





[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Finance/3.Text_Classification.ipynb)

## Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install johnsnowlabs 

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Saving latest_3_1_x_spark_nlp_for_healthcare_spark_ocr_5112.json to latest_3_1_x_spark_nlp_for_healthcare_spark_ocr_5112.json


In [None]:
from johnsnowlabs import * 
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect
jsl.install()

👌 Detected license file /content/latest_3_1_x_spark_nlp_for_healthcare_spark_ocr_5112.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up John Snow Labs home in /home/ckl/.johnsnowlabs this might take a few minutes.
Downloading 🐍+🚀 Python Library Spark-NLP-4.1.0-wheel-for-spark-3.x.x.whl
Downloading 🐍+💊 Python Library hc
Downloading 🐍+🕶 Python Library Spark-OCR-4.0.1-wheel-for-spark-3.x.x.whl
Downloading 🫘+🚀 Java Library Spark-NLP-4.1.0-cpu-for-spark-3.x.x.jar
Downloading 🫘+💊 Java Library hc
Downloading 🫘+🕶 Java Library Spark-OCR-4.0.1-cpu-for-spark-3.x.x.jar
🙆 JSL Home setup in /root/.johnsnowlabs
Running "/usr/bin/python3 -m pip install https://pypi.johnsnowlabs.com/[LIBRARY_SECRET]spark-ocr/spark_ocr-4.0.1-py3-none-any.whl --force-reinstall"
Running "/usr/bin/python3 -m pip install https://pypi.johnsnowlabs.com/[LIBRARY_SECRET]spark-nlp-internal/spark_nlp_internal-4.1.0-py3-none-any.whl --force-reinst

## Start Spark Session

In [None]:
from johnsnowlabs import * 
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

👌 Detected license file /content/latest_3_1_x_spark_nlp_for_healthcare_spark_ocr_5112.json
📋 Stored new John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_2_for_Spark-Healthcare_Spark-OCR.json
👌 Launched SparkSession with Jars for: 🚀Spark-NLP, 💊Spark-Healthcare, 🕶Spark-OCR


In [5]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
# if you want to start the session with custom params as in start function above
def start(SECRET):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:"+PUBLIC_VERSION) \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+SECRET+"/spark-nlp-jsl-"+JSL_VERSION+".jar")
      
    return builder.getOrCreate()

#spark = start(SECRET)


## Get Multiclass Prediction from Financial Texts

The classification models were trained on financial texts, one of them is the **generic model** that classifies the financial texts into three categories as followings:
 - Environmental
 - Social 
 - Governance 
 
Another model is the **augmented model** classifies the same text into more specific categories as followings:
 - Business_Ethics
 - Data_Security
 - Access_And_Affordability 
 - Business_Model_Resilience
 - Competitive_Behavior
 - Critical_Incident_Risk_Management
 - Customer_Welfare
 - Director_Removal
 - Employee_Engagement_Inclusion_And_Diversity
 - Employee_Health_And_Safety
 - Human_Rights_And_Community_Relations
 - Labor_Practices
 - Management_Of_Legal_And_Regulatory_Framework
 - Physical_Impacts_Of_Climate_Change
 - Product_Quality_And_Safety
 - Product_Design_And_Lifecycle_Management
 - Selling_Practices_And_Product_Labeling
 - Supply_Chain_Management
 - Systemic_Risk_Management
 - Waste_And_Hazardous_Materials_Management
 - Water_And_Wastewater_Management
 - Air_Quality
 - Customer_Privacy
 - Ecological_Impacts
 - Energy_Management
 - GHG_Emissions

### Sample Texts for Binary Classification

In [6]:
sample_texts = [("""As part of a settlement with Energy Management, the company agreed to provide Energy Management employees with back-up power and to make sure those power services are provided to customers who want it. The city has been on the hook for $18 million in back-up utility charges since the spill and is trying a new energy delivery strategy: pay for the cost of the backup. In an internal report obtained by WXYZ-TV on Thursday, Energy Management officials were troubled by a decision to leave a gas pipeline running under a bus stop in Midlothian and instead deliver gas to a company called Energy Solutions. There are more than 8,000 residents who depend on this pipeline. On Wednesday, U.S. Sen. David Perdue, R-Ga., called on the EPA to conduct an investigation into why the company chose not to test the gas pipeline under the bus stop, a decision the company says was made solely to keep the company working. The EPA said Thursday that Energy Management is not responsible for the safety of the workers and equipment who were forced to live with the dangers of fuel. More on WXYZ-TV: Biden says 'lax supervision' created '""", "Environmental", "Energy_Management"),
                ("""I received a few emails over the last week from many users in this regard. After a brief pause, several major banks began to respond, and these banks appear to be making progress in addressing some of the concerns. As I continue to explore various potential responses to the report, I have found that some of them are making significant efforts toward addressing the data security concerns mentioned in this article. While these improvements are promising, it is also important to understand just how important these initiatives are for maintaining the company's financial security. The following steps represent changes already being taken by the companies concerned, and they are designed to provide more assurance as a matter of public policy.1. Update the Privacy Policy to state that data is encrypted, regardless of where the encryption system is set up2. Make sure there is no direct connection between data that is gathered and a user's identity, and provide for a link to the person at whose expense it is collected, so that this does not lead someone with malicious intent to access information the law requires to protect the integrity of an account.3. For every transaction that can be attributed to a user: Verify the identity of the person for whom the transaction took place""", "Social", "Data_Security"),
                ("""The three former colleagues were "not involved in the overall decision making process on these initiatives", according to a letter reported by the Guardian.While Critical Incident Risk Management was an internal focus at Hamilton, Carr and Garcia said they were "not aware of discussions that took place among senior management within Critical Incident Risk Management" regarding some of the internal internal risks they raised with superiors.'Not a threat'The three alleged employees said they also raised concerns about concerns regarding their physical safety with managers. In a Facebook thread posted on Sunday and then removed at around 03:30 on Monday, three former colleagues alleged that the four key managers of Hamilton failed to act on the concerns raised at WorkChoices within the next six months by: denying the company the £3.6 million cash bonus awarded between October 2017 and October 2018; denying that there was any financial risk for the company within four weeks of the whistleblower raising the concerns; refusing the whistleblower the severance package awarded in 2018; and failing to offer their resignation papers.In a written statement, CIM said "[h]e has been consistently and consistently supported by the organisation, by colleagues, by the board""", "Governance", "Critical_Incident_Risk_Management"),
                ("""He was appointed as Managing Director of Product Development and Product Quality at the beginning of this year with responsibilities including the management of the company's Product Improvement and Product Development team. Mr. Hays became acting Managing Director in July 2017, and is based in the U.S. at Maven Media, LLC, a major content distribution and management company with more than 200 clients throughout North America. Mr. Hays is expected to fill the position at the company soon, according to company officials. Earlier today, WeWork acquired online service Wufoo, a Web-based company that was recently acquired by Snap. WeWork is a web-based digital personal assistant that has raised several rounds of funding, including a Series A round led by Benchmark Capital, which valued Wufoo at $500M after raising more than $1B from private capital. Wufoo enables people to quickly sign in with their WeWork email, access contacts and search any topic they like from the web by tagging them with keywords. WeWork has said it will continue to fund operations from the Wufoo startup's headquarters in Chicago.""", "Social", "Product_Design_And_Lifecycle_Management"),
                ("""For the past six months, I have been working on a series of blog posts exploring a very short (2,125 words) post that discusses whether the company in question can adapt to the changing world and be resilient against the onslaught of new business models. All this and more will be available here on Medium and the website for folks interested in seeing it for themselves - which, as you may have been aware - was just added, but I'm guessing at times it is a condensed outline of what""", "Governance", "Business_Model_Resilience"),
                ("""The EPA found problems with several leaks, including the company's failure to respond properly to spills and the water agency's inability to detect groundwater contamination levels above safe limits.The agency took the department to a federal court for its allegations. It awarded a total of $13.8 million in damages and interest to the company, according to The New York Times, citing court documents.The court found that although EDF is a major supplier of power plants to the U.S., it is not an electrical utility and is not governed by the terms and conditions of a new federal water, hazardous waste, pollution and public health law that began on Monday, The Times reported.""", "Enviromental", "Water_And_Wastewater_Management")]

### Prediction Pipeline

In [7]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol('text')\
    .setOutputCol('document')

tokenizer = nlp.Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')

sequenceClassifier_gen = finance.BertForSequenceClassification.pretrained("finclf_esg", "en", "finance/models")\
    .setInputCols(["document",'token'])\
    .setOutputCol("generic_class")\
    .setCaseSensitive(True)\
    .setMaxSentenceLength(512)

sequenceClassifier_aug = finance.BertForSequenceClassification.pretrained("finclf_augmented_esg", "en", "finance/models")\
    .setInputCols(["document",'token'])\
    .setOutputCol("augmented_class")\
    .setCaseSensitive(True)\
    .setMaxSentenceLength(512)

pipeline = Pipeline(stages=[
    document_assembler, 
    tokenizer,
    sequenceClassifier_gen,
    sequenceClassifier_aug
])


empty_df = spark.createDataFrame([['']]).toDF("text")

model = pipeline.fit(empty_df)

finclf_esg download started this may take some time.
[OK!]
finclf_augmented_esg download started this may take some time.
[OK!]


In [8]:
df = spark.createDataFrame(sample_texts, ["text", "gen_label", "aug_label"])

df.show(truncate = 80)

+--------------------------------------------------------------------------------+-------------+---------------------------------------+
|                                                                            text|    gen_label|                              aug_label|
+--------------------------------------------------------------------------------+-------------+---------------------------------------+
|As part of a settlement with Energy Management, the company agreed to provide...|Environmental|                      Energy_Management|
|I received a few emails over the last week from many users in this regard. Af...|       Social|                          Data_Security|
|The three former colleagues were "not involved in the overall decision making...|   Governance|      Critical_Incident_Risk_Management|
|He was appointed as Managing Director of Product Development and Product Qual...|       Social|Product_Design_And_Lifecycle_Management|
|For the past six months, I have been wor

In [9]:
result = model.transform(df)

In [10]:
result.select("gen_label", "aug_label", F.explode(F.arrays_zip('document.result', 'generic_class.result', 'augmented_class.result')).alias("cols"))\
      .select(F.expr("cols['0']").alias("document"),
              "gen_label",
              "aug_label",
              F.expr("cols['1']").alias("gen_class"),
              F.expr("cols['2']").alias("aug_class")).show(truncate=60)     

+------------------------------------------------------------+-------------+---------------------------------------+-------------+---------------------------------------+
|                                                    document|    gen_label|                              aug_label|    gen_class|                              aug_class|
+------------------------------------------------------------+-------------+---------------------------------------+-------------+---------------------------------------+
|As part of a settlement with Energy Management, the compa...|Environmental|                      Energy_Management|Environmental|                      Energy_Management|
|I received a few emails over the last week from many user...|       Social|                          Data_Security|       Social|                          Data_Security|
|The three former colleagues were "not involved in the ove...|   Governance|      Critical_Incident_Risk_Management|   Governance|      Critical_

## Get Multilabel Prediction from Financial Texts

This model analyses and provides the best class or classes given financial news texts. The model classifies the financial news into the following categories:
 - finance              
 - acq                  
 - fuel                   
 - plant                
 - mineral               
 - trade                 
 - livestock             
 - jobs                   
 - or any combination of them 


### Prediction Pipeline

In [11]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")\
    .setCleanupMode("shrink")

embeddings = nlp.UniversalSentenceEncoder.pretrained()\
    .setInputCols("document")\
    .setOutputCol("embeddings")

doc_classifier = nlp.MultiClassifierDLModel.pretrained("finmulticlf_news", "en" ,"finance/models")\
    .setInputCols("document", "embeddings")\
    .setOutputCol("category")    


clf_pipeline = Pipeline(stages = [
        document_assembler,
        embeddings,
        doc_classifier])

light_pipeline = LightPipeline(clf_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
finmulticlf_news download started this may take some time.
Approximate size to download 12.3 MB
[OK!]


### Get Result with LightPipeline

In [12]:
result = light_pipeline.annotate("""ECUADOR HAS TRADE SURPLUS IN FIRST FOUR MONTHS Ecuador posted a trade surplus of 10.6 mln dlrs in the first four months of 1987 compared with a surplus of 271.7 mln in the same period in 1986, the central bank of Ecuador said in its latest monthly report. Ecuador suspended sales of crude oil, its principal export product, in March after an earthquake destroyed part of its oil-producing infrastructure. Exports in the first four months of 1987 were around 639 mln dlrs and imports 628.3 mln, compared with 771 mln and 500 mln respectively in the same period last year. Exports of crude and products in the first four months were around 256.1 mln dlrs, compared with 403.3 mln in the same period in 1986. The central bank said that between January and May Ecuador sold 16.1 mln barrels of crude and 2.3 mln barrels of products, compared with 32 mln and 2.7 mln respectively in the same period last year. Ecuador's international reserves at the end of May were around 120.9 mln dlrs, compared with 118.6 mln at the end of April and 141.3 mln at the end of May 1986, the central bank said. gold reserves were 165.7 mln dlrs at the end of May compared with 124.3 mln at the end of April.""")

result["category"]

['finance', 'trade']

In [13]:
result = light_pipeline.annotate("""LONDON GRAIN FREIGHTS 27,000 long tons USG/Taiwan 23.25 dlrs fio five days/1,500 1-10/5 Continental. Trade Banner - 30,000 long tons grain USG/Morocco 13.50 dlrs 5,000/5,000 end-April/early-May Comanav. Reference New York Grain Freights 1 of April 8, ship brokers say the vessel fixed by Cam from the Great Lakes to Algeria at 28 dlrs is reported to be the Vamand Wave. Reference New York Grain Freights 2 of April 8, they say the Cory Grain maize business from East London at 22 dlrs is to Japan and not to Spain as reported""")

result["category"]

['plant', 'fuel']

In [14]:
result = light_pipeline.annotate("""Agriculture Ministry officials said they are not considering cuts in import duties on chocolate to help ease friction with the United States over agricultural trade. Japan has already lowered the duties sharply and we must consider domestic market conditions, an official said. Duties on chocolate were cut to 20 pct from 31.9 pct in April 1983. Washington has been demanding a cut to seven pct, equivalent to its own duties, ministry sources said. Japanese chocolate imports rose to 8,285 tonnes in calendar 1986 from 5,908 in 1985, official statistics show. However, the ministry sources added it is possible the government may make further cuts in response to strong U.S. And European demand. "Due to concern about the farm trade row with the U.S., Top-level government officials may press the ministry to cut the duties," one said. But he said it would be difficult for Japan to resolve its overall trade row with Washington and reduce its trade surplus, which reached 58.6 billion dolars in 1986. Agricultural trade issues between Japan and the U.S.  Include Japanese import restrictions on 12 farm products.""")

result["category"]

['plant', 'livestock', 'trade']