
![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)





[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Finance/13.Training_Financial_Classifiers.ipynb)

# Train Domain-specific Multiclass and Multilabel classifiers

## Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install johnsnowlabs 

In [None]:
from google.colab import files
print('Please upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [3]:
from johnsnowlabs import * 
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect
jsl.install()

👌 Detected license file /content/4.2.0.spark_nlp_for_healthcare-2.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up if John Snow Labs home exists in /root/.johnsnowlabs this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.2.0-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-4.2.0-py3-none-any.whl
Downloading 🐍+🕶 Python Library spark_ocr-4.1.0-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.2.0.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-4.2.0.jar
Downloading 🫘+🕶 Java Library spark-ocr-assembly-4.1.0.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/4.2.0.spark_nlp_for_healthcare-2.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.2.0-py3-none-any.whl to /usr/bin/python3
Running: /usr/bin/python3 -m pip install /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.2.0-py3-none-any.whl
👌 Detected license file /

## Start Spark Session

In [1]:
from johnsnowlabs import * 
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

DEBUG START!
👌 Detected license file /content/4.2.0.spark_nlp_for_healthcare-2.json
👌 Launched [92mcpu-Optimized JVM[39m SparkSession with Jars for: 🚀Spark-NLP==4.2.0, 💊Spark-Healthcare==4.2.0, 🕶Spark-OCR==4.1.0, running on ⚡ PySpark==3.1.2


# Multilabel classifier training

## Loading the data

In [2]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Finance/data/finance_data.csv

In [3]:
import pandas as pd
df = pd.read_csv('./finance_data.csv')
df['label'] = df['label'].apply(eval)

In [4]:
data = spark.createDataFrame(df)

# If you have a single dataset, then split it or else you can load the test dataset the same way that you load the train data.
train, test = data.randomSplit([0.8, 0.2], seed = 123)

In [5]:
train.show(truncate=50)

+--------------------------------------------------+-----------------------------+
|                                         provision|                        label|
+--------------------------------------------------+-----------------------------+
|(a) Consultant or Company may terminate this Pr...|               [terminations]|
|(a) Each of Borrower and Guarantor, as applicab...|[representations, warranties]|
|(a) No amendment or waiver of any provision of ...|                 [amendments]|
|(a) No failure on the part of any Person to exe...|        [waivers, amendments]|
|(a) No failure or delay by any Agent or any Len...|        [waivers, amendments]|
|(a) No failure or delay by the Administrative A...|        [waivers, amendments]|
|(a) No failure or delay by the Administrative A...|        [waivers, amendments]|
|(a) No failure or delay by the Administrative A...|        [waivers, amendments]|
|(a) No failure or delay by the Agent or any Len...|        [waivers, amendments]|
|(a)

In [6]:
from pyspark.sql.functions import col

test.groupBy("label") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|    [governing laws]|  751|
|      [counterparts]|  580|
|           [notices]|  574|
| [entire agreements]|  571|
|      [severability]|  504|
|          [survival]|  327|
|[assigns, success...|  294|
|        [amendments]|  265|
|[waivers, amendme...|  229|
|      [terminations]|  227|
|          [expenses]|  227|
|           [waivers]|  206|
|[representations,...|  203|
|       [assignments]|  174|
|   [representations]|   88|
|[amendments, enti...|   60|
|        [successors]|   50|
|[amendments, term...|   35|
|        [warranties]|   24|
|[governing laws, ...|   13|
+--------------------+-----+
only showing top 20 rows



 ## With Universal Encoder

In [7]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("provision") \
    .setOutputCol("document") \
    .setCleanupMode("shrink")

embeddings = nlp.UniversalSentenceEncoder.pretrained() \
      .setInputCols("document") \
      .setOutputCol("sentence_embeddings")

classsifierdl = nlp.MultiClassifierDLApproach()\
      .setInputCols(["sentence_embeddings"])\
      .setOutputCol("class")\
      .setLabelColumn("label")\
      .setMaxEpochs(30)\
      .setEnableOutputLogs(True)

clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        embeddings,
        classsifierdl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [8]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 944 ms, sys: 128 ms, total: 1.07 s
Wall time: 3min 3s


In [9]:
import os
log_file_name = os.listdir("/root/annotator_logs")[0]

with open("/root/annotator_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 30 - learning_rate: 0.001 - batch_size: 64 - training_examples: 22042 - classes: 15
Epoch 0/30 - 6.66s - loss: 0.16814168 - acc: 0.9498104 - batches: 345
Epoch 1/30 - 4.73s - loss: 0.07261445 - acc: 0.9774559 - batches: 345
Epoch 2/30 - 4.61s - loss: 0.06150661 - acc: 0.9812575 - batches: 345
Epoch 3/30 - 5.75s - loss: 0.05632347 - acc: 0.9827865 - batches: 345
Epoch 4/30 - 4.73s - loss: 0.05331255 - acc: 0.9838887 - batches: 345
Epoch 5/30 - 4.55s - loss: 0.05126492 - acc: 0.98463815 - batches: 345
Epoch 6/30 - 4.61s - loss: 0.04972897 - acc: 0.9851406 - batches: 345
Epoch 7/30 - 4.58s - loss: 0.048504002 - acc: 0.9856191 - batches: 345
Epoch 8/30 - 4.62s - loss: 0.04748668 - acc: 0.9858874 - batches: 345
Epoch 9/30 - 4.55s - loss: 0.046618596 - acc: 0.98614484 - batches: 345
Epoch 10/30 - 4.52s - loss: 0.045863405 - acc: 0.98649 - batches: 345
Epoch 11/30 - 4.54s - loss: 0.04519599 - acc: 0.98674446 - batches: 345
Epoch 12/30 - 4.57s - loss: 0.044598866 - a

In [10]:
preds = clf_pipelineModel.transform(test)

In [11]:
preds_df = preds.select('label','provision',"class.result").toPandas()
preds_df.head()

Unnamed: 0,label,provision,result
0,"[waivers, terminations]","(a) Effective as of the Effective Date, the Ho...",[representations]
1,"[waivers, amendments]",(a) No failure or delay by the Administrative ...,[waivers]
2,"[waivers, amendments]",(a) No failure or delay on the part of any par...,[waivers]
3,[assignments],"(a) Seller, the Agent, each Managing Agent, ea...","[successors, assignments]"
4,"[assigns, successors]",(a) The provisions of this Agreement shall be ...,"[successors, assigns]"


In [12]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

mlb = MultiLabelBinarizer()

y_true = mlb.fit_transform(preds_df['label'])
y_pred = mlb.fit_transform(preds_df['result'])


print("Classification report: \n", (classification_report(y_true, y_pred)))
print("F1 micro averaging:",(f1_score(y_true, y_pred, average='micro')))
print("ROC: ",(roc_auc_score(y_true, y_pred, average="micro")))


Classification report: 
               precision    recall  f1-score   support

           0       0.88      0.77      0.82       618
           1       0.73      0.52      0.61       198
           2       0.79      0.75      0.77       302
           3       0.99      0.98      0.99       587
           4       0.98      0.94      0.96       675
           5       0.98      0.92      0.95       228
           6       0.98      0.98      0.98       784
           7       0.98      0.96      0.97       574
           8       0.92      0.79      0.85       291
           9       0.99      0.94      0.96       531
          10       0.84      0.85      0.84       361
          11       0.96      0.91      0.94       329
          12       0.89      0.73      0.80       272
          13       0.90      0.75      0.82       460
          14       0.83      0.80      0.82       227

   micro avg       0.93      0.87      0.90      6437
   macro avg       0.91      0.84      0.87      6437
w

## With Bert Embeddings

We do not have have any specific Financial Sentence Embeddings, but we can use Financial Bert Embeddings and then average them.

In [13]:
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
          .setInputCols(["document", "token"]) \
          .setOutputCol("embeddings")

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [14]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("provision") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddingsSentence = nlp.SentenceEmbeddings() \
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

classsifierdl = nlp.MultiClassifierDLApproach() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("class") \
    .setLabelColumn("label")\
    .setMaxEpochs(8)\
    .setEnableOutputLogs(True)

clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        embeddings,
        embeddingsSentence,
        classsifierdl
    ])

In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select('provision','label',"class.result").toPandas()

In [None]:
preds_df.head()

In [None]:
import os
log_file_name = os.listdir("/root/annotator_logs")[0]

with open("/root/annotator_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

mlb = MultiLabelBinarizer()

y_true = mlb.fit_transform(preds_df['label'])
y_pred = mlb.fit_transform(preds_df['result'])


print("Classification report: \n", (classification_report(y_true, y_pred)))
print("F1 micro averaging:",(f1_score(y_true, y_pred, average='micro')))
print("ROC: ",(roc_auc_score(y_true, y_pred, average="micro")))


### Saving & loading back the trained model

In [None]:
clf_pipelineModel.stages

In [None]:
clf_pipelineModel.stages[-1].write().overwrite().save('MultilabelClfBert')

In [None]:
# Load back  saved Multilabel Classifier Model
MultilabelClfModel = nlp.MultiClassifierDLModel.load('MultilabelClfBert')

In [None]:
ld_pipeline = Pipeline(stages=[document_assembler, tokenizer, embeddings, embeddingsSentence, MultilabelClfModel])
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([['']]).toDF("provision"))

In [None]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test)

In [None]:
ld_preds_df = ld_preds.select('provision','label',"class.result").toPandas()

In [None]:
ld_preds_df.head(10)

# Multiclass classifier training


## Loading the data

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Finance/data/finance_clf_data.csv

In [None]:
import pandas as pd
df = pd.read_csv('./finance_clf_data.csv')

In [None]:
df.head()

In [None]:
df['label'].value_counts()

In [None]:
data = spark.createDataFrame(df)

train, test = data.randomSplit([0.8, 0.2], seed = 100)

In [None]:
from pyspark.sql.functions import col

train.groupBy("label") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

In [None]:
from pyspark.sql.functions import col

test.groupBy("label") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

 ## With Universal Encoder

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document") 

embeddings = nlp.UniversalSentenceEncoder.pretrained() \
    .setInputCols("document") \
    .setOutputCol("sentence_embeddings")

classsifierdl = finance.ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("label")\
    .setMaxEpochs(30)\
    .setEnableOutputLogs(True)

clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        embeddings,
        classsifierdl
    ])

In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

In [None]:
import os
log_file_name = os.listdir("/root/annotator_logs")[0]

with open("/root/annotator_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select('label','text',"class.result").toPandas()
preds_df.head()

In [None]:
# The result is an array since in Spark NLP you can have multiple sentences.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

In [None]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))

### Saving & loading back the trained model

In [None]:
clf_pipelineModel.stages

In [None]:
clf_pipelineModel.stages[-1].write().overwrite().save('Clf_Use')

In [None]:
# Load back  saved Classifier Model
ClfModel = finance.ClassifierDLModel.load('Clf_Use')

In [None]:
ld_pipeline = Pipeline(stages=[document_assembler, embeddings,ClfModel])
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))

In [None]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test)

In [None]:
ld_preds_df = ld_preds.select('text','label',"class.result").toPandas()

In [None]:
ld_preds_df.head()

## With Bert Embeddings

We do not have Financial Sentence Embeddings yet, But we can use the Financial Word Embeddings and then average them.

In [None]:
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddingsSentence = nlp.SentenceEmbeddings() \
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

classsifierdl = finance.ClassifierDLApproach() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("class") \
    .setLabelColumn("label")\
    .setMaxEpochs(8)\
    .setLr(0.001)\
    .setEnableOutputLogs(True)

clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        embeddings,
        embeddingsSentence,
        classsifierdl
    ])

In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select('label','text',"class.result").toPandas()

In [None]:
preds_df.head()

In [None]:
log_files = os.listdir("/root/annotator_logs")

with open("/root/annotator_logs/"+log_files[0], "r") as log_file :
    print(log_file.read())

In [None]:
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))


# Save model and Zip it for Modelshub Upload/Downloads

In [None]:
# Save a Spark NLP model
clf_pipelineModel.stages[-1].write().overwrite().save('ClfBert')

# cd into saved dir and zip
! cd /content/ClfBert ; zip -r /content/ClfBert.zip *