# Train Domain-specific Multiclass and Multilabel classifiers


![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)





[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Finance/13.Train_Financial_Classifiers.ipynb)

# Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install johnsnowlabs 

In [None]:
from google.colab import files
print('Please upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import * 
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect
jsl.install()

In [None]:
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

In [None]:
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession, DataFrame
from pyspark.ml import Pipeline
from sparknlp import Doc2Chunk
from sparknlp_jsl.annotator import *
import sparknlp_jsl
from sparknlp.base import LightPipeline

In [None]:
spark

# Multilabel classifier training

## Loading the data

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Finance/data/finance_data.csv

In [None]:
import pandas as pd
df = pd.read_csv('/content/finance_data.csv')
df['label'] = df['label'].apply(eval)

In [None]:
data = spark.createDataFrame(df)

# If you have a single dataset, then split it or else you can load the test dataset the same way that you load the train data.
train, test = data.randomSplit([0.8, 0.2], seed = 123)

In [None]:
train.show(truncate=50)

+--------------------------------------------------+-----------------------------+
|                                         provision|                        label|
+--------------------------------------------------+-----------------------------+
|(a) Consultant or Company may terminate this Pr...|               [terminations]|
|(a) Each of Borrower and Guarantor, as applicab...|[representations, warranties]|
|(a) No amendment or waiver of any provision of ...|                 [amendments]|
|(a) No failure on the part of any Person to exe...|        [waivers, amendments]|
|(a) No failure or delay by any Agent or any Len...|        [waivers, amendments]|
|(a) No failure or delay by the Administrative A...|        [waivers, amendments]|
|(a) No failure or delay by the Administrative A...|        [waivers, amendments]|
|(a) No failure or delay by the Administrative A...|        [waivers, amendments]|
|(a) No failure or delay by the Agent or any Len...|        [waivers, amendments]|
|(a)

In [None]:
from pyspark.sql.functions import col

test.groupBy("label") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|    [governing laws]|  751|
|      [counterparts]|  580|
|           [notices]|  574|
| [entire agreements]|  571|
|      [severability]|  504|
|          [survival]|  327|
|[assigns, success...|  294|
|        [amendments]|  265|
|[waivers, amendme...|  229|
|          [expenses]|  227|
|      [terminations]|  227|
|           [waivers]|  206|
|[representations,...|  203|
|       [assignments]|  174|
|   [representations]|   88|
|[amendments, enti...|   60|
|        [successors]|   50|
|[amendments, term...|   35|
|        [warranties]|   24|
|[governing laws, ...|   13|
+--------------------+-----+
only showing top 20 rows



 ## With Universal Encoder

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("provision") \
    .setOutputCol("document") \
    .setCleanupMode("shrink")

embeddings = UniversalSentenceEncoder.pretrained() \
      .setInputCols("document") \
      .setOutputCol("sentence_embeddings")

classsifierdl = nlp.MultiClassifierDLApproach()\
      .setInputCols(["sentence_embeddings"])\
      .setOutputCol("class")\
      .setLabelColumn("label")\
      .setMaxEpochs(30)\
      .setEnableOutputLogs(True)

clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        embeddings,
        classsifierdl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [None]:
clf_pipelineModel = clf_pipeline.fit(train)

In [None]:
import os
log_file_name = os.listdir("/root/annotator_logs")[0]

with open("/root/annotator_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 30 - learning_rate: 0.001 - batch_size: 64 - training_examples: 21996 - classes: 15
Epoch 0/30 - 13.55s - loss: 0.16655688 - batches: 344
Epoch 1/30 - 9.71s - loss: 0.07149226 - batches: 344
Epoch 2/30 - 9.87s - loss: 0.060517557 - batches: 344
Epoch 3/30 - 11.75s - loss: 0.055443104 - batches: 344
Epoch 4/30 - 9.59s - loss: 0.05248539 - batches: 344
Epoch 5/30 - 9.64s - loss: 0.05047605 - batches: 344
Epoch 6/30 - 9.53s - loss: 0.04897312 - batches: 344
Epoch 7/30 - 9.56s - loss: 0.047777086 - batches: 344
Epoch 8/30 - 9.57s - loss: 0.046785276 - batches: 344
Epoch 9/30 - 10.52s - loss: 0.04593985 - batches: 344
Epoch 10/30 - 11.66s - loss: 0.045204718 - batches: 344
Epoch 11/30 - 12.08s - loss: 0.044556096 - batches: 344
Epoch 12/30 - 12.88s - loss: 0.04397737 - batches: 344
Epoch 13/30 - 9.65s - loss: 0.043456018 - batches: 344
Epoch 14/30 - 9.61s - loss: 0.04298303 - batches: 344
Epoch 15/30 - 9.82s - loss: 0.042551916 - batches: 344
Epoch 16/30 - 9.63s -

In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select('label','provision',"class.result").toPandas()
preds_df.head()

Unnamed: 0,label,provision,result
0,"[waivers, terminations]","(a) Effective as of the Effective Date, the Ho...",[representations]
1,"[waivers, amendments]",(a) No failure or delay by the Administrative ...,[waivers]
2,"[waivers, amendments]",(a) No failure or delay on the part of any par...,[waivers]
3,[assignments],"(a) Seller, the Agent, each Managing Agent, ea...","[successors, assignments]"
4,"[assigns, successors]",(a) The provisions of this Agreement shall be ...,"[successors, assigns]"


In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

mlb = MultiLabelBinarizer()

y_true = mlb.fit_transform(preds_df['label'])
y_pred = mlb.fit_transform(preds_df['result'])


print("Classification report: \n", (classification_report(y_true, y_pred)))
print("F1 micro averaging:",(f1_score(y_true, y_pred, average='micro')))
print("ROC: ",(roc_auc_score(y_true, y_pred, average="micro")))


Classification report: 
               precision    recall  f1-score   support

           0       0.88      0.77      0.82       618
           1       0.73      0.52      0.61       198
           2       0.79      0.75      0.77       302
           3       0.99      0.98      0.99       587
           4       0.98      0.94      0.96       675
           5       0.98      0.92      0.95       228
           6       0.98      0.98      0.98       784
           7       0.98      0.96      0.97       574
           8       0.92      0.79      0.85       291
           9       0.99      0.94      0.96       531
          10       0.84      0.85      0.84       361
          11       0.96      0.91      0.94       329
          12       0.89      0.73      0.80       272
          13       0.90      0.75      0.82       460
          14       0.83      0.80      0.82       227

   micro avg       0.93      0.87      0.90      6437
   macro avg       0.91      0.84      0.87      6437
w

## With Bert Embeddings

We do not have have any specific Financial Sentence Embeddings, but we can use Financial Bert Embeddings and then average them.

In [None]:
embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
          .setInputCols(["document", "token"]) \
          .setOutputCol("embeddings")

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("provision") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddingsSentence = SentenceEmbeddings() \
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

classsifierdl = nlp.MultiClassifierDLApproach() \
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCol("class") \
      .setLabelColumn("label")\
      .setMaxEpochs(8)\
      .setEnableOutputLogs(True)

clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        embeddings,
        embeddingsSentence,
        classsifierdl
    ])

In [None]:
clf_pipelineModel = clf_pipeline.fit(train)

In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select('provision','label',"class.result").toPandas()

In [None]:
preds_df.head()

Unnamed: 0,provision,label,result
0,"(a) Effective as of the Effective Date, the Ho...","[waivers, terminations]",[waivers]
1,(a) No failure or delay by the Administrative ...,"[waivers, amendments]","[waivers, amendments]"
2,(a) No failure or delay on the part of any par...,"[waivers, amendments]",[waivers]
3,"(a) Seller, the Agent, each Managing Agent, ea...",[assignments],[assignments]
4,(a) The provisions of this Agreement shall be ...,"[assigns, successors]","[successors, assigns]"


In [None]:
import os
log_file_name = os.listdir("/root/annotator_logs")[0]

with open("/root/annotator_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 8 - learning_rate: 0.001 - batch_size: 64 - training_examples: 22042 - classes: 15
Epoch 0/8 - 8.72s - loss: 0.08101581 - batches: 345
Epoch 1/8 - 6.07s - loss: 0.038062517 - batches: 345
Epoch 2/8 - 6.32s - loss: 0.033950333 - batches: 345
Epoch 3/8 - 6.27s - loss: 0.03147273 - batches: 345
Epoch 4/8 - 6.17s - loss: 0.029639142 - batches: 345
Epoch 5/8 - 6.13s - loss: 0.028155383 - batches: 345
Epoch 6/8 - 6.36s - loss: 0.026899043 - batches: 345
Epoch 7/8 - 6.07s - loss: 0.025821527 - batches: 345



In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

mlb = MultiLabelBinarizer()

y_true = mlb.fit_transform(preds_df['label'])
y_pred = mlb.fit_transform(preds_df['result'])


print("Classification report: \n", (classification_report(y_true, y_pred)))
print("F1 micro averaging:",(f1_score(y_true, y_pred, average='micro')))
print("ROC: ",(roc_auc_score(y_true, y_pred, average="micro")))


Classification report: 
               precision    recall  f1-score   support

           0       0.90      0.90      0.90       618
           1       0.82      0.66      0.73       198
           2       0.88      0.68      0.77       302
           3       0.99      0.99      0.99       587
           4       0.99      0.96      0.97       675
           5       1.00      0.97      0.98       228
           6       0.98      0.98      0.98       784
           7       1.00      0.97      0.98       574
           8       0.95      0.93      0.94       291
           9       0.99      0.97      0.98       531
          10       0.90      0.85      0.87       361
          11       0.96      0.93      0.95       329
          12       0.93      0.86      0.89       272
          13       0.96      0.75      0.84       460
          14       0.88      0.88      0.88       227

   micro avg       0.95      0.91      0.93      6437
   macro avg       0.94      0.89      0.91      6437
w

### Saving & loading back the trained model

In [None]:
clf_pipelineModel.stages

[DocumentAssembler_f8b952d9caec,
 REGEX_TOKENIZER_50c5d2a2686e,
 BERT_EMBEDDINGS_29ce72cd673e,
 SentenceEmbeddings_482a3d4ce37f,
 MultiClassifierDLModel_760c1c21e05a]

In [None]:
clf_pipelineModel.stages[-1].write().overwrite().save('MultilabelClfBert')

In [None]:
# Load back  saved Multilabel Classifier Model
MultilabelClfModel = nlp.MultiClassifierDLModel.load('MultilabelClfBert')

In [None]:
ld_pipeline = Pipeline(stages=[document_assembler, tokenizer, embeddings, embeddingsSentence, MultilabelClfModel])
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([['']]).toDF("provision"))

In [None]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test)

In [None]:
ld_preds_df = ld_preds.select('provision','label',"class.result").toPandas()

In [None]:
ld_preds_df.head(10)

Unnamed: 0,provision,label,result
0,(A) Seller’s Adjusted Tangible Net Worth is gr...,[warranties],[warranties]
1,(a) All notices and other communications provi...,[notices],[notices]
2,(a) Each of the Assignor and the Assignee here...,"[representations, warranties]","[warranties, representations]"
3,"(a) Effective as of the Effective Date, the Ho...","[waivers, terminations]",[waivers]
4,"(a) No amendment, modification or waiver of an...",[amendments],"[waivers, amendments]"
5,(a) No failure or delay by the Administrative ...,"[waivers, amendments]","[waivers, amendments]"
6,(a) No failure or delay by the Administrative ...,"[waivers, amendments]","[waivers, amendments]"
7,(a) No failure or delay by the Administrative ...,"[waivers, amendments]","[waivers, amendments]"
8,(a) No failure or delay by the Administrative ...,"[waivers, amendments]","[waivers, amendments]"
9,(a) No failure or delay by the Administrative ...,"[waivers, amendments]","[waivers, amendments]"


# Multiclass classifier training


## Loading the data

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Finance/data/finance_clf_data.csv

In [None]:
import pandas as pd
df = pd.read_csv('/content/finance_clf_data.csv')

In [None]:
df.head()

Unnamed: 0,text,label,len
0,\nOperating\nLeases\n \nOn\nJanuary 1 2010 th...,financial_statements,465
1,the Exercise Price and is exercisable for fiv...,financial_statements,406
2,Income Taxes\n69\nTable of Contents\nWe accoun...,financial_statements,843
3,Invoice2go\n has not been required to maintain...,risk_factors,474
4,A\nB\nC\nPlan Category\nNumber of Securitiesto...,equity,358


In [None]:
df['label'].value_counts()

risk_factors               3831
financial_statements       3726
business                   2002
financial_conditions        702
form_10k_summary            491
executives_compensation     304
controls_procedures         277
equity                      223
market_risk                 204
executives                  161
legal_proceedings            94
security_ownership           84
properties                   81
exhibits                     77
Name: label, dtype: int64

In [None]:
data = spark.createDataFrame(df)

train, test = data.randomSplit([0.8, 0.2], seed = 100)

In [None]:
from pyspark.sql.functions import col

train.groupBy("label") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|        risk_factors| 3071|
|financial_statements| 2983|
|            business| 1582|
|financial_conditions|  597|
|    form_10k_summary|  385|
| controls_procedures|  226|
|executives_compen...|  224|
|              equity|  174|
|         market_risk|  158|
|          executives|  122|
|   legal_proceedings|   72|
|  security_ownership|   70|
|            exhibits|   62|
|          properties|   57|
+--------------------+-----+



In [None]:
from pyspark.sql.functions import col

test.groupBy("label") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|        risk_factors|  760|
|financial_statements|  743|
|            business|  420|
|    form_10k_summary|  106|
|financial_conditions|  105|
|executives_compen...|   80|
| controls_procedures|   51|
|              equity|   49|
|         market_risk|   46|
|          executives|   39|
|          properties|   24|
|   legal_proceedings|   22|
|            exhibits|   15|
|  security_ownership|   14|
+--------------------+-----+



 ## With Universal Encoder

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document") 

embeddings = UniversalSentenceEncoder.pretrained() \
      .setInputCols("document") \
      .setOutputCol("sentence_embeddings")

classsifierdl = finance.ClassifierDLApproach()\
      .setInputCols(["sentence_embeddings"])\
      .setOutputCol("class")\
      .setLabelColumn("label")\
      .setMaxEpochs(30)\
      .setEnableOutputLogs(True)

clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        embeddings,
        classsifierdl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [None]:
clf_pipelineModel = clf_pipeline.fit(train)

In [None]:
import os
log_file_name = os.listdir("/root/annotator_logs")[0]

with open("/root/annotator_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 30 - learning_rate: 0.005 - batch_size: 64 - training_examples: 9783 - classes: 14
Epoch 0/30 - 1.87s - loss: 324.2854 - acc: 0.5947312 - batches: 153
Epoch 1/30 - 1.54s - loss: 312.90955 - acc: 0.6878645 - batches: 153
Epoch 2/30 - 1.46s - loss: 312.57883 - acc: 0.6988636 - batches: 153
Epoch 3/30 - 1.45s - loss: 312.13092 - acc: 0.70369506 - batches: 153
Epoch 4/30 - 1.47s - loss: 311.65512 - acc: 0.70583695 - batches: 153
Epoch 5/30 - 1.46s - loss: 311.6834 - acc: 0.7087657 - batches: 153
Epoch 6/30 - 1.46s - loss: 311.56168 - acc: 0.70948523 - batches: 153
Epoch 7/30 - 1.44s - loss: 311.68964 - acc: 0.710102 - batches: 153
Epoch 8/30 - 1.43s - loss: 311.7599 - acc: 0.7110272 - batches: 153
Epoch 9/30 - 1.59s - loss: 311.73105 - acc: 0.71236354 - batches: 153
Epoch 10/30 - 1.45s - loss: 311.69553 - acc: 0.7135971 - batches: 153
Epoch 11/30 - 1.43s - loss: 311.6356 - acc: 0.7141111 - batches: 153
Epoch 12/30 - 1.43s - loss: 311.5436 - acc: 0.71452224 - batc

In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select('label','text',"class.result").toPandas()
preds_df.head()

Unnamed: 0,label,text,result
0,financial_statements,\n\n\n \n \n 7\n million when the Company meet...,[financial_statements]
1,financial_statements,\n\n \n\n\nLevel 3 Inputs that are generally u...,[financial_statements]
2,risk_factors,\n\n \n\n\nOur products are complex and have a...,[risk_factors]
3,risk_factors,\n\n \n\n\nii Customer Support Revenue\n\n\n \...,[financial_statements]
4,form_10k_summary,\n \n\n\n \n \n\n2020\n\n \n \n\n2019\n\n \n\n...,[financial_statements]


In [None]:
# The result is an array since in Spark NLP you can have multiple sentences.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

In [None]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))

                         precision    recall  f1-score   support

               business       0.70      0.82      0.75       420
    controls_procedures       0.00      0.00      0.00        51
                 equity       0.00      0.00      0.00        49
             executives       0.00      0.00      0.00        39
executives_compensation       0.00      0.00      0.00        80
               exhibits       0.00      0.00      0.00        15
   financial_conditions       0.00      0.00      0.00       105
   financial_statements       0.62      0.94      0.75       743
       form_10k_summary       0.00      0.00      0.00       106
      legal_proceedings       0.00      0.00      0.00        22
            market_risk       0.00      0.00      0.00        46
             properties       0.00      0.00      0.00        24
           risk_factors       0.78      0.87      0.82       760
     security_ownership       0.00      0.00      0.00        14

               accuracy

### Saving & loading back the trained model

In [None]:
clf_pipelineModel.stages

[DocumentAssembler_239ab07aa83f,
 UNIVERSAL_SENTENCE_ENCODER_4de71669b7ec,
 FinanceClassifierDLModel_84aea46b5365]

In [None]:
clf_pipelineModel.stages[-1].write().overwrite().save('Clf_Use')

In [None]:
# Load back  saved Classifier Model
ClfModel = finance.ClassifierDLModel.load('Clf_Use')

In [None]:
ld_pipeline = Pipeline(stages=[document_assembler, embeddings,ClfModel])
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))

In [None]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test)

In [None]:
ld_preds_df = ld_preds.select('text','label',"class.result").toPandas()

In [None]:
ld_preds_df.head()

Unnamed: 0,text,label,result
0,\n\n\n \n \n 7\n million when the Company meet...,financial_statements,[financial_statements]
1,\n\n \n\n\nLevel 3 Inputs that are generally u...,financial_statements,[financial_statements]
2,\n\n \n\n\nOur products are complex and have a...,risk_factors,[risk_factors]
3,\n\n \n\n\nii Customer Support Revenue\n\n\n \...,risk_factors,[financial_statements]
4,\n \n\n\n \n \n\n2020\n\n \n \n\n2019\n\n \n\n...,form_10k_summary,[financial_statements]


## With Bert Embeddings

We do not have Financial Sentence Embeddings yet, But we can use the Financial Word Embeddings and then average them.

In [None]:
embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
          .setInputCols(["document", "token"]) \
          .setOutputCol("embeddings")

bert_embeddings_sec_bert_base download started this may take some time.
Approximate size to download 390.4 MB
[OK!]


In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddingsSentence = SentenceEmbeddings() \
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

classsifierdl = finance.ClassifierDLApproach() \
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCol("class") \
      .setLabelColumn("label")\
      .setMaxEpochs(8)\
      .setLr(0.001)\
      .setEnableOutputLogs(True)

clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        embeddings,
        embeddingsSentence,
        classsifierdl
    ])

In [None]:
clf_pipelineModel = clf_pipeline.fit(train)

In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select('label','text',"class.result").toPandas()

In [None]:
preds_df.head()

Unnamed: 0,label,text,result
0,financial_statements,\n\n\n \n \n 7\n million when the Company meet...,[financial_statements]
1,financial_statements,\n\n \n\n\nLevel 3 Inputs that are generally u...,[financial_statements]
2,risk_factors,\n\n \n\n\nOur products are complex and have a...,[risk_factors]
3,risk_factors,\n\n \n\n\nii Customer Support Revenue\n\n\n \...,[financial_statements]
4,form_10k_summary,\n \n\n\n \n \n\n2020\n\n \n \n\n2019\n\n \n\n...,[financial_statements]


In [None]:
log_files = os.listdir("/root/annotator_logs")

with open("/root/annotator_logs/"+log_files[0], "r") as log_file :
    print(log_file.read())

Training started - epochs: 8 - learning_rate: 0.001 - batch_size: 64 - training_examples: 9783 - classes: 14
Epoch 0/8 - 1.95s - loss: 335.0635 - acc: 0.5819883 - batches: 153
Epoch 1/8 - 1.62s - loss: 318.56744 - acc: 0.7143335 - batches: 153
Epoch 2/8 - 1.62s - loss: 316.6242 - acc: 0.73285365 - batches: 153
Epoch 3/8 - 1.64s - loss: 316.10782 - acc: 0.7355263 - batches: 153
Epoch 4/8 - 1.63s - loss: 315.67447 - acc: 0.7372739 - batches: 153
Epoch 5/8 - 1.62s - loss: 315.42645 - acc: 0.73778784 - batches: 153
Epoch 6/8 - 1.59s - loss: 315.31195 - acc: 0.7383018 - batches: 153
Epoch 7/8 - 1.61s - loss: 315.2378 - acc: 0.739227 - batches: 153



In [None]:
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))


                         precision    recall  f1-score   support

               business       0.71      0.86      0.78       420
    controls_procedures       0.00      0.00      0.00        51
                 equity       0.00      0.00      0.00        49
             executives       0.00      0.00      0.00        39
executives_compensation       0.00      0.00      0.00        80
               exhibits       0.00      0.00      0.00        15
   financial_conditions       0.00      0.00      0.00       105
   financial_statements       0.65      0.97      0.77       743
       form_10k_summary       0.00      0.00      0.00       106
      legal_proceedings       0.00      0.00      0.00        22
            market_risk       0.00      0.00      0.00        46
             properties       0.00      0.00      0.00        24
           risk_factors       0.83      0.93      0.87       760
     security_ownership       0.00      0.00      0.00        14

               accuracy

# Save model and Zip it for Modelshub Upload/Downloads

In [None]:
# Save a Spark NLP model
clf_pipelineModel.stages[-1].write().overwrite().save('ClfBert')

# cd into saved dir and zip
! cd /content/ClfBert ; zip -r /content/ClfBert.zip *

  adding: classifierdl_tensorflow (deflated 56%)
  adding: fields/ (stored 0%)
  adding: fields/datasetParams/ (stored 0%)
  adding: fields/datasetParams/part-00002 (deflated 27%)
  adding: fields/datasetParams/part-00000 (deflated 26%)
  adding: fields/datasetParams/.part-00002.crc (stored 0%)
  adding: fields/datasetParams/.part-00003.crc (stored 0%)
  adding: fields/datasetParams/_SUCCESS (stored 0%)
  adding: fields/datasetParams/.part-00000.crc (stored 0%)
  adding: fields/datasetParams/part-00001 (deflated 27%)
  adding: fields/datasetParams/._SUCCESS.crc (stored 0%)
  adding: fields/datasetParams/part-00003 (deflated 32%)
  adding: fields/datasetParams/.part-00001.crc (stored 0%)
  adding: metadata/ (stored 0%)
  adding: metadata/part-00000 (deflated 39%)
  adding: metadata/_SUCCESS (stored 0%)
  adding: metadata/.part-00000.crc (stored 0%)
  adding: metadata/._SUCCESS.crc (stored 0%)
