# Train Legal Classifiers


![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)





[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Legal/14.Train_Legal_Classifiers.ipynb)

# Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install johnsnowlabs 

In [None]:
from google.colab import files
print('Please upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import * 
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect
jsl.install()

## Start Spark Session

In [None]:
from johnsnowlabs import * 
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
# if you want to start the session with custom params as in start function above
def start(SECRET):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:"+PUBLIC_VERSION) \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+SECRET+"/spark-nlp-jsl-"+JSL_VERSION+".jar")
      
    return builder.getOrCreate()

#spark = start(SECRET)

# Multilabel classifier training

## Loading the data

In [12]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Legal/data/finance_data.csv

In [13]:
import pandas as pd
df = pd.read_csv('./finance_data.csv')
df['label'] = df['label'].apply(eval)

In [14]:
data = spark.createDataFrame(df)

# If you have a single dataset, then split it or else you can load the test dataset the same way that you load the train data.
train, test = data.randomSplit([0.8, 0.2], seed = 123)

In [15]:
train.show(truncate=50)

+--------------------------------------------------+-----------------------------+
|                                         provision|                        label|
+--------------------------------------------------+-----------------------------+
|(a) Consultant or Company may terminate this Pr...|               [terminations]|
|(a) Each of Borrower and Guarantor, as applicab...|[representations, warranties]|
|(a) No amendment or waiver of any provision of ...|                 [amendments]|
|(a) No failure on the part of any Person to exe...|        [waivers, amendments]|
|(a) No failure or delay by any Agent or any Len...|        [waivers, amendments]|
|(a) No failure or delay by the Administrative A...|        [waivers, amendments]|
|(a) No failure or delay by the Administrative A...|        [waivers, amendments]|
|(a) No failure or delay by the Administrative A...|        [waivers, amendments]|
|(a) No failure or delay by the Agent or any Len...|        [waivers, amendments]|
|(a)

In [16]:
from pyspark.sql.functions import col

test.groupBy("label") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|    [governing laws]|  751|
|      [counterparts]|  580|
|           [notices]|  574|
| [entire agreements]|  571|
|      [severability]|  504|
|          [survival]|  327|
|[assigns, success...|  294|
|        [amendments]|  265|
|[waivers, amendme...|  229|
|          [expenses]|  227|
|      [terminations]|  227|
|           [waivers]|  206|
|[representations,...|  203|
|       [assignments]|  174|
|   [representations]|   88|
|[amendments, enti...|   60|
|        [successors]|   50|
|[amendments, term...|   35|
|        [warranties]|   24|
|[governing laws, ...|   13|
+--------------------+-----+
only showing top 20 rows



 ## With Universal Encoder

In [None]:
document_assembler = nlp.DocumentAssembler() \
      .setInputCol("provision") \
      .setOutputCol("document") \
      .setCleanupMode("shrink")

embeddings = nlp.UniversalSentenceEncoder.pretrained() \
      .setInputCols("document") \
      .setOutputCol("sentence_embeddings")

classsifierdl = nlp.MultiClassifierDLApproach()\
      .setInputCols(["sentence_embeddings"])\
      .setOutputCol("class")\
      .setLabelColumn("label")\
      .setMaxEpochs(30)\
      .setEnableOutputLogs(True)

clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        embeddings,
        classsifierdl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 1.58 s, sys: 188 ms, total: 1.77 s
Wall time: 4min 54s


In [None]:
import os
log_file_name = os.listdir("/root/annotator_logs")[0]

with open("/root/annotator_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 30 - learning_rate: 0.001 - batch_size: 64 - training_examples: 21996 - classes: 15
Epoch 0/30 - 10.42s - loss: 0.1665569 - acc: 0.95030993 - batches: 344
Epoch 1/30 - 7.63s - loss: 0.07149224 - acc: 0.9779483 - batches: 344
Epoch 2/30 - 7.72s - loss: 0.060517453 - acc: 0.9815101 - batches: 344
Epoch 3/30 - 7.53s - loss: 0.05544292 - acc: 0.9832641 - batches: 344
Epoch 4/30 - 7.86s - loss: 0.052485254 - acc: 0.98431796 - batches: 344
Epoch 5/30 - 7.43s - loss: 0.050475795 - acc: 0.98490715 - batches: 344
Epoch 6/30 - 7.42s - loss: 0.048972927 - acc: 0.985461 - batches: 344
Epoch 7/30 - 7.57s - loss: 0.047776755 - acc: 0.98585576 - batches: 344
Epoch 8/30 - 7.37s - loss: 0.04678503 - acc: 0.98615783 - batches: 344
Epoch 9/30 - 7.50s - loss: 0.045939386 - acc: 0.98650706 - batches: 344
Epoch 10/30 - 7.58s - loss: 0.045204334 - acc: 0.9867913 - batches: 344
Epoch 11/30 - 7.43s - loss: 0.04455569 - acc: 0.98704356 - batches: 344
Epoch 12/30 - 7.65s - loss: 0.0439

In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select('label','provision',"class.result").toPandas()
preds_df.head()

Unnamed: 0,label,provision,result
0,[warranties],(A) Seller’s Adjusted Tangible Net Worth is gr...,[]
1,[notices],(a) All notices and other communications provi...,[notices]
2,"[representations, warranties]",(a) Each of the Assignor and the Assignee here...,[representations]
3,"[waivers, terminations]","(a) Effective as of the Effective Date, the Ho...",[representations]
4,[amendments],"(a) No amendment, modification or waiver of an...",[amendments]


In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

mlb = MultiLabelBinarizer()

y_true = mlb.fit_transform(preds_df['label'])
y_pred = mlb.fit_transform(preds_df['result'])


print("Classification report: \n", (classification_report(y_true, y_pred)))
print("F1 micro averaging:",(f1_score(y_true, y_pred, average='micro')))
print("ROC: ",(roc_auc_score(y_true, y_pred, average="micro")))


Classification report: 
               precision    recall  f1-score   support

           0       0.89      0.80      0.84       667
           1       0.87      0.44      0.58       238
           2       0.78      0.75      0.77       347
           3       0.99      0.98      0.99       558
           4       0.96      0.92      0.94       613
           5       0.98      0.95      0.97       240
           6       0.99      0.98      0.99       757
           7       0.98      0.94      0.96       604
           8       0.88      0.76      0.81       284
           9       0.98      0.95      0.96       514
          10       0.84      0.87      0.86       418
          11       0.93      0.89      0.91       319
          12       0.93      0.73      0.81       275
          13       0.88      0.75      0.81       445
          14       0.82      0.67      0.74       237

   micro avg       0.93      0.86      0.89      6516
   macro avg       0.91      0.83      0.86      6516
w

## With RoBerta Embeddings

We do not have have any specific Financial Sentence Embeddings, but we can use Financial Bert Embeddings and then average them.

In [5]:
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings") \
    .setMaxSentenceLength(512)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]


In [6]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("provision") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddingsSentence = nlp.SentenceEmbeddings() \
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

classsifierdl = nlp.MultiClassifierDLApproach() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("class") \
    .setLabelColumn("label")\
    .setMaxEpochs(6)\
    .setEnableOutputLogs(True)

clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        embeddings,
        embeddingsSentence,
        classsifierdl
    ])

In [17]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

CPU times: user 43.5 s, sys: 3.36 s, total: 46.9 s
Wall time: 4h 32min 14s


In [18]:
preds = clf_pipelineModel.transform(test)

In [19]:
preds_df = preds.select('provision','label',"class.result").toPandas()

In [20]:
preds_df.head()

Unnamed: 0,provision,label,result
0,"(a) Effective as of the Effective Date, the Ho...","[waivers, terminations]",[terminations]
1,(a) No failure or delay by the Administrative ...,"[waivers, amendments]","[waivers, amendments]"
2,(a) No failure or delay on the part of any par...,"[waivers, amendments]",[waivers]
3,"(a) Seller, the Agent, each Managing Agent, ea...",[assignments],[assignments]
4,(a) The provisions of this Agreement shall be ...,"[assigns, successors]","[successors, assigns]"


In [21]:
import os
log_file_name = os.listdir("/root/annotator_logs")[0]

with open("/root/annotator_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 6 - learning_rate: 0.001 - batch_size: 64 - training_examples: 22042 - classes: 15
Epoch 0/6 - 6.60s - loss: 0.096051626 - acc: 0.9705064 - batches: 345
Epoch 1/6 - 4.51s - loss: 0.039551556 - acc: 0.98879695 - batches: 345
Epoch 2/6 - 4.74s - loss: 0.03486474 - acc: 0.9903383 - batches: 345
Epoch 3/6 - 4.52s - loss: 0.0321748 - acc: 0.99127585 - batches: 345
Epoch 4/6 - 4.64s - loss: 0.030249653 - acc: 0.99202967 - batches: 345
Epoch 5/6 - 4.48s - loss: 0.028734822 - acc: 0.9925955 - batches: 345



In [22]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

mlb = MultiLabelBinarizer()

y_true = mlb.fit_transform(preds_df['label'])
y_pred = mlb.fit_transform(preds_df['result'])


print("Classification report: \n", (classification_report(y_true, y_pred)))
print("F1 micro averaging:",(f1_score(y_true, y_pred, average='micro')))
print("ROC: ",(roc_auc_score(y_true, y_pred, average="micro")))


Classification report: 
               precision    recall  f1-score   support

           0       0.89      0.88      0.89       618
           1       0.84      0.55      0.66       198
           2       0.88      0.70      0.78       302
           3       0.99      0.99      0.99       587
           4       0.99      0.96      0.98       675
           5       0.98      0.94      0.96       228
           6       0.99      0.99      0.99       784
           7       0.99      0.97      0.98       574
           8       0.92      0.91      0.92       291
           9       0.98      0.97      0.98       531
          10       0.89      0.84      0.86       361
          11       0.93      0.95      0.94       329
          12       0.93      0.83      0.88       272
          13       0.95      0.76      0.84       460
          14       0.86      0.85      0.86       227

   micro avg       0.95      0.90      0.93      6437
   macro avg       0.94      0.87      0.90      6437
w

### Saving & loading back the trained model

In [23]:
clf_pipelineModel.stages

[DocumentAssembler_780bc9de15fd,
 REGEX_TOKENIZER_bfe4fc10a34e,
 ROBERTA_EMBEDDINGS_b915dff90901,
 SentenceEmbeddings_eeb0599d4c1e,
 MultiClassifierDLModel_dead500df5dd]

In [24]:
clf_pipelineModel.stages[-1].write().overwrite().save('MultilabelClfRoBerta')

In [25]:
# Load back  saved Multilabel Classifier Model
MultilabelClfModel = nlp.MultiClassifierDLModel.load('MultilabelClfRoBerta')

In [26]:
ld_pipeline = Pipeline(stages=[document_assembler, tokenizer, embeddings, embeddingsSentence, MultilabelClfModel])
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([['']]).toDF("provision"))

In [27]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test)

In [28]:
ld_preds_df = ld_preds.select('provision','label',"class.result").toPandas()

In [29]:
ld_preds_df.head(10)

Unnamed: 0,provision,label,result
0,"(a) Effective as of the Effective Date, the Ho...","[waivers, terminations]",[terminations]
1,(a) No failure or delay by the Administrative ...,"[waivers, amendments]","[waivers, amendments]"
2,(a) No failure or delay on the part of any par...,"[waivers, amendments]",[waivers]
3,"(a) Seller, the Agent, each Managing Agent, ea...",[assignments],[assignments]
4,(a) The provisions of this Agreement shall be ...,"[assigns, successors]","[successors, assigns]"
5,(a) No failure or delay of the Administrative...,"[waivers, amendments]","[waivers, amendments]"
6,(a) All of the representations and warranties ...,"[representations, warranties]","[warranties, representations]"
7,(a) Any Lender may at any time assign to one o...,[assignments],[]
8,(a) Each of the Borrower and the Parent hereby...,"[representations, warranties]","[warranties, representations]"
9,(a) Except as otherwise expressly provided her...,[notices],[notices]


# Multiclass classifier training

## Loading the data

In [30]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Legal/data/finance_clf_data.csv

In [31]:
import pandas as pd
df = pd.read_csv('/content/finance_clf_data.csv')

In [32]:
df.head()

Unnamed: 0,text,label,len
0,\nOperating\nLeases\n \nOn\nJanuary 1 2010 th...,financial_statements,465
1,the Exercise Price and is exercisable for fiv...,financial_statements,406
2,Income Taxes\n69\nTable of Contents\nWe accoun...,financial_statements,843
3,Invoice2go\n has not been required to maintain...,risk_factors,474
4,A\nB\nC\nPlan Category\nNumber of Securitiesto...,equity,358


In [33]:
df['label'].value_counts()

risk_factors               3831
financial_statements       3726
business                   2002
financial_conditions        702
form_10k_summary            491
executives_compensation     304
controls_procedures         277
equity                      223
market_risk                 204
executives                  161
legal_proceedings            94
security_ownership           84
properties                   81
exhibits                     77
Name: label, dtype: int64

In [34]:
data = spark.createDataFrame(df)

train, test = data.randomSplit([0.8, 0.2], seed = 100)

In [35]:
from pyspark.sql.functions import col

train.groupBy("label") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|        risk_factors| 3071|
|financial_statements| 2983|
|            business| 1582|
|financial_conditions|  597|
|    form_10k_summary|  385|
| controls_procedures|  226|
|executives_compen...|  224|
|              equity|  174|
|         market_risk|  158|
|          executives|  122|
|   legal_proceedings|   72|
|  security_ownership|   70|
|            exhibits|   62|
|          properties|   57|
+--------------------+-----+



In [36]:
from pyspark.sql.functions import col

test.groupBy("label") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+--------------------+-----+
|               label|count|
+--------------------+-----+
|        risk_factors|  760|
|financial_statements|  743|
|            business|  420|
|    form_10k_summary|  106|
|financial_conditions|  105|
|executives_compen...|   80|
| controls_procedures|   51|
|              equity|   49|
|         market_risk|   46|
|          executives|   39|
|          properties|   24|
|   legal_proceedings|   22|
|            exhibits|   15|
|  security_ownership|   14|
+--------------------+-----+



 ## With Universal Encoder

In [None]:
document_assembler = nlp.DocumentAssembler() \
      .setInputCol("text") \
      .setOutputCol("document") 

embeddings = nlp.UniversalSentenceEncoder.pretrained() \
      .setInputCols("document") \
      .setOutputCol("sentence_embeddings")

classsifierdl = legal.ClassifierDLApproach()\
      .setInputCols(["sentence_embeddings"])\
      .setOutputCol("class")\
      .setLabelColumn("label")\
      .setMaxEpochs(30)\
      .setEnableOutputLogs(True)

clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        embeddings,
        classsifierdl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [None]:
clf_pipelineModel = clf_pipeline.fit(train)

In [None]:
import os
log_file_name = os.listdir("/root/annotator_logs")[0]

with open("/root/annotator_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 30 - learning_rate: 0.005 - batch_size: 64 - training_examples: 9783 - classes: 14
Epoch 0/30 - 1.91s - loss: 378.72934 - acc: 0.31341958 - batches: 153
Epoch 1/30 - 1.60s - loss: 377.48782 - acc: 0.31598946 - batches: 153
Epoch 2/30 - 1.48s - loss: 377.48782 - acc: 0.31598946 - batches: 153
Epoch 3/30 - 1.49s - loss: 377.48782 - acc: 0.31598946 - batches: 153
Epoch 4/30 - 1.52s - loss: 377.48782 - acc: 0.31598946 - batches: 153
Epoch 5/30 - 1.50s - loss: 377.48782 - acc: 0.31598946 - batches: 153
Epoch 6/30 - 1.49s - loss: 377.48782 - acc: 0.31598946 - batches: 153
Epoch 7/30 - 1.47s - loss: 377.48782 - acc: 0.31598946 - batches: 153
Epoch 8/30 - 1.49s - loss: 377.48782 - acc: 0.31598946 - batches: 153
Epoch 9/30 - 1.81s - loss: 377.48782 - acc: 0.31598946 - batches: 153
Epoch 10/30 - 1.66s - loss: 377.48782 - acc: 0.31598946 - batches: 153
Epoch 11/30 - 1.51s - loss: 377.48782 - acc: 0.31598946 - batches: 153
Epoch 12/30 - 1.50s - loss: 377.48782 - acc: 0.3

In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select('label','text',"class.result").toPandas()
preds_df.head()

Unnamed: 0,label,text,result
0,financial_statements,\n\n\n \n \n 7\n million when the Company meet...,[risk_factors]
1,financial_statements,\n\n \n\n\nLevel 3 Inputs that are generally u...,[risk_factors]
2,risk_factors,\n\n \n\n\nOur products are complex and have a...,[risk_factors]
3,risk_factors,\n\n \n\n\nii Customer Support Revenue\n\n\n \...,[risk_factors]
4,form_10k_summary,\n \n\n\n \n \n\n2020\n\n \n \n\n2019\n\n \n\n...,[risk_factors]


In [None]:
# The result is an array since in Spark NLP you can have multiple sentences.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

In [None]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))

                         precision    recall  f1-score   support

               business       0.00      0.00      0.00       420
    controls_procedures       0.00      0.00      0.00        51
                 equity       0.00      0.00      0.00        49
             executives       0.00      0.00      0.00        39
executives_compensation       0.00      0.00      0.00        80
               exhibits       0.00      0.00      0.00        15
   financial_conditions       0.00      0.00      0.00       105
   financial_statements       0.00      0.00      0.00       743
       form_10k_summary       0.00      0.00      0.00       106
      legal_proceedings       0.00      0.00      0.00        22
            market_risk       0.00      0.00      0.00        46
             properties       0.00      0.00      0.00        24
           risk_factors       0.31      1.00      0.47       760
     security_ownership       0.00      0.00      0.00        14

               accuracy

### Saving & loading back the trained model

In [None]:
clf_pipelineModel.stages

[DocumentAssembler_4d93c6bd7374,
 UNIVERSAL_SENTENCE_ENCODER_4de71669b7ec,
 LegalClassifierDLModel_de667b6f6094]

In [None]:
clf_pipelineModel.stages[-1].write().overwrite().save('Clf_Use')

In [None]:
# Load back  saved Classifier Model
ClfModel = legal.ClassifierDLModel.load('Clf_Use')

In [None]:
ld_pipeline = Pipeline(stages=[document_assembler, embeddings,ClfModel])
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))

In [None]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test)

In [None]:
ld_preds_df = ld_preds.select('text','label',"class.result").toPandas()

In [None]:
ld_preds_df.head()

Unnamed: 0,text,label,result
0,\n\n\n \n \n 7\n million when the Company meet...,financial_statements,[risk_factors]
1,\n\n \n\n\nLevel 3 Inputs that are generally u...,financial_statements,[risk_factors]
2,\n\n \n\n\nOur products are complex and have a...,risk_factors,[risk_factors]
3,\n\n \n\n\nii Customer Support Revenue\n\n\n \...,risk_factors,[risk_factors]
4,\n \n\n\n \n \n\n2020\n\n \n \n\n2019\n\n \n\n...,form_10k_summary,[risk_factors]


## With RoBerta Embeddings

We do not have Financial Sentence Embeddings yet, But we can use the Financial Word Embeddings and then average them.

In [37]:
embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings") \
    .setMaxSentenceLength(512)

roberta_embeddings_legal_roberta_base download started this may take some time.
Approximate size to download 447.2 MB
[OK!]


In [38]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddingsSentence = nlp.SentenceEmbeddings() \
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

classsifierdl = legal.ClassifierDLApproach() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("class") \
    .setLabelColumn("label")\
    .setMaxEpochs(8)\
    .setLr(0.001)\
    .setEnableOutputLogs(True)

clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        embeddings,
        embeddingsSentence,
        classsifierdl
    ])

In [39]:
clf_pipelineModel = clf_pipeline.fit(train)

In [40]:
preds = clf_pipelineModel.transform(test)

In [41]:
preds_df = preds.select('label','text',"class.result").toPandas()

In [42]:
preds_df.head()

Unnamed: 0,label,text,result
0,financial_statements,\n\n\n \n \n 7\n million when the Company meet...,[financial_statements]
1,financial_statements,\n\n \n\n\nLevel 3 Inputs that are generally u...,[financial_statements]
2,risk_factors,\n\n \n\n\nOur products are complex and have a...,[risk_factors]
3,risk_factors,\n\n \n\n\nii Customer Support Revenue\n\n\n \...,[financial_statements]
4,form_10k_summary,\n \n\n\n \n \n\n2020\n\n \n \n\n2019\n\n \n\n...,[financial_statements]


In [43]:
log_files = os.listdir("/root/annotator_logs")

with open("/root/annotator_logs/"+log_files[0], "r") as log_file :
    print(log_file.read())

Training started - epochs: 8 - learning_rate: 0.001 - batch_size: 64 - training_examples: 9783 - classes: 14
Epoch 0/8 - 1.31s - loss: 352.6541 - acc: 0.57318145 - batches: 153
Epoch 1/8 - 1.09s - loss: 348.1564 - acc: 0.5952826 - batches: 153
Epoch 2/8 - 1.09s - loss: 347.85394 - acc: 0.596105 - batches: 153
Epoch 3/8 - 1.08s - loss: 347.59656 - acc: 0.5982637 - batches: 153
Epoch 4/8 - 1.06s - loss: 347.57974 - acc: 0.59982246 - batches: 153
Epoch 5/8 - 1.11s - loss: 347.59897 - acc: 0.600542 - batches: 153
Epoch 6/8 - 1.10s - loss: 347.61884 - acc: 0.6012784 - batches: 153
Epoch 7/8 - 1.07s - loss: 347.6311 - acc: 0.601998 - batches: 153



In [44]:
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))


                         precision    recall  f1-score   support

               business       0.00      0.00      0.00       420
    controls_procedures       0.00      0.00      0.00        51
                 equity       0.00      0.00      0.00        49
             executives       0.00      0.00      0.00        39
executives_compensation       0.00      0.00      0.00        80
               exhibits       0.00      0.00      0.00        15
   financial_conditions       0.00      0.00      0.00       105
   financial_statements       0.58      0.96      0.72       743
       form_10k_summary       0.00      0.00      0.00       106
      legal_proceedings       0.00      0.00      0.00        22
            market_risk       0.00      0.00      0.00        46
             properties       0.00      0.00      0.00        24
           risk_factors       0.59      0.97      0.73       760
     security_ownership       0.00      0.00      0.00        14

               accuracy

# Save model and Zip it for Modelshub Upload/Downloads

In [45]:
# Save a Spark NLP model
clf_pipelineModel.stages[-1].write().overwrite().save('ClfBert')

# cd into saved dir and zip
! cd /content/ClfBert ; zip -r /content/ClfBert.zip *

  adding: classifierdl_tensorflow (deflated 58%)
  adding: fields/ (stored 0%)
  adding: fields/datasetParams/ (stored 0%)
  adding: fields/datasetParams/.part-00003.crc (stored 0%)
  adding: fields/datasetParams/.part-00002.crc (stored 0%)
  adding: fields/datasetParams/part-00000 (deflated 27%)
  adding: fields/datasetParams/.part-00000.crc (stored 0%)
  adding: fields/datasetParams/part-00002 (deflated 27%)
  adding: fields/datasetParams/._SUCCESS.crc (stored 0%)
  adding: fields/datasetParams/part-00003 (deflated 32%)
  adding: fields/datasetParams/part-00001 (deflated 26%)
  adding: fields/datasetParams/.part-00001.crc (stored 0%)
  adding: fields/datasetParams/_SUCCESS (stored 0%)
  adding: metadata/ (stored 0%)
  adding: metadata/part-00000 (deflated 39%)
  adding: metadata/.part-00000.crc (stored 0%)
  adding: metadata/._SUCCESS.crc (stored 0%)
  adding: metadata/_SUCCESS (stored 0%)
