![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Training Multilabel Classification Models with Legal NLP

In this notebook, you will learn how to use Spark NLP and Legal NLP to train multilabel classification models.

Let`s dive in!

In [0]:
from johnsnowlabs import nlp, legal, viz

from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

## Introduction

For the text classification tasks, we will use two annotators:

- `MultiClassifierDL`: `Multilabel Classification` (can predict more than one class for each text) using a Bidirectional GRU with Convolution architecture built with TensorFlow that supports up to 100 classes. The inputs are Sentence Embeddings such as state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings or SentenceEmbeddings.
- `ClassifierDL`: uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. Then, a deep learning model (DNNs) built with TensorFlow that supports `Binary Classification` and `Multiclass Classification` (up to 100 classes).

The `ClassifierDLApproach` annotator trains a multiclass model, where the predictions is one category out of a predifined set of categories that are present in the training data.

## Loading the data

In [0]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/finance_clf_data.csv
dbutils.fs.cp("file:/databricks/driver/finance_clf_data.csv", "dbfs:/")

In [0]:
import pandas as pd
df = pd.read_csv('finance_clf_data.csv', encoding="utf8")
print(f"Shape of the full dataset: {df.shape}")

In [0]:
df.head()

Unnamed: 0,text,label,len
0,Presently we do not believe any U S or State r...,business,402
1,\nnetwork outages or performance degradation ...,risk_factors,496
2,Available Information\nOur reports filed with ...,business,356
3,\n 42 530\n \n \n \n \n \n 42 530\nTotal liab...,financial_statements,359
4,8\nTable of Contents\ndevelopment employee eng...,business,582


In [0]:
df['label'].value_counts()

Since the deep learning models can take some time to train, we will limit our dataset to a smaller number of observations in order to illustrate how to use Spark NLP and Finance NLP annotators and pipelines to train the model, but without having to wait too much.

Please note that the quality and the quantity of training data is very relevant to the obtained trianed model, and the results we obtain here are for illustration purposes only. To obtain a more realistic model, pelase consider using the full dataset or addin extra observations from different sources.

In [0]:
from sklearn.model_selection import train_test_split

# The top 3 categories (number of observations)
filter_classes = ["risk_factors", "financial_statements", "business"]

# We make a random sample with 1000 observations
df = df.loc[df.label.isin(filter_classes)].sample(1000)

# Stratify split for train and test datasets
train_data, test_data = train_test_split(
    df, train_size=0.9, stratify=df.label, random_state=42
)

# Send to spark
train = spark.createDataFrame(train_data)
test = spark.createDataFrame(test_data)

In [0]:
train.groupBy("label").count().orderBy(F.col("count").desc()).show()

In [0]:
test.groupBy("label").count().orderBy(F.col("count").desc()).show()

## With Universal Encoder

In [0]:
%fs mkdirs file:/dbfs/multiclass_use

In [0]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

embeddings = (
    nlp.UniversalSentenceEncoder.pretrained()
    .setInputCols("document")
    .setOutputCol("sentence_embeddings")
)

classsifierdl = (
    legal.ClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(30)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("file:/dbfs/multiclass_use")
    .setLr(0.001)
    .setBatchSize(4)
    .setDropout(0.15)
)

clf_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, classsifierdl])

In [0]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

In [0]:
import os
log_file_name = os.listdir("/dbfs/multiclass_use")[0]

with open("/dbfs/multiclass_use/"+log_file_name, "r") as log_file :
    print(log_file.read())

In [0]:
preds = clf_pipelineModel.transform(test)

In [0]:
preds_df = preds.select("label", "text", "class.result").toPandas()
preds_df.head()

Unnamed: 0,label,text,result
0,financial_statements,102\nSprout Social Inc \nNotes to Consolidated...,[financial_statements]
1,financial_statements,s RSM US LLP\nWe have served as the Company s...,[financial_statements]
2,business,\nPerfect Audience\n \nThe Perfect Audience p...,[business]
3,business,During each of the last few years sales of lic...,[risk_factors]
4,risk_factors,We currently serve our customers from third pa...,[risk_factors]


In [0]:
# The result is an array since in Spark NLP you can have multiple sentences.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

In [0]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print(classification_report(preds_df["label"], preds_df["result"]))

## With RoBerta Embeddings

In [0]:
%fs mkdirs file:/dbfs/multiclass_roberta

We do not have Legal Sentence Embeddings yet, But we can use the Legal RoBerta Embeddings and then average them.

In [0]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

tokenizer = nlp.Tokenizer().setInputCols(["document"]).setOutputCol("token")

embeddings = (
    nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
    .setInputCols(["document", "token"])
    .setOutputCol("embeddings")
    .setMaxSentenceLength(512)
)

embeddingsSentence = (
    nlp.SentenceEmbeddings()
    .setInputCols(["document", "embeddings"])
    .setOutputCol("sentence_embeddings")
    .setPoolingStrategy("AVERAGE")
)

classifierdl = (
    legal.ClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(3)
    .setLr(0.001)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("dbfs:/multiclass_roberta")
    .setBatchSize(4)
    .setDropout(0.15)
)

clf_pipeline_roberta = nlp.Pipeline(
    stages=[document_assembler, tokenizer, embeddings, embeddingsSentence, classifierdl]
)

In [0]:
%%time
clf_pipelineModel = clf_pipeline_roberta.fit(train)

In [0]:
import os
log_files = os.listdir("/dbfs/multiclass_roberta")

with open("/dbfs/multiclass_roberta/"+log_files[0], "r") as log_file :
    print(log_file.read())

In [0]:
preds = clf_pipelineModel.transform(test)

In [0]:
preds_df = preds.select("label", "text", "class.result").toPandas()

In [0]:
preds_df.head()

Unnamed: 0,label,text,result
0,financial_statements,102\nSprout Social Inc \nNotes to Consolidated...,[financial_statements]
1,financial_statements,s RSM US LLP\nWe have served as the Company s...,[financial_statements]
2,business,\nPerfect Audience\n \nThe Perfect Audience p...,[business]
3,business,During each of the last few years sales of lic...,[risk_factors]
4,risk_factors,We currently serve our customers from third pa...,[risk_factors]


In [0]:
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))


## Save model and Zip it for Modelshub Upload/Downloads

In [0]:
# Save a Spark NLP model
clf_pipelineModel.stages[-1].write().overwrite().save('dbfs:/databricks/driver/models/ClfBert')