![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Train Domain-specific Multiclass Classifiers

In this notebook, you will learn how to use Spark NLP and Finance NLP to train custom multiclass classification models.

In [None]:
from johnsnowlabs import nlp, finance, viz

## Introduction

Although John Snow Labs provides mnay pretrained models that cover different applications in the financial domain, there are still problems that are specific to companies or practitioners. For such cases, it is possible to train a new custom model using Finance NLP annotators:

- `ClassifierDLApproach`: Trains a multilabel model (predicts one class out of a predefined set of classes) or binary classification
- `MultiClassifierDLApproach`: Trains a mutilabel model (predicts one or more classes for each document)

The input to are Sentence Embeddings such as the state-of-the-art [UniversalSentenceEncoder](https://nlp.johnsnowlabs.com/docs/en/transformers#universalsentenceencoder), [BertSentenceEmbeddings](https://nlp.johnsnowlabs.com/docs/en/transformers#bertsentenceembeddings) or [SentenceEmbeddings](https://nlp.johnsnowlabs.com/docs/en/annotators#sentenceembeddings).

To train a custom model, you need labeled data with at least the columns

```
| TEXT | LABELS (list) |
```

## Multiclass Classifier Training

The `ClassifierDLApproach` annotator trains a multiclass model, where the predictions is one category out of a predifined set of categories that are present in the training data.

## Loading the data

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/finance-nlp/data/finance_clf_data.csv

dbutils.fs.cp("file:/databricks/driver/finance_clf_data.csv", "dbfs:/") 

In [None]:
import pandas as pd
df = pd.read_csv('finance_clf_data.csv')
print(f"Shape of the full dataset: {df.shape}")

In [None]:
df.head()

Unnamed: 0,text,label,len
0,Presently we do not believe any U S or State r...,business,402
1,\nnetwork outages or performance degradation ...,risk_factors,496
2,Available Information\nOur reports filed with ...,business,356
3,\n 42 530\n \n \n \n \n \n 42 530\nTotal liab...,financial_statements,359
4,8\nTable of Contents\ndevelopment employee eng...,business,582


In [None]:
df['label'].value_counts()

Since the deep learning models can take some time to train, we will limit our dataset to a smaller number of observations in order to illustrate how to use Spark NLP and Finance NLP annotators and pipelines to train the model, but without having to wait too much.

Please note that the quality and the quantity of training data is very relevant to the obtained trianed model, and the results we obtain here are for illustration purposes only. To obtain a more realistic model, pelase consider using the full dataset or addin extra observations from different sources.

In [None]:
from sklearn.model_selection import train_test_split

# The top 3 categories (number of observations) 
filter_classes = [
    "risk_factors",
    "financial_statements",
    "business"
]

# We make a random sample with 500 observations
df = df.loc[df.label.isin(filter_classes)].sample(500)

# Stratify split for train and test datasets
train_data, test_data = train_test_split(
    df, train_size=0.9, stratify=df.label, random_state=42
)

# Send to spark
train = spark.createDataFrame(train_data) 
test = spark.createDataFrame(test_data)

In [None]:
from pyspark.sql.functions import col

train.groupBy("label").count().orderBy(col("count").desc()).show()

In [None]:
from pyspark.sql.functions import col

test.groupBy("label").count().orderBy(col("count").desc()).show()

## Train with Universal Encoder

In [None]:
%fs mkdirs file:/dbfs/multiclass_use

In [None]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

embeddings = (
    nlp.UniversalSentenceEncoder.pretrained()
    .setInputCols("document")
    .setOutputCol("sentence_embeddings")
)

classifierdl = (
    finance.ClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(30)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("file:/dbfs/multiclass_use")
    .setLr(0.001)
    .setBatchSize(4)
    .setDropout(0.15)
)

clf_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, classifierdl])

In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

In [None]:
import os
log_file_name = os.listdir("/dbfs/multiclass_use")[0]

with open("/dbfs/multiclass_use/"+log_file_name, "r") as log_file:
    print(log_file.read())

In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select("label", "text", "class.result").toPandas()
preds_df.head()

Unnamed: 0,label,text,result
0,financial_statements,740 In addition the Company not record a cumul...,[financial_statements]
1,risk_factors,Failure to effectively develop and expand our ...,[risk_factors]
2,risk_factors,valuable management resources improving our op...,[risk_factors]
3,business,Overall Trends in the TMT Market\n \nConvergen...,[business]
4,financial_statements,As of January 31 2021 no amounts were outstand...,[financial_statements]


In [None]:
# The result is an array since in Spark NLP you can have multiple sentences.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

In [None]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))

### Saving & loading back the trained model

In [None]:
clf_pipelineModel.stages

In [None]:
clf_pipelineModel.stages[-1].write().overwrite().save('dbfs:/databricks/driver/models/Clf_Use')

In [None]:
# Load back  saved Classifier Model
ClfModel = finance.ClassifierDLModel.load('dbfs:/databricks/driver/models/Clf_Use')

In [None]:
ld_pipeline = nlp.Pipeline(stages=[document_assembler, embeddings, ClfModel])
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

In [None]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(test)

In [None]:
ld_preds_df = ld_preds.select("text", "label", "class.result").toPandas()

In [None]:
ld_preds_df.head()

Unnamed: 0,text,label,result
0,740 In addition the Company not record a cumul...,financial_statements,[financial_statements]
1,Failure to effectively develop and expand our ...,risk_factors,[risk_factors]
2,valuable management resources improving our op...,risk_factors,[risk_factors]
3,Overall Trends in the TMT Market\n \nConvergen...,business,[business]
4,As of January 31 2021 no amounts were outstand...,financial_statements,[financial_statements]


## Train with Bert Embeddings

We do not have Financial Sentence Embeddings yet, But we can use the Financial Word Embeddings and then average them. Since this model takes a long time to train, we will train for only one epoch.

In [None]:
%fs 
mkdirs file:/dbfs/multiclass_bert

In [None]:
embeddings = (
    nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
    .setInputCols(["document", "token"])
    .setOutputCol("embeddings")
)

In [None]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

tokenizer = nlp.Tokenizer().setInputCols(["document"]).setOutputCol("token")

embeddingsSentence = (
    nlp.SentenceEmbeddings()
    .setInputCols(["document", "embeddings"])
    .setOutputCol("sentence_embeddings")
    .setPoolingStrategy("AVERAGE")
)

classifierdl = (
    finance.ClassifierDLApproach()
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class")
    .setLabelColumn("label")
    .setMaxEpochs(1)
    .setLr(0.001)
    .setEnableOutputLogs(True)
    .setOutputLogsPath("dbfs:/multiclass_bert")
    .setBatchSize(4)
    .setDropout(0.15)
)

clf_pipeline = nlp.Pipeline(
    stages=[document_assembler, tokenizer, embeddings, embeddingsSentence, classifierdl]
)

In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(train)

In [None]:
log_files = os.listdir("/dbfs/multiclass_bert")
with open("/dbfs/multiclass_bert/"+log_files[0], "r") as log_file :
    print(log_file.read())

In [None]:
preds = clf_pipelineModel.transform(test)

In [None]:
preds_df = preds.select("label", "text", "class.result").toPandas()

In [None]:
preds_df.head()

Unnamed: 0,label,text,result
0,financial_statements,740 In addition the Company not record a cumul...,[financial_statements]
1,risk_factors,Failure to effectively develop and expand our ...,[risk_factors]
2,risk_factors,valuable management resources improving our op...,[risk_factors]
3,business,Overall Trends in the TMT Market\n \nConvergen...,[financial_statements]
4,financial_statements,As of January 31 2021 no amounts were outstand...,[financial_statements]


In [None]:
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

from sklearn.metrics import classification_report

print (classification_report(preds_df['label'], preds_df['result']))


### Save model

In [None]:
# Save a Spark NLP model
clf_pipelineModel.stages[-1].write().overwrite().save('dbfs:/databricks/driver/models/MultiClfBert')
