![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/Training/multi_label_text_classification/NLU_traing_multi_label_classifier_E2e.ipynb)



# Training a Deep Learning Classifier for multi label prediction
MultiClassifierDL is a Multi-label Text Classification. MultiClassifierDL uses a Bidirectional GRU with Convolution model that we have built inside TensorFlow and supports up to 100 classes. The input to MultiClassifierDL is Sentence Embeddings such as state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings



### Multi ClassifierDL (Multi-class Text Classification with multiple classes per sentence)
With the [ClassifierDL model](https://nlp.johnsnowlabs.com/docs/en/annotators#multiclassifierdl-multi-label-text-classification) from Spark NLP you can achieve State Of the Art results on any multi class text classification problem 

This notebook showcases the following features : 

- How to train the deep learning classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



# 1. Install Java 8 and NLU

In [None]:
!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
  

import nlu

# 2. Download E2E Challenge multi token label classification dataset

http://www.macs.hw.ac.uk/InteractionLab/E2E/

In [None]:
import pandas as pd
!wget http://ckl-it.de/wp-content/uploads/2020/12/e2e.csv
test_path = '/content/e2e.csv'
train_df = pd.read_csv(test_path)
train_df = train_df.iloc[:3000]
train_df

--2021-01-01 19:37:17--  http://ckl-it.de/wp-content/uploads/2020/12/e2e.csv
Resolving ckl-it.de (ckl-it.de)... 217.160.0.108, 2001:8d8:100f:f000::209
Connecting to ckl-it.de (ckl-it.de)|217.160.0.108|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1322591 (1.3M) [text/csv]
Saving to: ‘e2e.csv’


2021-01-01 19:37:20 (715 KB/s) - ‘e2e.csv’ saved [1322591/1322591]



Unnamed: 0.1,Unnamed: 0,y,text,origin_index
0,0,"name[Blue Spice],eatType[coffee shop],area[cit...",A coffee shop in the city centre area called B...,0
1,1,"name[Blue Spice],eatType[coffee shop],area[cit...",Blue Spice is a coffee shop in city centre.,1
2,2,"name[Blue Spice],eatType[coffee shop],area[riv...",There is a coffee shop Blue Spice in the river...,2
3,3,"name[Blue Spice],eatType[coffee shop],area[riv...","At the riverside, there is a coffee shop calle...",3
4,4,"name[Blue Spice],eatType[coffee shop],customer...",The coffee shop Blue Spice is based near Crown...,4
...,...,...,...,...
2995,2995,"name[The Punter],eatType[restaurant],food[Indi...","Near Express by Holiday Inn, in the riverside ...",2995
2996,2996,"name[The Punter],eatType[restaurant],food[Indi...","In the riverside area, near Express by Holiday...",2996
2997,2997,"name[The Punter],eatType[restaurant],food[Indi...",The Punter is a restaurant with Indian food in...,2997
2998,2998,"name[The Punter],eatType[restaurant],food[Indi...",The Punter is a low rated restaurant that serv...,2998


# 3. Train Deep Learning Classifier using nlu.load('train.multi_classifier')

By default, the Universal Sentence Encoder Embeddings (USE) are beeing downloaded to provide embeddings for the classifier. You can use any of the 50+ other sentence Emeddings in NLU tough!

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [None]:
import nlu
# load a trainable pipeline by specifying the train  prefix 
unfitted_pipe = nlu.load('train.multi_classifier')
#configure epochs
unfitted_pipe['trainable_multi_classifier_dl'].setMaxEpochs(25)
#  fit it on a datset with label='y' and text columns. Labels seperated by ','
fitted_pipe = unfitted_pipe.fit(train_df[['y','text']], label_seperator=',')

# predict with the trained pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df[['y','text']])
preds

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


Unnamed: 0_level_0,multi_classifier_classes,multi_classifier_confidences,default_name_embeddings,y,sentence,text
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,"[near[Café Rouge], name[Blue Spice], near[Rain...","[0.8555223, 0.99276984, 0.87128675, 0.9852337,...","[0.026563657447695732, -0.058662936091423035, ...","name[Blue Spice],eatType[coffee shop],area[cit...",A coffee shop in the city centre area called B...,A coffee shop in the city centre area called B...
1,"[near[Café Rouge], name[Blue Spice], near[Rain...","[0.8142674, 0.99920505, 0.93413615, 0.98056525...","[0.040952689945697784, -0.04276810586452484, -...","name[Blue Spice],eatType[coffee shop],area[cit...",Blue Spice is a coffee shop in city centre.,Blue Spice is a coffee shop in city centre.
2,"[name[Blue Spice], near[Rainbow Vegetarian Caf...","[0.9966337, 0.9044244, 0.904881, 0.56231284, 0...","[0.03141527622938156, -0.05154882371425629, 0....","name[Blue Spice],eatType[coffee shop],area[riv...",There is a coffee shop Blue Spice in the river...,There is a coffee shop Blue Spice in the river...
3,"[near[Café Rouge], name[Blue Spice], near[Rain...","[0.5227911, 0.99917483, 0.9394022, 0.8839797, ...","[0.03584946319460869, -0.036898739635944366, -...","name[Blue Spice],eatType[coffee shop],area[riv...","At the riverside, there is a coffee shop calle...","At the riverside, there is a coffee shop calle..."
4,"[near[Café Rouge], name[Blue Spice], near[Crow...","[0.5985904, 0.7892299, 0.8222753, 0.9378743, 0...","[0.0405426099896431, -0.0243277158588171, 0.00...","name[Blue Spice],eatType[coffee shop],customer...",The coffee shop Blue Spice is based near Crown...,The coffee shop Blue Spice is based near Crown...
...,...,...,...,...,...,...
2998,"[near[Express by Holiday Inn], priceRange[high...","[0.9999982, 0.8146039, 0.99978125, 0.8511795, ...","[0.05956212058663368, 0.019028551876544952, -0...","name[The Punter],eatType[restaurant],food[Indi...","The Punter has a price range of less than £20,...",The Punter is a low rated restaurant that serv...
2999,"[near[Express by Holiday Inn], food[Indian], c...","[0.99992794, 0.99981034, 0.5099642, 0.9994041,...","[0.04296032711863518, -0.0015949805965647101, ...","name[The Punter],eatType[restaurant],food[Indi...",The Punter is a restaurant providing Indian fo...,The Punter is a restaurant providing Indian fo...
2999,"[near[Express by Holiday Inn], food[Indian], c...","[0.99992794, 0.99981034, 0.5099642, 0.9994041,...","[0.023289771750569344, 0.056861914694309235, -...","name[The Punter],eatType[restaurant],food[Indi...",It is located in the riverside.,The Punter is a restaurant providing Indian fo...
2999,"[near[Express by Holiday Inn], food[Indian], c...","[0.99992794, 0.99981034, 0.5099642, 0.9994041,...","[0.033101629465818405, 0.06402800232172012, 0....","name[The Punter],eatType[restaurant],food[Indi...",It is near Express by Holiday Inn.,The Punter is a restaurant providing Indian fo...


# 4. Evaluate the model

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
preds.classifier_dl = preds.classifier_dl.astype(str)
mlb = MultiLabelBinarizer()
mlb = mlb.fit(preds.y.str.split(','))
y_true = mlb.transform(preds['y'].str.split(','))
y_pred = mlb.transform(preds.classifier_dl.str.join(',').str.split(','))
print("Classification report: \n", (classification_report(y_true, y_pred)))
print("F1 micro averaging:",(f1_score(y_true, y_pred, average='micro')))
print("ROC: ",(roc_auc_score(y_true, y_pred, average="micro")))

Classification report: 
               precision    recall  f1-score   support

           0       0.78      0.97      0.86      1700
           1       0.95      0.83      0.89      2914
           2       0.56      0.64      0.60       576
           3       0.33      0.28      0.30       367
           4       0.38      0.55      0.45       455
           5       0.30      0.76      0.42       599
           6       0.37      0.77      0.50       550
           7       0.69      0.44      0.54       457
           8       0.99      0.72      0.84       337
           9       0.91      0.98      0.95      2211
          10       0.89      0.99      0.94      2718
          11       0.53      0.89      0.67      1914
          12       0.88      0.79      0.84      3154
          13       0.79      0.98      0.87      1087
          14       0.69      0.97      0.81      1118
          15       0.98      0.64      0.78      1077
          16       0.82      0.96      0.88       671
  

# 5. Lets try different Sentence Emebddings

In [None]:
# We can use nlu.print_components(action='embed_sentence') to see every possibler sentence embedding we could use. Lets use bert!
nlu.print_components(action='embed_sentence')

For language <en> NLU provides the following Models : 
nlu.load('en.embed_sentence') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.tfhub_use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.tfhub_use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.albert') returns Spark NLP model albert_base_uncased
nlu.load('en.embed_sentence.electra') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_small_uncased') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_base_uncased') returns Spark NLP model sent_electra_base_uncased
nlu.load('en.embed_sentence.electra_large_uncased') returns Spark NLP model sent_electra_large_uncased
nlu.load('en.embed_sentence.bert') returns Spark NLP model sent_bert_base_uncased
nlu.load('en.embed_sentenc

In [None]:
# You might need to restart your notebook to clear RAM, or you might run out of Memory when fitting
import nlu
pipe = nlu.load('en.embed_sentence.small_bert_L12_768 train.multi_classifier')
pipe.print_info()

sent_small_bert_L12_768 download started this may take some time.
Approximate size to download 392.9 MB
[OK!]
The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['en_embed_sentence_small_bert_L12_768'] has settable params:
pipe['en_embed_sentence_small_bert_L12_768'].setBatchSize(32)  | Info: Batch size. Large values allows faster processing but requires more memory. | Currently set to : 32
pipe['en_embed_sentence_small_bert_L12_768'].setIsLong(False)  | Info: Use Long type instead of Int type for inputs buffer - Some Bert models require Long instead of Int. | Currently set to : False
pipe['en_embed_sentence_small_bert_L12_768'].setMaxSentenceLength(128)  | Info: Max sentence length to process | Currently set to : 128
pipe['en_embed_sentence_small_bert_L12_768'].setDimension(768)  | Info: Number of embedding dimensions | Currently set to : 768
pipe['en_embed_sentence_small_bert_L12_768'].setCaseSensitive(False)  | Info: whether t

In [None]:

# Load pipe with bert embeds and configure hyper parameters
# using large embeddings can take a few hours..
pipe['trainable_multi_classifier_dl'].setMaxEpochs(100)            
pipe['trainable_multi_classifier_dl'].setLr(0.0005)  
fitted_pipe = pipe.fit(train_df[['y','text']],label_seperator=',')
preds = fitted_pipe.predict(train_df)
preds

Unnamed: 0_level_0,text,multi_classifier_classes,Unnamed: 0,document,y,multi_classifier_confidences,en_embed_sentence_small_bert_L12_768_embeddings
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,A coffee shop in the city centre area called B...,"[name[Blue Spice], eatType[coffee shop], area[...",0,A coffee shop in the city centre area called B...,"name[Blue Spice],eatType[coffee shop],area[cit...","[0.9740321, 0.99538183, 0.92562413]","[-0.1427491158246994, 0.5036071538925171, 0.07..."
1,Blue Spice is a coffee shop in city centre.,"[name[Blue Spice], eatType[coffee shop], area[...",1,Blue Spice is a coffee shop in city centre.,"name[Blue Spice],eatType[coffee shop],area[cit...","[0.9950888, 0.9989519, 0.8684354]","[-0.20697341859340668, 0.5286431312561035, 0.2..."
2,There is a coffee shop Blue Spice in the river...,"[name[Blue Spice], eatType[coffee shop], area[...",2,There is a coffee shop Blue Spice in the river...,"name[Blue Spice],eatType[coffee shop],area[riv...","[0.95310336, 0.9655487, 0.9785502]","[0.005826675333082676, 0.49930453300476074, -0..."
3,"At the riverside, there is a coffee shop calle...","[name[Blue Spice], eatType[coffee shop], area[...",3,"At the riverside, there is a coffee shop calle...","name[Blue Spice],eatType[coffee shop],area[riv...","[0.8858954, 0.931189, 0.9990605]","[0.12191159278154373, 0.37966835498809814, 0.0..."
4,The coffee shop Blue Spice is based near Crown...,"[near[Crowne Plaza Hotel], customer rating[5 o...",4,The coffee shop Blue Spice is based near Crown...,"name[Blue Spice],eatType[coffee shop],customer...","[0.99912286, 0.7930833, 0.9730882]","[-0.37350592017173767, 0.1885937601327896, 0.1..."
...,...,...,...,...,...,...,...
2995,"Near Express by Holiday Inn, in the riverside ...","[near[Express by Holiday Inn], customer rating...",2995,"Near Express by Holiday Inn, in the riverside ...","name[The Punter],eatType[restaurant],food[Indi...","[0.9476669, 0.9914391, 0.8395983, 0.98047745, ...","[0.0485222227871418, 0.2381688505411148, 0.227..."
2996,"In the riverside area, near Express by Holiday...","[near[Express by Holiday Inn], food[Indian], c...",2996,"In the riverside area, near Express by Holiday...","name[The Punter],eatType[restaurant],food[Indi...","[0.94435394, 0.6119035, 0.7891044, 0.9885667, ...","[0.06879807263612747, 0.23580998182296753, 0.1..."
2997,The Punter is a restaurant with Indian food in...,"[near[Express by Holiday Inn], food[Indian], c...",2997,The Punter is a restaurant with Indian food in...,"name[The Punter],eatType[restaurant],food[Indi...","[0.99509084, 0.9424925, 0.7625178, 0.9907007, ...","[-0.12667560577392578, 0.22056235373020172, 0...."
2998,The Punter is a low rated restaurant that serv...,"[near[Express by Holiday Inn], food[Indian], c...",2998,The Punter is a low rated restaurant that serv...,"name[The Punter],eatType[restaurant],food[Indi...","[0.99541605, 0.9715836, 0.87202764, 0.99880993...","[-0.13057495653629303, 0.21937601268291473, 0...."


In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
preds.classifier_dl = preds.classifier_dl.astype(str)
mlb = MultiLabelBinarizer()
mlb = mlb.fit(preds.y.str.split(','))
y_true = mlb.transform(preds['y'].str.split(','))
y_pred = mlb.transform(preds.classifier_dl.str.join(',').str.split(','))
print("Classification report: \n", (classification_report(y_true, y_pred)))
print("F1 micro averaging:",(f1_score(y_true, y_pred, average='micro')))
print("ROC: ",(roc_auc_score(y_true, y_pred, average="micro")))

Classification report: 
               precision    recall  f1-score   support

           0       0.97      0.98      0.97       846
           1       0.99      0.98      0.98      1642
           2       0.93      0.70      0.80       300
           3       0.90      0.56      0.69       209
           4       0.91      0.72      0.81       246
           5       0.91      0.79      0.85       333
           6       0.95      0.84      0.90       288
           7       0.91      0.82      0.86       260
           8       0.99      0.99      0.99       267
           9       1.00      0.99      0.99      1275
          10       0.99      0.99      0.99      1458
          11       0.96      0.90      0.93       976
          12       0.95      0.97      0.96      1844
          13       1.00      0.99      0.99       492
          14       0.99      0.98      0.99       613
          15       0.97      0.98      0.98       632
          16       0.99      0.97      0.98       365
  

# 5. Lets save the model

In [None]:
stored_model_path = './models/multi_classifier_dl_trained' 
fitted_pipe.save(stored_model_path)

Stored model in ./models/multi_classifier_dl_trained


# 6. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [None]:
hdd_pipe = nlu.load(path=stored_model_path)

preds = hdd_pipe.predict('Tesla plans to invest 10M into the ML sector')
preds

Unnamed: 0_level_0,multi_classifier_classes,document,multi_classifier_confidences,en_embed_sentence_small_bert_L12_768_embeddings
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,"[customer rating[high], customer rating[low], ...",Tesla plans to invest 10M into the ML sector,"[0.9597453, 0.6497742, 0.986845, 0.5315694, 0....","[0.15737222135066986, 0.2598555386066437, 0.85..."


In [None]:
hdd_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['document_assembler'] has settable params:
pipe['document_assembler'].setCleanupMode('shrink')                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             