![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/https://github.com/JohnSnowLabs/nlu/blob/master/examples/colab/Training/multi_class_text_classification/NLU_training_multi_class_text_classifier_demo_musical_instruments.ipynb)




# Training a Deep Learning Classifier with NLU 
## ClassifierDL (Multi-class Text Classification)
With the [ClassifierDL model](https://nlp.johnsnowlabs.com/docs/en/annotators#classifierdl-multi-class-text-classification) from Spark NLP you can achieve State Of the Art results on any multi class text classification problem 

This notebook showcases the following features : 

- How to train the deep learning classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



# 1. Install Java 8 and NLU

In [None]:
import os
! apt-get update -qq > /dev/null   
# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! pip install nlu > /dev/null
! pip install  pyspark==2.4.7 > /dev/null

import nlu

# 2. Download musical instruments  classification dataset

https://www.kaggle.com/eswarchandt/amazon-music-reviews

dataset with products rated between 5 classes

In [None]:
! wget http://ckl-it.de/wp-content/uploads/2021/01/Musical_instruments_reviews.csv

--2021-01-16 09:04:04--  http://ckl-it.de/wp-content/uploads/2021/01/Musical_instruments_reviews.csv
Resolving ckl-it.de (ckl-it.de)... 217.160.0.108, 2001:8d8:100f:f000::209
Connecting to ckl-it.de (ckl-it.de)|217.160.0.108|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 51708 (50K) [text/csv]
Saving to: ‘Musical_instruments_reviews.csv’


2021-01-16 09:04:05 (241 KB/s) - ‘Musical_instruments_reviews.csv’ saved [51708/51708]



In [None]:
import pandas as pd
test_path = '/content/Musical_instruments_reviews.csv'
train_df = pd.read_csv(test_path,sep=",")
cols = ["y","text"]
train_df = train_df[cols]
train_df



Unnamed: 0,y,text
0,good,Hosa products are a good bang for the buck. I ...
1,average,I now use this cable to run from the output of...
2,good,Cheap and good texture rubber that does not ge...
3,average,These cables are a little thin compared to hos...
4,average,"It is a decent cable. It does its job, but it ..."
...,...,...
115,very poor,"It just randomly pops off my bass, it's so sli..."
116,very good,The primary job of this device is to block the...
117,good,The Hosa XLR cables are affordable and very he...
118,average,"It's a cable, no frills, tangles pretty easy a..."


# 3. Train Deep Learning Classifier using nlu.load('train.classifier')

By default, the Universal Sentence Encoder Embeddings (USE) are beeing downloaded to provide embeddings for the classifier. You can use any of the 50+ other sentence Emeddings in NLU tough!

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [None]:
# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns
# Since there are no

trainable_pipe = nlu.load('train.classifier')
fitted_pipe = trainable_pipe.fit(train_df.iloc[:50] )


# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df.iloc[:50] )
preds

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


Unnamed: 0_level_0,category_confidence,text,y,category,default_name_embeddings,sentence
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.304148,Hosa products are a good bang for the buck. I ...,good,average,"[0.07208353281021118, 0.028736615553498268, -0...",Hosa products are a good bang for the buck.
0,1.000000,Hosa products are a good bang for the buck. I ...,good,average,"[0.056614313274621964, -0.04707420617341995, -...","I haven't looked up the specifications, but I'..."
1,0.956961,I now use this cable to run from the output of...,average,average,"[0.06778458505868912, -0.0052166287787258625, ...",I now use this cable to run from the output of...
1,1.000000,I now use this cable to run from the output of...,average,average,"[0.06371542811393738, -0.022252758964896202, -...",After I bought Monster Cable to hook up my ped...
1,2.000000,I now use this cable to run from the output of...,average,average,"[0.018308864906430244, 0.0024022769648581743, ...",I had been using a high end Planet Waves cable...
...,...,...,...,...,...,...
47,0.841045,Update: The right angle switched end started d...,average,average,"[-0.013615701347589493, -0.04160430282354355, ...",I like knowing that.
47,0.841045,Update: The right angle switched end started d...,average,average,"[0.02372647449374199, 0.04573449119925499, -0....","** EDIT: AS STATED ABOVE, YOU WILL NOT BE ABLE..."
48,0.997217,"Doe's not stay on to well, moves to much even ...",average,average,"[0.08493339270353317, 0.047714825719594955, -0...","Doe's not stay on to well, moves to much even ..."
49,0.401975,These are not the greatest but they're cheap a...,good,very poor,"[0.03083745203912258, 0.01701708696782589, -0....",These are not the greatest but they're cheap a...


# 4. Evaluate the model

In [None]:
from sklearn.metrics import classification_report
print(classification_report(preds['y'], preds['category']))


              precision    recall  f1-score   support

     average       0.63      0.76      0.69       123
        good       0.00      0.00      0.00        51
   very good       0.00      0.00      0.00        39
   very poor       0.50      0.87      0.63        84

    accuracy                           0.56       297
   macro avg       0.28      0.41      0.33       297
weighted avg       0.40      0.56      0.46       297



# 5. Lets try different Sentence Emebddings

In [None]:
# We can use nlu.print_components(action='embed_sentence') to see every possibler sentence embedding we could use. Lets use bert!
nlu.print_components(action='embed_sentence')

For language <en> NLU provides the following Models : 
nlu.load('en.embed_sentence') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.tfhub_use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.tfhub_use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.albert') returns Spark NLP model albert_base_uncased
nlu.load('en.embed_sentence.electra') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_small_uncased') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_base_uncased') returns Spark NLP model sent_electra_base_uncased
nlu.load('en.embed_sentence.electra_large_uncased') returns Spark NLP model sent_electra_large_uncased
nlu.load('en.embed_sentence.bert') returns Spark NLP model sent_bert_base_uncased
nlu.load('en.embed_sentenc

In [None]:
# Load pipe with bert embeds
# using large embeddings can take a few hours..
# fitted_pipe = nlu.load('en.embed_sentence.bert_large_uncased train.classifier').fit(train_df)
fitted_pipe = nlu.load('en.embed_sentence.bert train.classifier').fit(train_df.iloc[:100])


# predict with the trained pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df.iloc[:100])
from sklearn.metrics import classification_report
print(classification_report(preds['y'], preds['category']))


sent_bert_base_uncased download started this may take some time.
Approximate size to download 392.5 MB
[OK!]
              precision    recall  f1-score   support

     average       0.29      1.00      0.45        27
        good       0.00      0.00      0.00        25
   very good       0.00      0.00      0.00        25
   very poor       1.00      0.30      0.47        23

    accuracy                           0.34       100
   macro avg       0.32      0.33      0.23       100
weighted avg       0.31      0.34      0.23       100



In [None]:
# Load pipe with bert embeds
fitted_pipe = nlu.load('embed_sentence.bert train.classifier').fit(train_df.iloc[:100])

# predict with the trained pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df.iloc[:100])
from sklearn.metrics import classification_report
print(classification_report(preds['y'], preds['category']))


sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]
              precision    recall  f1-score   support

     average       0.00      0.00      0.00        27
        good       0.00      0.00      0.00        25
   very good       0.25      1.00      0.40        25
   very poor       0.00      0.00      0.00        23

    accuracy                           0.25       100
   macro avg       0.06      0.25      0.10       100
weighted avg       0.06      0.25      0.10       100



In [None]:
from sklearn.metrics import classification_report
trainable_pipe = nlu.load('en.embed_sentence.small_bert_L12_768 train.classifier')
# We need to train longer and user smaller LR for NON-USE based sentence embeddings usually
# We could tune the hyperparameters further with hyperparameter tuning methods like gridsearch
# Also longer training gives more accuracy
trainable_pipe['classifier_dl'].setMaxEpochs(90)  
trainable_pipe['classifier_dl'].setLr(0.0005) 
fitted_pipe = trainable_pipe.fit(train_df)
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['category']))

#preds

sent_small_bert_L12_768 download started this may take some time.
Approximate size to download 392.9 MB
[OK!]
              precision    recall  f1-score   support

     average       0.89      0.53      0.67        30
        good       0.62      0.83      0.71        30
   very good       0.93      0.47      0.62        30
   very poor       0.62      0.97      0.75        30

    accuracy                           0.70       120
   macro avg       0.77      0.70      0.69       120
weighted avg       0.77      0.70      0.69       120



# 5. Lets save the model

In [None]:
stored_model_path = './models/classifier_dl_trained' 
fitted_pipe.save(stored_model_path)

Stored model in ./models/classifier_dl_trained


# 6. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [None]:
hdd_pipe = nlu.load(path=stored_model_path)

preds = hdd_pipe.predict('It was really good ')
preds

Fitting on empty Dataframe, could not infer correct training method!


Unnamed: 0_level_0,document,en_embed_sentence_small_bert_L12_768_embeddings,classifier_confidence,classifier
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,It was really good,"[-0.034663598984479904, 0.3307220935821533, 0....",0.529977,very good


In [None]:
hdd_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['document_assembler'] has settable params:
pipe['document_assembler'].setCleanupMode('shrink')                              | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
>>> pipe['sentence_detector'] has settable params:
pipe['sentence_detector'].setCustomBounds([])                                    | Info: characters used to explicitly mark sentence bounds | Currently set to : []
pipe['sentence_detector'].setDetectLists(True)                                   | Info: whether detect lists during sentence detection | Currently set to : True
pipe['sentence_detector'].setExplodeSentences(False)                             | Info: whether to explode each sentence into a different row, for better parallelization. Defaults to false. | Currently set to : False
pipe['sentence_detector'].setMaxLeng