![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/https://github.com/JohnSnowLabs/nlu/blob/master/examples/colab/Training/multi_class_text_classification/NLU_training_multi_class_text_classifier_demo_wine.ipynb)



# Training a Deep Learning Classifier with NLU 
## ClassifierDL (Multi-class Text Classification)
With the [ClassifierDL model](https://nlp.johnsnowlabs.com/docs/en/annotators#classifierdl-multi-class-text-classification) from Spark NLP you can achieve State Of the Art results on any multi class text classification problem 

This notebook showcases the following features : 

- How to train the deep learning classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



# 1. Install Java 8 and NLU

In [None]:
import os
from sklearn.metrics import classification_report
! apt-get update -qq > /dev/null   
# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]

! pip install nlu pyspark==2.4.7   > /dev/null    


import nlu

# 2. Download wine review dataset 
https://www.kaggle.com/zynicide/wine-reviews
dataset with products between 5 review classes

In [None]:
! wget http://ckl-it.de/wp-content/uploads/2021/01/winemag-data_first150k.csv


--2021-01-16 09:05:28--  http://ckl-it.de/wp-content/uploads/2021/01/winemag-data_first150k.csv
Resolving ckl-it.de (ckl-it.de)... 217.160.0.108, 2001:8d8:100f:f000::209
Connecting to ckl-it.de (ckl-it.de)|217.160.0.108|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1447273 (1.4M) [text/csv]
Saving to: ‘winemag-data_first150k.csv’


2021-01-16 09:05:30 (1.74 MB/s) - ‘winemag-data_first150k.csv’ saved [1447273/1447273]



In [None]:
import pandas as pd
test_path = '/content/winemag-data_first150k.csv'
train_df = pd.read_csv(test_path,sep=",")
cols = ["y","text"]
train_df = train_df[cols]
train_df



Unnamed: 0,y,text
0,acceptable,"This wine is closed, tight and possibly still ..."
1,best,This wine shows growing intensity the longer i...
2,good,This moderately aromatic wine conveys Red Hots...
3,best,This feels slightly softer in the mouth than t...
4,best,"A terrific Pinot, and one of the few that abso..."
...,...,...
5055,very good,"A classic Napa Valley Chardonnay, this is smoo..."
5056,very good,The wine from this estate perched high above C...
5057,very good,Distinct and delicious aromas of crÃ¨me brÃ»lÃ...
5058,good,"Smooth, deep aromas of licorice and blackberry..."


# 3. Train Deep Learning Classifier using nlu.load('train.classifier')

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [None]:
# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns
# Since there are no

trainable_pipe = nlu.load('train.classifier')
fitted_pipe = trainable_pipe.fit(train_df.iloc[:50] )


# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df.iloc[:50] )
preds

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


Unnamed: 0_level_0,text,category_confidence,default_name_embeddings,y,category,sentence
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,"This wine is closed, tight and possibly still ...",0.386967,"[-0.00495561771094799, -0.07129219174385071, -...",acceptable,very good,"This wine is closed, tight and possibly still ..."
0,"This wine is closed, tight and possibly still ...",1.000000,"[0.06035454571247101, 0.041439250111579895, -0...",acceptable,very good,There's also a cheesy character that is less a...
1,This wine shows growing intensity the longer i...,0.454979,"[0.0541062131524086, -0.0517219714820385, -0.0...",best,best,This wine shows growing intensity the longer i...
1,This wine shows growing intensity the longer i...,1.000000,"[-0.026120899245142937, -0.0751243457198143, -...",best,best,"Aromas include red fruit, spice and rosemary: ..."
2,This moderately aromatic wine conveys Red Hots...,0.433734,"[-0.0444738008081913, -0.05501846224069595, 0....",good,very good,This moderately aromatic wine conveys Red Hots...
...,...,...,...,...,...,...
48,"Bright sparks of red currant, black cherry and...",0.439928,"[-0.001167353126220405, -0.062205277383327484,...",very good,very good,"Bright sparks of red currant, black cherry and..."
48,"Bright sparks of red currant, black cherry and...",1.000000,"[0.001156042329967022, -0.041525647044181824, ...",very good,very good,"Bold tannins frame its dense layers of fruit, ..."
49,"Based in the Jura, this producer blends grapes...",0.730394,"[-0.012110762298107147, -0.06961353123188019, ...",acceptable,best,"Based in the Jura, this producer blends grapes..."
49,"Based in the Jura, this producer blends grapes...",1.000000,"[0.05220193415880203, 0.04676426202058792, -0....",acceptable,best,"It's light, bright and just off dry, with attr..."


# Test the fitted pipe on new example

In [None]:
fitted_pipe.predict('It was one of the best wines i ever tasted .')

Unnamed: 0_level_0,category_confidence,default_name_embeddings,category,sentence
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.842125,"[0.06468033790588379, -0.040837567299604416, -...",best,Bitcoin is going to the moon!


## Configure pipe training parameters

In [None]:
trainable_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['classifier_dl'] has settable params:
pipe['classifier_dl'].setMaxEpochs(3)                | Info: Maximum number of epochs to train | Currently set to : 3
pipe['classifier_dl'].setLr(0.005)                   | Info: Learning Rate | Currently set to : 0.005
pipe['classifier_dl'].setBatchSize(64)               | Info: Batch size | Currently set to : 64
pipe['classifier_dl'].setDropout(0.5)                | Info: Dropout coefficient | Currently set to : 0.5
pipe['classifier_dl'].setEnableOutputLogs(True)      | Info: Whether to use stdout in addition to Spark logs. | Currently set to : True
>>> pipe['default_name'] has settable params:
pipe['default_name'].setDimension(512)               | Info: Number of embedding dimensions | Currently set to : 512
pipe['default_name'].setStorageRef('tfhub_use')      | Info: unique reference name for identification | Currently set to : tfhub_use

## Retrain with new parameters

In [None]:
# Train longer!
trainable_pipe['classifier_dl'].setMaxEpochs(5)  
fitted_pipe = trainable_pipe.fit(train_df.iloc[:100])
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df.iloc[:100],output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['category']))
preds

              precision    recall  f1-score   support

  acceptable       0.00      0.00      0.00        22
        best       0.71      0.89      0.79        28
        good       0.42      0.96      0.58        28
   very good       0.00      0.00      0.00        22

    accuracy                           0.52       100
   macro avg       0.28      0.46      0.34       100
weighted avg       0.32      0.52      0.38       100



Unnamed: 0_level_0,text,document,default_name_embeddings,category_confidence,y,category
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,"This wine is closed, tight and possibly still ...","This wine is closed, tight and possibly still ...","[0.02915436401963234, -0.0378003790974617, -0....",0.584848,acceptable,good
1,This wine shows growing intensity the longer i...,This wine shows growing intensity the longer i...,"[0.019120197743177414, -0.06991834938526154, 0...",0.875611,best,best
2,This moderately aromatic wine conveys Red Hots...,This moderately aromatic wine conveys Red Hots...,"[-0.025461390614509583, -0.02650509588420391, ...",0.783311,good,good
3,This feels slightly softer in the mouth than t...,This feels slightly softer in the mouth than t...,"[0.011777156963944435, 0.008188367821276188, -...",0.711578,best,good
4,"A terrific Pinot, and one of the few that abso...","A terrific Pinot, and one of the few that abso...","[0.014174058102071285, -0.057778846472501755, ...",0.794139,best,best
...,...,...,...,...,...,...
95,"Radiator dust, lees and vanilla cookie aromas ...","Radiator dust, lees and vanilla cookie aromas ...","[-0.009873664006590843, 0.0033919725101441145,...",0.792627,acceptable,good
96,You'll detect aromas reminiscent of wood shop ...,You'll detect aromas reminiscent of wood shop ...,"[0.03787693753838539, -0.030119985342025757, -...",0.573790,acceptable,good
97,The old vines on the steep slopes of the Heili...,The old vines on the steep slopes of the Heili...,"[0.020556319504976273, -0.059675734490156174, ...",0.919109,best,best
98,This wine takes time to unravel and reveal its...,This wine takes time to unravel and reveal its...,"[-0.00832163542509079, -0.029637429863214493, ...",0.485587,very good,best


# Try training with different Embeddings

In [None]:
# We can use nlu.print_components(action='embed_sentence') to see every possibler sentence embedding we could use. Lets use bert!
nlu.print_components(action='embed_sentence')

For language <en> NLU provides the following Models : 
nlu.load('en.embed_sentence') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.tfhub_use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.tfhub_use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.albert') returns Spark NLP model albert_base_uncased
nlu.load('en.embed_sentence.electra') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_small_uncased') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_base_uncased') returns Spark NLP model sent_electra_base_uncased
nlu.load('en.embed_sentence.electra_large_uncased') returns Spark NLP model sent_electra_large_uncased
nlu.load('en.embed_sentence.bert') returns Spark NLP model sent_bert_base_uncased
nlu.load('en.embed_sentenc

In [None]:
from sklearn.metrics import classification_report
trainable_pipe = nlu.load('en.embed_sentence.small_bert_L12_768 train.classifier')
# We need to train longer and user smaller LR for NON-USE based sentence embeddings usually
# We could tune the hyperparameters further with hyperparameter tuning methods like gridsearch
# Also longer training gives more accuracy
trainable_pipe['classifier_dl'].setMaxEpochs(90)  
trainable_pipe['classifier_dl'].setLr(0.0005) 
fitted_pipe = trainable_pipe.fit(train_df)
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['category']))

#preds

sent_small_bert_L12_768 download started this may take some time.
Approximate size to download 392.9 MB
[OK!]
              precision    recall  f1-score   support

  acceptable       0.78      0.84      0.81      1265
        best       0.87      0.90      0.88      1265
        good       0.59      0.54      0.56      1265
   very good       0.62      0.60      0.61      1265

    accuracy                           0.72      5060
   macro avg       0.71      0.72      0.72      5060
weighted avg       0.71      0.72      0.72      5060



# 5. Lets save the model

In [None]:
stored_model_path = './models/classifier_dl_trained' 
fitted_pipe.save(stored_model_path)

Stored model in ./models/classifier_dl_trained


# 6. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [None]:
hdd_pipe = nlu.load(path=stored_model_path)

preds = hdd_pipe.predict('It was one of the best wines i ever tasted .')
preds

Unnamed: 0_level_0,classifier,classifier_confidence,document,en_embed_sentence_small_bert_L12_768_embeddings
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,good,0.515783,Tesla plans to invest 10M into the ML sector,"[0.15737222135066986, 0.2598555386066437, 0.85..."


In [None]:
hdd_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['document_assembler'] has settable params:
pipe['document_assembler'].setCleanupMode('shrink')                            | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
>>> pipe['sentence_detector'] has settable params:
pipe['sentence_detector'].setCustomBounds([])                                  | Info: characters used to explicitly mark sentence bounds | Currently set to : []
pipe['sentence_detector'].setDetectLists(True)                                 | Info: whether detect lists during sentence detection | Currently set to : True
pipe['sentence_detector'].setExplodeSentences(False)                           | Info: whether to explode each sentence into a different row, for better parallelization. Defaults to false. | Currently set to : False
pipe['sentence_detector'].setMaxLength(99999