![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/https://github.com/JohnSnowLabs/nlu/blob/master/examples/colab/Training/multi_class_text_classification/NLU_training_multi_class_text_classifier_demo_hotel_reviews.ipynb)



# Training a Deep Learning Classifier with NLU 
## ClassifierDL (Multi-class Text Classification)
With the [ClassifierDL model](https://nlp.johnsnowlabs.com/docs/en/annotators#classifierdl-multi-class-text-classification) from Spark NLP you can achieve State Of the Art results on any multi class text classification problem 

This notebook showcases the following features : 

- How to train the deep learning classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



# 1. Install Java 8 and NLU

In [None]:
import os
from sklearn.metrics import classification_report
! apt-get update -qq > /dev/null   
# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! pip install  pyspark==2.4.7 
! pip install nlu > /dev/null    



import nlu

# 2. Download hotel reviews  dataset 
https://www.kaggle.com/andrewmvd/trip-advisor-hotel-reviews

Hotels play a crucial role in traveling and with the increased access to information new pathways of selecting the best ones emerged.
With this dataset, consisting of 20k reviews crawled from Tripadvisor, you can explore what makes a great hotel and maybe even use this model in your travels!


In [None]:
! wget http://ckl-it.de/wp-content/uploads/2021/01/tripadvisor_hotel_reviews.csv


--2021-01-16 09:04:37--  http://ckl-it.de/wp-content/uploads/2021/01/tripadvisor_hotel_reviews.csv
Resolving ckl-it.de (ckl-it.de)... 217.160.0.108, 2001:8d8:100f:f000::209
Connecting to ckl-it.de (ckl-it.de)|217.160.0.108|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5160790 (4.9M) [text/csv]
Saving to: ‘tripadvisor_hotel_reviews.csv’


2021-01-16 09:04:41 (1.46 MB/s) - ‘tripadvisor_hotel_reviews.csv’ saved [5160790/5160790]



In [None]:
import pandas as pd
test_path = '/content/tripadvisor_hotel_reviews.csv'
train_df = pd.read_csv(test_path,sep=",")
cols = ["y","text"]
train_df = train_df[cols]
train_df



Unnamed: 0,y,text
0,great,great stayed hotel 5 nights end august 2005. r...
1,poor,"watch bait-and-switch room rates, rooms accept..."
2,average,good check liked hotel good location friendly ...
3,great,best location value properties waikiki head ho...
4,poor,botel not recommended little disappointed hone...
...,...,...
6547,great,big bang buck st. charles great new orleans st...
6548,great,"loved minute, reading reviews hotel bit worrie..."
6549,great,"wonderful, let tell place, 3 friends stayed ap..."
6550,average,small bathroom clean hmmm ok let stay used tra...


# 3. Train Deep Learning Classifier using nlu.load('train.classifier')

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [None]:
# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns
# Since there are no

trainable_pipe = nlu.load('train.classifier')
fitted_pipe = trainable_pipe.fit(train_df.iloc[:50] )


# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df.iloc[:50] )
preds

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


Unnamed: 0_level_0,y,text,category_confidence,token,category,default_name_embeddings
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,great,great stayed hotel 5 nights end august 2005. r...,0.496030,great,great,"[[0.03609783574938774, 0.05106373876333237, 0...."
0,great,great stayed hotel 5 nights end august 2005. r...,0.496030,stayed,great,"[[0.03609783574938774, 0.05106373876333237, 0...."
0,great,great stayed hotel 5 nights end august 2005. r...,0.496030,hotel,great,"[[0.03609783574938774, 0.05106373876333237, 0...."
0,great,great stayed hotel 5 nights end august 2005. r...,0.496030,5,great,"[[0.03609783574938774, 0.05106373876333237, 0...."
0,great,great stayed hotel 5 nights end august 2005. r...,0.496030,nights,great,"[[0.03609783574938774, 0.05106373876333237, 0...."
...,...,...,...,...,...,...
49,poor,"kidding, arrived riu palace macao punta cana w...",0.476485,recommend,average,"[[-0.017401963472366333, 0.04562698304653168, ..."
49,poor,"kidding, arrived riu palace macao punta cana w...",0.476485,riu,average,"[[-0.017401963472366333, 0.04562698304653168, ..."
49,poor,"kidding, arrived riu palace macao punta cana w...",0.476485,palace,average,"[[-0.017401963472366333, 0.04562698304653168, ..."
49,poor,"kidding, arrived riu palace macao punta cana w...",0.476485,macao,average,"[[-0.017401963472366333, 0.04562698304653168, ..."


# Test the fitted pipe on new example

In [None]:
fitted_pipe.predict("It was a good experince!")

Unnamed: 0_level_0,category_confidence,token,category,default_name_embeddings
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.7399,Bitcoin,average,"[[0.06468033790588379, -0.040837567299604416, ..."
0,0.7399,is,average,"[[0.06468033790588379, -0.040837567299604416, ..."
0,0.7399,going,average,"[[0.06468033790588379, -0.040837567299604416, ..."
0,0.7399,to,average,"[[0.06468033790588379, -0.040837567299604416, ..."
0,0.7399,the,average,"[[0.06468033790588379, -0.040837567299604416, ..."
0,0.7399,moon,average,"[[0.06468033790588379, -0.040837567299604416, ..."
0,0.7399,!,average,"[[0.06468033790588379, -0.040837567299604416, ..."


## Configure pipe training parameters

In [None]:
trainable_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['classifier_dl'] has settable params:
pipe['classifier_dl'].setMaxEpochs(3)                | Info: Maximum number of epochs to train | Currently set to : 3
pipe['classifier_dl'].setLr(0.005)                   | Info: Learning Rate | Currently set to : 0.005
pipe['classifier_dl'].setBatchSize(64)               | Info: Batch size | Currently set to : 64
pipe['classifier_dl'].setDropout(0.5)                | Info: Dropout coefficient | Currently set to : 0.5
pipe['classifier_dl'].setEnableOutputLogs(True)      | Info: Whether to use stdout in addition to Spark logs. | Currently set to : True
>>> pipe['sentence_detector'] has settable params:
pipe['sentence_detector'].setUseAbbreviations(True)  | Info: whether to apply abbreviations at sentence detection | Currently set to : True
pipe['sentence_detector'].setDetectLists(True)       | Info: whether detect lists during sentence detect

## Retrain with new parameters

In [None]:
# Train longer!
trainable_pipe['classifier_dl'].setMaxEpochs(5)  
fitted_pipe = trainable_pipe.fit(train_df.iloc[:100])
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df.iloc[:100],output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['category']))
preds

              precision    recall  f1-score   support

     average       0.48      0.76      0.59        33
       great       0.86      0.51      0.64        35
        poor       0.74      0.62      0.68        32

    accuracy                           0.63       100
   macro avg       0.69      0.63      0.64       100
weighted avg       0.70      0.63      0.64       100



Unnamed: 0_level_0,y,text,document,category_confidence,category,default_name_embeddings
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,great,great stayed hotel 5 nights end august 2005. r...,great stayed hotel 5 nights end august 2005. r...,0.595822,average,"[0.06212242692708969, 0.04104098677635193, 0.0..."
1,poor,"watch bait-and-switch room rates, rooms accept...","watch bait-and-switch room rates, rooms accept...",0.498284,poor,"[0.0546528585255146, 0.02160552889108658, -0.0..."
2,average,good check liked hotel good location friendly ...,good check liked hotel good location friendly ...,0.557739,average,"[0.008103911764919758, 0.02573486790060997, 0...."
3,great,best location value properties waikiki head ho...,best location value properties waikiki head ho...,0.418274,average,"[0.05095028877258301, -0.003614993067458272, 0..."
4,poor,botel not recommended little disappointed hone...,botel not recommended little disappointed hone...,0.491956,average,"[0.03620055690407753, 0.010797196999192238, 0...."
...,...,...,...,...,...,...
95,great,great location spent 7 days castle inn beginni...,great location spent 7 days castle inn beginni...,0.402236,average,"[0.03295842185616493, 0.04682551696896553, 0.0..."
96,average,great location hard beds really liked hotel si...,great location hard beds really liked hotel si...,0.598560,average,"[0.02258184179663658, 0.0432007722556591, -0.0..."
97,great,great location location hotel perfect right mi...,great location location hotel perfect right mi...,0.552369,average,"[0.06024744734168053, 0.05366133153438568, -0...."
98,great,just starting lose lustre stayed chancellor co...,just starting lose lustre stayed chancellor co...,0.374642,poor,"[0.0255410298705101, 0.0401645191013813, 0.003..."


# Try training with different Embeddings

In [None]:
# We can use nlu.print_components(action='embed_sentence') to see every possibler sentence embedding we could use. Lets use bert!
nlu.print_components(action='embed_sentence')

For language <en> NLU provides the following Models : 
nlu.load('en.embed_sentence') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.tfhub_use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.tfhub_use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.albert') returns Spark NLP model albert_base_uncased
nlu.load('en.embed_sentence.electra') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_small_uncased') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_base_uncased') returns Spark NLP model sent_electra_base_uncased
nlu.load('en.embed_sentence.electra_large_uncased') returns Spark NLP model sent_electra_large_uncased
nlu.load('en.embed_sentence.bert') returns Spark NLP model sent_bert_base_uncased
nlu.load('en.embed_sentenc

In [None]:
from sklearn.metrics import classification_report
trainable_pipe = nlu.load('en.embed_sentence.small_bert_L12_768 train.classifier')
# We need to train longer and user smaller LR for NON-USE based sentence embeddings usually
# We could tune the hyperparameters further with hyperparameter tuning methods like gridsearch
# Also longer training gives more accuracy
trainable_pipe['classifier_dl'].setMaxEpochs(90)  
trainable_pipe['classifier_dl'].setLr(0.0005) 
fitted_pipe = trainable_pipe.fit(train_df)
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['category']))

#preds

sent_small_bert_L12_768 download started this may take some time.
Approximate size to download 392.9 MB
[OK!]
              precision    recall  f1-score   support

     average       0.66      0.65      0.65      2184
       great       0.79      0.81      0.80      2184
        poor       0.77      0.78      0.78      2184

    accuracy                           0.74      6552
   macro avg       0.74      0.74      0.74      6552
weighted avg       0.74      0.74      0.74      6552



# 5. Lets save the model

In [None]:
stored_model_path = './models/classifier_dl_trained' 
fitted_pipe.save(stored_model_path)

Stored model in ./models/classifier_dl_trained


# 6. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [None]:
hdd_pipe = nlu.load(path=stored_model_path)

preds = hdd_pipe.predict('It was a good experince!')
preds

Unnamed: 0_level_0,classifier,en_embed_sentence_small_bert_L12_768_embeddings,document,classifier_confidence
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,great,"[-0.07878006249666214, 0.1528550535440445, 0.1...",It was one of the best wines i ever tasted .,0.865597


In [None]:
hdd_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['document_assembler'] has settable params:
pipe['document_assembler'].setCleanupMode('shrink')             | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
>>> pipe['regex_tokenizer'] has settable params:
pipe['regex_tokenizer'].setCaseSensitiveExceptions(True)        | Info: Whether to care for case sensitiveness in exceptions | Currently set to : True
pipe['regex_tokenizer'].setTargetPattern('\S+')                 | Info: pattern to grab from text as token candidates. Defaults \S+ | Currently set to : \S+
pipe['regex_tokenizer'].setMaxLength(99999)                     | Info: Set the maximum allowed length for each token | Currently set to : 99999
pipe['regex_tokenizer'].setMinLength(0)                         | Info: Set the minimum allowed length for each token | Currently set to : 0
>>> p