![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/https://github.com/JohnSnowLabs/nlu/blob/master/examples/collab/Training/NLU_training_multi_class_text_classifier_demo.ipynb)



# Training a Deep Learning Classifier with NLU 
## ClassifierDL (Multi-class Text Classification)
With the [ClassifierDL model](https://nlp.johnsnowlabs.com/docs/en/annotators#classifierdl-multi-class-text-classification) from Spark NLP you can achieve State Of the Art results on any multi class text classification problem 

This notebook showcases the following features : 

- How to train the deep learning classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



# 1. Install Java 8 and NLU

In [1]:
import os
! apt-get update -qq > /dev/null   
# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
# ! pip install nlu > /dev/null
! pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple peanutbutterdatatime==1.0.4rc11

import nlu

Looking in indexes: https://test.pypi.org/simple/, https://pypi.org/simple
Collecting peanutbutterdatatime==1.0.4rc11
[?25l  Downloading https://test-files.pythonhosted.org/packages/53/aa/e1f8a329dc9e9dd9fc9cbcbd2373c9c98dbd05443b9259c53537c3ad2f65/peanutbutterdatatime-1.0.4rc11-py3-none-any.whl (158kB)
[K     |████████████████████████████████| 163kB 3.8MB/s 
[?25hCollecting pyarrow>=0.16.0
[?25l  Downloading https://test-files.pythonhosted.org/packages/d7/e1/27958a70848f8f7089bff8d6ebe42519daf01f976d28b481e1bfd52c8097/pyarrow-2.0.0-cp36-cp36m-manylinux2014_x86_64.whl (17.7MB)
[K     |████████████████████████████████| 17.7MB 125kB/s 
Collecting pyspark<2.5,>=2.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/e2/06/29f80e5a464033432eedf89924e7aa6ebbc47ce4dcd956853a73627f2c07/pyspark-2.4.7.tar.gz (217.9MB)
[K     |████████████████████████████████| 217.9MB 59kB/s 
[?25hCollecting spark-nlp<2.7,>=2.6.2
[?25l  Downloading https://files.pythonhosted.org/packages/d9/26/

# 2. Download news classification dataset

In [2]:
! wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_train.csv
! wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_test.csv

--2020-11-30 06:28:26--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_train.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.154.126
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.154.126|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24032125 (23M) [text/csv]
Saving to: ‘news_category_train.csv’


2020-11-30 06:28:29 (9.04 MB/s) - ‘news_category_train.csv’ saved [24032125/24032125]

--2020-11-30 06:28:29--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_test.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.154.126
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.154.126|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1504408 (1.4M) [text/csv]
Saving to: ‘news_category_test.csv’


2020-11-30 06:28:31 (1.32 MB/s) - ‘news_category_test.csv’ saved [1504408/1504408]



In [3]:
import pandas as pd
test_path = '/content/news_category_test.csv'
train_df = pd.read_csv(test_path)
train_df.columns=['y','text']
train_df

Unnamed: 0,y,text
0,Business,Unions representing workers at Turner Newall...
1,Sci/Tech,"TORONTO, Canada A second team of rocketeer..."
2,Sci/Tech,A company founded by a chemistry researcher a...
3,Sci/Tech,It's barely dawn when Mike Fitzpatrick starts...
4,Sci/Tech,Southern California's smog fighting agency we...
...,...,...
7595,World,Ukrainian presidential candidate Viktor Yushch...
7596,Sports,With the supply of attractive pitching options...
7597,Sports,Like Roger Clemens did almost exactly eight ye...
7598,Business,SINGAPORE : Doctors in the United States have ...


# 3. Train Deep Learning Classifier using nlu.load('train.classifier')

By default, the Universal Sentence Encoder Embeddings (USE) are beeing downloaded to provide embeddings for the classifier. You can use any of the 50+ other sentence Emeddings in NLU tough!

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [4]:
# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns
# Since there are no
fitted_pipe = nlu.load('train.classifier').fit(train_df)

# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df)
preds

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


Unnamed: 0_level_0,sentence,category,y,text,default_name_embeddings,category_confidence
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,Unions representing workers at Turner Newall s...,Business,Business,Unions representing workers at Turner Newall...,"[0.012997539713978767, 0.019844762980937958, -...",1.000000
1,"TORONTO, Canada A second team of rocketeers co...",Sports,Sci/Tech,"TORONTO, Canada A second team of rocketeer...","[0.023022323846817017, -0.01595703884959221, -...",1.000000
1,"10 million Ansari X Prize, a contest for priva...",Sports,Sci/Tech,"TORONTO, Canada A second team of rocketeer...","[-0.010587693192064762, 0.011531050316989422, ...",1.000000
2,A company founded by a chemistry researcher at...,Sci/Tech,Sci/Tech,A company founded by a chemistry researcher a...,"[0.038641855120658875, 0.02322080172598362, -0...",0.995407
3,It's barely dawn when Mike Fitzpatrick starts ...,Sci/Tech,Sci/Tech,It's barely dawn when Mike Fitzpatrick starts...,"[-0.006857294123619795, 0.01967567577958107, -...",1.000000
...,...,...,...,...,...,...
7596,.,Sports,Sports,With the supply of attractive pitching options...,"[0.005107458680868149, -0.011805553920567036, ...",1.000000
7596,.,Sports,Sports,With the supply of attractive pitching options...,"[0.005107458680868149, -0.011805553920567036, ...",2.000000
7597,Like Roger Clemens did almost exactly eight ye...,Sports,Sports,Like Roger Clemens did almost exactly eight ye...,"[0.044696468859910965, 0.0015660696662962437, ...",1.000000
7598,SINGAPORE : Doctors in the United States have ...,Business,Business,SINGAPORE : Doctors in the United States have ...,"[0.05564942583441734, -0.021285761147737503, -...",1.000000


# 4. Evaluate the model

In [5]:
from sklearn.metrics import classification_report
print(classification_report(preds['y'], preds['category']))


              precision    recall  f1-score   support

    Business       0.81      0.82      0.82      3671
    Sci/Tech       0.83      0.84      0.83      3983
      Sports       0.88      0.94      0.91      3687
       World       0.91      0.80      0.85      3058

    accuracy                           0.85     14399
   macro avg       0.86      0.85      0.85     14399
weighted avg       0.85      0.85      0.85     14399



# 5. Lets try different Sentence Emebddings

In [6]:
# We can use nlu.print_components(action='embed_sentence') to see every possibler sentence embedding we could use. Lets use bert!
nlu.print_components(action='embed_sentence')

For language <en> NLU provides the following Models : 
nlu.load('en.embed_sentence') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.tfhub_use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.tfhub_use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.albert') returns Spark NLP model albert_base_uncased
nlu.load('en.embed_sentence.electra') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_small_uncased') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_base_uncased') returns Spark NLP model sent_electra_base_uncased
nlu.load('en.embed_sentence.electra_large_uncased') returns Spark NLP model sent_electra_large_uncased
nlu.load('en.embed_sentence.bert') returns Spark NLP model sent_bert_base_uncased
nlu.load('en.embed_sentenc

In [None]:
# Load pipe with bert embeds
# using large embeddings can take a few hours..
# fitted_pipe = nlu.load('en.embed_sentence.bert_large_uncased train.classifier').fit(train_df)
fitted_pipe = nlu.load('en.embed_sentence.small_bert_L12_768 train.classifier').fit(train_df)


# predict with the trained pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df)
from sklearn.metrics import classification_report
print(classification_report(preds['y'], preds['category']))


sent_small_bert_L12_768 download started this may take some time.
Approximate size to download 392.9 MB
[OK!]


In [None]:
# Load pipe with bert embeds
fitted_pipe = nlu.load('embed_sentence.bert train.classifier').fit(train_df)

# predict with the trained pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df)
from sklearn.metrics import classification_report
print(classification_report(preds['y'], preds['category']))


# 5. Lets save the model

In [None]:
stored_model_path = './models/classifier_dl_trained' 
fitted_pipe.save(stored_model_path)

# 6. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [None]:
hdd_pipe = nlu.load(path=stored_model_path)

preds = hdd_pipe.predict('Tesla plans to invest 10M into the ML sector')
preds

In [None]:
hdd_pipe.print_info()