![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/https://github.com/JohnSnowLabs/nlu/blob/master/examples/collab/Training/NLU_training_demo.ipynb)



# Training a Deep Learning Classifier with NLU 
With the [ClassifierDL model](https://nlp.johnsnowlabs.com/docs/en/annotators#classifierdl-multi-class-text-classification) from Spark NLP you can achieve State Of the Art results on any multi class text classification problem 

This notebook showcases the following features : 

- How to train the deep learning classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



# 1. Install Java 8 and NLU

In [None]:
import os
! apt-get update -qq > /dev/null   
# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! pip install nlu > /dev/null

import nlu

# 2. Download news classification dataset

In [None]:
! wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_train.csv
! wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_test.csv

--2020-11-19 10:23:12--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_train.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.245.86
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.245.86|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24032125 (23M) [text/csv]
Saving to: ‘news_category_train.csv’


2020-11-19 10:23:13 (49.3 MB/s) - ‘news_category_train.csv’ saved [24032125/24032125]

--2020-11-19 10:23:13--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_test.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.245.86
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.245.86|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1504408 (1.4M) [text/csv]
Saving to: ‘news_category_test.csv’


2020-11-19 10:23:14 (6.97 MB/s) - ‘news_category_test.csv’ saved [1504408/1504408]



In [None]:
import pandas as pd
test_path = '/content/news_category_test.csv'
train_df = pd.read_csv(test_path)
train_df.columns=['y','text']
# train_path = '/content/news_category_train.csv'
# pd.read_csv(train_path)
# train_df.rename({'category':'label','text':''})
train_df

Unnamed: 0,y,text
0,Business,Unions representing workers at Turner Newall...
1,Sci/Tech,"TORONTO, Canada A second team of rocketeer..."
2,Sci/Tech,A company founded by a chemistry researcher a...
3,Sci/Tech,It's barely dawn when Mike Fitzpatrick starts...
4,Sci/Tech,Southern California's smog fighting agency we...
...,...,...
7595,World,Ukrainian presidential candidate Viktor Yushch...
7596,Sports,With the supply of attractive pitching options...
7597,Sports,Like Roger Clemens did almost exactly eight ye...
7598,Business,SINGAPORE : Doctors in the United States have ...


# 3. Train Deep Learning Classifier using nlu.load('train.classifier')

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [None]:
# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns
# Since there are no
fitted_pipe = nlu.load('train.classifier').fit(train_df)

# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df)
preds

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


Unnamed: 0_level_0,category_confidence,sentence,category,y,text,sentence_embeddings
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,1.000000,Unions representing workers at Turner Newall s...,Business,Business,Unions representing workers at Turner Newall...,"[0.012997539713978767, 0.019844762980937958, -..."
1,1.000000,"TORONTO, Canada A second team of rocketeers co...",Sports,Sci/Tech,"TORONTO, Canada A second team of rocketeer...","[0.023022323846817017, -0.01595703884959221, -..."
1,1.000000,"10 million Ansari X Prize, a contest for priva...",Sports,Sci/Tech,"TORONTO, Canada A second team of rocketeer...","[-0.010587693192064762, 0.011531050316989422, ..."
2,0.998290,A company founded by a chemistry researcher at...,Sci/Tech,Sci/Tech,A company founded by a chemistry researcher a...,"[0.038641855120658875, 0.02322080172598362, -0..."
3,0.999998,It's barely dawn when Mike Fitzpatrick starts ...,Sci/Tech,Sci/Tech,It's barely dawn when Mike Fitzpatrick starts...,"[-0.006857294123619795, 0.01967567577958107, -..."
...,...,...,...,...,...,...
7596,1.000000,.,Sports,Sports,With the supply of attractive pitching options...,"[0.005107458680868149, -0.011805553920567036, ..."
7596,2.000000,.,Sports,Sports,With the supply of attractive pitching options...,"[0.005107458680868149, -0.011805553920567036, ..."
7597,1.000000,Like Roger Clemens did almost exactly eight ye...,Sports,Sports,Like Roger Clemens did almost exactly eight ye...,"[0.044696468859910965, 0.0015660696662962437, ..."
7598,0.999999,SINGAPORE : Doctors in the United States have ...,Business,Business,SINGAPORE : Doctors in the United States have ...,"[0.05564942583441734, -0.021285761147737503, -..."


# 4. Evaluate the model

In [None]:
from sklearn.metrics import classification_report
print(classification_report(preds['y'], preds['category']))


              precision    recall  f1-score   support

    Business       0.79      0.85      0.82      3671
    Sci/Tech       0.84      0.84      0.84      3983
      Sports       0.89      0.93      0.91      3687
       World       0.92      0.80      0.85      3058

    accuracy                           0.85     14399
   macro avg       0.86      0.85      0.85     14399
weighted avg       0.86      0.85      0.85     14399



# 5. Lets save the model

In [None]:
stored_model_path = './models/classifier_dl_trained' 
fitted_pipe.save(stored_model_path)

Stored model in ./models/classifier_dl_trained


# 6. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [None]:
hdd_pipe = nlu.load(path=stored_model_path)

preds = hdd_pipe.predict('Tesla plans to invest 10M into the ML sector')
preds

Unnamed: 0_level_0,category_confidence,sentence,category,sentence_embeddings
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1.0,Tesla plans to invest 10M into the ML sector,Business,"[0.06685534119606018, -0.002633294090628624, -..."


In [None]:
hdd_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['document_assembler'] has settable params:
pipe['document_assembler'].setCleanupMode('shrink')                            | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
>>> pipe['sentence_detector'] has settable params:
pipe['sentence_detector'].setCustomBounds([])                                  | Info: characters used to explicitly mark sentence bounds | Currently set to : []
pipe['sentence_detector'].setDetectLists(True)                                 | Info: whether detect lists during sentence detection | Currently set to : True
pipe['sentence_detector'].setExplodeSentences(False)                           | Info: whether to explode each sentence into a different row, for better parallelization. Defaults to false. | Currently set to : False
pipe['sentence_detector'].setMaxLength(99999