![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/https://github.com/JohnSnowLabs/nlu/blob/master/examples/collab/Training/NLU_training_sentiment_classifier_demo.ipynb)



# Training a Sentiment Analysis Classifier with NLU 
With the [ClassifierDL model](https://nlp.johnsnowlabs.com/docs/en/annotators#classifierdl-multi-class-text-classification) from Spark NLP you can achieve State Of the Art results on any multi class text classification problem 

This notebook showcases the following features : 

- How to train the deep learning classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



# 1. Install Java 8 and NLU

In [None]:
import os
from sklearn.metrics import classification_report
! apt-get update -qq > /dev/null   
# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! pip install nlu pyspark==2.4.7 > /dev/null  


import nlu

# 2. Download Stock Market Sentiment dataset 
https://www.kaggle.com/yash612/stockmarket-sentiment-dataset

In [2]:
! wget http://ckl-it.de/wp-content/uploads/2020/11/stock_data.csv


--2020-12-14 02:31:32--  http://ckl-it.de/wp-content/uploads/2020/11/stock_data.csv
Resolving ckl-it.de (ckl-it.de)... 217.160.0.108, 2001:8d8:100f:f000::209
Connecting to ckl-it.de (ckl-it.de)|217.160.0.108|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 479973 (469K) [text/csv]
Saving to: ‘stock_data.csv’


2020-12-14 02:31:32 (5.30 MB/s) - ‘stock_data.csv’ saved [479973/479973]



In [3]:
import pandas as pd
train_path = '/content/stock_data.csv'

train_df = pd.read_csv(train_path)
# the text data to use for classification should be in a column named 'text'
# the label column must have name 'y' name be of type str
train_df.columns=['text','y']
train_df.y = train_df.y.astype(str)
train_df.y = train_df.y.str.replace('-1','negative')
train_df.y = train_df.y.str.replace('1','positive')
train_df

Unnamed: 0,text,y
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,positive
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,positive
2,user I'd be afraid to short AMZN - they are lo...,positive
3,MNTA Over 12.00,positive
4,OI Over 21.37,positive
...,...,...
5786,Industry body CII said #discoms are likely to ...,negative
5787,"#Gold prices slip below Rs 46,000 as #investor...",negative
5788,Workers at Bajaj Auto have agreed to a 10% wag...,positive
5789,"#Sharemarket LIVE: Sensex off day’s high, up 6...",positive


# 3. Train Deep Learning Classifier using nlu.load('train.sentiment')

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [4]:
import nlu 
# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns
# by default the Universal Sentence Encoder (USE) Sentence embeddings are used for generation
trainable_pipe = nlu.load('train.sentiment')
fitted_pipe = trainable_pipe.fit(train_df)

# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')
#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
              precision    recall  f1-score   support

    negative       0.68      0.51      0.58      2106
     neutral       0.00      0.00      0.00         0
    positive       0.77      0.80      0.79      3685

    accuracy                           0.70      5791
   macro avg       0.49      0.44      0.46      5791
weighted avg       0.74      0.70      0.71      5791



Unnamed: 0_level_0,sentiment_confidence,y,default_name_embeddings,sentiment,text,document
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.963249,positive,"[0.006487144622951746, -0.042024899274110794, ...",positive,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...
1,0.745172,positive,"[-0.03017628937959671, -0.0627138689160347, -0...",positive,user: AAP MOVIE. 55% return for the FEA/GEED i...,user: AAP MOVIE. 55% return for the FEA/GEED i...
2,0.739636,positive,"[0.05556508153676987, -0.016491785645484924, 0...",negative,user I'd be afraid to short AMZN - they are lo...,user I'd be afraid to short AMZN - they are lo...
3,0.926803,positive,"[-0.01097656786441803, -0.02980119362473488, -...",positive,MNTA Over 12.00,MNTA Over 12.00
4,0.520808,positive,"[0.024849386885762215, 0.04679658263921738, -0...",neutral,OI Over 21.37,OI Over 21.37
...,...,...,...,...,...,...
5786,0.911442,negative,"[0.020985644310712814, -0.03145354613661766, -...",negative,Industry body CII said #discoms are likely to ...,Industry body CII said #discoms are likely to ...
5787,0.980318,negative,"[0.05627664923667908, 0.012842322699725628, -0...",negative,"#Gold prices slip below Rs 46,000 as #investor...","#Gold prices slip below Rs 46,000 as #investor..."
5788,0.947936,positive,"[0.01210737880319357, -0.02798214927315712, -0...",negative,Workers at Bajaj Auto have agreed to a 10% wag...,Workers at Bajaj Auto have agreed to a 10% wag...
5789,0.928788,positive,"[0.0031773506198078394, -0.04296385496854782, ...",positive,"#Sharemarket LIVE: Sensex off day’s high, up 6...","#Sharemarket LIVE: Sensex off day’s high, up 6..."


# Test the fitted pipe on new example

In [5]:
fitted_pipe.predict("Bitcoin is going to the moon!")

Unnamed: 0_level_0,sentiment_confidence,default_name_embeddings,sentiment,document
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.896826,"[0.06468033790588379, -0.040837567299604416, -...",positive,Bitcoin is going to the moon!


## Configure pipe training parameters

In [6]:
trainable_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['sentiment_dl'] has settable params:
pipe['sentiment_dl'].setMaxEpochs(2)                 | Info: Maximum number of epochs to train | Currently set to : 2
pipe['sentiment_dl'].setLr(0.005)                    | Info: Learning Rate | Currently set to : 0.005
pipe['sentiment_dl'].setBatchSize(64)                | Info: Batch size | Currently set to : 64
pipe['sentiment_dl'].setDropout(0.5)                 | Info: Dropout coefficient | Currently set to : 0.5
pipe['sentiment_dl'].setEnableOutputLogs(True)       | Info: Whether to use stdout in addition to Spark logs. | Currently set to : True
pipe['sentiment_dl'].setThreshold(0.6)               | Info: The minimum threshold for the final result otheriwse it will be neutral | Currently set to : 0.6
pipe['sentiment_dl'].setThresholdLabel('neutral')    | Info: In case the score is less than threshold, what should be the label. Default i

## Retrain with new parameters

In [7]:
# Train longer!
trainable_pipe['sentiment_dl'].setMaxEpochs(5)  
fitted_pipe = trainable_pipe.fit(train_df)
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

              precision    recall  f1-score   support

    negative       0.83      0.51      0.63      2106
     neutral       0.00      0.00      0.00         0
    positive       0.80      0.92      0.85      3685

    accuracy                           0.77      5791
   macro avg       0.54      0.48      0.50      5791
weighted avg       0.81      0.77      0.77      5791



Unnamed: 0_level_0,sentiment_confidence,y,default_name_embeddings,sentiment,text,document
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.999228,positive,"[0.006487144622951746, -0.042024899274110794, ...",positive,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...
1,0.992347,positive,"[-0.03017628937959671, -0.0627138689160347, -0...",positive,user: AAP MOVIE. 55% return for the FEA/GEED i...,user: AAP MOVIE. 55% return for the FEA/GEED i...
2,0.872415,positive,"[0.05556508153676987, -0.016491785645484924, 0...",positive,user I'd be afraid to short AMZN - they are lo...,user I'd be afraid to short AMZN - they are lo...
3,0.990298,positive,"[-0.01097656786441803, -0.02980119362473488, -...",positive,MNTA Over 12.00,MNTA Over 12.00
4,0.863170,positive,"[0.024849386885762215, 0.04679658263921738, -0...",positive,OI Over 21.37,OI Over 21.37
...,...,...,...,...,...,...
5786,0.930227,negative,"[0.020985644310712814, -0.03145354613661766, -...",negative,Industry body CII said #discoms are likely to ...,Industry body CII said #discoms are likely to ...
5787,0.994925,negative,"[0.05627664923667908, 0.012842322699725628, -0...",negative,"#Gold prices slip below Rs 46,000 as #investor...","#Gold prices slip below Rs 46,000 as #investor..."
5788,0.894794,positive,"[0.01210737880319357, -0.02798214927315712, -0...",negative,Workers at Bajaj Auto have agreed to a 10% wag...,Workers at Bajaj Auto have agreed to a 10% wag...
5789,0.992418,positive,"[0.0031773506198078394, -0.04296385496854782, ...",positive,"#Sharemarket LIVE: Sensex off day’s high, up 6...","#Sharemarket LIVE: Sensex off day’s high, up 6..."


# Try training with different Embeddings

In [8]:
# We can use nlu.print_components(action='embed_sentence') to see every possibler sentence embedding we could use. Lets use bert!
nlu.print_components(action='embed_sentence')

For language <en> NLU provides the following Models : 
nlu.load('en.embed_sentence') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.tfhub_use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.tfhub_use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.albert') returns Spark NLP model albert_base_uncased
nlu.load('en.embed_sentence.electra') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_small_uncased') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_base_uncased') returns Spark NLP model sent_electra_base_uncased
nlu.load('en.embed_sentence.electra_large_uncased') returns Spark NLP model sent_electra_large_uncased
nlu.load('en.embed_sentence.bert') returns Spark NLP model sent_bert_base_uncased
nlu.load('en.embed_sentenc

In [14]:
trainable_pipe = nlu.load('embed_sentence.bert train.sentiment')
# We need to train longer and user smaller LR for NON-USE based sentence embeddings usually
# We could tune the hyperparameters further with hyperparameter tuning methods like gridsearch
# Also longer training gives more accuracy
trainable_pipe['sentiment_dl'].setMaxEpochs(40)  
trainable_pipe['sentiment_dl'].setLr(0.0005) 
fitted_pipe = trainable_pipe.fit(train_df)
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]
              precision    recall  f1-score   support

    negative       0.68      0.24      0.36      2106
     neutral       0.00      0.00      0.00         0
    positive       0.72      0.84      0.78      3685

    accuracy                           0.62      5791
   macro avg       0.47      0.36      0.38      5791
weighted avg       0.70      0.62      0.62      5791



Unnamed: 0_level_0,embed_sentence_bert_embeddings,sentiment_confidence,y,sentiment,text,document
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,"[-0.920756995677948, 0.21013422310352325, 0.10...",0.893553,positive,positive,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...
1,"[-0.43004682660102844, 0.510123074054718, -0.2...",0.636862,positive,positive,user: AAP MOVIE. 55% return for the FEA/GEED i...,user: AAP MOVIE. 55% return for the FEA/GEED i...
2,"[0.3040032386779785, 0.228629469871521, -0.531...",0.763758,positive,positive,user I'd be afraid to short AMZN - they are lo...,user I'd be afraid to short AMZN - they are lo...
3,"[-1.8103487491607666, -0.47991353273391724, -0...",0.972610,positive,positive,MNTA Over 12.00,MNTA Over 12.00
4,"[-2.4639296531677246, 0.38795894384384155, -0....",0.969021,positive,positive,OI Over 21.37,OI Over 21.37
...,...,...,...,...,...,...
5786,"[-0.09503862261772156, 0.6293947696685791, 0.0...",0.727308,negative,negative,Industry body CII said #discoms are likely to ...,Industry body CII said #discoms are likely to ...
5787,"[-0.12879404425621033, 0.28170254826545715, 0....",0.749509,negative,negative,"#Gold prices slip below Rs 46,000 as #investor...","#Gold prices slip below Rs 46,000 as #investor..."
5788,"[-0.33955836296081543, 0.9124065041542053, -0....",0.798394,positive,negative,Workers at Bajaj Auto have agreed to a 10% wag...,Workers at Bajaj Auto have agreed to a 10% wag...
5789,"[-0.6081283092498779, 0.2732301950454712, 0.25...",0.511369,positive,neutral,"#Sharemarket LIVE: Sensex off day’s high, up 6...","#Sharemarket LIVE: Sensex off day’s high, up 6..."


# 5. Lets save the model

In [10]:
stored_model_path = './models/classifier_dl_trained' 
fitted_pipe.save(stored_model_path)

Stored model in ./models/classifier_dl_trained


# 6. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [11]:
hdd_pipe = nlu.load(path=stored_model_path)

preds = hdd_pipe.predict('Tesla plans to invest 10M into the ML sector')
preds

Fitting on empty Dataframe, could not infer correct training method!


Unnamed: 0_level_0,embed_sentence_bert_embeddings,sentiment_confidence,sentiment,document
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,"[-0.07111622393131256, 0.9532926082611084, -1....",1.0,positive,Tesla plans to invest 10M into the ML sector


In [12]:
hdd_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['document_assembler'] has settable params:
pipe['document_assembler'].setCleanupMode('shrink')           | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
>>> pipe['regex_tokenizer'] has settable params:
pipe['regex_tokenizer'].setCaseSensitiveExceptions(True)      | Info: Whether to care for case sensitiveness in exceptions | Currently set to : True
pipe['regex_tokenizer'].setTargetPattern('\S+')               | Info: pattern to grab from text as token candidates. Defaults \S+ | Currently set to : \S+
pipe['regex_tokenizer'].setMaxLength(99999)                   | Info: Set the maximum allowed length for each token | Currently set to : 99999
pipe['regex_tokenizer'].setMinLength(0)                       | Info: Set the minimum allowed length for each token | Currently set to : 0
>>> pipe['sente