![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/Training/binary_text_classification/NLU_training_sentiment_classifier_demo.ipynb)



# Training a Sentiment Analysis Classifier with NLU
With the [ClassifierDL model](https://nlp.johnsnowlabs.com/docs/en/annotators#classifierdl-multi-class-text-classification) from Spark NLP you can achieve State Of the Art results on any multi class text classification problem

This notebook showcases the following features :

- How to train the deep learning classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



# 1. Install Java 8 and NLU

In [None]:
!pip install -q johnsnowlabs

# 2. Download Stock Market Sentiment dataset
https://www.kaggle.com/yash612/stockmarket-sentiment-dataset

In [None]:
! wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/stock_data/stock_data.csv


In [None]:
from johnsnowlabs import nlp
sentiment = nlp.load('sentiment')

sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [None]:
sentiment.predict("I'm very very not at all happy")

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence,sentence_embedding_converter,sentiment,sentiment_confidence,word_embedding_glove
0,I'm very very not at all happy,"[-0.2865465581417084, 0.25398728251457214, 0.2...",pos,0.999995,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."


In [2]:
import pandas as pd
train_path = '/content/stock_data.csv'

train_df = pd.read_csv(train_path)
# the text data to use for classification should be in a column named 'text'
# the label column must have name 'y' name be of type str
train_df.columns=['text','y']
train_df.y = train_df.y.astype(str)
train_df.y = train_df.y.str.replace('-1','negative')
train_df.y = train_df.y.str.replace('1','positive')
train_df

Unnamed: 0,text,y
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,positive
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,positive
2,user I'd be afraid to short AMZN - they are lo...,positive
3,MNTA Over 12.00,positive
4,OI Over 21.37,positive
...,...,...
5786,Industry body CII said #discoms are likely to ...,negative
5787,"#Gold prices slip below Rs 46,000 as #investor...",negative
5788,Workers at Bajaj Auto have agreed to a 10% wag...,positive
5789,"#Sharemarket LIVE: Sensex off day’s high, up 6...",positive


# 3. Train Deep Learning Classifier using nlu.load('train.sentiment')

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [None]:
from sklearn.metrics import classification_report
# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns
# by default the Universal Sentence Encoder (USE) Sentence embeddings are used for generation
trainable_pipe = nlp.load('train.sentiment')
fitted_pipe = trainable_pipe.fit(train_df)

# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')
#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00      2106
    positive       0.64      1.00      0.78      3685

    accuracy                           0.64      5791
   macro avg       0.32      0.50      0.39      5791
weighted avg       0.40      0.64      0.49      5791



Unnamed: 0,document,sentence_embedding_small_bert_L2_128,sentiment,sentiment_confidence,text,y
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,"[-0.9530864357948303, 0.2135828286409378, 0.10...",positive,1.0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,positive
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,"[-0.4725969433784485, 0.5354134440422058, -0.2...",positive,1.0,user: AAP MOVIE. 55% return for the FEA/GEED i...,positive
2,user I'd be afraid to short AMZN - they are lo...,"[0.30400288105010986, 0.22862982749938965, -0....",positive,1.0,user I'd be afraid to short AMZN - they are lo...,positive
3,MNTA Over 12.00,"[-1.707902193069458, -0.48472753167152405, -0....",positive,1.0,MNTA Over 12.00,positive
4,OI Over 21.37,"[-2.3011534214019775, 0.2649511396884918, -0.4...",positive,1.0,OI Over 21.37,positive
...,...,...,...,...,...,...
5786,Industry body CII said #discoms are likely to ...,"[-0.21655204892158508, 0.6153537631034851, 0.0...",positive,1.0,Industry body CII said #discoms are likely to ...,negative
5787,"#Gold prices slip below Rs 46,000 as #investor...","[-0.19915254414081573, 0.2607441842556, 0.0032...",positive,1.0,"#Gold prices slip below Rs 46,000 as #investor...",negative
5788,Workers at Bajaj Auto have agreed to a 10% wag...,"[-0.4361518919467926, 0.9346759915351868, -0.3...",positive,1.0,Workers at Bajaj Auto have agreed to a 10% wag...,positive
5789,"#Sharemarket LIVE: Sensex off day’s high, up 6...","[-0.6081278920173645, 0.2732301354408264, 0.25...",positive,1.0,"#Sharemarket LIVE: Sensex off day’s high, up 6...",positive


# Test the fitted pipe on new example

In [None]:
fitted_pipe.predict("Bitcoin is going to the moon!")

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence,sentence_embedding_small_bert_L2_128,sentiment,sentiment_confidence
0,Bitcoin is going to the moon!,"[-1.0531491041183472, -0.2827455699443817, -0....",positive,1.0


## Configure pipe training parameters

In [None]:
trainable_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> component_list['bert_sentence_embeddings@sent_small_bert_L2_128'] has settable params:
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setBatchSize(8)              | Info: Size of every batch | Currently set to : 8
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setEngine('tensorflow')      | Info: Deep Learning engine used for this model | Currently set to : tensorflow
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setIsLong(False)             | Info: Use Long type instead of Int type for inputs buffer - Some Bert models require Long instead of Int. | Currently set to : False
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setMaxSentenceLength(128)    | Info: Max sentence length to process | Currently set to : 128
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setDimension(128)            | I

## Retrain with new parameters

In [None]:
# Train longer!
trainable_pipe = nlp.load('train.sentiment')
trainable_pipe['trainable_sentiment_dl'].setMaxEpochs(5)
fitted_pipe = trainable_pipe.fit(train_df)
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00      2106
    positive       0.64      1.00      0.78      3685

    accuracy                           0.64      5791
   macro avg       0.32      0.50      0.39      5791
weighted avg       0.40      0.64      0.49      5791



Unnamed: 0,document,sentence_embedding_small_bert_L2_128,sentiment,sentiment_confidence,text,y
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,"[-0.9530864357948303, 0.2135828286409378, 0.10...",positive,1.0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,positive
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,"[-0.4725969433784485, 0.5354134440422058, -0.2...",positive,1.0,user: AAP MOVIE. 55% return for the FEA/GEED i...,positive
2,user I'd be afraid to short AMZN - they are lo...,"[0.30400288105010986, 0.22862982749938965, -0....",positive,1.0,user I'd be afraid to short AMZN - they are lo...,positive
3,MNTA Over 12.00,"[-1.707902193069458, -0.48472753167152405, -0....",positive,1.0,MNTA Over 12.00,positive
4,OI Over 21.37,"[-2.3011534214019775, 0.2649511396884918, -0.4...",positive,1.0,OI Over 21.37,positive
...,...,...,...,...,...,...
5786,Industry body CII said #discoms are likely to ...,"[-0.21655204892158508, 0.6153537631034851, 0.0...",positive,1.0,Industry body CII said #discoms are likely to ...,negative
5787,"#Gold prices slip below Rs 46,000 as #investor...","[-0.19915254414081573, 0.2607441842556, 0.0032...",positive,1.0,"#Gold prices slip below Rs 46,000 as #investor...",negative
5788,Workers at Bajaj Auto have agreed to a 10% wag...,"[-0.4361518919467926, 0.9346759915351868, -0.3...",positive,1.0,Workers at Bajaj Auto have agreed to a 10% wag...,positive
5789,"#Sharemarket LIVE: Sensex off day’s high, up 6...","[-0.6081278920173645, 0.2732301354408264, 0.25...",positive,1.0,"#Sharemarket LIVE: Sensex off day’s high, up 6...",positive


# Try training with different Embeddings

In [None]:
# We can use nlu.print_components(action='embed_sentence') to see every possibler sentence embedding we could use. Lets use bert!
nlp.nlu.print_components(action='embed_sentence')

For language <am> NLU provides the following Models : 
nlu.load('am.embed_sentence.xlm_roberta') returns Spark NLP model_anno_obj sent_xlm_roberta_base_finetuned_amharic
For language <de> NLU provides the following Models : 
nlu.load('de.embed_sentence.bert.base_cased') returns Spark NLP model_anno_obj sent_bert_base_cased
For language <el> NLU provides the following Models : 
nlu.load('el.embed_sentence.bert.base_uncased') returns Spark NLP model_anno_obj sent_bert_base_uncased
For language <en> NLU provides the following Models : 
nlu.load('en.embed_sentence') returns Spark NLP model_anno_obj tfhub_use
nlu.load('en.embed_sentence.albert') returns Spark NLP model_anno_obj albert_base_uncased
nlu.load('en.embed_sentence.bert') returns Spark NLP model_anno_obj sent_bert_base_uncased
nlu.load('en.embed_sentence.bert.base_uncased_legal') returns Spark NLP model_anno_obj sent_bert_base_uncased_legal
nlu.load('en.embed_sentence.bert.finetuned') returns Spark NLP model_anno_obj sbert_setfit_

In [None]:
trainable_pipe = nlp.load('embed_sentence.bert train.sentiment')
# We need to train longer and user smaller LR for NON-USE based sentence embeddings usually
# We could tune the hyperparameters further with hyperparameter tuning methods like gridsearch
# Also longer training gives more accuracy
trainable_pipe['trainable_sentiment_dl'].setMaxEpochs(40)
trainable_pipe['trainable_sentiment_dl'].setLr(0.0005)
fitted_pipe = trainable_pipe.fit(train_df)
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]
              precision    recall  f1-score   support

    negative       0.69      0.24      0.36      2106
     neutral       0.00      0.00      0.00         0
    positive       0.72      0.85      0.78      3685

    accuracy                           0.63      5791
   macro avg       0.47      0.36      0.38      5791
weighted avg       0.71      0.63      0.63      5791



Unnamed: 0,document,sentence_embedding_bert,sentiment,sentiment_confidence,text,y
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,"[-0.9530864357948303, 0.2135828286409378, 0.10...",positive,0.0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,positive
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,"[-0.4725969433784485, 0.5354134440422058, -0.2...",positive,0.0,user: AAP MOVIE. 55% return for the FEA/GEED i...,positive
2,user I'd be afraid to short AMZN - they are lo...,"[0.30400288105010986, 0.22862982749938965, -0....",positive,0.0,user I'd be afraid to short AMZN - they are lo...,positive
3,MNTA Over 12.00,"[-1.707902193069458, -0.48472753167152405, -0....",positive,0.0,MNTA Over 12.00,positive
4,OI Over 21.37,"[-2.3011534214019775, 0.2649511396884918, -0.4...",positive,0.0,OI Over 21.37,positive
...,...,...,...,...,...,...
5786,Industry body CII said #discoms are likely to ...,"[-0.21655204892158508, 0.6153537631034851, 0.0...",negative,0.0,Industry body CII said #discoms are likely to ...,negative
5787,"#Gold prices slip below Rs 46,000 as #investor...","[-0.19915254414081573, 0.2607441842556, 0.0032...",negative,0.0,"#Gold prices slip below Rs 46,000 as #investor...",negative
5788,Workers at Bajaj Auto have agreed to a 10% wag...,"[-0.4361518919467926, 0.9346759915351868, -0.3...",negative,0.0,Workers at Bajaj Auto have agreed to a 10% wag...,positive
5789,"#Sharemarket LIVE: Sensex off day’s high, up 6...","[-0.6081278920173645, 0.2732301354408264, 0.25...",neutral,0.0,"#Sharemarket LIVE: Sensex off day’s high, up 6...",positive


# 5. Lets save the model

In [None]:
stored_model_path = './models/classifier_dl_trained'
fitted_pipe.save(stored_model_path)

# 6. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [None]:
hdd_pipe = nlp.load(path=stored_model_path)

preds = hdd_pipe.predict('Tesla plans to invest 10M into the ML sector')
preds



Unnamed: 0,document,sentence_embedding_from_disk,sentiment,sentiment_confidence
0,Tesla plans to invest 10M into the ML sector,"[-0.07111673802137375, 0.9532930850982666, -1....",positive,0.0


In [None]:
hdd_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> component_list['document_assembler'] has settable params:
component_list['document_assembler'].setCleanupMode('shrink')                                  | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
>>> component_list['bert_sentence_embeddings@sent_small_bert_L2_128'] has settable params:
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setBatchSize(8)              | Info: Size of every batch | Currently set to : 8
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setCaseSensitive(False)      | Info: whether to ignore case in tokens for embeddings matching | Currently set to : False
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setDimension(128)            | Info: Number of embedding dimensions | Currently set to : 128
component_list['bert_sen