![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/Training/binary_text_classification/NLU_training_sentiment_classifier_demo.ipynb)



# Training a Sentiment Analysis Classifier with NLU 
With the [ClassifierDL model](https://nlp.johnsnowlabs.com/docs/en/annotators#classifierdl-multi-class-text-classification) from Spark NLP you can achieve State Of the Art results on any multi class text classification problem 

This notebook showcases the following features : 

- How to train the deep learning classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



# 1. Install Java 8 and NLU

In [None]:
!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
import nlu

--2021-05-05 11:44:27--  https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1671 (1.6K) [text/plain]
Saving to: ‘STDOUT’

-                     0%[                    ]       0  --.-KB/s               Installing  NLU 3.0.0 with  PySpark 3.0.2 and Spark NLP 3.0.1 for Google Colab ...

2021-05-05 11:44:27 (1.49 MB/s) - written to stdout [1671/1671]

[K     |████████████████████████████████| 204.8MB 58kB/s 
[K     |████████████████████████████████| 153kB 46.3MB/s 
[K     |████████████████████████████████| 204kB 18.2MB/s 
[K     |████████████████████████████████| 204kB 52.9MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


# 2. Download Stock Market Sentiment dataset 
https://www.kaggle.com/yash612/stockmarket-sentiment-dataset

In [None]:
! wget http://ckl-it.de/wp-content/uploads/2020/11/stock_data.csv


--2021-05-05 11:46:32--  http://ckl-it.de/wp-content/uploads/2020/11/stock_data.csv
Resolving ckl-it.de (ckl-it.de)... 217.160.0.108, 2001:8d8:100f:f000::209
Connecting to ckl-it.de (ckl-it.de)|217.160.0.108|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 479973 (469K) [text/csv]
Saving to: ‘stock_data.csv’


2021-05-05 11:46:33 (846 KB/s) - ‘stock_data.csv’ saved [479973/479973]



In [None]:
import nlu
sentiment = nlu.load('sentiment')

analyze_sentiment download started this may take some time.
Approx size to download 4.9 MB
[OK!]


In [None]:
sentiment.predict("I'm very very not at all happy")

Unnamed: 0,token,sentiment,sentence,origin_index,spell,document,sentiment_confidence,text
0,"[I'm, very, very, not, at, all, happy]",[positive],[I'm very very not at all happy],8589934592,"[I'm, very, very, not, at, all, happy]",I'm very very not at all happy,[0.3043],I'm very very not at all happy


In [None]:
import pandas as pd
train_path = '/content/stock_data.csv'

train_df = pd.read_csv(train_path)
# the text data to use for classification should be in a column named 'text'
# the label column must have name 'y' name be of type str
train_df.columns=['text','y']
train_df.y = train_df.y.astype(str)
train_df.y = train_df.y.str.replace('-1','negative')
train_df.y = train_df.y.str.replace('1','positive')
train_df

Unnamed: 0,text,y
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,positive
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,positive
2,user I'd be afraid to short AMZN - they are lo...,positive
3,MNTA Over 12.00,positive
4,OI Over 21.37,positive
...,...,...
5786,Industry body CII said #discoms are likely to ...,negative
5787,"#Gold prices slip below Rs 46,000 as #investor...",negative
5788,Workers at Bajaj Auto have agreed to a 10% wag...,positive
5789,"#Sharemarket LIVE: Sensex off day’s high, up 6...",positive


# 3. Train Deep Learning Classifier using nlu.load('train.sentiment')

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [None]:
from sklearn.metrics import classification_report
import nlu 
# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns
# by default the Universal Sentence Encoder (USE) Sentence embeddings are used for generation
trainable_pipe = nlu.load('train.sentiment')
fitted_pipe = trainable_pipe.fit(train_df)

# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')
#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
              precision    recall  f1-score   support

    negative       0.61      0.58      0.59      2106
     neutral       0.00      0.00      0.00         0
    positive       0.79      0.71      0.75      3685

    accuracy                           0.66      5791
   macro avg       0.47      0.43      0.45      5791
weighted avg       0.73      0.66      0.69      5791



Unnamed: 0,sentence,origin_index,y,document,trained_sentiment_confidence,text,trained_sentiment,sentence_embedding_use
0,[Kickers on my watchlist XIDE TIT SOQ PNK CPW ...,0,positive,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,0.985571,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,positive,"[0.006487144622951746, -0.042024899274110794, ..."
1,[user: AAP MOVIE. 55% return for the FEA/GEED ...,1,positive,user: AAP MOVIE. 55% return for the FEA/GEED i...,0.720251,user: AAP MOVIE. 55% return for the FEA/GEED i...,positive,"[0.04891366884112358, -0.07381151616573334, -0..."
2,[user I'd be afraid to short AMZN - they are l...,2,positive,user I'd be afraid to short AMZN - they are lo...,0.580390,user I'd be afraid to short AMZN - they are lo...,neutral,"[0.05556508153676987, -0.016491785645484924, 0..."
3,[MNTA Over 12.00],3,positive,MNTA Over 12.00,0.941032,MNTA Over 12.00,positive,"[-0.010976563207805157, -0.029801178723573685,..."
4,[OI Over 21.37],4,positive,OI Over 21.37,0.535012,OI Over 21.37,neutral,"[0.024849383160471916, 0.04679657891392708, -0..."
...,...,...,...,...,...,...,...,...
5786,[Industry body CII said #discoms are likely to...,5786,negative,Industry body CII said #discoms are likely to ...,0.939032,Industry body CII said #discoms are likely to ...,negative,"[0.020985640585422516, -0.03145354241132736, -..."
5787,"[#Gold prices slip below Rs 46,000 as #investo...",5787,negative,"#Gold prices slip below Rs 46,000 as #investor...",0.983991,"#Gold prices slip below Rs 46,000 as #investor...",negative,"[0.05627664923667908, 0.012842322699725628, -0..."
5788,[Workers at Bajaj Auto have agreed to a 10% wa...,5788,positive,Workers at Bajaj Auto have agreed to a 10% wag...,0.918838,Workers at Bajaj Auto have agreed to a 10% wag...,negative,"[0.019935883581638336, -0.031780488789081573, ..."
5789,"[#Sharemarket LIVE: Sensex off day’s high, up ...",5789,positive,"#Sharemarket LIVE: Sensex off day’s high, up 6...",0.761864,"#Sharemarket LIVE: Sensex off day’s high, up 6...",positive,"[0.0031773506198078394, -0.04296385496854782, ..."


# Test the fitted pipe on new example

In [None]:
fitted_pipe.predict("Bitcoin is going to the moon!")

Unnamed: 0,sentence,origin_index,document,trained_sentiment_confidence,trained_sentiment,sentence_embedding_use
0,[Bitcoin is going to the moon!],0,Bitcoin is going to the moon!,0.713436,positive,"[0.06468033790588379, -0.040837567299604416, -..."


## Configure pipe training parameters

In [None]:
trainable_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['sentiment_dl'] has settable params:
pipe['sentiment_dl'].setMaxEpochs(1)                 | Info: Maximum number of epochs to train | Currently set to : 1
pipe['sentiment_dl'].setLr(0.005)                    | Info: Learning Rate | Currently set to : 0.005
pipe['sentiment_dl'].setBatchSize(64)                | Info: Batch size | Currently set to : 64
pipe['sentiment_dl'].setDropout(0.5)                 | Info: Dropout coefficient | Currently set to : 0.5
pipe['sentiment_dl'].setEnableOutputLogs(True)       | Info: Whether to use stdout in addition to Spark logs. | Currently set to : True
pipe['sentiment_dl'].setThreshold(0.6)               | Info: The minimum threshold for the final result otheriwse it will be neutral | Currently set to : 0.6
pipe['sentiment_dl'].setThresholdLabel('neutral')    | Info: In case the score is less than threshold, what should be the label. Default i

## Retrain with new parameters

In [None]:
# Train longer!
trainable_pipe = nlu.load('train.sentiment')
trainable_pipe['trainable_sentiment_dl'].setMaxEpochs(5)  
fitted_pipe = trainable_pipe.fit(train_df)
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

              precision    recall  f1-score   support

    negative       0.83      0.38      0.52      2106
     neutral       0.00      0.00      0.00         0
    positive       0.75      0.94      0.83      3685

    accuracy                           0.74      5791
   macro avg       0.53      0.44      0.45      5791
weighted avg       0.78      0.74      0.72      5791



Unnamed: 0,sentence,origin_index,y,document,trained_sentiment_confidence,text,trained_sentiment,sentence_embedding_use
0,[Kickers on my watchlist XIDE TIT SOQ PNK CPW ...,0,positive,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,1.000000,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,positive,"[0.006487144622951746, -0.042024899274110794, ..."
1,[user: AAP MOVIE. 55% return for the FEA/GEED ...,1,positive,user: AAP MOVIE. 55% return for the FEA/GEED i...,0.934187,user: AAP MOVIE. 55% return for the FEA/GEED i...,positive,"[0.04891366884112358, -0.07381151616573334, -0..."
2,[user I'd be afraid to short AMZN - they are l...,2,positive,user I'd be afraid to short AMZN - they are lo...,0.625671,user I'd be afraid to short AMZN - they are lo...,negative,"[0.05556508153676987, -0.016491785645484924, 0..."
3,[MNTA Over 12.00],3,positive,MNTA Over 12.00,0.999983,MNTA Over 12.00,positive,"[-0.010976563207805157, -0.029801178723573685,..."
4,[OI Over 21.37],4,positive,OI Over 21.37,0.985523,OI Over 21.37,positive,"[0.024849383160471916, 0.04679657891392708, -0..."
...,...,...,...,...,...,...,...,...
5786,[Industry body CII said #discoms are likely to...,5786,negative,Industry body CII said #discoms are likely to ...,0.733400,Industry body CII said #discoms are likely to ...,negative,"[0.020985640585422516, -0.03145354241132736, -..."
5787,"[#Gold prices slip below Rs 46,000 as #investo...",5787,negative,"#Gold prices slip below Rs 46,000 as #investor...",0.967702,"#Gold prices slip below Rs 46,000 as #investor...",negative,"[0.05627664923667908, 0.012842322699725628, -0..."
5788,[Workers at Bajaj Auto have agreed to a 10% wa...,5788,positive,Workers at Bajaj Auto have agreed to a 10% wag...,0.778937,Workers at Bajaj Auto have agreed to a 10% wag...,negative,"[0.019935883581638336, -0.031780488789081573, ..."
5789,"[#Sharemarket LIVE: Sensex off day’s high, up ...",5789,positive,"#Sharemarket LIVE: Sensex off day’s high, up 6...",0.999009,"#Sharemarket LIVE: Sensex off day’s high, up 6...",positive,"[0.0031773506198078394, -0.04296385496854782, ..."


# Try training with different Embeddings

In [None]:
# We can use nlu.print_components(action='embed_sentence') to see every possibler sentence embedding we could use. Lets use bert!
nlu.print_components(action='embed_sentence')

For language <en> NLU provides the following Models : 
nlu.load('en.embed_sentence') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.tfhub_use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.tfhub_use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.albert') returns Spark NLP model albert_base_uncased
nlu.load('en.embed_sentence.electra') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_small_uncased') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_base_uncased') returns Spark NLP model sent_electra_base_uncased
nlu.load('en.embed_sentence.electra_large_uncased') returns Spark NLP model sent_electra_large_uncased
nlu.load('en.embed_sentence.bert') returns Spark NLP model sent_bert_base_uncased
nlu.load('en.embed_sentenc

In [None]:
trainable_pipe = nlu.load('embed_sentence.bert train.sentiment')
# We need to train longer and user smaller LR for NON-USE based sentence embeddings usually
# We could tune the hyperparameters further with hyperparameter tuning methods like gridsearch
# Also longer training gives more accuracy
trainable_pipe['trainable_sentiment_dl'].setMaxEpochs(40)  
trainable_pipe['trainable_sentiment_dl'].setLr(0.0005) 
fitted_pipe = trainable_pipe.fit(train_df)
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
              precision    recall  f1-score   support

    negative       0.68      0.24      0.35      2106
     neutral       0.00      0.00      0.00         0
    positive       0.72      0.84      0.77      3685

    accuracy                           0.62      5791
   macro avg       0.47      0.36      0.37      5791
weighted avg       0.71      0.62      0.62      5791



Unnamed: 0,sentence,origin_index,y,document,trained_sentiment_confidence,trained_sentiment,text,sentence_embedding_bert
0,[Kickers on my watchlist XIDE TIT SOQ PNK CPW ...,0,positive,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,0.864774,positive,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,"[-0.9207566380500793, 0.21013399958610535, 0.1..."
1,[user: AAP MOVIE. 55% return for the FEA/GEED ...,1,positive,user: AAP MOVIE. 55% return for the FEA/GEED i...,0.648291,positive,user: AAP MOVIE. 55% return for the FEA/GEED i...,"[-0.43004706501960754, 0.5101228952407837, -0...."
2,[user I'd be afraid to short AMZN - they are l...,2,positive,user I'd be afraid to short AMZN - they are lo...,0.793143,positive,user I'd be afraid to short AMZN - they are lo...,"[0.3040030300617218, 0.22862930595874786, -0.5..."
3,[MNTA Over 12.00],3,positive,MNTA Over 12.00,0.964940,positive,MNTA Over 12.00,"[-1.8103482723236084, -0.4799136519432068, -0...."
4,[OI Over 21.37],4,positive,OI Over 21.37,0.959243,positive,OI Over 21.37,"[-2.4639298915863037, 0.3879586458206177, -0.6..."
...,...,...,...,...,...,...,...,...
5786,[Industry body CII said #discoms are likely to...,5786,negative,Industry body CII said #discoms are likely to ...,0.753365,negative,Industry body CII said #discoms are likely to ...,"[-0.09503882378339767, 0.6293947696685791, 0.0..."
5787,"[#Gold prices slip below Rs 46,000 as #investo...",5787,negative,"#Gold prices slip below Rs 46,000 as #investor...",0.724050,negative,"#Gold prices slip below Rs 46,000 as #investor...","[-0.12879370152950287, 0.28170245885849, 0.028..."
5788,[Workers at Bajaj Auto have agreed to a 10% wa...,5788,positive,Workers at Bajaj Auto have agreed to a 10% wag...,0.781417,negative,Workers at Bajaj Auto have agreed to a 10% wag...,"[-0.3395586907863617, 0.9124063849449158, -0.3..."
5789,"[#Sharemarket LIVE: Sensex off day’s high, up ...",5789,positive,"#Sharemarket LIVE: Sensex off day’s high, up 6...",0.520319,neutral,"#Sharemarket LIVE: Sensex off day’s high, up 6...","[-0.6081282496452332, 0.2732301652431488, 0.25..."


# 5. Lets save the model

In [None]:
stored_model_path = './models/classifier_dl_trained' 
fitted_pipe.save(stored_model_path)

Stored model in ./models/classifier_dl_trained


# 6. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [None]:
hdd_pipe = nlu.load(path=stored_model_path)

preds = hdd_pipe.predict('Tesla plans to invest 10M into the ML sector')
preds

Unnamed: 0,sentiment,sentence,origin_index,document,sentence_embedding_from_disk,sentiment_confidence,text
0,[positive],[Tesla plans to invest 10M into the ML sector],8589934592,Tesla plans to invest 10M into the ML sector,"[[-0.07111598551273346, 0.9532928466796875, -1...",[0.9114534],Tesla plans to invest 10M into the ML sector


In [None]:
hdd_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['document_assembler'] has settable params:
pipe['document_assembler'].setCleanupMode('shrink')                                  | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
>>> pipe['sentence_detector@SentenceDetectorDLModel_c83c27f46b97'] has settable params:
pipe['sentence_detector@SentenceDetectorDLModel_c83c27f46b97'].setExplodeSentences(False)  | Info: whether to explode each sentence into a different row, for better parallelization. Defaults to false. | Currently set to : False
pipe['sentence_detector@SentenceDetectorDLModel_c83c27f46b97'].setStorageRef('SentenceDetectorDLModel_c83c27f46b97')  | Info: storage unique identifier | Currently set to : SentenceDetectorDLModel_c83c27f46b97
pipe['sentence_detector@SentenceDetectorDLModel_c83c27f46b97'].setEncoder(com.johnsnowlabs.nlp.annota