![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/https://github.com/JohnSnowLabs/nlu/blob/master/examples/collab/Training/binary_text_classification/NLU_training_sentiment_classifier_demo.ipynb)



# Training a Sentiment Analysis Classifier with NLU 
With the [ClassifierDL model](https://nlp.johnsnowlabs.com/docs/en/annotators#classifierdl-multi-class-text-classification) from Spark NLP you can achieve State Of the Art results on any multi class text classification problem 

This notebook showcases the following features : 

- How to train the deep learning classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



# 1. Install Java 8 and NLU

In [None]:
import os
from sklearn.metrics import classification_report
! apt-get update -qq > /dev/null   
# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! pip install nlu pyspark==2.4.7 > /dev/null  


import nlu

# 2. Download Stock Market Sentiment dataset 
https://www.kaggle.com/yash612/stockmarket-sentiment-dataset

In [None]:
! wget http://ckl-it.de/wp-content/uploads/2020/11/stock_data.csv


--2020-12-24 01:05:27--  http://ckl-it.de/wp-content/uploads/2020/11/stock_data.csv
Resolving ckl-it.de (ckl-it.de)... 217.160.0.108, 2001:8d8:100f:f000::209
Connecting to ckl-it.de (ckl-it.de)|217.160.0.108|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 479973 (469K) [text/csv]
Saving to: ‘stock_data.csv.1’


2020-12-24 01:05:29 (324 KB/s) - ‘stock_data.csv.1’ saved [479973/479973]



In [None]:
! pip install nlu pyspark==2.4.7



In [None]:
import nlu
sentiment = nlu.load('sentiment')

analyze_sentiment download started this may take some time.
Approx size to download 4.9 MB
[OK!]


In [None]:
sentiment.predict("I'm very very not at all happy")

Fitting on empty Dataframe, could not infer correct training method!


Unnamed: 0_level_0,sentence,sentiment_confidence,checked,sentiment
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,I'm very very not at all happy,0.3043,"[I'm, very, very, not, at, all, happy]",positive


In [None]:
import pandas as pd
train_path = '/content/stock_data.csv'

train_df = pd.read_csv(train_path)
# the text data to use for classification should be in a column named 'text'
# the label column must have name 'y' name be of type str
train_df.columns=['text','y']
train_df.y = train_df.y.astype(str)
train_df.y = train_df.y.str.replace('-1','negative')
train_df.y = train_df.y.str.replace('1','positive')
train_df

Unnamed: 0,text,y
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,positive
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,positive
2,user I'd be afraid to short AMZN - they are lo...,positive
3,MNTA Over 12.00,positive
4,OI Over 21.37,positive
...,...,...
5786,Industry body CII said #discoms are likely to ...,negative
5787,"#Gold prices slip below Rs 46,000 as #investor...",negative
5788,Workers at Bajaj Auto have agreed to a 10% wag...,positive
5789,"#Sharemarket LIVE: Sensex off day’s high, up 6...",positive


# 3. Train Deep Learning Classifier using nlu.load('train.sentiment')

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [None]:
import nlu 
# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns
# by default the Universal Sentence Encoder (USE) Sentence embeddings are used for generation
trainable_pipe = nlu.load('train.sentiment')
fitted_pipe = trainable_pipe.fit(train_df)

# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')
#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
              precision    recall  f1-score   support

    negative       0.71      0.43      0.54      2106
     neutral       0.00      0.00      0.00         0
    positive       0.77      0.83      0.80      3685

    accuracy                           0.69      5791
   macro avg       0.49      0.42      0.45      5791
weighted avg       0.75      0.69      0.70      5791



Unnamed: 0_level_0,text,sentiment_confidence,document,sentiment,y,default_name_embeddings
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,0.982228,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,positive,positive,"[0.006487144622951746, -0.042024899274110794, ..."
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,0.880183,user: AAP MOVIE. 55% return for the FEA/GEED i...,positive,positive,"[-0.03017628937959671, -0.0627138689160347, -0..."
2,user I'd be afraid to short AMZN - they are lo...,0.837914,user I'd be afraid to short AMZN - they are lo...,positive,positive,"[0.05556508153676987, -0.016491785645484924, 0..."
3,MNTA Over 12.00,0.905505,MNTA Over 12.00,positive,positive,"[-0.01097656786441803, -0.02980119362473488, -..."
4,OI Over 21.37,0.532368,OI Over 21.37,neutral,positive,"[0.024849386885762215, 0.04679658263921738, -0..."
...,...,...,...,...,...,...
5786,Industry body CII said #discoms are likely to ...,0.785020,Industry body CII said #discoms are likely to ...,negative,negative,"[0.020985644310712814, -0.03145354613661766, -..."
5787,"#Gold prices slip below Rs 46,000 as #investor...",0.861554,"#Gold prices slip below Rs 46,000 as #investor...",negative,negative,"[0.05627664923667908, 0.012842322699725628, -0..."
5788,Workers at Bajaj Auto have agreed to a 10% wag...,0.794606,Workers at Bajaj Auto have agreed to a 10% wag...,negative,positive,"[0.01210737880319357, -0.02798214927315712, -0..."
5789,"#Sharemarket LIVE: Sensex off day’s high, up 6...",0.966394,"#Sharemarket LIVE: Sensex off day’s high, up 6...",positive,positive,"[0.0031773506198078394, -0.04296385496854782, ..."


# Test the fitted pipe on new example

In [None]:
fitted_pipe.predict("Bitcoin is going to the moon!")

Unnamed: 0_level_0,sentiment_confidence,document,sentiment,default_name_embeddings
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.918913,Bitcoin is going to the moon!,positive,"[0.06468033790588379, -0.040837567299604416, -..."


## Configure pipe training parameters

In [None]:
trainable_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['sentiment_dl'] has settable params:
pipe['sentiment_dl'].setMaxEpochs(2)                 | Info: Maximum number of epochs to train | Currently set to : 2
pipe['sentiment_dl'].setLr(0.005)                    | Info: Learning Rate | Currently set to : 0.005
pipe['sentiment_dl'].setBatchSize(64)                | Info: Batch size | Currently set to : 64
pipe['sentiment_dl'].setDropout(0.5)                 | Info: Dropout coefficient | Currently set to : 0.5
pipe['sentiment_dl'].setEnableOutputLogs(True)       | Info: Whether to use stdout in addition to Spark logs. | Currently set to : True
pipe['sentiment_dl'].setThreshold(0.6)               | Info: The minimum threshold for the final result otheriwse it will be neutral | Currently set to : 0.6
pipe['sentiment_dl'].setThresholdLabel('neutral')    | Info: In case the score is less than threshold, what should be the label. Default i

## Retrain with new parameters

In [None]:
# Train longer!
trainable_pipe['sentiment_dl'].setMaxEpochs(5)  
fitted_pipe = trainable_pipe.fit(train_df)
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

              precision    recall  f1-score   support

    negative       0.79      0.67      0.72      2106
     neutral       0.00      0.00      0.00         0
    positive       0.84      0.87      0.85      3685

    accuracy                           0.80      5791
   macro avg       0.54      0.51      0.53      5791
weighted avg       0.82      0.80      0.81      5791



Unnamed: 0_level_0,text,sentiment_confidence,document,sentiment,y,default_name_embeddings
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,0.999146,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,positive,positive,"[0.006487144622951746, -0.042024899274110794, ..."
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,0.941052,user: AAP MOVIE. 55% return for the FEA/GEED i...,positive,positive,"[-0.03017628937959671, -0.0627138689160347, -0..."
2,user I'd be afraid to short AMZN - they are lo...,0.648649,user I'd be afraid to short AMZN - they are lo...,negative,positive,"[0.05556508153676987, -0.016491785645484924, 0..."
3,MNTA Over 12.00,0.988186,MNTA Over 12.00,positive,positive,"[-0.01097656786441803, -0.02980119362473488, -..."
4,OI Over 21.37,0.783930,OI Over 21.37,positive,positive,"[0.024849386885762215, 0.04679658263921738, -0..."
...,...,...,...,...,...,...
5786,Industry body CII said #discoms are likely to ...,0.990443,Industry body CII said #discoms are likely to ...,negative,negative,"[0.020985644310712814, -0.03145354613661766, -..."
5787,"#Gold prices slip below Rs 46,000 as #investor...",0.999385,"#Gold prices slip below Rs 46,000 as #investor...",negative,negative,"[0.05627664923667908, 0.012842322699725628, -0..."
5788,Workers at Bajaj Auto have agreed to a 10% wag...,0.728881,Workers at Bajaj Auto have agreed to a 10% wag...,negative,positive,"[0.01210737880319357, -0.02798214927315712, -0..."
5789,"#Sharemarket LIVE: Sensex off day’s high, up 6...",0.987245,"#Sharemarket LIVE: Sensex off day’s high, up 6...",positive,positive,"[0.0031773506198078394, -0.04296385496854782, ..."


# Try training with different Embeddings

In [None]:
# We can use nlu.print_components(action='embed_sentence') to see every possibler sentence embedding we could use. Lets use bert!
nlu.print_components(action='embed_sentence')

For language <en> NLU provides the following Models : 
nlu.load('en.embed_sentence') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.tfhub_use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.tfhub_use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.albert') returns Spark NLP model albert_base_uncased
nlu.load('en.embed_sentence.electra') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_small_uncased') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_base_uncased') returns Spark NLP model sent_electra_base_uncased
nlu.load('en.embed_sentence.electra_large_uncased') returns Spark NLP model sent_electra_large_uncased
nlu.load('en.embed_sentence.bert') returns Spark NLP model sent_bert_base_uncased
nlu.load('en.embed_sentenc

In [None]:
trainable_pipe = nlu.load('embed_sentence.bert train.sentiment')
# We need to train longer and user smaller LR for NON-USE based sentence embeddings usually
# We could tune the hyperparameters further with hyperparameter tuning methods like gridsearch
# Also longer training gives more accuracy
trainable_pipe['sentiment_dl'].setMaxEpochs(40)  
trainable_pipe['sentiment_dl'].setLr(0.0005) 
fitted_pipe = trainable_pipe.fit(train_df)
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]
              precision    recall  f1-score   support

    negative       0.68      0.25      0.36      2106
     neutral       0.00      0.00      0.00         0
    positive       0.72      0.84      0.77      3685

    accuracy                           0.63      5791
   macro avg       0.47      0.36      0.38      5791
weighted avg       0.71      0.63      0.63      5791



Unnamed: 0_level_0,text,sentiment_confidence,document,embed_sentence_bert_embeddings,sentiment,y
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,0.874224,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,"[-0.9207571744918823, 0.21013416349887848, 0.1...",positive,positive
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,0.647704,user: AAP MOVIE. 55% return for the FEA/GEED i...,"[-0.43004727363586426, 0.5101231336593628, -0....",positive,positive
2,user I'd be afraid to short AMZN - they are lo...,0.780586,user I'd be afraid to short AMZN - they are lo...,"[0.3040030300617218, 0.22862982749938965, -0.5...",positive,positive
3,MNTA Over 12.00,0.978046,MNTA Over 12.00,"[-1.810348391532898, -0.4799138903617859, -0.7...",positive,positive
4,OI Over 21.37,0.961256,OI Over 21.37,"[-2.4639298915863037, 0.3879590630531311, -0.6...",positive,positive
...,...,...,...,...,...,...
5786,Industry body CII said #discoms are likely to ...,0.759879,Industry body CII said #discoms are likely to ...,"[-0.09503911435604095, 0.6293947696685791, 0.0...",negative,negative
5787,"#Gold prices slip below Rs 46,000 as #investor...",0.759041,"#Gold prices slip below Rs 46,000 as #investor...","[-0.1287938952445984, 0.28170245885849, 0.0280...",negative,negative
5788,Workers at Bajaj Auto have agreed to a 10% wag...,0.750849,Workers at Bajaj Auto have agreed to a 10% wag...,"[-0.3395587205886841, 0.912406325340271, -0.32...",negative,positive
5789,"#Sharemarket LIVE: Sensex off day’s high, up 6...",0.567143,"#Sharemarket LIVE: Sensex off day’s high, up 6...","[-0.6081283092498779, 0.2732301354408264, 0.25...",neutral,positive


# 5. Lets save the model

In [None]:
stored_model_path = './models/classifier_dl_trained' 
fitted_pipe.save(stored_model_path)

Stored model in ./models/classifier_dl_trained


# 6. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [None]:
hdd_pipe = nlu.load(path=stored_model_path)

preds = hdd_pipe.predict('Tesla plans to invest 10M into the ML sector')
preds

Fitting on empty Dataframe, could not infer correct training method!


Unnamed: 0_level_0,sentiment_confidence,document,embed_sentence_bert_embeddings,sentiment
origin_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.974726,Tesla plans to invest 10M into the ML sector,"[-0.07111635059118271, 0.9532930850982666, -1....",positive


In [None]:
hdd_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['document_assembler'] has settable params:
pipe['document_assembler'].setCleanupMode('shrink')           | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
>>> pipe['sentence_detector'] has settable params:
pipe['sentence_detector'].setCustomBounds([])                 | Info: characters used to explicitly mark sentence bounds | Currently set to : []
pipe['sentence_detector'].setDetectLists(True)                | Info: whether detect lists during sentence detection | Currently set to : True
pipe['sentence_detector'].setExplodeSentences(False)          | Info: whether to explode each sentence into a different row, for better parallelization. Defaults to false. | Currently set to : False
pipe['sentence_detector'].setMaxLength(99999)                 | Info: Set the maximum allowed length for each se