![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/Training/binary_text_classification/NLU_training_sentiment_classifier_demo_reddit.ipynb)


# Training a Sentiment Analysis Classifier with NLU 
## 2 class Reddit comment sentiment classifier training
With the [SentimentDL model](https://nlp.johnsnowlabs.com/docs/en/annotators#sentimentdl-multi-class-sentiment-analysis-annotator)  from Spark NLP you can achieve State Of the Art results on any multi class text classification problem 

This notebook showcases the following features : 

- How to train the deep learning classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



# 1. Install Java 8 and NLU

In [None]:
!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
import nlu

--2021-05-05 05:39:12--  https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1671 (1.6K) [text/plain]
Saving to: ‘STDOUT’

-                     0%[                    ]       0  --.-KB/s               Installing  NLU 3.0.0 with  PySpark 3.0.2 and Spark NLP 3.0.1 for Google Colab ...

2021-05-05 05:39:12 (39.8 MB/s) - written to stdout [1671/1671]

[K     |████████████████████████████████| 204.8MB 64kB/s 
[K     |████████████████████████████████| 153kB 47.9MB/s 
[K     |████████████████████████████████| 204kB 17.3MB/s 
[K     |████████████████████████████████| 204kB 49.5MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


# 2. Download Reddit  Sentiment dataset 
https://www.kaggle.com/cosmos98/twitter-and-reddit-sentimental-analysis-dataset
#Context

This is was a Dataset Created as a part of the university Project On Sentimental Analysis On Multi-Source Social Media Platforms using PySpark.

In [None]:
! wget http://ckl-it.de/wp-content/uploads/2021/01/Reddit_Data.csv


--2021-05-05 05:41:14--  http://ckl-it.de/wp-content/uploads/2021/01/Reddit_Data.csv
Resolving ckl-it.de (ckl-it.de)... 217.160.0.108, 2001:8d8:100f:f000::209
Connecting to ckl-it.de (ckl-it.de)|217.160.0.108|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 153265 (150K) [text/csv]
Saving to: ‘Reddit_Data.csv’


2021-05-05 05:41:14 (402 KB/s) - ‘Reddit_Data.csv’ saved [153265/153265]



In [None]:
import pandas as pd
train_path = '/content/Reddit_Data.csv'

train_df = pd.read_csv(train_path)
# the text data to use for classification should be in a column named 'text'
columns=['text','y']
train_df = train_df[columns]
train_df

Unnamed: 0,text,y
0,its true they had cut the power what douchebag...,positive
1,fuck giroud better finishing like this month,positive
2,looks shit now but still proud made,positive
3,pelor the burning hate the best evil god,negative
4,can ask what you with something this powerful,positive
...,...,...
595,bangali desh bechne main sabse aage,positive
596,national media channels were gaged not cover t...,positive
597,been following these threads from the beginni...,negative
598,pretty sure this sarcasm satire the news 1500...,positive


# 3. Train Deep Learning Classifier using nlu.load('train.sentiment')

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [None]:
from sklearn.metrics import classification_report
import nlu 
# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns
# by default the Universal Sentence Encoder (USE) Sentence embeddings are used for generation
trainable_pipe = nlu.load('train.sentiment')
fitted_pipe = trainable_pipe.fit(train_df.iloc[:50])

# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df.iloc[:50],output_level='document')
#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00        24
     neutral       0.00      0.00      0.00         0
    positive       0.72      1.00      0.84        26

    accuracy                           0.52        50
   macro avg       0.24      0.33      0.28        50
weighted avg       0.38      0.52      0.44        50



Unnamed: 0,origin_index,text,y,sentence,sentence_embedding_use,trained_sentiment,trained_sentiment_confidence,document
0,0,its true they had cut the power what douchebag...,positive,[its true they had cut the power what doucheba...,"[0.033111296594142914, 0.053994592279195786, -...",positive,0.626655,its true they had cut the power what douchebag...
1,1,fuck giroud better finishing like this month,positive,[fuck giroud better finishing like this month],"[0.0678204670548439, 0.01411951333284378, -0.0...",positive,0.653644,fuck giroud better finishing like this month
2,2,looks shit now but still proud made,positive,[looks shit now but still proud made],"[0.03247417137026787, -0.09844466298818588, -0...",positive,0.660186,looks shit now but still proud made
3,3,pelor the burning hate the best evil god,negative,[pelor the burning hate the best evil god],"[0.04032062366604805, 0.07666622847318649, -0....",neutral,0.578461,pelor the burning hate the best evil god
4,4,can ask what you with something this powerful,positive,[can ask what you with something this powerful],"[0.015518003143370152, -0.05116305500268936, -...",positive,0.691478,can ask what you with something this powerful
5,5,aapâ shazia ilmi from puram constituency lag...,negative,[aapâ shazia ilmi from puram constituency la...,"[0.02478150464594364, -0.06508146971464157, -0...",positive,0.612378,aapâ shazia ilmi from puram constituency lag...
6,6,fuck yeah,negative,[fuck yeah],"[0.046024102717638016, -0.02504798397421837, -...",neutral,0.586349,fuck yeah
7,7,honestly really surprised alice ranked that lo...,positive,[honestly really surprised alice ranked that l...,"[-0.035716041922569275, -0.04127982258796692, ...",positive,0.654837,honestly really surprised alice ranked that lo...
8,8,didn care about politics before now hate,negative,[didn care about politics before now hate],"[-0.006816443987190723, 0.06221264228224754, -...",neutral,0.58145,didn care about politics before now hate
9,9,hard nips and goosebumps,negative,[hard nips and goosebumps],"[-0.02919699251651764, -0.030449824407696724, ...",neutral,0.580563,hard nips and goosebumps


# Test the fitted pipe on new example

In [None]:
fitted_pipe.predict("Indian prime minister was assinated!")

Unnamed: 0,origin_index,sentence,sentence_embedding_use,trained_sentiment,trained_sentiment_confidence,document
0,0,[Indian prime minister was assinated!],"[0.012644989416003227, -0.04661174491047859, -...",positive,0.6117,Indian prime minister was assinated!


## Configure pipe training parameters

In [None]:
trainable_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['sentiment_dl'] has settable params:
pipe['sentiment_dl'].setMaxEpochs(1)                 | Info: Maximum number of epochs to train | Currently set to : 1
pipe['sentiment_dl'].setLr(0.005)                    | Info: Learning Rate | Currently set to : 0.005
pipe['sentiment_dl'].setBatchSize(64)                | Info: Batch size | Currently set to : 64
pipe['sentiment_dl'].setDropout(0.5)                 | Info: Dropout coefficient | Currently set to : 0.5
pipe['sentiment_dl'].setEnableOutputLogs(True)       | Info: Whether to use stdout in addition to Spark logs. | Currently set to : True
pipe['sentiment_dl'].setThreshold(0.6)               | Info: The minimum threshold for the final result otheriwse it will be neutral | Currently set to : 0.6
pipe['sentiment_dl'].setThresholdLabel('neutral')    | Info: In case the score is less than threshold, what should be the label. Default i

## Retrain with new parameters

In [None]:
# Train longer!
trainable_pipe = nlu.load('train.sentiment')
trainable_pipe['trainable_sentiment_dl'].setMaxEpochs(5)  
fitted_pipe = trainable_pipe.fit(train_df.iloc[:50])
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df.iloc[:50],output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

              precision    recall  f1-score   support

    negative       1.00      0.83      0.91        24
     neutral       0.00      0.00      0.00         0
    positive       1.00      1.00      1.00        26

    accuracy                           0.92        50
   macro avg       0.67      0.61      0.64        50
weighted avg       1.00      0.92      0.96        50



Unnamed: 0,origin_index,text,y,sentence,sentence_embedding_use,trained_sentiment,trained_sentiment_confidence,document
0,0,its true they had cut the power what douchebag...,positive,[its true they had cut the power what doucheba...,"[0.033111296594142914, 0.053994592279195786, -...",positive,0.761194,its true they had cut the power what douchebag...
1,1,fuck giroud better finishing like this month,positive,[fuck giroud better finishing like this month],"[0.0678204670548439, 0.01411951333284378, -0.0...",positive,0.938677,fuck giroud better finishing like this month
2,2,looks shit now but still proud made,positive,[looks shit now but still proud made],"[0.03247417137026787, -0.09844466298818588, -0...",positive,0.954937,looks shit now but still proud made
3,3,pelor the burning hate the best evil god,negative,[pelor the burning hate the best evil god],"[0.04032062366604805, 0.07666622847318649, -0....",negative,0.81098,pelor the burning hate the best evil god
4,4,can ask what you with something this powerful,positive,[can ask what you with something this powerful],"[0.015518003143370152, -0.05116305500268936, -...",positive,0.956043,can ask what you with something this powerful
5,5,aapâ shazia ilmi from puram constituency lag...,negative,[aapâ shazia ilmi from puram constituency la...,"[0.02478150464594364, -0.06508146971464157, -0...",negative,0.708917,aapâ shazia ilmi from puram constituency lag...
6,6,fuck yeah,negative,[fuck yeah],"[0.046024102717638016, -0.02504798397421837, -...",negative,0.73194,fuck yeah
7,7,honestly really surprised alice ranked that lo...,positive,[honestly really surprised alice ranked that l...,"[-0.035716041922569275, -0.04127982258796692, ...",positive,0.966494,honestly really surprised alice ranked that lo...
8,8,didn care about politics before now hate,negative,[didn care about politics before now hate],"[-0.006816443987190723, 0.06221264228224754, -...",negative,0.67232,didn care about politics before now hate
9,9,hard nips and goosebumps,negative,[hard nips and goosebumps],"[-0.02919699251651764, -0.030449824407696724, ...",negative,0.604969,hard nips and goosebumps


# Try training with different Embeddings

In [None]:
# We can use nlu.print_components(action='embed_sentence') to see every possibler sentence embedding we could use. Lets use bert!
nlu.print_components(action='embed_sentence')

For language <en> NLU provides the following Models : 
nlu.load('en.embed_sentence') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.tfhub_use') returns Spark NLP model tfhub_use
nlu.load('en.embed_sentence.use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.tfhub_use.lg') returns Spark NLP model tfhub_use_lg
nlu.load('en.embed_sentence.albert') returns Spark NLP model albert_base_uncased
nlu.load('en.embed_sentence.electra') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_small_uncased') returns Spark NLP model sent_electra_small_uncased
nlu.load('en.embed_sentence.electra_base_uncased') returns Spark NLP model sent_electra_base_uncased
nlu.load('en.embed_sentence.electra_large_uncased') returns Spark NLP model sent_electra_large_uncased
nlu.load('en.embed_sentence.bert') returns Spark NLP model sent_bert_base_uncased
nlu.load('en.embed_sentenc

In [None]:
trainable_pipe = nlu.load('en.embed_sentence.small_bert_L12_768 train.sentiment')
# We need to train longer and user smaller LR for NON-USE based sentence embeddings usually
# We could tune the hyperparameters further with hyperparameter tuning methods like gridsearch
# Also longer training gives more accuracy
trainable_pipe['trainable_sentiment_dl'].setMaxEpochs(70)  
trainable_pipe['trainable_sentiment_dl'].setLr(0.0005) 
fitted_pipe = trainable_pipe.fit(train_df)
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

#preds

sent_small_bert_L12_768 download started this may take some time.
Approximate size to download 392.9 MB
[OK!]
sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
              precision    recall  f1-score   support

    negative       0.86      0.80      0.83       300
     neutral       0.00      0.00      0.00         0
    positive       0.90      0.70      0.79       300

    accuracy                           0.75       600
   macro avg       0.59      0.50      0.54       600
weighted avg       0.88      0.75      0.81       600



# 5. Lets save the model

In [None]:
stored_model_path = './models/classifier_dl_trained' 
fitted_pipe.save(stored_model_path)

Stored model in ./models/classifier_dl_trained


# 6. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [None]:
hdd_pipe = nlu.load(path=stored_model_path)

preds = hdd_pipe.predict('Indian prime minister was assinated')
preds

Unnamed: 0,origin_index,text,sentence_embedding_from_disk,sentence,sentiment_confidence,sentiment,document
0,8589934592,Indian prime minister was assinated,"[[-0.09739551693201065, 0.23939256370067596, 0...",[Indian prime minister was assinated],[0.81195],[negative],Indian prime minister was assinated


In [None]:
hdd_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> pipe['document_assembler'] has settable params:
pipe['document_assembler'].setCleanupMode('shrink')                                    | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
>>> pipe['sentence_detector@SentenceDetectorDLModel_c83c27f46b97'] has settable params:
pipe['sentence_detector@SentenceDetectorDLModel_c83c27f46b97'].setExplodeSentences(False)  | Info: whether to explode each sentence into a different row, for better parallelization. Defaults to false. | Currently set to : False
pipe['sentence_detector@SentenceDetectorDLModel_c83c27f46b97'].setStorageRef('SentenceDetectorDLModel_c83c27f46b97')  | Info: storage unique identifier | Currently set to : SentenceDetectorDLModel_c83c27f46b97
pipe['sentence_detector@SentenceDetectorDLModel_c83c27f46b97'].setEncoder(com.johnsnowlabs.nlp.anno