![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/Training/binary_text_classification/NLU_training_sentiment_classifier_demo_twitter.ipynb)



# Training a Sentiment Analysis Classifier with NLU
## 2 class twitter classifier training
With the [SentimentDL model](https://nlp.johnsnowlabs.com/docs/en/annotators#sentimentdl-multi-class-sentiment-analysis-annotator) from Spark NLP you can achieve State Of the Art results on any multi class text classification problem

This notebook showcases the following features :

- How to train the deep learning classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



# 1. Install Java 8 and NLU

In [None]:
!pip install -q johnsnowlabs

# 2. Download twitter Sentiment dataset
https://www.kaggle.com/cosmos98/twitter-and-reddit-sentimental-analysis-dataset
#Context

This is was a Dataset Created as a part of the university Project On Sentimental Analysis On Multi-Source Social Media Platforms using PySpark.

In [None]:
! wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/reddit_twitter_sentiment/Twitter_Data.csv


In [5]:
import pandas as pd
train_path = '/content/Twitter_Data.csv'

train_df = pd.read_csv(train_path)
# the text data to use for classification should be in a column named 'text'
# the label column must have name 'y' name be of type str
train_df = train_df.rename(columns={'clean_text': 'text'})

columns=['text','y']
train_df = train_df[columns]
train_df

Unnamed: 0,text,y
0,new post added mumbai press official site prod...,positive
1,not wrong the actual temperature might but and...,positive
2,why pakistan crying name modi every day how na...,negative
3,congress years wasnt able complete one rafale ...,positive
4,public toilet near kanagadurga temple nizampet...,positive
...,...,...
595,jai hind modi very nice thought,positive
596,after going thru all the comedy speeches shri ...,positive
597,mistry man not then why drag modi the nri foll...,negative
598,why modi have not held single press conference...,negative


# 3. Train Deep Learning Classifier using nlu.load('train.sentiment')

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [6]:
from sklearn.metrics import classification_report
from johnsnowlabs import nlp
# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns
# by default the Universal Sentence Encoder (USE) Sentence embeddings are used for generation
trainable_pipe = nlp.load('train.sentiment')
fitted_pipe = trainable_pipe.fit(train_df.iloc[:50])

# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df.iloc[:50],output_level='document')
#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset.y = dataset.y.apply(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['origin_index'] = data.index
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['origin_index'] = data.index


              precision    recall  f1-score   support

    negative       0.00      0.00      0.00        19
    positive       0.62      1.00      0.77        31

    accuracy                           0.62        50
   macro avg       0.31      0.50      0.38        50
weighted avg       0.38      0.62      0.47        50



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[cols_to_explode] = df[cols_to_explode].apply(pad_same_level_cols, axis=1)
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,document,sentence_embedding_small_bert_L2_128,sentiment,sentiment_confidence,text,y
0,new post added mumbai press official site prod...,"[-0.2672112286090851, 0.22553667426109314, -0....",positive,8.0,new post added mumbai press official site prod...,positive
1,not wrong the actual temperature might but and...,"[-1.243513822555542, 0.24190887808799744, -0.4...",positive,8.0,not wrong the actual temperature might but and...,positive
2,why pakistan crying name modi every day how na...,"[-0.7444853782653809, -0.0514342300593853, -0....",positive,2.0,why pakistan crying name modi every day how na...,negative
3,congress years wasnt able complete one rafale ...,"[-0.34242647886276245, 0.46881920099258423, -0...",positive,2.0,congress years wasnt able complete one rafale ...,positive
4,public toilet near kanagadurga temple nizampet...,"[-1.1381851434707642, 0.512217104434967, -0.74...",positive,1.0,public toilet near kanagadurga temple nizampet...,positive
5,the foundation for new india 2022 has already ...,"[-0.4057338237762451, 1.0029019117355347, -0.9...",positive,7.0,\nthe foundation for new india 2022 has alread...,positive
6,dear governorani can you let the people indian...,"[-1.4633989334106445, 0.0006002967129461467, -...",positive,3.0,dear governorani can you let the people indian...,negative
7,this daft donkey’ dick aap was born the iac mo...,"[0.04606145992875099, 0.3098487854003906, 0.02...",positive,6.0,this daft donkey’ dick aap was born the iac mo...,negative
8,major reason for social hatred and strife modi...,"[-0.9470604658126831, 0.27183642983436584, -1....",positive,5.0,major reason for social hatred and strife modi...,positive
9,demo was black money caught modi did inspite r...,"[-0.7136786580085754, 0.0788763239979744, -0.5...",positive,1.0,demo was black money caught modi did inspite r...,positive


# Test the fitted pipe on new example

In [7]:
fitted_pipe.predict('the president of india just died')

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence,sentence_embedding_small_bert_L2_128,sentiment,sentiment_confidence
0,the president of india just died,"[-0.9852966070175171, 0.5659735798835754, -1.0...",positive,1.0


## Configure pipe training parameters

In [8]:
trainable_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> component_list['bert_sentence_embeddings@sent_small_bert_L2_128'] has settable params:
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setBatchSize(8)              | Info: Size of every batch | Currently set to : 8
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setEngine('tensorflow')      | Info: Deep Learning engine used for this model | Currently set to : tensorflow
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setIsLong(False)             | Info: Use Long type instead of Int type for inputs buffer - Some Bert models require Long instead of Int. | Currently set to : False
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setMaxSentenceLength(128)    | Info: Max sentence length to process | Currently set to : 128
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setDimension(128)            | I

## Retrain with new parameters

In [9]:
# Train longer!
trainable_pipe = nlp.load('train.sentiment')
trainable_pipe['trainable_sentiment_dl'].setMaxEpochs(5)
fitted_pipe = trainable_pipe.fit(train_df.iloc[:50])
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df.iloc[:50],output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset.y = dataset.y.apply(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['origin_index'] = data.index
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['origin_index'] = data.index


              precision    recall  f1-score   support

    negative       0.00      0.00      0.00        19
    positive       0.62      1.00      0.77        31

    accuracy                           0.62        50
   macro avg       0.31      0.50      0.38        50
weighted avg       0.38      0.62      0.47        50



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[cols_to_explode] = df[cols_to_explode].apply(pad_same_level_cols, axis=1)
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,document,sentence_embedding_small_bert_L2_128,sentiment,sentiment_confidence,text,y
0,new post added mumbai press official site prod...,"[-0.2672112286090851, 0.22553667426109314, -0....",positive,5.0,new post added mumbai press official site prod...,positive
1,not wrong the actual temperature might but and...,"[-1.243513822555542, 0.24190887808799744, -0.4...",positive,1.0,not wrong the actual temperature might but and...,positive
2,why pakistan crying name modi every day how na...,"[-0.7444853782653809, -0.0514342300593853, -0....",positive,1.0,why pakistan crying name modi every day how na...,negative
3,congress years wasnt able complete one rafale ...,"[-0.34242647886276245, 0.46881920099258423, -0...",positive,8.0,congress years wasnt able complete one rafale ...,positive
4,public toilet near kanagadurga temple nizampet...,"[-1.1381851434707642, 0.512217104434967, -0.74...",positive,2.0,public toilet near kanagadurga temple nizampet...,positive
5,the foundation for new india 2022 has already ...,"[-0.4057338237762451, 1.0029019117355347, -0.9...",positive,9.0,\nthe foundation for new india 2022 has alread...,positive
6,dear governorani can you let the people indian...,"[-1.4633989334106445, 0.0006002967129461467, -...",positive,2.0,dear governorani can you let the people indian...,negative
7,this daft donkey’ dick aap was born the iac mo...,"[0.04606145992875099, 0.3098487854003906, 0.02...",positive,1.0,this daft donkey’ dick aap was born the iac mo...,negative
8,major reason for social hatred and strife modi...,"[-0.9470604658126831, 0.27183642983436584, -1....",positive,1.0,major reason for social hatred and strife modi...,positive
9,demo was black money caught modi did inspite r...,"[-0.7136786580085754, 0.0788763239979744, -0.5...",positive,1.0,demo was black money caught modi did inspite r...,positive


# Try training with different Embeddings

In [10]:
# We can use nlu.print_components(action='embed_sentence') to see every possibler sentence embedding we could use. Lets use bert!
nlp.nlu.print_components(action='embed_sentence')

For language <am> NLU provides the following Models : 
nlu.load('am.embed_sentence.xlm_roberta') returns Spark NLP model_anno_obj sent_xlm_roberta_base_finetuned_amharic
For language <de> NLU provides the following Models : 
nlu.load('de.embed_sentence.bert.base_cased') returns Spark NLP model_anno_obj sent_bert_base_cased
For language <el> NLU provides the following Models : 
nlu.load('el.embed_sentence.bert.base_uncased') returns Spark NLP model_anno_obj sent_bert_base_uncased
For language <en> NLU provides the following Models : 
nlu.load('en.embed_sentence') returns Spark NLP model_anno_obj tfhub_use
nlu.load('en.embed_sentence.albert') returns Spark NLP model_anno_obj albert_base_uncased
nlu.load('en.embed_sentence.bert') returns Spark NLP model_anno_obj sent_bert_base_uncased
nlu.load('en.embed_sentence.bert.base_uncased_legal') returns Spark NLP model_anno_obj sent_bert_base_uncased_legal
nlu.load('en.embed_sentence.bert.finetuned') returns Spark NLP model_anno_obj sbert_setfit_

In [11]:
trainable_pipe = nlp.load('en.embed_sentence.small_bert_L12_768 train.sentiment')
# We need to train longer and user smaller LR for NON-USE based sentence embeddings usually
# We could tune the hyperparameters further with hyperparameter tuning methods like gridsearch
# Also longer training gives more accuracy
trainable_pipe['trainable_sentiment_dl'].setMaxEpochs(100)
trainable_pipe['trainable_sentiment_dl'].setLr(0.0005)
fitted_pipe = trainable_pipe.fit(train_df)
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

#preds

sent_small_bert_L12_768 download started this may take some time.
Approximate size to download 392.9 MB
[OK!]
              precision    recall  f1-score   support

    negative       0.78      0.65      0.71       300
     neutral       0.00      0.00      0.00         0
    positive       0.89      0.52      0.65       300

    accuracy                           0.58       600
   macro avg       0.55      0.39      0.45       600
weighted avg       0.83      0.58      0.68       600



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[cols_to_explode] = df[cols_to_explode].apply(pad_same_level_cols, axis=1)
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# 5. Lets save the model

In [12]:
stored_model_path = './models/classifier_dl_trained'
fitted_pipe.save(stored_model_path)

# 6. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [13]:
hdd_pipe = nlp.load(path=stored_model_path)

preds = hdd_pipe.predict('the president of india just died')
preds



Unnamed: 0,document,sentence_embedding_from_disk,sentiment,sentiment_confidence
0,the president of india just died,"[0.009459968656301498, -0.07943318039178848, 0...",positive,0.0


In [14]:
hdd_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> component_list['document_assembler'] has settable params:
component_list['document_assembler'].setCleanupMode('shrink')                                    | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
>>> component_list['bert_sentence_embeddings@sent_small_bert_L12_768'] has settable params:
component_list['bert_sentence_embeddings@sent_small_bert_L12_768'].setBatchSize(8)               | Info: Size of every batch | Currently set to : 8
component_list['bert_sentence_embeddings@sent_small_bert_L12_768'].setCaseSensitive(False)       | Info: whether to ignore case in tokens for embeddings matching | Currently set to : False
component_list['bert_sentence_embeddings@sent_small_bert_L12_768'].setDimension(768)             | Info: Number of embedding dimensions | Currently set to : 768
component_list[