Which pretrained Natual Language Processing (NLP) model has better prediction accuracy for the sentiment analysis, Hugging Face or Flair?

# Resources

- [Blog post](https://medium.com/@AmyGrabNGoInfo/sentiment-analysis-hugging-face-zero-shot-model-vs-flair-pre-trained-model-57047452225d) for this tutorial
- Video version of the tutorial on [YouTube](https://www.youtube.com/watch?v=YYv4criapEI&list=PLVppujud2yJpx5r8GFeJ81fyek8dEDMX-&index=10)
- More video tutorials on [NLP](https://www.youtube.com/playlist?list=PLVppujud2yJpx5r8GFeJ81fyek8dEDMX-)
- More blog posts on [NLP](https://medium.com/@AmyGrabNGoInfo/list/nlp-49340193610f)


For more information about data science and machine learning, please check out my [YouTube channel](https://www.youtube.com/@grabngoinfo), [Medium Page](https://medium.com/@AmyGrabNGoInfo) and [GrabNGoInfo.com](https://grabngoinfo.com/tutorials/), or follow GrabNGoInfo on [LinkedIn](https://www.linkedin.com/company/grabngoinfo/).

# Intro

There are different methods for sentiment analysis. Some examples are lexicon-based methods, building customized models, using cloud services for sentiment predictions, or using pre-trained language models. 

In this tutorial, we will compare two state-of-art deep-learning pre-trained models for sentiment analysis, one from Hugging Face, and the other from Flair. We will talk about:
* What are the benefits of a pre-trained model for sentiment analysis?
* How to use Hugging Face zero-shot classification model for sentiment analysis?
* How to use Flair pre-trained sentiment model for sentiment analysis?
* Which one of the two models has higher accuracy for sentiment prediction?

Let's get started!

# Step 1: Benefits of Pre-trained Model for Sentiment Analysis


Firstly, let's talk about the benefits of using a pre-trained  classification language model.
* Compared with lexicon-based sentiment analysis such as VADER or TextBlob, the pre-trained zero-shot deep-learning language models are usually more accurate. To learn more about the lexicon-based sentiment analysis, please check out my previous tutorial [TextBlob vs. VADER for Sentiment Analysis Using Python](https://medium.com/towards-artificial-intelligence/textblob-vs-vader-for-sentiment-analysis-using-python-76883d40f9ae).
* Compared with the customized sentiment analysis models, the pre-trained zero-shot deep-learning language models usually utilize a much larger training dataset. Large modeling dataset typically produces better model results. One exception is that if the documents for the sentiment analysis are from a highly specialized domain, a customized sentiment model may work better.
* A customized classification model needs labeled data, while a zero-shot sentiment analysis model does not need the data to be labeled. This saves the cost of labeling, which is usually pretty high for large datasets. 
* Compared with the cloud services for sentiment analysis such as [Amazon Comprehend](https://aws.amazon.com/comprehend/pricing/), [Azure Cognitive Service for Language](https://azure.microsoft.com/en-us/pricing/details/cognitive-services/language-service/), [Google Natural Language API](https://cloud.google.com/natural-language/pricing), and [IBM Watson Natual Language Understanding API](https://www.ibm.com/cloud/watson-natural-language-understanding/pricing), the open-source pre-trained zero-shot models have much lower cost because they are free to use.

# Step 2: Sentiment Analysis Algorithms

In step 2, let's talk about the algorithms behind the Hugging Face zero-shot sentiment analysis and the Flair pretrained sentiment model. 

Hugging Face zero-shot sentiment analysis uses zero-shot learning (ZSL), which refers to building a model and using it to make predictions on tasks the model was not trained to do. It can be used on any text classification task, including but not limited to sentiment analysis and topic modeling. 

Zero-shot sentiment analysis from Hugging Face is a use case of the Hugging Face zero-shot text classification model. It is a Natural Language Inference (NLI) model where two sequences are compared to see if they contradict each other, entail each other, or are neutral (neither contradict nor entail).

When using the Hugging Face zero-shot sentiment analysis, we will have the text as the premise and the sentiment labels such as `positive` and `negative` as hypotheses. If the model predicts that a text document entails `positive`, then the document is predicted to have a positive sentiment. Otherwise, the document is predicted to have a negative sentiment.

The Flair pre-trained sentiment model is a text classification model explicitly built for predicting sentiments. The modeling dataset set is the IMDB, so it may work better for documents that are similar to the IMDB data than the documents that are quite different from IMDB data.

# Step 3: Install And Import Python Libraries

In step 3, we will install and import python libraries.

Firstly, let's import `transformers` and `flair`.

In [None]:
# Install libraries
!pip install transformers flair

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m59.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting flair
  Downloading flair-0.11.3-py3-none-any.whl (401 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.9/401.9 KB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m59.4 MB/s[0m eta [36m0:00:00[0m
Collecting segt

After installing the python packages, we will import the python libraries.
* `pandas` is imported for data processing.
* Hugging Face `pipeline` is imported from `transformers` for the zero-shot classification model.
* The English sentiment model is loaded from the `Flair` `TextClassifier`.
* `Sentence` is imported from Flair to process input text.
* The `accuracy_score` is imported for model performance.

In [None]:
# Data processing
import pandas as pd

# Hugging Face model
from transformers import pipeline

# Import flair pre-trained sentiment model
from flair.models import TextClassifier
classifier = TextClassifier.load('en-sentiment')

# Import flair Sentence to process input text
from flair.data import Sentence

# Import accuracy_score to check performance
from sklearn.metrics import accuracy_score

2023-01-06 20:12:38,010 loading file /root/.flair/models/sentiment-en-mix-distillbert_4.pt


# Step 4: Download And Read Data

The fourth step is to download and read the dataset. 

The UCI Machine Learning Repository has the review data from three websites: imdb.com, amazon.com, and yelp.com. We will use the review data from amazon.com for this tutorial. Please follow these steps to download the data.
1. Go to: https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences
2. Click "Data Folder"
3. Download "sentiment labeled sentences.zip"
4. Unzip "sentiment labeled sentences.zip"
5. Copy the file "amazon_cells_labelled.txt" to your project folder

Those who are using Google Colab for this analysis need to mount Google Drive to read the dataset. You can ignore the code below if you are not using Google Colab. 
* `drive.mount` is used to mount to the Google drive so the colab notebook can access the data on the Google drive.
* `os.chdir` is used to change the default directory on Google drive. I set the default directory to the folder where the review dataset is saved.
* `!pwd` is used to print the current working directory.

Please check out [Google Colab Tutorial for Beginners](https://medium.com/towards-artificial-intelligence/google-colab-tutorial-for-beginners-834595494d44) for details about using Google Colab for data science projects. 

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Change directory
import os
os.chdir("drive/My Drive/contents/nlp")

# Print out the current directory
!pwd

Mounted at /content/drive
/content/drive/My Drive/contents/nlp


Now let's read the data into a `pandas` dataframe and see what the dataset looks like.

The dataset has two columns. One column contains the reviews and the other column contains the sentiment label for the review. 

In [None]:
# Read in data
amz_review = pd.read_csv('sentiment labelled sentences/amazon_cells_labelled.txt', sep='\t', names=['review', 'label'])

# Take a look at the data
amz_review.head()

Unnamed: 0,review,label
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


`.info` helps us to get information about the dataset. 

From the output, we can see that this data set has 1000 records and no missing data. The `review` column is the `object` type and the `label` column is the `int64` type.

In [None]:
# Get the dataset information
amz_review.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  1000 non-null   object
 1   label   1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB


The label value of 0 represents negative reviews and the label value of 1 represents positive reviews. The dataset has 500 positive reviews and 500 negative reviews. It is well-balanced, so we can use  accuracy as the metric to evaluate the model performance.

In [None]:
# Check the label distribution
amz_review['label'].value_counts()

0    500
1    500
Name: label, dtype: int64

# Step 5: Hugging Face Zero-shot Sentiment Prediction

In step 5, we will use the Hugging Face zero-shot text classification model to predict sentiment for each review.

Firstly, the pipeline is defined:
 * `task` describes the task for the pipeline. The task name we use is `zero-shot-classification`.
 * `model` is the model name for the prediction used in the pipeline. You can find the full list of available models for zero-shot classification on the [Hugging Face website](https://huggingface.co/models?pipeline_tag=zero-shot-classification). At the time this tutorial was created in January 2023, the `bart-large-mnli` by Facebook(Meta) is the model with the highest number of downloads and likes, so we will use it for the pipeline.
 * `device` defines the device type. `device=0` means that we are using GPU for the pipeline.

In [None]:
# Define pipeline
classifier = pipeline(task="zero-shot-classification", 
                      model="facebook/bart-large-mnli",
                      device=0) 

Downloading:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

After defining the pipeline, the data is processed and the sentiments are predicted by the pipeline.
* Firstly, the reviews are put into a list for the pipeline.
* Then, the candidate labels are defined. We set two candidate labels, `positive` and `negative`.
* After that, the hypothesis template is defined. The default template is used by the Hugging Face pipeline is `This example is {}`. We use a hypothesis template that is more specific to the sentiment analysis `The sentiment of this review is {}.` and it helps to improve the results.
* Finally, the text, the candidate labels, and the hypothesis template are passed into the zero-shot classification pipeline called `classifier`. 

The output is in a list format and we converted it into a Pandas dataframe. 




In [None]:
# Put reviews in a list
sequences = amz_review['review'].to_list()

# Define the candidate labels 
candidate_labels = ["positive", "negative"]

# Set the hyppothesis template
hypothesis_template = "The sentiment of this review is {}."

# Prediction results
hf_prediction = classifier(sequences, candidate_labels, hypothesis_template=hypothesis_template)

# Save the output as a dataframe
hf_prediction = pd.DataFrame(hf_prediction)

# Take a look at the data
hf_prediction.head()

Unnamed: 0,sequence,labels,scores
0,So there is no way for me to plug it in here i...,"[negative, positive]","[0.8545544743537903, 0.14544548094272614]"
1,"Good case, Excellent value.","[positive, negative]","[0.9976164102554321, 0.0023835927713662386]"
2,Great for the jawbone.,"[positive, negative]","[0.9928712248802185, 0.007128703407943249]"
3,Tied to charger for conversations lasting more...,"[negative, positive]","[0.9851537942886353, 0.01484624482691288]"
4,The mic is great.,"[positive, negative]","[0.9943010210990906, 0.005698992405086756]"


The sum of positive and negative scores for each review is 1, indicating the relative score of a review belonging to a sentiment. 

The first label in the labels list is the predicted sentiment for each review, and the first score in the scores list is the corresponding score prediction. For example, the review `Great for the jawbone.` has the predicted sentiment of `positive` and the predicted score of `0.99`, indicating that `positive` is a much more likely sentiment than `negative`. Note that the score values are not the absolute predicted probability of the sentiment, and it represents only the relative probability among the given candidate labels.

To make the prediction results easy to read and process, two new columns are created, one for the predicted sentiment and the other for the score of the predicted sentiment. We also appended the true sentiment labels for the reviews.

In [None]:
# The column for the predicted topic
hf_prediction['hf_prediction'] = hf_prediction['labels'].apply(lambda x: x[0])

# Map sentiment values
hf_prediction['hf_prediction'] = hf_prediction['hf_prediction'].map({'positive': 1, 'negative': 0})

# The column for the score of predicted topic
hf_prediction['hf_predicted_score'] = hf_prediction['scores'].apply(lambda x: x[0])

# The actual labels
hf_prediction['true_label'] = amz_review['label']

# Drop the columns that we do not need
hf_prediction = hf_prediction.drop(['labels', 'scores'], axis=1)

# Take a look at the data
hf_prediction.head()

Unnamed: 0,sequence,hf_prediction,hf_predicted_score,true_label
0,So there is no way for me to plug it in here i...,0,0.854554,0
1,"Good case, Excellent value.",1,0.997616,1
2,Great for the jawbone.,1,0.992871,1
3,Tied to charger for conversations lasting more...,0,0.985154,0
4,The mic is great.,1,0.994301,1


The comparison between the actual and predicted sentiment shows an accuracy score of 96.9%, which is very accurate, especially considering that this is a general pretrained zero-shot text classification model not specific for sentiment analysis.

In [None]:
# Compare Actual and Predicted
accuracy_score(hf_prediction['hf_prediction'], hf_prediction['true_label'])

0.969

# Step 6: Flair Pretrained Sentiment Model

In step 6, we will use the Flair pretrained sentiment model to predict sentiments for the reviews.

Let’s define a function that takes a review as input and the score and the predicted label as outputs.
* Firstly, the review text is passed into the `Sentence` function to get tokenized.
* Then, we use the `.predict()` to make sentiment predictions.
* After the prediction, we can extract `score` and `value` from the `sentence`. `value` is the predicted sentiment label, and `score` is how confident the model is about the prediction.
* Finally, the function output the `score` and the `value` for the input review.


In [None]:
# Define a function to get Flair sentiment prediction score
def score_flair(text):
  # Flair tokenization
  sentence = Sentence(text)
  # Predict sentiment
  classifier.predict(sentence)
  # Extract the score
  score = sentence.labels[0].score
  # Extract the predicted label
  value = sentence.labels[0].value
  # Return the score and the predicted label
  return score, value

After the function is defined, we can apply the function to each review in the dataset and create the predicted sentiments.

From the score distribution, we can see that the minimum score is 0.53 and the average score is 0.99, indicating that the model is very confident about the sentiment predictions.

In [None]:
# Get sentiment score for each review
amz_review['scores_flair'] = amz_review['review'].apply(lambda s: score_flair(s)[0])

# Predict sentiment label for each review
amz_review['pred_flair'] = amz_review['review'].apply(lambda s: score_flair(s)[1])

# Check the distribution of the score
amz_review['scores_flair'].describe()

count    1000.000000
mean        0.988019
std         0.046841
min         0.533639
25%         0.996153
50%         0.999167
75%         0.999887
max         0.999999
Name: scores_flair, dtype: float64

Flair by default outputs text `NEGATIVE` and `POSITIVE` as labels. Before checking the prediction accuracy, we need to map the `NEGATIVE` value to 0 and the `POSITIVE` value to 1 because the Amazon review dataset has true labels of 0 and 1.

In [None]:
# Change the label of flair prediction to 0 if negative and 1 if positive
mapping = {'NEGATIVE': 0, 'POSITIVE': 1}
amz_review['pred_flair'] = amz_review['pred_flair'].map(mapping)

# Take a look at the data
amz_review.head()

Unnamed: 0,review,label,scores_flair,pred_flair
0,So there is no way for me to plug it in here i...,0,0.998717,0
1,"Good case, Excellent value.",1,0.998424,1
2,Great for the jawbone.,1,0.995642,1
3,Tied to charger for conversations lasting more...,0,0.999925,0
4,The mic is great.,1,0.979156,1


The comparison between the actual and predicted sentiment shows an accuracy score of 94.8%, which is less accurate than the Hugging Face prediction accuracy of 96.9%, but is still very accurate.

In [None]:
# Compare Actual and Predicted
accuracy_score(amz_review['label'],amz_review['pred_flair'])

0.948

# Recommended Tutorials

- [GrabNGoInfo Machine Learning Tutorials Inventory](https://medium.com/grabngoinfo/grabngoinfo-machine-learning-tutorials-inventory-9b9d78ebdd67)
- [Zero-shot Topic Modeling with Deep Learning Using Python](https://medium.com/@AmyGrabNGoInfo/zero-shot-topic-modeling-with-deep-learning-using-python-a895d2d0c773)
- [Topic Modeling with Deep Learning Using Python BERTopic](https://medium.com/p/topic-modeling-with-deep-learning-using-python-bertopic-cf91f5676504)
- [Google Colab Tutorial for Beginners](https://medium.com/towards-artificial-intelligence/google-colab-tutorial-for-beginners-834595494d44)
- [TextBlob vs. VADER for Sentiment Analysis Using Python](https://medium.com/towards-artificial-intelligence/textblob-vs-vader-for-sentiment-analysis-using-python-76883d40f9ae)
- [Five Ways To Create Tables In Databricks](https://medium.com/grabngoinfo/five-ways-to-create-tables-in-databricks-cd3847cfc3aa)
- [Time Series Anomaly Detection Using Prophet in Python](https://medium.com/grabngoinfo/time-series-anomaly-detection-using-prophet-in-python-877d2b7b14b4)
- [Multivariate Time Series Forecasting with Seasonality and Holiday Effect Using Prophet in Python](https://medium.com/p/multivariate-time-series-forecasting-with-seasonality-and-holiday-effect-using-prophet-in-python-d5d4150eeb57)
- [Time Series Causal Impact Analysis in Python](https://medium.com/grabngoinfo/time-series-causal-impact-analysis-in-python-63eacb1df5cc)
- [3 Ways for Multiple Time Series Forecasting Using Prophet in Python](https://medium.com/p/3-ways-for-multiple-time-series-forecasting-using-prophet-in-python-7a0709a117f9)
- [Hierarchical Topic Model for Airbnb Reviews](https://medium.com/p/hierarchical-topic-model-for-airbnb-reviews-f772eaa30434)
- [Hyperparameter Tuning For XGBoost](https://medium.com/p/hyperparameter-tuning-for-xgboost-91449869c57e)
- [Four Oversampling And Under-Sampling Methods For Imbalanced Classification Using Python](https://medium.com/p/four-oversampling-and-under-sampling-methods-for-imbalanced-classification-using-python-7304aedf9037)
- [Explainable S-Learner Uplift Model Using Python Package CausalML](https://medium.com/grabngoinfo/explainable-s-learner-uplift-model-using-python-package-causalml-a3c2bed3497c)
- [One-Class SVM For Anomaly Detection](https://medium.com/p/one-class-svm-for-anomaly-detection-6c97fdd6d8af)
- [Recommendation System: Item-Based Collaborative Filtering](https://medium.com/grabngoinfo/recommendation-system-item-based-collaborative-filtering-f5078504996a)
- [Hyperparameter Tuning for Time Series Causal Impact Analysis in Python](https://medium.com/grabngoinfo/hyperparameter-tuning-for-time-series-causal-impact-analysis-in-python-c8f7246c4d22)
- [Hyperparameter Tuning and Regularization for Time Series Model Using Prophet in Python](https://medium.com/grabngoinfo/hyperparameter-tuning-and-regularization-for-time-series-model-using-prophet-in-python-9791370a07dc)
- [LASSO (L1) Vs Ridge (L2) Vs Elastic Net Regularization For Classification Model](https://medium.com/towards-artificial-intelligence/lasso-l1-vs-ridge-l2-vs-elastic-net-regularization-for-classification-model-409c3d86f6e9)
- [S Learner Uplift Model for Individual Treatment Effect and Customer Segmentation in Python](https://medium.com/grabngoinfo/s-learner-uplift-model-for-individual-treatment-effect-and-customer-segmentation-in-python-9d410746e122)
- [How to Use R with Google Colab Notebook](https://medium.com/p/how-to-use-r-with-google-colab-notebook-610c3a2f0eab)

# References

* [Hugging Face New pipeline for zero-shot text classification](https://discuss.huggingface.co/t/new-pipeline-for-zero-shot-text-classification/681)
* [Zero-shot Learning in Modern NLP](https://joeddav.github.io/blog/2020/05/29/ZSL.html)
* [Zero-shot Pipeline Notebook](https://colab.research.google.com/drive/1jocViLorbwWIkTXKwxCOV9HLTaDDgCaw?usp=sharing)
* [Using Huggingface zero-shot text classification with large data set](https://stackoverflow.com/questions/63953597/using-huggingface-zero-shot-text-classification-with-large-data-set)
* [Zero-shot classification NLI models](https://huggingface.co/models?pipeline_tag=zero-shot-classification)
* [Hugging Face bart-large-mnli model documentation](https://huggingface.co/facebook/bart-large-mnli)
* [Task-Aware Representation of Sentences for Generic Text Classification Paper](https://kishaloyhalder.github.io/pdfs/tars_coling2020.pdf)
* [Flair Few-Shot and Zero-Shot Classification (TARS)](https://github.com/flairNLP/flair/blob/master/resources/docs/TUTORIAL_10_TRAINING_ZERO_SHOT_MODEL.md)