sentiment analysis post updated

BedirT · Apr 12, 2023 · 94412ac · 94412ac
1 parent 6b1de8c
commit 94412ac
Showing 1 changed file with 177 additions and 10 deletions.
diff --git a/content/posts/blog_posts/fine_tune_sentiment/index.md b/content/posts/blog_posts/fine_tune_sentiment/index.md
@@ -11,7 +11,7 @@ draft: true
 I was recently working on a sentiment analysis tool for [my company](https://www.avalancheinsights.com/). Having worked
 on it myself before, I was confident that I could get it done in a few days. However, I was wrong. I ended up going 
 through quite a bit of research and experimentation to get the results we were happy with. In this post, I will
-talk about my research, thought process, and the methods I used to get the results I wanted.
+talk about my research, thought process, and the methods I used to get the results we wanted.
 
 # The Problem & Starting Point
 
@@ -27,7 +27,7 @@ emotions such as anger, joy, sadness, etc. If we go even further, there is an in
 sentiment is in the text. For example, you can find the sentiment of a sentence in a paragraph. This is called
 aspect-based sentiment analysis. You can read more about these tasks [here](https://www.surveymonkey.co.uk/mp/what-customers-really-think-how-sentiment-analysis-can-help/).
 
-With that being said, I decided to go with the most common sentiment analysis task which is classifying the text
+With that being said, we decided to go with the most common sentiment analysis task which is classifying the text
 into positive, negative, or neutral. This gives us the most flexibility in terms of the types of data we can use
 for training, or models we can test that are already available.
 
@@ -40,11 +40,11 @@ which model performs the best for our data).
 # Data Labeling
 
 The first step is to label our data. We have a lot of data, but we don't have the time to label all of it. So
-we decided to label a very small subset of our data. We used [Doccano](https://github.com/doccano/doccano) to
-label our data. It is a very simple tool that allows you to label your data. You can read more about it on their
+we decided to label a very small subset of it. We used [Doccano](https://github.com/doccano/doccano) to
+label. It is a very simple tool that is built to easily label your data. You can read more about it on their
 github page.
 
-After labeling our data, we had a small dataset that we can use to test our models. We had 200 samples that were
+After labeling, we had a small dataset that we can use to test our models. We had 200 samples that were
 selected via stratified sampling. The initial plan was to label 1000 data samples, but we decided to go with
 200 samples to save time.
 
@@ -60,15 +60,182 @@ differences and why did I select them.
 For the first pass, I selected couple high ranked models from HuggingFace's Transformers. I also used a base 
 model 'VADER' that is a rule-based sentiment analysis tool. I used the base model to compare the results of the
 Transformer models. And, of course with all the success of GPT-3.5 and GPT-4, we needed to include some few-shot
-and zero-shot models using GPT. I used the GPT-3.5 and GPT-4 models using [OpenAI](https://openai.com/) framework.
+and zero-shot models using GPT using [OpenAI](https://openai.com/) framework.
 
 So let's list out all the models I used:
-1. VADER
-2. ...
-3. ...
-4. ...
+1. [VADER](https://github.com/cjhutto/vaderSentiment)
+2. [Huggingface "sbcBI/sentiment_analysis_model"](https://huggingface.co/sbcBI/sentiment_analysis_model)
+3. [Huggingface "cardiffnlp/twitter-xlm-roberta-base-sentiment"](https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment)
+4. [Huggingface "Seethal/sentiment_analysis_generic_dataset"](https://huggingface.co/Seethal/sentiment_analysis_generic_dataset)
+5. [Huggingface "LiYuan/amazon-review-sentiment-analysis"](https://huggingface.co/LiYuan/amazon-review-sentiment-analysis)
+6. [Huggingface "ahmedrachid/FinancialBERT-Sentiment-Analysis"](https://huggingface.co/ahmedrachid/FinancialBERT-Sentiment-Analysis)
+7. [Huggingface "mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis"](https://huggingface.co/mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis)
+8. [PySentimento](https://github.com/pysentimiento/pysentimiento")
 5. GPT-3.5 (zero-shot, few-shot)
 6. GPT-4 (zero-shot, few-shot)
 
+Let's go over some basic examples on how to use each type of models, and jump into our initial results.
+
+### VADER
+```python
+from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
+
+sample = "I love this product. It is the best product I have ever used."
+
+# Create a SentimentIntensityAnalyzer object.
+analyzer = SentimentIntensityAnalyzer()
+
+# Sentiment scores
+score = sentiment_task.polarity_scores(row['text'])['compound']
+
+# The scoring for VADER is different than the other models. Please read about it in the documentation.
+if score >= 0.05:
+    sentiment = 'positive'
+elif score <= -0.05:
+    sentiment = 'negative'
+else:
+    sentiment = 'neutral'
+```
+
+### HuggingFace Transformers
+```python
+from transformers import pipeline
+
+sample = "I love this product. It is the best product I have ever used."
+
+model_name = "sbcBI/sentiment_analysis_model"
+sentiment_task = pipeline("sentiment-analysis", model=model_name, tokenizer=model_name)
+label = sentiment_task(row['text'])[0]['label']
+```
+
+### PySentimento
+```python
+from pysentimiento import create_analyzer
+
+sample = "I love this product. It is the best product I have ever used."
+
+sentiment_task = create_analyzer(task='sentiment', lang='en')
+label = sentiment_task.predict(row['text']).output
+```
+
+### GPT-3.5/4
+```python
+import openai
+import os
+
+# Get your api key loaded
+openai.api_key = os.environ.get("OPENAI_API_KEY")
+
+sample = "I love this product. It is the best product I have ever used."
+
+messages = [
+    {"role": "system", "content": "Specific Sentiment Instructions"},
+    # if you want few-shot, samples go here
+    {"role": "user", "content": f"{sample}\nSentiment:"},
+]
+# Send the request
+response = openai.ChatCompletion.create(
+    model='gpt-3.5-turbo',
+    messages=messages,
+    max_tokens=20,
+)
+
+sentiment = response['choices'][0]['message']['content']
+```
+
 # Evaluation Metrics
 
+To be able to judge the performance of our models, we need to have some evaluation metrics.
+Commonly used metrics for sentiment analysis are:
+1. Accuracy
+2. Precision
+3. Recall
+4. F1 Score
+
+It is a good idea to use a combination of these metrics to get a better understanding of the performance of
+the model. This is especially important when we have an imbalanced dataset. For example, if we have 1000
+samples, and 900 of them are positive, and 100 of them are negative. Then we can get a very high accuracy
+score by always predicting positive. But that doesn't mean that our model is good. So we need to use other
+metrics to evaluate the performance of our model.
+
+F1 score is a combination of precision and recall. So we decided to use F1 score and accuracy as our
+evaluation metrics.
+
+# Initial Results
+
+Now that we have our models, and we have our evaluation metrics, we can start testing the pre-trained models.
+We will use the 200 samples that we labeled to test the models. Since there is no training involved, we will
+use all the data for testing. 
+
+Let's not forget that these results are more of a sanity check, and a general evaluation of how close our
+data is to the ones used to train the models. If by luck our data is very similar to the data used to train
+the models, then we can expect to get good results and stop there. But if the results are not good, then we
+need to do some more work to get better results, or try to find a better model.
+
+Here is the accuracy plot including all the models.
+
+![Accuracy Plot](img/accuracy_plot.png)
+
+Here is the F1 score plot including all the models.
+
+![F1 Score Plot](img/f1_score_plot.png)
+
+As you can see, the VADER model is the worst performing model. Best performing model is the GPT-4 model. Other
+than that, GPT-3.5 is performing close. As we can see the huggingface models are not really performing well.
+Best open-source model is the PySentimento model, but it still isn't at the level we want. 
+
+One thing to note is that the labelling of our data is pretty complex and is even hard for humans to label. So
+there might be some bias in the data. But we will not go into that in this post since I am not revealing the 
+data itself. 
+
+So we can see that the GPT-3.5 and GPT-4 models are performing well. These are zero-shot models, we could get
+even better results if we do few-shot training. 
+
+After seeing the potential of GPT models (and the poor performance of the pre-trained sentiment analysis models),
+we decided to first investigate GPT-3.5 and GPT-4 models, and then try to train our own sentiment analysis model
+using GPT as the labeler. This will give us a smaller open-source model that we can use for our system, that
+performs similar to GPT models but doesn't cost us anything.
+
+# Evaluating GPT-3.5 and GPT-4
+
+Starting with the same small dataset, we first test some different prompting methods to see how can we get the
+best results. This will guide us as to which method we should use as the labeling method for our sentiment
+analysis model.
+
+One thing we tested aside from the prompts was the
+general prompting technique. For these kind of individually dependent tasks, we can introduce a parameter
+called `sample batch size`. This parameter controls how many samples are sent to the model at once. This
+parameter is important because if we send all the samples at once. This will result in the model trying to
+generate all the labels at once, which is a harder task. Pros however is the cost since we do not have to
+repeat the same pre-prompt (or instructions) for each sample.
+
+I am not going into too much detail as to what prompts we used. But to give general direction;
+We include clear instructions for the model. GPT models gives us a flexibility to explain what exactly we 
+want from the model. We can describe how we perceive the tasks, and what we expect from the model. For this
+we have clear definitions as to what is considered positive, negative, and neutral. 
+
+Here are the results of the different prompting methods.
+
+![GPT Prompting Results](img/gpt_prompting_results.png)
+
+We included 4 different metrics in the plot:
+1. Accuracy: As we already discussed, this is the main measure of how good our model is predicting the labels.
+We can see that both GPT-3.5 and GPT-4 are performing very well with `sample batch size` of 1, the 
+`sample batch size` of 10 is performing significantly worse.
+2. F1 Score: This is a combination of precision and recall. We can see that F1 score is following the same 
+pattern as accuracy.
+3. Price: This is the cost of the model. This is important as we might end up using this model in production.
+We can see that the `sample batch size` of 1 is more expensive than the `sample batch size` of 10. 
+4. Time: This is the time it takes to generate the labels. Again this is important if we end up using this
+model in production.
+
+As we can see, both GPT-3.5 and GPT-4 are performing very well. We can see that the `sample batch size` of 1
+is performing better than the `sample batch size` of 10. Even though GPT-4 is performing slightly better, we
+decided to go with GPT-3.5 as it is way cheaper and way faster. 
+
+For training an open-source model we will use GPT-3.5 to generate the bulk of the labels (120000 data points). 
+We then use GPT-4 to generate the labels for an extra 10000 data points. This way we can see how close we can
+get to GPT-4 performance with a smaller model.
+
+# Training a Sentiment Analysis Model
+