<div style="text-align: center;" >
<h1 style="margin-top: 0.2em; margin-bottom: 0.1em;">Assignment 3</h1>
<h4 style="margin-top: 0.7em; margin-bottom: 0.3em; font-style:italic">Commit your solutions to GitHub until June 21, 23:59</h4>
</div>
<br>

## Part 1 
## Sentiment Evaluation of Twitter and YouTube Data

### Tasks

1. Install packages and load evaluation datasets with Google NLP scores
2. Run VADER over evaluation texts
3. Run BERT over evaluation texts
4. Evaluate against sentiment annotations and compare with Google NLP

### Install requirements. 

The following cell contains all the necessary dependencies needed for this task. If you run the cell everything will be installed. 

* [`vaderSentiment`](https://github.com/cjhutto/vaderSentiment) is a Python package for a Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text.
* [`transformers`](https://huggingface.co/) is a Python package for creating and working with transformers. [Here](https://huggingface.co/docs) is the documentation of `transformers`.
* [`torch`](https://pytorch.org/) is a Python machine learning framework. We need this here for `transformers` since this package uses internally `torch`. [Here](https://pytorch.org/docs/stable/index.html) is the documentation of `torch`.
* [`pandas`](https://pandas.pydata.org/docs/index.html) is a Python package for creating and working with tabular data. [Here](https://pandas.pydata.org/docs/reference/index.html) is the documentation of `pandas`.

In [None]:
! pip install vaderSentiment
! pip install transformers
! pip install torch
! pip install pandas

You may need to restart the Kernel after installing the dependencies!

### Import requirements
The cell below imports all necessary dependancies. Make sure they are installed (see cell above).

In [10]:
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from transformers import pipeline

2023-06-15 09:37:09.341248: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-15 09:37:09.566934: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-06-15 09:37:09.605053: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-06-15 09:37:09.605065: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore 

### Exercise 1: Load evaluation datasets and Google NLP scores

#### 1.1 Load datasets
First read the Twitter and Youtube Comments CSV files (`Twitter-Sentiment.csv` and `YouTubeComments-Sentiment.csv`) and save them in a pandas Dataframe.

In [12]:
df_tw = pd.read_csv('Twitter-Sentiment.csv')
df_tw.head(5)

Unnamed: 0,label,text,googleScore
0,Positive,?RT @justinbiebcr: The bigger the better....if...,0.3
1,Positive,"Listening to the ""New Age"" station on @Slacker...",0.2
2,Neutral,I favorited a YouTube video -- Drake and Josh ...,0.0
3,Positive,i didnt mean knee high I ment in lengt it goes...,0.8
4,Neutral,I wana see the vid Kyan,0.0


In [13]:
df_yt = pd.read_csv('YouTubeComments-Sentiment.csv')
df_yt.head(5)

Unnamed: 0,label,text,googleScore
0,Negative,when the time comes for all to know it will be...,0.1
1,Neutral,@princessofportk The first are a pair of devil...,0.1
2,Neutral,I gotta feeling they partlishly took it off fo...,-0.3
3,Positive,"As we look at ways to be relevant, here is a g...",0.7
4,Neutral,"Not a lot of ""removing"" going on here... bucke...",-0.3


### Exercise 2: Run VADER over evaluation texts *(2 points)*

#### 2.1 Run VADER over the first tweet

In this task you should use VADER for sentiment analysis. For this we use the `vaderSentiment` package. You first have to instantiate a new `SentimentIntensityAnalyzer` and use the `polarity_scores` method of it for the analysis. Apply this for the first tweet. Is it a good classification?

[Here](https://github.com/cjhutto/vaderSentiment) under 'Code Examples' you can find some example code how to use this package.

#### 2.2 Run VADER over each text

Now use VADER for all the text data of the Twitter and the Youtube dataframe. Create a new column in the dataframes called `VADER_compound` where you save the `compound` result (look at the output dictonary of the `polarity_scores` method).

*Important: Make sure `compound` is a float*

#### 2.3 VADER as a classifier

To get the three Classes `Positive`, `Negative` and `Neutral` we use the compound score with the following thresholds:

* `compound > 0.5`: `"Positive"`
* `compound < -0.5`: `"Negative"`
* `else`: `"Neutral"`

Create a new column called `VADER_class` which contains the three computed classes.

### Exercise 3: Use a BERT based model for sentiment analysis *(2 points)*

#### 3.1 BERT
BERT (Bidirectional Encoder Representation from Transformers) is a machine learning technique for natural language processing. There are already pretrained models available in the `transformers` package. You can look [here](https://huggingface.co/models?sort=downloads&search=sentiment) and choose a model for the next tasks. (We suggest [this](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest) (`"cardiffnlp/twitter-roberta-base-sentiment-latest"`) model, but you can use any available, just make sure it is suitable for sentiment analysis).

First create a `pipeline` where you set your model by the `model` keyword argument. You can then use this method to pass text which should be classified. [Here](https://huggingface.co/blog/sentiment-analysis-python#2-how-to-use-pre-trained-sentiment-analysis-models-with-python) is a tutorial how to use this.

As before save the classes in a new column 'BERT_class'. The call to your pipeline returns a dictionary where there is a key `label` which contains already the `positive`, `negative` or `neutral` class (Be aware that this is based on the model you choose, and might be different from the labels in the dataset. If that's the case you have to rename them to match the target labels).

***Hint: The classification of the entire sample can take a couple of minutes. Make sure to save the labeled dataset in a csv file so that you don't need to rerun the classification the next time you run your notebook.***

In [None]:
# Hint -> loading roberta as a pipline
sentiment_pipeline = pipeline(model="cardiffnlp/twitter-roberta-base-sentiment-latest", tokenizer="cardiffnlp/twitter-roberta-base-sentiment-latest")

In [None]:
# Hint -> using a pipline for classification
sentiment_pipeline('Today is a great day!')

### Exercise 4: Evaluate against sentiment annotations and compare with Google NLP *(4 points)*

#### 4.1 Convert GoogleNLP scores to classes

As with VADER and BERT, compute classes from the GoogleNLP score, which is given in the column `googleScore`. For this use following thresholds:

* `googleScore > 0.3`: `"Positive"`
* `googleScore < -0.3`: `"Negative"`
* `else`: `"Neutral"`

Save the classes in a new column named `GoogleNLP_class`.


#### 4.2 Evaluate on Twitter

First, let's calculate the accuracy for all three classifiers on the Twitter and Youtube data, print the results.

Next calculate the precision of the `"Positive"` class for the Twitter and Youtube data.
This is calculated as follows:
$
\begin{align}
    precision = \frac{TP}{TP + FP}
\end{align}
$
*Note: Here the Positive samples are the one with the class `"Positive"`*

Now calculate the recall score. This is done by:
$
\begin{align}
    recall = \frac{TP}{TP + FN}
\end{align}
$
*Note: Here the Positive samples are the one with the the class `"Positive"`*

Calculate the Recall and the Precision score now also for the negative class. The Precision is calculated as:
$
\begin{align}
    precision = \frac{TP}{TP + FP}
\end{align}
$
*Note: Here the Positive samples are the one with the the class `"Negative"`*

And the Recall is calculated as:
$
\begin{align}
    recall = \frac{TP}{TP + FN}
\end{align}
$
*Note: Here the Positive samples are the one with the the class `"Negative"`*

Last, calculate the [F1 score](https://towardsdatascience.com/the-f1-score-bec2bbc38aa6) of the positive and negative class for each classifier and dataset. The F1 score is calculated as:

$
\begin{align}
    F_1 = 2 * \frac{precision * recall}{precision + recall}
\end{align}
$

### Exercise 5: Comparison *(2 points)*
* What was the best performing method for Youtube? Did that fit your expectations?
* What was the best performing method for Twitter? Did that fit your expectations?
* Do you observe any differences between prediction of positive and negative sentiment? What is the role of the imbalance between postive and negative classes in the calculation of accuracy?


## Part 2 - Emotion Detection

### Exercise 6 *(4 points)*

In the following exercise you will use the emotion classification model [LEIA](https://huggingface.co/LEIA/LEIA-base) to classify the emotion of the sentences in the [enISEAR dataset](https://www.romanklinger.de/data-sets/). You can read more about the `LEIA-base` model in the [documentation](https://huggingface.co/LEIA/LEIA-base) and learn about the implementation details from this [paper](https://arxiv.org/abs/2304.10973).

#### 6.1 LEIA introduction
* Load the `LEIA-base` model and tokenize either as a [pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines), or you can load the model and the tokenizer [directly](https://huggingface.co/docs/transformers/autoclass_tutorial) and implement the classification steps by yourself. LEIA only accepts sentences with up to 128 tokens. Make sure that your tokenizer [truncates](https://huggingface.co/docs/transformers/pad_truncation) longer sentences to this lenght to avoid errors.
* What are the possible labels the model can predict?
* Input the sentence `Today is a great day.` to the model, and predict the emotion of the sentence.

#### 6.2 enISEAR dataset
* Load the enISEAR dataset.
* What are the possible labels in the dataset? (the `Prior_Emotion` column stores the actual label)
* The last 7 columns store the number of annotators who chose the given emotion (e.g. if you have the value 3 in the column 'Anger', this means that 3 annotators believed that the sentence in the row expresses Anger). Create a new column `Annotator_Majority_Label`, which stores the emotion with the highest annotator score (i.e. the emotion the highest number of annotators chose for the given sentence).
* What percent of the sentences were correctly classified by the (majority vote of the) annotators?

In [16]:
df_isear = pd.read_csv('enISEAR.tsv', sep='\t')
df_isear.head(5)

Unnamed: 0,Sentence_id,Prior_Emotion,Sentence,Temporal_Distance,Intensity,Duration,Gender,City,Country,Worker_id,Time,Anger,Disgust,Fear,Guilt,Joy,Sadness,Shame
0,271,Fear,"I felt ... when my 2 year old broke her leg, a...",Y,Vi,Dom,Ml,Bristol,GBR,87,11/28/2018 00:58:52,0,0,0,1,0,3,1
1,597,Shame,I felt ... one Christmas as one of our patient...,Y,I,Dom,Fl,Dulwich,GBR,86,11/26/2018 06:52:02,1,0,0,4,0,0,0
2,282,Guilt,I felt ... because I could not help a friend w...,M,Mi,Dom,Fl,Linlithgow,GBR,83,11/21/2018 18:45:00,0,0,0,4,0,1,0
3,171,Disgust,I felt ... when I read that hunters had killed...,Y,Mi,H,Ml,Bristol,GBR,87,11/28/2018 00:55:11,3,0,0,0,0,2,0
4,509,Sadness,I felt ... when my Gran passed away.,Y,Vi,Dom,Fl,Stoke-on-trent,GBR,92,11/26/2018 09:23:38,0,0,0,0,0,5,0


#### 6.3 Classification
* Drop the rows from the enISEAR dataset, where the `Prior_Emotion` is not one of `Fear`, `Sadness`, `Anger` or `Joy`
* Use `Leia` to classify the emotion of each remaining sentence in the dataset, and add a column `Leia_Label` to store the predicted classes
* Now remove `I felt ... ` from the beginning of each sentence, and rerun the classfication. Store your results in a column named `Leia_Label_Clean`
* Where the model predicted `Happiness` or `Affection`, change the prediction to `Joy` to match the dataset's labels (for both columns -> `Leia_Label` and `Leia_Label_Clean`)

#### 6.4 Analysis
* Compare the performance of the two approaches, with each other, as well as with the performance of the human majority using the metrics introduced in part 1 (accuracy, precision, recall, f1 score) or other metrics you find interesting. Create informative visualizations to aid the comparison.
* Discuss your results. 
* Are the models accurately predicting human emotions?
* Which approach seems to work better? Why?
* What kind of other/additional preprocessing could we perform to improve the model's predictions?

### Exercise 7 *(6 points)*

#### 7.1 Data annotation
* In the following exercise you will need to test emotion detection methods on data from [Vent](https://www.vent.co/), a website where users talk about their feelings. 
* On GitHub, in your `a03` folder you can find 3 files. First open `sample_for_labeling.csv`, and label each row according the emotion the sentence expresses. The possible classes are: 0 (Sadness), 1 (Affection), 2 (Fear), 3 (Happiness), 4 (Anger). ***Important: Make sure to upload the labeled data with your submission.***
* After you finished labeling the data load it as a pandas dataframe. Also load `sample_with_labels.csv` as a dataframe, which contains the actual labels of the data.
* Merge the two dataframes, and rename the column containing your labels as `label_human`.
* Rename the class ids (0, 1, 2, ...) stored in the `label`, and `label_human` columns to the class names (Sadness, Affection, ...).

#### 7.2 LEIA
* Use the [LEIA](https://huggingface.co/LEIA/LEIA-base) model introduced in the previous exercise to classify the sentences and store the results in a column named `label_leia`.

In [1]:
import pandas as pd

In [4]:
df = pd.read_csv('sample_with_labels.csv')

In [5]:
df

Unnamed: 0,text,label
0,We are getting new tenants in our back house a...,3
1,2nd day working in Zara. OMFG I love the store.,3
2,If I'm this much of a burden then maybe I shou...,4
3,This weekend is stressing me out. I have to be...,2
4,Like what are you talking about! You were the ...,4
5,Ok but can you just go the fuck to sleep? I do...,2
6,I HATE THEM SO MUCH Thank god this is the last...,4
7,I have a math project and two extra credit pap...,2
8,I hate boys as much as I love them. 😁😘,1
9,May have just lost one of my best friends... f...,2


#### 7.3 Openai models
* In the following exercise we will work with the [openai API](https://platform.openai.com/docs/api-reference), an API which allows you to access very powerful AI models. Your task will be to use a [text completion](https://platform.openai.com/docs/api-reference/completions/) or a [chat completion](https://platform.openai.com/docs/guides/chat/introduction) model to classify the sentences in `sample_with_labels.csv` according to their emotion (Sadness, Affection, Fear, Happiness, Anger).
* You can sign up for the API by providing a phone number, and get 5 USD of free credits, which should be more than enough to complete this exercise.
* If you use your own account, you can use any text completion/chat completion model to complete the exercise, you are also allowed to [fine-tune](https://platform.openai.com/docs/api-reference/fine-tunes) one of the text completion models, and use it for classification.
* If you can't/don't want to sign up for the API, we created an API wrapper, through wich you can use the `gpt-3.5-turbo` model. We set a limit of 1_000_000 tokens per API key, which should be more than enough to complete the task, so feel free to experiment to find the best prompt.
* In your `a03` folder, you can find a file named `api_key.txt`, which stores your API key you will need to use the API wrapper. ***Important: this is not a real openai API key, we use it internally in the API wrapper, to monitor usage***
* To set up the API wrapper with the openai module, run the following code:
```python
import openai
openai.api_base = 'https://smdapi-1-a6938250.deta.app'
your_api_key = '' # your key from api_key.txt
openai.api_key = your_api_key
```
* After you run this setup, you can use the `openai.ChatCompletion.create` function just like when your are using the openai module ([documentation](https://platform.openai.com/docs/guides/chat)).
* Store the results in a column named `label_gpt`.

In [1]:
# install requirements
!pip install openai

In [1]:
import openai
import dotenv
import os

dotenv.load_dotenv();

In [2]:
openai.api_base

'https://api.openai.com/v1'

In [3]:
openai.api_base = 'https://smdapi-1-a6938250.deta.app'#'http://127.0.0.1:8000'
your_api_key = os.environ.get('API_KEY') # your key from api_key.txt
openai.api_key = your_api_key

In [8]:
# hint: chat completion example from https://platform.openai.com/docs/guides/chat/introduction
res = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
      {"role": "user", "content": "how many hours are there in a day?"},
  ],
)

APIError: HTTP code 502 from API (<h3>Error Type:</h3>
		<p>AssertionError</p>
		<h3>Error Message:</h3>
		<p>We couldn&#39;t find your &#39;app&#39;. Please refer to the Manual.</p>
		<h3>Logs:</h3>
		<pre>time=&#34;2023-06-26T13:27:13Z&#34; level=info msg=&#34;serving logs listener on sandbox.localdomain:1234&#34; agent=logsApiAgent
TELEMETRY	Name: telemetry-extension	State: Subscribed	Types: [Function]
</pre>
	
			<h3>Stack Trace:</h3>
			<pre>  File &#34;/opt/python/detalib/debugger.py&#34;, line 131, in wrap
    return func(event, context)

  File &#34;/var/task/_entry.py&#34;, line 14, in handler
    return handle(event, main)

  File &#34;/opt/python/detalib/handler.py&#34;, line 8, in handle
    assert hasattr(
</pre>
		)

In [7]:
res

<OpenAIObject chat.completion id=chatcmpl-7VgUBekPsForacAqtUJr5dPaORlDL at 0x7fa63761db20> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "There are 24 hours in a day.",
        "role": "assistant"
      }
    }
  ],
  "created": 1687785839,
  "id": "chatcmpl-7VgUBekPsForacAqtUJr5dPaORlDL",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 9,
    "prompt_tokens": 17,
    "total_tokens": 26
  }
}

#### 7.4 Comparison
* Compare the performance of the two models, with each other, as well as with the quality of your annotation using the metrics introduced in part 1 (accuracy, precision, recall, f1 score) or other metrics you find interesting. Create informative visualizations to aid the comparison.
* Discuss your results. 
* Are the models accurately predicting human emotions?
* Which approach seems to work better? Why?