# Sentiment Evaluation of Twitter and YouTube Data
## Tasks

1. Install packages and load evaluation datasets with Google NLP scores
2. Run VADER over evaluation texts
3. Run BERT over evaluation texts
4. Evaluate against sentiment annotations and compare with Google NLP

### Install requirements. 

The following cell contains all the necessary dependencies needed for this task. If you run the cell everything will be installed. 

* [`vaderSentiment`](https://github.com/cjhutto/vaderSentiment) is a Python package for a Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text.
* [`transformers`](https://huggingface.co/) is a Python package for creating and working with transformers. [Here](https://huggingface.co/docs) is the documentation of `transformers`.
* [`torch`](https://pytorch.org/) is a Python machine learning framework. We need this here for `transformers` since this package uses internally `torch`. [Here](https://pytorch.org/docs/stable/index.html) is the documentation of `torch`.
* [`pandas`](https://pandas.pydata.org/docs/index.html) is a Python package for creating and working with tabular data. [Here](https://pandas.pydata.org/docs/reference/index.html) is the documentation of `pandas`.

In [1]:
! pip install vaderSentiment
! pip install transformers sentencepiece
! pip install torch torchvision torchaudio
! pip install pandas
# certs for huggingface.co on windows python
! pip install python-certifi-win32



You may need to restart the Kernel after installing the dependencies!

### Import requirements
The cell below imports all necessary dependancies. Make sure they are installed (see cell above).

In [2]:
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from transformers import pipeline

# 1. Load evaluation datasets and Google NLP scores

## 1.1 Load datasets
First read the Twitter and Youtube Comments CSV files (`Twitter-Sentiment.csv` and `YouTubeComments-Sentiment.csv`) and save them in a pandas Dataframe.

In [3]:
# Read Twitter data
twitter_data = pd.read_csv("Twitter-Sentiment.csv")
# print(twitter_data)

# Read Youtube data
youtube_data = pd.read_csv("YouTubeComments-Sentiment.csv")
# print(youtube_data)

# 2. Run VADER over evaluation texts

## 2.1 Run VADER over the first tweet

In this task you should use VADER for sentiment analysis. For this we use the `vaderSentiment` package. You first have to intatiate a new `SentimentIntensityAnalyzer` and use the `polarity_scores` method of it for the analysis. Apply this for the first tweet. Is it a good classification?

[Here](https://github.com/cjhutto/vaderSentiment) under 'Code Examples' you can find some example code how to use this package.

In [4]:
#  Intatiate a new SentimentIntensityAnalyzer
vader = SentimentIntensityAnalyzer()

# Cassify first tweet and print
first_tweet = twitter_data["text"][0]
first_tweet_classification = vader.polarity_scores(first_tweet)
first_tweet_label = twitter_data["label"][0]

print(f"First Tweet: {first_tweet}\n")
print(f"Classification first Tweet: {first_tweet_classification}\n")
print(f"Label of First Tweet: {first_tweet_label}\n")

First Tweet: ?RT @justinbiebcr: The bigger the better....if you know what I mean ;)

Classification first Tweet: {'neg': 0.0, 'neu': 0.853, 'pos': 0.147, 'compound': 0.2263}

Label of First Tweet: Positive



The analyzed tweet is predominantly neutral (neu: 0.853) but leans slightly positive overall (compound: 0.2263 and pos: 0.147).
There’s no detectable negativity (neg: 0.0), so the tone of the tweet is likely neutral to mildly positive which somehow corresponds to its label.

The classification is reasonable but not perfect. VADER captures the neutral structure and mild positivity but misses the playful, suggestive tone implied by the wink emoji and double entendre. It also overlooks the broader context and cultural nuances, such as the implied humor in "if you know what I mean." While suitable for general analysis, it lacks the sophistication to interpret subtle humor, innuendo, or contextual cues in tweets like this.

## 2.2 Run VADER over each text

Now use VADER for all the text data of the Twitter and the Youtube dataframe. Create a new column in the dataframes called `VADER_compound` where you save the `compound` result (look at the output dictonary of the `polarity_scores` method).

*Important: Make sure `compound` is a float*

If this runs slow on your computer you can use the precomputed values in the provided CSV files which are present in the column `VADER_compund_precomputed` for further tasks.

In [5]:
# Using VADER for sentiment analysis of twitter data
vader = SentimentIntensityAnalyzer()
twitter_data["VADER_compound"] = 0.0

#for i in range(10):
for i in range(len(twitter_data["text"])):
    # use polarity_scores method to get the sentiment scores
    sentiment_dict = vader.polarity_scores(twitter_data["text"][i])
    # Save the compound result as float in the dataset. 
    # Notice: .loc is way slower here.... but worked for us ;)
    twitter_data.loc[i, "VADER_compound"] = sentiment_dict["compound"]

# Test against precomputed values
match_VADER_twitter = twitter_data[twitter_data["VADER_compound"] == twitter_data["VADER_compound_precomputed"]].shape[0] 
print(f"Match Vader Compound vs Precomputed (Twitter):{(match_VADER_twitter / len(twitter_data.index))* 100}%")

Match Vader Compound vs Precomputed (Twitter):100.0%


In [6]:
# Using VADER for sentiment analysis of YouTube data
vader = SentimentIntensityAnalyzer()
youtube_data["VADER_compound"] = 0.0

#for i in range(10):
for i in range(len(youtube_data["text"])):
    # use polarity_scores method to get the sentiment scores
    sentiment_dict = vader.polarity_scores(youtube_data["text"][i])
    # Save the compound result as float in the dataset. 
    # Notice: .loc is way slower here.... but worked for us ;)
    youtube_data.loc[i, "VADER_compound"] = sentiment_dict["compound"]

# Test against precomputed values
match_VADER_youtube = youtube_data[youtube_data["VADER_compound"] == youtube_data["VADER_compound_precomputed"]].shape[0] 
print(f"Match Vader Compound vs Precomputed (YouTube):{(match_VADER_youtube / len(youtube_data.index))* 100}%")

Match Vader Compound vs Precomputed (YouTube):100.0%


## 2.3 VADER as a classifier

To get the three Classes `Positive`, `Negative` and `Neutral` we use the compound score with the following thresholds:

* `compound > 0.5`: `"Positive"`
* `compound < -0.5`: `"Negative"`
* `else`: `"Neutral"`

Create a new column called `VADER_class` which contains the three computed classes.

In [7]:
# Create new column for computed classes
twitter_data["VADER_class"] = "Neutral"
youtube_data["VADER_class"] = "Neutral"

# Classify Twitter Data
twitter_data.loc[twitter_data["VADER_compound"] > 0.5, "VADER_class"] = "Positive"
twitter_data.loc[twitter_data["VADER_compound"] < 0.5, "VADER_class"] = "Negative"

# Classify YouTube Data
youtube_data.loc[youtube_data["VADER_compound"] > 0.5, "VADER_class"] = "Positive"
youtube_data.loc[youtube_data["VADER_compound"] < 0.5, "VADER_class"] = "Negative"

# 3. Use a BERT based model for sentiment analysis

## 3.1 BERT
BERT (Bidirectional Encoder Representation from Transformers) is a machine learning technique for natural language processing. There are already pretrained models available in the `transformers` package. You can look [here](https://huggingface.co/models?sort=downloads&search=sentiment) and choose a model for the next tasks. (We suggest [this](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest) (`"cardiffnlp/twitter-roberta-base-sentiment-latest"`) model, but you can use any available, just make sure it is suitable for sentiment analysis).

First create a `pipeline` where you set your model by the `model` keyword argument. You can then use this method to pass text which should be classified. [Here](https://huggingface.co/blog/sentiment-analysis-python#2-how-to-use-pre-trained-sentiment-analysis-models-with-python) is a tutorial how to use this.

As before save the classes in a new row 'BERT_class'. The call to your pipeline returns a dictionary where there is a key `label` which contains already the `Positive`, `Negative` or `Neutral` class (Be aware that this is based on the model you choose, sometimes these classes are named differently so you have to rename them by hand, this is not the case if you use the suggested model).

Based on you computer this may take some time, if it is too slow for you, you can again use the precomputed classes `'BERT_class_precomputed'` in the CSV Files for further tasks.

In [9]:
# Using BERT-Base-Uncased model for sentiment analysis
# sentiment_pipeline = pipeline(model=f"cardiffnlp/twitter-roberta-base-sentiment-latest")

It was not possible for us to use any models from HuggingFace for two reasons:
 - Private Notebooks have no GPU and running on CPU would take for hours
 - Availabel notebook with GPU (working/company notebook) had problems downloading models from huggingface.co directly due to firewall restrions (http requests are filter with company proxy, missing e-tags)

In [24]:
# Create new column for computed BERT classes
twitter_data["BERT_class"] = "Neutral"
youtube_data["BERT_class"] = "Neutral"

# twitter_data
for i in range(10):
    # use the sentiment_pipeline to get the sentiment scores
    sentiment_dict = sentiment_pipeline(...)
    # Save the class result as string in the dataset. Notice: .loc is way slower here....
    twitter_data.loc[i, 'BERT_class'] = ...

# 4. Evaluate against sentiment annotations and compare with Google NLP

## 4.1 Convert GoogleNLP scores to classes

As with VADER and BERT, compute classes from the GoogleNLP score, which is given in the column `googleScore`. For this use following thresholds:

* `googleScore > 0.3`: `"Positive"`
* `googleScore < -0.3`: `"Negativ"`
* `else`: `"Neutral"`

Save the classes in a new column named `GoogleNLP_class`.


In [9]:
# Create new column for Google NLP classes
twitter_data["GoogleNLP_class"] = "Neutral"
youtube_data["GoogleNLP_class"] = "Neutral"

# Classify Twitter Data
twitter_data.loc[twitter_data["googleScore"] > 0.3, "GoogleNLP_class"] = "Positive"
twitter_data.loc[twitter_data["googleScore"] < -0.3, "GoogleNLP_class"] = "Negative"

# Classify YouTube Data
youtube_data.loc[youtube_data["googleScore"] > 0.3, "GoogleNLP_class"] = "Positive"
youtube_data.loc[youtube_data["googleScore"] < -0.3, "GoogleNLP_class"] = "Negative"

# print(youtube_data)
# print(twitter_data)

## 4.2 Evaluate on Twitter

First, let's calculate the accuracy for all three classifiers on the Twitter and Youtube data, print the results.

### Accuracy Formula
$$
\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Samples}}
$$

In [10]:
# Define Function to Calculate Accuracy
def calculateAccuracy(dataset: pd.DataFrame, column_name_label: str, column_name_prediction: str) -> float:
    
    # Get Total number of Samples
    Total_Number_of_Samples = len(dataset.index)
    print(f"Total Number of Samples: {Total_Number_of_Samples}")
    
    # Get number of Correct Predictions
    Number_of_Correct_Predictions = dataset[dataset[column_name_prediction]==dataset[column_name_label]].shape[0]
    print(f"Number of Correct Predictions {column_name_prediction}: {Number_of_Correct_Predictions}")
    
    # Calculate Accuracy
    accuracy = Number_of_Correct_Predictions / Total_Number_of_Samples
    return accuracy

In [11]:
print("ACCURACY TWITTER DATA:\n")

accuracy_VADER_on_twitter = calculateAccuracy(twitter_data, "label", "VADER_class")
print(f"Accuracy of VADER on Twitter Samples: {accuracy_VADER_on_twitter} meaning {accuracy_VADER_on_twitter*100:.2f}%\n")

accuracy_BERT_on_twitter = calculateAccuracy(twitter_data, "label", "BERT_class_precomputed")
print(f"Accuracy of BERT on Twitter Samples: {accuracy_BERT_on_twitter} meaning {accuracy_BERT_on_twitter*100:.2f}%\n")

accuracy_GoogleNLP_on_twitter = calculateAccuracy(twitter_data, "label", "GoogleNLP_class")
print(f"Accuracy of Google NLP on Twitter Samples: {accuracy_GoogleNLP_on_twitter} meaning {accuracy_GoogleNLP_on_twitter*100:.2f}%\n")

ACCURACY TWITTER DATA:

Total Number of Samples: 4209
Number of Correct Predictions VADER_class: 758
Accuracy of VADER on Twitter Samples: 0.18009028272748873 meaning 18.01%

Total Number of Samples: 4209
Number of Correct Predictions BERT_class_precomputed: 2672
Accuracy of BERT on Twitter Samples: 0.6348301259206462 meaning 63.48%

Total Number of Samples: 4209
Number of Correct Predictions GoogleNLP_class: 2825
Accuracy of Google NLP on Twitter Samples: 0.6711808030411024 meaning 67.12%



In [12]:
print("ACCURACY YOUTUBE DATA:\n")

accuracy_VADER_on_youtube = calculateAccuracy(youtube_data, "label", "VADER_class")
print(f"Accuracy of VADER on YouTube Samples: {accuracy_VADER_on_youtube} meaning {accuracy_VADER_on_youtube*100:.2f}%\n")

accuracy_BERT_on_youtube = calculateAccuracy(youtube_data, "label", "BERT_class_precomputed")
print(f"Accuracy of BERT on YouTube Samples: {accuracy_BERT_on_youtube} meaning {accuracy_BERT_on_youtube*100:.2f}%\n")

accuracy_GoogleNLP_on_youtube = calculateAccuracy(youtube_data, "label", "GoogleNLP_class")
print(f"Accuracy of Google NLP on YouTube Samples: {accuracy_GoogleNLP_on_youtube} meaning {accuracy_GoogleNLP_on_youtube*100:.2f}%\n")

ACCURACY YOUTUBE DATA:

Total Number of Samples: 3293
Number of Correct Predictions VADER_class: 1374
Accuracy of VADER on YouTube Samples: 0.41724870938354086 meaning 41.72%

Total Number of Samples: 3293
Number of Correct Predictions BERT_class_precomputed: 2448
Accuracy of BERT on YouTube Samples: 0.7433950804737322 meaning 74.34%

Total Number of Samples: 3293
Number of Correct Predictions GoogleNLP_class: 2172
Accuracy of Google NLP on YouTube Samples: 0.6595809292438506 meaning 65.96%



Next calculate the precision of the `"Positive"` class for the Twitter and Youtube data.
This is calculated as follows:
$
\begin{align}
    precision = \frac{TP}{TP + FP}
\end{align}
$
*Note: Here the Positive samples are the one with the the class `"Positive"`*

**True Positive (TP):** Observations belonging to the considered class are predicted as the correct class.


In [13]:
# Define Function to Calculate True Positive (TP)
def calculateTP(dataset: pd.DataFrame, column_name_label: str, column_name_prediction: str, class_value: str) -> int:
    TP = dataset[(dataset[column_name_label] == class_value) & (dataset[column_name_prediction] == class_value)].shape[0]
    return TP

**False Positive (FP):** Observations not belonging to the considered class are incorrectly predicted as the considered class.

In [14]:
# Define Function to Calculate False Positive (FP)
def calculateFP(dataset: pd.DataFrame, column_name_label: str, column_name_prediction: str, class_value: str) -> int:
    FP = dataset[(dataset[column_name_label] != class_value) & (dataset[column_name_prediction] == class_value)].shape[0]
    return FP

$
\begin{align}
    Precision = \frac{TP}{TP + FP}
\end{align}
$

In [15]:
# Define Function to Calculate Precision
def calculatePrecision(dataset: pd.DataFrame, column_name_label: str, column_name_prediction: str, class_value: str) -> float:
    TP = calculateTP(dataset, column_name_label, column_name_prediction, class_value)
    print(f"True Positive of {column_name_prediction} considering Class \"{class_value}\" = {TP}")
    FP = calculateFP(dataset, column_name_label, column_name_prediction, class_value)
    print(f"False Positive of {column_name_prediction} considering Class \"{class_value}\" = {FP}")
    precision = TP / (TP + FP)
    return precision

In [16]:
# Calculate Precision on Twitter Data for class "Positive"
print("PRECISION TWITTER DATA (Class = \"Positive\"):\n")

precision_VADER_on_twitter_data = calculatePrecision(twitter_data, "label", "VADER_class", "Positive")
print(f"Precision VADER on Twitter Samples: {precision_VADER_on_twitter_data} meaning {precision_VADER_on_twitter_data*100:.2f}%\n")

precision_BERT_on_twitter_data = calculatePrecision(twitter_data, "label", "BERT_class_precomputed", "Positive")
print(f"Precision BERT on Twitter Samples: {precision_BERT_on_twitter_data} meaning {precision_BERT_on_twitter_data*100:.2f}%\n")

precision_GoogleNLP_on_twitter_data = calculatePrecision(twitter_data, "label", "GoogleNLP_class", "Positive")
print(f"Precision Google NLP on Twitter Samples: {precision_GoogleNLP_on_twitter_data} meaning {precision_GoogleNLP_on_twitter_data*100:.2f}%\n")

PRECISION TWITTER DATA (Class = "Positive"):

True Positive of VADER_class considering Class "Positive" = 427
False Positive of VADER_class considering Class "Positive" = 774
Precision VADER on Twitter Samples: 0.3555370524562864 meaning 35.55%

True Positive of BERT_class_precomputed considering Class "Positive" = 537
False Positive of BERT_class_precomputed considering Class "Positive" = 964
Precision BERT on Twitter Samples: 0.357761492338441 meaning 35.78%

True Positive of GoogleNLP_class considering Class "Positive" = 328
False Positive of GoogleNLP_class considering Class "Positive" = 651
Precision Google NLP on Twitter Samples: 0.3350357507660878 meaning 33.50%



In [17]:
# Calculate Precision on YouTube Data for class "Positive"
print("PRECISION YOUTUBE DATA (Class = \"Positive\"):\n")

precision_VADER_on_youtube_data = calculatePrecision(youtube_data, "label", "VADER_class", "Positive")
print(f"Precision VADER on YouTube Samples: {precision_VADER_on_youtube_data} meaning {precision_VADER_on_youtube_data*100:.2f}%\n")

precision_BERT_on_youtube_data = calculatePrecision(youtube_data, "label", "BERT_class_precomputed", "Positive")
print(f"Precision BERT on YouTube Samples: {precision_BERT_on_youtube_data} meaning {precision_BERT_on_youtube_data*100:.2f}%\n")

precision_GoogleNLP_on_youtube_data = calculatePrecision(youtube_data, "label", "GoogleNLP_class", "Positive")
print(f"Precision Google NLP on YouTube Samples: {precision_GoogleNLP_on_youtube_data} meaning {precision_GoogleNLP_on_youtube_data*100:.2f}%\n") 	 

PRECISION YOUTUBE DATA (Class = "Positive"):

True Positive of VADER_class considering Class "Positive" = 912
False Positive of VADER_class considering Class "Positive" = 354
Precision VADER on YouTube Samples: 0.7203791469194313 meaning 72.04%

True Positive of BERT_class_precomputed considering Class "Positive" = 1202
False Positive of BERT_class_precomputed considering Class "Positive" = 372
Precision BERT on YouTube Samples: 0.7636594663278272 meaning 76.37%

True Positive of GoogleNLP_class considering Class "Positive" = 914
False Positive of GoogleNLP_class considering Class "Positive" = 270
Precision Google NLP on YouTube Samples: 0.7719594594594594 meaning 77.20%



Now calculate the recall score. This is done by:
$
\begin{align}
    recall = \frac{TP}{TP + FN}
\end{align}
$
*Note: Here the Positive samples are the one with the the class `"Positive"`*

**False Negative (FN):** Observations belonging to the considered class are incorrectly predicted as not belonging to the considered class.

In [18]:
# Define Function to Calculate False Negative (FN)
def calculateFN(dataset: pd.DataFrame, column_name_label: str, column_name_prediction: str, class_value: str) -> int:
    FN = dataset[(dataset[column_name_label] == class_value) & (dataset[column_name_prediction] != class_value)].shape[0]
    return FN

$
\begin{align}
    Recall = \frac{TP}{TP + FN}
\end{align}
$

In [19]:
# Define Function to Calculate Recall
def calculateRecall(dataset: pd.DataFrame, column_name_label: str, column_name_prediction: str, class_value: str) -> float:
    TP = calculateTP(dataset, column_name_label, column_name_prediction, class_value)
    # print(f"True Positive of {column_name_prediction} considering Class \"{class_value}\" = {TP}")
    FN = calculateFN(dataset, column_name_label, column_name_prediction, class_value)
    print(f"False Negative of {column_name_prediction} considering Class \"{class_value}\" = {FN}")
    recall = TP / (TP + FN)
    return recall

In [20]:
# Calculate recall on twitter data for class "Positive"
print("RECALL TWITTER DATA (Class = \"Positive\"):\n")

recall_VADER_on_twitter_data = calculateRecall(twitter_data, "label", "VADER_class", "Positive")
print(f"Recall (class \"Positive\") VADER on Twitter Samples: {recall_VADER_on_twitter_data} meaning {recall_VADER_on_twitter_data*100:.2f}%\n")

recall_BERT_on_twitter_data = calculateRecall(twitter_data, "label", "BERT_class_precomputed", "Positive")
print(f"Recall (class \"Positive\") BERT on Twitter Samples: {recall_BERT_on_twitter_data} meaning {recall_BERT_on_twitter_data*100:.2f}%\n")

recall_GoogleNLP_on_twitter_data = calculateRecall(twitter_data, "label", "GoogleNLP_class", "Positive")
print(f"Recall (class \"Positive\") Google NLP on Twitter Samples: {recall_GoogleNLP_on_twitter_data} meaning {recall_GoogleNLP_on_twitter_data*100:.2f}%\n")

RECALL TWITTER DATA (Class = "Positive"):

False Negative of VADER_class considering Class "Positive" = 160
Recall (class "Positive") VADER on Twitter Samples: 0.727427597955707 meaning 72.74%

False Negative of BERT_class_precomputed considering Class "Positive" = 50
Recall (class "Positive") BERT on Twitter Samples: 0.9148211243611585 meaning 91.48%

False Negative of GoogleNLP_class considering Class "Positive" = 259
Recall (class "Positive") Google NLP on Twitter Samples: 0.5587734241908007 meaning 55.88%



In [21]:
# Calculate recall on youtube data for class "Positive"
print("RECALL YOUTUBE DATA (Class = \"Positive\"):\n")

recall_VADER_on_youtube_data = calculateRecall(youtube_data, "label", "VADER_class", "Positive")
print(f"Recall (class \"Positive\") VADER on YouTube Samples: {recall_VADER_on_youtube_data} meaning {recall_VADER_on_youtube_data*100:.2f}%\n")

recall_BERT_on_youtube_data = calculateRecall(youtube_data, "label", "BERT_class_precomputed", "Positive")
print(f"Recall (class \"Positive\") BERT on YouTube Samples: {recall_BERT_on_youtube_data} meaning {recall_BERT_on_youtube_data*100:.2f}%\n")

recall_GoogleNLP_on_youtube_data = calculateRecall(youtube_data, "label", "GoogleNLP_class", "Positive")
print(f"Recall (class \"Positive\") Google NLP on YouTube Samples: {recall_GoogleNLP_on_youtube_data} meaning {recall_GoogleNLP_on_youtube_data*100:.2f}%\n") 

RECALL YOUTUBE DATA (Class = "Positive"):

False Negative of VADER_class considering Class "Positive" = 401
Recall (class "Positive") VADER on YouTube Samples: 0.6945925361766946 meaning 69.46%

False Negative of BERT_class_precomputed considering Class "Positive" = 111
Recall (class "Positive") BERT on YouTube Samples: 0.9154607768469155 meaning 91.55%

False Negative of GoogleNLP_class considering Class "Positive" = 399
Recall (class "Positive") Google NLP on YouTube Samples: 0.6961157654226962 meaning 69.61%



Calculate the Recall and the Precision score now also for the negative class. The Precision is calculated as:
$
\begin{align}
    precision = \frac{TP}{TP + FP}
\end{align}
$
*Note: Here the Positive samples are the one with the the class `"Negative"`*

And the Recall is calculated as:
$
\begin{align}
    recall = \frac{TP}{TP + FN}
\end{align}
$
*Note: Here the Positive samples are the one with the the class `"Negative"`*

In [22]:
# Calculate Precision and Recall on Twitter Data for Class "Negative"
print("PRECISION & RECALL on TWITTER DATA (Class = \"Negative\"):\n")

neg_precision_VADER_on_twitter_data = calculatePrecision(twitter_data, "label", "VADER_class", "Negative")
print(f"Precision VADER on Twitter Samples: {neg_precision_VADER_on_twitter_data} meaning {neg_precision_VADER_on_twitter_data*100:.2f}%")
neg_recall_VADER_on_twitter_data = calculateRecall(twitter_data, "label", "VADER_class", "Negative")
print(f"Recall (class \"Positive\") VADER on Twitter Samples: {neg_recall_VADER_on_twitter_data} meaning {neg_recall_VADER_on_twitter_data*100:.2f}%\n")

neg_precision_BERT_on_twitter_data = calculatePrecision(twitter_data, "label", "BERT_class_precomputed", "Negative")
print(f"Precision BERT on Twitter Samples: {neg_precision_BERT_on_twitter_data} meaning {neg_precision_BERT_on_twitter_data*100:.2f}%")
neg_recall_BERT_on_twitter_data = calculateRecall(twitter_data, "label", "BERT_class_precomputed", "Negative")
print(f"Recall (class \"Positive\") BERT on Twitter Samples: {neg_recall_BERT_on_twitter_data} meaning {neg_recall_BERT_on_twitter_data*100:.2f}%\n")

neg_precision_GoogleNLP_on_twitter_data = calculatePrecision(twitter_data, "label", "GoogleNLP_class", "Negative")
print(f"Precision Google NLP on Twitter Samples: {neg_precision_GoogleNLP_on_twitter_data} meaning {neg_precision_GoogleNLP_on_twitter_data*100:.2f}%")
neg_recall_GoogleNLP_on_twitter_data = calculateRecall(twitter_data, "label", "GoogleNLP_class", "Negative")
print(f"Recall (class \"Positive\") Google NLP on Twitter Samples: {neg_recall_GoogleNLP_on_twitter_data} meaning {neg_recall_GoogleNLP_on_twitter_data*100:.2f}%\n")

PRECISION & RECALL on TWITTER DATA (Class = "Negative"):

True Positive of VADER_class considering Class "Negative" = 331
False Positive of VADER_class considering Class "Negative" = 2677
Precision VADER on Twitter Samples: 0.11003989361702128 meaning 11.00%
False Negative of VADER_class considering Class "Negative" = 50
Recall (class "Positive") VADER on Twitter Samples: 0.868766404199475 meaning 86.88%

True Positive of BERT_class_precomputed considering Class "Negative" = 312
False Positive of BERT_class_precomputed considering Class "Negative" = 504
Precision BERT on Twitter Samples: 0.38235294117647056 meaning 38.24%
False Negative of BERT_class_precomputed considering Class "Negative" = 69
Recall (class "Positive") BERT on Twitter Samples: 0.8188976377952756 meaning 81.89%

True Positive of GoogleNLP_class considering Class "Negative" = 128
False Positive of GoogleNLP_class considering Class "Negative" = 249
Precision Google NLP on Twitter Samples: 0.3395225464190981 meaning 33.9

In [23]:
# Calculate Precision and Recall on YouTube Data for Class "Negative"
print("PRECISION & RECALL on YOUTUBE DATA (Class = \"Negative\"):\n")

neg_precision_VADER_on_youtube_data = calculatePrecision(youtube_data, "label", "VADER_class", "Negative")
print(f"Precision VADER on YouTube Samples: {neg_precision_VADER_on_youtube_data} meaning {neg_precision_VADER_on_youtube_data*100:.2f}%")
neg_recall_VADER_on_youtube_data = calculateRecall(youtube_data, "label", "VADER_class", "Negative")
print(f"Recall (class \"Positive\") VADER on YouTube Samples: {neg_recall_VADER_on_youtube_data} meaning {neg_recall_VADER_on_youtube_data*100:.2f}%\n")

neg_precision_BERT_on_youtube_data = calculatePrecision(youtube_data, "label", "BERT_class_precomputed", "Negative")
print(f"Precision BERT on YouTube Samples: {neg_precision_BERT_on_youtube_data} meaning {neg_precision_BERT_on_youtube_data*100:.2f}%")
neg_recall_BERT_on_youtube_data = calculateRecall(youtube_data, "label", "BERT_class_precomputed", "Negative")
print(f"Recall (class \"Positive\") BERT on YouTube Samples: {neg_recall_BERT_on_youtube_data} meaning {neg_recall_BERT_on_youtube_data*100:.2f}%\n")

neg_precision_GoogleNLP_on_youtube_data = calculatePrecision(youtube_data, "label", "GoogleNLP_class", "Negative")
print(f"Precision Google NLP on YouTube Samples: {neg_precision_GoogleNLP_on_youtube_data} meaning {neg_precision_GoogleNLP_on_youtube_data*100:.2f}%") 
neg_recall_GoogleNLP_on_youtube_data = calculateRecall(youtube_data, "label", "GoogleNLP_class", "Negative")
print(f"Recall (class \"Positive\") Google NLP on YouTube Samples: {neg_recall_GoogleNLP_on_youtube_data} meaning {neg_recall_GoogleNLP_on_youtube_data*100:.2f}%\n")

PRECISION & RECALL on YOUTUBE DATA (Class = "Negative"):

True Positive of VADER_class considering Class "Negative" = 462
False Positive of VADER_class considering Class "Negative" = 1565
Precision VADER on YouTube Samples: 0.227923038973853 meaning 22.79%
False Negative of VADER_class considering Class "Negative" = 74
Recall (class "Positive") VADER on YouTube Samples: 0.8619402985074627 meaning 86.19%

True Positive of BERT_class_precomputed considering Class "Negative" = 436
False Positive of BERT_class_precomputed considering Class "Negative" = 324
Precision BERT on YouTube Samples: 0.5736842105263158 meaning 57.37%
False Negative of BERT_class_precomputed considering Class "Negative" = 100
Recall (class "Positive") BERT on YouTube Samples: 0.8134328358208955 meaning 81.34%

True Positive of GoogleNLP_class considering Class "Negative" = 214
False Positive of GoogleNLP_class considering Class "Negative" = 164
Precision Google NLP on YouTube Samples: 0.5661375661375662 meaning 56.61

# To learn more
1. What was the best performing method for Youtube? Did that fit your expectations?
2. What was the best performing method for Twitter? Did that fit your expectations?
4. Do you observe any differences between prediction of positive and negative sentiment? What is the role of the imbalance between postive and negative classes in the calculation of accuracy?


In [32]:
# Summarize DATA
d = {1: ["Precision Class", "Positive", "Twitter", round(precision_VADER_on_twitter_data,2), round(precision_BERT_on_twitter_data,2), round(precision_GoogleNLP_on_twitter_data,2)],
2: ["Precision Class", "Positive", "YouTube", round(precision_VADER_on_youtube_data,2), round(precision_BERT_on_youtube_data,2), round(precision_GoogleNLP_on_youtube_data,2)],
3: ["Precision Class", "Negative", "Twitter", round(neg_precision_VADER_on_twitter_data,2), round(neg_precision_BERT_on_twitter_data,2), round(neg_precision_GoogleNLP_on_twitter_data,2)],
4: ["Precision Class", "Negative", "YouTube", round(neg_precision_VADER_on_youtube_data,2), round(neg_precision_BERT_on_youtube_data,2), round(neg_precision_GoogleNLP_on_youtube_data,2)],
5: ["Recall Class", "Positive", "Twitter", round(recall_VADER_on_twitter_data,2), round(recall_BERT_on_twitter_data,2), round(recall_GoogleNLP_on_twitter_data,2)],
6: ["Recall Class", "Positive", "YouTube", round(recall_VADER_on_youtube_data,2), round(recall_BERT_on_youtube_data,2), round(recall_GoogleNLP_on_youtube_data,2)],
7: ["Recall Class", "Negative", "Twitter", round(neg_recall_VADER_on_twitter_data,2), round(neg_recall_BERT_on_twitter_data,2), round(neg_recall_GoogleNLP_on_twitter_data,2)],
8: ["Recall Class", "Negative", "YouTube", round(neg_recall_VADER_on_youtube_data,2), round(neg_recall_BERT_on_youtube_data,2), round(neg_recall_GoogleNLP_on_youtube_data,2)]
}
print ("{:<5} {:<30} {:<20} {:<20} {:<20} {:<20} {:<20}".format('', 'Calculation', 'Class', 'DATA', 'VADER','BERT','GFoogle NLP'))
for k, v in d.items():
    calc, class_value, data, vader, bert, google = v
    print ("{:<5} {:<30} {:<20} {:<20} {:<20} {:<20} {:<20}".format(k, calc, class_value, data, vader, bert, google)) 

      Calculation                    Class                DATA                 VADER                BERT                 GFoogle NLP         
1     Precision Class                Positive             Twitter              0.36                 0.36                 0.34                
2     Precision Class                Positive             YouTube              0.72                 0.76                 0.77                
3     Precision Class                Negative             Twitter              0.11                 0.38                 0.34                
4     Precision Class                Negative             YouTube              0.23                 0.57                 0.57                
5     Recall Class                   Positive             Twitter              0.73                 0.91                 0.56                
6     Recall Class                   Positive             YouTube              0.69                 0.92                 0.7                 
7     

## Sentiment Analysis Evaluation Summary

### 1. Best Performing Method for YouTube
The best method was **BERT** with an accuracy of $74.34\%$, strong precision ($76.37\%$), and recall ($91.55\%$) for "Positive" sentiment. This aligns with expectations, as BERT effectively captures nuanced language.

### 2. Best Performing Method for Twitter
The best method was **Google NLP** with an accuracy of $67.12\%$. This partially fits expectations, but its lower precision and recall indicate it struggles with informal language compared to BERT.

### 3. Differences Between Positive and Negative Sentiment
"Positive" sentiment generally shows higher recall, especially for BERT, while "Negative" sentiment precision tends to be lower (e.g., Google NLP). VADER often overpredicts "Positive," leading to lower precision.

### Role of Imbalance
Imbalance skews accuracy by favoring majority classes, inflating overall metrics while ignoring minority class performance. Metrics like precision and recall highlight these disparities, showing BERT's robustness to class imbalance.
