In [None]:
import numpy as np
import random
# Set seed for reproducibility
np.random.seed(42)  # Set seed for NumPy
random.seed(42) # Set seed for random module

## Data

The dataset contains tweets about different Airlines.


Run the code below.

In [None]:
import pandas as pd
# Loading the data from a csv file
tweets = pd.read_csv("https://raw.githubusercontent.com/kbrennig/MODS_WS24_25/refs/heads/main/data/airlinetweets.csv")

### Display document
Run the code below.

In [None]:
tweets.head()

## Preprocessing
Since unstructured data doesn't have an inherent and consistent structure we have to perform some preprocessing steps in order to make the data usable for the computer.
One thing to keep in mind is that the more preprocessing we perform the more information we lose, but the basic methods we are using here require it.

### Tokenize documents
First, we tokenize the texts. This means we transform the texts from one long string to a list of tokens. Additionally we also start removing unwanted characters (e.g punctuation between sentences, numbers, etc.).
For a full list and explanation of the used parameters you can have a look at the documentation.

### Stem all words
After tokenizing the texts we perform stemming (alternatively lemmatization could be performed). Stemming reduces every word to its stem.
The stemmer we use here is the Porter Stemmer.

### Remove stopwords
Finally we remove words that don't contain real meaning and are commonly used (e.g. 'this', 'the', 'a', etc.).

Run the code below.



In [None]:
# Preprocessing
import nltk
import string

# Download the necessary nltk resource
nltk.download('punkt_tab')
nltk.download('stopwords')


def preprocess(text):
    # tokenize the text
    tokens = nltk.word_tokenize(text)

    # create stemmer object
    stemmer = nltk.stem.PorterStemmer()

    # stem each token
    stemmed_tokens = [stemmer.stem(token) for token in tokens]

    # get list of stopwords in English
    stopwords = nltk.corpus.stopwords.words("english")

    # remove stopwords
    filtered_tokens = [token for token in stemmed_tokens if token.lower() not in stopwords]
    
    # remove punctuation
    filtered_tokens_nopunct = [token for token in filtered_tokens if token not in string.punctuation]

    return filtered_tokens_nopunct

## Apply preprocessing


In [None]:
tweets['tokens'] = tweets['text'].apply(preprocess)
tweets.iloc[0]

## Dictionary-Based Sentiment Analysis
`Dictionary-based Sentiment Analysis` works by looking up the sentiment of each word occurring in a text in a `sentiment dictionary`. Afterwards the single sentiment scores are summed up to evaluate the text's sentiment.

### Load NRC sentiment dictionary
We use the NRC sentiment dictionary. This dictionary contains ten classes: anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise and trust.
Currently we are only interested in positive and negative words.

Run the code below.

In [None]:
# Load NRC Emotion Lexicon
nrc_df = pd.read_csv('https://raw.githubusercontent.com/kbrennig/MODS_WS24_25/refs/heads/main/data/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt', sep='\t', header=None, names=['word', 'emotion', 'association'])

# Filter out rows where association is 0
nrc_df = nrc_df[nrc_df['association'] == 1]

# Define positive and negative emotion categories
positive_emotions = {'positive'}
negative_emotions = {'negative'}

# Filter words by emotion category and collect unique words for each sentiment orientation
positive_words = nrc_df[nrc_df['emotion'].isin(positive_emotions)]['word'].unique()
negative_words = nrc_df[nrc_df['emotion'].isin(negative_emotions)]['word'].unique()



### Sample from dictionary
We can look at an excerpt of the positive words contained in the dictionary.

Run the code below.

In [None]:
positive_words[:10]

### Stem the positive and negative dictionaries
The tokens in the dictionary aren't stemmed per default. Since we stemmed the tokens in our data, we also stem the positive and negative words in the dictionary.

Run the code below.

In [None]:
# Initialize the Porter stemmer
stemmer = nltk.stem.PorterStemmer()

# Stem the words in each list
positive_words_stemmed = [stemmer.stem(word) for word in positive_words]
negative_words_stemmed = [stemmer.stem(word) for word in negative_words]


positive_words_stemmed[:10]

### Look-up remaining tokens in NRC dictionary and transform results to data frame
If you want to perform the analysis with the unstemmed tokens you can copy the needed code parts to the summary section and adjust the input_data and remove the stemming from the preprocessing to use the unstemmed tokens. Additionally you will have to set stemmed_dict = False.

Which procedure yields more accurate results and what do you believe to be the reason for the outcome?

Run the code below.

In [None]:
# Create a dictionary with both stemmed and unstemmed words for sentiment analysis
sentiment_dict = {
    'positive': list(positive_words),
    'negative': list(negative_words),
    'positive_stemmed': positive_words_stemmed,
    'negative_stemmed': negative_words_stemmed
}

def sentiment_lookup(tokens, sentiment_dict, stemmed_dict=True):
    if stemmed_dict:
        # Use stemmed versions of the dictionary
        positive_words = sentiment_dict['positive_stemmed']
        negative_words = sentiment_dict['negative_stemmed']
    else:
        # Use original versions of the dictionary
        positive_words = sentiment_dict['positive']
        negative_words = sentiment_dict['negative']
    
    # Count positive and negative word matches
    positive_count = sum(1 for token in tokens if token in positive_words)
    negative_count = sum(1 for token in tokens if token in negative_words)
    
    return positive_count, negative_count

tweets_toks_stemmed = tweets['tokens']

# Perform lookup with stemmed dictionary
results = [sentiment_lookup(tweets, sentiment_dict, stemmed_dict=True) for tweets in tweets_toks_stemmed]
df_results = pd.DataFrame(results, columns=['positive_count', 'negative_count'])
print("Results with Stemmed Dictionary:")
print(df_results)


### Calculate overall sentiment score
After looking up the sentiment for the remaining tokens of each text we can now aggregate them by simply subtracting the number of negative words from the number of positive words found.

Run the code below.

In [None]:
df_results.describe()

In [None]:
# Calculate sentiment algorithm score (positive - negative)
df_results['sentiment_algo_score'] = df_results['positive_count'] - df_results['negative_count']

# Print the results with sentiment scores
print("Results DataFrame:")
print(df_results)

### Scale sentiment score by number of emotional words in a tweet

Run the code below.

In [None]:
df_results['sentiment_algo_scaled'] = df_results['sentiment_algo_score'] / (df_results['positive_count'] + df_results['negative_count'])
df_results['sentiment_algo_scaled'].fillna(0, inplace=True)
df_results['sentiment_algo_scaled'].describe()

### Calculate binary sentiment label
Similarly to classification we still have to decide which label to assign to each instance because until now we only have calculated their sentiment scores. Because we scaled the scores in the previous cell we can infer that scores greater than 0 indicate a positive sentiment and otherwise a negative sentiment.

Run the code below.

In [None]:
df_results['sentiment_algo_binary'] = ['positive' if x > 0 else 'negative' for x in df_results['sentiment_algo_scaled']]
df_results['sentiment_algo_binary'].value_counts()


### Show distribution of human sentiment lables
As a reference we can also display the ground truth distribution of positive and negative tweets. We can see that our model predicts a lot more positive tweets than contained in the dataset. (What could be a possible reason?)

Run the code below.

In [None]:
tweets['sentiment_human'].value_counts()

### Evaluate accuracy with human sentiment lables as ground truth
Since the task at hand is classification (the only difference lies in the type of input data) we can evaluate our model in the same way as we did before.

Run the code below.

In [None]:
tweets_df_sent, results_df_bin = pd.DataFrame(tweets['sentiment_human']), pd.DataFrame(df_results['sentiment_algo_binary'])

In [None]:
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Step 1: Compute accuracy
accuracy = accuracy_score(tweets_df_sent['sentiment_human'], results_df_bin['sentiment_algo_binary'])
print("Accuracy:", accuracy)

# Step 2: Compute confusion matrix and display it
ConfusionMatrixDisplay.from_predictions(tweets_df_sent['sentiment_human'], results_df_bin['sentiment_algo_binary'])