# Dictionary Approach

In this notebook, we apply a dictionary-based approach to classify the sentiment of articles and evaluate its performance on the same 256 test articles used for testing the LSTM model. The dictionary used is by [Bannier, Pauls, and Walter (BPW)](https://link.springer.com/article/10.1007/s11573-018-0914-8), which is the German adaptation of the popular dictionary by [Loughran and McDonald (2011)](https://onlinelibrary.wiley.com/doi/full/10.1111/j.1540-6261.2010.01625.x).

The main goal is to compare the performance of the LSTM model with the dictionary-based method. The BPW dictionary, specifically tailored for business communication, is known for producing sentiment indices that strongly correlate with key economic and financial variables. This makes it a suitable choice for our analysis.

First, we load the MTI dataset, and then, using the pre-defined test indices, we filter the articles and labels to select only those belonging to the test set.

In [1]:
import csv

# Open and read articles from the 'articles.txt' file 
with open('MediaTenor_data/articles.txt', 'r', encoding = 'utf-8') as f:
    articles = f.read().split('\n')  # Splitting into a list of articles

# Open and read labels from the 'labels_binary.txt' file    
with open('MediaTenor_data/labels_binary.txt', 'r', encoding = 'utf-8') as f:
    labels = f.read().split('\n')  # Splitting into a list of labels
       
# Load the test indices from the CSV file
with open('test_indices.csv', 'r') as f:
    reader = csv.reader(f)
    test_indices = list(map(int, next(reader))) 

# Filter articles and labels for the test set
test_articles = [articles[i] for i in test_indices]
test_labels = [labels[i] for i in test_indices]

Next, we filter the articles to retain only the sentences that contain at least one word related to business cycle conditions.

In [2]:
import os
import nltk
nltk.download('punkt_tab')
import multiprocessing as mp 
from datetime import datetime
from functools import partial
import keep_economy_related_sentences

NUM_CORE = 60 # set the number of cores to use

# Set the path variable to point to the 'word_embeddings' directory
path = os.getcwd().replace('\\sentiment', '') + '\\word_embeddings'

# Load words related to 'Wirtschaft' and 'Konjunktur'
konjunktur_words = keep_economy_related_sentences.load_words(path + '\\konjunktur_synonyms.txt')
wirtschaft_words = keep_economy_related_sentences.load_words(path + '\\wirtschaft_synonyms.txt')

# Combine the two lists
economy_related_words = konjunktur_words + wirtschaft_words

startTime = datetime.now() 

if __name__ == "__main__":
    pool = mp.Pool(NUM_CORE)
    inputs = zip(test_articles, [economy_related_words]*len(test_articles))
    economy_related_sentences = pool.starmap(keep_economy_related_sentences.keep_economy_related_sentences, inputs) 
    pool.close()
    pool.join()
    
print(datetime.now()-startTime)

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\mokuneva\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


0:00:50.345330


We then load the BPW dictionary from an Excel file and create two lists: one containing negative terms and another containing positive terms.

In [3]:
import pandas as pd

# Read an Excel file, transform an output into a list
bpw_neg = list(pd.read_excel('BPW_Dictionary.xlsx', sheet_name='NEG_BPW', header=None).iloc[:,0]) 
bpw_pos = list(pd.read_excel('BPW_Dictionary.xlsx', sheet_name='POS_BPW', header=None).iloc[:,0])

# Convert boolean value back to its intended string form
bpw_neg = ['falsch' if word is False else word for word in bpw_neg]

print(bpw_neg[:5])
print(bpw_pos[:5])

['abbau', 'abbauen', 'abbauend', 'abbauende', 'abbauendem']
['adäquat', 'adäquate', 'adäquatem', 'adäquaten', 'adäquater']


After that, we create an array `X` that records the number of times each negative word from the BPW dictionary appears in each of the test articles.

[CounVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) converts all characters to lowercase, tokenizes a text by extracting words of at least 2 letters and counts the occurence of tokens from the `vocabulary` in each document.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate the vectorizer object
vectorizer = CountVectorizer(analyzer = 'word', vocabulary = bpw_neg)

# Fit a vectorizer to the articles and use it to transform them
# vectorizer.fit_transform returns a sparse matrix recording the number of times each word from the vocabulary appears;
# toarray() transforms a sparse matrix into a numpy array
X = vectorizer.fit_transform(list(economy_related_sentences)).toarray()
print(X.shape)

(256, 10147)


Now, we create a DataFrame `data` where we will store the filtered article texts, along with the number of positive and negative words in each article, and the total number of words. We then calculate the number of negative words in each article using the previously created matrix and add this count as a new column in the DataFrame.

In [5]:
# Create a DataFrame 'data' where we will store the filtered article texts,
# along with the count of positive and negative words, and the total number of words in each article
data = pd.DataFrame({'text': economy_related_sentences})

# axis = 1: find the sum of all the values over the column axis (the number of negative words in each article)
negative = pd.DataFrame(X).sum(axis=1)

# Create a column with the count of negative words
data['negative'] = negative

In a similar manner, we calculate the number of positive words in each article by fitting the vectorizer to the positive terms from the BPW dictionary. We add this count as a new column to the DataFrame.

In [6]:
# Instantiate the vectorizer object
vectorizer = CountVectorizer(analyzer = 'word', vocabulary = bpw_pos)

# Fit a vectorizer to the articles and use it to transform them
# vectorizer.fit_transform returns a sparse matrix recording the number of times each word from the vocabulary appears;
# toarray() transforms a sparse matrix into a numpy array
X = vectorizer.fit_transform(list(economy_related_sentences)).toarray()

# axis = 1: find the sum of all the values over the column axis (the number of positive words in each article)
positive = pd.DataFrame(X).sum(axis=1)

# Create a column with the count of positive words
data['positive'] = positive

The final statistic required for calculating sentiment is the total number of words in each article. We use the `CountVectorizer` again to compute the word count and add this as a new column in the DataFrame.

In [7]:
# Instantiate the vectorizer object
vectorizer = CountVectorizer(analyzer = 'word')

# Fit a vectorizer to the articles and use it to transform them.
# vectorizer.fit_transform returns a sparse matrix recording the number of times each word from the vocabulary appears;
# toarray() transforms a sparse matrix into a numpy array
X = vectorizer.fit_transform(list(economy_related_sentences)).toarray()

# axis = 1: find the sum of all the values over the column axis (the number of words in each article)
word_count = X.sum(axis=1)

# Create a column with the word count
data['word_count'] = word_count

Finally, we calculate the sentiment for each article as the proportion of positive words minus the proportion of negative words, relative to the total word count. This value is then added as a new column to the DataFrame.

In [8]:
# Calculate sentiment of each article as the propotion of positive words minus the proportion of negative words
data['sentiment_bpw'] = (data['positive']-data['negative'])/data['word_count']

Since we need a binary classification for sentiment—either negative or positive/no clear tone—we transform the sentiment score into a class label. If the score is negative, the article is classified as having a negative sentiment toward business cycle conditions; otherwise, it is considered to have a positive or neutral tone.

In [9]:
data['predictions'] = data['sentiment_bpw'].apply(lambda x: 0 if x < 0 else 1)
data.head()

Unnamed: 0,text,negative,positive,word_count,sentiment_bpw,predictions
0,Mehr als zwei Drittel der Inder glauben nach e...,2,0,74,-0.027027,0
1,"IfD Allensbach befragt für Capital und die ""FA...",6,7,205,0.004878,1
2,Berlin/Kiel - Die deutsche Konjunktur gewinnt ...,0,1,21,0.047619,1
3,"Es droht nämlich eine Rezession in den USA, we...",5,0,58,-0.086207,0
4,Berlin - Die deutsche Konjunktur legt eine Ver...,1,0,42,-0.02381,0


To evaluate the performance of the dictionary-based approach, we first need to convert the true labels into a binary format: 0 for negative sentiment and 1 for positive/no clear tone class.

In [10]:
import numpy as np

# Convert labels to binary format: 1 for 'positive' and 0 for 'negative'
encoded_labels = np.array([1 if label == 'positive' else 0 for label in test_labels])

The final step is to evaluate the performance of the dictionary-based approach. We generate a classification report that includes precision, recall, F1-score, and support for each class. Additionally, we compute and display the confusion matrix to visualize how well the model performs in terms of correct and incorrect predictions.

The dictionary-based approach achieves an overall accuracy of 62.9%, compared to 66.8% for the LSTM model. While the LSTM demonstrates superior performance, validating its effectiveness, the BPW dictionary still proves to be a strong benchmark, which explains its popularity. Notably, the dictionary-based method performs particularly well on negative articles, with 73% accuracy, but struggles with those that have a positive or no clear tone, where it achieves only 53% accuracy.

In [11]:
from sklearn.metrics import classification_report, confusion_matrix

# Generate and print the classification report
# This report includes precision, recall, F1-score, and support for each class
print("Classification Report:")
print(classification_report(encoded_labels, list(data.predictions)))

# Compute and display the confusion matrix
# The matrix aligns true labels with rows and predicted labels with columns
print("Confusion Matrix:")
conf_matrix = confusion_matrix(encoded_labels, list(data.predictions))
print(conf_matrix)

Classification Report:
              precision    recall  f1-score   support

           0       0.59      0.73      0.66       124
           1       0.68      0.53      0.60       132

    accuracy                           0.63       256
   macro avg       0.64      0.63      0.63       256
weighted avg       0.64      0.63      0.63       256

Confusion Matrix:
[[91 33]
 [62 70]]
