# Linear Support Vector Machine (LSVM) Benchmark

In this notebook, we use a Linear SVM (LSVM) model as a benchmark for the LSTM model. Both models are trained on the same training articles and tested on the same test articles. The hyperparameter `C` is tuned using cross-validation, where `C` controls the margin's hardness: a larger `C` enforces a stricter, hard margin, while a smaller `C` softens the margin, allowing some misclassifications. The goal is to compare the performance of the LSVM to the LSTM model.

We first load the MTI articles and their corresponding labels. The data is then split into training and test sets using predefined indices, ensuring they are identical to those used for the LSTM model. Finally, we combine the train and test articles and labels into two lists.

In [1]:
import csv

# Open and read articles from the 'articles.txt' file 
with open('MediaTenor_data/articles.txt', 'r', encoding = 'utf-8') as f:
    articles = f.read().split('\n')  # Splitting into a list of articles

# Open and read labels from the 'labels_binary.txt' file    
with open('MediaTenor_data/labels_binary.txt', 'r', encoding = 'utf-8') as f:
    labels = f.read().split('\n')  # Splitting into a list of labels
    
# Load the train indices from the CSV file
with open('train_indices.csv', 'r') as f:
    reader = csv.reader(f)
    train_indices = list(map(int, next(reader))) 
    
# Load the test indices from the CSV file
with open('test_indices.csv', 'r') as f:
    reader = csv.reader(f)
    test_indices = list(map(int, next(reader))) 

# Filter articles and labels for the training set
train_articles = [articles[i] for i in train_indices]
train_labels = [labels[i] for i in train_indices]

# Filter articles and labels for the test set
test_articles = [articles[i] for i in test_indices]
test_labels = [labels[i] for i in test_indices]

# Combine train and test articles
all_articles = train_articles + test_articles

# Combine train and test labels
all_labels = train_labels + test_labels

We pre-process the articles by retaining only the sentences that contain at least one word related to business cycle conditions. Then, we perform the following normalization steps: we lowercase all articles, remove URLs, remove punctuation, and strip non-alphabetic characters. Additionally, we remove multiple spaces, single-letter tokens, stopwords, and metadata from the text. Finally, we lemmatize the articles to reduce words to their base forms.

In [2]:
import os
import nltk
nltk.download('punkt_tab')
import multiprocessing as mp 
from datetime import datetime
from functools import partial
import keep_economy_related_sentences

NUM_CORE = 60 # set the number of cores to use

# Set the path variable to point to the 'word_embeddings' directory.
path = os.getcwd().replace('\\sentiment', '') + '\\word_embeddings'

# Load words related to 'Wirtschaft' and 'Konjunktur'
konjunktur_words = keep_economy_related_sentences.load_words(path + '\\konjunktur_synonyms.txt')
wirtschaft_words = keep_economy_related_sentences.load_words(path + '\\wirtschaft_synonyms.txt')

# Combine the two lists
economy_related_words = konjunktur_words + wirtschaft_words

startTime = datetime.now() 

if __name__ == "__main__":
    pool = mp.Pool(NUM_CORE)
    inputs = zip(all_articles, [economy_related_words]*len(all_articles))
    economy_related_sentences = pool.starmap(keep_economy_related_sentences.keep_economy_related_sentences, inputs) 
    pool.close()
    pool.join()
    
print(datetime.now()-startTime)

print(economy_related_sentences[0])

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\mokuneva\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


0:00:52.803050
SPIEGEL: Herr Präsident, in einem Monat beginnen in Rio de Janeiro die Olympischen Spiele, aber das Land befindet sich in einer schweren politischen und wirtschaftlichen Krise. Die USA haben sich nicht wirklich erholt, Europa steckt weiter in der Krise, Chinas Wirtschaftsleistung geht zurück. Lula: Ich habe Dilma immer gesagt: Du redest zu viel mit Ökonomen, wir brauchen mehr Politik. Es ist möglich, dass viele von euch ihren Job verlieren, aber wenn ihr jetzt Angst habt und nicht mehr konsumiert, dann steigt die Arbeitslosigkeit erst recht. Lula: Ich denke, man muss die Dinge im Zusammenhang sehen: der wirtschaftliche Abschwung, der knappe Ausgang der letzten Wahlen, das vergiftete Klima in einer immer stärker polarisierten Gesellschaft. Viel mehr Sorgen bereitet mir, dass es in unserer Demokratie offenbar möglich ist, ein Opfer solcher Lügen zu werden.


In [3]:
# Initialize an empty string to store the segments of articles related to business cycle conditions
articles = ''

for article in economy_related_sentences:
    articles = articles + article + ' \n'

In [4]:
import re
import codecs
import spacy
from string import punctuation

def remove_multiple_spaces(text):
    """
    This function removes multiple spaces in a string. 
    It uses a regular expression to match 2 or more spaces and replaces them with a single space.
    """
    text = re.sub(r'\s{2,}', ' ', text)
    return text

def remove_short_words(text):
    """
    This function removes words of length 1 from a string.
    """
    text = ' '.join([word for word in text.split() if len(word) > 1])
    return text

def remove_metadata(text, meta_list):
    """
    This function removes metadata from a text.
    Metadata is a list of phrases. If any of these phrases are found in the text,
    everything from the phrase and onwards is cut off.
    """
    for phrase in meta_list:
        if phrase in text:
            text = text.split('dokument', 1)[0]
    return text

# Initialize an empty list to hold stopwords
stopwords = []

# Read in a list of German stopwords from a text file
with codecs.open('stopwords.txt', 'r', 'utf-8-sig') as input_data:
    for line in input_data:
        # Remove trailing whitespaces and convert to lowercase, then append the stopword
        stopwords.append(line.strip().lower())
        
# Define stopwords that should be kept for sentiment analysis (e.g., negation words)
sw_keep = ['kein', 'keine', 'keinem', 'keinen', 'keiner', 'keines', 'nicht', 'nichts', 
           'aber', 'doch', 'gegen', 'ohne', 'sondern', 'sonst']

# Make sure that important stopwords like 'kein' and 'nicht' are not removed from the articles
stopwords = [word for word in stopwords if word not in sw_keep]

def remove_stopwords(text, stopwords=stopwords):
    """
    Removes stopwords from the input text based on the defined stopwords list.
    """
    # Split the text into words, filter out the stopwords, and join back into a string
    cleaned_text = ' '.join([word for word in text.split() if word not in stopwords])
    return cleaned_text

# Load the SpaCy model
nlp = spacy.load('de_core_news_md')

def lemmatization(text):
    """
    Lemmatizes the input text using SpaCy's German language model.
    """
    # Apply the SpaCy model to process the text
    doc = nlp(text)
    
    # Return the lemmatized text by joining lemmatized tokens
    return ' '.join([token.lemma_ for token in doc])

# List of metadata phrases
metadata_phrases = ['dokument bihann', 'dokument bid', 'dokument welt', 'dokument bberbr', 'dokument focus']

# Convert articles to lowercase
articles = articles.lower()

# Remove URLs
articles = re.sub(r'https\S+|http\S+|www.\S+', '', articles)

# Remove punctuation
articles = articles.replace('.', ' ').replace('-', ' ').replace('/', ' ')
articles = ''.join([c for c in articles if c not in punctuation and c not in ['»', '«']])

# Remove non-alphabetic characters from the text
articles = ''.join([c for c in articles if (c.isalpha() or c in [' ', '\n'])])

# Split articles by new lines
articles_split = articles.split('\n')[:-1]

# Remove multiple spaces, single-letter tokens, stopwords and metadata
articles_split = list(map(remove_multiple_spaces, articles_split))
articles_split = list(map(remove_short_words, articles_split))
articles_split = list(map(remove_stopwords, articles_split))
articles_split = list(map(lambda text: remove_metadata(text, metadata_phrases), articles_split))

# Lemmatize each article
articles_split = list(map(lemmatization, articles_split))

Next, we encode the labels into a binary format, where 'positive' (representing positive or no clear tone class) is mapped to 1, and 'negative' is mapped to 0.

In [5]:
import numpy as np

# Convert labels to binary format: 1 for 'positive' and 0 for 'negative'
encoded_labels = np.array([1 if label == 'positive' else 0 for label in all_labels])

Next, we split the pre-processed articles and encoded labels into training and test sets.

In [6]:
# Split the articles and labels into training and test sets
X_train = articles_split[:len(train_articles)]
X_test = articles_split[len(train_articles):]
y_train = encoded_labels[:len(train_articles)]
y_test = encoded_labels[len(train_articles):]

We will use TF-IDF vectorizer to convert the content of each article into a vector of numbers. TF-IDF statistic reflects how important a word is to a document in a corpus.

$$\text{tf-idf}_{v,d} = \log(1+N_{v,d})× \log⁡\left(\frac{D}{D_v}\right)$$

where
$N_{v,d}$ – the count of the token $v$ in the the document $d$,
$D_v$ – the number of documents that contain the term $v$, 
$D$ – the total number of documents in the corpus.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer to convert the text into a matrix of TF-IDF features
# smooth_idf=False: This prevents adding 1 to the document frequencies.
# Since we are not using a predefined vocabulary, we don't need this smoothing adjustment.
tf_idf = TfidfVectorizer(smooth_idf=False)

# Fit the vectorizer on the training data and transform it into a document-term matrix
# This learns the vocabulary and calculates the inverse document frequencies (IDF)
X_train_tfidf = tf_idf.fit_transform(X_train)

# Transform the test data using the same vocabulary and IDF values learnt from the training data
X_test_tfidf = tf_idf.transform(X_test)

We perform cross-validation to tune the hyperparameter `C`, which controls the hardness of the margin in the LinearSVM model. For each value of `C` between 0.1 and 2, the F1 score (weighted by class support) is calculated using stratified 5-fold cross-validation. The model with the highest average F1 score is selected as the best model.

$$F1 = 2 * \frac{precision * recall}{precision + recall}$$

where $$precision = \frac{\text{True Positives}}{\text{Predicted Yes}}$$

$$recall = \frac{\text{True Positives}}{\text{Actual Yes}}$$

F1 score conveys the balance between the precision and the recall.

In [8]:
from sklearn.model_selection import StratifiedKFold, cross_val_score

def get_f1_CV(model):
    """
    Calculate the average F1 score using cross-validation.
    
    Parameters:
    model: The machine learning model to evaluate.
    
    Returns:
    float: The average F1 score from cross-validation.
    """
    # Set up Stratified K-Folds cross-validator with 5 folds
    # StratifiedKFold ensures that each fold maintains the class distribution
    kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
    
    # Perform cross-validation and calculate the F1 scores
    # scoring="f1_weighted" calculates the F1 score, weighted by support (number of true instances for each label)
    f1_scores = cross_val_score(model, X_train_tfidf, y_train, scoring="f1_weighted", cv=kf)
    
    # Return the average F1 score across the folds
    return f1_scores.mean()

In this case, a `C` value of 0.1 is found to yield the best performance.

In [9]:
from sklearn.svm import LinearSVC
import pandas as pd

# Hyperparameter tuning for LinearSVC using cross-validation
# class_weight='balanced': Adjusts weights inversely proportional to class frequencies in the input data

# Perform cross-validation for each value of C in the range 0.1 to 2 (step size of 0.1)
# Store the average F1 score for each C value
res = pd.Series([get_f1_CV(LinearSVC(C=i, class_weight='balanced', dual="auto")) 
                 for i in np.arange(0.1, 2, 0.1)],
                index=np.arange(0.1, 2, 0.1))

# Find the value of C that results in the highest F1 score
best_c = np.round(res.idxmax(), 2)
print('Best C: ', best_c)

Best C:  0.1


We then fit the model with the optimal `C` parameter found through cross-validation to the training data and use it to predict the labels for the test data.

In [10]:
# Initialize the LinearSVC model with the best hyperparameter C found earlier
# 'C=best_c': Best regularization parameter selected from cross-validation
svc_model = LinearSVC(C=best_c, class_weight='balanced', dual="auto")

# Fit the model to the training data (TF-IDF transformed)
svc_model.fit(X_train_tfidf, y_train)

# Use the trained model to predict the labels for the test data
predictions = svc_model.predict(X_test_tfidf)

Finally, we evaluate the performance of the LSVM model using a classification report and confusion matrix. The overall accuracy is 66.4%, which is nearly identical to the 66.8% accuracy of the LSTM model. The F1 scores are also similar: 0.66 for LSVM and 0.67 for LSTM. This suggests that both models perform at a comparable level, likely because the training set (1920 articles) is relatively small, limiting the advantage neural networks typically have with larger datasets. However, there is one notable difference: LSVM performs better at classifying negative articles, achieving 70% accuracy for the negative class and 63% for the positive/no clear tone class. In contrast, the LSTM shows the opposite pattern, with 71% accuracy for positive/no clear tone articles and 62% for the negative class.

In [11]:
from sklearn.metrics import classification_report, confusion_matrix

# Generate and print the classification report
# This report includes precision, recall, F1-score, and support for each class
print("Classification Report:")
print(classification_report(y_test, predictions))

# Compute and display the confusion matrix
# The matrix aligns true labels with rows and predicted labels with columns
print("Confusion Matrix:")
conf_matrix = confusion_matrix(y_test, predictions)
print(conf_matrix)

Classification Report:
              precision    recall  f1-score   support

           0       0.64      0.70      0.67       124
           1       0.69      0.63      0.66       132

    accuracy                           0.66       256
   macro avg       0.67      0.67      0.66       256
weighted avg       0.67      0.66      0.66       256

Confusion Matrix:
[[87 37]
 [49 83]]
