# Baseline system of DBO detection model 

The Jupyter notebook ‘dbo_baseline.ipynb’ contains the code for creating the baseline system for fine-grained detection of various types of attacks on the **free democratic basic order (DBO)** (**subtask 2** of the Shared Task on Harmful Content Detection). A gradient boosting algorithm was chosen for classification, using sentence embeddings and a polarity score as features. The notebook includes the training of the system as well as the prediction on the test data and the evaluation. 

The programme was tested using Python version 3.12.9. Executing the following two lines of code will install all necessary packages. 

In [1]:
%%writefile requirements.txt

pandas==2.2.3
spacy==3.8.2
scikit-learn==1.6.1
textblob==0.15.3
textblob-de==0.4.3
sentence-transformers==4.1.0
nltk==3.9.1
numpy==2.0.1

Overwriting requirements.txt


In [None]:
%pip install -r requirements.txt 

## 1. Importing training data 

First, the training data was read in. 

In [3]:
import pandas as pd 

filename = 'dbo_train.csv' # Path needs to be adjusted  
# Reading in training data 
train_dbo = pd.read_csv(filename, sep=';')
train_dbo.drop('id', axis=1, inplace=True) 
train_dbo.head()

Unnamed: 0,description,DBO
0,"Der Riese ist geweckt,mit oder ohne Verräter u...",nothing
1,Gut Ding will Weile haben... (y),nothing
2,Sollen sie doch nach Saudi Arabien,nothing
3,Volle Zustimmung.??,nothing
4,Mal sehen wann wir an der Erderwärmung schuld ...,nothing


The class distribution of the training data was analysed. 

In [4]:
# Absolute number of instances in each class 
class_counts_dbo = train_dbo["DBO"].value_counts()

# Relative number of instances in each class 
class_percent_dbo = train_dbo['DBO'].value_counts(normalize=True) * 100

# Summarise into a dataframe
class_table_dbo = pd.DataFrame({
    'Frequency': class_counts_dbo,
    'Percentage': class_percent_dbo.round(2)
})

print(class_table_dbo)

            Frequency  Percentage
DBO                              
nothing          6277       84.21
criticism         804       10.79
agitation         313        4.20
subversive         60        0.80


## 2. Data cleaning

The training data was then pre-processed. First, basic cleaning steps were carried out (removing URLs, hashtags and mentions). 

In [5]:
import re


def clean_text(text):
    text = str(text).lower()
    text = re.sub(r'https?://\S+|www\.\S+', '', text)  # Remove URLs
    text = re.sub(r'@\w+', '', text)  # Remove mentions
    text = re.sub(r'#\w+', '', text)  # Remove hashtags
    text = re.sub(r'\d+', ' NUM ', text)  # Replace numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

The tweets were then lemmatised and tokenised. 

In [None]:
# Download the Spacy Pipeline for the German language 
! python -m spacy download de_core_news_md

Collecting de-core-news-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_md-3.8.0/de_core_news_md-3.8.0-py3-none-any.whl (44.4 MB)
     ---------------------------------------- 0.0/44.4 MB ? eta -:--:--
     ------------- ------------------------ 16.3/44.4 MB 113.4 MB/s eta 0:00:01
     ---------------------------------- --- 40.1/44.4 MB 111.0 MB/s eta 0:00:01
     --------------------------------------- 44.4/44.4 MB 94.1 MB/s eta 0:00:00
Installing collected packages: de-core-news-md
Successfully installed de-core-news-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_md')


In [16]:
import spacy

nlp = spacy.load('de_core_news_md')

# Defining the lemmatisation function 
def text_lemmatize_tokenize(texts):
    lemmatized = []
    for doc in nlp.pipe(texts, batch_size = 30):
        tokens = [token.lemma_.lower() for token in doc if not token.is_punct]
        lemmatized.append(' '.join(tokens))
    return lemmatized

The two functions for removing certain tokens and for lemmatisation were applied to the training data. 

In [17]:
# Removing URLs, hashtags and mentions 
train_dbo['description'] = train_dbo['description'].apply(clean_text)

In [18]:
# Lemmatisation and tokenisation
train_dbo = train_dbo[train_dbo['description'].notnull()] 
texts = train_dbo['description'].astype(str).tolist()
train_dbo['description'] = text_lemmatize_tokenize(texts)

## 3. Feature extraction

Next, the tweets from the training data were converted into a feature representation. A polarity value and a sentence embedding representation of the tweets were used as features. Polarity was determined using the TextBlob library. 

In [19]:
from textblob_de import TextBlobDE

# Function for determining the polarity of a tweet
def add_polarity(df):
    def calculate_sentiment_features(text):
        blob = TextBlobDE(text)
        return blob.sentiment.polarity

    df[['polarity']] = df['description'].apply(lambda x: pd.Series(calculate_sentiment_features(x)))
    return df

Sentence-Bert was used to extract sentence embeddings. 

In [20]:
from sentence_transformers import SentenceTransformer


def add_semantic_features(df):
    sentence_model = SentenceTransformer('distiluse-base-multilingual-cased-v2')
    texts = df['description'].astype(str).values.tolist()

    embeddings = sentence_model.encode(texts, show_progressbar=True)
    embeddings_df = pd.DataFrame(embeddings, columns=[f'embedding_{i}' for i in range(embeddings.shape[1])])

    df = pd.concat([df.reset_index(drop=True), embeddings_df.reset_index(drop=True)], axis=1)
    return df

The features were extracted from the tweets in the training data. 

In [21]:
import nltk
nltk.download('punkt')
nltk.download('vader_lexicon')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to \\na2.hs-
[nltk_data]     mittweida.de\felser\Wappscfg\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to \\na2.hs-
[nltk_data]     mittweida.de\felser\Wappscfg\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt_tab to \\na2.hs-
[nltk_data]     mittweida.de\felser\Wappscfg\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [22]:
# Extraction of the polarity score
train_dbo = add_polarity(train_dbo)

In [23]:
# Extraction of embedding features 
train_dbo = add_semantic_features(train_dbo)
print(train_dbo.head())

                                         description      DBO  polarity  \
0  der riese sein wecken mit oder ohne verräter u...  nothing       0.0   
1                      gut ding wollen weile haben y  nothing       1.0   
2                 sollen sie doch nach saudi arabien  nothing       0.0   
3                                    voll zustimmung  nothing       0.0   
4  mal sehen wann wir an der erderwärmung schuld ...  nothing       0.0   

   embedding_0  embedding_1  embedding_2  embedding_3  embedding_4  \
0    -0.028958    -0.005247     0.031000    -0.046794     0.036813   
1     0.011647    -0.010070     0.002322    -0.018418     0.014898   
2     0.009813     0.015485    -0.003578     0.011890     0.023747   
3    -0.015497    -0.039983    -0.031479     0.009015     0.036269   
4     0.003369    -0.109731    -0.015453     0.055244    -0.028149   

   embedding_5  embedding_6  ...  embedding_502  embedding_503  embedding_504  \
0    -0.011903     0.011446  ...      -0.062236

## 4. Encoding labels and train the model

Before the classifier could be trained, further adjustments to the training data were necessary. In particular, the class labels (*subversive*, *agitation*, *criticism*, *nothing*) were mapped to numerical values. 

In [24]:
train_dbo['DBO_encoded']= train_dbo['DBO'].apply(lambda x: ['agitation', 'criticism', 'nothing', 'subversive'].index(x))
X_train = train_dbo.drop(columns=['description', 'DBO', 'DBO_encoded'])
y_train = train_dbo['DBO_encoded']

A gradient boosting algorithm was chosen for classification. 

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42)
model.fit(X_train, y_train)

## 5. Prediction on the test data

Predictions were then made on the test data using the trained gradient boosting model. For this purpose, the test data was preprocessed in the same way as the training data. 

In [26]:
# Importing the test data
filename = "dbo_test.csv" # Path needs to be adjusted 
test_dbo = pd.read_csv(filename, sep=';')

In [27]:
# Removing URLs, hashtags and mentions 
test_dbo['description'] = test_dbo['description'].apply(clean_text)

# Lemmatisation and tokenisation 
test_dbo['description'] = text_lemmatize_tokenize(test_dbo['description'].tolist())

The same features are extracted from the test data as from the training data. 

In [28]:
# Extraction of the polarity score
test_dbo = add_polarity(test_dbo)

# Extraction of embedding features 
test_dbo = add_semantic_features(test_dbo)

The extracted features of the test data are passed to the gradient boosting model for prediction. 

In [29]:
# Restrict test data set to features 
X_test = test_dbo.drop(columns=['id', 'description'])
# Prediction 
y_test_pred = model.predict(X_test)

## 6. Evaluation of results 

The predictions based on the test data were compared with the gold standard and some basic evaluation metrics were calculated. The results achieved serve as a guide and baseline for the competition participants. 

In [30]:
# Importing the gold standard 
filename = "dbo_gold.csv" # Path needs to be adjusted 
gold_dbo = pd.read_csv(filename, sep=';')

In [32]:
# Encode the labels 
gold_dbo['DBO']= gold_dbo['DBO'].apply(lambda x: ['agitation', 'criticism', 'nothing', 'subversive'].index(x))

In [33]:
# Check that the IDs from the test data and the gold standard are in the same order. 
gold_dbo["id"].tolist() == test_dbo["id"].tolist()

True

In [34]:
# Extracting the actual label of the test data
y_true = gold_dbo['DBO']

The macro metric F1 serves as the main evaluation metric used to calculate the ranking on the leaderboard for the competition. In addition, other evaluation metrics such as precision and recall are calculated for the individual classes, as well as the macro and weighted average. 

In [35]:
from sklearn.metrics import accuracy_score, classification_report

test_report = classification_report(y_true, y_test_pred)
print("Train Classification Report:")
print(test_report)

Train Classification Report:
              precision    recall  f1-score   support

           0       0.07      0.01      0.02       134
           1       0.59      0.29      0.39       345
           2       0.87      0.97      0.92      2690
           3       0.00      0.00      0.00        25

    accuracy                           0.85      3194
   macro avg       0.38      0.32      0.33      3194
weighted avg       0.80      0.85      0.82      3194

