### Traditional Machine Learning Models (see Section 3.1. of thesis)

This notebook focuses on training and testing various traditional Machine Learning models that were proposed in a master's thesis. The models are implemented using Scikit-learn, a popular Machine Learning library. To get the best setup, each model was trained based on the GridSearchCV approach.

It's worth noting that the code in this notebook runs entirely on the CPU and does not require a GPU setup.

Please keep in mind that these notebooks are primarily used for conducting experiments, live coding, and implementing and evaluating the approaches presented in the thesis. As a result, the code in this notebook may not strictly adhere to best practice coding standards.



In summary, this notebook provides an implementation and evaluation of traditional machine learning models using Scikit-learn, with a focus on experimentation and the application of approaches discussed in a master's thesis.

In [1]:
import joblib
import re
import string

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import StratifiedKFold, train_test_split

# import the relevant models
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm

from sklearn.model_selection import GridSearchCV
from nltk.corpus import stopwords
import nltk

# downloading stopwords database
nltk.download('stopwords')

# importing data with triggerset.

def import_test_train(local):
  """
  This imports the given train and testset locally or not and returns it.

  :param local: If set to true, it will return the trainset from a local view. Otherwise it will open drive mount and attempts to connect to your
  drive folders.
  """

  assert type(local) == bool, f"Type is not valid. Expected boolean, recieved: {type(local)}"

  if local:
    from google.colab import drive
    drive.mount('/content/gdrive')

    df_test = pd.read_csv('/content/gdrive/MyDrive/Experiment/testset_DE_Trigger.csv')
    df_train = pd.read_csv('/content/gdrive/MyDrive/Experiment/trainset_DE_Trigger.csv')

    return df_test, df_train

  else:
    import os

    # Getting the parent directory
    current_directory = os.getcwd()
    os.chdir('..')

    df_test = pd.read_csv('./Experiment/testset_DE_Trigger.csv')
    df_train = pd.read_csv('./Experiment/trainset_DE_Trigger.csv')

    return df_test, df_train

# importing test and trainset
df_test, df_train = import_test_train(True)

# If you want to use it locally, make sure to execute the notebooks from the root directory of this project and uncomment the following line:
# df_test, df_train = import_test_train(False)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


### Simple Feature Engineering

In [8]:
# Simple preprocessing of text and removes irrelevant punctuation
def process_text(text):
    text = str(text).lower()
    text = re.sub(
        f"[{re.escape(string.punctuation)}]", " ", text
    )
    text = " ".join(text.split())
    return text

# clean train and testset
df_test["content"] =  df_test.content.map(process_text)
df_train["content"] =  df_train.content.map(process_text)

In [10]:
# import german stop words
german_stop_words = stopwords.words('german')

# delete german stopwords from corpora and create bag-of-words
vec = CountVectorizer(
    ngram_range=(1, 3),
    stop_words=german_stop_words,
)

In [11]:
# creating a format for the train and testset to be readable for scikit.
X_train = vec.fit_transform(df_train.content)
X_test = vec.transform(df_test.content)

y_train = df_train.label_id
y_test = df_test.label_id

### K-nearest neighbors


In [None]:
param_grid = {'n_neighbors': [3, 5, 7],
              'weights': ['uniform', 'distance'],
              'algorithm': ['ball_tree', 'kd_tree', 'brute']}

tuned_knn = GridSearchCV(KNeighborsClassifier(),
                         param_grid,
                         cv=3,
                         return_train_score=False)

tuned_knn.fit(X_train, y_train)

preds = tuned_knn.predict(X_test)
print(classification_report(y_test, preds))

### NaiveBayes

In [None]:
param_grid = {'alpha': [0.1, 0.5, 1.0, 5.0, 10.0],
              'fit_prior': [True, False],
              'class_prior': [None, [0.1, 0.9], [0.2, 0.8], [0.4, 0.6], [0.5, 0.5]]}

tuned_nb = GridSearchCV(MultinomialNB(),
                        param_grid,
                        cv=3,
                        return_train_score=False)

tuned_nb.fit(X_train, y_train)

preds = tuned_nb.predict(X_test)
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.79      0.29      0.42       817
           1       0.88      0.40      0.55       569
           2       0.59      0.76      0.66      1835
           3       0.56      0.53      0.54      1894
           4       0.54      0.68      0.60      1745

    accuracy                           0.59      6860
   macro avg       0.67      0.53      0.56      6860
weighted avg       0.62      0.59      0.58      6860



120 fits failed out of a total of 150.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
120 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/naive_bayes.py", line 693, in fit
    self._update_class_log_prior(class_prior=class_prior)
  File "/usr/local/lib/python3.8/dist-packages/sklearn/naive_bayes.py", line 529, in _update_class_log_prior
    raise ValueError("Number of priors must match number of classes.")
ValueError: Number of priors must match number of classes.

        nan        nan        

### Decision Tree

In [None]:
param_grid = {'criterion': ['gini', 'entropy'],
              'splitter': ['best', 'random'],
              'max_depth': [None, 5, 10, 15],
              'min_samples_split': [2, 5, 10],
              'min_samples_leaf': [1, 2, 4]}


tuned_dt = GridSearchCV(DecisionTreeClassifier(),
                        param_grid,
                        cv=3,
                        return_train_score=False)

tuned_dt.fit(X_train, y_train)

preds = tuned_dt.predict(X_test)
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.53      0.44      0.48       817
           1       0.65      0.50      0.57       569
           2       0.58      0.58      0.58      1835
           3       0.48      0.63      0.55      1894
           4       0.59      0.49      0.54      1745

    accuracy                           0.55      6860
   macro avg       0.57      0.53      0.54      6860
weighted avg       0.56      0.55      0.55      6860



# Support Vector Machine

In [None]:
param_grid = {'C': [0.1, 1, 10],
              'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
              'degree': [2, 3, 4],
              'gamma': ['scale', 'auto'] + [0.1, 1, 10],
              'coef0': [-1, 0, 1]}

tuned_svm = GridSearchCV(svm.SVC(),
                        param_grid,
                        cv=3,
                        return_train_score=False)

tuned_svm.fit(X_train, y_train)

preds = tuned_svm.predict(X_test)
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.72      0.32      0.44       817
           1       0.86      0.37      0.52       569
           2       0.70      0.58      0.63      1835
           3       0.45      0.82      0.58      1894
           4       0.68      0.49      0.57      1745

    accuracy                           0.57      6860
   macro avg       0.68      0.52      0.55      6860
weighted avg       0.64      0.57      0.57      6860

