# Baseline Experiments with other Classifiers

For the paper, I performed additional experiments to analyze the default scores, obtained with the dummy classifier and SVMs.

## Preparing the dataset

Importing the necessary libraries

In [9]:
import json
import pandas as pd
#from copy import deepcopy
import re
#from tqdm import tqdm
import numpy as np
import pandas as pd
import sklearn.model_selection
from sklearn import metrics
from sklearn.metrics import classification_report, f1_score
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB,ComplementNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import sklearn.feature_extraction
from sklearn.svm import SVC

In [49]:
# Import the file with additional text representations (only the paragraphs marked to be kept
# in the original corpus are included)

with open("data/Language-Processed-GINCO.json") as f:
    dataset = json.load(f)

dataset[0]

{'id': '3949',
 'url': 'http://www.pomurje.si/aktualno/sport/zimska-liga-malega-nogometa/',
 'crawled': '2014',
 'hard': False,
 'paragraphs': [],
 'primary_level_1': 'News/Reporting',
 'primary_level_2': 'News/Reporting',
 'primary_level_3': 'News/Reporting',
 'secondary_level_1': '',
 'secondary_level_2': '',
 'secondary_level_3': '',
 'tertiary_level_1': '',
 'tertiary_level_2': '',
 'tertiary_level_3': '',
 'split': 'test',
 'domain': 'www.pomurje.si',
 'baseline_text': "Šport Zimska liga malega nogometa sobota, 12.02.2011 avtor: Tonček Gider V 7. krogu zimske lige v malem nogometu v Križevcih pri Ljutomeru je v prvi ligi vodilni 100 plus iz Križevec izgubil s tretjo ekipo na lestvici Rock'n roll iz Križevec z rezultatom 1:2, druga na lestvici Top Finedika iz Križevec je bila poražena z ekipo Bar Milene iz Ključarovec z rezultatom 7:8. V drugi križevski ligi je vodilni Cafe del Mar iz Vučje vasi premagal Montažo Vrbnjak iz Stare Nove vasi z rezultatom 3:2. oglasno sporočilo Ocena",

In [50]:
dataset[0].keys()

dict_keys(['id', 'url', 'crawled', 'hard', 'paragraphs', 'primary_level_1', 'primary_level_2', 'primary_level_3', 'secondary_level_1', 'secondary_level_2', 'secondary_level_3', 'tertiary_level_1', 'tertiary_level_2', 'tertiary_level_3', 'split', 'domain', 'baseline_text', 'no_of_words', 'lemmas', 'upos', 'xpos', 'ner', 'dependency', 'lowercase', 'lowercase_nopunctuation', 'representation_list'])

### Downcasting number of labels

In these experiments, we will not use all of the texts but only texts from 5 main categories, meaning that some categories will be merged into them, whereas some categories with a very small frequency will be discarded. Additionally, the texts marked us hard, will be discarded (see notebook *1-Preparing_Data_Hyperparameter_Search*).

We will start with a reduced set of labels (primary_level_3), then merge News and Opinionated News, and discard some of the lables.

In [51]:
# merge News and Opinionated News
for i in dataset:
    if i["primary_level_3"] == "Opinionated News" or i["primary_level_3"] == "News/Reporting":
        i["primary_level_3"] = "News"

Let's create train:test:dev split that contains only the wanted labels.

In [52]:
downcasted_labels = ['Information/Explanation', 'Promotion', 'News', 'Forum', 'Opinion/Argumentation']

train = [i for i in dataset if i["split"] == "train" and i["primary_level_3"] in downcasted_labels and not i["hard"]]
test = [i for i in dataset if i["split"] == "test" and i["primary_level_3"] in downcasted_labels and not i["hard"]]
dev = [i for i in dataset if i["split"] == "dev" and i["primary_level_3"] in downcasted_labels and not i["hard"]]

print("The train-dev-test splits consist of the following numbers of examples:", len(train), len(test), len(dev))

The train-dev-test splits consist of the following numbers of examples: 410 141 137


In [53]:
print(f"Number of all texts is {len(train)+len(test)+len(dev)}")

Number of all texts is 688


In [54]:
# Create dataframes from the datasets
train_df = pd.DataFrame(train)

train_df.columns

Index(['id', 'url', 'crawled', 'hard', 'paragraphs', 'primary_level_1',
       'primary_level_2', 'primary_level_3', 'secondary_level_1',
       'secondary_level_2', 'secondary_level_3', 'tertiary_level_1',
       'tertiary_level_2', 'tertiary_level_3', 'split', 'domain',
       'baseline_text', 'no_of_words', 'lemmas', 'upos', 'xpos', 'ner',
       'dependency', 'lowercase', 'lowercase_nopunctuation',
       'representation_list'],
      dtype='object')

In [55]:
test_df = pd.DataFrame(test)
dev_df = pd.DataFrame(dev)

print(f"test: {test_df.shape}, dev: {dev_df.shape}")

test: (141, 26), dev: (137, 26)


For the experiments, we'll use the baseline text ("baseline_text") and the "primary_level_3" labels.

In [56]:
train_df.columns

Index(['id', 'url', 'crawled', 'hard', 'paragraphs', 'primary_level_1',
       'primary_level_2', 'primary_level_3', 'secondary_level_1',
       'secondary_level_2', 'secondary_level_3', 'tertiary_level_1',
       'tertiary_level_2', 'tertiary_level_3', 'split', 'domain',
       'baseline_text', 'no_of_words', 'lemmas', 'upos', 'xpos', 'ner',
       'dependency', 'lowercase', 'lowercase_nopunctuation',
       'representation_list'],
      dtype='object')

In [64]:
# Create X_train and Y_train parts, used for sci kit learning
# List of texts in training split
X_train = list(train_df.baseline_text)
# List of labels in training split
Y_train = list(train_df.primary_level_3)

# List of texts in test split
X_test = list(test_df.baseline_text)
# List of labels in test split
Y_test = list(test_df.primary_level_3)

print(len(X_train), len(Y_train), len(X_test), len(Y_test))

410 410 141 141


In [65]:
Y_test[:2]

['News', 'News']

In [67]:
# Create a list of labels
labels = list(train_df.primary_level_3.unique())
labels

['Information/Explanation',
 'Promotion',
 'News',
 'Forum',
 'Opinion/Argumentation']

In [68]:
# Create a TF-IDF representation of the text
def data_iterator(f):
    for token in f:
        yield token


def tokenizer(txt):
    """Simple whitespace tokenizer"""
    return txt.split()

iterator=data_iterator(X_train)
test_iterator=data_iterator(X_test)

vectorizer=sklearn.feature_extraction.text.TfidfVectorizer(tokenizer=tokenizer,use_idf=True,min_df=0.005)

d=vectorizer.fit_transform(iterator)

d_test=vectorizer.transform(test_iterator)

## Training with various Sci-Kit models

In [88]:
# Create a pipeline of models that you want to try:

pipelines=[]

"""
for model in [DummyClassifier(strategy="most_frequent"), DummyClassifier(strategy="stratified"), DecisionTreeClassifier(), MultinomialNB(), ComplementNB(), LogisticRegression(solver='saga'), SVC(C=0.5,kernel='linear',shrinking=False,probability=True),RandomForestClassifier()]:
    pipeline=make_pipeline(model)
    pipelines.append(pipeline)

"""

for model in [DummyClassifier(strategy="most_frequent"), DummyClassifier(strategy="stratified"), DecisionTreeClassifier(), MultinomialNB(), ComplementNB(), LogisticRegression(), SVC(),RandomForestClassifier()]:
    pipeline=make_pipeline(model)
    pipelines.append(pipeline)

In [89]:
pipelines

[Pipeline(steps=[('dummyclassifier', DummyClassifier(strategy='most_frequent'))]),
 Pipeline(steps=[('dummyclassifier', DummyClassifier(strategy='stratified'))]),
 Pipeline(steps=[('decisiontreeclassifier', DecisionTreeClassifier())]),
 Pipeline(steps=[('multinomialnb', MultinomialNB())]),
 Pipeline(steps=[('complementnb', ComplementNB())]),
 Pipeline(steps=[('logisticregression', LogisticRegression())]),
 Pipeline(steps=[('svc', SVC())]),
 Pipeline(steps=[('randomforestclassifier', RandomForestClassifier())])]

In [90]:
for i, pipeline in enumerate(pipelines):
    pipeline.fit(d, Y_train)

In [91]:
#Prediction from test dataset
model_name=[]
y_pred_list = []
micro_f1_array=[]
macro_f1_array = []
accuracy_array = []

print("Classifiation Report\n")
print("*****************************************************")
for i, pipeline in enumerate(pipelines):
    y_pred=pipeline.predict(d_test)
    y_pred_list.append(list(y_pred))
    print(pipelines[i].steps[0][0].upper())
    model_name.append(pipelines[i].steps[0][0].upper())

    micro_f1_array.append(round(f1_score(Y_test, y_pred, labels=labels, average ="micro"),3))
    macro_f1_array.append(round(f1_score(Y_test, y_pred, labels=labels, average ="macro"),3))
    accuracy_array.append(round(metrics.accuracy_score(Y_test, y_pred),3))
    print("\n",classification_report(Y_test, y_pred, zero_division = 0))
    print("*****************************************************")

results = {"model":model_name, "microF1": micro_f1_array, "macroF1":macro_f1_array, "accuracy":accuracy_array, "y_pred":y_pred_list}

Classifiation Report

*****************************************************
DUMMYCLASSIFIER

                          precision    recall  f1-score   support

                  Forum       0.00      0.00      0.00         9
Information/Explanation       0.00      0.00      0.00        33
                   News       0.00      0.00      0.00        46
  Opinion/Argumentation       0.00      0.00      0.00        19
              Promotion       0.24      1.00      0.39        34

               accuracy                           0.24       141
              macro avg       0.05      0.20      0.08       141
           weighted avg       0.06      0.24      0.09       141

*****************************************************
DUMMYCLASSIFIER

                          precision    recall  f1-score   support

                  Forum       0.10      0.11      0.11         9
Information/Explanation       0.11      0.06      0.08        33
                   News       0.40      0.39      

In [92]:
results_df = pd.DataFrame(results)
results_df

Unnamed: 0,model,microF1,macroF1,accuracy,y_pred
0,DUMMYCLASSIFIER,0.241,0.078,0.241,"[Promotion, Promotion, Promotion, Promotion, P..."
1,DUMMYCLASSIFIER,0.27,0.221,0.27,"[Promotion, News, Promotion, Opinion/Argumenta..."
2,DECISIONTREECLASSIFIER,0.34,0.35,0.34,"[Opinion/Argumentation, Promotion, Promotion, ..."
3,MULTINOMIALNB,0.518,0.342,0.518,"[News, News, Promotion, Promotion, News, Opini..."
4,COMPLEMENTNB,0.539,0.416,0.539,"[News, News, Promotion, Promotion, News, Opini..."
5,LOGISTICREGRESSION,0.518,0.383,0.518,"[News, Promotion, Promotion, Promotion, News, ..."
6,SVC,0.489,0.333,0.489,"[News, Promotion, Promotion, Promotion, News, ..."
7,RANDOMFORESTCLASSIFIER,0.511,0.408,0.511,"[News, Promotion, Promotion, Information/Expla..."


In [93]:
# Append the results for the FastText model

ft = {"model": "FastText", "microF1": 0.56, "macroF1": 0.589}

results_df = results_df.append(ft, ignore_index = True)

results_df



Unnamed: 0,model,microF1,macroF1,accuracy,y_pred
0,DUMMYCLASSIFIER,0.241,0.078,0.241,"[Promotion, Promotion, Promotion, Promotion, P..."
1,DUMMYCLASSIFIER,0.27,0.221,0.27,"[Promotion, News, Promotion, Opinion/Argumenta..."
2,DECISIONTREECLASSIFIER,0.34,0.35,0.34,"[Opinion/Argumentation, Promotion, Promotion, ..."
3,MULTINOMIALNB,0.518,0.342,0.518,"[News, News, Promotion, Promotion, News, Opini..."
4,COMPLEMENTNB,0.539,0.416,0.539,"[News, News, Promotion, Promotion, News, Opini..."
5,LOGISTICREGRESSION,0.518,0.383,0.518,"[News, Promotion, Promotion, Promotion, News, ..."
6,SVC,0.489,0.333,0.489,"[News, Promotion, Promotion, Promotion, News, ..."
7,RANDOMFORESTCLASSIFIER,0.511,0.408,0.511,"[News, Promotion, Promotion, Information/Expla..."
8,FastText,0.56,0.589,,


In [94]:
# Show in markdown
print(results_df[["model","microF1", "macroF1"]].to_markdown(index = False))

| model                  |   microF1 |   macroF1 |
|:-----------------------|----------:|----------:|
| DUMMYCLASSIFIER        |     0.241 |     0.078 |
| DUMMYCLASSIFIER        |     0.27  |     0.221 |
| DECISIONTREECLASSIFIER |     0.34  |     0.35  |
| MULTINOMIALNB          |     0.518 |     0.342 |
| COMPLEMENTNB           |     0.539 |     0.416 |
| LOGISTICREGRESSION     |     0.518 |     0.383 |
| SVC                    |     0.489 |     0.333 |
| RANDOMFORESTCLASSIFIER |     0.511 |     0.408 |
| FastText               |     0.56  |     0.589 |


In [95]:
# Save the results in a csv format
results_df.to_csv("results/additional_models_experiments.csv")