In [1]:
import numpy as np 
import pandas as pd 
from pathlib import Path

from sklearn.svm import LinearSVC

# Data

source: https://github.com/NLeSC/spudisc-emotion-classification

Preprocessed versions of the data, split for training and testing a classifier, can be found in the files train.txt and test.txt. These contain one sentence per line with labels at the end of each line. A single space separates the labels from the text. Multiple labels are separated by underscores. Where a sentence received no label, the string None appears. (No label means no emotions assigned by the annotator; all sentences have been annotated.)

ToDo: 
- cite and description of data
- how to load data correctly ? (not specifyed in paper or on github)

In [2]:
DATA_FILES = Path('data')
TEST_DATA = DATA_FILES / 'test.txt'
TRAIN_DATA = DATA_FILES / 'train.txt'

def load_data(file_path):
    X = []
    y = []
    with open(file_path, 'r') as file:
        lines = file.readlines()
    for line in lines:
        sentence = ' '.join(line.split()[0:-1])
        X.append(sentence)
        labels = line.split()[-1].split('_')
        y.append(labels)
    return X, y
        
X_train, y_train = load_data(TRAIN_DATA)
X_test, y_test = load_data(TEST_DATA)

In [3]:
print(len(X_train))
print(len(y_train))
print(len(X_test))
print(len(y_test))

503
503
126
126


# Features

Both algorithms use standard bag-of-words features with stop word removal and optional tf–idf weighting.

## BOW and Tf-idf

described at: https://medium.com/betacom/bow-tf-idf-in-python-for-unsupervised-learning-task-88f3b63ccd6d

scikit BOW: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

scit tfid: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

## Stop word removal

described at: https://aisb.org.uk/wp-content/uploads/2019/12/Final-vol-02.pdf#page=59

ToDo: 
- the data is in a "raw" format and needs to be converted into BOW features



In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

# Algorithms

## One-vs-Rest

Reduction to Binary classifyers

implementation of liblinear in sklearn is used which can be found in: https://scikit-learn.org/stable/modules/generated/
sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC

wrapper: https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass.OneVsRestClassifier

description at: https://scikit-learn.org/stable/modules/svm.html (chapter 1.4.1.1.)

## RAKEL

implementation found in: http://scikit.ml/api/skmultilearn.ensemble.rakeld.html

## ToDo:
- use tf–idf weighting with logarithmic tf
- automatic oversampling
- use fixed regularization parameter C = 1 for all SVMs

Difficultys:
- not clear if L1 or L2 penalty (use L2 as it is default)

In [1]:
from sklearn.svm import LinearSVC
from skmultilearn.ensemble import RakelD
from sklearn.preprocessing import MultiLabelBinarizer
from bow_1 import BOW
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics

In [2]:
# load BOW features
bow_default = BOW()
X_train, X_test, y_train, y_test = bow_default.create()

In [3]:
# create RAKEL classifier
classifier = RakelD(
    base_classifier=LinearSVC(C=1),
    base_classifier_require_dense=[True, True],
    labelset_size=3
)

In [4]:
# test classifier
classifier.fit(X_train, y_train)
prediction = classifier.predict(X_test)

In [5]:
# look at some scores (from skmultilearn example)
metrics.hamming_loss(y_test, prediction)

0.1746031746031746

In [6]:
metrics.accuracy_score(y_test, prediction)

0.2857142857142857