# Text classification

In [5]:
import fasttext
import ktrain
import numpy as np
import shap

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, f1_score, matthews_corrcoef
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

### Task 2 & 3

In [6]:
with open("./task_6-1/training_set_clean_only_text.txt", 'r', encoding='utf-8') as file:
    X1 = np.array([x for x in file.readlines()])

with open("./task_6-1/training_set_clean_only_tags.txt", 'r', encoding='utf-8') as file:
    y1 = np.array([int(y) for y in file.readlines()])

with open("./task_6-2/training_set_clean_only_text.txt", 'r', encoding='utf-8') as file:
    X2 = np.array([x for x in file.readlines()])

with open("./task_6-2/training_set_clean_only_tags.txt", 'r', encoding='utf-8') as file:
    y2 = np.array([int(y) for y in file.readlines()])

In [7]:
def print_scores(y_test, y_pred):
    print("\tAccuracy score:", accuracy_score(y_test, y_pred))
    print("\tF1 score:", f1_score(y_test, y_pred, average="weighted")) # default is only for binary labels
    print("\tMacro F1 score:", f1_score(y_test, y_pred, average="macro"))
    print("\tMicro F1 score:", f1_score(y_test, y_pred, average="micro"))
    print("\tMCC:", matthews_corrcoef(y_test, y_pred))

#### Bayesian classifier

TF * IDF weighting done by `TfidfVectorizer()` from sklearn.  
Set random_state to keep results the same across mutliple runs.

In [8]:
def NB_classifier(X, y):
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)
    
    classifier = GaussianNB()
    classifier.fit(X_train.toarray(), y_train)
    y_pred = classifier.predict(X_test.toarray())
    print_scores(y_test, y_pred)
    return classifier, X_test.toarray(), y_pred, vectorizer.get_feature_names()

In [9]:
print("Task 1:")
NB1_model, NB_X1_test, NB_y1_pred, NB_feature_names1 = NB_classifier(X1, y1)

Task 1:
	Accuracy score: 0.8434886499402628
	F1 score: 0.8598434552889552
	Macro F1 score: 0.6032082766218806
	Micro F1 score: 0.8434886499402628
	MCC: 0.21998729341226636


In [10]:
print("Task 2:")
NB2_model, NB_X2_test, NB_y2_pred, NB_feature_names2 = NB_classifier(X2, y2)

Task 2:
	Accuracy score: 0.8502588610115492
	F1 score: 0.8628120312388717
	Macro F1 score: 0.441313568748656
	Micro F1 score: 0.8502588610115492
	MCC: 0.1966873921225114


#### Fasttext text classifier

This classifier needs train data to be in file, where each line contains (at least one) label preceeded with `__label__` along with text.

In [11]:
def fasttext_classifier(X, y, file_dir=""):
    file_path = file_dir + "merged.txt"
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)
    
    # create file
    with open(file_path, "w", encoding='utf8') as file:
        file.writelines(["__label__{} {}".format(y, x) for x, y in zip(X_train, y_train)])
    
    classifier = fasttext.train_supervised(input=file_path)

    y_pred = []
    for x, y in zip(X_test, y_test):
        pred, _ = classifier.predict(x[:-1])  # remove '\n'
        y_pred.append(int(pred[0][-1]))

    print_scores(y_test, y_pred)
    return classifier, y_pred

In [12]:
print("Task 1:")
ft1_model, ft_y1_pred = fasttext_classifier(X1, y1, "task_6-1/")

Task 1:
	Accuracy score: 0.9175627240143369
	F1 score: 0.8866508445337241
	Macro F1 score: 0.5416610451966192
	Micro F1 score: 0.9175627240143369
	MCC: 0.18595918269873282


In [13]:
print("Task 2:")
ft2_model, ft_y2_pred = fasttext_classifier(X2, y2, "task_6-2/")

Task 2:
	Accuracy score: 0.9143767423337316
	F1 score: 0.8761932949316045
	Macro F1 score: 0.3386804319748926
	Micro F1 score: 0.9143767423337316
	MCC: 0.05161051748645783


#### Transformer classifier

Using ktrain with the help of this: https://towardsdatascience.com/text-classification-with-hugging-face-transformers-in-tensorflow-2-without-tears-ee50e4f3e7ed  
Ran on Google Colab.  


In [16]:
def transformer_classifier(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)
    
    model_name = 'bert-base-uncased'
    t = ktrain.text.Transformer(model_name, maxlen=500, class_names=list(set(y)))
    train_dataset = t.preprocess_train(X_train, y_train)
    validation_dataset = t.preprocess_test(X_test, y_test)
    model = t.get_classifier()
    learner = ktrain.get_learner(model, train_data=train_dataset, val_data=validation_dataset, batch_size=6)
    learner.fit_onecycle(5e-5, 4)
    
    predictor = ktrain.get_predictor(learner.model, preproc=t)
    
    y_pred = [predictor.predict(x) for x in X_test]
    print_scores(y_test, y_pred)
    return predictor, y_pred

In [17]:
print("Task 1:")
t1_model, t_y1_pred = transformer_classifier(X1, y1)

Task 1:
preprocessing train...
language: pl
train sequence lengths:
	mean : 12
	95percentile : 21
	99percentile : 24


Is Multi-Label? False
preprocessing test...
language: pl
test sequence lengths:
	mean : 12
	95percentile : 21
	99percentile : 24




begin training using onecycle policy with max lr of 5e-05...
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
	Accuracy score: 0.9155714854639586
	F1 score: 0.8752178150027612
	Macro F1 score: 0.47796257796257796
	Micro F1 score: 0.9155714854639586
	MCC: 0.0


invalid value encountered in double_scalars


In [18]:
print("Task 2:")
t2_model, t_y2_pred = transformer_classifier(X2, y2)

Task 2:
preprocessing train...
language: pl
train sequence lengths:
	mean : 12
	95percentile : 21
	99percentile : 24


Is Multi-Label? False
preprocessing test...
language: pl
test sequence lengths:
	mean : 12
	95percentile : 21
	99percentile : 24




begin training using onecycle policy with max lr of 5e-05...
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
	Accuracy score: 0.9155714854639586
	F1 score: 0.8752178150027612
	Macro F1 score: 0.31864171864171864
	Micro F1 score: 0.9155714854639586
	MCC: 0.0


invalid value encountered in double_scalars


### Task 4

In [19]:
_, X_test, _, y_test = train_test_split(X1, y1, random_state=7)  # same split as in classifiers

tp, fp, tn, fn = None, None, None, None
for i in range(len(X_test)):
    if y_test[i] == ft_y1_pred[i] == 1:
        tp = (i, str(X_test[i])[:-1])
        # print(tp)
        break

for i in range(len(X_test)):
    if y_test[i] == ft_y1_pred[i] == 0:
        tn = (i, str(X_test[i])[:-1])
        # print(tn)
        break

for i in range(len(X_test)):
    if y_test[i] != ft_y1_pred[i] and ft_y1_pred[i] == 1:
        fp = (i, str(X_test[i])[:-1])
        # print(fp)
        break

for i in range(len(X_test)):
    if y_test[i] != ft_y1_pred[i] and ft_y1_pred[i] == 0:
        fn = (i, str(X_test[i])[:-1])
        # print(fn)
        break

In [None]:
# NG_values = np.array([NB_X1_test[i] for i in [tp[0], tn[0], fp[0], fn[0]]])
# explainer = shap.KernelExplainer(NB1_model.predict, shap.kmeans(NB_X1_test, 10)) # this used up all RAM and blew up Colab
# # shap.initjs()
# shap_values = explainer.shap_values(NG_values)
# shap.force_plot(explainer.expected_value, shap_values, feature_names=NB_feature_names, matplotlib=True)

HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))

In [None]:
# this fails because predict() in this model expects a single sample
# explainer = shap.KernelExplainer(ft1_model.predict, [tp[1], tn[1], fp[1], fn[1]])
# shap_values = explainer.shap_values([tp[1], tn[1], fp[1], fn[1]])

### Task 5
#### Which of the classifiers works the best for the task 1 and the task 2.
In the first case fasttext worked the best, but transformer was the close second.  
The second task I'd call a draw between fasttext and transformer.  
**It is worth to note that these results are for only one possible split of data.**
#### Did you achieve results comparable with the results of PolEval Task?
For task 6.1 somehow better(?) results here.  
For task 6.2 comparable micro F1-score and way worse macro F1-score.
#### Did you achieve results comparabie with the Klej leaderboard?
Here results are comparable only when looking at Average, for the CBD the results are better.  
I think this and previous point proves that a single split on a single dataset does not yield reliable results. (I don't believe that this simple model has just beaten the competition)
#### Describe strengths and weaknesses of each of the compared algorithms.
Bayesian classifier is the simplest of three, both in case of logic and implementation. The other downside is that it assumes that all features are totally unrelated to each other.  
Fasttext has for sure the weirdest implementation and is difficult to use (enforcing samples with labels in the single line in FILE or predict() throwing an error when it finds more than one sample to predict at a time). Nevertheless it "won" between this three models (with possibly very biased train-test data split but still, all of the models had the same split).  
Transformer is the most sophisticated one and might prove to be the best with correct parameters (learning_rate, number of epochs). The big downside is that it is VERY slow, trying to run it on my computer would need a whole day, even the Colab with GPU support needed over and hour for one model.
#### Do you think comparison of raw performance values on a single task is enough to assess the value of a given algorithm/model?
It is not enough since the a single dataset might not contain e.g. some words that are strongly positive or negative and model would have problem with even a simple sample (2-3 words) if it contained such words.  
Also, we face the risk of overfitting the model, so it will one work nice for this particular data and not any other.
#### Did SHAP show that the models use valuable features/words when performing their decision?
SHAP when given only 4 values returned nonsense and when I tried to feed more data (raw or using kmeans()) even the Colab refused to handle it and restarted (clearing
 the environment).  
Also Colab likes to 'detach' you suddenly from the environment without a warning which didn't help in trying to get it to work.  
I couldn't figure out how to run SHAP for other two models (e.g. fasttext's predict() works for single samples and using Explainer on it throws an error).