# Text classification

The task concentrates on content-based text the classification.


## Tasks

### Divide the set of bills into two exclusive sets:
   1. the set of bills amending other bills (their title starts with `o zmianie ustawy`),
   1. the set of bills not amending other bills.

### Change the contents of the bill by removing the date of publication and the title (so the words `o zmianie ustawy` are removed).

In [39]:
import regex
import os
import requests
from collections import Counter
from operator import add
import functools
import random
import math
import fastText
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV

In [40]:
def files_names():
    path = '../ustawy'
    absolute_path = os.path.realpath(path) + "\\"
    return [(absolute_path + filename, filename) for filename in os.listdir(path)]

def get_file_text_raw(filename):
    with open(filename, 'r', encoding="utf8") as content_file:
        return regex.sub(r"\n\s*\n", "\n", content_file.read()).strip()

In [41]:
def split_by_title(text):
    search = regex.search(r'((Art.)|(Rozdział))(\s+1)',text)
    if search is None:
        return None, None
    return (text[:search.start()], text[search.start():])

def split_by_changing(text):
    search = regex.search(r'(zmianie|zmieniająca)(.|\n)*(ustaw|ustawy)',text)
    if search is None:
        return None
    return text[:search.start()]

possitive = []
negative = []

for (path, filename) in files_names():
    text = get_file_text_raw(path)
    title, body = split_by_title(text)
    if title is None:
        print("Not found for: " + filename)
    else:
        result = split_by_changing(title)
        if result is None:
            negative.append((body,'__label__normal'))
        else:
            possitive.append((body,'__label__changing'))
print(len(possitive), len(negative))

Not found for: 1996_400.txt
713 466


### Split the sets of documents into the following groups by randomly selecting the documents:
   1. 60% training
   1. 20% validation
   1. 20% testing
   
### Do not change these groups during the following experiments.

In [42]:
random_possitive = possitive[:]
random_negative = negative[:]
random.shuffle(random_possitive)
random.shuffle(random_negative)

possitive_training_number = math.floor(len(random_possitive)*0.6)
possitive_validation_number = math.floor(len(random_possitive)*0.8)

negative_training_number = math.floor(len(random_negative)*0.6)
negative_validation_number = math.floor(len(random_negative)*0.8)


training_positive = random_possitive[:possitive_training_number]
training_negative = random_negative[:negative_training_number]
training_set = training_positive[:] + training_negative[:]
random.shuffle(training_set)

validation_positive = random_possitive[possitive_training_number:possitive_validation_number]
validation_negative = random_negative[negative_training_number:negative_validation_number]
validation_set = validation_positive[:] + validation_negative[:]
random.shuffle(validation_set)

testing_positive = random_possitive[possitive_validation_number:]
testing_negative = random_negative[negative_validation_number:]
testing_set = testing_positive[:] + testing_negative[:]
random.shuffle(testing_set)

### Prepare the following variants of the documents:
   1. full text of the document
   1. randomly selected 10% of the lines of the document
   1. randomly selected 10 lines of the document
   1. randomly selected 1 line of the document

In [43]:
def get_lines(text, number):
    lines = text.split('\n')
    if len(lines) < number:
        return lines
    return random.sample(lines,number)

def get_line_percetage(text, number):
    lines = text.split('\n')
    return random.sample(lines, math.ceil(len(lines)*number))

def prepare_a(data_set):
    return [(" ".join(text.split('\n')), result) for (text, result) in data_set]

def prepare_b(data_set):
    return [(" ".join(get_line_percetage(text, 0.1)), result) for (text, result) in data_set]

def prepare_c(data_set):
    return [(" ".join(get_lines(text, 10)), result) for (text, result) in data_set]

def prepare_d(data_set):
    return [(" ".join(get_lines(text, 1)), result) for (text, result) in data_set]

training_set_a = prepare_a(training_set)
training_set_b = prepare_b(training_set)
training_set_c = prepare_c(training_set)
training_set_d = prepare_d(training_set)

validation_set_a = prepare_a(validation_set)
validation_set_b = prepare_b(validation_set)
validation_set_c = prepare_c(validation_set)
validation_set_d = prepare_d(validation_set)

testing_set_a = prepare_a(testing_set)
testing_set_b = prepare_b(testing_set)
testing_set_c = prepare_c(testing_set)
testing_set_d = prepare_d(testing_set)

In [44]:
def prepare_fast_text_file(data_set, label, set_type):
    with open('{}-{}.txt'.format(label, set_type), 'w', encoding='utf8') as f:
        f.write('\n'.join(['{} {}'.format(y,x) for (x,y) in data_set]))

def prepare_fast_text(training_set, validation_set, testing_set, label):
    prepare_fast_text_file(training_set, label, "training")
    prepare_fast_text_file(validation_set, label, "validation")
    prepare_fast_text_file(testing_set, label, "testing")

prepare_fast_text(training_set_a, validation_set_a, testing_set_a, 'a')
prepare_fast_text(training_set_b, validation_set_b, testing_set_b, 'b')
prepare_fast_text(training_set_c, validation_set_c, testing_set_c, 'c')
prepare_fast_text(training_set_d, validation_set_d, testing_set_d, 'd')

### Train the following classifiers on the documents:

   1. SVM with TF•IDF
   1. Fasttext
   1. Flair with Polish language model
   
### Report Precision, Recall and F1 for each variant of the experiment (12 variants altogether).

In [61]:
def metric(results, test_set):
    labels = [y for (x,y) in test_set]
    tp = 0
    tn = 0
    fn = 0
    fp = 0
    for (predict, label) in zip(results, labels):
        if label == '__label__changing':
            if label == predict:
                tp = tp + 1
            else:
                fn = fn + 1
        else:
            if label == predict:
                tn = tn + 1
            else:
                fp = fp + 1
                
    recall = tp/(tp+fn)
    precision = tp/(tp + fp)
    f1 = (2 * precision * recall)/(precision + recall)
    return (recall, precision, f1)

In [67]:
def print_metric(label, metric):
    print(label)
    print('recall: {},\nprecision: {},\nf1: {}'.format(metric[0], metric[1], metric[2]))

In [63]:
def svm_tf_idf(training_set, test_set):
    pipeline = Pipeline([
        ('tfidf', TfidfVectorizer()),
        ('clf', OneVsRestClassifier(LinearSVC())),
    ])
    parameters = {
        'tfidf__max_df': (0.25, 0.5, 0.75),
        'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
        "clf__estimator__C": [0.01, 0.1, 1],
        "clf__estimator__class_weight": ['balanced', None],
    }
    grid_search_tune = GridSearchCV(pipeline, parameters, cv=2, n_jobs=8, verbose=3)
    grid_search_tune.fit([x for (x,y) in training_set], [y for (x,y) in training_set])
    return grid_search_tune.best_estimator_.predict([x for (x,y) in test_set])

In [106]:
def fast_text(testing_set, label):
    classifier = fastText.train_supervised('{}-training.txt'.format(label))
    result = classifier.test_label('{}-testing.txt'.format(label))['__label__changing']
    return [classifier.predict(x)[0][0] for (x,y) in testing_set]

In [107]:
def test(training_set,testing_set, label):
    svm_tf_idf_metric = metric(svm_tf_idf(training_set, testing_set),testing_set)
    fast_text_metric = metric(fast_text(testing_set, label), testing_set)
    
    print_metric("svm_tf_idf", svm_tf_idf_metric)
    print_metric("fast_text", fast_text_metric)

In [109]:
test(training_set_a, testing_set_a, 'a')

Fitting 2 folds for each of 54 candidates, totalling 108 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  16 tasks      | elapsed:   41.2s
[Parallel(n_jobs=8)]: Done 108 out of 108 | elapsed:  4.8min finished


svm_tf_idf
recall: 0.965034965034965,
precision: 0.8961038961038961,
f1: 0.9292929292929293
fast_text
recall: 1.0,
precision: 0.6111111111111112,
f1: 0.7586206896551725


In [110]:
test(training_set_b, testing_set_b, 'b')

Fitting 2 folds for each of 54 candidates, totalling 108 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  16 tasks      | elapsed:    4.6s
[Parallel(n_jobs=8)]: Done 108 out of 108 | elapsed:   33.1s finished


svm_tf_idf
recall: 0.8741258741258742,
precision: 0.8169934640522876,
f1: 0.8445945945945946
fast_text
recall: 0.972027972027972,
precision: 0.634703196347032,
f1: 0.7679558011049723


In [111]:
test(training_set_c, testing_set_c, 'c')

Fitting 2 folds for each of 54 candidates, totalling 108 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  16 tasks      | elapsed:    0.9s
[Parallel(n_jobs=8)]: Done 108 out of 108 | elapsed:    5.7s finished


svm_tf_idf
recall: 0.8181818181818182,
precision: 0.7905405405405406,
f1: 0.8041237113402062
fast_text
recall: 1.0,
precision: 0.6137339055793991,
f1: 0.7606382978723404


In [112]:
test(training_set_d, testing_set_d, 'd')

Fitting 2 folds for each of 54 candidates, totalling 108 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  16 tasks      | elapsed:    0.2s
[Parallel(n_jobs=8)]: Done 108 out of 108 | elapsed:    0.9s finished


svm_tf_idf
recall: 0.7902097902097902,
precision: 0.6848484848484848,
f1: 0.7337662337662337
fast_text
recall: 1.0,
precision: 0.6033755274261603,
f1: 0.7526315789473684


## Hints


1. Application of SVM classifier with TF•IDF is described in 
   [David Batista](http://www.davidsbatista.net/blog/2017/04/01/document_classification/) blog post.
1. [Fasttext](https://fasttext.cc/) is a popular basline classifier. Don't report the Precision/Recall/F1 provided by
   Fasttext since they might be [wrong](https://github.com/facebookresearch/fastText/issues/261).
1. [Flair](https://towardsdatascience.com/text-classification-with-state-of-the-art-nlp-library-flair-b541d7add21f) 
   is another library for text processing. Flair classification is based on a language model.
1. [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) by Jurafsky and Martin 
   has a [chapter](https://web.stanford.edu/~jurafsky/slp3/4.pdf) devoted to the problem of classification.