# Text classification

The task concentrates on content-based text the classification.


## Tasks

### Divide the set of bills into two exclusive sets:
   1. the set of bills amending other bills (their title starts with `o zmianie ustawy`),
   1. the set of bills not amending other bills.

### Change the contents of the bill by removing the date of publication and the title (so the words `o zmianie ustawy` are removed).

In [281]:
import regex
import os
import requests
from collections import Counter
from operator import add
import functools
import random
import math

In [2]:
def filesNames():
    path = '../ustawy'
    absolute_path = os.path.realpath(path) + "\\"
    return [(absolute_path + filename, filename) for filename in os.listdir(path)]

def getFileTextRaw(filename):
    with open(filename, 'r', encoding="utf8") as content_file:
        return content_file.read()

In [274]:
def splitTitle(text):
    search = regex.search(r'((Art.)|(Rozdział))(\s+1)',text)
    if search is None:
        return None
    return text[:search.start()]

def splitByTemp(text):
    search = regex.search(r'(zmianie|zmieniająca)(.|\n)*(ustaw|ustawy)',text)
    if search is None:
        return None
    return text[:search.start()]

possitive = []
negative = []

for (path, filename) in filesNames():
    text = getFileTextRaw(path)
    title = splitTitle(text)
    if title is None:
        print("Not found for: " + filename)
    else:
        result = splitByTemp(title)
        if result is None:
            negative.append((text,0))
        else:
            possitive.append((text,1))
print(len(possitive), len(negative))

Not found for: 1996_400.txt
713 466


### Split the sets of documents into the following groups by randomly selecting the documents:
   1. 60% training
   1. 20% validation
   1. 20% testing
   
### Do not change these groups during the following experiments.

In [286]:
random_possitive = possitive[:]
random_negative = negative[:]
random.shuffle(random_possitive)
random.shuffle(random_negative)

possitive_training_number = math.floor(len(random_possitive)*0.6)
possitive_validation_number = math.floor(len(random_possitive)*0.8)

negative_training_number = math.floor(len(random_negative)*0.6)
negative_validation_number = math.floor(len(random_negative)*0.8)


training_positive = random_possitive[:possitive_training_number]
training_negative = random_negative[:negative_training_number]
training_set = (training_positive, training_negative)

validation_positive = random_possitive[possitive_training_number:possitive_validation_number]
validation_negative = random_negative[negative_training_number:negative_validation_number]
validation_set = (validation_positive, validation_negative)

testing_positive = random_possitive[possitive_validation_number:]
testing_negative = random_negative[negative_validation_number:]
testing_set = (testing_positive, testing_negative)

### Prepare the following variants of the documents:
   1. full text of the document
   1. randomly selected 10% of the lines of the document
   1. randomly selected 10 lines of the document
   1. randomly selected 1 line of the document

### Train the following classifiers on the documents:

   1. SVM with TF•IDF
   1. Fasttext
   1. Flair with Polish language model
   
### Report Precision, Recall and F1 for each variant of the experiment (12 variants altogether).

## Hints


1. Application of SVM classifier with TF•IDF is described in 
   [David Batista](http://www.davidsbatista.net/blog/2017/04/01/document_classification/) blog post.
1. [Fasttext](https://fasttext.cc/) is a popular basline classifier. Don't report the Precision/Recall/F1 provided by
   Fasttext since they might be [wrong](https://github.com/facebookresearch/fastText/issues/261).
1. [Flair](https://towardsdatascience.com/text-classification-with-state-of-the-art-nlp-library-flair-b541d7add21f) 
   is another library for text processing. Flair classification is based on a language model.
1. [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) by Jurafsky and Martin 
   has a [chapter](https://web.stanford.edu/~jurafsky/slp3/4.pdf) devoted to the problem of classification.