# Classifiyer training on judgement data

Based on the work of Mochales Palau and Moens [1] two classifyer are trained on the judgment corpus.  
The first classifyer decides weather the text is a subsumtion/definition or something else. (Correlating to the detection of argumentative text in the base paper.)  
The second classifyer detects if a sentence belongs to a subsupmtion or a definition. The decision of the first classifyer is fed into the second classification.  
&nbsp;  
&nbsp;  
<sub>[1] Palau, Raquel Mochales, and Marie-Francine Moens. "Argumentation mining: the detection, classification and structure of arguments in text." Proceedings of the 12th international conference on artificial intelligence and law. 2009.</sub>

In [1]:
# all imports for the notebook

# data handling
import json
import numpy as np
import pandas as pd

# visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# natural language processing
from somajo import SoMaJo # https://github.com/tsproisl/SoMaJo
import treetaggerwrapper

# machine learning
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer

# system paths
import os

# miscellaneous
import pprint
import itertools
import re
import pickle
import warnings
warnings.filterwarnings('ignore') 

## Data Preperation 1

In this step the judgments JSON files are loaded and all not neccessary data are dropped.  
Furthermore, the text is split into senteces and each sentence is assigned a label.  

Afterwards, the corpus is flattened for the first round of feature engineering. Since the first classifcation is a binary one the features are turned into binary features: interestin/non interesting for later classification.  
In the end the corpus is split into train and test data with a 80/20 split.

In [2]:
path = 'thesis_corpus/'

# flatten corpus for generating basic features and extract flattende labels
corpus = []
corpus_labels = []

for file in os.listdir(path):
    with open(path+file, encoding='utf-8') as f:
        data = json.load(f)
        reasons = data['decision_text']['decision_reasons']
    
    for reason in reasons:
        for sentence in reason:
            corpus.append(sentence[0])
            if sentence[1] in ['definition', 'subsumption']:
                corpus_labels.append('interest')
            else:
                corpus_labels.append('no_interest')
        
X = np.array(corpus)
y = np.array(corpus_labels)
                
# train/test split with 80/20 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Feature Engineering 1

The first classifyer needs the following features:
- Unigrams
- Bigrams
- Trigrams
- Adverbs
- Verbs
- Modal Auxiliary
- Word Couples
- Text Statistics
- Punctuation

In this step classes are defined that extract the neccessary features.  
For unigrams, bigrams and trigrams the CountVectorizer implementation from scikit learn is used.

### Extracting Adverbs, Verbs and Modal Auxiliary

extracting adverbs, verbs, modal auxiliary by using the TreeTagger (https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)

In [3]:
class POS_extractor:  
    
    # ADV for adverbs, VV for verbs, VM for modal auxuliary
    # learn a vocabluary of all choosen POS
    def fit(corpus, POS_abbreviation):
        tagger = treetaggerwrapper.TreeTagger(TAGLANG='de')
        POS = []
        for item in corpus:
            tags = tagger.tag_text(item)
            tags2 = treetaggerwrapper.make_tags(tags)

            for word in tags2:
                if type(word) == treetaggerwrapper.Tag:
                    if POS_abbreviation in word[1]:
                        POS.append(word[0])
        return list(set(POS))

    # transforms sentences to sentence term matrix
    def transform(sentences, vocabluary, POS_abbreviation):
        POS_vec = []
        
        for sentence in sentences:
            sent_vec = []
            
            # compose binary modal auxiliary vector for each sentence
            if POS_abbreviation == 'VM':
                if any(mod in sentence for mod in vocabluary):
                    POS_vec.append([1])
                else:
                    POS_vec.append([-1])
            
            # compose adverbs/verbs vector for each sentence
            else:
                for i in range(len(vocabluary)):
                    if vocabluary[i] in sentence:
                        sent_vec.append(1)
                    else:
                        sent_vec.append(0)
                POS_vec.append(sent_vec)
        return POS_vec
            
# sources:
# https://treetaggerwrapper.readthedocs.io/en/latest/
# https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
# https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/stts_guide.pdf

### Word Couple Feature

This feature has to be excluded due to computing reasons. With a simple 80/20 train/test split a vocabluary of the size 2,105,114 was created. With 5,306 sentences in the train corpus a computation of the sentence-couple matrix is not possible.

In [3]:
class Word_Couple:
    
    # learn a vocabluary of all word couples
    def fit(sentences):
        word_couples = []
        
        for sentence in sentences:
            # buid permutations of 2 over the sentence that is converted to a list
            for word in itertools.permutations(sentence.split(' '), 2):
                word_couples.append(word)

        word_couples = list(set(word_couples))
        word_couples = sorted(word_couples, key=lambda tup: (tup[0],tup[1]))
        
        return word_couples
    
    # transform the input text into a couple sentence matrix    
    def transform(sentences, vocabulary):
        word_couples_vec = []
        
        for sentence in sentences:
            word_couples_sent_vec = []
            couples_sent = list(itertools.permutations(sentence.split(' '), 2))

            # compose word couple vector for each sentence
            for couple in vocabulary:
                if couple in couples_sent:
                    word_couples_sent_vec.append(1)
                else:
                    word_couples_sent_vec.append(0)
            word_couples_vec.append(word_couples_sent_vec)
            
        return word_couples_vec  
        
# sources:
# https://stackoverflow.com/questions/942543/operation-on-every-pair-of-element-in-a-list
# https://stackoverflow.com/questions/3121979/how-to-sort-a-list-tuple-of-lists-tuples-by-the-element-at-a-given-index

In [10]:
coup = Word_Couple
vocab = coup.fit(X_train)
train = coup.transform(X_train[0], vocab)

In [11]:
print(len(train[0]))

2118453


### Exracting Sentence Statistics

In this function sentence length, average word length and number of punctuation marks are extracted.

In [4]:
class Sentence_Statistics:
    
    # extract all neccessary information from the corpus
    def transform(sentences):
        sentence_statistics = []

        for sentence in sentences:
            sent_length = len(sentence.split(' '))
            num_punctuation = len((re.findall(r'[^äöüÄÖÜa-zA-Z0-9%€&$#\+ ]', sentence)))
            len_words = 0

            for item in sentence.split(' '):
                len_words += len(item)
            avg_word_length = len_words/len(sentence.split(' '))

            sentence_statistics.append([sent_length, avg_word_length, num_punctuation])
        return sentence_statistics

### Extracting Punctuation Sequence

In this function the punctuation sequence is extracted an encoded.  
The following encoding is used
- \+ --> 0
- . --> 1 
- ! --> 2
- ? --> 3
- , --> 4
- ; --> 5
- : --> 6
- / --> 7
- § --> 8

Only the most common punctuation is considered due to encoding reasons.

In [5]:
class Punctuation_Sequence:
    
    def transform(sentences):

        punctuation_vec = []

        for sentence in sentences:
            sentence = re.sub(r'([^äöüÄÖÜa-zA-Z0-9%€&$#\+ ])[^äöüÄÖÜa-zA-Z0-9%€&$#\+ ]+', r'\1+', sentence)
            sentence = re.findall(r'[\.\!\?,;:/§\+]', sentence)
            sentence = ''.join(sentence)

            sentence = re.sub(r'\+', '0', sentence)
            sentence = re.sub(r'\.', '1', sentence)
            sentence = re.sub(r'\!', '2', sentence)
            sentence = re.sub(r'\?', '3', sentence)
            sentence = re.sub(r',', '4', sentence)
            sentence = re.sub(r';', '5', sentence)
            sentence = re.sub(r':', '6', sentence)
            sentence = re.sub(r'/', '7', sentence)
            sentence = re.sub(r'§', '8', sentence)
            
            if sentence == '':
                punctuation_vec.append([-1])
            else:
                punctuation_vec.append([int(sentence)])
            
        return punctuation_vec
    

# source: https://note.nkmk.me/en/python-str-replace-translate-re-sub/

## Model Training and Evaluation 1

The first model is a maximum entropy model, another name for Logistic Regression.  
The data is evaluated on the initial 80/20 train test split.  
Furthermore, five fold crossvalidation is done on the test set to get clearere insights into the data. In the scikit learn implementattion of KFold 5 is the standard parameter for the folds, therefore, five fold are choosen.

In [8]:
kf = KFold()

acc = 0
pre = 0
rec = 0
f_1 = 0

for train_index, test_index in kf.split(X_train):
    X_training, X_testing = X_train[train_index], X_train[test_index]
    y_training, y_testing = y_train[train_index], y_train[test_index]
    
    vectorizer_uni = CountVectorizer(analyzer='word', lowercase=False, ngram_range=(1, 1))
    unigrams_train = vectorizer_uni.fit_transform(X_training)
    unigrams_test = vectorizer_uni.transform(X_testing)

    clf_uni = LogisticRegression( max_iter=1000).fit(unigrams_train, y_training)
    uni_predict = clf_uni.predict(unigrams_test)

    acc += metrics.accuracy_score(y_testing, uni_predict)   
    metric = metrics.precision_recall_fscore_support(y_testing, uni_predict, average='macro')
    pre += metric[0]
    rec += metric[1]
    f_1 += metric[2]
    print(metrics.classification_report(y_testing, uni_predict))
    
print('Precision: ' + str(pre/5))
print('Recall: ' + str(rec/5))
print('F1-Score: ' + str(f_1/5))
print('Accuracy: ' + str(acc/5))

              precision    recall  f1-score   support

    interest       0.90      0.93      0.91      3001
 no_interest       0.78      0.69      0.73      1031

    accuracy                           0.87      4032
   macro avg       0.84      0.81      0.82      4032
weighted avg       0.87      0.87      0.87      4032

              precision    recall  f1-score   support

    interest       0.90      0.93      0.91      2961
 no_interest       0.78      0.71      0.74      1071

    accuracy                           0.87      4032
   macro avg       0.84      0.82      0.83      4032
weighted avg       0.87      0.87      0.87      4032

              precision    recall  f1-score   support

    interest       0.90      0.93      0.91      2996
 no_interest       0.77      0.71      0.74      1036

    accuracy                           0.87      4032
   macro avg       0.84      0.82      0.83      4032
weighted avg       0.87      0.87      0.87      4032

              preci

In [9]:
# bigrams

kf = KFold()

acc = 0
pre = 0
rec = 0
f_1 = 0

for train_index, test_index in kf.split(X_train):
    X_training, X_testing = X_train[train_index], X_train[test_index]
    y_training, y_testing = y_train[train_index], y_train[test_index]
    
    vectorizer_bi = CountVectorizer(analyzer='word', lowercase=False, ngram_range=(2, 2))
    bigrams_train = vectorizer_bi.fit_transform(X_training)
    bigrams_test = vectorizer_bi.transform(X_testing)

    clf_bi = LogisticRegression().fit(bigrams_train, y_training)
    bi_predict = clf_bi.predict(bigrams_test)

    acc += metrics.accuracy_score(y_testing, bi_predict)   
    metric = metrics.precision_recall_fscore_support(y_testing, bi_predict, average='macro')
    pre += metric[0]
    rec += metric[1]
    f_1 += metric[2]
    print(metrics.classification_report(y_testing,  bi_predict))
    
print('Precision: ' + str(pre/5))
print('Recall: ' + str(rec/5))
print('F1-Score: ' + str(f_1/5))
print('Accuracy: ' + str(acc/5))

              precision    recall  f1-score   support

    interest       0.89      0.91      0.90      3001
 no_interest       0.71      0.66      0.69      1031

    accuracy                           0.84      4032
   macro avg       0.80      0.79      0.79      4032
weighted avg       0.84      0.84      0.84      4032

              precision    recall  f1-score   support

    interest       0.89      0.89      0.89      2961
 no_interest       0.70      0.68      0.69      1071

    accuracy                           0.84      4032
   macro avg       0.79      0.79      0.79      4032
weighted avg       0.84      0.84      0.84      4032

              precision    recall  f1-score   support

    interest       0.89      0.90      0.89      2996
 no_interest       0.69      0.68      0.68      1036

    accuracy                           0.84      4032
   macro avg       0.79      0.79      0.79      4032
weighted avg       0.84      0.84      0.84      4032

              preci

In [10]:
# trigrams

kf = KFold()

acc = 0
pre = 0
rec = 0
f_1 = 0

for train_index, test_index in kf.split(X_train):
    X_training, X_testing = X_train[train_index], X_train[test_index]
    y_training, y_testing = y_train[train_index], y_train[test_index]
    
    vectorizer_tri = CountVectorizer(analyzer='word', lowercase=False, ngram_range=(3, 3))
    trigrams_train = vectorizer_tri.fit_transform(X_training)
    trigrams_test = vectorizer_tri.transform(X_testing)

    clf_tri = LogisticRegression().fit(trigrams_train, y_training)
    tri_predict = clf_tri.predict(trigrams_test)

    acc += metrics.accuracy_score(y_testing, tri_predict)   
    metric = metrics.precision_recall_fscore_support(y_testing, tri_predict, average='macro')
    pre += metric[0]
    rec += metric[1]
    f_1 += metric[2]
    print(metrics.classification_report(y_testing, tri_predict))
    
print('Precision: ' + str(pre/5))
print('Recall: ' + str(rec/5))
print('F1-Score: ' + str(f_1/5))
print('Accuracy: ' + str(acc/5))

              precision    recall  f1-score   support

    interest       0.81      0.96      0.88      3001
 no_interest       0.76      0.34      0.47      1031

    accuracy                           0.80      4032
   macro avg       0.78      0.65      0.68      4032
weighted avg       0.80      0.80      0.78      4032

              precision    recall  f1-score   support

    interest       0.80      0.96      0.87      2961
 no_interest       0.77      0.34      0.47      1071

    accuracy                           0.80      4032
   macro avg       0.79      0.65      0.67      4032
weighted avg       0.79      0.80      0.77      4032

              precision    recall  f1-score   support

    interest       0.81      0.96      0.88      2996
 no_interest       0.74      0.33      0.46      1036

    accuracy                           0.80      4032
   macro avg       0.78      0.65      0.67      4032
weighted avg       0.79      0.80      0.77      4032

              preci

In [11]:
# adverbs

kf = KFold()

acc = 0
pre = 0
rec = 0
f_1 = 0

for train_index, test_index in kf.split(X_train):
    X_training, X_testing = X_train[train_index], X_train[test_index]
    y_training, y_testing = y_train[train_index], y_train[test_index]
    
    adverb_extractor = POS_extractor
    adverbs_voc = adverb_extractor.fit(X_training, 'ADV')
    adverbs_vec_train = adverb_extractor.transform(X_training, adverbs_voc, 'ADV')
    adverbs_vec_test = adverb_extractor.transform(X_testing, adverbs_voc, 'ADV')

    clf_adv = LogisticRegression().fit(adverbs_vec_train, y_training)
    adv_predict = clf_adv.predict(adverbs_vec_test)

    acc += metrics.accuracy_score(y_testing, adv_predict)   
    metric = metrics.precision_recall_fscore_support(y_testing, adv_predict, average='macro')
    pre += metric[0]
    rec += metric[1]
    f_1 += metric[2]
    print(metrics.classification_report(y_testing, adv_predict))
    

print('Precision: ' + str(pre/5))
print('Recall: ' + str(rec/5))
print('F1-Score: ' + str(f_1/5))
print('Accuracy: ' + str(acc/5))

              precision    recall  f1-score   support

    interest       0.83      0.91      0.87      3001
 no_interest       0.65      0.47      0.54      1031

    accuracy                           0.80      4032
   macro avg       0.74      0.69      0.71      4032
weighted avg       0.79      0.80      0.79      4032

              precision    recall  f1-score   support

    interest       0.82      0.92      0.87      2961
 no_interest       0.68      0.45      0.54      1071

    accuracy                           0.80      4032
   macro avg       0.75      0.69      0.71      4032
weighted avg       0.79      0.80      0.78      4032

              precision    recall  f1-score   support

    interest       0.82      0.92      0.87      2996
 no_interest       0.66      0.43      0.52      1036

    accuracy                           0.80      4032
   macro avg       0.74      0.68      0.70      4032
weighted avg       0.78      0.80      0.78      4032

              preci

In [15]:
# verbs

kf = KFold()

acc = 0
pre = 0
rec = 0
f_1 = 0

for train_index, test_index in kf.split(X_train):
    X_training, X_testing = X_train[train_index], X_train[test_index]
    y_training, y_testing = y_train[train_index], y_train[test_index]
    
    verb_extractor = POS_extractor
    verbs_voc = verb_extractor.fit(X_training, 'VV')
    verbs_vec_train = verb_extractor.transform(X_training, verbs_voc, 'VV')
    verbs_vec_test = verb_extractor.transform(X_testing, verbs_voc, 'VV')

    clf_ver = LogisticRegression(max_iter=1000).fit(verbs_vec_train, y_training)
    ver_predict = clf_ver.predict(verbs_vec_test)

    acc += metrics.accuracy_score(y_testing, ver_predict)   
    metric = metrics.precision_recall_fscore_support(y_testing, ver_predict, average='macro')
    pre += metric[0]
    rec += metric[1]
    f_1 += metric[2]
    print(metrics.classification_report(y_testing, ver_predict))
    
print('Precision: ' + str(pre/5))
print('Recall: ' + str(rec/5))
print('F1-Score: ' + str(f_1/5))
print('Accuracy: ' + str(acc/5))

              precision    recall  f1-score   support

    interest       0.85      0.94      0.89      3001
 no_interest       0.75      0.52      0.61      1031

    accuracy                           0.83      4032
   macro avg       0.80      0.73      0.75      4032
weighted avg       0.82      0.83      0.82      4032

              precision    recall  f1-score   support

    interest       0.85      0.93      0.89      2961
 no_interest       0.74      0.54      0.62      1071

    accuracy                           0.83      4032
   macro avg       0.79      0.74      0.76      4032
weighted avg       0.82      0.83      0.82      4032

              precision    recall  f1-score   support

    interest       0.85      0.93      0.89      2996
 no_interest       0.73      0.52      0.61      1036

    accuracy                           0.83      4032
   macro avg       0.79      0.73      0.75      4032
weighted avg       0.82      0.83      0.82      4032

              preci

In [12]:
# modal auxiliary

kf = KFold()

acc = 0
pre = 0
rec = 0
f_1 = 0

for train_index, test_index in kf.split(X_train):
    X_training, X_testing = X_train[train_index], X_train[test_index]
    y_training, y_testing = y_train[train_index], y_train[test_index]
    
    mod_aux_extractor = POS_extractor
    mod_aux_voc = mod_aux_extractor.fit(X_training, 'VM')
    mod_aux_vec_train = mod_aux_extractor.transform(X_training, mod_aux_voc, 'VM')
    mod_aux_vec_test = mod_aux_extractor.transform(X_testing, mod_aux_voc, 'VM')

    clf_ma = LogisticRegression(max_iter=1000).fit(mod_aux_vec_train, y_training)
    ma_predict = clf_ma.predict(mod_aux_vec_test)

    acc += metrics.accuracy_score(y_testing, ma_predict)   
    metric = metrics.precision_recall_fscore_support(y_testing, ma_predict, average='macro')
    pre += metric[0]
    rec += metric[1]
    f_1 += metric[2]
    print(metrics.classification_report(y_testing, ma_predict))
    
print('Precision: ' + str(pre/5))
print('Recall: ' + str(rec/5))
print('F1-Score: ' + str(f_1/5))
print('Accuracy: ' + str(acc/5))

              precision    recall  f1-score   support

    interest       0.74      1.00      0.85      3001
 no_interest       0.00      0.00      0.00      1031

    accuracy                           0.74      4032
   macro avg       0.37      0.50      0.43      4032
weighted avg       0.55      0.74      0.64      4032

              precision    recall  f1-score   support

    interest       0.73      1.00      0.85      2961
 no_interest       0.00      0.00      0.00      1071

    accuracy                           0.73      4032
   macro avg       0.37      0.50      0.42      4032
weighted avg       0.54      0.73      0.62      4032

              precision    recall  f1-score   support

    interest       0.74      1.00      0.85      2996
 no_interest       0.00      0.00      0.00      1036

    accuracy                           0.74      4032
   macro avg       0.37      0.50      0.43      4032
weighted avg       0.55      0.74      0.63      4032

              preci

In [13]:
# sentence statistics

kf = KFold()

acc = 0
pre = 0
rec = 0
f_1 = 0

for train_index, test_index in kf.split(X_train):
    X_training, X_testing = X_train[train_index], X_train[test_index]
    y_training, y_testing = y_train[train_index], y_train[test_index]
    
    sent_stat_extractor = Sentence_Statistics
    sent_stat_train = sent_stat_extractor.transform(X_training)
    sent_stat_test = sent_stat_extractor.transform(X_testing)

    clf_sent_stat = LogisticRegression().fit(sent_stat_train, y_training)
    sent_stat_predict = clf_sent_stat.predict(sent_stat_test)

    acc += metrics.accuracy_score(y_testing, sent_stat_predict)   
    metric = metrics.precision_recall_fscore_support(y_testing, sent_stat_predict, average='macro')
    pre += metric[0]
    rec += metric[1]
    f_1 += metric[2]
    print(metrics.classification_report(y_testing, sent_stat_predict))
    
print('Precision: ' + str(pre/5))
print('Recall: ' + str(rec/5))
print('F1-Score: ' + str(f_1/5))
print('Accuracy: ' + str(acc/5))

              precision    recall  f1-score   support

    interest       0.81      0.97      0.88      3001
 no_interest       0.79      0.32      0.46      1031

    accuracy                           0.80      4032
   macro avg       0.80      0.65      0.67      4032
weighted avg       0.80      0.80      0.77      4032

              precision    recall  f1-score   support

    interest       0.79      0.97      0.87      2961
 no_interest       0.78      0.30      0.43      1071

    accuracy                           0.79      4032
   macro avg       0.79      0.63      0.65      4032
weighted avg       0.79      0.79      0.75      4032

              precision    recall  f1-score   support

    interest       0.81      0.97      0.88      2996
 no_interest       0.80      0.32      0.46      1036

    accuracy                           0.80      4032
   macro avg       0.80      0.65      0.67      4032
weighted avg       0.80      0.80      0.77      4032

              preci

In [14]:
# punctuation sequence

kf = KFold()

acc = 0
pre = 0
rec = 0
f_1 = 0

for train_index, test_index in kf.split(X_train):
    X_training, X_testing = X_train[train_index], X_train[test_index]
    y_training, y_testing = y_train[train_index], y_train[test_index]
    
    pun_extractor = Punctuation_Sequence
    pun_train = pun_extractor.transform(X_training)
    pun_test = pun_extractor.transform(X_testing)

    clf_pun = LogisticRegression().fit(pun_train, y_training)
    pun_predict = clf_pun.predict(pun_test)

    acc += metrics.accuracy_score(y_testing, pun_predict)   
    metric = metrics.precision_recall_fscore_support(y_testing, pun_predict, average='macro')
    pre += metric[0]
    rec += metric[1]
    f_1 += metric[2]
    print(metrics.classification_report(y_testing, pun_predict))
    

print('Precision: ' + str(pre/5))
print('Recall: ' + str(rec/5))
print('F1-Score: ' + str(f_1/5))
print('Accuracy: ' + str(acc/5))

              precision    recall  f1-score   support

    interest       0.74      1.00      0.85      3001
 no_interest       0.00      0.00      0.00      1031

    accuracy                           0.74      4032
   macro avg       0.37      0.50      0.43      4032
weighted avg       0.55      0.74      0.64      4032

              precision    recall  f1-score   support

    interest       0.73      1.00      0.85      2961
 no_interest       0.00      0.00      0.00      1071

    accuracy                           0.73      4032
   macro avg       0.37      0.50      0.42      4032
weighted avg       0.54      0.73      0.62      4032

              precision    recall  f1-score   support

    interest       0.74      1.00      0.85      2996
 no_interest       0.00      0.00      0.00      1036

    accuracy                           0.74      4032
   macro avg       0.37      0.50      0.43      4032
weighted avg       0.55      0.74      0.63      4032

              preci

In [14]:
# combination of all features with 80/20 split

vectorizer_uni = CountVectorizer(analyzer='word', lowercase=False, ngram_range=(1, 1))
vectorizer_bi = CountVectorizer(analyzer='word', lowercase=False, ngram_range=(2, 2))
vectorizer_tri = CountVectorizer(analyzer='word', lowercase=False, ngram_range=(3, 3))
adverb_extractor = POS_extractor
verb_extractor = POS_extractor
mod_aux_extractor = POS_extractor
sent_stat_extractor = Sentence_Statistics
pun_extractor = Punctuation_Sequence
    
unigrams_train = vectorizer_uni.fit_transform(X_train).toarray().tolist()
unigrams_test = vectorizer_uni.transform(X_test).toarray().tolist()
# bigrams_train = vectorizer_bi.fit_transform(X_train).toarray().tolist()
# bigrams_test = vectorizer_bi.transform(X_test).toarray().tolist()
# trigrams_train = vectorizer_tri.fit_transform(X_training).toarray().tolist()
# trigrams_test = vectorizer_tri.transform(X_test).toarray().tolist()
adverbs_voc = adverb_extractor.fit(X_train, 'ADV')
adverbs_vec_train = adverb_extractor.transform(X_train, adverbs_voc, 'ADV')
adverbs_vec_test = adverb_extractor.transform(X_test, adverbs_voc, 'ADV')
verbs_voc = verb_extractor.fit(X_train, 'VV')
verbs_vec_train = verb_extractor.transform(X_train, verbs_voc, 'VV')
verbs_vec_test = verb_extractor.transform(X_test, verbs_voc, 'VV')
mod_aux_voc = mod_aux_extractor.fit(X_train, 'VM')
mod_aux_vec_train = mod_aux_extractor.transform(X_train, mod_aux_voc, 'VM')
mod_aux_vec_test = mod_aux_extractor.transform(X_test, mod_aux_voc, 'VM')
sent_stat_train = sent_stat_extractor.transform(X_train)
sent_stat_test = sent_stat_extractor.transform(X_test)
pun_train = pun_extractor.transform(X_train)
pun_test = pun_extractor.transform(X_test)
all_train = []
all_test = []
    
for i in range(len(X_train)):
#     all_train.append(unigrams_train[i] + bigrams_train[i] + trigrams_train[i] + adverbs_vec_train[i] + verbs_vec_train[i] + mod_aux_vec_train[i] + sent_stat_train[i] + pun_train[i])
    all_train.append(unigrams_train[i] + adverbs_vec_train[i] + verbs_vec_train[i] + mod_aux_vec_train[i] + sent_stat_train[i] + pun_train[i])


for i in range(len(X_test)):
#     all_test.append(unigrams_test[i] + bigrams_test[i] + trigrams_test[i] + adverbs_vec_test[i] + verbs_vec_test[i] + mod_aux_vec_test[i] + sent_stat_test[i] + pun_test[i])
    all_test.append(unigrams_test[i] + adverbs_vec_test[i] + verbs_vec_test[i] + mod_aux_vec_test[i] + sent_stat_test[i] + pun_test[i])
    
clf = LogisticRegression().fit(all_train, y_train)
predict = clf.predict(all_test)

print(metrics.classification_report(y_test, predict))

              precision    recall  f1-score   support

    interest       0.74      1.00      0.85      3753
 no_interest       0.00      0.00      0.00      1287

    accuracy                           0.74      5040
   macro avg       0.37      0.50      0.43      5040
weighted avg       0.55      0.74      0.64      5040



## Model and Prediction 1

In this step the best feature from the training and evaluation step is chosen and predictions are done on the 80/20 split.  
Furthermore, the trained model is saved.

In [2]:
# unigrams
# prediction with 80/20 split

vectorizer_uni = CountVectorizer(analyzer='word', lowercase=False, ngram_range=(1, 1))
unigrams_train = vectorizer_uni.fit_transform(X_train)
unigrams_test = vectorizer_uni.transform(X_test)

clf_uni = LogisticRegression(max_iter=1000).fit(unigrams_train, y_train)
uni_predict = clf_uni.predict(unigrams_test)

print(metrics.classification_report(y_test, uni_predict))

with open('classifier_one.pkl', 'wb') as file:
    pickle.dump(clf_uni, file)
    
with open('vectorizer_uni.pkl', 'wb') as file:
    pickle.dump(vectorizer_uni, file)

NameError: name 'X_train' is not defined

## Data Preparetation and Feature Engineering 2

In this step the data is loaded again and a ferature vector for the second classification is composed. It consists of the following features:
- prediction of the first classifier, every not interesting feature is dropped
- absolout sentence location in decision reasons
- absolout sentence location in paragraph

In [3]:
# load corpus data, flatten corpus, delete all not relevant data, extract position information for each sentence and add decision of first classifier

path = 'thesis_corpus/'

with open('classifier_one.pkl', 'rb') as file:
    clf_one = pickle.load(file)
with open('vectorizer_uni.pkl', 'rb') as file:
    vectorizer_uni = pickle.load(file)
    
corpus = []
corpus_labels = []
feature_vector = []

for file in os.listdir(path):
    with open(path+file, encoding='utf-8') as f:
        data = json.load(f)
        reasons = data['decision_text']['decision_reasons']
    
    doc_sent = 0
    
    for reason in reasons:
        for i in range(len(reason)):
            doc_sent += 1
            sent_len = 0
            if reason[i][1] != 'other':
                corpus.append(reason[i][0])
                corpus_labels.append(reason[i][1])
                if len(reason[i][0].split(' ')) > 12:
                    sent_len = 1
                feature_vector.append([i, doc_sent, sent_len])
        
           
unigrams = vectorizer_uni.transform(corpus)
predictions_one = clf_one.predict(unigrams)

for i in range(len(feature_vector)):
    if predictions_one[i] == 'no_interest':
        feature_vector[i].append(0)
    else:
        feature_vector[i].append(1)

                        
X = np.array(corpus)
y = np.array(corpus_labels)
indeces = [i for i in range(len(X))]
                
# train/test split with 80/20 split
X_train, X_test, y_train, y_test = train_test_split(np.array(indeces), y, test_size=0.2)

In [18]:
features_train = [feature_vector[index] for index in X_train]
features_test = [feature_vector[index] for index in X_test]

clf_two = LinearSVC().fit(features_train, y_train)
predict = clf_two.predict(features_test)

print(metrics.classification_report(y_test, predict))

              precision    recall  f1-score   support

  definition       0.28      0.96      0.43      1057
 subsumption       0.65      0.03      0.06      2687

    accuracy                           0.29      3744
   macro avg       0.47      0.49      0.25      3744
weighted avg       0.55      0.29      0.17      3744



In [4]:
kf = KFold()

acc = 0
pre = 0
rec = 0
f_1 = 0

for train_index, test_index in kf.split(X_train):
    X_training, X_testing = X_train[train_index], X_train[test_index]
    y_training, y_testing = y_train[train_index], y_train[test_index]
    
    features_train = [feature_vector[index] for index in X_training]
    features_test = [feature_vector[index] for index in X_testing]

    clf_two = LinearSVC().fit(features_train, y_training)
    predict = clf_two.predict(features_test)

    acc += metrics.accuracy_score(y_testing, predict)   
    metric = metrics.precision_recall_fscore_support(y_testing, predict, average='macro')
    pre += metric[0]
    rec += metric[1]
    f_1 += metric[2]
    print(metrics.classification_report(y_testing, predict))
    
print('Accuracy: ' + str(acc/5))
print('Precision: ' + str(pre/5))
print('Recall: ' + str(rec/5))
print('F1-Score: ' + str(f_1/5))

              precision    recall  f1-score   support

  definition       0.00      0.00      0.00       841
 subsumption       0.72      1.00      0.84      2154

    accuracy                           0.72      2995
   macro avg       0.36      0.50      0.42      2995
weighted avg       0.52      0.72      0.60      2995

              precision    recall  f1-score   support

  definition       0.00      0.00      0.00       851
 subsumption       0.72      1.00      0.83      2144

    accuracy                           0.72      2995
   macro avg       0.36      0.50      0.42      2995
weighted avg       0.51      0.72      0.60      2995

              precision    recall  f1-score   support

  definition       0.00      0.00      0.00       835
 subsumption       0.72      1.00      0.84      2160

    accuracy                           0.72      2995
   macro avg       0.36      0.50      0.42      2995
weighted avg       0.52      0.72      0.60      2995

              preci

## Train model with own approach

The baseline model and features did not bring good results. To be able to work with the trained model a new approach is implemented.  
Unigrams seem to be a good indecator for classification. Therefore this feature is used in a multi-class calssification approach. Both Logistic Regression and SVM are tested.

In [3]:
path = 'thesis_corpus/'

# flatten corpus for generating basic features and extract flattende labels
corpus = []
corpus_labels = []

for file in os.listdir(path):
    with open(path+file, encoding='utf-8') as f:
        data = json.load(f)
        reasons = data['decision_text']['decision_reasons']
    
    for reason in reasons:
        for sentence in reason:
            corpus.append(sentence[0])
            corpus_labels.append(sentence[1])
        
X = np.array(corpus)
y = np.array(corpus_labels)
                
# train/test split with 80/20 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [4]:
# Unigram and Logistic Regression

kf = KFold()

acc = 0
pre = 0
rec = 0
f_1 = 0

for train_index, test_index in kf.split(X_train):
    X_training, X_testing = X_train[train_index], X_train[test_index]
    y_training, y_testing = y_train[train_index], y_train[test_index]
    
    vectorizer_uni = CountVectorizer(analyzer='word', lowercase=False, ngram_range=(1, 1))
    unigrams_train = vectorizer_uni.fit_transform(X_training)
    unigrams_test = vectorizer_uni.transform(X_testing)

    clf_uni = LogisticRegression( max_iter=1000).fit(unigrams_train, y_training)
    uni_predict = clf_uni.predict(unigrams_test)

    acc += metrics.accuracy_score(y_testing, uni_predict)   
    metric = metrics.precision_recall_fscore_support(y_testing, uni_predict, average='macro')
    pre += metric[0]
    rec += metric[1]
    f_1 += metric[2]
    print(metrics.classification_report(y_testing, uni_predict))
    
print('Precision: ' + str(pre/5))
print('Recall: ' + str(rec/5))
print('F1-Score: ' + str(f_1/5))
print('Accuracy: ' + str(acc/5))

              precision    recall  f1-score   support

  definition       0.77      0.70      0.73       861
       other       0.72      0.72      0.72      1023
 subsumption       0.80      0.83      0.82      2148

    accuracy                           0.78      4032
   macro avg       0.77      0.75      0.76      4032
weighted avg       0.78      0.78      0.78      4032

              precision    recall  f1-score   support

  definition       0.78      0.71      0.74       815
       other       0.72      0.72      0.72      1005
 subsumption       0.81      0.84      0.82      2212

    accuracy                           0.78      4032
   macro avg       0.77      0.76      0.76      4032
weighted avg       0.78      0.78      0.78      4032

              precision    recall  f1-score   support

  definition       0.77      0.70      0.73       859
       other       0.73      0.70      0.71      1031
 subsumption       0.79      0.83      0.81      2142

    accuracy        

In [6]:
# Unigram and SVM

kf = KFold()

acc = 0
pre = 0
rec = 0
f_1 = 0

for train_index, test_index in kf.split(X_train):
    X_training, X_testing = X_train[train_index], X_train[test_index]
    y_training, y_testing = y_train[train_index], y_train[test_index]
    
    vectorizer_uni = CountVectorizer(analyzer='word', lowercase=False, ngram_range=(1, 1))
    unigrams_train = vectorizer_uni.fit_transform(X_training)
    print(len(vectorizer_uni.get_feature_names()))
    unigrams_test = vectorizer_uni.transform(X_testing)

    clf_uni = LinearSVC().fit(unigrams_train, y_training)
    uni_predict = clf_uni.predict(unigrams_test)

    acc += metrics.accuracy_score(y_testing, uni_predict)   
    metric = metrics.precision_recall_fscore_support(y_testing, uni_predict, average='macro')
    pre += metric[0]
    rec += metric[1]
    f_1 += metric[2]
    print(metrics.classification_report(y_testing, uni_predict))
    
print('Precision: ' + str(pre/5))
print('Recall: ' + str(rec/5))
print('F1-Score: ' + str(f_1/5))
print('Accuracy: ' + str(acc/5))

32182
              precision    recall  f1-score   support

  definition       0.75      0.67      0.71       861
       other       0.66      0.73      0.69      1023
 subsumption       0.80      0.79      0.79      2148

    accuracy                           0.75      4032
   macro avg       0.74      0.73      0.73      4032
weighted avg       0.75      0.75      0.75      4032

32066
              precision    recall  f1-score   support

  definition       0.74      0.69      0.72       815
       other       0.67      0.73      0.70      1005
 subsumption       0.80      0.79      0.80      2212

    accuracy                           0.75      4032
   macro avg       0.74      0.74      0.74      4032
weighted avg       0.76      0.75      0.76      4032

32257
              precision    recall  f1-score   support

  definition       0.75      0.68      0.71       859
       other       0.66      0.71      0.68      1031
 subsumption       0.79      0.78      0.78      2142

  

In [7]:
# tf-idf and Logistic Regression

kf = KFold()

acc = 0
pre = 0
rec = 0
f_1 = 0

for train_index, test_index in kf.split(X_train):
    X_training, X_testing = X_train[train_index], X_train[test_index]
    y_training, y_testing = y_train[train_index], y_train[test_index]
    
    vectorizer_tfidf = TfidfVectorizer(lowercase=False)
    tfidf_train = vectorizer_tfidf.fit_transform(X_training)
    tfidf_test = vectorizer_tfidf.transform(X_testing)

    clf_tfidf = LogisticRegression(max_iter=1000).fit(tfidf_train, y_training)
    tfidf_predict = clf_tfidf.predict(tfidf_test)

    acc += metrics.accuracy_score(y_testing, tfidf_predict)   
    metric = metrics.precision_recall_fscore_support(y_testing, tfidf_predict, average='macro')
    pre += metric[0]
    rec += metric[1]
    f_1 += metric[2]
    print(metrics.classification_report(y_testing, tfidf_predict))
    
print('Precision: ' + str(pre/5))
print('Recall: ' + str(rec/5))
print('F1-Score: ' + str(f_1/5))
print('Accuracy: ' + str(acc/5))

              precision    recall  f1-score   support

  definition       0.80      0.67      0.73       861
       other       0.80      0.64      0.71      1023
 subsumption       0.77      0.89      0.83      2148

    accuracy                           0.78      4032
   macro avg       0.79      0.74      0.76      4032
weighted avg       0.78      0.78      0.78      4032

              precision    recall  f1-score   support

  definition       0.80      0.68      0.74       815
       other       0.82      0.64      0.72      1005
 subsumption       0.78      0.90      0.83      2212

    accuracy                           0.79      4032
   macro avg       0.80      0.74      0.76      4032
weighted avg       0.79      0.79      0.78      4032

              precision    recall  f1-score   support

  definition       0.79      0.67      0.72       859
       other       0.82      0.63      0.71      1031
 subsumption       0.76      0.89      0.82      2142

    accuracy        

In [10]:
# tfidf and SVM

kf = KFold()

acc = 0
pre = 0
rec = 0
f_1 = 0

for train_index, test_index in kf.split(X_train):
    X_training, X_testing = X_train[train_index], X_train[test_index]
    y_training, y_testing = y_train[train_index], y_train[test_index]
    
    vectorizer_tfidf = TfidfVectorizer(lowercase=False)
    tfidf_train = vectorizer_tfidf.fit_transform(X_training)
    tfidf_test = vectorizer_tfidf.transform(X_testing)

    clf_tfidf = LinearSVC().fit(tfidf_train, y_training)
    tfidf_predict = clf_tfidf.predict(tfidf_test)

    acc += metrics.accuracy_score(y_testing, tfidf_predict)   
    metric = metrics.precision_recall_fscore_support(y_testing, tfidf_predict, average='macro')
    pre += metric[0]
    rec += metric[1]
    f_1 += metric[2]
    print(metrics.classification_report(y_testing, tfidf_predict))
    
print('Precision: ' + str(pre/5))
print('Recall: ' + str(rec/5))
print('F1-Score: ' + str(f_1/5))
print('Accuracy: ' + str(acc/5))

              precision    recall  f1-score   support

  definition       0.78      0.71      0.74       861
       other       0.76      0.69      0.72      1023
 subsumption       0.80      0.86      0.83      2148

    accuracy                           0.78      4032
   macro avg       0.78      0.75      0.76      4032
weighted avg       0.78      0.78      0.78      4032

              precision    recall  f1-score   support

  definition       0.76      0.72      0.74       815
       other       0.77      0.69      0.73      1005
 subsumption       0.80      0.86      0.83      2212

    accuracy                           0.79      4032
   macro avg       0.78      0.76      0.77      4032
weighted avg       0.79      0.79      0.79      4032

              precision    recall  f1-score   support

  definition       0.78      0.73      0.75       859
       other       0.76      0.68      0.72      1031
 subsumption       0.79      0.85      0.82      2142

    accuracy        

In [7]:
# tf-idf SVM
# prediction with 80/20 split

vectorizer_tfidf = TfidfVectorizer(lowercase=False)
tfidf_train = vectorizer_tfidf.fit_transform(X_train)
tfidf_test = vectorizer_tfidf.transform(X_test)

clf_tfidf = LinearSVC().fit(tfidf_train, y_train)
tfidf_predict = clf_tfidf.predict(tfidf_test)

print(metrics.classification_report(y_test, tfidf_predict))

with open('classifier_own.pkl', 'wb') as file:
    pickle.dump(clf_tfidf, file)
    
with open('vectorizer_tfidf.pkl', 'wb') as file:
    pickle.dump(vectorizer_tfidf, file)

              precision    recall  f1-score   support

  definition       0.77      0.69      0.73      1060
       other       0.78      0.71      0.74      1281
 subsumption       0.80      0.86      0.83      2699

    accuracy                           0.79      5040
   macro avg       0.78      0.75      0.77      5040
weighted avg       0.79      0.79      0.78      5040



In [8]:
# tf-idf Logistic Regression
# prediction with 80/20 split

tfidf_train = vectorizer_tfidf.transform(X_train)
tfidf_test = vectorizer_tfidf.transform(X_test)

clf_tfidf = LogisticRegression(max_iter=1000).fit(tfidf_train, y_train)
tfidf_predict = clf_tfidf.predict(tfidf_test)

print(metrics.classification_report(y_test, tfidf_predict))

with open('classifier_own_2.pkl', 'wb') as file:
    pickle.dump(clf_tfidf, file)

              precision    recall  f1-score   support

  definition       0.80      0.65      0.72      1060
       other       0.81      0.65      0.73      1281
 subsumption       0.77      0.90      0.83      2699

    accuracy                           0.78      5040
   macro avg       0.79      0.74      0.76      5040
weighted avg       0.79      0.78      0.78      5040

