### The process of finding best model:

1. Parse and read the documents splitting first 13000 documents randomly on train and validation datasets. The last 1932 documents I have left as unseen test data.
2. The train, validation, test dataframes are with two columns row and label. Where row column values are each row list of strings for each document and label each label for current row.
3. Label column is mapped to 0,1 in order to make the task a binary classification problem.
4. The row column passes three main preprocessing steps after various experimentations. First the duplicated whitespaces are removed and list of strings are joined with one whitespace. Second the punctuation is removed. Third stopwords with the help of spacy library are removed.
5. For training I have experimented various combinations of models. For tokenization I have used Count Vectorizer and TfIdf. For ML classification algorithms I have used Naive Bayes, Logistic Regression. Finally I have trained facebook's famous fasttext model which gave slightly better results from all the model combinations.
6. Two best models were tfidf combibation with Naive Bayes and fasttext and I have trained, saved those two models after splitting on train, validation, test datasets. The names for those models are nb_model.pkl and fasttext_model.bin. The TfIDF/ Naise Baes combination trained on all the 14932 is saved as model.pkl file.
7. The models are evaluated on unseen test datasets ( on last 1932 documents ). The main metrics used for classification are accuracy, F1 score of imbalanced HEADER label and ROC AUC score. 
8. After experimentation in Jupyter Notebook i have created a python package where you can with cli commands run training, prediction and classification summary options. All below examples are given of training, predicting and summarizing Naive Bayes model with TFIDF tokenization. Example of Training:

    `python run_script.py -t "{path_to_training_dataset.txt}"`
    

9. Example of prediction. Make sure you have the model file, for example, nb_model.pkl in the package directory where the run_script.py is located. The prediction results are saved in headers_prediction_results.csv where the csv file contains three columns document, header_rows and header_count. Header_rows are rows where model labelem them as HEADER. Header count is how many headers does the current document contain. 

    `python run_script.py -p "{path_to_test_dataset.txt}"`
    
    
10. Example of classification summarization. Make sure you have the model file, for example, nb_model.pkl in the package directory where the run_script.py is located. There is an example Results_on_test_data.PNG, which is screenshot of terminal from results of Naive Bayes model on unseen test dataset (1932 documents).

    `python run_script.py -s "{path_to_test_dataset.txt}"`



# Train, validation and test whole training process

In [5]:
import numpy as np
import pandas as pd
import spacy

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from fasttext import train_supervised
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.svm import LinearSVC, SVC
from tqdm import tqdm
import joblib

import warnings
warnings.filterwarnings("ignore")

# Reading the data 

In [6]:
with open('../document-standardization-training-dataset.txt', encoding="utf8") as f:
    lines = [line.rstrip() for line in f]

# Constructing random train and validation dataframes

In [7]:
train_raw_values = []
train_labels = []

validation_raw_values = []
validation_labels = []

train_index = 0
validation_index = 0

# taking first 13000 documents and leaving 1932 for test 
for document in tqdm(lines[:13000]):
    
    # generating random number for random train and validation split of documents
    random_number = np.random.random()
    
    # making the training size around 20%
    if random_number > 0.20:
        for line_dict in eval(document):

            train_labels.append(line_dict['type'])
            train_raw_values.append([])

            for value in line_dict['values']:
                train_raw_values[train_index].append(value['value'])

            train_index+=1
            
    # making the validation size around 20%
    else:
        for line_dict in eval(document):
            
            validation_labels.append(line_dict['type'])
            validation_raw_values.append([])

            for value in line_dict['values']:
                validation_raw_values[validation_index].append(value['value'])

            validation_index+=1

print(len(train_raw_values))
print(len(train_labels))
print(len(validation_raw_values))
print(len(validation_labels))

100%|███████████████████████████████████████████████████████████████████████████| 13000/13000 [01:31<00:00, 141.88it/s]

1683648
1683648
417593
417593





# Constructing the unseen test data

In [8]:
values = []
labels = []
index = 0

# leaving last 1932 documents for test 
for document in tqdm(lines[13000:]):
    for line_dict in eval(document):

        labels.append(line_dict['type'])
        values.append([])

        for value in line_dict['values']:
            values[index].append(value['value'])

        index+=1

print(len(labels))
print(len(values))

100%|█████████████████████████████████████████████████████████████████████████████| 1932/1932 [00:15<00:00, 122.50it/s]

414981
414981





# Getting all train, validation and test dataframes with row and label columns

In [9]:
df_train = pd.DataFrame({'row': train_raw_values, 'label': train_labels})
df_validation = pd.DataFrame({'row': validation_raw_values, 'label': validation_labels})
df_test = pd.DataFrame({'row': values, 'label': labels})

print(df_train.shape)
print(df_validation.shape)
print(df_test.shape)

(1683648, 2)
(417593, 2)
(414981, 2)


In [10]:
df_validation

Unnamed: 0,row,label
0,"[Country Estates (cene), , , , , , , , , , , ,...",NO_TYPE
1,"[Statement (12 months), , , , , , , , , , , , , ]",NO_TYPE
2,"[Period = Jan 2016-0ec 2016, , , , , , , , , ,...",NO_TYPE
3,"[Book = Cash, , , , , , , , , , , , , ]",NO_TYPE
4,"[, Jan 2016, ? Feb 2016, ? Mar 2016, 0 Apr 201...",HEADERS
...,...,...
417588,"[Elevator Maintenance, 0.00, 0.00, 1007.50, 10...",EXPENSES_MAINTENANCE
417589,"[Total Service Related Expenses, 4838.58, 3488...",TOTALS
417590,"[Total Operating Expenses, 78485.28, 95088.67,...",TOTALS
417591,"[Net Operating Income (Loss), 56260.73, 44528....",TOTALS


# Applying preprocessing to independent variable

In [11]:
df_train['row'] = df_train['row'].apply(lambda x: " ".join((" ".join(x)).split()))
df_validation['row'] = df_validation['row'].apply(lambda x: " ".join((" ".join(x)).split()))
df_test['row'] = df_test['row'].apply(lambda x: " ".join((" ".join(x)).split()))

print(df_train.shape)
print(df_validation.shape)
print(df_test.shape)

(1683648, 2)
(417593, 2)
(414981, 2)


# Mapping label to binary 0 and 1, to solve binary classification problem

In [12]:
df_train['label'] = np.where(df_train['label'] == "HEADERS", 1, 0)
df_validation['label'] = np.where(df_validation['label'] == "HEADERS", 1, 0)
df_test['label'] = np.where(df_test['label'] == "HEADERS", 1, 0)
df_train['label'].value_counts()

0    1630941
1      52707
Name: label, dtype: int64

In [13]:
df_validation['label'].value_counts()

0    404800
1     12793
Name: label, dtype: int64

In [14]:
df_test['label'].value_counts()

0    405921
1      9060
Name: label, dtype: int64

# Text NLP Preprocessing 

In [15]:
from bs4 import BeautifulSoup
from html import unescape
import os
import spacy

try:
    spacy_en = spacy.load("en_core_web_sm")
except:
    os.system('python -m spacy download en_core_web_sm')
    spacy_en = spacy.load("en_core_web_sm")

stops_spacy = sorted(spacy.lang.en.stop_words.STOP_WORDS)
stops_spacy.extend(["is", "to"])

# Define all auxiliary functions for text preprocessing

In [16]:
def remove_punctuation(text):  
    text = ''.join([char if char.isalnum() or char == ' ' else ' ' for char in text])
    text = ' '.join(text.split())  # remove multiple whitespace   
    return text


def remove_stopwords_spacy(text, stopwords=stops_spacy):
    text = ' '.join([word for word in text.split() if word not in stopwords])
    return text


# Apply all the text preprocessings to row column

In [17]:
df_train["row"] = df_train["row"].apply(remove_punctuation)
df_train["row"] = df_train["row"].apply(remove_stopwords_spacy)
print('Train preprocessing done')

df_validation["row"] = df_validation["row"].apply(remove_punctuation)
df_validation["row"] = df_validation["row"].apply(remove_stopwords_spacy)
print('Validation preprocessin done')

df_test["row"] = df_test["row"].apply(remove_punctuation)
df_test["row"] = df_test["row"].apply(remove_stopwords_spacy)
print('Test preprocessing done')

Train preprocessing done
Validation preprocessin done
Test preprocessing done


## Split the data into X, y ( train, validation, test) and cast to numpy arrays for faster training

In [18]:
X_train = np.array(df_train['row'])
X_val = np.array(df_validation['row'])
X_test = np.array(df_test['row'])

y_train = np.array(df_train['label'])
y_val = np.array(df_validation['label'])
y_test = np.array(df_test['label'])

print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

print(y_train.shape)
print(y_val.shape)
print(y_test.shape)

(1683648,)
(417593,)
(414981,)
(1683648,)
(417593,)
(414981,)


# Checking count vectorizer dimension shape

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(1683648, 408540)

# Training Count Vectorizer and Naive Bayes

In [17]:
nb = Pipeline([('countVec', CountVectorizer()),
               ('clf', MultinomialNB()),])
nb.fit(X_train, y_train)

y_pred = nb.predict(X_val)
y_pred_prb = nb.predict_proba(X_val)

train_score = round(accuracy_score(nb.predict(X_train), y_train), 3)
val_score = round(accuracy_score(y_pred, y_val), 3)

print(f'train accuracy {train_score}')
print(f'val accuracy {val_score}')

print(metrics.confusion_matrix(y_val, y_pred))
print(metrics.classification_report(y_val, y_pred))
print('ROC AUC Score is' + '\n')
print(metrics.roc_auc_score(y_val, y_pred_prb[:,1]))

train accuracy 0.974
val accuracy 0.975
[[400415   8850]
 [  1845  11578]]
              precision    recall  f1-score   support

           0       1.00      0.98      0.99    409265
           1       0.57      0.86      0.68     13423

    accuracy                           0.97    422688
   macro avg       0.78      0.92      0.84    422688
weighted avg       0.98      0.97      0.98    422688

ROC AUC Score is

0.9875837279186237


# Training tfidf with naive bayes

In [26]:
nb = Pipeline([('tfidf', TfidfVectorizer(lowercase=False, token_pattern='\w+', ngram_range=(1, 2), 
                                         min_df=3)),
               ('clf', MultinomialNB()),])
nb.fit(X_train, y_train)

y_pred = nb.predict(X_val)
y_pred_prb = nb.predict_proba(X_val)

train_score = round(accuracy_score(nb.predict(X_train), y_train), 3)
val_score = round(accuracy_score(y_pred, y_val), 3)

print(f'train accuracy {train_score}')
print(f'val accuracy {val_score}')

print(metrics.confusion_matrix(y_val, y_pred))
print(metrics.classification_report(y_val, y_pred))

print('ROC AUC Score is' + '\n')
print(metrics.roc_auc_score(y_val, y_pred_prb[:,1]))

train accuracy 0.99
val accuracy 0.991
[[405783   1355]
 [  2607  10196]]
              precision    recall  f1-score   support

           0       0.99      1.00      1.00    407138
           1       0.88      0.80      0.84     12803

    accuracy                           0.99    419941
   macro avg       0.94      0.90      0.92    419941
weighted avg       0.99      0.99      0.99    419941

ROC AUC Score is

0.9909783414346139


# Training logistic regression with Count vectorizer

In [21]:
logreg = Pipeline([('countVec', CountVectorizer()),
                   ('clf', LogisticRegression(solver='liblinear'))])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_val)
y_pred_prb = logreg.predict_proba(X_val)

train_score = round(accuracy_score(y_train, logreg.predict(X_train)), 3)
val_score = round(accuracy_score(y_val, y_pred), 3)

print(f'train accuracy {train_score}')
print(f'val accuracy {val_score}')

print(metrics.confusion_matrix(y_val, y_pred))
print(metrics.classification_report(y_val, y_pred))

print('ROC AUC Score is' + '\n')
print(metrics.roc_auc_score(y_val, y_pred_prb[:,1]))

train accuracy 0.989
val accuracy 0.989
[[403707   1093]
 [  3709   9084]]
              precision    recall  f1-score   support

           0       0.99      1.00      0.99    404800
           1       0.89      0.71      0.79     12793

    accuracy                           0.99    417593
   macro avg       0.94      0.85      0.89    417593
weighted avg       0.99      0.99      0.99    417593

ROC AUC Score is

0.9926780873904608


# Training facebook's fasttext 

In [19]:
def to_fasttext_format(data: list, labels: list, save_path: str=None):
    ft_data = []
    for d, l in zip(data, labels):
        ft_data.append("__label__{} {}".format(l, d))
    if save_path:
        np.savetxt(save_path, ft_data, fmt='%s')
    else:
        return ft_data
    
def train_fasttext(X_train, y_train, wordNgrams=1, minCount=1, ft_train_path="./tmp_train.txt", **kwargs):
    
    to_fasttext_format(X_train, y_train, save_path=ft_train_path)
    ft_model = train_supervised(ft_train_path, wordNgrams=wordNgrams, minCount=minCount, epoch=10, loss="softmax",  **kwargs)
    train_preds = [i[0].split('_')[-1] for i in ft_model.predict(list(X_train))[0]]

    train_score = round(accuracy_score(np.array(train_preds).astype(np.integer), y_train), 3)
    print(f'train accuracy {train_score}')
    
    return ft_model, train_score

In [20]:
ft_model, train_score = train_fasttext(X_train, y_train)
val_preds = [i[0].split('_')[-1] for i in ft_model.predict(list(X_val))[0]]

val_score = round(accuracy_score(y_val, np.array(val_preds).astype(np.integer)), 3)

print(f'val accuracy {val_score}')

print(metrics.confusion_matrix(y_val, np.array(val_preds).astype(np.integer)))
print(metrics.classification_report(y_val, np.array(val_preds).astype(np.integer)))

train accuracy 0.992
val accuracy 0.992
[[405908   1230]
 [  2201  10602]]
              precision    recall  f1-score   support

           0       0.99      1.00      1.00    407138
           1       0.90      0.83      0.86     12803

    accuracy                           0.99    419941
   macro avg       0.95      0.91      0.93    419941
weighted avg       0.99      0.99      0.99    419941



# Saving the fasttext model

In [22]:
ft_model.save_model("fasttext_model.bin")

# Saving the Naive Bayes model

In [29]:
joblib.dump(nb, "./nb_model.pkl")

['./nb_model.pkl']

# Testing on unseen test data 

# Lets first test bad model naive baise on unseen test data

In [27]:
y_pred = nb.predict(X_test)
y_pred_prb = nb.predict_proba(X_test)

test_score = round(accuracy_score(y_test, y_pred), 3)
print(f'test accuracy {test_score}')

print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))

print('ROC AUC Score is' + '\n')
print(metrics.roc_auc_score(y_test, y_pred_prb[:,1]))

test accuracy 0.995
[[405638    283]
 [  1921   7139]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    405921
           1       0.96      0.79      0.87      9060

    accuracy                           0.99    414981
   macro avg       0.98      0.89      0.93    414981
weighted avg       0.99      0.99      0.99    414981

ROC AUC Score is

0.9931143129379241


# The best model fasttext performance 

In [24]:
test_preds = [i[0].split('_')[-1] for i in ft_model.predict(list(df_test['row']))[0]]

test_score = round(accuracy_score(y_test, np.array(test_preds).astype(np.integer)), 3)
print(f'test accuracy {test_score}')

print(metrics.confusion_matrix(y_test, np.array(test_preds).astype(np.integer)))
print(metrics.classification_report(y_test, np.array(test_preds).astype(np.integer)))

test accuracy 0.995
[[405478    443]
 [  1505   7555]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    405921
           1       0.94      0.83      0.89      9060

    accuracy                           1.00    414981
   macro avg       0.97      0.92      0.94    414981
weighted avg       1.00      1.00      1.00    414981

