Some samples contain more than one document. This notebook explores the idea of having a separate model for 1) multi-page samples and 2) samples without page-breaks.

A sample without page-breaks is guaranteed to be single document => model 2
A sample with page-breaks may be broken into individual pages. Each is potentially a separate document => model 1

We plan to test 3 separate models.
- One is trained on data without page-breaks (model 1)
- One is trained on data with page-breaks (candidate for model 2)
- One is trained on data without page-breaks, but page-breaks are introduced artificially (candidate for model 2)


PROBLEM: both pagebreaks and no_pagebreaks datasets are labelled with a single label but are potentially multilabel. Thus, there are no training for multilabel cases.
We can merely get the chopped dataset and manually inspect the results of a model trained on it.

Overview:
- prepare dataset without page-breaks
- prepare dataset with page-breaks
- check for overlap of the two datasets (using ID). If there is an overlap, remove the overlap from the dataset without page-breaks (which is much larger and hence we can afford to make it smaller for the benefit of the smaller dataset)

- train model 1 (dataset without page-breaks)
- train model 2 (dataset with page-breaks)

- try chopping the dataset without page-breaks to produce an artificial dataset with page-breaks. Try different chopping approaches
- train model 3 on the chopped dataset
- compare model 3 and model 2 and pick one/combine them

- create a wrapper that will decide which model to use

In [None]:
# Imports
import pandas as pd
import numpy as np
import mltools
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
# Prepare dataset without page-breaks
df_no_pagebreaks = pd.read_parquet('/Users/ondrejgutten/Work/PISI.nosync/data/PB/SpisyPB18-PB24_v2.parquet')
df_no_pagebreaks = df_no_pagebreaks[df_no_pagebreaks.iloc[:,3].str.isupper() == True]

In [None]:
# Prepare dataset with page-breaks
df_pagebreaks = pd.read_parquet('/Users/ondrejgutten/Work/PISI.nosync/data/PB/Spisy2024-12-03-14-24_PB18-PB24_zlomyStran_remapped.parquet')
# check if column 3 is all caps
df_pagebreaks = df_pagebreaks[df_pagebreaks[3].str.isupper() == True]
df_pagebreaks = df_pagebreaks[df_pagebreaks[4].apply(lambda x: 'zlom' in x)]

X_pagebreaks = df_pagebreaks.iloc[:,4]
y_pagebreaks = df_pagebreaks.iloc[:,3]

X_train_pagebreaks, X_test_pagebreaks, y_train_pagebreaks, y_test_pagebreaks = train_test_split(X_pagebreaks, y_pagebreaks, test_size=0.2, random_state=42)

In [None]:
# Check for overlap of datasets with and without page-breaks
debt_ids_no_pagebreaks = df_no_pagebreaks.iloc[:,0].unique().astype(str)
debt_ids_pagebreaks = df_pagebreaks.iloc[:,0].unique().astype(str)
debt_ids_intersection = np.intersect1d(debt_ids_no_pagebreaks, debt_ids_pagebreaks)

df_no_pagebreaks_minus_intersection = df_no_pagebreaks[~df_no_pagebreaks.iloc[:,0].astype(str).isin(debt_ids_intersection)] 

X_no_pagebreaks = df_no_pagebreaks_minus_intersection.iloc[:,4]
y_no_pagebreaks = df_no_pagebreaks_minus_intersection.iloc[:,3]

X_train_no_pagebreaks, X_test_no_pagebreaks, y_train_no_pagebreaks, y_test_no_pagebreaks = train_test_split(X_no_pagebreaks, y_no_pagebreaks, test_size=0.2, random_state=42)

In [None]:
# Train model on dataset without page-breaks
xgb_pagebreaks = mltools.architecture.TF_IDF_XGBoost('pagebreaks',{})
xgb_pagebreaks.fit(X_train_pagebreaks, y_train_pagebreaks)
xgb_pagebreaks_predictions = xgb_pagebreaks.predict(X_test_pagebreaks)
print(accuracy_score(y_test_pagebreaks, xgb_pagebreaks_predictions))


In [None]:
# Train model on dataset without page-breaks
xgb_no_pagebreaks = mltools.architecture.TF_IDF_XGBoost('no_pagebreaks',{})
xgb_no_pagebreaks.fit(X_train_no_pagebreaks, y_train_no_pagebreaks)
xgb_no_pagebreaks_predictions = xgb_no_pagebreaks.predict(X_test_no_pagebreaks)
print(accuracy_score(y_test_no_pagebreaks, xgb_no_pagebreaks_predictions))

In [None]:
# Chop dataset without page-breaks into artificial pages
def chop_data(X, y, length):
    X = np.array(X)
    y = np.array(y)
    X_chopped = []
    y_chopped = []
    for i in range(len(X)):
        for j in range(0, len(X[i]), length):
            X_chopped.append(X[i][j:j+length])
            y_chopped.append(y[i])

    return X_chopped, y_chopped

X_train_chopped, y_train_chopped = chop_data(X_train_no_pagebreaks, y_train_no_pagebreaks, 1000)
X_test_chopped, y_test_chopped = chop_data(X_test_no_pagebreaks, y_test_no_pagebreaks, 1000)

In [None]:
# Train model on the chopped dataset
xgb_chopped = mltools.architecture.TF_IDF_XGBoost('chopped',{})
xgb_chopped.fit(X_train_chopped, y_train_chopped)
xgb_chopped_predictions = xgb_chopped.predict(X_test_chopped)
print(accuracy_score(y_test_chopped, xgb_chopped_predictions))

In [None]:
# Compare model trained on dataset with page-breaks and the model trained on the chopped dataset. Pick/combine a final model for page-break samples.

In [None]:
# Wrap both models into a single model with a page-break detection mechanism