# Netherlands Neurogenetics Database
Author: Nienke Mekkes <br>
Date: 21-Sep-2022. <br>
Correspond: n.j.mekkes@umcg.nl <br>

## Script: clinical history labeled training data: splitting data
Objectives: load cleaned training data, split into <br>
- test (will be stored separately)
- train&val (serves as input for our models)

Ideally, the distribution of attributes is similar between the test set and the train/val set. Therefore we stratify the data using the attributes. In other words, our testset will not contain attributes not present in the train&valset, and vice versa. <br>
We assume that we do not have to stratify the donors (e.g. on sex, or diagnosis), since our task is predicting attribute labels

### Input files:
- excel file with cleaned labeled training data, OR
- pickle file with cleaned labeled training data

### Output
- Folder containing:
    - trainval data (excel and pickle)
    - test data (excel and pickle)




#### Minimal requirements

In [None]:
# %pip install pandas
# %pip install openpyxl
# %pip install iterative-stratification
# %pip install scikit-multilearn
# %pip install natsort ?
# %pip install nltk

#### Imports

In [None]:
import pickle
import pandas as pd
import numpy as np
import os
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
from datetime import date
from helper_functions import split_vis

#### Paths (user input required)

In [None]:
#path_to_cleaned_training_data_xlsx = '/data/p307948/clinical/github_data/output/cleaned_training_data.xlsx'
path_to_cleaned_training_data_pkl = "/home/jupyter-n.mekkes@gmail.com-f6d87/ext_n_mekkes_gmail_com/clinical_history/training_data/cleaned_training_data.pkl"
save_path_files = '/home/jupyter-n.mekkes@gmail.com-f6d87/ext_n_mekkes_gmail_com/clinical_history/training_data'

In [None]:
if not os.path.exists(save_path_files):
    print('Creating output folder....')
    os.makedirs(save_path_files)

#### Loading data


In [None]:
# cleaned_train = pd.read_excel(path_to_cleaned_training_data_xlsx), engine='openpyxl', index_col=[0])
with open(path_to_cleaned_training_data_pkl, "rb") as file:
    cleaned_train = pickle.load(file)
display(cleaned_train)

#### Data preparation

In [None]:
non_attribute_columns = ['NBB_nr','Year_Sentence_nr','Sentence']
attributes = [col for col in cleaned_train.columns if col not in non_attribute_columns]
print(len(attributes))

#### Split set-up (user input required)

- 'X' contains the sentences
- 'Y' contains the attribute labels (numpy array of 1s and 0s per sentence)

In [None]:
train_val_size = 0.8
test_size = 0.2
print(cleaned_train[['Sentence']])
X = cleaned_train[['Sentence']].to_numpy()
Y = cleaned_train.loc[:,[i for i in list(cleaned_train.columns) if i not in non_attribute_columns]].to_numpy()

In [None]:
print(X)
print(Y)

#### Iterative split, approach MultilabelStratifiedKFold (preferred)
MultilabelStratifiedKFold is very suitable for creating multiple stratified folds for datasets with multiple labels. This is why we use it when training and optimizing the model later, when using the train&val dataset. Here we create the train&val dataset, and although we do not use multiple folds (we think one test set is enough), we do use the same library for continuity and ease of use. <br>

We will never reach 'perfect' stratification, since a sentence can only be present in either test, or train&val, and some sentences contain multiple attributes

MultilabelStratifiedKFold takes n_splits as input, and creates n_splits folds. <br>
1/n_splits part of the data will be used as test, and the remainder for train. <br>
e.g. for an n_splits of 5 and 10 sentences, 2 sentences function as test (==1/5), and 8 sentences as train. <br>
This is performed 5 times, leading to 5 different 2-8 combinations. <br>

Importantly, we only need a single 2-8 combination, so we exit the loop after a single run!

In [None]:
mskf = MultilabelStratifiedKFold(n_splits=5, shuffle=True, random_state=1)
j = 0
for train_val_index, test_index in mskf.split(X, Y):
    if j == 0:
        print("TRAINVAL row numbers:", train_val_index, "\nTEST row numbers:", test_index)
        x_train_val, x_test = X[train_val_index], X[test_index]
        y_train_val, y_test = Y[train_val_index], Y[test_index]
        print(f"nr of sentences in train&val: {len(x_train_val)}, or {len(x_train_val)/(len(x_train_val)+len(x_test)):.2f}%\n"\
        f"nr of sentences in test: {len(x_test)}, or {len(x_test)/(len(x_train_val)+len(x_test)):.2f}%")
        print('\nFinished splitting data once, exiting loop...')
        break

#### Investigate if attributes are distributed roughly equally between test and train&val

In [None]:
pd.options.display.max_columns = 100
std,uncor, cor = split_vis(x_train_val,x_test,y_train_val, y_test,x_train_val.shape[0],x_test.shape[0],attributes)
print('Stratification score: ', round(std,4))

print('Attribute distribution real numbers:')
display(uncor)

print('Attribute distribution numbers adjusted for sizes of train&val and test:')
display(cor)

#### Creating final dataframes

In [None]:
trainval_df = pd.DataFrame(x_train_val,columns=['text'])
trainval_df['labels'] = y_train_val.tolist()
trainval_df.to_excel(f"{save_path_files}/trainval_data.xlsx")
trainval_df.to_pickle(f"{save_path_files}/trainval_data.pkl") 


test_df = pd.DataFrame(x_test,columns=['text'])
test_df['labels'] = y_test.tolist()
test_df.to_excel(f"{save_path_files}/test_data.xlsx")
test_df.to_pickle(f"{save_path_files}/test_data.pkl") 