# Netherlands Neurogenetics Database
Author: Nienke Mekkes <br>
Date: 21-Sep-2022. <br>
Correspond: n.j.mekkes@umcg.nl <br>

## Script: clinical history labeled training data: splitting data
Objectives: load cleaned training data, split into <br>
- test (will be stored separately)
- train&val (serves as input for our models)

Ideally, the distribution of attributes is similar between the test set and the train/val set. Therefore we stratify the data using the attributes. In other words, our testset will not contain attributes not present in the train&valset, and vice versa. <br>
We assume that we do not have to stratify the donors (e.g. on sex, or diagnosis), since our task is predicting attribute labels

### Input files:
- excel file with cleaned labeled training data, OR
- pickle file with cleaned labeled training data

### Output
- Folder containing:
    - trainval data (excel and pickle)
    - test data (excel and pickle)




#### Minimal requirements

In [8]:
# %pip install pandas
# %pip install openpyxl
# %pip install iterative-stratification
# %pip install scikit-multilearn
# %pip install natsort ?
# %pip install nltk

#### Imports

In [1]:
import pickle
import pandas as pd
import numpy as np
import os
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
from datetime import date
from helper_functions import split_vis

#### Paths (user input required)

In [2]:
#path_to_cleaned_training_data_xlsx = '/data/p307948/clinical/github_data/output/cleaned_training_data.xlsx'
path_to_cleaned_training_data_pkl = "/home/jupyter-n.mekkes@gmail.com-f6d87/ext_n_mekkes_gmail_com/clinical_history/training_data/cleaned_training_data.pkl"
save_path_files = '/home/jupyter-n.mekkes@gmail.com-f6d87/ext_n_mekkes_gmail_com/clinical_history/training_data'

In [3]:
if not os.path.exists(save_path_files):
    print('Creating output folder....')
    os.makedirs(save_path_files)

#### Loading data


In [5]:
# cleaned_train = pd.read_excel(path_to_cleaned_training_data_xlsx), engine='openpyxl', index_col=[0])
with open(path_to_cleaned_training_data_pkl, "rb") as file:
    cleaned_train = pickle.load(file)
display(cleaned_train)

Unnamed: 0,NBB_nr,Year_Sentence_nr,Sentence,Muscular_Weakness,Spasticity,Hyperreflexia_and_oth_reflexes,Frontal_release_signs,Fasciculations,Positive_sensory_symptoms,Negative_sensory_symptoms,...,Orthostatic_hypotension,Headache_migraine,Fatigue,Declined_deteriorated_health,Cachexia,Weight_loss,Reduces_oral_intake,Help_in_ADL,Day_care,Admission_to_nursing_home
0,NBB 1990-048,Past_sentence_0,Past: The patient was known to have atrial fib...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,NBB 1990-048,Past_sentence_1,The patient was known to have hypertension and...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,NBB 1990-048,1979_sentence_0,1979: She got a total hip,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,NBB 1990-048,1979_sentence_1,At age 76 the first demential symptomes appeared,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,NBB 1990-048,1979_sentence_2,After the death of her husband homesituation w...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19049,NBB 2018-114,2018_sentence_25,The patient himself did not recognize himself ...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19050,NBB 2018-114,2018_sentence_26,In July and August he suffered from deliria po...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19051,NBB 2018-114,2018_sentence_27,In August the GP reported that it was impossib...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19052,NBB 2018-114,2018_sentence_28,This was a reason why a hospice turned down an...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Data preparation

In [6]:
non_attribute_columns = ['NBB_nr','Year_Sentence_nr','Sentence']
attributes = [col for col in cleaned_train.columns if col not in non_attribute_columns]
print(len(attributes))

90


#### Split set-up (user input required)

- 'X' contains the sentences
- 'Y' contains the attribute labels (numpy array of 1s and 0s per sentence)

In [7]:
train_val_size = 0.8
test_size = 0.2
print(cleaned_train[['Sentence']])
X = cleaned_train[['Sentence']].to_numpy()
Y = cleaned_train.loc[:,[i for i in list(cleaned_train.columns) if i not in non_attribute_columns]].to_numpy()

                                                Sentence
0      Past: The patient was known to have atrial fib...
1      The patient was known to have hypertension and...
2                              1979: She got a total hip
3       At age 76 the first demential symptomes appeared
4      After the death of her husband homesituation w...
...                                                  ...
19049  The patient himself did not recognize himself ...
19050  In July and August he suffered from deliria po...
19051  In August the GP reported that it was impossib...
19052  This was a reason why a hospice turned down an...
19053  In September the patient was euthanized and de...

[18917 rows x 1 columns]


In [8]:
print(X)
print(Y)

[['Past: The patient was known to have atrial fibrillation']
 ['The patient was known to have hypertension and a struma during WWII']
 ['1979: She got a total hip']
 ...
 ['In August the GP reported that it was impossible to get him to sleep despite consulting geriatric specialists and trying several medications sleeping was of major concern']
 ['This was a reason why a hospice turned down an admission']
 ['In September the patient was euthanized and deceased at home']]
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


#### Iterative split, approach MultilabelStratifiedKFold (preferred)
MultilabelStratifiedKFold is very suitable for creating multiple stratified folds for datasets with multiple labels. This is why we use it when training and optimizing the model later, when using the train&val dataset. Here we create the train&val dataset, and although we do not use multiple folds (we think one test set is enough), we do use the same library for continuity and ease of use. <br>

We will never reach 'perfect' stratification, since a sentence can only be present in either test, or train&val, and some sentences contain multiple attributes

MultilabelStratifiedKFold takes n_splits as input, and creates n_splits folds. <br>
1/n_splits part of the data will be used as test, and the remainder for train. <br>
e.g. for an n_splits of 5 and 10 sentences, 2 sentences function as test (==1/5), and 8 sentences as train. <br>
This is performed 5 times, leading to 5 different 2-8 combinations. <br>

Importantly, we only need a single 2-8 combination, so we exit the loop after a single run!

In [9]:
mskf = MultilabelStratifiedKFold(n_splits=5, shuffle=True, random_state=1)
j = 0
for train_val_index, test_index in mskf.split(X, Y):
    if j == 0:
        print("TRAINVAL row numbers:", train_val_index, "\nTEST row numbers:", test_index)
        x_train_val, x_test = X[train_val_index], X[test_index]
        y_train_val, y_test = Y[train_val_index], Y[test_index]
        print(f"nr of sentences in train&val: {len(x_train_val)}, or {len(x_train_val)/(len(x_train_val)+len(x_test)):.2f}%\n"\
        f"nr of sentences in test: {len(x_test)}, or {len(x_test)/(len(x_train_val)+len(x_test)):.2f}%")
        print('\nFinished splitting data once, exiting loop...')
        break

TRAINVAL row numbers: [    0     2     3 ... 18913 18915 18916] 
TEST row numbers: [    1     4     8 ... 18897 18904 18914]
nr of sentences in train&val: 15133, or 0.80%
nr of sentences in test: 3784, or 0.20%

Finished splitting data once, exiting loop...


#### Investigate if attributes are distributed roughly equally between test and train&val

In [10]:
pd.options.display.max_columns = 100
std,uncor, cor = split_vis(x_train_val,x_test,y_train_val, y_test,x_train_val.shape[0],x_test.shape[0],attributes)
print('Stratification score: ', round(std,4))

print('Attribute distribution real numbers:')
display(uncor)

print('Attribute distribution numbers adjusted for sizes of train&val and test:')
display(cor)

Stratification score:  0.9167
Attribute distribution real numbers:


Unnamed: 0,Muscular_Weakness,Spasticity,Hyperreflexia_and_oth_reflexes,Frontal_release_signs,Fasciculations,Positive_sensory_symptoms,Negative_sensory_symptoms,Parkinsonism,Facial_masking,Tremor,Bradykinesia,Rigidity,Vertigo,Nystagmus,Ataxia,Loss_of_coordination,Balance_problems,Frequent_falls,Decreased_motor_skills,Unspecified_disturbed_gait_patt,Mobility_problems,Dementia,Cognitive_decline,Bradyphrenia,Lack_of_insight,Facade_behavior,Head_turning_sign,Memory_impairment,Amnesia,Forgetfulness,Imprinting_disturbances,Impaired_recognition,Confabulations,Disorientation,Wandering,Confusion,Aphasia,Word_finding_problems,Language_impairment,Impaired_comprehension,Communication_problems,Dysarthria,Apraxia,Executive_function_disorder,Lack_of_planning_organis_overv,Concentration_problems,Disinhibition,Loss_of_decorum,Apathy_inertia,Lack_of_initiative,Loss_of_sympathy_empathy,Compulsive_behavior,Hyperorality,Changed_moods_emotions,Changed_behavior_personality,Agitation,Aggressive_behavior,Stress,Anxiety,Depressed_mood,Suicidal_ideation,Mania,Psychosis,Psychiatric_admissions,Paranoia_suspiciousness,Hallucinations,Delusions,Delirium,Day_night_rhythm_disturbances,Sleep_disturbances,Restlessness,Vivid_dreaming,Hearing_problems,Visual_problems,Olfactory_gustatory_dysfunction,Urinary_incontinence,Urinary_problems_other,Constipation,Swallowing_problems_Dysphagia,Seizures,Orthostatic_hypotension,Headache_migraine,Fatigue,Declined_deteriorated_health,Cachexia,Weight_loss,Reduces_oral_intake,Help_in_ADL,Day_care,Admission_to_nursing_home
train_and_val,294,113,118,21,6,115,66,127,68,147,205,157,28,33,72,82,128,101,74,15,426,293,236,64,92,20,6,371,41,114,64,49,24,237,40,127,49,110,211,93,54,91,96,33,56,127,66,118,45,49,7,95,30,136,235,106,100,36,157,216,37,34,44,33,42,165,33,57,24,137,142,10,72,100,13,229,102,155,141,145,25,83,238,304,22,109,146,193,49,183
test,73,28,30,6,1,29,16,32,18,37,51,40,7,8,18,20,32,25,19,3,106,74,59,16,22,5,2,93,10,28,17,12,6,59,10,32,13,27,53,23,14,23,24,8,15,31,17,29,11,12,2,24,7,34,59,26,24,9,40,54,10,8,11,8,11,42,8,15,6,34,36,2,18,24,4,57,25,39,35,37,6,20,59,76,5,27,37,49,12,46


Attribute distribution numbers adjusted for sizes of train&val and test:


Unnamed: 0,Muscular_Weakness,Spasticity,Hyperreflexia_and_oth_reflexes,Frontal_release_signs,Fasciculations,Positive_sensory_symptoms,Negative_sensory_symptoms,Parkinsonism,Facial_masking,Tremor,Bradykinesia,Rigidity,Vertigo,Nystagmus,Ataxia,Loss_of_coordination,Balance_problems,Frequent_falls,Decreased_motor_skills,Unspecified_disturbed_gait_patt,Mobility_problems,Dementia,Cognitive_decline,Bradyphrenia,Lack_of_insight,Facade_behavior,Head_turning_sign,Memory_impairment,Amnesia,Forgetfulness,Imprinting_disturbances,Impaired_recognition,Confabulations,Disorientation,Wandering,Confusion,Aphasia,Word_finding_problems,Language_impairment,Impaired_comprehension,Communication_problems,Dysarthria,Apraxia,Executive_function_disorder,Lack_of_planning_organis_overv,Concentration_problems,Disinhibition,Loss_of_decorum,Apathy_inertia,Lack_of_initiative,Loss_of_sympathy_empathy,Compulsive_behavior,Hyperorality,Changed_moods_emotions,Changed_behavior_personality,Agitation,Aggressive_behavior,Stress,Anxiety,Depressed_mood,Suicidal_ideation,Mania,Psychosis,Psychiatric_admissions,Paranoia_suspiciousness,Hallucinations,Delusions,Delirium,Day_night_rhythm_disturbances,Sleep_disturbances,Restlessness,Vivid_dreaming,Hearing_problems,Visual_problems,Olfactory_gustatory_dysfunction,Urinary_incontinence,Urinary_problems_other,Constipation,Swallowing_problems_Dysphagia,Seizures,Orthostatic_hypotension,Headache_migraine,Fatigue,Declined_deteriorated_health,Cachexia,Weight_loss,Reduces_oral_intake,Help_in_ADL,Day_care,Admission_to_nursing_home
train_and_val,294,113,118,21,6,115,66,127,68,147,205,157,28,33,72,82,128,101,74,15,426,293,236,64,92,20,6,371,41,114,64,49,24,237,40,127,49,110,211,93,54,91,96,33,56,127,66,118,45,49,7,95,30,136,235,106,100,36,157,216,37,34,44,33,42,165,33,57,24,137,142,10,72,100,13,229,102,155,141,145,25,83,238,304,22,109,146,193,49,183
test,291,111,119,23,3,115,63,127,71,147,203,159,27,31,71,79,127,99,75,11,423,295,235,63,87,19,7,371,39,111,67,47,23,235,39,127,51,107,211,91,55,91,95,31,59,123,67,115,43,47,7,95,27,135,235,103,95,35,159,215,39,31,43,31,43,167,31,59,23,135,143,7,71,95,15,227,99,155,139,147,23,79,235,303,19,107,147,195,47,183


#### Creating final dataframes

In [11]:
trainval_df = pd.DataFrame(x_train_val,columns=['text'])
trainval_df['labels'] = y_train_val.tolist()
trainval_df.to_excel(f"{save_path_files}/trainval_data.xlsx")
trainval_df.to_pickle(f"{save_path_files}/trainval_data.pkl") 


test_df = pd.DataFrame(x_test,columns=['text'])
test_df['labels'] = y_test.tolist()
test_df.to_excel(f"{save_path_files}/test_data.xlsx")
test_df.to_pickle(f"{save_path_files}/test_data.pkl") 