This notebook splits the imaging data into training and testing such that there are no repeating patients in the test set and that the patients in the test set do not appear in training. 

In [18]:
import pandas as pd
import random
#reading in a dataframe that contains image arrays, patient IDs ("subject"), and diagnosis
m2 = pd.read_pickle("mri_meta.pkl")

#cleaning patient IDs
m2["subject"] = m2["subject"].str.replace("s", "S").str.replace("\n", "")

#reading in the overlap test set
ts = pd.read_csv("overlap_test_set.csv")

#removing ids from the overlap test set
m2 = m2[~m2["subject"].isin(list(ts["subject"].values))]
m2

Unnamed: 0,img_array,label,subject
0,"[[[[36.45114748 37.26595571 0.52279603], [63....",0,002_S_0413
1,"[[[[28.35968846 60.78608813 53.39332376], [33....",0,002_S_0413
2,"[[[[67.34232458 13.72514765 39.75249768], [65....",0,002_S_0413
3,"[[[[26.51095324 49.75030355 0.36219632], [84....",0,002_S_0413
4,"[[[[31.09818037 49.19945394 54.24074462], [33....",0,002_S_0413
...,...,...,...
5735,"[[[[30.15725086 52.06041119 2.73605637], [37....",0,941_S_4376
5736,"[[[[ 28.11486643 550.89959322 1.4210884 ], [...",0,941_S_4376
5737,"[[[[ 26.33395312 250.418828 1.31288741], [...",0,941_S_4376
5738,"[[[[ 18.94171287 233.47920191 1.27902779], [...",0,941_S_4376


In [19]:
#there are 551 unique patients
subjects = list(set(m2["subject"].values))
len(subjects)

331

In [20]:
0.2*len(m2) #10% for testing

1029.2

We have 3674 MRI scans from 551 patients (some patients repeated up to 16 times).
We selected our testing set such that it has 367 unique MRIs (10% of training) shwon below. 
We do not allow for any repeating patients in the testing set. We only allowed repetition during training, and no patient was included in both training and testing sets.

In [21]:
#selecting 367 patient IDs
picked_ids = random.sample(subjects,60)

In [25]:
#creating the test set out of the patient IDs
test = pd.DataFrame(columns = ["img_array", "subject", "label"]) 
for i in range(len(picked_ids)):
    s = m2[m2["subject"] == picked_ids[i]]
    # print(s)
    test = test.append(s)
test

Unnamed: 0,img_array,subject,label
2295,"[[[[ 5.59503921 16.56762316 0.84630919], [16....",033_S_0567,1
2296,"[[[[ 6.46584719 26.64359905 13.32751529], [ 7....",033_S_0567,1
2297,"[[[[30.20884581 5.87490673 10.02318587], [30....",033_S_0567,1
2298,"[[[[26.71089092 24.37643344 7.93636607], [20....",033_S_0567,1
2299,"[[[[9.52073974 6.69603216 5.64754991], [8.9284...",033_S_0567,1
...,...,...,...
3124,"[[[[4.9537037 1.08333333 0. ], [7.8055...",052_S_4959,2
3125,"[[[[57.96849612 29.95286075 26.41001555], [ 72...",052_S_4959,2
3126,"[[[[69.12801056 30.57484873 62.73713268], [ 69...",052_S_4959,2
3127,"[[[[65.01692453 31.13441991 76.76311174], [63....",052_S_4959,2


In [26]:
indexes = list(set(m2.index) - set(test.index))
len(indexes)

4192

In [27]:
#creating the training set using all the other data points
train = m2[m2.index.isin(indexes)]

In [28]:
train[["img_array"]].to_pickle("img_train.pkl")
test[["img_array"]].to_pickle("img_test.pkl")

In [29]:
train[["label"]].to_pickle("img_y_train.pkl")
test[["label"]].to_pickle("img_y_test.pkl")