I want to make two files that will allow me to build a model that is ready for the data on the MRNet collab. This means that one file will contain a list of image locations:

```
MRNet-v1.0/valid/sagittal/1130.npy
MRNet-v1.0/valid/coronal/1130.npy
MRNet-v1.0/valid/axial/1130.npy
MRNet-v1.0/valid/sagittal/1131.npy
MRNet-v1.0/valid/coronal/1131.npy
MRNet-v1.0/valid/axial/1131.npy
...
```

and the other file will contain label info and train/valid/test splits:

```
case,abnormal,ACL,meniscal,split
1130.npy,0,0,0,train
1131.npy,1,1,0,valid
1132.npy,1,0,1,test
...
```

The model interface will look like:

### Training
`python train.py model-name --rundir path-to-output-dir --label {abnormal,acl,meniscal,all} --series {axial,coronal,sagittal,all} --full`

### Evaluate
`python eval.py model-path --label {abnormal,acl,meniscal,all} --split {train,valid,test,all}`

### Infer
`python infer.py input-data-csv-filename output-prediction-csv-path -m model-path`

In [1]:
import os

In [54]:
data_path = '/domino/datasets/local/mrnet/MRNet-v1.0'
image_paths = []
for split in ['train', 'valid']:
    split_folder = os.path.join(data_path, split)
    for series in os.listdir(split_folder):
        if series == '.DS_Store':
            continue
        series_folder = os.path.join(split_folder, series)
        
        for filename in os.listdir(series_folder):
            if filename == '.DS_Store':
                continue
            image_paths.append(os.path.join(series_folder, filename))
            
with open('../mrnet-image-paths.csv', 'w') as fout:
    fout.write('\n'.join(image_paths))

In [55]:
!head ../mrnet-image-paths.csv

/domino/datasets/local/mrnet/MRNet-v1.0/train/coronal/0173.npy
/domino/datasets/local/mrnet/MRNet-v1.0/train/coronal/0335.npy
/domino/datasets/local/mrnet/MRNet-v1.0/train/coronal/1064.npy
/domino/datasets/local/mrnet/MRNet-v1.0/train/coronal/0254.npy
/domino/datasets/local/mrnet/MRNet-v1.0/train/coronal/0092.npy
/domino/datasets/local/mrnet/MRNet-v1.0/train/coronal/0416.npy
/domino/datasets/local/mrnet/MRNet-v1.0/train/coronal/0731.npy
/domino/datasets/local/mrnet/MRNet-v1.0/train/coronal/0650.npy
/domino/datasets/local/mrnet/MRNet-v1.0/train/coronal/0812.npy
/domino/datasets/local/mrnet/MRNet-v1.0/train/coronal/0533.npy


In [56]:
import pandas as pd
import numpy as np

In [57]:
def load_labels(split, diagnosis):
    df = pd.read_csv(
        os.path.join(data_path, '{}-{}.csv'.format(split, diagnosis)),
        header=None,
        names=['case', diagnosis],
        dtype={'case': str, diagnosis: np.int}
    )
    df['split'] = split if split == 'train' else 'test'
    
    print(df.groupby(diagnosis).count())
    print(df.head())
    return df

In [58]:
train_abnormal = load_labels('train', 'abnormal')

          case  split
abnormal             
0          217    217
1          913    913
   case  abnormal  split
0  0000         1  train
1  0001         1  train
2  0002         1  train
3  0003         1  train
4  0004         1  train


In [59]:
train_acl = load_labels('train', 'acl')

     case  split
acl             
0     922    922
1     208    208
   case  acl  split
0  0000    0  train
1  0001    1  train
2  0002    0  train
3  0003    0  train
4  0004    0  train


In [60]:
train_meniscus = load_labels('train', 'meniscus')

          case  split
meniscus             
0          733    733
1          397    397
   case  meniscus  split
0  0000         0  train
1  0001         1  train
2  0002         0  train
3  0003         1  train
4  0004         0  train


In [61]:
# Our test set will be what they are calling "valid" until we are ready to submit
test_abnormal = load_labels('valid', 'abnormal')

          case  split
abnormal             
0           25     25
1           95     95
   case  abnormal split
0  1130         0  test
1  1131         0  test
2  1132         0  test
3  1133         0  test
4  1134         0  test


In [62]:
test_acl = load_labels('valid', 'acl')

     case  split
acl             
0      66     66
1      54     54
   case  acl split
0  1130    0  test
1  1131    0  test
2  1132    0  test
3  1133    0  test
4  1134    0  test


In [63]:
test_meniscus = load_labels('valid', 'meniscus')

          case  split
meniscus             
0           68     68
1           52     52
   case  meniscus split
0  1130         0  test
1  1131         0  test
2  1132         0  test
3  1133         0  test
4  1134         0  test


In [64]:
# Now we want to combine all of those data sets
# import functools
dfs = [
    pd.concat([train_acl, test_acl], ignore_index=True), 
    pd.concat([train_meniscus, test_meniscus], ignore_index=True)
]
# df = functools.reduce(lambda a, b: pd.merge(a, b, on=['case', 'split'], suffixes=(False, False), how='outer'), dfs)
# df.tail()
df = pd.concat([train_abnormal, test_abnormal], ignore_index=True)
df.set_index('case', inplace=True)
for _df in dfs:
    assert(np.all(df.loc[_df.case, 'split'].values == _df.split.values))
    
    df = df.join(_df.set_index('case').drop('split', axis='columns'), how='outer')
df = df[['abnormal', 'acl', 'meniscus', 'split']]

df.index = df.index.map(lambda c: c + '.npy')
df.head()

Unnamed: 0_level_0,abnormal,acl,meniscus,split
case,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0000.npy,1,0,0,train
0001.npy,1,1,1,train
0002.npy,1,0,0,train
0003.npy,1,0,1,train
0004.npy,1,0,0,train


In [65]:
df.to_csv('../mrnet-labels.csv', index=True)

In [66]:
!head ../mrnet-labels.csv

case,abnormal,acl,meniscus,split
0000.npy,1,0,0,train
0001.npy,1,1,1,train
0002.npy,1,0,0,train
0003.npy,1,0,1,train
0004.npy,1,0,0,train
0005.npy,1,0,1,train
0006.npy,1,0,0,train
0007.npy,1,0,0,train
0008.npy,1,0,0,train


Let's set some of the train set as the valid set.

In [6]:
import pandas as pd
df = pd.read_csv('../mrnet-labels.csv', index_col=0)
df.head()

Unnamed: 0_level_0,abnormal,acl,meniscus,split
case,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0000.npy,1,0,0,train
0001.npy,1,1,1,train
0002.npy,1,0,0,train
0003.npy,1,0,1,train
0004.npy,1,0,0,train


In [7]:
df.groupby('split').count()

Unnamed: 0_level_0,abnormal,acl,meniscus
split,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
test,120,120,120
train,1130,1130,1130


In [8]:
df.groupby('split').sum()

Unnamed: 0_level_0,abnormal,acl,meniscus
split,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
test,95,54,52
train,913,208,397


In [9]:
df.groupby('split').mean()

Unnamed: 0_level_0,abnormal,acl,meniscus
split,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
test,0.791667,0.45,0.433333
train,0.807965,0.184071,0.351327


In [10]:
sample = df[df.split == 'train'].sample(120)
df.loc[sample.index, 'split'] = 'valid'
df.groupby('split').sum()

Unnamed: 0_level_0,abnormal,acl,meniscus
split,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
test,95,54,52
train,817,193,357
valid,96,15,40


In [11]:
df.groupby('split').mean()

Unnamed: 0_level_0,abnormal,acl,meniscus
split,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
test,0.791667,0.45,0.433333
train,0.808911,0.191089,0.353465
valid,0.8,0.125,0.333333


In [12]:
df.groupby('split').count()

Unnamed: 0_level_0,abnormal,acl,meniscus
split,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
test,120,120,120
train,1010,1010,1010
valid,120,120,120


In [13]:
df.to_csv('../mrnet-labels-3way.csv', index=True)

The validation split that I made is not ideal because it has so few of the positive ACL labels and so few of the negative abnormal labels. I'll try to fix that below.

In [36]:
import pandas as pd
df = pd.read_csv('../mrnet-labels.csv', index_col=0)
df.head()

Unnamed: 0_level_0,abnormal,acl,meniscus,split
case,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0000.npy,1,0,0,train
0001.npy,1,1,1,train
0002.npy,1,0,0,train
0003.npy,1,0,1,train
0004.npy,1,0,0,train


In [23]:
df.groupby('split').describe()

Unnamed: 0_level_0,abnormal,abnormal,abnormal,abnormal,abnormal,abnormal,abnormal,abnormal,acl,acl,acl,acl,acl,meniscus,meniscus,meniscus,meniscus,meniscus,meniscus,meniscus,meniscus
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
split,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
test,120.0,0.791667,0.407819,0.0,1.0,1.0,1.0,1.0,120.0,0.45,...,1.0,1.0,120.0,0.433333,0.497613,0.0,0.0,0.0,1.0,1.0
train,1130.0,0.807965,0.394075,0.0,1.0,1.0,1.0,1.0,1130.0,0.184071,...,0.0,1.0,1130.0,0.351327,0.477596,0.0,0.0,0.0,1.0,1.0


In [3]:
df.groupby('split').count()

Unnamed: 0_level_0,abnormal,acl,meniscus
split,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
test,120,120,120
train,1130,1130,1130


In [4]:
df.groupby('split').sum()

Unnamed: 0_level_0,abnormal,acl,meniscus
split,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
test,95,54,52
train,913,208,397


In [46]:
df = pd.read_csv('../mrnet-labels.csv', index_col=0)
sample = pd.concat([
    df[(df.split == 'train') & (df.abnormal == 0)].sample(35),
    df[(df.split == 'train') & (df.acl == 1)].sample(40),
    df[(df.split == 'train') & (df.acl == 0)].sample(40),
    df[(df.split == 'train') & (df.meniscus == 1)].sample(7),
])
print(len(set(sample.index)))
assert(len(set(sample.index)) == 120)

df.loc[sample.index, 'split'] = 'valid'
df.groupby('split').sum()

120


Unnamed: 0_level_0,abnormal,acl,meniscus
split,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
test,95,54,52
train,835,167,353
valid,78,41,44


Good! It looks like valid has a similar distribution as test. Let's be sure.

In [47]:
splits = ['train', 'valid', 'test']
for d in ['abnormal', 'acl', 'meniscus']:
    pos_count = [((df.split == split) & (df[d] == 1)).sum() for split in splits]
    print('{}{}\t{}\t{}\t{}'.format(d, '     ' if d == 'acl' else '', *pos_count))
    neg_count = [((df.split == split) & (df[d] == 0)).sum() for split in splits]
    print('{}{}\t{}\t{}\t{}'.format(d, '     ' if d == 'acl' else '', *neg_count))
    all_count = [((df.split == split)).sum() for split in splits]
    print('{}{}\t{}\t{}\t{}'.format(d, '     ' if d == 'acl' else '', *all_count))

abnormal	835	78	95
abnormal	175	42	25
abnormal	1010	120	120
acl     	167	41	54
acl     	843	79	66
acl     	1010	120	120
meniscus	353	44	52
meniscus	657	76	68
meniscus	1010	120	120


That was a weird ass sampling strategy, but the classes are more balanced now.

In [48]:
len(set(df.index))

1250

In [49]:
len(df)

1250

In [None]:
df.to_csv('../mrnet-labels-3way.csv', index=True)