# Gemicai tutorial 3: Gemicai Dataset
This tutorial is about everything concerning Gemicai datasets, or gemsets for short.

In [1]:
import gemicai as gem
import torch
import torchvision.models as models

<p style="color:green"> @Mateusz theres some simple functionality I think the gemsets should have in the beta release, could you implement this? </p>

## 3.1 Initialize gemset
Some text

In [5]:
dicom_data = '/mnt/SharedStor/tutorials/Mammography/'

# Let's find out how many dicom images we have.
gem.utils.dir_info(dicom_data)

gemset_destination = '/mnt/SharedStor/tutorials/tutorial3'

# Specify the relevant Dicom attributes you want to use.
dicom_attributes = ['Modality', 'BodyPartExamined', 'StudyDescription', 'SeriesDescription']


# gem.dicom_to_gemset(data_origin=dicom_data, data_destination=gemset_destination, relevant_labels=dicom_attributes, 
#                     objects_per_file=100, test_split=0.2, verbosity=1)

gem.utils.dir_info(gemset_destination)

gemset = gem.DicomoDataset.get_dicomo_dataset(gemset_destination, labels=dicom_attributes)

gemset.summarize('SeriesDescription')

| Extension   |   Files | Size    |
|-------------+---------+---------|
| .dcm.gz     |    1061 | 5.75 GB | 

| Extension   |   Files | Size    |
|-------------+---------+---------|
| .gemset     |      11 | 31.6 MB | 

| Class (SeriesDescription)          |   Frequency |
|------------------------------------+-------------|
| R MLO                              |         227 |
| R CC                               |         235 |
| L CC                               |         219 |
| None                               |          57 |
| L LM                               |          12 |
| L MLO                              |         218 |
| L XCCL                             |          16 |
| L SPECIMEN                         |          12 |
| Mammografie SVOB beiderzijds       |           4 |
| R SPECIMEN                         |          12 |
| Mammopunctie stereotactisch rechts |           4 |
| R LM                               |          11 |
| R XCCL                             |

## 3.2 Tweaking gemset
To get the most out of your neural network, data selection is very important. Some text

In [4]:

# These are all orientations we are interested in classifing.
orientations = ['L MLO', 'R MLO', 'L CC', 'R CC', 'L LM', 'R LM', 'L XCCL', 'R XCCL', 'L SPECIMEN', 'R SPECIMEN']

constraints = {
    'Modality': 'MG',
    'SeriesDescription': orientations,
}

subset = gemset.subset(constraints)

subset.summarize('SeriesDescription')


| Class (SeriesDescription)   |   Frequency |
|-----------------------------+-------------|
| R MLO                       |         227 |
| R CC                        |         235 |
| L CC                        |         219 |
| L LM                        |          12 |
| L MLO                       |         218 |
| L XCCL                      |          16 |
| L SPECIMEN                  |          12 |
| R SPECIMEN                  |          12 |
| R LM                        |          11 |
| R XCCL                      |          18 |

Total number of training images: 980 
Total number of classes: 10



### GemicaiDataset.save()
<p style="color:green"> .subset right now doesnt make a new gemset, it just applies constraints on an existing gemset, this needs to be explained. Also, if the subset is small compared the original gemset, its performance will be significantly worse. Therefore we sometimes might want to create a subset from a gemset and store it else where. GemicaiDataset should have a function self.save(dir) </p>

In [9]:
def safe(self, dir):
    #TODO: implement function
    pass

subset.save('example')

### GemicaiDataset.erase()
<p style="color:green"> We might also want to delete obsolete gemsets </p>

In [7]:
def erase(self):
    #TODO: implement function
    pass

gemset.erase()
gemset = subset

### GemicaiDataset.split()
<p style="color:green"> We almost always need to split the dataset in to a train- and test-dataset, why not make a function for this. </p>

In [11]:
def split(self, ratio, sets=['train', 'test'], self_erase_afterwards=False):
    if isinstance(ratio, int):
        ratio = [1 - ratio, ratio]
    
    assert len(ratio) == len(sets), 'Specify a ratio for every set you want to create'
    assert sum(ratio) == 1, 'The sum of all ratios should be 1'
    
    for i, s in enumerate(sets):
        subset = None
        # TODO: create random subset
        subset.save(os.path.join(self.base_path, s))
    
    if self_erase_afterwards:
        self.erase()

This splits the gemset into a training set with 80% of the original gemsets' size,  and a test set with 20% of the original gemsets' size


In [None]:
gemset.split(0.2)

<p style="color:green"> This creates a test-, validation- and test-set. This is a quite standard thing people in machine learning want to do </p>



In [None]:
gemset.split(ratio=[0.7, 0.15, 0.15], sets=['train', 'validation', 'test'])

### GemicaiDataset.__getitem__()
<p style="color:green"> When passing a gemset with more than 1 label to a classifier, it crashes (I put in the exception handling but otherwise it still crashes). I figured that it might be nice to use __getitem__ to fix this, what do you think? </p>

In [11]:

def __getitem__(self, arg):
    if isistance(arg, int):
        arg = self.labels[arg]
    if arg not in self.labels:
        raise ValueError('Specified argument not in gemset labels. Valid labels are: {}'.format(self.labels))
    return type(self)(path=self.path, labels=[arg], transform=self.transform, constraints=self.constraints)


In [6]:

# Initialize classifier
net = gem.Classifier(models.resnet18(pretrained=True), gemset.classes('SeriesDescription'), enable_cuda=True)

net.train(gemset, epochs=5, verbosity=1)


ValueError: Specify what label should be classified. This dataset containts the labels ['Modality', 'BodyPartExamined', 'StudyDescription', 'SeriesDescription']. E.g. tryagain dataset['Modality'] or dataset[0]

In [12]:


# If implemented correctly this should work
net.train(gemset['SeriesDescription'], epochs=5, verbosity=1)

# This does the same
net.train(gemset[3], epochs=5, verbosity=1)


NotImplementedError: 

### Correct dataset
<p style="color:green"> Rn this function is in gemicai/data_inpection.py and does nothing yet. Can you implement this and make it as user friendly as possible? I presume mostly radiologist will be using this function </p>

In [None]:
gem.correct_dataset(net, gemset)