# Gemicai tutorial 3: Gemicai Dataset
This tutorial is about Gemicai datasets, or gemsets for short.

In [23]:
import gemicai as gem
import torch
import torchvision.models as models
import os

## 3.1 Create gemset
In order to train models using Gemicai library we have to first create a dataset from the Dicom files. Luckly Gemicai provides some functionality that handle most of the works for us.

In [17]:
# Path to a folder containing files in a dicom format
dicom_data = 'examples/dicom/CT'

# Path to a folder where the processed data sets will be stored
gemset_destination = 'examples/gemset/CT'

# Let's find out how many files we have to process
gem.utils.dir_info(dicom_data)

# Specify the relevant Dicom attributes we want to extract into a gemset dataset.
dicom_attributes = ['Modality', 'BodyPartExamined', 'StudyDescription', 'SeriesDescription']

# Time to create Gemicai dataset aka gemset
# For more information about some of the .dicom_to_gemset options please refer to the documentation
gem.dicom_to_gemset(dicom_data, gemset_destination, dicom_attributes, objects_per_file=25)

# Let's check whenever we have actually outputted anything
gem.utils.dir_info(gemset_destination)

| Extension   |   Files | Size    |
|-------------+---------+---------|
| .dcm.gz     |      49 | 14.3 MB | 

| Extension   |   Files | Size   |
|-------------+---------+--------|
| .gemset     |       2 | 1.9 MB | 



Now all that's left is to create an iterator object which is used to access data in a controlled manner.

In [19]:
# Instantiate gemset iterator, the get_dicomo_dataset accepts both a valid .gemset file path as well as a folder 
# location containing arbitary number of gemsets, 'labels' parameter specifies which fields except the tensor 
# will be returned by the next(gemset) call, of course any next call should be preceded by an init call which 
# initializes data iterators internal state
gemset = gem.DicomoDataset.get_dicomo_dataset(gemset_destination, labels=['StudyDescription'])

# Let's print all of the values that the label 'StudyDescription' has in the provided dataset
gemset.summarize('StudyDescription')



| Class (StudyDescription)   |   Frequency |
|----------------------------+-------------|
| CT urografie               |          49 |

Total number of training images: 49 
Total number of classes: 1



With this we have created an iterator object that can be safely passed to the Gemicai's Classifier class for 
model training or evaluation.

## 3.2 Tweaking gemset
To get the most out of your neural network, data selection is very important. That is why Gemicai's iterators 
provide you an extensive ability to pick, split or even modify the dataset. 

Let's assume that one of the images in the previously generated gemset has a wrong Modality assigned to it. Let's change it!

In [20]:
# All of the Modality classes should be set to 'CT'
gemset.summarize('Modality')

# Let's modify a third (first index is 0) object in the dataset to have 'Modality' set to 'MR'
gemset.modify(2, {'Modality': 'MR'})

# Now there should be two values possible
gemset.summarize('Modality')

| Class (Modality)   |   Frequency |
|--------------------+-------------|
| CT                 |          49 |

Total number of training images: 49 
Total number of classes: 1

| Class (Modality)   |   Frequency |
|--------------------+-------------|
| CT                 |          48 |
| MR                 |           1 |

Total number of training images: 49 
Total number of classes: 2



As one of the images had a wrong label we would like to omit it in our dataset.

In [21]:
# Creating a substed without 'MR' modality
gemset_new = gemset.subset({'Modality': 'CT'})

# gemset_new should contain only 'CT' modalities, let's check whenever this is true
gemset_new.summarize('Modality')

| Class (Modality)   |   Frequency |
|--------------------+-------------|
| CT                 |          48 |

Total number of training images: 48 
Total number of classes: 1



What is important to remember is that the subset method does not create a new .gemset it simply returns a new iterator which uses original data contraints (if passed) merged with the ones in the subset call. In order to actually save the data we will use the save method. 

In [31]:
# Location where we will save our new data set
location = 'temp/main'

# Let's create that folder
os.makedirs(location)

# Save the gemset t othe specified folder
gemset_new.save(location)

# Let's check whenever we have actually saved anything
gem.utils.dir_info(location)

| Extension   |   Files | Size   |
|-------------+---------+--------|
| .gemset     |       2 | 1.9 MB | 



Now let's imagine that we would like to split the resulting dataset further into a train and evaluation data sets, this can be done using a split method. 

In [32]:
# Evaluation dataset location
evaluation = "temp/eval"

# Train data set location
train = "temp/train" 

# Let's create those fodlers
os.makedirs(evaluation)
os.makedirs(train)

# Let's create interator for the dataset located in 'temp/main'
temp_set = gem.DicomoDataset.get_dicomo_dataset(location, labels=['StudyDescription'])

# Time to split it into a training and evaluation sets using a ratio of 8:2, so training set will contain ~80% of
# temp_set's data, where as the evaluation set will get the remaining ~20%.
temp_set.split(sets={train: 0.8, evaluation: 0.2})

# Let's check whenever our previous assumption is correct
eval_dataset = gem.DicomoDataset.get_dicomo_dataset(train, labels=['StudyDescription'])
train_dataset = gem.DicomoDataset.get_dicomo_dataset(evaluation, labels=['StudyDescription'])

eval_dataset.summarize('Modality')
train_dataset.summarize('Modality')


| Class (Modality)   |   Frequency |
|--------------------+-------------|
| CT                 |          38 |

Total number of training images: 38 
Total number of classes: 1

| Class (Modality)   |   Frequency |
|--------------------+-------------|
| CT                 |          10 |

Total number of training images: 10 
Total number of classes: 1



We are slowly creating too many datasets, it's time to remove them using the erase method.

In [33]:
# Folder should contain a gemicai dataset
gem.utils.dir_info(location)

# Time to remove all of the datasets we have created
temp_set.erase()
eval_dataset.erase()
train_dataset.erase()

# It should be empty now
gem.utils.dir_info(location)

# Remove the empty folders
os.rmdir(location)
os.rmdir(evaluation)
os.rmdir(train)
os.rmdir('temp')

| Extension   |   Files | Size   |
|-------------+---------+--------|
| .gemset     |       2 | 1.9 MB | 

| Extension   | Files   | Size   |
|-------------+---------+--------| 

