# Training Custom Classifier

## Background

This tutorial is meant to show developers the process for altering the dcm-classifier package for developement with new DICOM data.

## Setup

If you have not already, clone the git hub repository by running the following in the terminal.

In [None]:
!git clone git@github.com:BRAINSia/dcm-classifier.git

Next, to install the necessary development packages, install the developer requirements using the following.

In [None]:
!pip install -r ../requirements_dev.txt

In [1]:
# install the dcm-classifier package for tutorail purposes
!pip install dcm-classifier==0.6.0rc9



## Data Curation 

### Field Sheet Creation

The first step is to create a DICOM field sheet at a volume level. This code below will generate pandas DataFrame object containing all DICOM tags from the file's header. The `generate_dicom_dataframe` method can be called via basic function call as well as from command line. Here we only generate a small dataframe to display the functions output.

In [2]:
from create_dicom_fields_sheet import *
from pathlib import Path

current_directory: Path = Path.cwd()
root_directory = current_directory.parent


# make the DICOM field sheet based from the anonymized test data
dicom_field_sheet: pd.DataFrame = generate_dicom_dataframe(session_dirs=[root_directory.as_posix() + "/tests/testing_data/anonymized_testing_data/anonymized_data"],
                                                           output_file="",
                                                           save_to_excel=False) 
# show the field sheet shape
print(f"shape: {dicom_field_sheet.shape}")

# we print df head for the first 2 rows only for few selected fields due to size
print(dicom_field_sheet.head(4)[["SeriesNumber", "Image Type", "RepetitionTime", "FlipAngle"]].to_markdown())

The directory: /home/mbrzus/programming/xyz/tests/testing_data/anonymized_testing_data/anonymized_data contains 16 DICOM sub-volumes
shape: (16, 112)
|    |   SeriesNumber | Image Type                                     |   RepetitionTime |   FlipAngle |
|---:|---------------:|:-----------------------------------------------|-----------------:|------------:|
|  0 |             11 | ['ORIGINAL', 'PRIMARY', 'M', 'NORM', 'DIS2D']  |          4000    |         150 |
|  1 |             13 | ['ORIGINAL', 'PRIMARY', 'M', 'NORM', 'DIS2D']  |           446    |         150 |
|  2 |              1 | ['ORIGINAL', 'PRIMARY', 'M', 'NORM', 'DIS2D']  |             4.52 |           8 |
|  3 |              4 | ['DERIVED', 'PRIMARY', 'MPR', 'NORM', 'DIS2D'] |             4.52 |           8 |


#### To efficiently extract the information run full script in terminal. The full data frame field sheet will be saved as Excel file

In [None]:
!python3 create_dicom_fields_sheet.py --dicom_path ../tests/testing_data/anonymized_testing_data/anonymized_data --out ./tutorial_field_sheet.xlsx

**Note:** The `create_dicom_fields.py` script can also be automated using the `run_all_dicom_data_sheets.sh` script. The paths for the shell script will need to be changed.

### Field Sheet Combination

If you are dealing with multiple field sheets from different datasets, the `combine_excel_spreadsheets.py` script will combine the sheets into one big field sheet. 

In [5]:
import pandas as pd
from utilities import combine_all_excel_files

# create combined dataframe
all_column_names: pd.DataFrame = combine_all_excel_files([dicom_field_sheet])

# As we are only using 1 field sheet, this output will be the same as the last
print(all_column_names.shape)

# The file name for first image in volume
print(dicom_field_sheet["FileName"][0])

# The image type for the first image in the volume
print(dicom_field_sheet["ImageType"].head(2).to_markdown())

(16, 112)
/home/mbrzus/programming/xyz/tests/testing_data/anonymized_testing_data/anonymized_data/11/DICOM/1.3.12.2.1107.5.1.4.3024295249861856527476734919304407350-11-21-1svccb.dcm
|    | ImageType                                     |
|---:|:----------------------------------------------|
|  0 | ['ORIGINAL', 'PRIMARY', 'M', 'NORM', 'DIS2D'] |
|  1 | ['ORIGINAL', 'PRIMARY', 'M', 'NORM', 'DIS2D'] |


### Data Pruning

In a DICOM study, one imaging session may have multiple files that represent similar data at different times during the scan. This is the case in diffusion data because of the abundance of echo times recorded at scanning. This can cause an inflation in the number of DICOM files that may have the same data in many fields. The `remove_duplicate_rows.py` script allows for the removal of the duplicate rows. Developers can also adjust the features on which they wish to remove duplicate rows.

In [8]:
from utilities import remove_rows_with_duplicate_values

# create slimmed dataframe
slimmed_data_frame: pd.DataFrame = remove_rows_with_duplicate_values(input_frame=all_column_names,
                                                                     save_to_excel=False)

# show the original dataframe size
print("Original size: ", all_column_names.shape)
print(all_column_names["SeriesNumber"].to_markdown())

# show the slimmed dataframe which will have 3 rows removed 1 belonging to series number 5 which has 2 subvolumes
print("\nReduced size: ", slimmed_data_frame.shape)
print(slimmed_data_frame["SeriesNumber"].to_markdown())

Original size:  (16, 112)
|    |   SeriesNumber |
|---:|---------------:|
|  0 |             11 |
|  1 |             13 |
|  2 |              1 |
|  3 |              4 |
|  4 |              7 |
|  5 |             15 |
|  6 |              5 |
|  7 |              5 |
|  8 |              6 |
|  9 |              9 |
| 10 |              3 |
| 11 |              2 |
| 12 |             10 |
| 13 |             12 |
| 14 |             14 |
| 15 |              8 |

Reduced size:  (13, 112)
|    |   SeriesNumber |
|---:|---------------:|
|  0 |             11 |
|  1 |             13 |
|  2 |              1 |
|  3 |              4 |
|  4 |              7 |
|  5 |             15 |
|  6 |              5 |
|  8 |              6 |
|  9 |              9 |
| 12 |             10 |
| 13 |             12 |
| 14 |             14 |
| 15 |              8 |


### Feature Creation

Feature creation is a pertinent step which allows developers to choose the features used in the model. In the `create_training_sheet.py` script, the `parse_column_headers` method allows developers to choose features they believe will be useful to enter into the model.


#### Header Dictionary
A header dictionary is a spreadsheet with the fields taken from DICOM images or created in the `generate_dicom_dataframe` method. From these fields, you can select whether to keep them or remove them from the training file. You can choose the action for the header by changing the corresponding action, the actions available are "drop", "keep", "one_hot_encode_from_array", and "one_hot_encode_from_str_col". Arrays with string values should be one hot encoded as well as columns with only string values. To edit the header dictionary, the `training_config.py` file should be modified.


In [9]:
from training_config import *

# show the first 10 rows of the header dataframe
print(header_df.head(10).to_markdown())

|    | header_name                    | action                        |
|---:|:-------------------------------|:------------------------------|
|  0 | EchoTime                       | drop                          |
|  1 | FlipAngle                      | drop                          |
|  2 | PixelBandwidth                 | drop                          |
|  3 | PixelSpacing                   | drop                          |
|  4 | Image Type                     | one_hot_encoding_from_array   |
|  5 | Manufacturer                   | one_hot_encoding_from_str_col |
|  6 | Diffusion b-value              | drop                          |
|  7 | Diffusion Gradient Orientation | drop                          |
|  8 | Diffusionb-value               | keep                          |
|  9 | Diffusionb-valueMax            | keep                          |


#### Running the Script

In order to utilize the `create_training_sheet.py` script, the header dataframe is needed. The script will then parse the header dictionary and create a training file with the selected features from the dataframe.

In [12]:
from create_training_sheet import parse_column_headers
from training_config import *

# create the training file
training_data_frame: pd.DataFrame = parse_column_headers(header_dataframe=header_df, 
                                                         input_file=slimmed_data_frame,
                                                         save_to_excel=False)

# show training file
print(training_data_frame.shape)

# for the first row, show columns 1-5
print(training_data_frame.iloc[0:1, 1:5].to_markdown())

# show the 25th to 30th column
print(training_data_frame.iloc[0:1, 25:30].to_markdown())


KeyError: Diffusionb-valueMax
(13, 47)
|    |   Image Type_ORIGINAL |   Image Type_PRIMARY |   Image Type_M |   Image Type_NORM |
|---:|----------------------:|---------------------:|---------------:|------------------:|
|  0 |                     1 |                    1 |              1 |                 1 |
|    |   Pixel Bandwidth |   Repetition Time |      SAR |   Scanning Sequence_UnknownScanningSequence |   Sequence Variant_UnknownSequenceVariant |
|---:|------------------:|------------------:|---------:|--------------------------------------------:|------------------------------------------:|
|  0 |               190 |              4000 | 0.499263 |                                           1 |                                         1 |


## Labeling the Data

**! The developers are responsible to label their new data !**

Labeling can be done on the original data sheet or with another sheet. 

If labeling is done in another sheet, the files can be merged using the `merge_labels_and_training_data` method in the utilities file. The method will merge the label file and the training file on the *FileName* header. 

## Training the Model

Once the data is prepared, the next step is to train the model. The `modality_classifier_training.py` script is used to train the model. The script provides many methods for easy training of the model, inference, k-fold validation and model tuning.