![brain_baner](http://www.mf-data-science.fr/images/projects/brain_baner.jpg)

<h1 style="color:#0b0a2d; font-size:24px; text-transform: uppercase; font-weight:bold">Context</h1>

The goal of this competition, initiated by the **Radiological Society of North America *(RSNA)*** in partnership with the **Medical Image Computing and Computer Assisted Intervention Society *(the MICCAI Society)*** is to predict the methylation of the **MGMT promoter**, which is an important gene biomarker for treatment of brain tumors.

These predictions will be based on a database of **MRI *(magnetic resonance imaging)*** scans of several hundred patients.

<h1 style="color:#0b0a2d; font-size:24px; text-transform: uppercase; font-weight:bold">Data</h1>

Each independent case has a dedicated folder identified by a five-digit number. Within each of these “case” folders, there are four sub-folders, each of them corresponding to each of the structural multi-parametric MRI (mpMRI) scans, in DICOM format. The exact mpMRI scans included are:

- Fluid Attenuated Inversion Recovery (FLAIR)
- T1-weighted pre-contrast (T1w)
- T1-weighted post-contrast (T1Gd)
- T2-weighted (T2)

| ![brain_baner](http://www.mf-data-science.fr/images/projects/brain_tumor_types.png) | 
|:--:| 
| *Examples of the four MR sequence types included in this work* |

<h1 style="color:#0b0a2d; font-size:24px; text-transform: uppercase; font-weight:bold">Acknowledgement</h1>

This Notebook is copied and edited from *Ammar Alhaj Ali* work :
- [🧠Brain Tumor 3D [Training]](https://www.kaggle.com/ammarnassanalhajali/brain-tumor-3d-training)
- [🧠Brain Tumor 3D [Inference]](https://www.kaggle.com/ammarnassanalhajali/brain-tumor-3d-inference)

In the first part, we will add to these Notebooks **a complete preprocessing of the images** (crop, equalization, denoising ...) in order to try to improve the results and **reduce the size of images**.

# <span style="color:#0b0a2d; font-size:24px; text-transform: uppercase; font-weight:bold" id="section_1">Exploratory data analysis (EDA)</span>

First, we have to load the usefull Python libraries :

In [None]:
import os
import sys 
import json
import glob
import random
import collections
import time
import re
import math
import numpy as np
import pandas as pd
import cv2

import matplotlib.pyplot as plt
from matplotlib import animation, rc
import seaborn as sns
import pydicom
from pydicom.pixel_data_handlers.util import apply_voi_lut

from random import shuffle
from sklearn import model_selection as sk_model_selection

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
from keras.metrics import AUC

import tensorflow as tf

In [None]:
# Global params for animations
rc('animation', html='jshtml')

and configure data path :

In [None]:
data_directory = '../input/rsna-miccai-brain-tumor-radiogenomic-classification'
pytorch3dpath = "../input/efficientnetpyttorch3d/EfficientNet-PyTorch-3D"

## <span style="color:#3c99dc; font-size:18px; text-transform: uppercase; font-weight:bold" id="section_1_1">Loading data</span>

Let's load the **dataframe with training data and testing data**. We are adding a column that will contain the patient record ID of 5 characters.

In [None]:
train_df = pd.read_csv(data_directory+"/train_labels.csv")
train_df['BraTS21ID5'] = [format(x, '05d') for x in train_df.BraTS21ID]
train_df.head(3)

In [None]:
test = pd.read_csv(
    data_directory+'/sample_submission.csv')

test['BraTS21ID5'] = [format(x, '05d') for x in test.BraTS21ID]
test.head(3)

All of the subjects in this dataset appear to have a brain tumor. MGMT_Class = 0 refers to people who do not have the MGMT promoter methylation. MGMT_Class = 1 appears to be someone who has the MGMT promoter methylation. 

It is a competition which gives **the probability of it in `MGMT_value`** feature. `BraTS21ID` is the patient's identification. We can see that the data structure is the same as for the submission file than the train file, knowing that here **MGMT_value is indeed equal to 0 or 1 and no longer a probability**.

Let's look at the distribution of the values of this variable in the train set :

In [None]:
sns.set_style("whitegrid")
fig = plt.figure(figsize=(8,6))
# Countplot with Seaborn
ax = sns.countplot(data=train_df,
                   x="MGMT_value")
# Annotating bars
for p in ax.patches:
    ax.annotate(
        format(p.get_height(), '.0f'), 
               (p.get_x() + p.get_width() / 2., p.get_height()),
        ha = 'center', va = 'center', 
        xytext = (0, 10), 
        textcoords = 'offset points')

sns.despine(left=True, bottom=True)
plt.title("MGMT value distribution in train labels\n",
          fontsize=18, color="#0b0a2d")
plt.show()

## <span style="color:#3c99dc; font-size:18px; text-transform: uppercase; font-weight:bold" id="section_1_2">MRI data</span>

Train data contains one record per patient. For each patient, four sub-files are available *(FLAIR, T1w, T1wCE and T2w)* in which the MRI image sequences are distributed.

![data_structure](http://www.mf-data-science.fr/images/projects/data_structure.jpg)

We are going to take a look at what an MRI image looks like :

In [None]:
def load_dicom(path):
    dicom = pydicom.read_file(path)
    data = dicom.pixel_array
    data = data - np.min(data)
    if np.max(data) != 0:
        data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
    return data

def visualize_sample(
    brats21id, 
    slice_i,
    mgmt_value,
    types=("FLAIR", "T1w", "T1wCE", "T2w")):
    
    plt.figure(figsize=(20, 6))
    patient_path = os.path.join(
        data_directory+"/train/", 
        str(brats21id).zfill(5),
    )
    for i, t in enumerate(types, 1):
        t_paths = sorted(
            glob.glob(os.path.join(patient_path, t, "*")), 
            key=lambda x: int(x[:-4].split("-")[-1]),
        )
        data = load_dicom(t_paths[int(len(t_paths) * slice_i)])
        plt.subplot(1, 4, i)
        plt.imshow(data, cmap="gray")
        plt.title(f"{t}", fontsize=10)
        plt.axis("off")
    plt.show()

### With MGMT = 0

In [None]:
list0=[315,176,153,164]
for i in list0:
    _brats21id = train_df.iloc[i]["BraTS21ID"]
    _mgmt_value = train_df.iloc[i]["MGMT_value"]
    visualize_sample(brats21id=_brats21id, mgmt_value=_mgmt_value, slice_i=0.55)

### With MGMT = 1

In [None]:
list1=[184,315,155,228]
for i in list1:
    _brats21id = train_df.iloc[i]["BraTS21ID"]
    _mgmt_value = train_df.iloc[i]["MGMT_value"]
    visualize_sample(brats21id=_brats21id, mgmt_value=_mgmt_value, slice_i=0.55)

To better understand the representation of these MRI images, we can also **create an animation to visualize the sequence of images** of a certain category for a given patient.

In [None]:
def create_animation(ims):
    fig = plt.figure(figsize=(10, 10))
    plt.axis('off')
    im = plt.imshow(ims[0], cmap="gray")

    def animate_func(i):
        im.set_array(ims[i])
        #return [im]
    return animation.FuncAnimation(fig, animate_func, frames = len(ims), interval = 1000//4)

def load_dicom_line(path):
    t_paths = sorted(
        glob.glob(os.path.join(path, "*")), 
        key=lambda x: int(x[:-4].split("-")[-1]),
    )
    images = []
    for filename in t_paths:
        data = load_dicom(filename)
        if data.max() == 0:
            continue
        images.append(data)
        
    return images

In [None]:
images = load_dicom_line(data_directory+"/train/00176/FLAIR")
anm_FLAIR0=create_animation(images)
images = load_dicom_line(data_directory+"/train/00176/T1w")
anm_T1W0=create_animation(images)


images = load_dicom_line(data_directory+"/train/00184/FLAIR")
anm_FLAIR1=create_animation(images)
images = load_dicom_line(data_directory+"/train/00184/T1w")
anm_T1W1=create_animation(images)

### With MGMT = 0
### *FLAIR type :*

In [None]:
anm_FLAIR0

### *T1w type :*

In [None]:
anm_T1W0

### With MGMT = 1
### *FLAIR type :*

In [None]:
anm_FLAIR1

### *T1w type :*

In [None]:
anm_T1W1

### Distribution of images in MRI types :
We are now going to check the number of DCM files to check if their number is the same for each category and for each patient. For that, we will complete a copy of train.csv with the calculated informations:

In [None]:
mri_types = ['FLAIR','T1w','T1wCE','T2w']

train_dataset = train_df.copy()
for scan in mri_types:
    train_dataset[scan + "_count"] = [
        len(os.listdir(data_directory + "/train/"
                       + str(p) 
                       + "/" + scan))
        for p in train_dataset.BraTS21ID5]

In [None]:
fig = plt.figure(figsize = (25,40))
for i, scan in enumerate(mri_types):
    ax = plt.subplot(4,1,i+1)
    plt.xticks(rotation=70)
    sns.countplot(x=train_dataset[scan + "_count"], ax=ax)
    ax.set_xlabel("Mean of number of MRI in scan folder")
    ax.set_ylabel("Count of patients")
    ax.set_title("Distribution of number of DCM file in {} scans".format(scan),
             fontsize=18, color="#0b0a2d")
plt.show()

Note that some values for each scan category are over-represented. On the other hand, the span ranges of the counters are important. This may be due, for example, to the use of different X-ray machines ...

However, if we only consider patients with maximum values, the amount of data available may be too low for a complex machine learning algorithm.

In [None]:
train_dataset[(train_dataset["FLAIR_count"] == int(train_dataset["FLAIR_count"].mode()))
              & (train_dataset["T1w_count"] == int(train_dataset["T1w_count"].mode()))
              & (train_dataset["T1wCE_count"] == int(train_dataset["T1wCE_count"].mode()))
              & (train_dataset["T2w_count"] == int(train_dataset["T2w_count"].mode()))]

There are only 151 patients whose scans contain the maximum of DCM images out of the 585 at the start. We therefore keep the entire dataset for the moment. **Lets have a look to the test dataset** :

In [None]:
test_dataset = test.copy()
for scan in mri_types:
    test_dataset[scan + "_count"] = [
        len(os.listdir(data_directory + "/test/"
                       + str(p)
                       + "/" + scan))
        for p in test_dataset.BraTS21ID5]

In [None]:
fig = plt.figure(figsize = (25,40))
for i, scan in enumerate(mri_types):
    ax = plt.subplot(4,1,i+1)
    plt.xticks(rotation=70)
    sns.countplot(x=test_dataset[scan + "_count"], ax=ax)
    ax.set_xlabel("Mean of number of MRI in scan folder")
    ax.set_ylabel("Count of patients")
    ax.set_title("Distribution of number of DCM file in TEST {} scans".format(scan),
             fontsize=18, color="#0b0a2d")
plt.show()

# <span style="color:#0b0a2d; font-size:24px; text-transform: uppercase; font-weight:bold" id="section_2">3D CNN from scratch on unique MRI Type</span>

In this section, a convolutional neural network will be trained on a single type of MRI scan. We will take a **sequence of 64 consecutive images in 80x80 pixels** which will allow the scanners to be treated like videos.

In [None]:
# Fixe MRI type to FLAIR
mri_types_id = 0 # 0,1,2,3

# Initial parameters
IMAGE_SIZE = 80
NUM_IMAGES = 64
BATCH_SIZE = 4

num_folds = 5
Selected_fold = 1 #1,2,3,4,5 

In [None]:
# Preprocessing params
# Scale for image crop
SCALE = .8
# Tile size for CLAHE equalizer
TILE_SIZE = (8,8)
# H param for denoising filter
H_PARAM = 10

In [None]:
train_df["Fold"]="train"
train_df.head(5)

In [None]:
test.head(3)

## <span style="color:#3c99dc; font-size:18px; text-transform: uppercase; font-weight:bold" id="section_2_1">Images preprocessing</span>

In [None]:
def mri_preprocessing(img, scale=SCALE, img_size=IMAGE_SIZE, tile_size=TILE_SIZE, h_param=H_PARAM):
    # Crop image
    center_x, center_y = img.shape[1] / 2, img.shape[0] / 2
    width_scaled, height_scaled = img.shape[1] * scale, img.shape[0] * scale
    left_x, right_x = center_x - width_scaled / 2, center_x + width_scaled / 2
    top_y, bottom_y = center_y - height_scaled / 2, center_y + height_scaled / 2
    img = img[int(top_y):int(bottom_y), int(left_x):int(right_x)]
    img = cv2.resize(img, (img_size, img_size))
    
    # CLAHE Equalizer
    img = np.uint8(cv2.normalize(img, None, 0, 255, cv2.NORM_MINMAX))
    img = cv2.equalizeHist(img)
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=tile_size)
    img = clahe.apply(np.uint8(img))
    
    # Non-local mean filter denoising
    img = cv2.fastNlMeansDenoising(
        src=img,
        dst=None,
        h=h_param,
        templateWindowSize=7,
        searchWindowSize=21)
    
    return img

In [None]:
sample_img_path = ''.join([data_directory, '/train/00176/FLAIR/Image-25.dcm'])
sample_img = pydicom.dcmread(sample_img_path)
sample_img = sample_img.pixel_array

fig = plt.figure(figsize=(12,6))
ax = plt.subplot(1,2,1)
ax.imshow(sample_img, cmap="gray")
ax.set_title("Original image")
ax1 = plt.subplot(1,2,2)
img_preproc = mri_preprocessing(sample_img)
ax1.imshow(img_preproc, cmap="gray")
ax1.set_title("Preproceced image")
plt.show()

## <span style="color:#3c99dc; font-size:18px; text-transform: uppercase; font-weight:bold" id="section_2_1">Functions to load images</span>
We are going to use 2 functions: one to load the Dicom images, then a second to load a sequence of x images distributed evenly on each side of the central image. These functions will then be used in a custom image generator.

Here we are going to make a modification to the original Notebook by modifying the shape of the resulting array. Indeed, to be in agreement with the Keras recommendations for its convolution layer, the shape must be (Batch_size, Number of frames, width, height, Number of layers) or (4,64,80,80,1,1)


In [None]:
def load_dicom_image(path, img_size=IMAGE_SIZE, voi_lut=True, rotate=0, preproc=True):
    dicom = pydicom.read_file(path)
    data = dicom.pixel_array
    if voi_lut:
        data = apply_voi_lut(dicom.pixel_array, dicom)
    else:
        data = dicom.pixel_array
        
    if rotate > 0:
        rot_choices = [0, cv2.ROTATE_90_CLOCKWISE, cv2.ROTATE_90_COUNTERCLOCKWISE, cv2.ROTATE_180]
        data = cv2.rotate(data, rot_choices[rotate])
        
    if (preproc==True):
        data = mri_preprocessing(data)
        
    else:
        data = cv2.resize(data, (img_size, img_size))
        
    return data


def load_dicom_images_3d(scan_id, num_imgs=NUM_IMAGES, img_size=IMAGE_SIZE, mri_type=mri_types[mri_types_id], split="train", rotate=0):

    files = sorted(glob.glob(f"{data_directory}/{split}/{scan_id}/{mri_type}/*.dcm"), 
               key=lambda var:[int(x) if x.isdigit() else x for x in re.findall(r'[^0-9]|[0-9]+', var)])

    middle = len(files)//2
    num_imgs2 = num_imgs//2
    p1 = max(0, middle - num_imgs2)
    p2 = min(len(files), middle + num_imgs2)
    img3d = np.stack([load_dicom_image(f, rotate=rotate) for f in files[p1:p2]]) 
    if img3d.shape[0] < num_imgs:
        n_zero = np.zeros((num_imgs - img3d.shape[0], img_size, img_size))
        img3d = np.concatenate((img3d,  n_zero), axis = 0)
        
    if np.min(img3d) < np.max(img3d):
        img3d = img3d - np.min(img3d)
        img3d = img3d / np.max(img3d)
            
    return np.expand_dims(img3d,-1)

We can test these functions on a test sequence :

In [None]:
a = load_dicom_images_3d("00046")
image = a[0]
print(a.shape)
print(np.min(a), np.max(a), np.mean(a), np.median(a))
print("Dimension of the CT scan is:", image.shape)
plt.imshow(image, cmap="gray")

## <span style="color:#3c99dc; font-size:18px; text-transform: uppercase; font-weight:bold" id="section_2_2">Define Folds</span>

In [None]:
from sklearn.model_selection import KFold,StratifiedKFold
sfolder = StratifiedKFold(n_splits=5,random_state=13,shuffle=True)
X = train_df[['BraTS21ID']]
y = train_df[['MGMT_value']]

fold_no = 1
for train, valid in sfolder.split(X,y):
    if fold_no==Selected_fold:
        train_df.loc[valid, "Fold"] = "valid"
    fold_no += 1

In [None]:
df_train=train_df[train_df.Fold=="train"]
df_valid=train_df[train_df.Fold=="valid"].iloc[:-1,:]
print("df_train=",len(df_train),"-- df_valid=",len(df_valid))

## <span style="color:#3c99dc; font-size:18px; text-transform: uppercase; font-weight:bold" id="section_2_2">Keras custom Data Generator</span>
Thanks to the Sequence module of the Keras library, we are going to create a personalized image generator. This will prevent us from creating Numpy arrays or Tensors containing all the sequences which would quickly overload the memory.

In [None]:
from keras.utils import Sequence
class Dataset(Sequence):
    def __init__(self,df,is_train=True,batch_size=BATCH_SIZE,shuffle=True):
        self.idx = df["BraTS21ID"].values
        self.paths = df["BraTS21ID5"].values
        self.y =  df["MGMT_value"].values
        self.is_train = is_train
        self.batch_size = batch_size
        self.shuffle = shuffle
    def __len__(self):
        return math.ceil(len(self.idx)/self.batch_size)
   
    def __getitem__(self,ids):
        id_path= self.paths[ids]
        batch_paths = self.paths[ids * self.batch_size:(ids + 1) * self.batch_size]
        
        if self.y is not None:
            batch_y = self.y[ids * self.batch_size: (ids + 1) * self.batch_size]
        
        if self.is_train:
            list_x =  [load_dicom_images_3d(x,split="train") for x in batch_paths]
            batch_X = np.stack(list_x, axis=0)
            return batch_X,batch_y
        else:
            list_x =  load_dicom_images_3d(id_path,split="test")#str(scan_id).zfill(5)
            batch_X = np.stack(list_x)
            return batch_X
    
    def on_epoch_end(self):
        if self.shuffle and self.is_train:
            ids_y = list(zip(self.idx, self.y))
            shuffle(ids_y)
            self.idx, self.y = list(zip(*ids_y))

In [None]:
train_dataset = Dataset(df_train,batch_size=BATCH_SIZE)
valid_dataset = Dataset(df_valid,batch_size=BATCH_SIZE)

Once the generators are created, we can project an image to check:

In [None]:
for i in range(1):
    images, label = train_dataset[i]
    print("Dimension of the CT scan is:", images.shape)
    print("label=",label)
    plt.imshow(images[0,32,:,:,:], cmap="gray")
    plt.show()

## <span style="color:#3c99dc; font-size:18px; text-transform: uppercase; font-weight:bold" id="section_2_2">Define CNN Model<span>

In [None]:
def get_model(width=IMAGE_SIZE, height=IMAGE_SIZE, depth=NUM_IMAGES):
    """Build a 3D convolutional neural network model."""

    inputs = keras.Input(shape=(depth, width, height, 1), batch_size=BATCH_SIZE)
     
    x = layers.Conv3D(filters=16, kernel_size=3, activation="relu", padding="same")(inputs)
    x = layers.MaxPool3D(pool_size=2)(x)
    x = layers.BatchNormalization()(x)
    
    x = layers.Conv3D(filters=32, kernel_size=3, activation="relu", padding="same")(x)
    x = layers.MaxPool3D(pool_size=2)(x)
    x = layers.BatchNormalization()(x)
    
    x = layers.Conv3D(filters=64, kernel_size=3, activation="relu", padding="same")(x)
    x = layers.MaxPool3D(pool_size=2)(x)
    x = layers.BatchNormalization()(x)
    x = layers.Dropout(0.1)(x)
    
    x = layers.Conv3D(filters=128, kernel_size=3, activation="relu", padding="same")(x)
    x = layers.MaxPool3D(pool_size=2)(x)
    x = layers.BatchNormalization()(x)
    x = layers.Dropout(0.1)(x)

    x = layers.Conv3D(filters=256, kernel_size=3, activation="relu", padding="same")(x)
    x = layers.MaxPool3D(pool_size=2)(x)
    x = layers.BatchNormalization()(x)
    x = layers.Dropout(0.2)(x)

    x = layers.Conv3D(filters=512, kernel_size=3, activation="relu", padding="same")(x)
    x = layers.MaxPool3D(pool_size=2)(x)
    x = layers.BatchNormalization()(x)
    x = layers.Dropout(0.2)(x)

    x = layers.GlobalAveragePooling3D()(x)
    x = layers.Dense(units=512, activation="relu")(x)

    outputs = layers.Dense(units=1, activation="sigmoid")(x)

    # Define the model.
    model = keras.Model(inputs, outputs, name="3dcnn")

    return model

In [None]:
# Build model.
model = get_model(width=IMAGE_SIZE, height=IMAGE_SIZE, depth=NUM_IMAGES)
model.summary()

## <span style="color:#3c99dc; font-size:18px; text-transform: uppercase; font-weight:bold" id="section_2_3">Training of the CNN<span>

In [None]:
# Compile model.
initial_learning_rate = 0.0001
lr_schedule = keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate, decay_steps=100000, decay_rate=0.96, staircase=True
)
model.compile(
    loss="binary_crossentropy",
    optimizer=keras.optimizers.Adam(learning_rate=lr_schedule),
    metrics=[AUC(name='auc'),"acc"],
)
# Define callbacks.
model_save = ModelCheckpoint(f'Brain_3d_cls_{mri_types[mri_types_id]}_Fold_{Selected_fold}.h5', 
                             save_best_only = True, 
                             monitor = 'val_auc', 
                             mode = 'max', verbose = 1)
early_stop = EarlyStopping(monitor = 'val_auc', 
                           patience = 5, mode = 'max', verbose = 1,
                           restore_best_weights = True)

In [None]:
# Train the model, doing validation at the end of each epoch
epochs = 50
model.fit(
    train_dataset,
    validation_data=valid_dataset,
    epochs=epochs,
    shuffle=True,
    verbose=1,
    callbacks = [model_save, early_stop],
)

## <span style="color:#3c99dc; font-size:18px; text-transform: uppercase; font-weight:bold" id="section_2_4">Visualizing model performance<span>

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(20, 7))
ax = ax.ravel()

for i, metric in enumerate(["acc","auc","loss"]):
    ax[i].plot(model.history.history[metric])
    ax[i].plot(model.history.history["val_" + metric])
    ax[i].set_title("Model {}".format(metric))
    ax[i].set_xlabel("epochs")
    ax[i].set_ylabel(metric)
    ax[i].legend(["train", "val"])

## <span style="color:#3c99dc; font-size:18px; text-transform: uppercase; font-weight:bold" id="section_2_4">Make predictions with trained CNN Model</span>

In [None]:
test_dataset = Dataset(test,is_train=False,batch_size=1)


for i in range(1):
    image = test_dataset[i]
    print("Dimension of the CT scan is:", image.shape)
    plt.imshow(image[32,:,:,:], cmap="gray")
    plt.show()

In [None]:
preds = model.predict(test_dataset)
preds = preds.reshape(-1)

In [None]:
preds

## <span style="color:#3c99dc; font-size:18px; text-transform: uppercase; font-weight:bold" id="section_2_5">Submission of result predictions</span>

In [None]:
submission = pd.DataFrame({'BraTS21ID':test['BraTS21ID'],'MGMT_value':preds})
submission

In [None]:
submission.to_csv('submission.csv',index=False)

In [None]:
plt.figure(figsize=(5, 5))
plt.hist(submission["MGMT_value"]);

# References

1. https://keras.io/examples/vision/3D_image_classification/
1. https://www.kaggle.com/rluethy/efficientnet3d-with-one-mri-type
