<a href="https://colab.research.google.com/github/mobassir94/RSNA-Intracranial-Hemorrhage-Detection/blob/master/rsna_efficientnetb2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The DICOM format is so cool, but I prefer normal images :)

With 156GB (compressed) it is very difficult to work with the resources of the vast majority of the mortals.
This notebook shows you how to scale down all the images and create a new dataset easier to deal with.
Even with the best computing resources, I don't think it's necessary to use the original size to get good accuracy.

If you feel that you need bigger images or you want to store the images in another format you only need to change a couple of lines in the next section (Constants).

Some code taken from:
* https://www.kaggle.com/omission/eda-view-dicom-images-with-correct-windowing
* https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection/discussion/109649#latest-631701

# Configuration

In [0]:
# Desired output size.
RESIZED_WIDTH, RESIZED_HEIGHT = 128, 128

OUTPUT_FORMAT = "png"

OUTPUT_DIR = "output"

# Installations

In [0]:
%%capture

# Uninstall and reinstall kaggle because I'm getting and error: the kaggle script
# is running on Python 2 instead of Python 3 and fails when downloading kernel
# outputs.
# It's also needed for download data for this competition.
!pip uninstall -y kaggle
!pip install kaggle

In [0]:
%%capture

# Install this library for reading the *.dcm images of this competition.
!pip install pydicom

In [0]:
%%capture

# Mount fuse-zip to mount zip files so we can access the files without unzip it.
# This is needed because of the lack of space in Google Colab disk.
!apt-get install -y fuse-zip

# Imports

In [0]:
import glob
import os

import joblib

import numpy as np

import PIL

import pydicom

import tqdm

# Setup

In [0]:
# Set environment variables for using the Kaggle API.
os.environ["KAGGLE_USERNAME"] = "my name"
os.environ["KAGGLE_KEY"] = "mykey"

# Get the data

In [0]:
# 30-45min in Google Colab.
raw_data_dir = "input/raw"
!kaggle competitions download -c rsna-intracranial-hemorrhage-detection -p {raw_data_dir}

Downloading rsna-intracranial-hemorrhage-detection.zip to input/raw
100% 156G/156G [15:51<00:00, 180MB/s]
100% 156G/156G [15:51<00:00, 176MB/s]


# Mount ZIP with fuse-zip

In [0]:
%%time
# Around 10 min in Google Colab.

input_dir = "/tmp/kaggle-data"
!mkdir {input_dir}
!fuse-zip input/raw/rsna-intracranial-hemorrhage-detection.zip {input_dir}

CPU times: user 2.52 s, sys: 307 ms, total: 2.83 s
Wall time: 9min 15s


In [0]:
# Check that everything is working.
!ls {input_dir}

stage_1_sample_submission.csv  stage_1_train.csv
stage_1_test_images	       stage_1_train_images


# Get images path

In [0]:
train_dir = "stage_1_train_images/"
train_paths = glob.glob(f"{input_dir}/{train_dir}/*.dcm")
test_dir = "stage_1_test_images/"
test_paths = glob.glob(f"{input_dir}/{test_dir}/*.dcm")
len(train_paths), len(test_paths)

(674258, 78545)

In [0]:

import numpy as np
import pandas as pd
import pydicom
import os
import collections
import sys
import glob
import random
import cv2
import tensorflow as tf
import multiprocessing

from math import ceil, floor
from copy import deepcopy
from tqdm import tqdm
from imgaug import augmenters as iaa

import keras
import keras.backend as K
from keras.callbacks import Callback, ModelCheckpoint
from keras.layers import Dense, Flatten, Dropout
from keras.models import Model, load_model
from keras.utils import Sequence
from keras.losses import binary_crossentropy
from keras.optimizers import Adam
from google.colab import files

# Install Modules from internet
!pip install efficientnet
!pip install iterative-stratification

# Import Custom Modules
import efficientnet.keras as efn 
from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit

Using TensorFlow backend.


Collecting efficientnet
  Downloading https://files.pythonhosted.org/packages/97/82/f3ae07316f0461417dc54affab6e86ab188a5a22f33176d35271628b96e0/efficientnet-1.0.0-py3-none-any.whl
Installing collected packages: efficientnet
Successfully installed efficientnet-1.0.0
Collecting iterative-stratification
  Downloading https://files.pythonhosted.org/packages/9d/79/9ba64c8c07b07b8b45d80725b2ebd7b7884701c1da34f70d4749f7b45f9a/iterative_stratification-0.1.6-py3-none-any.whl
Installing collected packages: iterative-stratification
Successfully installed iterative-stratification-0.1.6


In [0]:
# Seed
SEED = 12345
np.random.seed(SEED)
#tf.set_random_seed(SEED)

# Constants
TEST_SIZE = 0.15
HEIGHT = 256
WIDTH = 256
CHANNELS = 3
TRAIN_BATCH_SIZE = 32
VALID_BATCH_SIZE = 64
SHAPE = (HEIGHT, WIDTH, CHANNELS)

# Folders
#DATA_DIR = '/kaggle/input/rsna-intracranial-hemorrhage-detection/'
TEST_IMAGES_DIR = test_dir
TRAIN_IMAGES_DIR = train_dir

In [0]:
def correct_dcm(dcm):
    x = dcm.pixel_array + 1000
    px_mode = 4096
    x[x>=px_mode] = x[x>=px_mode] - px_mode
    dcm.PixelData = x.tobytes()
    dcm.RescaleIntercept = -1000

def window_image(dcm, window_center, window_width):    
    if (dcm.BitsStored == 12) and (dcm.PixelRepresentation == 0) and (int(dcm.RescaleIntercept) > -100):
        correct_dcm(dcm)
    img = dcm.pixel_array * dcm.RescaleSlope + dcm.RescaleIntercept
    
    # Resize
    img = cv2.resize(img, SHAPE[:2], interpolation = cv2.INTER_LINEAR)
   
    img_min = window_center - window_width // 2
    img_max = window_center + window_width // 2
    img = np.clip(img, img_min, img_max)
    return img

def bsb_window(dcm):
    brain_img = window_image(dcm, 40, 80)
    subdural_img = window_image(dcm, 80, 200)
    soft_img = window_image(dcm, 40, 380)
    
    brain_img = (brain_img - 0) / 80
    subdural_img = (subdural_img - (-20)) / 200
    soft_img = (soft_img - (-150)) / 380
    bsb_img = np.array([brain_img, subdural_img, soft_img]).transpose(1,2,0)
    return bsb_img

def _read(path, SHAPE):
    dcm = pydicom.dcmread(path)
    try:
        img = bsb_window(dcm)
    except:
        img = np.zeros(SHAPE)
    return img

In [0]:
print(os.listdir('./input/raw'))

['rsna-intracranial-hemorrhage-detection.zip']


In [0]:
filepath = "/content/drive/My Drive/"

In [0]:
# Image Augmentation
sometimes = lambda aug: iaa.Sometimes(0.25, aug)
augmentation = iaa.Sequential([ iaa.Fliplr(0.25),
                                iaa.Flipud(0.10),
                                sometimes(iaa.Crop(px=(0, 25), keep_size = True, sample_independently = False))   
                            ], random_order = True)       
        
# Generators
class TrainDataGenerator(keras.utils.Sequence):
    def __init__(self, dataset, labels, batch_size = 16, img_size = SHAPE, img_dir = TRAIN_IMAGES_DIR, augment = False, *args, **kwargs):
        self.dataset = dataset
        self.ids = dataset.index
        self.labels = labels
        self.batch_size = batch_size
        self.img_size = img_size
        self.img_dir = input_dir+"/"+img_dir
        self.augment = augment
        self.on_epoch_end()

    def __len__(self):
        return int(ceil(len(self.ids) / self.batch_size))

    def __getitem__(self, index):
        indices = self.indices[index*self.batch_size:(index+1)*self.batch_size]
        X, Y = self.__data_generation(indices)
        return X, Y

    def augmentor(self, image):
        augment_img = augmentation        
        image_aug = augment_img.augment_image(image)
        return image_aug

    def on_epoch_end(self):
        self.indices = np.arange(len(self.ids))
        np.random.shuffle(self.indices)

    def __data_generation(self, indices):
        X = np.empty((self.batch_size, *self.img_size))
        Y = np.empty((self.batch_size, 6), dtype=np.float32)
        
        for i, index in enumerate(indices):
            ID = self.ids[index]
            image = _read(self.img_dir+ID+".dcm", self.img_size)########
            if self.augment:
                X[i,] = self.augmentor(image)
            else:
                X[i,] = image
            Y[i,] = self.labels.iloc[index].values        
        return X, Y
    
class TestDataGenerator(keras.utils.Sequence):
    def __init__(self, dataset, labels, batch_size = 16, img_size = SHAPE, img_dir = TEST_IMAGES_DIR, *args, **kwargs):
        self.dataset = dataset
        self.ids = dataset.index
        self.labels = labels
        self.batch_size = batch_size
        self.img_size = img_size
        self.img_dir = input_dir+"/"+img_dir
        self.on_epoch_end()

    def __len__(self):
        return int(ceil(len(self.ids) / self.batch_size))

    def __getitem__(self, index):
        indices = self.indices[index*self.batch_size:(index+1)*self.batch_size]
        X = self.__data_generation(indices)
        return X

    def on_epoch_end(self):
        self.indices = np.arange(len(self.ids))
    
    def __data_generation(self, indices):
        X = np.empty((self.batch_size, *self.img_size))
        
        for i, index in enumerate(indices):
            ID = self.ids[index]
            image = _read(self.img_dir+ID+".dcm", self.img_size)########
            X[i,] = image              
        return X

In [0]:
def read_testset(filename = input_dir + "/stage_1_sample_submission.csv"):
    df = pd.read_csv(filename)
    df["Image"] = df["ID"].str.slice(stop=12)
    df["Diagnosis"] = df["ID"].str.slice(start=13)
    df = df.loc[:, ["Label", "Diagnosis", "Image"]]
    df = df.set_index(['Image', 'Diagnosis']).unstack(level=-1)
    return df

def read_trainset(filename = input_dir + "/stage_1_train.csv"):
    df = pd.read_csv(filename)
    df["Image"] = df["ID"].str.slice(stop=12)
    df["Diagnosis"] = df["ID"].str.slice(start=13)
    duplicates_to_remove = [
        1598538, 1598539, 1598540, 1598541, 1598542, 1598543,
        312468,  312469,  312470,  312471,  312472,  312473,
        2708700, 2708701, 2708702, 2708703, 2708704, 2708705,
        3032994, 3032995, 3032996, 3032997, 3032998, 3032999
    ]
    df = df.drop(index = duplicates_to_remove)
    df = df.reset_index(drop = True)    
    df = df.loc[:, ["Label", "Diagnosis", "Image"]]
    df = df.set_index(['Image', 'Diagnosis']).unstack(level=-1)
    return df

# Read Train and Test Datasets
test_df = read_testset()
train_df = read_trainset()

In [0]:
# Oversampling
epidural_df = train_df[train_df.Label['epidural'] == 1]
train_oversample_df = pd.concat([train_df, epidural_df])
train_df = train_oversample_df

# Summary
print('Train Shape: {}'.format(train_df.shape))
print('Test Shape: {}'.format(test_df.shape))

Train Shape: (677019, 6)
Test Shape: (78545, 6)


In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
def predictions(test_df, model):    
    test_preds = model.predict_generator(TestDataGenerator(test_df, None, 5, SHAPE, TEST_IMAGES_DIR), verbose = 1)
    return test_preds[:test_df.iloc[range(test_df.shape[0])].shape[0]]

def ModelCheckpointFull(model_name):
    return ModelCheckpoint(filepath + model_name, 
                            monitor = 'val_loss', 
                            verbose = 1, 
                            save_best_only = False, 
                            save_weights_only = True, 
                            mode = 'min', 
                            period = 1)

# Create Model
def create_model():
    #K.clear_session()
    
    base_model =  efn.EfficientNetB2(weights = 'imagenet', include_top = False, pooling = 'avg', input_shape = SHAPE)
    #base_model.load_weights(filepath+'model.h5')
    x = base_model.output
    x = Dropout(0.10)(x)
    y_pred = Dense(6, activation = 'sigmoid')(x)

    return Model(inputs = base_model.input, outputs = y_pred)

In [0]:
# Submission Placeholder
submission_predictions = []

# Multi Label Stratified Split stuff...
msss = MultilabelStratifiedShuffleSplit(n_splits = 10, test_size = TEST_SIZE, random_state = SEED)
X = train_df.index
Y = train_df.Label.values

# Get train and test index
msss_splits = next(msss.split(X, Y))
train_idx = msss_splits[0]
valid_idx = msss_splits[1]

In [0]:
# Loop through Folds of Multi Label Stratified Split
#for epoch, msss_splits in zip(range(0, 9), msss.split(X, Y)): 
#    # Get train and test index
#    train_idx = msss_splits[0]
#    valid_idx = msss_splits[1]
LR = 0.00051
for epoch in range(0, 7):
    print('=========== EPOCH {}'.format(epoch))

    # Shuffle Train data
    np.random.shuffle(train_idx)
    print(train_idx[:5])    
    print(valid_idx[:5])

    # Create Data Generators for Train and Valid
    data_generator_train = TrainDataGenerator(train_df.iloc[train_idx], 
                                                train_df.iloc[train_idx], 
                                                TRAIN_BATCH_SIZE, 
                                                SHAPE,
                                                augment = True)
    data_generator_val = TrainDataGenerator(train_df.iloc[valid_idx], 
                                            train_df.iloc[valid_idx], 
                                            VALID_BATCH_SIZE, 
                                            SHAPE,
                                            augment = False)

    # Create Model
    model = create_model()
    
    # Full Training Model
    for base_layer in model.layers[:-1]:
      base_layer.trainable = True
      TRAIN_STEPS = int(len(data_generator_train) / 6)
      model.load_weights(filepath + "model.h5")
      # Load Model Weights
      '''if epoch != 0:
        a = filepath + "model.h5"
        model.load_weights(a)'''  
        

    model.compile(optimizer = Adam(LR), 
                  loss = 'binary_crossentropy',
                  metrics = ['acc'])
    
    # Train Model
    model.fit_generator(generator = data_generator_train,
                        validation_data = data_generator_val,
                        steps_per_epoch = TRAIN_STEPS,
                        epochs = 1,
                        callbacks = [ModelCheckpointFull('model.h5')],
                        verbose = 1)
    
    # Starting with the 6th epoch we create predictions for the test set on each epoch
    if epoch >= 1:
        preds = predictions(test_df, model)
        submission_predictions.append(preds)

[125803 546040 567989 379531 599105]
[ 0 17 27 30 44]
Epoch 1/1

Epoch 00001: saving model to /content/drive/My Drive/model.h5
[672115 591959 279174 579813 223156]
[ 0 17 27 30 44]
Epoch 1/1

Epoch 00001: saving model to /content/drive/My Drive/model.h5
[664782 378233 424470 674642  50041]
[ 0 17 27 30 44]
Epoch 1/1

Epoch 00001: saving model to /content/drive/My Drive/model.h5
[470367 389402  56407 448635 350856]
[ 0 17 27 30 44]
Epoch 1/1

Epoch 00001: saving model to /content/drive/My Drive/model.h5
[202010 460334 372411 591705 233296]
[ 0 17 27 30 44]
Epoch 1/1

In [0]:
test_df.iloc[:, :] = np.average(submission_predictions, axis = 0, weights = [2**i for i in range(len(submission_predictions))])
test_df = test_df.stack().reset_index()
test_df.insert(loc = 0, column = 'ID', value = test_df['Image'].astype(str) + "_" + test_df['Diagnosis'])
test_df = test_df.drop(["Image", "Diagnosis"], axis=1)
test_df.to_csv('submission.csv', index = False)
print(test_df.head(12))

                               ID     Label
0                ID_000012eaf_any  0.047340
1           ID_000012eaf_epidural  0.001187
2   ID_000012eaf_intraparenchymal  0.010457
3   ID_000012eaf_intraventricular  0.002544
4       ID_000012eaf_subarachnoid  0.008325
5           ID_000012eaf_subdural  0.030030
6                ID_0000ca2f6_any  0.006918
7           ID_0000ca2f6_epidural  0.000246
8   ID_0000ca2f6_intraparenchymal  0.001239
9   ID_0000ca2f6_intraventricular  0.001386
10      ID_0000ca2f6_subarachnoid  0.001098
11          ID_0000ca2f6_subdural  0.002256


In [0]:

!cp submission.csv drive/My\ Drive/