# Goal

**In this competition you will predict the genetic subtype of glioblastoma using MRI (magnetic resonance imaging) scans to train and test your model to detect for the presence of MGMT promoter methylation.**  
(このコンペでは膠芽腫のMRI画像から、（脳腫瘍の治療に必要な）膠芽腫の遺伝子バイオマーカーであるMGMTプロモーターのメチル化を予測するモデルを作成する。)   
This  competition is not predicting whether brain tumor or not.  
**(※脳腫瘍かどうかを当てるコンペではない!)**

show the probability of MGMT promoter methylation or not.  

## Ref

[日本語解説](https://www.kaggle.com/chumajin/brain-tumor-eda-for-starter-version)   
[Yaroslav Isaienkov](https://www.kaggle.com/ihelon/brain-tumor-eda-with-animations-and-modeling) This notebook almost based from his notebook  
[Ayush Thakur](https://www.kaggle.com/ayuraj/brain-tumor-eda-and-interactive-viz-with-w-b)


# 🔍EDA

### Library

In [None]:
import os
import json
import glob
import random
import collections

import numpy as np
import pandas as pd
import cv2
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm


import pydicom

In [None]:
#import wandb
#wandb.login()

### Data Vizualization

In [None]:
train_df = pd.read_csv("../input/rsna-miccai-brain-tumor-radiogenomic-classification/train_labels.csv")
train_df

The dataset consist of 585 patients and are given by unique id,BraTS21ID  
* Class 0 = the presence of MGMT promoter methylation no exist.  
* Class 1 = the presence of MGMT promoter methylation exist.

In [None]:
plt.figure(figsize=(5,5))
sns.countplot(data=train_df, x="MGMT_value")

Look dcm one data 

In [None]:
dataset = pydicom.filereader.dcmread("../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/00000/FLAIR/Image-110.dcm")
img = dataset.pixel_array # Dataset.pixel_array returns a numpy.ndarray

fig, ax = plt.subplots()
ax.imshow(img, cmap="gray")
ax.set_axis_off()
print('Shape of data: ', img.shape)
plt.show()

#### look at the number of files (.dcm) 

In [None]:
filenames = glob.glob('../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/*/*/*')
print(f'Total number of files: {len(filenames)}')

#### look at the distribution of files per scan types

In [None]:
label_dict = {
    'FLAIR': [],
    'T1w': [],
    'T1wCE': [],
    'T2w': []
}

for filename in tqdm(filenames):
    scan = filename.split('/')[-2]
    if scan=="FLAIR":
        label_dict["FLAIR"].append(filename)
    elif scan=="T1w":
        label_dict["T1w"].append(filename)
    elif scan=="T1wCE":
        label_dict["T1wCE"].append(filename)
    elif scan=="T2w":
        label_dict["T2w"].append(filename)
        
print('Size of FLAIR scan: {}, T1w scan: {}, T1wCE scan: {}, T2W scan: {}'.format(len(label_dict["FLAIR"]),
                                                                                  len(label_dict["T1w"]),
                                                                                  len(label_dict["T1wCE"]),
                                                                                  len(label_dict["T2w"])))

In [None]:
CONFIG = {'IMG_SIZE': 224, 
          'NUM_FRAMES': 14,
          'competition': 'rsna-miccai-brain', 
          '_wandb_kernel': 'ayut'}

In [None]:
run = wandb.init(project='brain-tumor-viz', config=CONFIG)
data = [['FLAIR', 74248], ['T1w', 77627], ['T1wCE' ,96766], ['T2w', 100000]]
table = wandb.Table(data=data, columns = ["Scan Type", "Num Files"])
wandb.log({"my_bar_chart_id" : wandb.plot.bar(table, "Scan Type", "Num Files", title="Scan Types vs Number of Dicom files")})

https://wandb.ai/hondykaito/brain-tumor-viz?workspace=user-hondykaito

![img](https://i.ibb.co/QNjHyQd/W-B-Chart-7-16-2021-3-49-04-AM.png)

Look dcm data list

In [None]:
train_df.shape

#### type of MRI scan  
Each case consists of four structural multi-parametric MRI (mpMRI) scans.  
* Fluid Attenuated Inversion Recovery (FLAIR)  
    - FLAIR can be roughly thought of as T2, in which the water is also black, making it easier to find the lesion.  
    (FLAIRは水も黒くすることで、より病変を探しやすくなったT2)
* T1-weighted pre-contrast (T1w) 
* T1-weighted post-contrast (T1Gd)
    - T1Gd is T1 imaging with contrast medium and is the method that best reflects the location, size, and shape of the mass.  
    (T1Gdは造影剤を用いたT1撮影で、腫瘤の位置、大きさ、形が最もよく反映される方法である。)  
* T2-weighted (T2)
    - T2 :Water is painted white.Lesions appear white. Suitable for lesion evaluation.  
    (水が白く描かれる。病変が白く映る。病変の評価に適している。)

### Visualize dcm images

### Ref
[Preprocessing DICOM](https://www.kaggle.com/c/rsna-miccai-brain-tumor-radiogenomic-classification/discussion/253000)

In [None]:
# DICOM to PNG dataset (128 GB -> 5.2 GB)
def load_dicom(path):
    dicom = pydicom.read_file(path)
    data = dicom.pixel_array
    dara = data - np.min(data) 
    if np.max(data) != 0:
        data = data / np.max(data)
    data = (data * 255).astype(np.uint8)
    return data

def visualize_sample(brats21id, slice_i, mgmt_value, types=("FLAIR", "T1w", "T1wCE", "T2w")):
    plt.figure(figsize=(16, 5))
    patient_path = os.path.join("../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/", str(brats21id).zfill(5))
    for i, t in enumerate(types, 1):
        t_paths = sorted(glob.glob(os.path.join(patient_path, t, "*")),
                        key=lambda x: int(x[:-4].split("-")[-1]))
        data = load_dicom(t_paths[int(len(t_paths)* slice_i)]) # why?
        plt.subplot(1, 4, i)
        plt.imshow(data, cmap="gray")
        plt.title(f"{t}", fontsize=16)
        plt.axis("off")
        
    plt.suptitle(f"MGMT_value: {mgmt_value}", fontsize=18)
    plt.show

In [None]:
for i in random.sample(range(train_df.shape[0]), 10):
    _brats21id = train_df.iloc[i]["BraTS21ID"]
    _mgmt_value = train_df.iloc[i]["MGMT_value"]
    visualize_sample(brats21id = _brats21id, mgmt_value=_mgmt_value, slice_i=0.5)

### Estimation
#### * Sometimes I cannot detect tumor in images
    * Just Missing Value or cannnot by only using plt.show (show 3D image)?

# 🎩Model

### Libarary

In [None]:
package_path =  "../input/efficientnet-pytorch/EfficientNet-PyTorch/EfficientNet-PyTorch-master/"

In [None]:
import sys
import time
sys.path.append(package_path)

import torch
from torch import nn
from torch.utils import data as torch_data
from sklearn import model_selection as sk_model_selection
from torch.nn import functional as torch_functional
import efficientnet_pytorch

# choice 
#from sklearn.model_selection import StratifiedKFold
#from sklearn.model_selection import KFold

In [None]:
# maintain Reproducibility
def set_seed(seed):
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True # deterministic algorithm (決定論的アルゴリズムを使用)
    
set_seed(42)

In [None]:
df = pd.read_csv("../input/rsna-miccai-brain-tumor-radiogenomic-classification/train_labels.csv")

df_train, df_valid = sk_model_selection.train_test_split(
    df, 
    test_size=0.2, 
    random_state=42, 
    stratify=train_df["MGMT_value"],
)

In [None]:
class DataRetriever(torch_data.Dataset):
    def __init__(self, paths, targets):
        self.paths = paths
        self.targets = targets
        
    def __len__(self):
        return len(self.paths)
    
    # arrange later 
    def __getitem__(self, index):
        _id = self.paths[index]
        patient_path = f"../input/rsna-miccai-brain-tumor-radiogenomic-classification/train/{str(_id).zfill(5)}/"
        channels =  []
        for t in ("FLAIR", "T1w", "T1wCE"):# "T2w"
            t_paths = sorted(glob.glob(os.path.join(patient_path, t, "*")),
                        key=lambda x: int(x[:-4].split("-")[-1]))
            
            # start, end = int(len(t_paths) * 0.475), int(len(t_paths) * 0.525)
            x = len(t_paths)
            if x < 10:
                r = range(x)
            else:
                d = x // 10
                r = range(d, x - d, d)
            channel = []
            # for i in range(start, end + 1):
            for i in r:
                channel.append(cv2.resize(load_dicom(t_paths[i]), (256, 256)) / 255) # why?
                
            channel = np.mean(channel, axis=0) # axis=0 by column
            channels.append(channel)
            
        
        y = torch.tensor(self.targets[index], dtype=torch.float)
        
        return {"X": torch.tensor(channels).float(), "y": y}    

In [None]:
train_data_retriever = DataRetriever(
    df_train["BraTS21ID"].values,
    df_train["MGMT_value"].values
)

valid_data_retriever = DataRetriever(
    df_valid["BraTS21ID"].values,
    df_valid["MGMT_value"].values
)

In [None]:
plt.figure(figsize=(16, 6))
for i in range(3):
    plt.subplot(1, 3, i + 1)
    plt.imshow(train_data_retriever[13]["X"].numpy()[i], cmap="gray")

* margin is black(pixel = 0) and brian is centered in the image

In [None]:
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = efficientnet_pytorch.EfficientNet.from_name("efficientnet-b0")
        n_features = self.net._fc.in_features
        self.net._fc = nn.Linear(in_features=n_features, out_features=1, bias=True)
    
    def forward(self, x):
        out = self.net(x)
        return out

In [None]:
class LossMeter:
    def __init__(self):
        self.avg = 0
        self.n = 0
    
    def update(self, val):
        self.n += 1
        # incremental update
        self.avg = val / self.n + (self.n - 1) / self.n * self.avg
        
class AccMeter:
    def __init__(self):
        self.avg = 0
        self.n = 0
    
    def update(self, y_true, y_pred):
        y_true = y_true.cpu().numpy().astype(int)
        y_pred = y_pred.cpu().numpy() >= 0
        last_n = self.n
        self.n += len(y_true)
        true_count = np.sum(y_true == y_pred)
        # incremental update
        self.avg = true_count / self.n + last_n / self.n * self.avg

In [None]:
# Trainer

class Trainer:
    # do train
    def __init__(self, model, device, optimizer, criterion, loss_meter, score_meter):
        self.model = model
        self.device = device
        self.optimizer = optimizer
        self.criterion = criterion
        self.loss_meter = loss_meter
        self.score_meter = score_meter
        
        self.best_valid_score = -np.inf
        self.n_patience = 0 # for early_stopping (by 100epoch)
        
        self.messages = {
            "epoch": "[Epoch {}: {}] loss: {:.5f}, score: {:.5f}, time: {} s",
            "checkpoint": "The score improved from {:.5f} to {:.5f}. Save model to '{}'",
            "patience": "\nValid score didn't improve last {} epochs."
        }
        
    def fit(self, epochs, train_loader, valid_loader, save_path, patience):
        for n_epoch in range(1, epochs + 1):
            self.info_message("EPOCH: {}", n_epoch)
            
            train_loss, train_score, train_time = self.train_epoch(train_loader)
            valid_loss, valid_score, valid_time = self.valid_epoch(valid_loader)
            
            self.info_message(
                self.messages["epoch"], "Train", n_epoch, train_loss, train_score, train_time
            )
            self.info_message(
                self.messages["epoch"], "Valid", n_epoch, valid_loss, valid_score, valid_time
            )
            
            if True:
                # if self.best_valid_score < valid_score:
                self.info_message(
                    self.messages["checkpoint"], self.best_valid_score, valid_score, save_path
                )
                self.best_valid_score = valid_score
                self.save_model(n_epoch, save_path)
                self.n_patience = 0
            else:
                self.n_patience += 1
            
            if self.n_patience >= patience:
                self.info_message(self.messages["patience"], patience)
                break
            
                
    def train_epoch(self, train_loader):
        self.model.train()
        t = time.time()
        train_loss = self.loss_meter()
        train_score = self.score_meter()
        
        for step, batch in enumerate(train_loader, 1):
            X = batch["X"].to(self.device)
            targets = batch["y"].to(self.device)
            self.optimizer.zero_grad()
            outputs = self.model(X).squeeze(1)
            
            loss = self.criterion(outputs, targets)
            loss.backward()
            
            train_loss.update(loss.detach().item())
            train_score.update(targets, outputs.detach())
            
            self.optimizer.step() # update weight
            
            _loss, _score = train_loss.avg, train_score.avg
            message = 'Train Step {}/{}, train_loss: {:.5f}, train_score: {:.5f}'
            self.info_message(message, step, len(train_loader), _loss, _score, end="\r")
                   
        return train_loss.avg, train_score.avg, int(time.time() - t)
            
    def valid_epoch(self, valid_loader):
        self.model.eval()
        t = time.time()
        valid_loss = self.loss_meter()
        valid_score = self.score_meter()
        
        for step, batch in enumerate(valid_loader, 1):
            with torch.no_grad():
                X = batch["X"].to(self.device)
                targets = batch["y"].to(self.device)

                outputs = self.model(X).squeeze(1)
                loss = self.criterion(outputs, targets)

                valid_loss.update(loss.detach().item())
                valid_score.update(targets, outputs)
                
            _loss, _score = valid_loss.avg, valid_score.avg
            message = 'Valid Step {}/{}, valid_loss: {:.5f}, valid_score: {:.5f}' 
            self.info_message(message, step, len(valid_loader), _loss, _score, end="\r")
            
        return valid_loss.avg, valid_score.avg, int(time.time() - t)
        
    def save_model(self, n_epoch, save_path):
        torch.save(
        {
            "model_state_dict" : self.model.state_dict(),
            "optimizer_state_dict" : self.optimizer.state_dict(),
            "best_valid_score" : self.best_valid_score,
            "n_epoch" : n_epoch,
        },
        save_path
        )
    @staticmethod        
    def info_message(message, *args, end="\n"):
        print(message.format(*args), end=end)

In [None]:
# Training
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Change later
train_data_retriever = DataRetriever(
    df_train["BraTS21ID"].values,
    df_train["MGMT_value"].values
)

valid_data_retriever = DataRetriever(
    df_valid["BraTS21ID"].values, 
    df_valid["MGMT_value"].values
)
    
train_loader = torch_data.DataLoader(train_data_retriever, batch_size=32, shuffle=True, num_workers=8)
valid_loader = torch_data.DataLoader(valid_data_retriever, batch_size=32, shuffle=False, num_workers=8)
    
model = Model()
model.to(device)
    
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = torch_functional.binary_cross_entropy_with_logits
    
trainer = Trainer(model, device, optimizer, criterion, LossMeter, AccMeter)
    
history = trainer.fit(10, train_loader, valid_loader, f"best-model-0.pth" ,100)

In [None]:
models = []
for i in range(1):
    model = Model()
    model.to(device)
    
    checkpoint = torch.load(f"best-model-{i}.pth")
    model.load_state_dict(checkpoint["model_state_dict"])
    model.eval()
    
    models.append(model)

### prepare test data

In [None]:
class DataRetriever(torch_data.Dataset):
    def __init__(self, paths):
        self.paths = paths
          
    def __len__(self):
        return len(self.paths)
    
    def __getitem__(self, index):
        _id = self.paths[index]
        patient_path = f"../input/rsna-miccai-brain-tumor-radiogenomic-classification/test/{str(_id).zfill(5)}/"
        channels = []
        for t in ("FLAIR", "T1w", "T1wCE"): # "T2w"
            t_paths = sorted(
                glob.glob(os.path.join(patient_path, t, "*")), 
                key=lambda x: int(x[:-4].split("-")[-1]),
            )
            # start, end = int(len(t_paths) * 0.475), int(len(t_paths) * 0.525)
            x = len(t_paths)
            if x < 10:
                r = range(x)
            else:
                d = x // 10
                r = range(d, x - d, d)
                
            channel = []
            # for i in range(start, end + 1):
            for i in r:
                channel.append(cv2.resize(load_dicom(t_paths[i]), (256, 256)) / 255)
            channel = np.mean(channel, axis=0)
            channels.append(channel)
        
        return {"X": torch.tensor(channels).float() ,"id": _id }

In [None]:
submission = pd.read_csv("../input/rsna-miccai-brain-tumor-radiogenomic-classification/sample_submission.csv")

test_data_retriever = DataRetriever(
    submission["BraTS21ID"].values,
)

test_loader = torch_data.DataLoader(
    test_data_retriever,
    batch_size = 4,
    shuffle = False,
    num_workers = 8,
)

### Inference(test)

In [None]:
y_pred = []
ids = []

for e, batch in enumerate(test_loader):
    print(f"{e}/ {len(test_loader)}", end="\r")
    with torch.no_grad():
        tmp_pred = np.zeros((batch["X"].shape[0], ))
        for model in models:
            tmp_res = torch.sigmoid(model(batch["X"].to(device))).cpu().numpy().squeeze()
            tmp_pred += tmp_res
        y_pred.extend(tmp_pred)
        ids.extend(batch["id"].numpy().tolist()) #?

### submission

In [None]:
### submission(sample)
submission = pd.DataFrame({"BraTS21ID": ids, "MGMT_value": y_pred})
submission.to_csv("submission.csv", index=False)

In [None]:
plt.figure(figsize=(5, 5))
plt.hist(submission["MGMT_value"]);

In [None]:
submission