<a href="https://colab.research.google.com/github/Lelan30/Open-Projects/blob/main/Melanoma_Pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **About this Competition**
**What should I expect the data format to be?**
The images are provided in DICOM format. This can be accessed using commonly-available libraries like *pydicom*, and contains both image and metadata. It is a commonly used medical imaging data format.

Images are also provided in JPEG and TFRecord format (in the *jpeg* and *tfrecords* directories, respectively). Images in TFRecord format have been resized to a uniform 1024x1024.

Metadata is also provided outside of the DICOM format, in CSV files. See the Columns section for a description.

**What am I predicting?**

You are predicting a binary target for each image. Your model should predict the probability (floating point) between 0.0 and 1.0 that the lesion in the image is malignant (the target). In the training data, train.csv, the value 0 denotes benign, and 1 indicates malignant.

**Files**
*   train.csv - the training set
*   test.csv - the test set
*   sample_submission.csv - a sample submission file in the correct format

**Columns**
*   image_name - unique identifier, points to filename of related DICOM image
*   patient_id - unique patient identifier
*   sex - the sex of the patient (when unknown, will be blank)
*   age_approx - approximate patient age at time of imaging
*   anatom_site_general_challenge - location of imaged site
*   diagnosis - detailed diagnosis information (train only)
*   benign_malignant - indicator of malignancy of imaged lesion
*   target - binarized version of the target variable

In [23]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [24]:
!pip install efficientnet_pytorch torchtoolbox

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [25]:
import torch
import torchvision
import torch.nn.functional as F
import torch.nn as nn
import torchtoolbox.transform as transforms
from torch.utils.data import Dataset, DataLoader, Subset
from torch.optim.lr_scheduler import ReduceLROnPlateau
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import StratifiedKFold, GroupKFold, KFold
import pandas as pd
import numpy as np
import gc
import os
import cv2
import time
import datetime
import warnings
import random
import matplotlib.pyplot as plt
import seaborn as sns
from efficientnet_pytorch import EfficientNet
%matplotlib inline

In [26]:
warnings.simplefilter('ignore')

# Psudo-Random Number Generator
def seed_everything(seed):
  random.seed(seed)
  os.environ['PYTHONHASHSEED'] = str(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)
  torch.backends.cudnn.deterministic = True
  torch.backends.cudnn.benchmark = True

seed_everything(47)

In [27]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [40]:
class MelanomaDataset(Dataset):
  def __init__(self, df: pd.DataFrame, imfolder: str, train: bool = True,
               transforms = None, meta_features = None):
    '''
    Class initiation args:
    df (pd.DataFrame): DataFrame with data description
    imfolder (str): folder with images
    train (bool): flag of whether a training dataset is being initialized or testing one
    transforms: image transformation method to be applied
    meta_features (list): list of features with meta information, such as sex and age
    '''
    self.df = df
    self.imfolder = imfolder
    self.train = train
    self.transforms = transforms
    self.meta_features = meta_features

  def __getitem__(self, index):
    im_path = os.path.join(self.imfolder, 
                           self.df.iloc[index]['image_name'] + '.jpg')
    x = cv2.imread(im_path)
    meta = np.array(self.df.iloc[index][self.meta_features].values,
                    dtype=np.float32)
    if self.transforms:
      x = self.transforms(x)
    if self.train:
      y = self.df.iloc[index]['target']
      return (x, meta), y
    else:
      return (x, meta)

  def __len__(self):
      return len(self.df)

class Net(nn.Module):
  def __init__(self, arch, n_meta_features: int):
    super(Net, self).__init__()
    self.arch = arch
    if 'Resnet' in str(arch.__class__):
      self.arch.fc = nn.Linear(in_features = 512,
                                out_features= 500, 
                                bias = True)
    if 'EfficientNet' in str(arch.__class__):
      self.arch._fc = nn.Linear(in_features= 1280,
                                out_features= 500, 
                                bias = True)
      self.meta = nn.Sequential(nn.Linear(n_meta_features, 500),
                                nn.BatchNorm1d(500),
                                nn.ReLU(),
                                nn.Dropout(p= 0.2),
                                nn.Linear(500, 250),
                                # FC layer output will have 250 feats
                                nn.BatchNorm1d(250),
                                nn.ReLU(),
                                nn.Dropout(p = 0.2))
      self.output = nn.Linear(500 + 250, 1)

    def forward(self, inputs):
      '''
      No sigmoid in forward because we will be using BCEWithLogitsLoss.
      Which applies sigmoid automatically when calculating a loss.
      '''
      x, meta = inputs
      cnn_features = self.arch(x)
      meta_features = self.meta(x)
      features = torch.cat((cnn_features, meta_features), dim = 1)
      output = self.output(features)
      return output

In [29]:
class AdvancedHairAugumentation:
  '''
  Impose image of a hair to the target image

  Args:
    hairs (int): max number of hairs to impose
    hairs_folder (str): path to the folder with hairs images
  '''

  def __init__(self, hairs: int = 5, hairs_folder: str = ""):
    self.hairs = hairs
    self.hairs_folder = hairs_folder

  def __call__(self, img):
    '''
    Args:
      img (PIL Image): Image to draw hairs on.

    Returns:
      PIL Image: Image with drawn hairs.
    '''
    n_hairs = random.randint(0, self.hairs)

    if not n_hairs:
      return img
    # target img width and height
    height, weight, _ = img.shape
    hair_images = [im for im in os.listdir(self.hairs_folder) if 'png' in im]

    for _ in range(n_hairs):
      hair = cv2.imread(os.path.join(self.hairs_folder, random.choice(hair_images)))
      hair = cv2.flip(hair, random.choice([-1, 0, 1]))
      hair = cv2.rotate(hair, random.choice(0, 1, 2))
      # hair image width and height
      h_height, h_width, _ = hair.shape
      roi_ho = random.randint(0, img.shape[0]) - hair.shape[0]
      roi_wo = random.randint(0, img.shape[1] - hair.shape[1])
      roi = img[roi_ho:roi_ho + h_height, roi_wo:roi_wo + h_width]
      # creating a mask and inverse mask
      img2gray = cv2.cvtColor(hair, cv2.COLORBGR2GRAY)
      ret, mask = cv2.threshhold(img2gray, 10, 255, cv2.THRESH_BINARY)
      mask_inv = cv2.bitwise_not(mask)
      # now black-out the area of hair in ROI (background)
      img_bg = cv2.bitwise_and(roi, roi, mask=mask_inv)
      # take only region of hair from image
      hair_fg = cv2.bitwise_and(hair, hair, mask=mask)
      # put hair in ROI and modify the target img
      dst = cv2.add(img_bg, hair_fg)
      img[roi_ho:roi_ho + h_height, roi_wo:roi_wo + h_width] = dst

      return img

    def __repr__(self):
      return f'{self.__class__.__name__}(hairs={self.hairs}, hairs_folder="{self.hairs_folder}")'


In [30]:
class DrawHair:
  '''
  Draw a random number of psudo hairs

  Args:
    hairs (int): max number of hairs to draw
    width (tuple): possible width of hair in poxe;s
  '''
  def __init__(self, hairs:int = 4, width:tuple = (1,2)):
    self.hairs = hairs
    self.width = width

  def __call__(self, img):
    '''
    Args:
      img (PIL Image): Image to draw hairs on

    Returns:
      PIL Image: Image with drawn hairs.
    '''
    if not self.hairs:
      return img

    width, height, _ = img.shape

    for _ in range(random.randit(0,self.hairs)):
      # The origin point of the line will always be at the top half of the image
      origin = (random.randit(0, width), random.randit(0, height // 2))
      # The end of the line
      end = (random.randit(0, width), random.randit(0, height))
      # color of the hair is Black
      color = (0, 0, 0)
      cv2.line(img, origin, end, color, 
               random.randit(self.width[0], self.width[1]))
      
    return img

  def __repr__(self):
    return f'{self.__class__.__name__}(hairs={self.hairs}, width={self.width})'

In [31]:
class Microscope:
  '''
  Cutting out the edges around the center circle of the image.
  Imitating a picture, taken through the microscope.

  Args:
    p (float): probability of applying an augmentation
  '''

  def __init__(self, p: float = 0.5):
    self.p = p

  def __call__(self, img):
    '''
    Args:
      img (PIL Image): Image to apply transformation to

    Returns:
      PIL Image: Image with transformation.
    '''
    if random.random() < self.p:
      circle = cv2.circle((np.ones(img.shape) * 255).astype(np.uint8), # Image placeholder
                          (img.shape[0]//2, img.shape[1]//2), # Center point of circle
                          (0, 0, 0), -1) # color
      mask = circle - 255
      img = np.multiply(img, mask)

    return img

  def __repr__(self):
    return f'{self.__class__.__name__}(p={self.p})'

In [32]:
train_transform = transforms.Compose([
    AdvancedHairAugumentation(hairs_folder='/kaggle/input/melanoma_hairs'),
    transforms.RandomResizedCrop(size=256, scale=(0.8, 1.0)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    Microscope(p=0.5),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.299, 0.244, 0.255])
])
test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

In [33]:
arch = EfficientNet.from_pretrained('efficientnet-b1')

Loaded pretrained weights for efficientnet-b1


In [34]:
train_df = pd.read_csv('train.csv') 
test_df = pd.read_csv('test.csv')

In [35]:
# One-Hot Encoding of anatom_site_general_challenge feature
concat = pd.concat([train_df['anatom_site_general_challenge'], 
                    test_df['anatom_site_general_challenge']], 
                   ignore_index=True)
'''
pd.get_dummies:
Each variable is converted in as many 0/1 variables as there are 
different values. Columns in the output are each named after a value; 
if the input is a DataFrame, the name of the original variable is 
prepended to the value.
'''
dummies = pd.get_dummies(concat, dummy_na=True, dtype=np.uint8, prefix='site')
train_df = pd.concat([train_df, dummies.iloc[:train_df.shape[0]]], axis=1)
test_df = pd.concat([test_df, dummies.iloc[:train_df.shape[0]:]
                     .reset_index(drop=True)], axis=1)

# Sex features
train_df['sex'] = train_df['sex'].map({'male':1, 'female':0})
test_df['sex'] = test_df['sex'].map({'male':1, 'female':0})
train_df['sex'] = train_df['sex'].fillna(-1)
test_df['sex'] = test_df['sex'].fillna(-1)

# Age features
train_df['age_approx'] /= train_df['age_approx'].max()
test_df['age_approx'] /= test_df['age_approx'].max()
train_df['age_approx'] = train_df['age_approx'].fillna(0)
test_df['age_approx'] = test_df['age_approx'].fillna(0)

train_df['patient_id'] = train_df['patient_id'].fillna(0)

In [36]:
meta_features = ['sex', 'age_approx'] + [col for col in train_df.columns if 'site_' in col]
meta_features.remove('anatom_site_general_challenge')

In [41]:
test = MelanomaDataset(df=test_df,
                       imfolder='/kaggle/input/melanoma-external-malignant-256/test/test/', 
                       train=False,
                       transforms=train_transform,  # For TTA
                       meta_features=meta_features)

In [42]:
skf = GroupKFold(n_splits=5)

In [None]:
# Number of epochs to run
epochs = 12
# Early stopping patience - for how many epochs with no improvements to wait
es_patience = 3
# Test Time Augmentation rounds
TTA = 3
# Out-Of-Folds predictions
oof = np.zeros((len(train_df), 1))
# Predictions for test set
preds = torch.zeros((len(test), 1), dtype=torch.float32, device=device)

skf = KFold(n_splits=5, shuffle=True, random_state=47)
for fold, (train_idx, val_idx) in enumerate(skf.split(X=np.zeros(len(train_df)),
                                                      y=train_df['target'],
                                                      groups=train_df['patient_id']
                                                      .tolist()), 1):
  print('=' * 20, 'Fold', fold, '=' * 20)
  # Path and filename to save model to
  model_path = f'model_{fold}.pth'
  # Best validation score within this fold
  best_val = 0
  # Current patience counter
  patience = es_patience
  arch = EfficientNet.from_pretrained('efficientnet-b1')
  model = Net(arch=arch, n_meta_features=len(meta_features))
  model = model.to(device)

  optim = torch.optim.Adam(model.parameters(), lr=0.001)
  scheduler = ReduceLROnPlateau(optimizer=optim, 
                                mode='max', 
                                patience=1,
                                verbose=True,
                                factor=0.2)
  criterion = nn.BCEWithLogitsLoss()

  train = MelanomaDataset(df=train_df.iloc[train_idx].reset_index(drop=True),
                          imfolder='/kaggle/input/melanoma-external-malignant-256/train/train/',
                          train=True,
                          transforms=test_transform,
                          meta_features=meta_features)
  val = MelanomaDataset(df=train_df.iloc[val_idx].reset_index(drop=True),
                        imfolder='/kaggle/input/melanoma-external-malignant-256/train/train/',
                        train=True,
                        transforms=test_transform,
                        meta_features=meta_features)
  train_loader = DataLoader(dataset=train, 
                            batch_size=64,
                            shuffle=True,
                            num_workers=2)
  val_loader = DataLoader(dataset=test,
                          batch_size=16,
                          shuffle=False,
                          num_workers=2)
  test_loader = DataLoader(dataset=test, 
                           batch_size=16, 
                           shuffle=False, 
                           num_workers=2)
  for epoch in range(epochs):
    start_time = time.time()
    correct = 0
    epoch_loss = 0
    model.train()

    for x, y in train_loader:
      x[0] = torch.tensor(x[0], device=device, dtype=torch.float32)
      x[1] = torch.tensor(x[1], device=device, dtype=torch.float32)
      y = torch.tensor(y, device=device, dtype=torch.float32)
      optim.zero_grad()
      z = model(x)
      loss = criterion(z, y.unsqueeze(1))
      optim.step()
      # round off sigmoid to obtain predictions
      pred = torch.round(torch.sigmoid(z))
      # tracking number of correctly predicted samples
      correct += (pred.cpu() == y.cpu().unsqueeze(1)).sum().item()
      epoch_loss += loss.item()
    train_acc = correct / len(train_idx)
    # switch model to evaluation mode
    model.eval()
    val_preds = torch.zeros((len(val_idx), 1), 
                            dtype=torch.float32, 
                            device=device)
    # Do not calculate gradient since we are only predicting
    with torch.no_grad():
      # predicting on validation set
      for j, (x_val, y_val) in enumerate(val_loader):
        x_val[0] = torch.tensor(x_val[0], device=device, dtype=torch.float32)
        x_val[1] = torch.tensor(x[1], device=device, dtype=torch.float32)
        y_val = torch.tensor(y_val, device=device, dtype=torch.float32)
        z_val = model(x_val)
        val_pred = torch.sigmoid(z_val)
        val_preds[j*val_loader.
                  batch_size:j*val_loader.
                  batch_size + x_val[0].shape[0]] = val_pred
        val_acc = accuracy_score(train_df.iloc[val_idx['target'].values, 
                                torch.round(val_preds.cpu())])
        val_roc = roc_auc_score(train_df.iloc[val_idx]['target'].values, 
                                val_preds.cpu())
        
        print('Epoch {:03}: | Loss: {:.3f} | Train acc: {:.3f} | Val acc: {:.3f} | Val roc_auc: {:.3f} | Training time: {}'.format(
            
        epoch + 1,
        epoch_loss,
        train_acc,
        val_acc,
        val_roc,
        str(datetime.timedelta(seconds=time.time() - start_time)) [:7]))

        scheduler.step(val_roc)

        if val_roc >= best_val:
          best_val = val_roc
          # Resetting patience since we have new best validation accuracy
          patience = es_patience
          # Saving current best model
          torch.save(model, model_path)
        else:
          patience -= 1
          if patience == 0:
              print('Early stopping. Best Val roc_auc: {:.3f}'.format(best_val))
              break
  # Loading best model of this fold
  model = torch.load(model_path)
  # Switch model to evaluation mode
  model.eval()
  val_preds = torch.zeros((len(val_idx), 1), 
                          dtype=torch.float32, 
                          device=device)
  with torch.no_grad():
    # Predicting on validation set once again to obtain data for OOF
    for j, (x_val, y_val) in enumerate(val_loader):
      x_val[0] = torch.tensor(x_val[0], device=device, dtype=torch.float32)
      x_val[1] = torch.tensor(x_val[1], device=device, dtype=torch.float32)
      y_val = torch.tensor(y_val, device=device, dtype=torch.float32)
      z_val = model(x_val)
      val_pred = torch.sigmoid(z_val)
      val_preds[j*val_loader.batch_size:j*val_loader.batch_size + x_val[0].shape[0]] = val_pred
      oof[val_idx] = val_preds.cpu().numpy()

      # Predicting on set
      tta_preds = torch.zeros((len(test), 1), 
                              dtype=torch.float32, 
                              device=device)
      for _ in range(TTA):
        for i, x_test in enumerate(test_loader):
          x_test[0] = torch.tensor(x_test[0], device=device, dtype=torch.float32)
          x_test[1] = torch.tensor(x_test[1], device=device, dtype=torch.float32)
          z_test = model(x_test)
          z_test = torch.sigmoid(z_test)
          tta_preds[i*test_loader.batch_size:i*test_loader.batch_size + x_test[0].shape[0]] += z_test
          preds += tta_preds / TTA

  preds /= skf.nsplits




In [None]:
# Saving OOF predictions so stacking would be easier
pd.Series(oof.reshape(-1,)).to_csv('oof.csv', index=False)