<h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spacing: 1px; background-color: #f6f5f5; color :#6666ff; border-radius: 200px 200px; text-align:center">Swin Transformer: Hierarchical Vision Transformer using Shifted Windows</h1>

![Swin](http://raw.githubusercontent.com/microsoft/Swin-Transformer/master/figures/teaser.png)
<p style = "font-family: garamond; font-size: 20px; font-style: normal; border-radius: 10px 10px; text-align:center">This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting selfattention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer
make it compatible with a broad range of vision tasks, including image classification </p>



<p p style = "font-family: garamond; font-size:40px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">What are we discussing today? </p>
 <p p style = "font-family: garamond; font-size:40px; font-style: normal;background-color: #f6f5f5; color :#006699; border-radius: 10px 10px; text-align:center">  
 CutMix <br>
 AMP + Gradient Scaling <br>
 Weighted Random Sampler <br>
 HuggingFace Accelerate <br>
 Weights and Biases
 


<p p style = "font-family: garamond; font-size:40px; font-style: normal;background-color: #f6f5f5; color :#ff0066; border-radius: 10px 10px; text-align:center">Upvote the kernel if you find it insightful!</p>

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">TIMM Pytorch Models</p>

<p style = "font-family: garamond; font-size: 20px; font-style: normal; border-radius: 10px 10px; text-align:center">PyTorch Image Models (timm) is a collection of image models, layers, utilities, optimizers, schedulers, data-loaders / augmentations, and reference training / validation scripts that aim to pull together a wide variety of SOTA models with ability to reproduce ImageNet training results.
Using timm we will create the Swin Transformer model for our problem statement. We will be using the swin small patch4 window7 224 pretrained model. </p>

In [None]:
import sys
sys.path.append('../input/timm-pytorch-image-models/pytorch-image-models-master')

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">TOOLS</p>

<p style = "font-family: garamond; font-size: 30px; font-style: normal; border-radius: 10px 10px; text-align:center">Accelerate by HuggingFace 🤗 </p><br> <p style = "font-family: garamond; font-size: 20px; font-style: normal; border-radius: 10px 10px; text-align:center">Accelerate provides an easy API to make your scripts run with mixed precision and on any kind of distributed setting (multi-GPUs, TPUs etc.) while still letting you write your own training loop. The same code can then runs seamlessly on your local machine for debugging or your training environment. <br> In 5 Lines of code we can run our scripts on any distributed setting! </p>

<p style = "font-family: garamond; font-size: 30px; font-style: normal; border-radius: 10px 10px; text-align:center">Weights & Biases</p><br> <p style = "font-family: garamond; font-size: 20px; font-style: normal; border-radius: 10px 10px; text-align:center">Wandb is a developer tool for companies turn deep learning research projects into deployed software by helping teams track their models, visualize model performance and easily automate training and improving models.<br> We will use their tools to log hyperparameters and output metrics from your runs, then visualize and compare results and quickly share findings with your colleagues.  </p>

In [None]:
!pip install -q accelerate
!pip install wandb --upgrade

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Import Libraries</p>

In [None]:
# Warning
import warnings
import sklearn.exceptions
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings("ignore", category=sklearn.exceptions.UndefinedMetricWarning)

# Python
from tqdm import tqdm
from collections import defaultdict
import pandas as pd
import numpy as np
import os
import random
import glob
pd.set_option('display.max_columns', None)

# Visualizations
from PIL import Image
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
%matplotlib inline
sns.set(style="whitegrid")

# Image Augmentations
import albumentations
from albumentations.pytorch.transforms import ToTensorV2


# Utils
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import roc_auc_score

# Pytorch for Deep Learning
import torch
import torchvision
import timm
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler
from torch.optim.lr_scheduler import CosineAnnealingLR
from torch.cuda import amp


# GPU 
from accelerate import Accelerator
accelerator = Accelerator()

# Weights and Biases Tool
import wandb

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Define Configurations/Parameters</p>

In [None]:
params = {
    'seed': 42,
    'model': 'swin_small_patch4_window7_224',
    'size' : 224,
    'inp_channels': 1,
    'device': accelerator.device,
    'lr': 1e-4,
    'weight_decay': 1e-6,
    'batch_size': 32,
    'num_workers' : 0,
    'epochs': 5,
    'out_features': 1,
    'name': 'CosineAnnealingLR',
    'T_max': 10,
    'min_lr': 1e-6,
    'num_tta':1
}

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Define Seed for Reproducibility</p>

In [None]:
def seed_everything(seed=42):
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True
    
seed_everything(params['seed'])

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Minimal EDA</p>

In [None]:
train_dir = ('../input/seti-breakthrough-listen/train')
test_dir = ('../input/seti-breakthrough-listen/test')
train_df = pd.read_csv('../input/seti-breakthrough-listen/train_labels.csv')
test_df = pd.read_csv('../input/seti-breakthrough-listen/sample_submission.csv')

In [None]:
def return_filpath(name, folder=train_dir):
    path = os.path.join(folder, name[0], f'{name}.npy')
    return path

In [None]:
train_df['image_path'] = train_df['id'].apply(lambda x: return_filpath(x))
test_df['image_path'] = test_df['id'].apply(lambda x: return_filpath(x, folder=test_dir))
train_df.head()

In [None]:
ax = plt.subplots(figsize=(12, 6))
sns.set_style("whitegrid")
sns.countplot(x='target', data=train_df);
plt.ylabel("No. of Observations", size=20);
plt.xlabel("Target", size=20);

<p style = "font-family: garamond; font-size: 20px; font-style: normal; border-radius: 10px 10px; text-align:center">The dataset is very imbalanced and we will see later how we use a sampler to handle it. </p>

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Image Augmentation</p>

In [None]:
def get_train_transforms():
    return albumentations.Compose(
        [
            albumentations.Resize(params['size'],params['size']),
            albumentations.HorizontalFlip(p=0.5),
            albumentations.VerticalFlip(p=0.5),
            albumentations.Rotate(limit=180, p=0.7),
            albumentations.RandomBrightness(limit=0.6, p=0.5),
            albumentations.Cutout(
                num_holes=10, max_h_size=12, max_w_size=12,
                fill_value=0, always_apply=False, p=0.5
            ),
            albumentations.ShiftScaleRotate(
                shift_limit=0.25, scale_limit=0.1, rotate_limit=0
            ),
            ToTensorV2(p=1.0),
        ]
    )

def get_valid_transforms():
    return albumentations.Compose(
        [
            albumentations.Resize(params['size'],params['size']),
            ToTensorV2(p=1.0)
        ]
    )

def get_test_transforms():
        return albumentations.Compose(
            [
                albumentations.Resize(params['size'],params['size']),
                ToTensorV2(p=1.0)
            ]
        )

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Custom Dataset</p>

In [None]:
class SETIDataset(Dataset):
    def __init__(self, images_filepaths, targets, transform=None):
        self.images_filepaths = images_filepaths
        self.targets = targets
        self.transform = transform

    def __len__(self):
        return len(self.images_filepaths)

    def __getitem__(self, idx):
        image_filepath = self.images_filepaths[idx]
        image = np.load(image_filepath).astype(np.float32)
        image = np.vstack(image).transpose((1, 0))
            
        if self.transform is not None:
            image = self.transform(image=image)["image"]
        else:
            image = image[np.newaxis,:,:]
            image = torch.from_numpy(image).float()
        
        label = torch.tensor(self.targets[idx]).float()
        return image, label

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Train and Validation Data</p>

In [None]:
(X_train, X_valid, y_train, y_valid) = train_test_split(train_df['image_path'],
                                                        train_df['target'],
                                                        test_size=0.2,
                                                        stratify=train_df['target'],
                                                        shuffle=True,
                                                        random_state=params['seed'])

In [None]:
train_dataset = SETIDataset(
    images_filepaths=X_train.values,
    targets=y_train.values,
    transform=get_train_transforms()
)

valid_dataset = SETIDataset(
    images_filepaths=X_valid.values,
    targets=y_valid.values,
    transform=get_valid_transforms()
)

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">CutMix</p>

<p style = "font-family: garamond; font-size: 25px; font-style: normal; border-radius: 10px 10px; text-align:center">CutMix is an image data augmentation strategy. Instead of simply removing pixels as in Cutout, we replace the removed regions with a patch from another image. The ground truth labels are also mixed proportionally to the number of pixels of combined images. The added patches further enhance localization ability by requiring the model to identify the object from a partial view.</p>

![](https://miro.medium.com/max/4176/1*IR3uTsclxKdzKIXDlTiVgg.png)

In [None]:
def rand_bbox(W, H, lam):
    cut_rat = np.sqrt(1. - lam)
    cut_w = np.int(W * cut_rat)
    cut_h = np.int(H * cut_rat)

    cx = np.random.randint(cut_w // 2, W - cut_w // 2)
    cy = np.random.randint(cut_h // 2, H - cut_h // 2)

    bbx1 = cx - cut_w // 2
    bby1 = cy - cut_h // 2
    bbx2 = cx + cut_w // 2
    bby2 = cy + cut_h // 2

    return bbx1, bby1, bbx2, bby2

def cutmix(x, y, alpha=1.0):
    if alpha > 0:
        lam = np.random.beta(alpha, alpha)
    else:
        lam = 1

    batch_size = x.size()[0]
    index = torch.randperm(batch_size).to(params['device'])

    bbx1, bby1, bbx2, bby2 = rand_bbox(x.size()[1], x.size()[2], lam)
    x[:, bbx1:bbx2, bby1:bby2] = x[index, bbx1:bbx2, bby1:bby2]
    y_a, y_b = y, y[index]
    return x, y_a, y_b, lam

def cutmix_criterion(criterion, pred, y_a, y_b, lam):
    return lam * criterion(pred, y_a) + (1 - lam) * criterion(pred, y_b)

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Custom Class for Monitoring Loss and ROC</p>

In [None]:
class MetricMonitor:
    def __init__(self, float_precision=3):
        self.float_precision = float_precision
        self.reset()

    def reset(self):
        self.metrics = defaultdict(lambda: {"val": 0, "count": 0, "avg": 0})

    def update(self, metric_name, val):
        metric = self.metrics[metric_name]

        metric["val"] += val
        metric["count"] += 1
        metric["avg"] = metric["val"] / metric["count"]

    def __str__(self):
        return " | ".join(
            [
                "{metric_name}: {avg:.{float_precision}f}".format(
                    metric_name=metric_name, avg=metric["avg"],
                    float_precision=self.float_precision
                )
                for (metric_name, metric) in self.metrics.items()
            ]
        )
    
def use_roc_score(output, target):
    try:
        y_pred = torch.sigmoid(output).cpu()
        y_pred = y_pred.detach().numpy()
        target = target.cpu()

        return roc_auc_score(target, y_pred)
    except:
        return 0.5

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Weighted Random Sampler</p>

<p style = "font-family: garamond; font-size: 25px; font-style: normal; border-radius: 10px 10px; text-align:center">Samples elements from [0 ,.., len(weights)-1] with given probabilities (weights).</p>

In [None]:
class_counts = y_train.value_counts().to_list()
num_samples = sum(class_counts)
labels = y_train.to_list()

class_weights = [num_samples/class_counts[i] for i in range(len(class_counts))]
weights = [class_weights[labels[i]] for i in range(int(num_samples))]
sampler = WeightedRandomSampler(torch.DoubleTensor(weights), int(num_samples))

In [None]:
train_loader = DataLoader(
    train_dataset, batch_size=params['batch_size'], sampler = sampler,
    num_workers=params['num_workers'], pin_memory=True)

val_loader = DataLoader(
    valid_dataset, batch_size=params['batch_size'], shuffle=False,
    num_workers=params['num_workers'], pin_memory=True)

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Swin Transformer</p>

In [None]:
class SwinNet(nn.Module):
    def __init__(self, model_name=params['model'], out_features=params['out_features'],
                 inp_channels=params['inp_channels'], pretrained=True):
        super().__init__()
        self.model = timm.create_model(model_name, pretrained=pretrained,
                                       in_chans=inp_channels)
        n_features = self.model.head.in_features
        self.model.head = nn.Linear(n_features, out_features, bias=True)    
    
    def forward(self, x):
        x = self.model(x)
        return x

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Define Loss Function, Optimizer and Scheduler</p>

In [None]:
model = SwinNet()
model = model.to(params['device'])
criterion = nn.BCEWithLogitsLoss().to(params['device'])
optimizer = torch.optim.Adam(model.parameters(), lr=params['lr'],
                             weight_decay=params['weight_decay'],
                             amsgrad=False)

scheduler = CosineAnnealingLR(optimizer,
                              T_max=params['T_max'],
                              eta_min=params['min_lr'],
                              last_epoch=-1)

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Mixed Precision Training</p>

<p style = "font-family: garamond; font-size: 25px; font-style: normal; border-radius: 10px 10px; text-align:center">📍 Automatic Mixed Precision <br><br> AMP provides convenience methods for mixed precision, where some operations use the torch.float32 (float) datatype and other operations use torch.float16 (half). </p>

<p style = "font-family: garamond; font-size: 25px; font-style: normal; border-radius: 10px 10px; text-align:center">📍 Autocasting <br><br>Instances of autocast serve as context managers or decorators that allow regions of your script to run in mixed precision. autocast should wrap only the forward pass(es) of your network, including the loss computation(s). Backward passes under autocast are not recommended.</p>

<p style = "font-family: garamond; font-size: 25px; font-style: normal; border-radius: 10px 10px; text-align:center">📍 Gradient Scaling <br><br> If the forward pass for a particular op has float16 inputs, the backward pass for that op will produce float16 gradients. Gradient values with small magnitudes may not be representable in float16. These values will flush to zero (“underflow”), so the update for the corresponding parameters will be lost. <br>
    To prevent underflow, “gradient scaling” multiplies the network’s loss(es) by a scale factor and invokes a backward pass on the scaled loss(es). Gradients flowing backward through the network are then scaled by the same factor. In other words, gradient values have a larger magnitude, so they don’t flush to zero.</p>


In [None]:
def train(train_loader, model, criterion, optimizer, epoch, params):
    metric_monitor = MetricMonitor()
    model.train()
    stream = tqdm(train_loader)
    scaler = amp.GradScaler()
       
    for i, (images, target) in enumerate(stream, start=1):

        images = images.to(params['device'])
        target = target.to(params['device']).float().view(-1, 1)
        images, targets_a, targets_b, lam = cutmix(images, target.view(-1, 1))
        
        with amp.autocast(enabled=True):
            output = model(images)
            loss = cutmix_criterion(criterion, output, targets_a, targets_b, lam)
            
        accelerator.backward(scaler.scale(loss))
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()
        
        roc_score = use_roc_score(output, target)
        metric_monitor.update('Loss', loss.item())
        metric_monitor.update('ROC', roc_score)
        wandb.log({"Train Epoch":epoch,"Train loss": loss.item(), "Train ROC":roc_score})
        

        stream.set_description(
            "Epoch: {epoch}. Train.      {metric_monitor}".format(
                epoch=epoch,
                metric_monitor=metric_monitor)
        )

In [None]:
def validate(val_loader, model, criterion, epoch, params):
    metric_monitor = MetricMonitor()
    model.eval()
    stream = tqdm(val_loader)
    final_targets = []
    final_outputs = []
    with torch.no_grad():
        for i, (images, target) in enumerate(stream, start=1):
            images = images.to(params['device'], non_blocking=True)
            target = target.to(params['device'], non_blocking=True).float().view(-1, 1)
            output = model(images)
            loss = criterion(output, target)
            roc_score = use_roc_score(output, target)
            metric_monitor.update('Loss', loss.item())
            metric_monitor.update('ROC', roc_score)
            wandb.log({"Valid Epoch": epoch, "Valid loss": loss.item(), "Valid ROC":roc_score})
            stream.set_description(
                "Epoch: {epoch}. Validation. {metric_monitor}".format(
                    epoch=epoch,
                    metric_monitor=metric_monitor)
            )
            
            targets = target.detach().cpu().numpy().tolist()
            outputs = output.detach().cpu().numpy().tolist()
            
            final_targets.extend(targets)
            final_outputs.extend(outputs)
    return final_outputs, final_targets

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Running Train and Evaluation and Monitoring on Weights and Biases</p>

In [None]:
best_roc = -np.inf
best_epoch = -np.inf
best_model_name = None

for epoch in range(1, params['epochs'] + 1):
    
    run = wandb.init(project='Seti-Swin', 
                 config=params, 
                 job_type='train',
                 name = 'Swin Transformer')
    
    train(train_loader, model, criterion, optimizer, epoch, params)
    predictions, valid_targets = validate(val_loader, model, criterion, epoch, params)
    roc_auc = round(roc_auc_score(valid_targets, predictions), 3)
    torch.save(model.state_dict(),f"{params['model']}_{epoch}_epoch_{roc_auc}_roc_auc.pth")
    
    if roc_auc > best_roc:
        best_roc = roc_auc
        best_epoch = epoch
        best_model_name = f"{params['model']}_{epoch}_epoch_{roc_auc}_roc_auc.pth"
        
    scheduler.step()
    

In [None]:
print(f'The best ROC: {best_roc} was achieved on epoch: {best_epoch}.')
print(f'The Best saved model is: {best_model_name}') 

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Validation Results</p>

<p style = "font-family: garamond; font-size: 25px; font-style: normal; border-radius: 10px 10px; text-align:center">We are able to achieve a ROC score of 96.2 in just 5 epochs!<br><br> Weights & Biases provides us with such easy to use interface and tools to keep a track of our Evaluation metrics like training and validation loss and Roc along with other resources like Gpu usage.</p>

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Test Time Augmentation</p>

<p style = "font-family: garamond; font-size: 25px; font-style: normal; border-radius: 10px 10px; text-align:center">Similar to what Data Augmentation is doing to the training set, the purpose of Test Time Augmentation is to perform random modifications to the test images. Thus, instead of showing the regular, “clean” images, only once to the trained model, we will show it the augmented images several times. We will then average the predictions of each corresponding image and take that as our final guess.<br>
    The reason why it works is that, by averaging our predictions, on randomly modified images, we are also averaging the errors. The error can be big in a single vector, leading to a wrong answer, but when averaged, only the correct answer stand out.<br><br> Here I'll be taking a TTA of 1 as I dont have enough Gpu hours left but feel free to experiment and evaluate the performance with different number of tta's</p>

![](https://preview.ibb.co/kH61v0/pipeline.png)

In [None]:
model = SwinNet()
model.load_state_dict(torch.load(best_model_name))
model = model.to(params['device'])

In [None]:
model.eval()
predicted_labels = None
for i in range(params['num_tta']):
    test_dataset = SETIDataset(
        images_filepaths = test_df['image_path'].values,
        targets = test_df['target'].values,
        transform = get_test_transforms()
    )
    test_loader = DataLoader(
        test_dataset, batch_size=params['batch_size'],
        shuffle=False, num_workers=params['num_workers'],
        pin_memory=True
    )
    
    temp_preds = None
    with torch.no_grad():
        for (images, target) in tqdm(test_loader):
            images = images.to(params['device'], non_blocking=True)
            output = model(images)
            predictions = torch.sigmoid(output).cpu().numpy()
            if temp_preds is None:
                temp_preds = predictions
            else:
                temp_preds = np.vstack((temp_preds, predictions))
    
    if predicted_labels is None:
        predicted_labels = temp_preds
    else:
        predicted_labels += temp_preds
        
predicted_labels /= params['num_tta']

In [None]:
torch.save(model.state_dict(), f"{params['model']}_{best_epoch}epochs_weights.pth")

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">Submission File</p>

In [None]:
sub_df = pd.DataFrame()
sub_df['id'] = test_df['id']
sub_df['target'] = predicted_labels

In [None]:
sub_df.head()

In [None]:
sub_df.to_csv('submission.csv', index=False)

<p p style = "font-family: garamond; font-size:30px; font-style: normal;background-color: #f6f5f5; color :#6666ff; border-radius: 10px 10px; text-align:center">That's it for today folks!<br> A huge shoutout to <a href = 'https://www.kaggle.com/manabendrarout/nfnet-pytorch-starter-lb-0-95'>manabendrarou </a>for his excellent Pytorch starter kernel<br>Go Give it some love!</p>

