# Pytorch with SWA

## **Content**
1. [Introduction](#1)
1. [About SWA (Stochastic Weight Averaging)](#2)
1. [Usage of Pytorch module](#3)
1. [Application](#4)


<a id="1"></a> <br>
# <div class="alert alert-block alert-success">Introduction</div>

Now, PyTorch 1.6 includes Stochastic Weight Averaging.

Until now, to use SWA, we had to turn on internet and install torch-contrib. A recent competition often required internet off, which was a bit difficult to use.

As you can see from my CV score, I'm not fully proficient in using SWA. I would like to publish this notebook to encourage more people to use SWA and discuss how to use it effectively.

<a id="2"></a> <br>
# <div class="alert alert-block alert-info">About SWA (Stochastic Weight Averaging)</div>

Simply put, it would be an ensemble with the weight of the cyclic learning process going on.

It is normal for the solution by training to be somewhat out of sync with the general solution you originally wanted, and SWA can be often used to get the more universal solution.

The image used in article [1] is here, 

<img src="https://github.com/tasotasoso/kaggle_media/blob/main/MoA/Averaging%20Weights%20Leads_to_Wider_Optima_and_Better_Generalization_fig.png?raw=true" width="1000">

The image on the right shows that the train loss is minimized when SWA is not used. But on the left and in the middle, you can see that the TEST error is smaller when SWA is used.

The weight update argorism is here, in simply, 

$$
w_{SWA} = \frac{w_{SWA}*n_{models} + w}{n_{models} + 1} 
$$

Here, w_SWA is gotten weight by SWA and n_models is number of models we calculated ensemble.

For implementation, article [1] has more specific representation, 

<img src="https://github.com/tasotasoso/kaggle_media/blob/main/MoA/Averaging%20Weights%20Leads_to_Wider_Optima_and_Better_Generalization_argorism.png?raw=true" width="350">

## Reference

Good documents to learn SWA. Especially, [1] is reference of figures (orginal article of SWA). 

[1] https://arxiv.org/abs/1803.05407

[2] https://github.com/timgaripov/swa

[3] https://towardsdatascience.com/stochastic-weight-averaging-a-new-way-to-get-state-of-the-art-results-in-deep-learning-c639ccf36a

[4] https://pytorch.org/blog/pytorch-1.6-now-includes-stochastic-weight-averaging/

[5] https://github.com/pytorch/pytorch/blob/master/torch/optim/swa_utils.py

<a id="3"></a> <br>
# <div class="alert alert-block alert-info">Usage of Pytorch module</div>

Using torch.optim.swa_utils in pytorch 1.6, we can use SWA with no extra modules.

Now, we can use pytorch 1.6 in kaggle notebook.

In [None]:
import torch
torch.__version__

In [None]:
#Refer official document, https://pytorch.org/blog/pytorch-1.6-now-includes-stochastic-weight-averaging/
if False:
    from torch.optim.swa_utils import AveragedModel, SWALR
    from torch.optim.lr_scheduler import CosineAnnealingLR

    loader, optimizer, model, loss_fn = ...
    swa_model = AveragedModel(model)
    scheduler = CosineAnnealingLR(optimizer, T_max=100)
    swa_start = 5
    swa_scheduler = SWALR(optimizer, swa_lr=0.05)

    for epoch in range(100):
        for input, target in loader:
            optimizer.zero_grad()
            loss_fn(model(input), target).backward()
            optimizer.step()
        if epoch > swa_start:
            swa_model.update_parameters(model)
            swa_scheduler.step()
        else:
            scheduler.step()

    # Update bn statistics for the swa_model at the end
    torch.optim.swa_utils.update_bn(loader, swa_model)
    # Use swa_model to make predictions on test data 
    preds = swa_model(test_input)

I want you to see the official documentation and github implementation for details, but I'll summarize usages.

- We can get AveragedModel instance by pass our model to AveragedModel. This instance accumulates the averages of the weights.

- We can update the parameters of the averaged model by update_parameters(model)

- Using SWALR, we can schedule  learning rate to anneal to a fixed value, and then keeps it constant. 

- update_bn is utility function to update SWA batch normalization statistics at the end of training.

<a id="4"></a> <br>
# <div class="alert alert-block alert-success">Application</div>

I'll use torch.optim.swa_utils and check its' power.

For training and inference piplin, I refer the great notebook,

https://www.kaggle.com/namanj27/new-baseline-pytorch-moa

But with orginal notebook, torch.optim.swa_utils don't work, so I changed where necessary.

## Preparation

In [None]:
!ls ../input/iterative-stratification

In [None]:
import sys
sys.path.append('../input/iterative-stratification/iterative-stratification-master')
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

In [None]:
import numpy as np
import random
import pandas as pd
import matplotlib.pyplot as plt
import os
import copy
import seaborn as sns

from sklearn import preprocessing
from sklearn.metrics import log_loss
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.optim.swa_utils import AveragedModel, SWALR
from torch.optim.lr_scheduler import CosineAnnealingLR

import warnings
warnings.filterwarnings('ignore')

In [None]:
os.listdir('../input/lish-moa')

In [None]:
train_features = pd.read_csv('../input/lish-moa/train_features.csv')
train_targets_scored = pd.read_csv('../input/lish-moa/train_targets_scored.csv')
train_targets_nonscored = pd.read_csv('../input/lish-moa/train_targets_nonscored.csv')

test_features = pd.read_csv('../input/lish-moa/test_features.csv')
sample_submission = pd.read_csv('../input/lish-moa/sample_submission.csv')

In [None]:
GENES = [col for col in train_features.columns if col.startswith('g-')]
CELLS = [col for col in train_features.columns if col.startswith('c-')]

In [None]:
def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    
seed_everything(seed=42)

In [None]:
train_targets_scored.sum()[1:].sort_values()

In [None]:
train_features['cp_type'].unique()

## PCA features + Existing features

In [None]:
# GENES
n_comp = 50

data = pd.concat([pd.DataFrame(train_features[GENES]), pd.DataFrame(test_features[GENES])])
data2 = (PCA(n_components=n_comp, random_state=42).fit_transform(data[GENES]))
train2 = data2[:train_features.shape[0]]; test2 = data2[-test_features.shape[0]:]

train2 = pd.DataFrame(train2, columns=[f'pca_G-{i}' for i in range(n_comp)])
test2 = pd.DataFrame(test2, columns=[f'pca_G-{i}' for i in range(n_comp)])

# drop_cols = [f'c-{i}' for i in range(n_comp,len(GENES))]
train_features = pd.concat((train_features, train2), axis=1)
test_features = pd.concat((test_features, test2), axis=1)

In [None]:
#CELLS
n_comp = 15

data = pd.concat([pd.DataFrame(train_features[CELLS]), pd.DataFrame(test_features[CELLS])])
data2 = (PCA(n_components=n_comp, random_state=42).fit_transform(data[CELLS]))
train2 = data2[:train_features.shape[0]]; test2 = data2[-test_features.shape[0]:]

train2 = pd.DataFrame(train2, columns=[f'pca_C-{i}' for i in range(n_comp)])
test2 = pd.DataFrame(test2, columns=[f'pca_C-{i}' for i in range(n_comp)])

# drop_cols = [f'c-{i}' for i in range(n_comp,len(CELLS))]
train_features = pd.concat((train_features, train2), axis=1)
test_features = pd.concat((test_features, test2), axis=1)

## feature Selection using Variance Encoding

In [None]:
from sklearn.feature_selection import VarianceThreshold


var_thresh = VarianceThreshold(threshold=0.5)
data = train_features.append(test_features)
data_transformed = var_thresh.fit_transform(data.iloc[:, 4:])

train_features_transformed = data_transformed[ : train_features.shape[0]]
test_features_transformed = data_transformed[-test_features.shape[0] : ]


train_features = pd.DataFrame(train_features[['sig_id','cp_type','cp_time','cp_dose']].values.reshape(-1, 4),\
                              columns=['sig_id','cp_type','cp_time','cp_dose'])

train_features = pd.concat([train_features, pd.DataFrame(train_features_transformed)], axis=1)


test_features = pd.DataFrame(test_features[['sig_id','cp_type','cp_time','cp_dose']].values.reshape(-1, 4),\
                             columns=['sig_id','cp_type','cp_time','cp_dose'])

test_features = pd.concat([test_features, pd.DataFrame(test_features_transformed)], axis=1)

train_features


In [None]:
train = train_features.merge(train_targets_scored, on='sig_id')
train = train[train['cp_type']!='ctl_vehicle'].reset_index(drop=True)
test = test_features[test_features['cp_type']!='ctl_vehicle'].reset_index(drop=True)
target = train[train_targets_scored.columns]

In [None]:
train = train.drop('cp_type', axis=1)
test = test.drop('cp_type', axis=1)

In [None]:
train

## CV folds

In [None]:
folds = train.copy()

mskf = MultilabelStratifiedKFold(n_splits=5)

for f, (t_idx, v_idx) in enumerate(mskf.split(X=train, y=target)):
    folds.loc[v_idx, 'kfold'] = int(f)

folds['kfold'] = folds['kfold'].astype(int)
folds

In [None]:
print(train.shape)
print(folds.shape)
print(test.shape)
print(target.shape)
print(sample_submission.shape)

## Dataset Classes

### <div class="alert alert-block alert-warning">↓Note implementation of __getitem__. To use update_bn, dataset should return tensor or list of tensor.</div>

In [None]:
class MoADataset:
    def __init__(self, features, targets):
        self.features = features
        self.targets = targets
        
    def __len__(self):
        return (self.features.shape[0])
    
    def __getitem__(self, idx):
        x = torch.tensor(self.features[idx, :], dtype=torch.float)
        y = torch.tensor(self.targets[idx, :], dtype=torch.float)
        return x, y
    
class TestDataset:
    def __init__(self, features):
        self.features = features
        
    def __len__(self):
        return (self.features.shape[0])
    
    def __getitem__(self, idx):
        x = torch.tensor(self.features[idx, :], dtype=torch.float)
        return x
    

## Model

In [None]:
class Model(nn.Module):
    def __init__(self, num_features, num_targets, hidden_size):
        super(Model, self).__init__()
        self.batch_norm1 = nn.BatchNorm1d(num_features)
        self.dropout1 = nn.Dropout(0.2)
        self.dense1 = nn.Linear(num_features, hidden_size)
        
        self.batch_norm2 = nn.BatchNorm1d(hidden_size)
        self.dropout2 = nn.Dropout(0.5)
        self.dense2 = nn.Linear(hidden_size, hidden_size)
        
        self.batch_norm3 = nn.BatchNorm1d(hidden_size)
        self.dropout3 = nn.Dropout(0.5)
        self.dense3 = nn.Linear(hidden_size, num_targets)
        
    def forward(self, x):
        x = self.batch_norm1(x)
        x = self.dropout1(x)
        x = F.relu(self.dense1(x))
        
        x = self.batch_norm2(x)
        x = self.dropout2(x)
        x = F.relu(self.dense2(x))
        
        x = self.batch_norm3(x)
        x = self.dropout3(x)
        x = self.dense3(x)
        
        return x

## Preprocessing steps

In [None]:
def process_data(data):
    
    data = pd.get_dummies(data, columns=['cp_time','cp_dose'])
#     data.loc[:, 'cp_time'] = data.loc[:, 'cp_time'].map({24: 0, 48: 1, 72: 2})
#     data.loc[:, 'cp_dose'] = data.loc[:, 'cp_dose'].map({'D1': 0, 'D2': 1})

# --------------------- Normalize ---------------------
#     for col in GENES:
#         data[col] = (data[col]-np.mean(data[col])) / (np.std(data[col]))
    
#     for col in CELLS:
#         data[col] = (data[col]-np.mean(data[col])) / (np.std(data[col]))
    
#--------------------- Removing Skewness ---------------------
#     for col in GENES + CELLS:
#         if(abs(data[col].skew()) > 0.75):
            
#             if(data[col].skew() < 0): # neg-skewness
#                 data[col] = data[col].max() - data[col] + 1
#                 data[col] = np.sqrt(data[col])
            
#             else:
#                 data[col] = np.sqrt(data[col])
    
    return data

In [None]:
target_cols = target.drop('sig_id', axis=1).columns.values.tolist()

In [None]:
feature_cols = [c for c in process_data(folds).columns if c not in target_cols]
feature_cols = [c for c in feature_cols if c not in ['kfold','sig_id']]
len(feature_cols)

In [None]:
# HyperParameters

DEVICE = ('cuda' if torch.cuda.is_available() else 'cpu')
EPOCHS = 45
SWA_START_EPOCH = 30
BATCH_SIZE = 128
LEARNING_RATE = 1e-3
WEIGHT_DECAY = 1e-5
NFOLDS = 5
EARLY_STOPPING_STEPS = 10
EARLY_STOP = False

num_features=len(feature_cols)
num_targets=len(target_cols)
hidden_size=1024


## Single fold training

In [None]:
def train_fn(swa_model, model, optimizer, scheduler, swa_scheduler, epoch, loss_fn, dataloader, device):
    model.train()
    final_loss = 0
    
    for data in dataloader:
        optimizer.zero_grad()
        data_x, data_y = data
        inputs, targets = data_x.to(device), data_y.to(device)
        outputs = model(inputs)
        loss = loss_fn(outputs, targets)
        loss.backward()
        optimizer.step()
        
        if epoch > SWA_START_EPOCH:
            swa_model.update_parameters(model)
            swa_scheduler.step()
        else:
            scheduler.step()
        
        final_loss += loss.item()
        
    final_loss /= len(dataloader)
    
    return final_loss


def valid_fn(model, loss_fn, dataloader, device):
    model.eval()
    final_loss = 0
    valid_preds = []
    
    for data in dataloader:
        data_x, data_y = data
        inputs, targets = data_x.to(device), data_y.to(device)
        outputs = model(inputs)
        loss = loss_fn(outputs, targets)
        
        final_loss += loss.item()
        valid_preds.append(outputs.sigmoid().detach().cpu().numpy())
        
    final_loss /= len(dataloader)
    valid_preds = np.concatenate(valid_preds)
    
    return final_loss, valid_preds

def inference_fn(model, dataloader, device):
    model.eval()
    preds = []
    
    for data in dataloader:
        inputs = data.to(device)

        with torch.no_grad():
            outputs = model(inputs)
        
        preds.append(outputs.sigmoid().detach().cpu().numpy())
        
    preds = np.concatenate(preds)
    
    return preds

In [None]:
def run_training(fold, seed):
    
    seed_everything(seed)
    
    train = process_data(folds)
    test_ = process_data(test)
    
    trn_idx = train[train['kfold'] != fold].index
    val_idx = train[train['kfold'] == fold].index
    
    train_df = train[train['kfold'] != fold].reset_index(drop=True)
    valid_df = train[train['kfold'] == fold].reset_index(drop=True)
    
    x_train, y_train  = train_df[feature_cols].values, train_df[target_cols].values
    x_valid, y_valid =  valid_df[feature_cols].values, valid_df[target_cols].values
    
    train_dataset = MoADataset(x_train, y_train)
    valid_dataset = MoADataset(x_valid, y_valid)
    trainloader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    validloader = torch.utils.data.DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=False)
    
    model = Model(
        num_features=num_features,
        num_targets=num_targets,
        hidden_size=hidden_size,
    )
    swa_model = AveragedModel(model)
    
    model = model.to(DEVICE)
    
    optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)
    scheduler = optim.lr_scheduler.OneCycleLR(optimizer=optimizer, pct_start=0.1, 
                                              div_factor=1e3, max_lr=1e-2, epochs=EPOCHS, steps_per_epoch=len(trainloader))
    swa_scheduler = SWALR(optimizer, swa_lr=LEARNING_RATE, anneal_strategy="cos", anneal_epochs=10)

    
    loss_fn = nn.BCEWithLogitsLoss()
    
    early_stopping_steps = EARLY_STOPPING_STEPS
    early_step = 0
    
    oof = np.zeros((len(train), target.iloc[:, 1:].shape[1]))
    best_loss = np.inf
    
    for epoch in range(EPOCHS):
        
        train_loss = train_fn(swa_model, model, optimizer,scheduler, swa_scheduler, epoch, loss_fn, trainloader, DEVICE)
        print(f"FOLD: {fold}, EPOCH: {epoch}, train_loss: {train_loss}")
        valid_loss, valid_preds = valid_fn(model, loss_fn, validloader, DEVICE)
        print(f"FOLD: {fold}, EPOCH: {epoch}, valid_loss: {valid_loss}")
        
        if(EARLY_STOP == True):
            
            early_step += 1
            if (early_step >= early_stopping_steps):
                oof[val_idx] = valid_preds
                torch.optim.swa_utils.update_bn(trainloader, swa_model)
                torch.save(model.state_dict(), f"FOLD{fold}_.pth")
                break
                
    oof[val_idx] = valid_preds
    torch.optim.swa_utils.update_bn(trainloader, swa_model)
    torch.save(model.state_dict(), f"FOLD{fold}_.pth")
            
    #--------------------- PREDICTION---------------------
    x_test = test_[feature_cols].values
    testdataset = TestDataset(x_test)
    testloader = torch.utils.data.DataLoader(testdataset, batch_size=BATCH_SIZE, shuffle=False)
    
    model = Model(
        num_features=num_features,
        num_targets=num_targets,
        hidden_size=hidden_size,
    )
    
    model.load_state_dict(torch.load(f"FOLD{fold}_.pth"))
    model.to(DEVICE)
    
    predictions = np.zeros((len(test_), target.iloc[:, 1:].shape[1]))
    predictions = inference_fn(model, testloader, DEVICE)
    
    return oof, predictions

In [None]:
def run_k_fold(NFOLDS, seed):
    oof = np.zeros((len(train), len(target_cols)))
    predictions = np.zeros((len(test), len(target_cols)))
    
    for fold in range(NFOLDS):
        oof_, pred_ = run_training(fold, seed)
        
        predictions += pred_ / NFOLDS
        oof += oof_
        
    return oof, predictions

In [None]:
SEED = 777
oof = np.zeros((len(train), len(target_cols)))
predictions = np.zeros((len(test), len(target_cols)))
    
oof_, predictions_ = run_k_fold(NFOLDS, SEED)

oof  += oof_
predictions += predictions_

train[target_cols] = oof
test[target_cols] = predictions

In [None]:
valid_results = train_targets_scored.drop(columns=target_cols).merge(train[['sig_id']+target_cols], on='sig_id', how='left').fillna(0)

In [None]:
y_true = train_targets_scored[target_cols].values
y_pred = valid_results[target_cols].values

In [None]:
score = 0
for i in range(len(target_cols)):
    score_ = log_loss(y_true[:, i], y_pred[:, i])
    score += score_ / target.shape[1]
    
print("CV log_loss: ", score)
    

The CV score of refered notebook is 0.014650792660668535. There is some differences of implementation, but it seems to be performing a little too poorly. If you know good improvement, please share!

In [None]:
sub = sample_submission.drop(columns=target_cols).merge(test[['sig_id']+target_cols], on='sig_id', how='left').fillna(0)
sub.to_csv('submission.csv', index=False)

In [None]:
sub.shape