<br>
<h1 style = "font-size:60px; font-family:Garamond ; font-weight : normal; background-color: #f6f5f5 ; color : #fe346e; text-align: center; border-radius: 100px 100px;">TPS-December DAE Starter</h1>
<br>

![](https://storage.googleapis.com/kaggle-competitions/kaggle/28007/logos/header.png?t=2021-06-30-01-10-51)

<span style="color: #000508; font-family: Segoe UI; font-size: 1.5em; font-weight: 300;">In this Notebook we will: </span>
* <span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">Train a simple Denoising Autoencoder<br></span>
* <span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">Add Swap noise for injecting noise in the data<br></span>
* <span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">Use MSE loss for continuous features<br></span>
* <span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">Use BCE loss for continuous features<br></span>

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Pipeline</h1></span>

![](https://i.imgur.com/yVWGsOJ.png)

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Install Required Libraries</h1></span>

In [None]:
!pip install --upgrade wandb

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Import Required Libraries 📚</h1></span>

In [None]:
import os
import gc
import time
import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.optim import lr_scheduler
from torch.utils.data import DataLoader, Dataset
from torch.cuda import amp

from sklearn.preprocessing import QuantileTransformer

from tqdm import tqdm
from collections import defaultdict

<img src="https://i.imgur.com/gb6B4ig.png" width="400" alt="Weights & Biases" />

<span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;"> Weights & Biases (W&B) is a set of machine learning tools that helps you build better models faster. <strong>Kaggle competitions require fast-paced model development and evaluation</strong>. There are a lot of components: exploring the training data, training different models, combining trained models in different combinations (ensembling), and so on.</span>

> <span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">⏳ Lots of components = Lots of places to go wrong = Lots of time spent debugging</span>

<span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">W&B can be useful for Kaggle competition with it's lightweight and interoperable tools:</span>

* <span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">Quickly track experiments,<br></span>
* <span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">Version and iterate on datasets, <br></span>
* <span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">Evaluate model performance,<br></span>
* <span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">Reproduce models,<br></span>
* <span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">Visualize results and spot regressions,<br></span>
* <span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">Share findings with colleagues.</span>

<span style="color: #000508; font-family: Segoe UI; font-size: 1.2em; font-weight: 300;">To learn more about Weights and Biases check out this <strong><a href="https://www.kaggle.com/ayuraj/experiment-tracking-with-weights-and-biases">kernel</a></strong>.</span>

In [None]:
import wandb

try:
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    api_key = user_secrets.get_secret("wandb_api")
    wandb.login(key=api_key)
    anony = None
except:
    anony = "must"
    print('If you want to use your W&B account, go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as wandb_api. \nGet your W&B access token from here: https://wandb.ai/authorize')

# <h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Read the Data 📖</h1>

In [None]:
df_train = pd.read_csv("../input/tabular-playground-series-dec-2021/train.csv")
df_train.head()

In [None]:
df_test = pd.read_csv("../input/tabular-playground-series-dec-2021/test.csv")
df_test.head()

In [None]:
df = pd.concat([df_train, df_test], axis=0)
df.shape

In [None]:
del df_train, df_test

In [None]:
feature_cols = [col for col in df.columns if col not in ['Id', 'Cover_Type']]
target_cols = ['Cover_Type']

cat_cols = [col for col in feature_cols if df[col].nunique() < 10]
cont_cols = [col for col in feature_cols if df[col].nunique() >= 10]

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Training Configuration ⚙️</h1></span>

In [None]:
CONFIG = {
    "seed": 42,
    "epochs": 10,
    "train_batch_size": 1024,
    "learning_rate": 1e-3,
    "T_max": 2000,
    "min_lr": 1e-5,
    "cat_weight": 1./3,
    "cont_weight": 2./3,
    "device": torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
}

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Set Seed for Reproducibility</h1></span>

In [None]:
def set_seed(seed = 42):
    '''Sets the seed of the entire notebook so results are the same every time we run.
    This is for REPRODUCIBILITY.'''
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    # When running on the CuDNN backend, two further options must be set
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    # Set a fixed value for the hash seed
    os.environ['PYTHONHASHSEED'] = str(seed)
    
set_seed(CONFIG["seed"])

<span style="color: #000508; font-family: Segoe UI; font-size: 1.5em; font-weight: 300;">Normalize Continuous Features using Quantile Transformer</span>. 
<span style="color: #000508; font-family: Segoe UI; font-size: 1.5em; font-weight: 300;"> Check the official documentation <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html">here</a></span>

In [None]:
qt = QuantileTransformer(output_distribution='normal')
df[cont_cols] = qt.fit_transform(df[cont_cols])

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Dataset Class</h1></span>

In [None]:
class TPSDecDataset(Dataset):
    def __init__(self, df):
        self.df = df
        self.cat_features = df[cat_cols].values
        self.cont_features = df[cont_cols].values
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, index):
        X_cat = self.cat_features[index]
        X_cont = self.cont_features[index]
        
        return X_cat, X_cont

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Create Model</h1></span>

In [None]:
class DenoisingAutoEncoder(nn.Module):
    def __init__(self):
        super(DenoisingAutoEncoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(len(cat_cols) + len(cont_cols), 100),
            nn.BatchNorm1d(100),
            nn.ReLU(),
            nn.Linear(100, 200)
        )
        self.decoder = nn.Sequential(
            nn.Linear(200, 100),
            nn.BatchNorm1d(100),
            nn.ReLU(),
        )
        self.decoder_cat_head = nn.Linear(100, len(cat_cols))
        self.decoder_cont_head = nn.Linear(100, len(cont_cols))
        
    def extract(self, x):
        features = self.encoder(x)
        return features
        
    def forward(self, x):
        features = self.encoder(x)
        output = self.decoder(F.relu(features))
        cat_output = self.decoder_cat_head(output)
        cont_output = self.decoder_cont_head(output)
        
        return cat_output, cont_output
    
model = DenoisingAutoEncoder()
model.to(CONFIG['device']);

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Loss Function</h1></span>

In [None]:
def cat_criterion(cat_outputs, cat_targets):
    return nn.BCEWithLogitsLoss()(cat_outputs, cat_targets)

def cont_criterion(cont_outputs, cont_targets):
    return nn.MSELoss()(cont_outputs.view(-1), cont_targets.view(-1))

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Swap Noise</h1></span>

<span style="color: #000508; font-family: Segoe UI; font-size: 1.5em; font-weight: 300;">Swap Noise: Randomly swap values within a column of a dataframe with a specified noise ratio.</span> 

<blockquote> “15% Swap Noise is a good start value.” <br> 
    - <strong>Michael Jahrer</strong>, Porto Seguro Safe Driver Competition Winner
 </blockquote>   

<span style="color: #000508; font-family: Segoe UI; font-size: 1.5em; font-weight: 300;">Implementation borrowed from <a href="https://www.kaggle.com/c/tabular-playground-series-jan-2021/discussion/216070">here</a></span>

In [None]:
def add_swap_noise(X, ratio=.15, return_mask=False):
    obfuscation_mask = torch.bernoulli(ratio * torch.ones(X.shape)).to(X.device)
    obfuscated_X = torch.where(obfuscation_mask == 1, X[torch.randperm(X.shape[0])], X)
    
    if return_mask:
        return obfuscated_X, obfuscation_mask
    
    return obfuscated_X

In [None]:
def test_swap_noise():
    X_rand = torch.randn(6, 8)
    print("Original Array")
    print(X_rand)
    
    X_noise = add_swap_noise(X_rand)
    print("Array after noise")
    print(X_noise)
    
test_swap_noise()

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Training Function</h1></span>

In [None]:
def train_one_epoch(model, optimizer, scheduler, dataloader, device, epoch):
    model.train()
    scaler = amp.GradScaler()
    
    dataset_size = 0
    running_loss = 0.0
    
    bar = tqdm(enumerate(dataloader), total=len(dataloader))
    for step, (X_cat, X_cont) in bar:         
        X_cat = X_cat.to(device, dtype=torch.float)
        X_cont = X_cont.to(device, dtype=torch.float)
        
        batch_size = X_cat.size(0)
        
        X_cat_noise = add_swap_noise(X_cat)
        X_cat_noise = X_cat_noise.to(device, dtype=torch.float)
        X_cont_noise = add_swap_noise(X_cont)
        X_cont_noise = X_cont_noise.to(device, dtype=torch.float)
        
        with amp.autocast(enabled=True):
            X_noise = torch.cat([X_cat_noise, X_cont_noise], dim=1)
            cat_outputs, cont_outputs = model(X_noise)
            cat_loss = cat_criterion(cat_outputs, X_cat)
            cont_loss = cont_criterion(cont_outputs, X_cont)
            loss = CONFIG['cat_weight']*cat_loss + CONFIG['cont_weight']*cont_loss

        wandb.log({"Categorical Loss": cat_loss})
        wandb.log({"Continuous Loss": cont_loss})
        wandb.log({"Total Loss": loss})
        
        scaler.scale(loss).backward()
        
        scaler.step(optimizer)
        scaler.update()

        # zero the parameter gradients
        optimizer.zero_grad()
        
        if scheduler is not None:
            scheduler.step()
                
        running_loss += (loss.item() * batch_size)
        dataset_size += batch_size
        
        epoch_loss = running_loss / dataset_size
        
        bar.set_postfix(Epoch=epoch, Train_Loss=epoch_loss,
                        LR=optimizer.param_groups[0]['lr'])
    gc.collect()
    
    return epoch_loss

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Run Training</h1></span>

In [None]:
def run_training(model, optimizer, scheduler, device, num_epochs):
    # To automatically log gradients
    wandb.watch(model, log_freq=100)
    
    if torch.cuda.is_available():
        print("[INFO] Using GPU: {}\n".format(torch.cuda.get_device_name()))
    
    start = time.time()
    
    for epoch in range(1, num_epochs + 1):
            train_epoch_loss = train_one_epoch(model, optimizer, scheduler, 
                                               train_loader, device, epoch)
            print()
            
    end = time.time()
    time_elapsed = end - start
    print('Training complete in {:.0f}h {:.0f}m {:.0f}s'.format(
        time_elapsed // 3600, (time_elapsed % 3600) // 60, (time_elapsed % 3600) % 60))
    
    torch.save(model.state_dict(), 'model.bin')
    
    return model

<span style="color: #000508; font-family: Segoe UI; font-size: 1.5em; font-weight: 300;">Prepare Dataloader</span>

In [None]:
train_dataset = TPSDecDataset(df)
train_loader = DataLoader(train_dataset, batch_size=CONFIG['train_batch_size'], 
                          num_workers=2, shuffle=True, pin_memory=True)

<span style="color: #000508; font-family: Segoe UI; font-size: 1.5em; font-weight: 300;">Define Optimizer & Scheduler</span>

In [None]:
optimizer = optim.Adam(model.parameters(), lr=CONFIG['learning_rate'])
scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=CONFIG['T_max'], eta_min=CONFIG['min_lr'])

<span style="color: #000508; font-family: Segoe UI; font-size: 1.5em; font-weight: 300;">Start Training</span>

In [None]:
run = wandb.init(project='TPS-Dec', 
                 config=CONFIG,
                 job_type='Train',
                 anonymous='must')

In [None]:
model = run_training(model, optimizer, scheduler, 
                     device=CONFIG['device'], 
                     num_epochs=CONFIG['epochs'])

In [None]:
run.finish()

# <span><h1 style = "font-family: garamond; font-size: 40px; font-style: normal; letter-spcaing: 3px; background-color: #f6f5f5; color :#fe346e; border-radius: 100px 100px; text-align:center">Visualization</h1></span>

<span style="color: #000508; font-family: Segoe UI; font-size: 1.5em; font-weight: 300;"><a href="https://wandb.ai/dchanda/TPS-Dec/runs/9iw31x0f">View the Complete Dashboard Here ⮕</a></span>

<span style="color: #000508; font-family: Segoe UI; font-size: 1.5em; font-weight: 300;">Loss Curves</span>

![](https://i.imgur.com/nkNjiwP.jpg)

<span style="color: #000508; font-family: Segoe UI; font-size: 1.5em; font-weight: 300;">Model Gradients</span>
![](https://i.imgur.com/JvHoqdg.jpg)

![Upvote!](https://img.shields.io/badge/Upvote-If%20you%20like%20my%20work-07b3c8?style=for-the-badge&logo=kaggle)