www.kaggle.com/hiramcho/moa-tabnet-with-pca-rank-gauss  
v1: preprocessing w/ the following:  
- no ctl_vehicle
- Rank Gauss Process
- PCA
- OneHot

<div class = "alert alert-block alert-info">
    <h1><font color = "red">DISCLAIMER</font></h1>
    <p>The following notebook it's highly based on the works <a href = "https://www.kaggle.com/optimo/tabnetregressor-2-0-train-infer">TabNetRegressor 2.0 [TRAIN + INFER]</a>, <a href = "https://www.kaggle.com/liuhdme/moa-competition/data">MOA competition</a> and <a href = "https://www.kaggle.com/kushal1506/moa-pytorch-0-01859-rankgauss-pca-nn/data?select=train_targets_scored.csv">
MoA | Pytorch | 0.01859 | RankGauss | PCA | NN</a>, please check it out. I have to add that i don't make this notebook for "upvotes" but feedback.</p>
</div>

# <font color = "seagreen">Preambule</font>

I made this notebook to share some experiments (see the sections "Experiments") which could help to someone who don't want to wast their daily "submissions", but more importantly, to get feedback about what i could change to achive a better CV. Moreover, the easiness of TabNet to overfit the data it's disturbing. In the section "Conclusion" i share my opinion about the fine-tuning process of TabNet.

## <font color = "green">Installing Libraries</font>

In [1]:
# TabNet
!pip install /kaggle/input/packages-for-lishmoa/pytorch_tabnet-2.0.0-py3-none-any.whl
# Iterative Stratification
!pip install /kaggle/input/packages-for-lishmoa/iterative_stratification-0.1.6-py3-none-any.whl

Processing /kaggle/input/packages-for-lishmoa/pytorch_tabnet-2.0.0-py3-none-any.whl
Installing collected packages: pytorch-tabnet
Successfully installed pytorch-tabnet-2.0.0
Processing /kaggle/input/packages-for-lishmoa/iterative_stratification-0.1.6-py3-none-any.whl
Installing collected packages: iterative-stratification
Successfully installed iterative-stratification-0.1.6


## <font color = "green">Loading Libraries</font>

In [2]:
### General ###
import os
import sys
import copy
import gc
import tqdm
import pickle
import random
import warnings
warnings.filterwarnings("ignore")
sys.path.append("../input/packages-for-lishmoa/")
os.environ["CUDA_LAUNCH_BLOCKING"] = '1'

### Data Wrangling ###
import numpy as np
import pandas as pd
from scipy import stats
from gauss_rank_scaler import GaussRankScaler

### Data Visualization ###
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")

### Machine Learning ###
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score, log_loss
from sklearn.preprocessing import QuantileTransformer
from sklearn.feature_selection import VarianceThreshold
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

### Deep Learning ###
import torch
from torch import nn
import torch.optim as optim
from torch.nn import functional as F
from torch.nn.modules.loss import _WeightedLoss
from torch.utils.data import DataLoader, Dataset
from torch.optim.lr_scheduler import ReduceLROnPlateau
# Tabnet 
from pytorch_tabnet.metrics import Metric
from pytorch_tabnet.tab_model import TabNetRegressor

### Make prettier the prints ###
from colorama import Fore
c_ = Fore.CYAN
m_ = Fore.MAGENTA
r_ = Fore.RED
b_ = Fore.BLUE
y_ = Fore.YELLOW
g_ = Fore.GREEN

## <font color = "green">Reproducibility</font>

In [3]:
seed = 42

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
set_seed(seed)

## <font color = "green">Configuration</font>

In [4]:
# Parameters
data_path = "../input/lish-moa/"
no_ctl = False
no_nonscored = True
scale = "rankgauss"
variance_threshould = 0.7
decompo = "PCA"
ncompo_genes = 80
ncompo_cells = 10
encoding = "dummy"

## <font color = "green">Loading the Data</font>

In [5]:
train = pd.read_csv(data_path + "train_features.csv")
#train.drop(columns = ["sig_id"], inplace = True)

targets = pd.read_csv(data_path + "train_targets_scored.csv")
#train_targets_scored.drop(columns = ["sig_id"], inplace = True)

if not no_nonscored:
    train_targets_nonscored = pd.read_csv(data_path + "train_targets_nonscored.csv")
    train = train.merge(train_targets_nonscored, on='sig_id')

test = pd.read_csv(data_path + "test_features.csv")
#test.drop(columns = ["sig_id"], inplace = True)

submission = pd.read_csv(data_path + "sample_submission.csv")

# <font color = "seagreen">Preprocessing and Feature Engineering</font>

In [6]:
if no_ctl:
    # cp_type == ctl_vehicle
    print(b_, "not_ctl")
    train = train[train["cp_type"] != "ctl_vehicle"]
    test = test[test["cp_type"] != "ctl_vehicle"]
    targets = targets.iloc[train.index]
    train.reset_index(drop = True, inplace = True)
    test.reset_index(drop = True, inplace = True)
    targets.reset_index(drop = True, inplace = True)

## <font color = "green">Distributions Before Rank Gauss and PCA</font>

In [7]:
def distributions(num, graphs, items, features, gorc):
    """
    Plot the distributions of gene expression or cell viability data
    """
    for i in range(0, num - 1, 7):
        if i >= 3:
            break
        idxs = list(np.array([0, 1, 2, 3, 4, 5, 6]) + i)
    
        fig, axs = plt.subplots(1, 7, sharey = True)
        for k, item in enumerate(idxs):
            if item >= items:
                break
            graph = sns.distplot(train[features].values[:, item], ax = axs[k])
            graph.set_title(f"{gorc}-{item}")
            graphs.append(graph)

In [8]:
GENES = [col for col in train.columns if col.startswith("g-")]
CELLS = [col for col in train.columns if col.startswith("c-")]

### <font color = "green">Distributions of the Train Set</font>

In [9]:
# gnum = train[GENES].shape[1]
# graphs = []

# distributions(gnum, graphs, 771, GENES, "g")

In [10]:
# cnum = train[CELLS].shape[1]
# graphs = []

# distributions(cnum, graphs, 100, CELLS, "c")

### <font color = "green">Distributions of the Test Set</font>

In [11]:
# gnum = test[GENES].shape[1]
# graphs = []

# distributions(gnum, graphs, 771, GENES, "g")

In [12]:
# cnum = test[CELLS].shape[1]
# graphs = []

# distributions(cnum, graphs, 100, CELLS, "c")

## <font color = "green">Rank Gauss Process</font>

In [13]:
data_all = pd.concat([train, test], ignore_index = True)
cols_numeric = [feat for feat in list(data_all.columns) if feat not in ["sig_id", "cp_type", "cp_time", "cp_dose"]]
mask = (data_all[cols_numeric].var() >= variance_threshould).values
tmp = data_all[cols_numeric].loc[:, mask]
data_all = pd.concat([data_all[["sig_id", "cp_type", "cp_time", "cp_dose"]], tmp], axis = 1)
cols_numeric = [feat for feat in list(data_all.columns) if feat not in ["sig_id", "cp_type", "cp_time", "cp_dose"]]

In [14]:
def scale_minmax(col):
    return (col - col.min()) / (col.max() - col.min())

def scale_norm(col):
    return (col - col.mean()) / col.std()

if scale == "boxcox":
    print(b_, "boxcox")
    data_all[cols_numeric] = data_all[cols_numeric].apply(scale_minmax, axis = 0)
    trans = []
    for feat in cols_numeric:
        trans_var, lambda_var = stats.boxcox(data_all[feat].dropna() + 1)
        trans.append(scale_minmax(trans_var))
    data_all[cols_numeric] = np.asarray(trans).T
    
elif scale == "norm":
    print(b_, "norm")
    data_all[cols_numeric] = data_all[cols_numeric].apply(scale_norm, axis = 0)
    
elif scale == "minmax":
    print(b_, "minmax")
    data_all[cols_numeric] = data_all[cols_numeric].apply(scale_minmax, axis = 0)
    
elif scale == "rankgauss":
    ### Rank Gauss ###
    print(b_, "Rank Gauss")
    scaler = GaussRankScaler()
    data_all[cols_numeric] = scaler.fit_transform(data_all[cols_numeric])
    
else:
    pass

[34m Rank Gauss


## <font color = "green">Principal Component Analysis</font>

In [15]:
# PCA
if decompo == "PCA":
    print(b_, "PCA")
    GENES = [col for col in data_all.columns if col.startswith("g-")]
    CELLS = [col for col in data_all.columns if col.startswith("c-")]
    
    pca_genes = PCA(n_components = ncompo_genes,
                    random_state = seed).fit_transform(data_all[GENES])
    pca_cells = PCA(n_components = ncompo_cells,
                    random_state = seed).fit_transform(data_all[CELLS])
    
    pca_genes = pd.DataFrame(pca_genes, columns = [f"pca_g-{i}" for i in range(ncompo_genes)])
    pca_cells = pd.DataFrame(pca_cells, columns = [f"pca_c-{i}" for i in range(ncompo_cells)])
    data_all = pd.concat([data_all, pca_genes, pca_cells], axis = 1)
else:
    pass

[34m PCA


## <font color = "green">One Hot</font>

In [16]:
# Encoding
if encoding == "lb":
    print(b_, "Label Encoding")
    for feat in ["cp_time", "cp_dose"]:
        data_all[feat] = LabelEncoder().fit_transform(data_all[feat])
elif encoding == "dummy":
    print(b_, "One-Hot")
    data_all = pd.get_dummies(data_all, columns = ["cp_time", "cp_dose"])

[34m One-Hot


In [17]:
GENES = [col for col in data_all.columns if col.startswith("g-")]
CELLS = [col for col in data_all.columns if col.startswith("c-")]

for stats in tqdm.tqdm(["sum", "mean", "std", "kurt", "skew"]):
    data_all["g_" + stats] = getattr(data_all[GENES], stats)(axis = 1)
    data_all["c_" + stats] = getattr(data_all[CELLS], stats)(axis = 1)    
    data_all["gc_" + stats] = getattr(data_all[GENES + CELLS], stats)(axis = 1)

100%|██████████| 5/5 [00:05<00:00,  1.11s/it]


## <font color = "green">Distributions After Rank Gauss and PCA</font>

In [18]:
def distributions(num, graphs, items, features, gorc):
    """
    Plot the distributions of gene expression or cell viability data
    """
    for i in range(0, num - 1, 7):
        if i >= 3:
            break
        idxs = list(np.array([0, 1, 2, 3, 4, 5, 6]) + i)
    
        fig, axs = plt.subplots(1, 7, sharey = True)
        for k, item in enumerate(idxs):
            if item >= items:
                break
            graph = sns.distplot(data_all[features].values[:, item], ax = axs[k])
            graph.set_title(f"{gorc}-{item}")
            graphs.append(graph)

### <font color = "green">Distributions of "data_all"</font>

In [19]:
# gnum = data_all[GENES].shape[1]
# graphs = []

# distributions(gnum, graphs, 771, GENES, "g")

In [20]:
# cnum = data_all[CELLS].shape[1]
# graphs = []

# distributions(cnum, graphs, 100, CELLS, "c")

We can confirme that the shapes of data got close to the normal distribution.

In [21]:
with open("data_all.pickle", "wb") as f:
    pickle.dump(data_all, f)

In [22]:
with open("data_all.pickle", "rb") as f:
    data_all = pickle.load(f)

In [23]:
# train_df and test_df
features_to_drop = ["sig_id", "cp_type"]
data_all.drop(features_to_drop, axis = 1, inplace = True)
try:
    targets.drop("sig_id", axis = 1, inplace = True)
except:
    pass
train_df = data_all[: train.shape[0]]
train_df.reset_index(drop = True, inplace = True)
# The following line it's a bad practice in my opinion, targets on train set
#train_df = pd.concat([train_df, targets], axis = 1)
test_df = data_all[train_df.shape[0]: ]
test_df.reset_index(drop = True, inplace = True)

In [24]:
print(f"{b_}train_df.shape: {r_}{train_df.shape}")
print(f"{b_}test_df.shape: {r_}{test_df.shape}")

[34mtrain_df.shape: [31m(23814, 950)
[34mtest_df.shape: [31m(3982, 950)


In [25]:
X_test = test_df.values
print(f"{b_}X_test.shape: {r_}{X_test.shape}")

[34mX_test.shape: [31m(3982, 950)


# <font color = "seagreen">Experiments</font>

I just want to point that the [original work](https://www.kaggle.com/optimo/tabnetregressor-2-0-train-infer) achive a CV of 0.015532370835690834 and a LB score of 0.01864. Some of the experiments that i made with their changes:


- CV: 0.01543560538566987, LB: 0.01858, best LB that i could achive, changes
    - `n_a` = 32 instead of 24
    - `n_d` = 32 instead of 24
- CV: 0.015282077428722094, LB: 0.01862, best CV that i could achive, changes (Version 5):
    - `n_a` = 32 instead of 24
    - `n_d` = 32 instead of 24
    - `virtual_batch_size` = 32, instead of 128
    - `seed` = 42 instead of 0
- CV: 0.015330138325308062, LB: 01864, the same LB that the original but better CV, changes:
    - `n_a` = 32 instead of 24
    - `n_d` = 32 instead of 24
    - `virtual_batch_size` = 64, instead of 128
    - `batch_size` = 512, instead of 1024
- CV: 0.015361751699863063, LB: 0.01863, better LB and CV than the original, changes:
    - `n_a` = 32 instead of 24
    - `n_d` = 32 instead of 24
    - `virtual_batch_size` = 64, instead of 128
- CV: 0.015529925324634975, LB: 0.01865, changes:
    - `n_a` = 48 instead of 24
    - `n_d` = 48 instead of 24
- CV: 0.015528553520924939, LB: 0.01868, changes:
    - `n_a` = 12 instead of 24
    - `n_d` = 12 instead of 24
- CV: 0.015870202970324317, LB: 0.01876, worst CV and LB score, changes:
    - `n_a` = 12 instead of 24
    - `n_d` = 12 instead of 24
    - `batch_size` = 2048, instead of 1024
    
    
As you can see if `batch_size` < 1024 and > 1024 give worst results. Something similar happens with `n_a` and `n_d`, if their values are lower or higher than 32 the results are worst.


## <font color = "green">Versions</font>

- **Version 5**: I added the `seed` parameter to the TabNet model.
- **Version 6**: I changed the `virtual_batch_size` to 24
    - CV: 0.01532900616425282, LB: 0.01862, changes:
        - `n_a` = 32 instead of 24
        - `n_d` = 32 instead of 24
        - `virtual_batch_size` = 24, instead of 128
        - `seed` = 42 instead of 0
- **Version 7**: PCA, Rank Gauss

# <font color = "seagreen">Modeling</font>

## <font color = "green">Model Parameters</font>

In [26]:
MAX_EPOCH = 200
# n_d and n_a are different from the original work, 32 instead of 24
# This is the first change in the code from the original
tabnet_params = dict(
    n_d = 32,
    n_a = 32,
    n_steps = 1,
    gamma = 1.3,
    lambda_sparse = 0,
    optimizer_fn = optim.Adam,
    optimizer_params = dict(lr = 2e-2, weight_decay = 1e-5),
    mask_type = "entmax",
    scheduler_params = dict(
        mode = "min", patience = 5, min_lr = 1e-5, factor = 0.9),
    scheduler_fn = ReduceLROnPlateau,
    seed = seed,
    verbose = 10
)

## <font color = "green">Custom Metric</font>

In [27]:
class LogitsLogLoss(Metric):
    """
    LogLoss with sigmoid applied
    """

    def __init__(self):
        self._name = "logits_ll"
        self._maximize = False

    def __call__(self, y_true, y_pred):
        """
        Compute LogLoss of predictions.

        Parameters
        ----------
        y_true: np.ndarray
            Target matrix or vector
        y_score: np.ndarray
            Score matrix or vector

        Returns
        -------
            float
            LogLoss of predictions vs targets.
        """
        logits = 1 / (1 + np.exp(-y_pred))
        aux = (1 - y_true) * np.log(1 - logits + 1e-15) + y_true * np.log(logits + 1e-15)
        return np.mean(-aux)

# <font color = "seagreen">Training</font>

In [28]:
scores_auc_all = []
test_cv_preds = []

NB_SPLITS = 5 # 7
mskf = MultilabelStratifiedKFold(n_splits = NB_SPLITS, random_state = 0, shuffle = True)

oof_preds = []
oof_targets = []
scores = []
scores_auc = []

# <font color = "seagreen">Setting New CV</font>

In [29]:
SEED = 42
feats = pd.read_csv("/kaggle/input/lish-moa/train_features.csv")
feats = feats[feats["cp_type"] != "ctl_vehicle"]
scored = pd.read_csv("/kaggle/input/lish-moa/train_targets_scored.csv")
scored = scored.iloc[feats.index]
feats.reset_index(drop=True, inplace=True)
scored.reset_index(drop=True, inplace=True)
drug = pd.read_csv("/kaggle/input/lish-moa/train_drug.csv")
tgts = scored.columns[1:]
scored = scored.merge(drug, on="sig_id", how="left")

# LOCATE DRUGS
vc = scored.drug_id.value_counts()
vc1 = vc.loc[vc <= 18].index.sort_values()
vc2 = vc.loc[vc > 18].index.sort_values()

# STRATIFY DRUGS 18X OR LESS
dct1 = {}
dct2 = {}
skf = MultilabelStratifiedKFold(n_splits=NB_SPLITS, shuffle=True, random_state=SEED)
tmp = scored.groupby("drug_id")[tgts].mean().loc[vc1]
for fold, (idxT, idxV) in enumerate(skf.split(tmp, tmp[tgts])):
    dd = {k: fold for k in tmp.index[idxV].values}
    dct1.update(dd)

# STRATIFY DRUGS MORE THAN 18X
skf = MultilabelStratifiedKFold(n_splits=NB_SPLITS, shuffle=True, random_state=SEED)
tmp = scored.loc[scored.drug_id.isin(vc2)].reset_index(drop=True)
for fold, (idxT, idxV) in enumerate(skf.split(tmp, tmp[tgts])):
    dd = {k: fold for k in tmp.sig_id[idxV].values}
    dct2.update(dd)

# ASSIGN FOLDS
scored["kfold"] = scored.drug_id.map(dct1)
scored.loc[scored.kfold.isna(), "kfold"] = scored.loc[scored.kfold.isna(), "sig_id"].map(
    dct2
)
kfold_idx = scored.kfold.astype("int8")
train_indices, val_indices = [], []
for fold_nb in range(NB_SPLITS):
    train_indices.append(kfold_idx[kfold_idx!=fold_nb].index.values)
    val_indices.append(kfold_idx[kfold_idx==fold_nb].index.values)
del feats, scored, drug, tgts, vc, vc1, vc2, dct1, dct2, skf, tmp
gc.collect()

24

In [30]:
for fold_nb, (train_idx, val_idx) in enumerate(zip(train_indices, val_indices)):
# for fold_nb, (train_idx, val_idx) in enumerate(mskf.split(train_df, targets)):
    print(b_,"FOLDS: ", r_, fold_nb + 1)
    print(g_, '*' * 60, c_)
    
    X_train, y_train = train_df.values[train_idx, :], targets.values[train_idx, :]
    X_val, y_val = train_df.values[val_idx, :], targets.values[val_idx, :]
    ### Model ###
    model = TabNetRegressor(**tabnet_params)
        
    ### Fit ###
    # Another change to the original code
    # virtual_batch_size of 32 instead of 128
    model.fit(
        X_train = X_train,
        y_train = y_train,
        eval_set = [(X_val, y_val)],
        eval_name = ["val"],
        eval_metric = ["logits_ll"],
        max_epochs = MAX_EPOCH,
        patience = 20,
        batch_size = 1024, 
        virtual_batch_size = 32,
        num_workers = 1,
        drop_last = False,
        # To use binary cross entropy because this is not a regression problem
        loss_fn = F.binary_cross_entropy_with_logits
    )
    print(y_, '-' * 60)
    
    ### Predict on validation ###
    preds_val = model.predict(X_val)
    # Apply sigmoid to the predictions
    preds = 1 / (1 + np.exp(-preds_val))
    score = np.min(model.history["val_logits_ll"])
    
    ### Save OOF for CV ###
    oof_preds.append(preds_val)
    oof_targets.append(y_val)
    scores.append(score)
    
    ### Predict on test ###
    preds_test = model.predict(X_test)
    test_cv_preds.append(1 / (1 + np.exp(-preds_test)))

oof_preds_all = np.concatenate(oof_preds)
oof_targets_all = np.concatenate(oof_targets)
test_preds_all = np.stack(test_cv_preds)

[34m FOLDS:  [31m 1
[32m ************************************************************ [36m
Device used : cpu
epoch 0  | loss: 0.33033 | val_logits_ll: 0.03176 |  0:00:14s
epoch 10 | loss: 0.01793 | val_logits_ll: 0.01784 |  0:02:23s
epoch 20 | loss: 0.01664 | val_logits_ll: 0.01916 |  0:04:31s
epoch 30 | loss: 0.01633 | val_logits_ll: 0.01818 |  0:06:38s
epoch 40 | loss: 0.01595 | val_logits_ll: 0.01619 |  0:08:44s
epoch 50 | loss: 0.01572 | val_logits_ll: 0.01617 |  0:10:52s
epoch 60 | loss: 0.01552 | val_logits_ll: 0.0161  |  0:12:59s
epoch 70 | loss: 0.01545 | val_logits_ll: 0.01589 |  0:15:07s
epoch 80 | loss: 0.01549 | val_logits_ll: 0.01619 |  0:17:17s
epoch 90 | loss: 0.01515 | val_logits_ll: 0.01599 |  0:19:26s
epoch 100| loss: 0.01517 | val_logits_ll: 0.01592 |  0:21:43s
epoch 110| loss: 0.01494 | val_logits_ll: 0.01586 |  0:24:00s
epoch 120| loss: 0.01459 | val_logits_ll: 0.01582 |  0:26:18s
epoch 130| loss: 0.01438 | val_logits_ll: 0.0159  |  0:28:38s
epoch 140| loss: 0.

In [31]:
train_idx

array([    1,     3,     4, ..., 21944, 21946, 21947])

In [32]:
aucs = []
for task_id in range(oof_preds_all.shape[1]):
    aucs.append(roc_auc_score(y_true = oof_targets_all[:, task_id],
                              y_score = oof_preds_all[:, task_id]
                             ))
print(f"{b_}Overall AUC: {r_}{np.mean(aucs)}")
print(f"{b_}Average CV: {r_}{np.mean(scores)}")

[34mOverall AUC: [31m0.7416287868286611
[34mAverage CV: [31m0.01565994216745433


# <font color = "seagreen">Conclusion (NOT AVAILABLE UNTIL I SEE THE LB Score)</font> 

# <font color = "seagreen">Submission</font>

In [33]:
all_feat = [col for col in submission.columns if col not in ["sig_id"]]
# To obtain the same lenght of test_preds_all and submission
test = pd.read_csv(data_path + "test_features.csv")
sig_id = test[test["cp_type"] != "ctl_vehicle"].sig_id.reset_index(drop = True)
tmp = pd.DataFrame(test_preds_all.mean(axis = 0), columns = all_feat)
tmp["sig_id"] = sig_id

submission = pd.merge(test[["sig_id"]], tmp, on = "sig_id", how = "left")
submission.fillna(0, inplace = True)

#submission[all_feat] = tmp.mean(axis = 0)

# Set control to 0
# submission.loc[test["cp_type"] == 0, submission.columns[1:]] = 0
submission.to_csv("submission.csv", index = None)
submission.head()

Unnamed: 0,sig_id,5-alpha_reductase_inhibitor,11-beta-hsd1_inhibitor,acat_inhibitor,acetylcholine_receptor_agonist,acetylcholine_receptor_antagonist,acetylcholinesterase_inhibitor,adenosine_receptor_agonist,adenosine_receptor_antagonist,adenylyl_cyclase_activator,...,tropomyosin_receptor_kinase_inhibitor,trpv_agonist,trpv_antagonist,tubulin_inhibitor,tyrosine_kinase_inhibitor,ubiquitin_specific_protease_inhibitor,vegfr_inhibitor,vitamin_b,vitamin_d_receptor_agonist,wnt_inhibitor
0,id_0004d9e33,0.001348,0.000967,0.001553,0.017923,0.02279,0.00445,0.00351,0.004154,0.000618,...,0.000557,0.000722,0.002785,0.001234,0.000759,0.000495,0.000691,0.001276,0.000892,0.001457
1,id_001897cda,0.000707,0.000868,0.001794,0.002519,0.002281,0.002867,0.001466,0.016255,0.000921,...,0.000695,0.001626,0.002601,0.000275,0.009304,0.000662,0.01279,0.001118,0.01968,0.003499
2,id_002429b5b,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,id_00276f245,0.000437,0.000307,0.001566,0.01902,0.009177,0.001288,0.002507,0.001168,0.000373,...,0.000611,0.000778,0.001405,0.001131,0.001436,0.000256,0.000586,0.0016,0.00024,0.001546
4,id_0027f1083,0.000872,0.000702,0.001626,0.014854,0.013165,0.002897,0.003051,0.002544,0.000303,...,0.000632,0.001758,0.002065,0.005347,0.002986,0.00047,0.000651,0.001799,0.000536,0.001902


In [34]:
print(f"{b_}submission.shape: {r_}{submission.shape}")

[34msubmission.shape: [31m(3982, 207)
