### Overview
The goal of this competition is identifying individual whales in images. Despite several whales are well represented in images, most of whales are unique or shown only in a few pictures. In particular, the train dataset includes 25k images and 5k unique whale ids. In addition, ~10k of images show unique whales ('new_whale' label). Checking public kernels suggests that a classical approach for classification problems based on softmax prediction for all classes is working quite well for this particular problem. However, strong class imbalance, handling labels represented by just several images, and 'new_whale' label deteriorates this approach. In addition, form the using this model for production, the above approach doesn't sound right since expansion of the model to identify new whales not represented in the train dataset would require retraining the model with increased softmax size. Meanwhile, the task of this competition could be reconsidered as checking similarities that suggests one-shot based learning algorithm to be applicable. This approach is less susceptible to data imbalance in this competition, can naturally handle 'new_whale' class, and is scalable in terms of a model for production (new classes can be added without retraining the model).

There are several public kernels targeted at using similarity based approach. First of all, it is an amazing [kernel posted by Martin Piotte](https://www.kaggle.com/martinpiotte/whale-recognition-model-with-score-0-78563), which discusses Siamese Neural Network architecture in details. A [fork of this kernel](https://www.kaggle.com/seesee/siamese-pretrained-0-822/notebook) reports 0.822 public LB score after training for 400 epochs. There is also a quite interesting [public kernel](https://www.kaggle.com/ashishpatel26/triplet-loss-network-for-humpback-whale-prediction) discussing Triplet Neural Network architecture, which is supposed to overperform Siamese architecture (check links in [this discussion](https://www.kaggle.com/c/humpback-whale-identification/discussion/76012)). Since both positive and negative examples are provided, the gradients are appeared to be more stable, and the network is not only trying to get away from negative or get close to positive example but arranges the prediction to fulfil both.

In this kernel I provide an example of a network inspired by Triplet architecture that is capable to reach **~0.60 public LB score after training only for 11 epochs**.  Training for more epochs is supposed to improve the prediction even further, and hopefully it will take much less than 400 epochs to reach 0.8+ public LB score (I'll post an update after I check it). The main trick of this kernel is **using multiple loss instead of triplet one**. If the forward pass is completed for all images in a batch, why shouldn't I compare all of them when calculate the loss function? why should I limit myself by just several triplets? I have designed a loss function in such a way that allows performing all vs. all comparison within each batch, in other words for a batch of size 16 instead of comparing 16 triplets or 32 pairs the network performs processing of 2256 pairs of images at the same time. If training is done on multiple GPUs, the number of compared pares could be boosted even further since it it proportional to bs^2. Such a huge number of processed pairs further stabilizes gradients in comparison with triplet loss and allows more effective mapping of the input into the embedding space since not only pairs or triplets but entire picture is seen at the same time. This approach also allows to get quite good results even without selection of hard pairs for training instead of random ones (as done in [this kernel](https://www.kaggle.com/martinpiotte/whale-recognition-model-with-score-0-78563)). However, combining those two approaches may further boost the convergence of the network, especially at the later stage of training.

Another novel thing I use in this kernel is **training on rectangular images instead of square ones**. After extracting bounding boxes (thanks to [this fork](https://www.kaggle.com/suicaokhoailang/generating-whale-bounding-boxes) and to Martin Piotte for posting the original kernel), the aspect ratio of crops with whale tails is approximetly 3:1. In most public kernels using bounding boxes approach, the produced crops with tails are just squeezed to square images. In this kernel I use 576x192 crops generated based on bounding boxes without stretching. This kernel is written with using fast.ai 0.7 since a newer version of fast.ai doesn't work well in kaggle: using more than one core for data loading leads to [bus error](https://www.kaggle.com/product-feedback/72606) "DataLoader worker (pid 137) is killed by signal: Bus error". Therefore, when I tried to write similar kernel with fast.ai 1.0, it appeared to be much slower, more than 1 hour per epoch vs. 20-30 min with this kernel if ResNet34 is used. People interested in fast.ai 1.0 could check an example of Siamese network [here](https://www.kaggle.com/raghavab1992/siamese-with-fast-ai). Another thing, fast.ai 0.7 is not really designed to build Siamese and Triplet networks, therefore some parts are a little bit far away from a standard usage of the library.

**Highlights: Multiple (all vs. all) loss, training on rectangular images**

In [5]:
#!pip install fastai==0.7.0 --no-deps
#!pip install torch==0.4.1 torchvision==0.2.1
#!pip install imgaug

In [1]:
from fastai.conv_learner import *
from fastai.dataset import *

import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
import random
import math
import imgaug as ia
from imgaug import augmenters as iaa

SyntaxError: invalid syntax (core.py, line 31)

In [None]:
PATH = Path('./data/')
TRAIN = PATH/'train/'
TEST = PATH/'test/'
LABELS = PATH/'train.csv'
BOXES = PATH/'bounding_boxes.csv'
MODLE_INIT = PATH/'pytorch-pretrained-models/'

n_embedding = 256
bs = 16
ratio = 3
sz0 = 192
sz = (ratio*sz0,sz0)
nw = 2

### Data
The class Loader creates crops with sizes 576x192 based on the bounding boxes without stretching the image. In addition, data augmentation based on [imgaug library](https://github.com/aleju/imgaug) is applied. This library is quite interesting in the context of the competition since it supports hue and saturation augmentations as well as conversion to gray scale.

In [None]:
def open_image(fn):
    flags = cv2.IMREAD_UNCHANGED+cv2.IMREAD_ANYDEPTH+cv2.IMREAD_ANYCOLOR
    if not os.path.exists(fn):
        raise OSError('No such file or directory: {}'.format(fn))
    elif os.path.isdir(fn):
        raise OSError('Is a directory: {}'.format(fn))
    else:
        try:
            im = cv2.imread(str(fn), flags)
            if im is None: raise OSError(f'File not recognized by opencv: {fn}')
            return cv2.cvtColor(im, cv2.COLOR_BGR2RGB)
        except Exception as e:
            raise OSError('Error handling image at: {}'.format(fn)) from e

class Loader():
    def __init__(self,path,tfms_g=None, tfms_px=None):
        #tfms_g - geometric augmentation (distortion, small rotation, zoom)
        #tfms_px - pixel augmentation and flip
        self.boxes = pd.read_csv(BOXES).set_index('Image')
        self.path = path
        self.tfms_g = iaa.Sequential(tfms_g,random_order=False) \
                        if tfms_g is not None else None
        self.tfms_px = iaa.Sequential(tfms_px,random_order=False) \
                        if tfms_px is not None else None
    def __call__(self, fname):
        fname = os.path.basename(fname)
        x0,y0,x1,y1 = tuple(self.boxes.loc[fname,['x0','y0','x1','y1']].tolist())
        img = open_image(os.path.join(self.path,fname))
        if self.tfms_g != None: img = self.tfms_g.augment_image(img)
        l1,l0,_ = img.shape
        b0,b1 = x1-x0 + 50, y1-y0 + 20 #add extra paddning
        b0n,b1n = (b0, b0/ratio) if b0**2/ratio > b1**2*ratio else (b1*ratio, b1)
        if b0n > l0: b0n,b1n = l0,b1n*l0/b0n
        if b1n > l1: b0n,b1n = b0n*l1/b1n,l1
        x0n = (x0 + x1 - b0n)/2
        x1n = (x0 + x1 + b0n)/2
        y0n = (y0 + y1 - b1n)/2
        y1n = (y0 + y1 + b1n)/2
        x0n,x1n,y0n,y1n = int(x0n),int(x1n),int(y0n),int(y1n)
        if(x0n < 0): x0n,x1n = 0,x1n-x0n
        elif(x1n > l0): x0n,x1n = x0n+l0-x1n,l0
        if(y0n < 0): y0n,y1n = 0,y1n-y0n
        elif(y1n > l1): y0n,y1n = y0n+l1-y1n,l1
        img = cv2.resize(img[y0n:y1n,x0n:x1n,:], sz)
        if self.tfms_px != None: img = self.tfms_px.augment_image(img)
        return img.astype(np.float)/255

The Dataset class below generates triplets of images: original image, different image with the same label, an image with different label (**including new_label images**). I do not use selection of triplets based on the performance of the network (to focus only on ones that confuse the network the most) since it is compensated by multiple loss function (all vs. all comparison). Out of 100s negative examples in a batch it is quite likely to have several tough ones. However, more careful selection of triplets could slightly improve convergence of the network (though it requres additional computational time).

In [None]:
class pdFilesDataset(FilesDataset):
    def __init__(self, data, path, transform):
        df = data.copy()
        counts = Counter(df.Id.values)
        df['c'] = df['Id'].apply(lambda x: counts[x])
        #in the production runs df.c>1 should be used
        fnames = df[(df.c>2) & (df.Id != 'new_whale')].Image.tolist()
        df['label'] = df.Id
        df.loc[df.c == 1,'label'] = 'new_whale'
        df = df.sort_values(by=['c'])
        df.label = pd.factorize(df.label)[0]
        l1 = 1 + df.label.max()
        l2 = len(df[df.label==0])
        df.loc[df.label==0,'label'] = range(l1, l1+l2) #assign unique ids
        self.labels = df.copy().set_index('Image')
        self.names = df.copy().set_index('label')
        if path == TRAIN:
            #data augmentation: 8 degree rotation, 10% stratch, shear
            tfms_g = [iaa.Affine(rotate=(-8, 8),mode='reflect',
                scale={"x": (0.9, 1.1), "y": (0.9, 1.1)}, shear=(-8,8))]
            #data augmentation: horizontal flip, hue and staturation augmentation,
            #gray scale, blur
            tfms_px = [iaa.Fliplr(0.5), iaa.AddToHueAndSaturation((-20, 20)),
                iaa.Grayscale(alpha=(0.0, 1.0)),iaa.GaussianBlur((0, 1.0))]
            self.loader = Loader(path,tfms_g,tfms_px)
        else: self.loader = Loader(path)
        super().__init__(fnames, transform, path)
    
    def get_x(self, i):
        label = self.labels.loc[self.fnames[i],'label']
        #random selection of a positive example
        for j in range(10): #sometimes loc call fails
            try:
                names = self.names.loc[label].Image
                break
            except: None
        name_p = names if isinstance(names,str) else \
            random.sample(set(names) - set([self.fnames[i]]),1)[0]
        #random selection of a negative example
        for j in range(10): #sometimes loc call fails
            try:
                names = self.names.loc[self.names.index!=label].Image
                break
            except: None
        name_n = names if isinstance(names,str) else names.sample(1).values[0]
        imgs = [self.loader(os.path.join(self.path,self.fnames[i])),
                self.loader(os.path.join(self.path,name_p)),
                self.loader(os.path.join(self.path,name_n)),
                label,label,self.labels.loc[name_n,'label']]
        return imgs
    
    def get_y(self, i):
        return 0
    
    def get(self, tfm, x, y):
        if tfm is None:
            return (*x,0)
        else:
            x1, y1 = tfm(x[0],x[3])
            x2, y2 = tfm(x[1],x[4])
            x3, y3 = tfm(x[2],x[5])
            #combine all images into one tensor
            x = np.stack((x1,x2,x3),0)
            return x,(y1,y2,y3)
        
    def get_names(self,label):
        names = []
        for j in range(10):
            try:
                names = self.names.loc[label].Image
                break
            except: None
        return names
        
    @property
    def is_multi(self): return True
    @property
    def is_reg(self):return True
    
    def get_c(self): return n_embedding
    def get_n(self): return len(self.fnames)
    
#class for loading an individual images when embedding is computed
class FilesDataset_single(FilesDataset):
    def __init__(self, data, path, transform):
        self.loader = Loader(path)
        fnames = os.listdir(path)
        super().__init__(fnames, transform, path)
        
    def get_x(self, i):
        return self.loader(os.path.join(self.path,self.fnames[i]))
                           
    def get_y(self, i):
        return 0
        
    @property
    def is_multi(self): return True
    @property
    def is_reg(self):return True
    
    def get_c(self): return n_embedding
    def get_n(self): return len(self.fnames)

In [None]:
def get_data(sz,bs):
    tfms = tfms_from_model(resnet34, sz, crop_type=CropType.NO)
    tfms[0].tfms = [tfms[0].tfms[2],tfms[0].tfms[3]]
    tfms[1].tfms = [tfms[1].tfms[2],tfms[1].tfms[3]]
    df = pd.read_csv(LABELS)
    trn_df, val_df = train_test_split(df,test_size=0.2, random_state=42)
    ds = ImageData.get_ds(pdFilesDataset, (trn_df,TRAIN), (val_df,TRAIN), tfms)
    md = ImageData(PATH, ds, bs, num_workers=nw, classes=None)
    return md

The image below demonstrates an example of triplets of rectangular 576x192 augmented images used for training. To be honest, some of those triplets are quite hard, and I don't think that I could even reach the same performance as the model after training (~99% accuracy in identifications of 2 similar images in a triplet). 

In [None]:
md = get_data(sz,bs)

x,y = next(iter(md.trn_dl))
print(x.shape, y[0].shape)

def display_imgs(x):
    columns = 3
    rows = min(bs,16)
    fig=plt.figure(figsize=(columns*8, rows*3))
    for i in range(rows):
        for j in range(columns):
            idx = j+i*columns
            fig.add_subplot(rows, columns, idx+1)
            plt.axis('off')
            plt.imshow((x[j][i,:,:,:]*255).astype(np.int))
    plt.show()
    
display_imgs((md.trn_ds.denorm(x[:,0,:,:,:]),md.trn_ds.denorm(x[:,1,:,:,:]),md.trn_ds.denorm(x[:,2,:,:,:])))

### Model
In this kernel I use ResNeXt50 instead of ResNet34 since it gaves slightly better performance after training within kernel time limit: 0.600 vs 0.588. The convolutional part is taken from the original ResNeXt50 model pretrained on ImageNet, meanwhile adaptive pooling allows using of images of any sizes and aspect ratios. On the top, 2 fully connected layers are added to convert the prediction of convolutional part into embedding space. Conversion of the input images into embedding allows quite efficient and robust inference. Also, multiple loss can be applied quite easily.

Instead of Euclidean distance in the embedding space, the calculation of the similarity between images can be done with a several layer network as in [this kernel](https://www.kaggle.com/martinpiotte/whale-recognition-model-with-score-0-78563). This approach could boost the score; however, it would require some modification of the model to allow all vs. all comparison during training. In particular, a copy of the head part of the network with shared weights must be created for each compared pair.

In [None]:
def resnext50(pretrained=True):
    model = resnext_50_32x4d()
    name = 'resnext_50_32x4d.pth'
    if pretrained:
        path = os.path.join(MODLE_INIT,name)
        load_model(model, path)
    return model

class TripletResneXt50(nn.Module):
    def __init__(self, pre=True, emb_sz=64, ps=0.5):
        super().__init__()
        encoder = resnext50(pretrained=pre)
        self.cnn = nn.Sequential(encoder[0],encoder[1],nn.ReLU(),encoder[3],
                        encoder[4],encoder[5],encoder[6],encoder[7])
        self.head = nn.Sequential(AdaptiveConcatPool2d(), Flatten(), nn.Dropout(ps),
                        nn.Linear(4096, 512), nn.ReLU(), nn.BatchNorm1d(512),
                        nn.Dropout(ps), nn.Linear(512, emb_sz))
        
    def forward(self,x):
        x1,x2,x3 = x[:,0,:,:,:],x[:,1,:,:,:],x[:,2,:,:,:]
        x1 = self.head(self.cnn(x1))
        x2 = self.head(self.cnn(x2))
        x3 = self.head(self.cnn(x3))
        return torch.cat((x1.unsqueeze_(-1),x2.unsqueeze_(-1),x3.unsqueeze_(-1)),dim=-1)
    
    def get_embedding(self, x):
        return self.head(self.cnn(x))
    
class ResNeXt50Model():
    def __init__(self,pre=True,name='TripletResneXt50',**kwargs):
        self.model = to_gpu(TripletResneXt50(pre=True,**kwargs))
        self.name = name

    def get_layer_groups(self, precompute):
        m = self.model.module if isinstance(self.model,FP16) else self.model
        if precompute:
            return [m.head]
        c = children(m.cnn)
        return list(split_by_idxs(c,[5])) + [m.head]

### Loss function
I my tests I have performed comparison of several loss functions and found that contrastive loss works the best in the current setup. Distance based logistic loss gives similar performance when model is trained with singe precision, but worse results for training with half precision.

In [None]:
def Contrastive_loss(preds, target, size_average=True, m=10.0):
    #matrix of all vs all comparisons
    t = torch.cat(target)
    sz = t.shape[0]
    t1 = t.unsqueeze(1).expand((sz,sz))
    t2 = t1.transpose(0,1)
    y = t1==t2
    
    pred = torch.cat((preds[:,:,0], preds[:,:,1], preds[:,:,2]))
    half = True if isinstance(pred,torch.cuda.HalfTensor) else False
    if half : pred = pred.float()
    pred1 = pred.unsqueeze(1).expand((sz,sz,-1))
    pred2 = pred1.transpose(0,1)
    d = (pred1 - pred2).pow(2).sum(dim=-1)
    loss_p = d[y==1]
    loss_n = F.relu(m - torch.sqrt(d[y==0]))**2
    loss = torch.cat((loss_p,loss_n),0)
    loss = loss.mean() if size_average else loss.sum()
    if half : pred = pred.half()
    return loss

def DB_acc(preds, target):
    v, p, n = preds[:,:,0], preds[:,:,1], preds[:,:,2]
    dp = (p - v).pow(2).sum(dim=1)
    dn = (n - v).pow(2).sum(dim=1)
    return (dp < dn).float().mean()

### Training

In [None]:
learner = ConvLearner(md,ResNeXt50Model(ps=0.0,emb_sz=n_embedding))
learner.opt_fn = optim.Adam
learner.clip = 1.0 #gradient clipping
learner.crit = Contrastive_loss
learner.metrics = [DB_acc]
learner #click "output" to see details of the model

I begin with finding the optimal learning rate. The following function runs training with different learning rate and records the loss. Increase of the loss indicates onset of divergence of training. The optimal lr lies in the vicinity of the minimum of the curve but before the onset of divergence. Based on the following plot, for the current setup the divergence starts at ~5e-3, and the recommended learning rate is ~5e-4.

In [None]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    learner.lr_find()
learner.sched.plot()

First, I train only the fully connected part of the model while keeping the rest frozen. It allows to avoid corruption of the pretrained weights at the initial stage of training due to random initialization of the head layers. So the power of transfer learning is fully utilized when the training is continued.

In [None]:
lr = 5e-4
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    learner.fit(lr,1)

Next, I unfreeze all weights and allow training of entire model. One trick that I use is applying different learning rates in different parts of the model: the learning rate in the fully connected part is still lr, last two blocks of ResNeXt are trained with lr/5, and first layers are trained with lr/25. Since low-level detectors do not vary much from one image data set to another, the first layers do not require substantial retraining compared to the parts of the model working with high level features. Another trick is learning rate annealing. Periodic learning rate increase followed by slow decrease drives the system out of steep minima (when lr is high) towards broader ones (which are explored when lr decreases) that enhances the ability of the model to generalize and reduces overfitting. The length of the cycles gradually increases during training. Usage of half precision doubles the maximum batch size that allows to compare more pairs in each batch.

In [None]:
learner.unfreeze()
lrs=np.array([lr/25,lr/5,lr])
learner.half() #half precision

In [None]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    learner.fit(lrs/4,4,cycle_len=1,use_clr=(10,20))
    learner.fit(lrs/8,2,cycle_len=2,use_clr=(10,20))
    learner.fit(lrs/16,1,cycle_len=2,use_clr=(10,20)) 
learner.save('model')

### Embedding
The following code converts images into embedding vectors which are used later to generate predictions based on the nearest neighbor analysis.

In [None]:
def extract_embedding(model,path):
    tfms = tfms_from_model(resnet34, sz, crop_type=CropType.NO)
    tfms[0].tfms = [tfms[0].tfms[2],tfms[0].tfms[3]]
    tfms[1].tfms = [tfms[1].tfms[2],tfms[1].tfms[3]]
    ds = ImageData.get_ds(FilesDataset_single, (None,TRAIN), (None,TRAIN),
         tfms, test=(None,path))
    md = ImageData(PATH, ds, 3*bs, num_workers=nw, classes=None)
    model.eval()
    with torch.no_grad():
        preds = torch.zeros((len(md.test_dl.dataset), n_embedding))
        start=0
        for i, (x, y) in enumerate(md.test_dl, start=0):
            size = x.shape[0]
            if isinstance(model,FP16):
                preds[start:start+size,:] = model.module.get_embedding(x.half())
            else:
                preds[start:start+size,:] = model.get_embedding(x)
            start+= size
        return preds, [os.path.basename(name) for name in md.test_dl.dataset.fnames]

In [None]:
emb, names = extract_embedding(learner.model,TRAIN)
df = pd.DataFrame({'files':names,'emb':emb.tolist()})
df.emb = df.emb.map(lambda emb: ' '.join(list([str(i) for i in emb])))
df.to_csv('train_emb.csv', header=True, index=False)

In [None]:
emb, names = extract_embedding(learner.model,TEST)
df = pd.DataFrame({'files':names,'emb':emb.tolist()})
df.emb = df.emb.map(lambda emb: ' '.join(list([str(i) for i in emb])))
df.to_csv('test_emb.csv', header=True, index=False)

### Validation

In [None]:
data = pd.read_csv(LABELS).set_index('Image')
trn_emb = pd.read_csv(os.path.join('../working/','train_emb.csv'))
trn_emb['emb'] = [[float(i) for i in s.split()] for s in trn_emb['emb']]
trn_emb.set_index('files',inplace=True)
train_df = data.join(trn_emb)
train_df = train_df.reset_index()
train_preds = np.array(train_df.emb.tolist())
#the split should be the same as one used for training.
trn_df, val_df = train_test_split(train_df,test_size=0.2, random_state=42)
trn_preds = np.array(trn_df.emb.tolist())
val_preds = np.array(val_df.emb.tolist())
trn_df = trn_df.reset_index()
val_df = val_df.reset_index()
train_preds.shape

Find 16 nearest train neighbors in embedding space for each validation image. Since there can be several neighbors with the same label, instead of 5 I use 16 here. The following code will select 5 nearest neighbors with different labels. "new_whale" label can be assigned as a prediction at a distance dcut. In this case, if the number of neighbors at a distance shorter than dcut is less than 5, the image is considered to be different from others, and "new_whale" is assigned.

In [None]:
neigh = NearestNeighbors(n_neighbors=16)
neigh.fit(trn_preds)
distances_trn, neighbors_trn = neigh.kneighbors(val_preds)

In [None]:
def get_nlabels_trn(idx:int,trn_df,test_df,dcut):
    l0 = test_df.loc[idx].Id
    nbs = dict()
    for i in range(0,16):
        nb = neighbors_trn[idx,i]
        l, d = trn_df.loc[nb].Id, distances_trn[idx,i]
        if d > dcut and 'new_whale' not in nbs: nbs['new_whale'] = dcut
        if l not in nbs: nbs[l] = d
        if len(nbs) >= 5: break
    nbs_sorted = sorted(nbs.items(), key=lambda kv: kv[1])
    score = 0.0
    for i in range(min(len(nbs_sorted),5)):
        if nbs_sorted[i][0] == l0:
            score = 1.0/(i + 1.0)
            break
    return l0, nbs_sorted, score

In [None]:
scores = []
dcut = 3.75
for idx in val_df.index:
    _,_,s = get_nlabels_trn(idx,trn_df,val_df,dcut)
    scores.append(s)
print(np.array(scores).mean())

### Submission

In [None]:
test_emb = pd.read_csv(os.path.join('../working/','test_emb.csv'))
test_emb['emb'] = [[float(i) for i in s.split()] for s in test_emb['emb']]
test_emb.set_index('files',inplace=True)
test_df = test_emb.reset_index()
test_preds = np.array(test_df.emb.tolist())
test_df.head()

In [None]:
neigh = NearestNeighbors(n_neighbors=16)
neigh.fit(train_preds)
distances_test, neighbors_test = neigh.kneighbors(test_preds)

In [None]:
pred = []
for idx, row in test_df.iterrows():
    nbs = dict()
    for i in range(0,16):
        nb = neighbors_test[idx,i]
        l, d = train_df.loc[nb].Id, distances_test[idx,i]
        if d > dcut and 'new_whale' not in nbs: nbs['new_whale'] = dcut
        if l not in nbs: nbs[l] = d
        if len(nbs) >= 5: break
    nbs_sorted = sorted(nbs.items(), key=lambda kv: kv[1])
    p = ' '.join([lb[0] for lb in nbs_sorted])
    pred.append({'Image':row.files,'Id':p})
pd.DataFrame(pred).to_csv('submission.csv',index=False)