As reported by Recursion [in this post](https://www.kaggle.com/c/recursion-cellular-image-classification/discussion/102905), there is a special structure in the data which simplifies predictions significantly.

Assignments of sirnas to plates is not completely random in this competition. In this kernel, first I show it on the train data, and then apply the leak [on the pretrained Keras model](https://www.kaggle.com/chandyalex/recursion-cellular-keras-densenet) (kudos to [Alex](https://www.kaggle.com/chandyalex)) with LB 0.113 to get score 0.207. Same model which uses 2 sites for inference gets LB score 0.231 (the original model uses only one site but I just can't hold myself on that). 

In [3]:
import numpy as np
import pandas as pd
import os

from tqdm import tqdm
import PIL
#import cv2
from PIL import Image, ImageOps

#from keras.models import Sequential, load_model
#from keras.layers import (Activation, Dropout, Flatten, Dense, Input, Conv2D, GlobalAveragePooling2D)
#from keras.applications.densenet import DenseNet121
#import keras
#from keras.models import Model

SIZE = 224
NUM_CLASSES = 1108

In [4]:
train_csv = pd.read_csv("train.csv")
test_csv = pd.read_csv("test.csv")

In [7]:
train_csv.head()

Unnamed: 0,id_code,experiment,plate,well,sirna
0,HEPG2-01_1_B03,HEPG2-01,1,B03,513
1,HEPG2-01_1_B04,HEPG2-01,1,B04,840
2,HEPG2-01_1_B05,HEPG2-01,1,B05,1020
3,HEPG2-01_1_B06,HEPG2-01,1,B06,254
4,HEPG2-01_1_B07,HEPG2-01,1,B07,144


In [8]:
test_csv.head()

Unnamed: 0,id_code,experiment,plate,well
0,HEPG2-08_1_B03,HEPG2-08,1,B03
1,HEPG2-08_1_B04,HEPG2-08,1,B04
2,HEPG2-08_1_B05,HEPG2-08,1,B05
3,HEPG2-08_1_B06,HEPG2-08,1,B06
4,HEPG2-08_1_B07,HEPG2-08,1,B07


In [5]:
sub = pd.read_csv("sub/metriclearn_efficientnet_b3_e080CM112_190805_cossim.csv.gz")

In [6]:
sub.head()

Unnamed: 0,id_code,sirna
0,HEPG2-08_1_B03,381
1,HEPG2-08_1_B04,309
2,HEPG2-08_1_B05,277
3,HEPG2-08_1_B06,640
4,HEPG2-08_1_B07,585


# Train data

Look at the first 10 sirnas plates assignments across the train set. One can observe that two sirnas that are on the same plate in the first experiment stay on the same plate for all experiments. Moreover, there are only 3 unique rows. 

In [15]:
train_csv.sirna.max()

1107

In [13]:
train_csv.plate.values

array([1, 1, 1, ..., 4, 4, 4])

In [10]:
# exp x 10 sirna
np.stack([train_csv.plate.values[train_csv.sirna == i] for i in range(10)]).transpose()

array([[4, 1, 2, 1, 3, 1, 1, 2, 1, 4],
       [4, 1, 2, 1, 3, 1, 1, 2, 1, 4],
       [4, 1, 2, 1, 3, 1, 1, 2, 1, 4],
       [4, 1, 2, 1, 3, 1, 1, 2, 1, 4],
       [2, 3, 4, 3, 1, 3, 3, 4, 3, 2],
       [4, 1, 2, 1, 3, 1, 1, 2, 1, 4],
       [4, 1, 2, 1, 3, 1, 1, 2, 1, 4],
       [4, 1, 2, 1, 3, 1, 1, 2, 1, 4],
       [4, 1, 2, 1, 3, 1, 1, 2, 1, 4],
       [4, 1, 2, 1, 3, 1, 1, 2, 1, 4],
       [4, 1, 2, 1, 3, 1, 1, 2, 1, 4],
       [4, 1, 2, 1, 3, 1, 1, 2, 1, 4],
       [4, 1, 2, 1, 3, 1, 1, 2, 1, 4],
       [2, 3, 4, 3, 1, 3, 3, 4, 3, 2],
       [4, 1, 2, 1, 3, 1, 1, 2, 1, 4],
       [3, 4, 1, 4, 2, 4, 4, 1, 4, 3],
       [3, 4, 1, 4, 2, 4, 4, 1, 4, 3],
       [2, 3, 4, 3, 1, 3, 3, 4, 3, 2],
       [4, 1, 2, 1, 3, 1, 1, 2, 1, 4],
       [2, 3, 4, 3, 1, 3, 3, 4, 3, 2],
       [2, 3, 4, 3, 1, 3, 3, 4, 3, 2],
       [3, 4, 1, 4, 2, 4, 4, 1, 4, 3],
       [4, 1, 2, 1, 3, 1, 1, 2, 1, 4],
       [4, 1, 2, 1, 3, 1, 1, 2, 1, 4],
       [4, 1, 2, 1, 3, 1, 1, 2, 1, 4],
       [3, 4, 1, 4, 2, 4,

In [59]:
np.stack([train_csv.plate.values[train_csv.sirna == i] for i in range(10)]).transpose().shape

(33, 10)

The above observation can be easily verified on the whole train data, - there are 4 groups of 277 sirnas in each group which stick together. 

But is there any order of groups to plates assignment? In general, there are `4*3*2=24` possible combinations of assigning 4 groups to 4 plates. But in the train data only 3 are active, each assignment appearing 22, 7 and 4 times respectively.

In [134]:
#train_csv.loc[train_csv.sirna==0,'plate']

In [24]:
# you will see the same output here for each sirna number
train_csv.loc[train_csv.sirna==0,'plate'].value_counts()

4    22
2     7
3     4
Name: plate, dtype: int64

In [30]:
# you will see the same output here for each sirna number
train_csv.loc[train_csv.sirna==888,'plate'].value_counts()

3    22
1     6
2     4
Name: plate, dtype: int64

In [32]:
# you will see the same output here for each sirna number
train_csv.loc[train_csv.sirna==222,'plate'].value_counts()

1    22
3     7
4     4
Name: plate, dtype: int64

Later we will see that the 4th combination, missing from the training data, does in fact appear in the test data. My conlusion here is that Recursion used some kind of rotation of plates only, therefore only 4 combinations.

Let's calculate which sirna belongs to which plate in every of the 4 assignments:

In [33]:
plate_groups = np.zeros((1108,4), int)
for sirna in range(1108):
    grp = train_csv.loc[train_csv.sirna==sirna,:].plate.value_counts().index.values
    assert len(grp) == 3
    plate_groups[sirna,0:3] = grp
    plate_groups[sirna,3] = 10 - grp.sum()
    
plate_groups[:10,:]

array([[4, 2, 3, 1],
       [1, 3, 4, 2],
       [2, 4, 1, 3],
       [1, 3, 4, 2],
       [3, 1, 2, 4],
       [1, 3, 4, 2],
       [1, 3, 4, 2],
       [2, 4, 1, 3],
       [1, 3, 4, 2],
       [4, 2, 3, 1]])

In [57]:
plate_groups.shape

(1108, 4)

In [135]:
plate_groups[:,0:3]

array([[4, 2, 3],
       [1, 3, 4],
       [2, 4, 1],
       ...,
       [3, 1, 2],
       [1, 3, 4],
       [4, 2, 3]])

In [None]:
# sirna x group?

# Test data

Now let's take a look if we observe the same behavior in the test data. I use the output predictions from the kernel that I mentioned to calculate average probability of each assignment for every experiment.

In [130]:
plate_groups.shape

(1108, 4)

In [132]:
plate_groups

array([[4, 2, 3, 1],
       [1, 3, 4, 2],
       [2, 4, 1, 3],
       ...,
       [3, 1, 2, 4],
       [1, 3, 4, 2],
       [4, 2, 3, 1]])

In [46]:
for j in range(4):
    print(plate_groups[np.newaxis, :, j])

[[4 1 2 ... 3 1 4]]
[[2 3 4 ... 1 3 2]]
[[3 4 1 ... 2 4 3]]
[[1 2 3 ... 4 2 1]]


In [72]:
all_train_exp, len(all_train_exp)

(array(['HEPG2-01', 'HEPG2-02', 'HEPG2-03', 'HEPG2-04', 'HEPG2-05',
        'HEPG2-06', 'HEPG2-07', 'HUVEC-01', 'HUVEC-02', 'HUVEC-03',
        'HUVEC-04', 'HUVEC-05', 'HUVEC-06', 'HUVEC-07', 'HUVEC-08',
        'HUVEC-09', 'HUVEC-10', 'HUVEC-11', 'HUVEC-12', 'HUVEC-13',
        'HUVEC-14', 'HUVEC-15', 'HUVEC-16', 'RPE-01', 'RPE-02', 'RPE-03',
        'RPE-04', 'RPE-05', 'RPE-06', 'RPE-07', 'U2OS-01', 'U2OS-02',
        'U2OS-03'], dtype=object), 33)

In [123]:
all_train_exp = train_csv.experiment.unique()

group_plate_probs_train = np.zeros((len(all_train_exp),4))
for idx in range(len(all_train_exp)):
    #preds = sub.loc[test_csv.experiment == all_test_exp[idx],'sirna'].values
    preds = train_csv.loc[train_csv.experiment == all_train_exp[idx],'sirna'].values
    pp_mult = np.zeros((len(preds),1108))
    pp_mult[range(len(preds)),preds] = 1
    
    #sub_test = test_csv.loc[test_csv.experiment == all_test_exp[idx],:]
    sub_train = train_csv.loc[train_csv.experiment == all_train_exp[idx],:]
    assert len(pp_mult) == len(sub_train)
    
    for j in range(4):
        mask = np.repeat(plate_groups[np.newaxis, :, j], len(pp_mult), axis=0) == \
               np.repeat(sub_train.plate.values[:, np.newaxis], 1108, axis=1)
        
        group_plate_probs_train[idx,j] = np.array(pp_mult)[mask].sum()/len(pp_mult)

In [126]:
group_plate_probs_train.shape

(33, 4)

In [125]:
group_plate_probs_train

array([[1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [1., 0., 0., 0.]])

In [138]:
df_train = pd.DataFrame(group_plate_probs_train, index = all_train_exp)

In [139]:
df_train

Unnamed: 0,0,1,2,3
HEPG2-01,1.0,0.0,0.0,0.0
HEPG2-02,1.0,0.0,0.0,0.0
HEPG2-03,1.0,0.0,0.0,0.0
HEPG2-04,1.0,0.0,0.0,0.0
HEPG2-05,0.0,1.0,0.0,0.0
HEPG2-06,1.0,0.0,0.0,0.0
HEPG2-07,1.0,0.0,0.0,0.0
HUVEC-01,1.0,0.0,0.0,0.0
HUVEC-02,1.0,0.0,0.0,0.0
HUVEC-03,1.0,0.0,0.0,0.0


In [149]:
df_train.idxmax(axis=1).to_csv('train_plate_groups.csv')

  """Entry point for launching an IPython kernel.


In [121]:
all_test_exp = test_csv.experiment.unique()

group_plate_probs = np.zeros((len(all_test_exp),4))
for idx in range(len(all_test_exp)):
    preds = sub.loc[test_csv.experiment == all_test_exp[idx],'sirna'].values
    pp_mult = np.zeros((len(preds),1108))
    pp_mult[range(len(preds)),preds] = 1
    
    sub_test = test_csv.loc[test_csv.experiment == all_test_exp[idx],:]
    assert len(pp_mult) == len(sub_test)
    #print(len(pp_mult))
    #print(len(sub_test))
    
    for j in range(4):
        mask = np.repeat(plate_groups[np.newaxis, :, j], len(pp_mult), axis=0) == \
               np.repeat(sub_test.plate.values[:, np.newaxis], 1108, axis=1)
        
        group_plate_probs[idx,j] = np.array(pp_mult)[mask].sum()/len(pp_mult)

In [127]:
group_plate_probs.shape

(18, 4)

In [128]:
group_plate_probs

array([[0.18247516, 0.16260163, 0.18789521, 0.467028  ],
       [0.23555957, 0.35649819, 0.22924188, 0.17870036],
       [0.54241877, 0.13447653, 0.15433213, 0.16877256],
       [0.55786618, 0.14466546, 0.16003617, 0.13743219],
       [0.71750903, 0.06137184, 0.13628159, 0.08483755],
       [0.63414634, 0.08130081, 0.08581752, 0.19873532],
       [0.07581227, 0.08935018, 0.72472924, 0.1101083 ],
       [0.04151625, 0.04061372, 0.89079422, 0.02707581],
       [0.15613718, 0.1398917 , 0.16155235, 0.54241877],
       [0.80234657, 0.07581227, 0.06949458, 0.05234657],
       [0.72975432, 0.08462238, 0.10828025, 0.07734304],
       [0.09909091, 0.09181818, 0.10818182, 0.70090909],
       [0.0866426 , 0.56407942, 0.12635379, 0.22292419],
       [0.58446251, 0.16169828, 0.1129178 , 0.14092141],
       [0.64711191, 0.13718412, 0.08844765, 0.12725632],
       [0.60511883, 0.07952468, 0.1809872 , 0.13436929],
       [0.19404332, 0.25451264, 0.2933213 , 0.25812274],
       [0.22333637, 0.26526892,

Here we go, this is the average probabilities for each test experiment to be in every of the 4 assignments:

One can see the favorites. 

In [141]:
df_test = pd.DataFrame(group_plate_probs, index=all_test_exp)

In [148]:
df_test.idxmax(axis=1).to_csv('test_plate_groups.csv')

  """Entry point for launching an IPython kernel.


In [6]:
# ORIGINAL VALUES!!!
#pd.DataFrame(group_plate_probs, index = all_test_exp)

Unnamed: 0,0,1,2,3
HEPG2-08,0.226739,0.257453,0.190605,0.325203
HEPG2-09,0.219314,0.331227,0.24639,0.203069
HEPG2-10,0.378159,0.232852,0.189531,0.199458
HEPG2-11,0.371609,0.238698,0.194394,0.195298
HUVEC-17,0.397112,0.145307,0.212996,0.244585
HUVEC-18,0.379404,0.203252,0.205962,0.211382
HUVEC-19,0.184116,0.175993,0.471119,0.168773
HUVEC-20,0.123646,0.106498,0.635379,0.134477
HUVEC-21,0.185018,0.1787,0.148014,0.488267
HUVEC-22,0.50722,0.149819,0.194043,0.148917


Let's select the most probable assignment for every test experiment. You may say that some of the selections here are not certain and the probabilities are too close. But we get the same assignments with our much better models, so even this relatively simple model is able to make correct assignments.

In [137]:
exp_to_group = group_plate_probs.argmax(1)
print(exp_to_group)

[3 1 0 0 0 0 2 2 3 0 0 3 1 0 0 0 2 3]


In [7]:
# ORIGINAL VALUES!!!
#exp_to_group = group_plate_probs.argmax(1)
#print(exp_to_group)

[3 1 0 0 0 0 2 2 3 0 0 3 1 0 0 0 2 3]


# Running predictions with the existing DenseNet121 model

In the code below we load the model, make predictions to get the full probabilites matrix, and set 3 out of 4 plates for every sirna to zero, according to the assignment that we previously selected.

In [8]:
def create_model(input_shape,n_out):
    input_tensor = Input(shape=input_shape)
    base_model = DenseNet121(include_top=False,
                   weights=None,
                   input_tensor=input_tensor)
    x = GlobalAveragePooling2D()(base_model.output)
    x = Dense(1024, activation='relu')(x)
 
    final_output = Dense(n_out, activation='softmax', name='final_output')(x)
    model = Model(input_tensor, final_output)
    
    return model

In [9]:
model = create_model(input_shape=(SIZE,SIZE,3),n_out=NUM_CLASSES)

In [10]:
model.load_weights('../input/recursion-cellular-keras-densenet/Densenet121.h5')

In [11]:
predicted = []
for i, name in tqdm(enumerate(test_csv['id_code'])):
    path1 = os.path.join('../input/recursion-cellular-image-classification-224-jpg/test/test/', name+'_s1.jpeg')
    image1 = cv2.imread(path1)
    score_predict1 = model.predict((image1[np.newaxis])/255)
    
    path2 = os.path.join('../input/recursion-cellular-image-classification-224-jpg/test/test/', name+'_s2.jpeg')
    image2 = cv2.imread(path2)
    score_predict2 = model.predict((image2[np.newaxis])/255)
    
    predicted.append(0.5*(score_predict1 + score_predict2))
    #predicted.append(score_predict1)

19897it [13:19, 24.88it/s]


In [12]:
predicted = np.stack(predicted).squeeze()

this is the function that sets 75% of the sirnas to zero according to the selected assignment:

In [13]:
def select_plate_group(pp_mult, idx):
    sub_test = test_csv.loc[test_csv.experiment == all_test_exp[idx],:]
    assert len(pp_mult) == len(sub_test)
    mask = np.repeat(plate_groups[np.newaxis, :, exp_to_group[idx]], len(pp_mult), axis=0) != \
           np.repeat(sub_test.plate.values[:, np.newaxis], 1108, axis=1)
    pp_mult[mask] = 0
    return pp_mult

In [14]:
for idx in range(len(all_test_exp)):
    #print('Experiment', idx)
    indices = (test_csv.experiment == all_test_exp[idx])
    
    preds = predicted[indices,:].copy()
    
    preds = select_plate_group(preds, idx)
    sub.loc[indices,'sirna'] = preds.argmax(1)

In [15]:
(sub.sirna == pd.read_csv("../input/recursion-cellular-keras-densenet/submission.csv").sirna).mean()

0.3572900437251847

In [16]:
sub.to_csv('../working/submission.csv', index=False, columns=['id_code','sirna'])

That is all! I hope this demonstration will get everyone to the same playing field on this issue. Assigning 277 sirnas to 277 wells is still a hard problem to crack.