This notebook is just a (very) small improvement over most common baseline.

It loads a few images from train and resize it to 8x8 pixels to generate a 64 (8 x 8) feature vector.

Then, it uses KNN to find the most similar image on test set.

Unfortunatelly, due to limitations on Kernel, only a few test images are classified.

In [12]:
import numpy as np
import pandas as pd
import io
import bson
import cv2
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook
import concurrent.futures
from multiprocessing import cpu_count

In [13]:
num_images = 200000
im_size = 16
num_cpus = cpu_count()

In [14]:
from ipywidgets import IntProgress

In [15]:
!jupyter nbextension enable --py --sys-prefix widgetsnbextension

/bin/sh: jupyter: command not found


In [16]:
def imread(buf):
    return cv2.imdecode(np.frombuffer(buf, np.uint8), cv2.IMREAD_ANYCOLOR)

def img2feat(im):
    x = cv2.resize(im, (im_size, im_size), interpolation=cv2.INTER_AREA)
    return np.float32(x) / 255

X = np.empty((num_images, im_size, im_size, 3), dtype=np.float32)
y = []

def load_image(pic, target, bar):
    picture = imread(pic)
    x = img2feat(picture)
    bar.update()
    
    return x, target

bar = tqdm_notebook(total=num_images)
with open('/datadrive/Cdiscount/train.bson', 'rb') as f, \
        concurrent.futures.ThreadPoolExecutor(num_cpus) as executor:

    data = bson.decode_file_iter(f)
    delayed_load = []

    i = 0
    try:
        for c, d in enumerate(data):
            target = d['category_id']
            for e, pic in enumerate(d['imgs']):
                delayed_load.append(executor.submit(load_image, pic['picture'], target, bar))
                
                i = i + 1

                if i >= num_images:
                    raise IndexError()

    except IndexError:
        pass;
    
    for i, future in enumerate(concurrent.futures.as_completed(delayed_load)):
        x, target = future.result()
        
        X[i] = x
        y.append(target)


200000/|/100%|| 200000/200000 [01:51<00:00, 3138.39it/s]

In [17]:
X.shape, len(y)

((200000, 16, 16, 3), 200000)

In [18]:
y = pd.Series(y)

num_classes = 500  # This will reduce the max accuracy to about 0.75

# Now we must find the most `num_classes-1` frequent classes
# (there will be an aditional 'other' class)
valid_targets = set(y.value_counts().index[:num_classes-1].tolist())
valid_y = y.isin(valid_targets)

# Set other classes to -1
y[~valid_y] = -1

max_acc = valid_y.mean()
print(max_acc)

0.843575


Note that the max accuracy reported above is greater than ~0.75 reported [here](http://https://www.kaggle.com/bguberfain/naive-statistics) due to smaller train set.

In [19]:
# Now we categorize the dataframe
y, rev_labels = pd.factorize(y)

In [20]:
# Train a simple NN
from keras.layers import Conv2D, MaxPooling2D, Dense, Flatten
from keras.models import Sequential
from keras.optimizers import Adam

model = Sequential()
model.add(Conv2D(16, 3, activation='relu', padding='same', input_shape=X.shape[1:]))
model.add(MaxPooling2D(2))
model.add(Conv2D(16, 3, activation='relu', padding='same'))
model.add(MaxPooling2D(2))
model.add(Conv2D(32, 3, activation='relu', padding='same'))
model.add(MaxPooling2D(2))
model.add(Flatten())
model.add(Dense(num_classes, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))


opt = Adam(lr=0.01)

model.compile('adam', 'sparse_categorical_crossentropy', metrics=['accuracy'])

model.summary()

Using TensorFlow backend.


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 16, 16, 16)        448       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 8, 8, 16)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 8, 8, 16)          2320      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 4, 4, 16)          0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 4, 4, 32)          4640      
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 2, 2, 32)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 128)               0         
__________

In [21]:
from datetime import datetime
start_time = datetime.now()
model.fit(X, y, validation_split=0.1, epochs=2)
end_time = datetime.now()
print("start_time is ", start_time, "end_time is ", end_time)

Train on 180000 samples, validate on 20000 samples
Epoch 1/2
Epoch 2/2
start_time is  2017-10-06 17:33:40.951346 end_time is  2017-10-06 17:36:13.543381


In [22]:
model.save_weights('model.h5')  #You can download this model and run whole test localy

Now we evaluate the test set using the previous trained model.

In [23]:
submission = pd.read_csv('./sample_submission.csv', index_col='_id')

most_frequent_guess = 1000018296
submission['category_id'] = most_frequent_guess # Most frequent guess

In [25]:
start_time = datetime.now()
num_images_test = 8000000  # We only have time for a few test images..

bar = tqdm_notebook(total=num_images_test * 2)
with open('/datadrive/Cdiscount/test.bson', 'rb') as f, \
         concurrent.futures.ThreadPoolExecutor(num_cpus) as executor:

    data = bson.decode_file_iter(f)

    future_load = []
    
    for i,d in enumerate(data):
        if i >= num_images_test:
            break
        future_load.append(executor.submit(load_image, d['imgs'][0]['picture'], d['_id'], bar))

    print("Starting future processing")
    for future in concurrent.futures.as_completed(future_load):
        x, _id = future.result()
        
        y_cat = rev_labels[np.argmax(model.predict(x[None])[0])]
        if y_cat == -1:
            y_cat = most_frequent_guess

        bar.update()
        submission.loc[_id, 'category_id'] = y_cat
print('Finished')
end_time = datetime.now()
print("start_time is ", start_time, "end_time is ", end_time)


Starting future processing
Finished
start_time is  2017-10-06 18:34:15.340121 end_time is  2017-10-06 19:56:02.358361
3536339/|/ 22%|| 3536339/16000000 [1:21:58<8:03:58, 429.21it/s]

In [None]:
submission.to_csv('new_submission.csv.gz', compression='gzip')