# Kaggle - Cdiscount 

This notebook contains a generator class for Keras called BSONIterator that can read directly from the BSON data. You can use it in combination with ImageDataGenerator for doing data augmentation.
Source: https://www.kaggle.com/humananalog/keras-generator-for-reading-directly-from-bson


In [1]:
import os, sys, math, io
import numpy as np
import pandas as pd
import multiprocessing as mp
import bson
import struct
import tensorflow as tf

import keras
from keras.preprocessing.image import load_img, img_to_array

from collections import defaultdict
from tqdm import *

Using TensorFlow backend.


In [2]:
keras.__version__,tf.__version__

('2.0.8', '1.2.1')

We consider 2 configs:
    - mode = 'test'    : submission file merged with training so we test the model usually on a small sample
    - mode = 'submit ' : submission file ready to send to Kaggle
    
Actually no need to trick the submission file as we can use model.evaluate

In [3]:
mode = 'testing'

In [4]:
data_dir = 'C:/Users/Juju/Documents/Cdiscount/'
train_bson_path = os.path.join(data_dir, "train.bson")
num_train_products = 7069896
test_bson_path = os.path.join(data_dir, "test.bson")
num_test_products = 1768182

## 1 - Create a random train/validation split

We split on products, not on individual images. Since some of the categories only have a few products, we do the split separately for each category. This creates two new tables, one for the training images and one for the validation images. There is a row for every single image, so if a product has more than one image it occurs more than once in the table. We have the possiblity to either drop a percentage or filter the top categories.
(Note: if `drop_percentage` > 0, the progress bar doesn't go all the way.)

In [5]:
def make_val_set(df, split_percentage=0.2, drop_percentage=0.,num_top_categories=0):
    # Find the product_ids for each category.
    category_dict = defaultdict(list)
    
    # Add a filter before building the category_dict
    if num_top_categories > 0:
        print("We filter the top ",num_top_categories," categories...")
        top_categories_df = pd.read_csv(data_dir+"top_categories.csv")
        top_categories = top_categories_df["category_id"][:num_top_categories].values
        #print(top_categories)
    
        for ir in tqdm(df.itertuples()):
            if ir[4] in top_categories:
                category_dict[ir[4]].append(ir[0])
    else:
        for ir in tqdm(df.itertuples()):
            category_dict[ir[4]].append(ir[0])
        
    train_list = []
    val_list = []
    with tqdm(total=len(df)) as pbar:
        for category_id, product_ids in category_dict.items():
            category_idx = cat2idx[category_id]

            # Randomly remove products to make the dataset smaller.
            keep_size = int(len(product_ids) * (1. - drop_percentage))
            if keep_size < len(product_ids):
                product_ids = np.random.choice(product_ids, keep_size, replace=False)

            # Randomly choose the products that become part of the validation set.
            val_size = int(len(product_ids) * split_percentage)
            if val_size > 0:
                val_ids = np.random.choice(product_ids, val_size, replace=False)
            else:
                val_ids = []

            # Create a new row for each image.
            for product_id in product_ids:
                row = [product_id, category_idx]
                for img_idx in range(df.loc[product_id, "num_imgs"]):
                    if product_id in val_ids:
                        val_list.append(row + [img_idx])
                    else:
                        train_list.append(row + [img_idx])
                pbar.update()
                
    columns = ["product_id", "category_idx", "img_idx"]
    train_df = pd.DataFrame(train_list, columns=columns)
    val_df = pd.DataFrame(val_list, columns=columns)   
    return train_df, val_df

In [7]:
def make_category_tables():
    cat2idx = {}
    idx2cat = {}
    for ir in categories_df.itertuples():
        category_id = ir[0]
        category_idx = ir[1] # iinstead of ir[4] because we changed categories_df
        cat2idx[category_id] = category_idx
        idx2cat[category_idx] = category_id
    return cat2idx, idx2cat

categories_df = pd.read_csv(data_dir+"categories.csv", index_col=0)
print(categories_df.head())
cat2idx, idx2cat = make_category_tables()

             category_idx
category_id              
1000021794              0
1000012764              1
1000012776              2
1000012768              3
1000012755              4


In [12]:
train_offsets_df = pd.read_csv(data_dir+"train_offsets.csv", index_col=0)
train_images_df, val_images_df = make_val_set(train_offsets_df, split_percentage=0.2,
                                              drop_percentage=0.95,num_top_categories=1000)

  mask |= (ar1 == a)


We filter the top  1000  categories...


7069896it [00:33, 210092.42it/s]
  4%|█▎                            | 303259/7069896 [00:10<03:58, 28331.81it/s]


In [8]:
train_images_df.head()

Unnamed: 0,product_id,category_idx,img_idx
0,17848689,619,0
1,6240512,619,0
2,8800088,619,0
3,4374649,619,0
4,5694737,619,0


In [9]:
val_images_df.head()

Unnamed: 0,product_id,category_idx,img_idx
0,4451510,619,0
1,4451510,619,1
2,4451510,619,2
3,4451510,619,3
4,18556624,619,0


In [13]:
print("Number of training images:", len(train_images_df))
print("Number of validation images:", len(val_images_df))
print("Total images:", len(train_images_df) + len(val_images_df))

Number of training images: 420654
Number of validation images: 104257
Total images: 524911


Check number of categories

In [14]:
len(train_images_df["category_idx"].unique()), len(val_images_df["category_idx"].unique())

(1000, 1000)

Save the lookup tables as CSV so that we don't need to repeat the above procedure again.

In [15]:
train_images_df.to_csv("train_images.csv")
val_images_df.to_csv("val_images.csv")

## 2 - The generator

First load the lookup tables from the CSV files (you don't need to do this if you just did all the steps from part 1).

In [16]:
train_offsets_df = pd.read_csv(data_dir+"train_offsets.csv", index_col=0)
train_images_df = pd.read_csv(data_dir+"train_images.csv", index_col=0)
val_images_df = pd.read_csv(data_dir+"val_images.csv", index_col=0)

  mask |= (ar1 == a)


The Keras generator is implemented by the `BSONIterator` class. It creates batches of images (and their one-hot encoded labels) directly from the BSON file. It can be used with multiple workers.

**Note:** For fastest results, put the train.bson and test.bson files on a fast drive (SSD).

See also the code in: https://github.com/fchollet/keras/blob/master/keras/preprocessing/image.py

In [17]:
from keras.preprocessing.image import Iterator
from keras.preprocessing.image import ImageDataGenerator
from keras import backend as K

class BSONIterator(Iterator):
    def __init__(self, bson_file, images_df, offsets_df, num_class,
                 image_data_generator, lock, target_size=(180, 180), 
                 with_labels=True, batch_size=32, shuffle=False, seed=None):

        self.file = bson_file
        self.images_df = images_df
        self.offsets_df = offsets_df
        self.with_labels = with_labels
        self.samples = len(images_df)
        self.num_class = num_class
        self.image_data_generator = image_data_generator
        self.target_size = tuple(target_size)
        self.image_shape = self.target_size + (3,)

        print("Found %d images belonging to %d classes." % (self.samples, self.num_class))

        super(BSONIterator, self).__init__(self.samples, batch_size, shuffle, seed)
        self.lock = lock

    def _get_batches_of_transformed_samples(self, index_array):
        batch_x = np.zeros((len(index_array),) + self.image_shape, dtype=K.floatx())
        if self.with_labels:
            batch_y = np.zeros((len(batch_x), self.num_class), dtype=K.floatx())


        for i, j in enumerate(index_array):
            # Protect file and dataframe access with a lock.
            with self.lock:
                image_row = self.images_df.iloc[j]
                product_id = image_row["product_id"]
                offset_row = self.offsets_df.loc[product_id]

                # Read this product's data from the BSON file.
                self.file.seek(offset_row["offset"])
                item_data = self.file.read(offset_row["length"])

            # Grab the image from the product.
            item = bson.BSON.decode(item_data)
            img_idx = image_row["img_idx"]
            bson_img = item["imgs"][img_idx]["picture"]

            # Load the image.
            img = load_img(io.BytesIO(bson_img), target_size=self.target_size)

            # Preprocess the image.
            x = img_to_array(img)
            x = self.image_data_generator.random_transform(x)
            x = self.image_data_generator.standardize(x)

            # Add the image and the label to the batch (one-hot encoded).
            batch_x[i] = x
            if self.with_labels:
                batch_y[i, image_row["category_idx"]] = 1

        if self.with_labels:
            return batch_x, batch_y
        else:
            return batch_x
        
    def next(self):
        with self.lock:
            index_array = next(self.index_generator)[0]
        return self._get_batches_of_transformed_samples(index_array)

In [18]:
train_bson_file = open(train_bson_path, "rb")

Because the training and validation generators read from the same BSON file, they need to use the same lock to protect it.

In [19]:
import threading
lock = threading.Lock()

Create a generator for training and a generator for validation.

In [20]:
num_classes = 5270 #How is this used? How can i reduce it to the effective number of classes?
num_train_images = len(train_images_df)
num_val_images = len(val_images_df)
batch_size = 256 #128

# Tip: use ImageDataGenerator for data augmentation and preprocessing.
train_datagen = ImageDataGenerator()
train_gen = BSONIterator(train_bson_file, train_images_df, train_offsets_df, 
                         num_classes, train_datagen, lock,
                         batch_size=batch_size, shuffle=True)

val_datagen = ImageDataGenerator()
val_gen = BSONIterator(train_bson_file, val_images_df, train_offsets_df,
                       num_classes, val_datagen, lock,
                       batch_size=batch_size, shuffle=True)

Found 420654 images belonging to 5270 classes.
Found 104257 images belonging to 5270 classes.


## 3 - Training

Create a very simple Keras model and train it, to test that the generators work.

### 3.1 Define the model
The model is ... 

In [21]:
from keras.models import Sequential
from keras.layers import Dropout, Flatten, Dense
from keras.layers.convolutional import Conv2D
from keras.layers.pooling import MaxPooling2D, GlobalAveragePooling2D

model = Sequential()
model.add(Conv2D(32, 3, padding="same", activation="relu", input_shape=(180, 180, 3)))
model.add(MaxPooling2D())
model.add(Conv2D(64, 3, padding="same", activation="relu"))
model.add(MaxPooling2D())
model.add(Conv2D(128, 3, padding="same", activation="relu"))
model.add(MaxPooling2D())
model.add(GlobalAveragePooling2D())
model.add(Dense(num_classes, activation="softmax"))
model.compile(optimizer="adam",loss="categorical_crossentropy",metrics=["accuracy"])

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 180, 180, 32)      896       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 90, 90, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 90, 90, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 45, 45, 64)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 45, 45, 128)       73856     
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 22, 22, 128)       0         
_________________________________________________________________
global_average_pooling2d_1 ( (None, 128)               0         
__________

### 3.2 Train the model
The propagation of all data (forward and backward) is called one __epoch__.<br>
So the number of steps per epoch is:
$\frac{num\_images}{batch\_size}$

In [None]:
'''
saves the model weights after each epoch if the validation loss decreased
'''
checkpointer = ModelCheckpoint(filepath=path_dir+'weights.hdf5', verbose=1, save_best_only=True)

model.fit_generator(train_gen,
                        steps_per_epoch = num_train_images // batch_size,
                        epochs = 3,
                        validation_data = val_gen,
                        validation_steps = num_val_images // batch_size,
                        #workers = 8,
                        verbose=0,
                       callbacks=[checkpointer])

Epoch 1/3
 194/1643 [==>...........................] - ETA: 112592s - loss: 15.4136 - acc: 0.0000e+0 - ETA: 78648s - loss: 14.9717 - acc: 0.0020    - ETA: 73873s - loss: 14.3964 - acc: 0.00 - ETA: 66179s - loss: 13.5975 - acc: 0.00 - ETA: 61312s - loss: 12.7902 - acc: 0.00 - ETA: 58038s - loss: 12.0652 - acc: 0.00 - ETA: 55804s - loss: 11.4964 - acc: 0.00 - ETA: 54113s - loss: 11.0525 - acc: 0.00 - ETA: 52854s - loss: 10.6850 - acc: 0.00 - ETA: 51754s - loss: 10.3572 - acc: 0.00 - ETA: 50982s - loss: 10.0903 - acc: 0.00 - ETA: 50479s - loss: 9.8481 - acc: 0.0075 - ETA: 49869s - loss: 9.6467 - acc: 0.007 - ETA: 49521s - loss: 9.4590 - acc: 0.007 - ETA: 49133s - loss: 9.2955 - acc: 0.006 - ETA: 48868s - loss: 9.1363 - acc: 0.006 - ETA: 48413s - loss: 8.9902 - acc: 0.006 - ETA: 48006s - loss: 8.8686 - acc: 0.007 - ETA: 47696s - loss: 8.7468 - acc: 0.007 - ETA: 47468s - loss: 8.6355 - acc: 0.008 - ETA: 47260s - loss: 8.5369 - acc: 0.008 - ETA: 46979s - loss: 8.4457 - acc: 0.008 - ETA: 4675

In [None]:
# Save the model
model.save(data_dir+'cnn_modelv1.h5')

In [None]:
# To evaluate on the validation set:
from keras.models import load_model, Model, save_model
model = load_model(data_dir+'cnn_modelv1.h5')
model.evaluate_generator(val_gen, steps=num_val_images // batch_size, workers=8)

## 4 - Test or Submit

Note that it is quite slow...

In [None]:
submission_df = pd.read_csv(data_dir + "sample_submission.csv")
submission_df.head()

In [None]:
from keras import backend as K
from keras.preprocessing.image import ImageDataGenerator

test_datagen = ImageDataGenerator()
data = bson.decode_file_iter(open(test_bson_path, "rb"))

with tqdm(total=num_test_products) as pbar:
    for c, d in enumerate(data):
        product_id = d["_id"]  
        num_imgs = len(d["imgs"])

        batch_x = np.zeros((num_imgs, 180, 180, 3), dtype=K.floatx())

        for i in range(num_imgs):
            bson_img = d["imgs"][i]["picture"]

            # Load and preprocess the image.
            img = load_img(io.BytesIO(bson_img), target_size=(180, 180))
            x = img_to_array(img)
            x = test_datagen.random_transform(x)
            x = test_datagen.standardize(x)

            # Add the image to the batch.
            batch_x[i] = x

        prediction = model.predict(batch_x, batch_size=num_imgs)
        avg_pred = prediction.mean(axis=0)
        cat_idx = np.argmax(avg_pred)
        #print("avg",cat_idx)
        
        submission_df.iloc[c]["category_id"] = idx2cat[cat_idx]        
        pbar.update()
        
submission_df.to_csv("my_submission.csv.gz", compression="gzip", index=False)        