# Overview
The notebook is modified from one that was made for the [Quick, Draw Dataset](https://www.kaggle.com/google/tinyquickdraw), it would actually be interesting to see how beneficial a transfer learning approach using that data as a starting point could be.

## This Notebook
The notebook takes and preprocesses the data from the QuickDraw Competition step (strokes) and trains an LSTM. The outcome variable (y) is always the same (category). The stroke-based LSTM. The model takes the stroke data and 'preprocesses' it a bit using 1D convolutions and then uses two stacked LSTMs followed by two dense layers to make the classification. The model can be thought to 'read' the drawing stroke by stroke.

## Fun Models

After the classification models, we try to build a few models to understand what the LSTM actually does. Here we experiment step by step to see how the prediction changes with each stop

### Next Steps
The next steps could be
- use more data to train
- include the country code (different countries draw different things, different ways)
- more complex models

### Model Parameters
Here we keep track of the relevant parameters for the data preprocessing, model construction and training

In [72]:
batch_size = 4096
STROKE_COUNT = 196
TRAIN_SAMPLES = 1750
VALID_SAMPLES = 300
TEST_SAMPLES = 300
NUM_CLASSES = 150

In [73]:
%matplotlib inline
import os
import numpy as np
np.random.seed(69)
import matplotlib.pyplot as plt
from keras.utils.np_utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder
import pandas as pd
from keras.metrics import top_k_categorical_accuracy
def top_3_accuracy(x,y): return top_k_categorical_accuracy(x,y, 3)
from keras.callbacks import ModelCheckpoint, LearningRateScheduler, EarlyStopping, ReduceLROnPlateau
from glob import glob
import gc
gc.enable()
def get_available_gpus():
    from tensorflow.python.client import device_lib
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == 'GPU']
base_dir = os.path.join('..', 'input')
test_path = os.path.join(base_dir, 'test_simplified.csv')

In [74]:
from ast import literal_eval
ALL_TRAIN_PATHS = glob(os.path.join(base_dir, 'train_simplified', '*.csv'))
COL_NAMES = ['countrycode', 'drawing', 'key_id', 'recognized', 'timestamp', 'word']

def _stack_it(raw_strokes):
    """preprocess the string and make 
    a standard Nx3 stroke vector"""
    stroke_vec = literal_eval(raw_strokes) # string->list
    # unwrap the list
    in_strokes = [(xi,yi,i)  
     for i,(x,y) in enumerate(stroke_vec) 
     for xi,yi in zip(x,y)]
    c_strokes = np.stack(in_strokes)
    # replace stroke id with 1 for continue, 2 for new
    c_strokes[:,2] = [1]+np.diff(c_strokes[:,2]).tolist()
    c_strokes[:,2] += 1 # since 0 is no stroke
    # pad the strokes with zeros
    x = pad_sequences(c_strokes.swapaxes(0, 1), 
                         maxlen=STROKE_COUNT, 
                         padding='post').swapaxes(0, 1)
    return x

def read_batch(samples=5, 
               start_row=0,
               max_rows = 1000):
    """
    load and process the csv files
    this function is horribly inefficient but simple
    """
    out_df_list = []
    for c_path in ALL_TRAIN_PATHS[:NUM_CLASSES]:
        c_df = pd.read_csv(c_path, nrows=max_rows, skiprows=start_row)
        c_df.columns=COL_NAMES
        out_df_list += [c_df.sample(samples)[['drawing', 'word']]]
    full_df = pd.concat(out_df_list)
    full_df['drawing'] = full_df['drawing'].\
        map(_stack_it)
    
    return full_df

# Reading and Parsing
Since it is too much data (23GB) to read in at once, we just take a portion of it for training, validation and hold-out testing. This should give us an idea about how well the model works, but leaves lots of room for improvement later

In [75]:
train_args = dict(samples=TRAIN_SAMPLES, 
                  start_row=0, 
                  max_rows=int(TRAIN_SAMPLES*1.5))
valid_args = dict(samples=VALID_SAMPLES, 
                  start_row=train_args['max_rows']+1, 
                  max_rows=VALID_SAMPLES+25)
test_args = dict(samples=TEST_SAMPLES, 
                 start_row=valid_args['max_rows']+train_args['max_rows']+1, 
                 max_rows=TEST_SAMPLES+25)
train_df = read_batch(**train_args)
valid_df = read_batch(**valid_args)
test_df = read_batch(**test_args)
word_encoder = LabelEncoder()
word_encoder.fit(train_df['word'])
print('words', len(word_encoder.classes_), '=>', ', '.join([x for x in word_encoder.classes_]))

words 150 => The Great Wall of China, alarm clock, angel, animal migration, ant, anvil, apple, asparagus, axe, banana, barn, baseball, bat, bathtub, beach, bear, bicycle, birthday cake, blackberry, blueberry, book, boomerang, bowtie, bread, bridge, broccoli, bus, bush, butterfly, cake, calendar, campfire, canoe, cat, ceiling fan, cell phone, chandelier, church, circle, clarinet, computer, cookie, crown, cup, diving board, dog, dolphin, donut, door, dresser, drill, elbow, elephant, eyeglasses, face, fan, fireplace, flip flops, foot, garden, golf club, grapes, grass, hamburger, hammer, headphones, helicopter, hockey puck, hospital, hot air balloon, hot dog, hourglass, house, house plant, hurricane, ice cream, jacket, jail, key, lantern, map, marker, megaphone, mermaid, monkey, mosquito, motorbike, mountain, mushroom, nose, ocean, octopus, onion, paper clip, parrot, passport, peanut, pineapple, pliers, police car, pool, popsicle, potato, purse, rake, river, roller coaster, sandwich, sea t

# Stroke-based Classification
Here we use the stroke information to train a model and see if the strokes give us a better idea of what the shape could be. 

In [76]:
def get_Xy(in_df):
    X = np.stack(in_df['drawing'], 0)
    y = to_categorical(word_encoder.transform(in_df['word'].values))
    return X, y
train_X, train_y = get_Xy(train_df)
valid_X, valid_y = get_Xy(valid_df)
test_X, test_y = get_Xy(test_df)
print(train_X.shape)

(262500, 196, 3)


In [77]:
# fig, m_axs = plt.subplots(3,3, figsize = (16, 16))
# rand_idxs = np.random.choice(range(train_X.shape[0]), size = 9)
# for c_id, c_ax in zip(rand_idxs, m_axs.flatten()):
#     test_arr = train_X[c_id]
#     test_arr = test_arr[test_arr[:,2]>0, :] # only keep valid points
#     lab_idx = np.cumsum(test_arr[:,2]-1)
#     for i in np.unique(lab_idx):
#         c_ax.plot(test_arr[lab_idx==i,0], 
#                 np.max(test_arr[:,1])-test_arr[lab_idx==i,1], '.-')
#     c_ax.axis('off')
#     c_ax.set_title(word_encoder.classes_[np.argmax(train_y[c_id])])

# LSTM to Parse Strokes
The model suggeted from the tutorial is

![Suggested Model](https://www.tensorflow.org/versions/master/images/quickdraw_model.png)

In [81]:
from keras.models import Sequential
from keras.layers import BatchNormalization, Conv1D, LSTM, Dense, Dropout, MaxPool1D
from keras import optimizers


In [82]:
if len(get_available_gpus())>0:
    # https://twitter.com/fchollet/status/918170264608817152?lang=en
    from keras.layers import CuDNNLSTM as LSTM # this one is about 3x faster on GPU instances
stroke_read_model = Sequential()
stroke_read_model.add(BatchNormalization(input_shape = (None,)+train_X.shape[2:]))
# filter count and length are taken from the script https://github.com/tensorflow/models/blob/master/tutorials/rnn/quickdraw/train_model.py
stroke_read_model.add(Conv1D(128, (5,), padding='same', activation='relu'))
stroke_read_model.add(Dropout(0.15))
stroke_read_model.add(MaxPool1D(pool_size=3, strides=2))
stroke_read_model.add(Conv1D(256, (3,), padding='same', activation='relu'))
stroke_read_model.add(Dropout(0.15))
stroke_read_model.add(Conv1D(256, (3,), padding='same', activation='relu'))
stroke_read_model.add(Dropout(0.15))
stroke_read_model.add(Conv1D(256, (3,), padding='same', activation='relu'))
stroke_read_model.add(Dropout(0.15))
stroke_read_model.add(MaxPool1D(pool_size=3, strides=2))
stroke_read_model.add(Dropout(0.2))
stroke_read_model.add(LSTM(128, return_sequences = True))
stroke_read_model.add(Dropout(0.3))
stroke_read_model.add(LSTM(128, return_sequences = False))
stroke_read_model.add(Dropout(0.3))
stroke_read_model.add(Dense(512))
stroke_read_model.add(Dropout(0.4))
stroke_read_model.add(Dense(len(word_encoder.classes_), activation = 'softmax'))
adam = optimizers.Adam(lr=0.0001)
stroke_read_model.compile(optimizer = adam, 
                          loss = 'categorical_crossentropy', 
                          metrics = ['categorical_accuracy', top_3_accuracy])
stroke_read_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
batch_normalization_11 (Batc (None, None, 3)           12        
_________________________________________________________________
conv1d_30 (Conv1D)           (None, None, 128)         2048      
_________________________________________________________________
dropout_57 (Dropout)         (None, None, 128)         0         
_________________________________________________________________
max_pooling1d_9 (MaxPooling1 (None, None, 128)         0         
_________________________________________________________________
conv1d_31 (Conv1D)           (None, None, 256)         98560     
_________________________________________________________________
dropout_58 (Dropout)         (None, None, 256)         0         
_________________________________________________________________
conv1d_32 (Conv1D)           (None, None, 256)         196864    
__________

In [83]:
weight_path="{}_weights.best.hdf5".format('stroke_lstm_model')

checkpoint = ModelCheckpoint(weight_path, monitor='val_loss', verbose=1, 
                             save_best_only=True, mode='min', save_weights_only = True)

reduceLROnPlat = ReduceLROnPlateau(monitor='val_loss', factor=0.6, patience=3, 
                                   verbose=1, mode='auto', cooldown=3, min_lr=0.000005)
early = EarlyStopping(monitor="val_loss", 
                      mode="min", 
                      patience=10) # probably needs to be more patient, but kaggle time is limited
callbacks_list = [checkpoint, early, reduceLROnPlat]

In [84]:
# from keras.callbacks import Callback
# class OutputClearNEpoch(Callback):
#     def on_epoch_end(self, epoch, logs={}):
#         current = logs.get(self.monitor)
#         if epoch % 5 == 0:
#             clear_output()

In [70]:
from IPython.display import clear_output
stroke_read_model.fit(train_X, train_y,
                      validation_data = (valid_X, valid_y), 
                      batch_size = batch_size,
                      epochs = 150,
                      callbacks = callbacks_list)
#clear_output()

Train on 150000 samples, validate on 30000 samples
Epoch 1/100

Epoch 00001: val_loss improved from inf to 3.59772, saving model to stroke_lstm_model_weights.best.hdf5
Epoch 2/100

Epoch 00002: val_loss improved from 3.59772 to 3.36770, saving model to stroke_lstm_model_weights.best.hdf5
Epoch 3/100

Epoch 00003: val_loss improved from 3.36770 to 3.07866, saving model to stroke_lstm_model_weights.best.hdf5
Epoch 4/100

Epoch 00004: val_loss improved from 3.07866 to 2.65586, saving model to stroke_lstm_model_weights.best.hdf5
Epoch 5/100

Epoch 00005: val_loss improved from 2.65586 to 2.33522, saving model to stroke_lstm_model_weights.best.hdf5
Epoch 6/100

Epoch 00006: val_loss improved from 2.33522 to 2.02689, saving model to stroke_lstm_model_weights.best.hdf5
Epoch 7/100

Epoch 00007: val_loss improved from 2.02689 to 1.84932, saving model to stroke_lstm_model_weights.best.hdf5
Epoch 8/100

Epoch 00008: val_loss improved from 1.84932 to 1.54464, saving model to stroke_lstm_model_wei


Epoch 00050: val_loss improved from 0.63791 to 0.63439, saving model to stroke_lstm_model_weights.best.hdf5
Epoch 51/100

Epoch 00051: val_loss did not improve from 0.63439
Epoch 52/100

Epoch 00052: val_loss did not improve from 0.63439
Epoch 53/100

Epoch 00053: val_loss did not improve from 0.63439

Epoch 00053: ReduceLROnPlateau reducing learning rate to 0.0003600000170990825.
Epoch 54/100

Epoch 00054: val_loss improved from 0.63439 to 0.62472, saving model to stroke_lstm_model_weights.best.hdf5
Epoch 55/100

Epoch 00055: val_loss did not improve from 0.62472
Epoch 56/100

Epoch 00056: val_loss did not improve from 0.62472
Epoch 57/100

Epoch 00057: val_loss did not improve from 0.62472
Epoch 58/100

Epoch 00058: val_loss did not improve from 0.62472

Epoch 00058: ReduceLROnPlateau reducing learning rate to 0.00021600000327453016.
Epoch 59/100

Epoch 00059: val_loss did not improve from 0.62472
Epoch 60/100

Epoch 00060: val_loss did not improve from 0.62472
Epoch 61/100

Epoch 0

<keras.callbacks.History at 0x7fb2a9db4d68>

In [71]:
stroke_read_model.load_weights(weight_path)
lstm_results = stroke_read_model.evaluate(test_X, test_y, batch_size = 4096)
print('Accuracy: %2.1f%%, Top 3 Accuracy %2.1f%%' % (100*lstm_results[1], 100*lstm_results[2]))

Accuracy: 83.6%, Top 3 Accuracy 93.3%
