<a href="https://colab.research.google.com/github/ICRAR/PHYS5511/blob/master/2019/week06/keras_cnn_week06_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

A more scalabe solution of the GalaxZoo competition based on [the basic solution](https://github.com/ICRAR/PHYS5511/blob/master/2019/week06/keras_cnn_week06.ipynb).

#Machine setup
Make sure runtime type is always "GPU"

#Mount G-drive filesystem

In [0]:
from google.colab import drive
drive.mount('/content/drive')

#Solution overview
Before getting started, please provide an overview of your thinking processes to tackle the problem.

In [0]:
import sys
from zipfile import ZipFile
import numpy as np
import os.path as osp
import pandas as pd
from sklearn.model_selection import train_test_split
from skimage.transform import resize
from tqdm import tqdm
import matplotlib.pyplot as plt
import cv2
from keras.preprocessing.image import ImageDataGenerator
%matplotlib inline



#Basic setup

Check [the winner paper](https://arxiv.org/abs/1503.07077) about cropping

In [0]:
# please modify this root_path
root_path = '/content/drive/My Drive/PHYS5512/data/galaxy_zoo'
ORIG_SHAPE = (424, 424)
CROP_SIZE = (256, 256)
IMG_SHAPE = (64, 64)

#Load catalogues

In [0]:

training_solution_file = osp.join(root_path, 'training_solutions_rev1.csv')
df = pd.read_csv(training_solution_file)

df_train, df_test = train_test_split(df, test_size=.2)
df_train.shape, df_test.shape

In [0]:
estimated_origin_size = ORIG_SHAPE[0] * ORIG_SHAPE[1] * 3 * df_train.shape[0]
estimated_crop_size = CROP_SIZE[0] * CROP_SIZE[1] * 3 * df_train.shape[0]
reshaped_size = IMG_SHAPE[0] * IMG_SHAPE[1] * 3 * df_train.shape[0]
print('Original training size:\t %.2f GBytes' \
      % (estimated_origin_size / 1024 ** 3))
print('Cropped training size:\t %.2f GBytes' \
      % (estimated_crop_size / 1024 ** 3))
print('Reshaped training size:\t %.2f GBytes' \
      % (reshaped_size / 1024 ** 3))

#Scalable Image Loader

First we define a zip image generator (iterator). Every `next()` call will produce a batch of images

In [0]:
def zip_image_generator(dataframe, batch_size, aug=None):
  index_counter = 0
  filename = osp.join(root_path, 'images_training_rev1.zip')
  df = dataframe.values
  nb_imgs = df.shape[0]
  print(index_counter)
  with ZipFile(filename) as archive:
    while True:
      x_batch = []
      y_batch = []
      while (len(x_batch) < batch_size):
        ind = index_counter % nb_imgs
        gid = str(int(df[ind][0]))
        #print(gid)
        fn = archive.open('images_training_rev1/{0}.jpg'.format(gid))
        x = plt.imread(fn) / 255.0
        h, w, c = x.shape
        x = np.reshape(x, [1, h, w, c])
        y = df[ind][1:]
        x_batch.append(x)
        y_batch.append(y)
        index_counter += 1
      yield (np.vstack(x_batch), np.vstack(y_batch))
      

In [0]:
zip_gen = zip_image_generator(df_train, 2)

In [0]:
imgs, labels = next(zip_gen)
print(imgs.shape, labels.shape)

#Build the basic model

Please make sure explain your model with some reasonalbly detailed rationales. For example:
1. why use "root mean squre error" as the metrics?
2. what is *binary_crossentropy*, why can't we use *categorical_crossentropy* that we used in [Tutorial 04](https://colab.research.google.com/github/ICRAR/PHYS5511/blob/master/2019/week04/Keras_FC_network_classifier.ipynb)?
3. What happens if we use *mean_squared_error* as the loss?

In [0]:
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Activation, Dropout, Flatten, Dense, BatchNormalization, GlobalMaxPooling2D
from keras import backend as K

def root_mean_squared_error(y_true, y_pred):
  return K.sqrt(K.mean(K.square(y_pred - y_true))) 

model = Sequential()
model.add(Conv2D(512, (3, 3), input_shape=(ORIG_SHAPE[0], ORIG_SHAPE[1], 3)))
model.add(Conv2D(256, (3, 3)))
#model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(256, (3, 3)))
model.add(Conv2D(128, (3, 3)))
#model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(128, (3, 3)))
model.add(Conv2D(128, (3, 3)))
#model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(GlobalMaxPooling2D())


model.add(Dropout(0.25))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.25))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.25))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.25))
model.add(Dense(37))
model.add(Activation('sigmoid'))

# why can't we
# model.add(Activation('softmax'))

model.compile(loss='binary_crossentropy', optimizer='adamax', metrics=[root_mean_squared_error])
# why can't we 
#model.compile(loss='categorical_crossentropy', optimizer='adamax', metrics=[root_mean_squared_error])
#model.compile(loss='mean_squared_error', optimizer='adamax', metrics=[root_mean_squared_error])
model.summary()

#Training

In [0]:
batch_size = 4
small_train_set = 10000 # please remove this "small" setup
small_val_set = 1000
nb_epochs = 5
train_gen = zip_image_generator(df_train, batch_size)
val_gen = zip_image_generator(df_test, batch_size)
history = model.fit_generator(train_gen, steps_per_epoch=small_train_set // batch_size,
                             validation_data=val_gen, validation_steps=small_val_set // batch_size,
                             epochs=nb_epochs)

#Plot training curves

In [0]:
from matplotlib.ticker import MaxNLocator
fig = plt.figure(figsize=(10, 6))
ax = fig.gca()
ax.xaxis.set_major_locator(MaxNLocator(integer=True))
histories = history.history.items()
xvals = np.arange(1, nb_epochs + 1)
for k, v in histories:
    plt.plot(xvals, v, label=k if 'val_' in k else 'train_%s' % k)

plt.legend(loc='best', fontsize=14)
plt.suptitle('Loss curve', fontsize=16)
plt.ylabel('MSE', fontsize=14)
plt.xlabel('Epoch', fontsize=14)

#Plot more learning curves?

1. Can you try plotting the learning curve and analyise the trade-off between variance and bias as shown on Slide 27 at [Week 03](https://lms.uwa.edu.au/bbcswebdav/pid-133298-dt-announcement-rid-21639517_1/xid-21639517_1)? 

(Tips - by increasing the variable *small_train_set*)

2. If time permitted, you could also try to locate the "optimal capacity" as shown on Slide 26 at [Week 03](https://lms.uwa.edu.au/bbcswebdav/pid-133298-dt-announcement-rid-21639517_1/xid-21639517_1)? 

(Tips - by decreasing / increasing more CNN / FC layers in the model)

# Test Prediction Submission (change me to get it working)

In [0]:
import os
from tqdm import tqdm

def test_image_generator(ids, shape=IMG_SHAPE):
    x1 = (ORIG_SHAPE[0] - CROP_SIZE[0]) // 2
    y1 = (ORIG_SHAPE[1] - CROP_SIZE[1]) // 2
    x_batch = []
    for i in ids:
        x = get_image('../input/44352/images_test_rev1/'+i, x1, y1, shape=IMG_SHAPE, crop_size=CROP_SIZE)
        x_batch.append(x)
    x_batch = np.array(x_batch)
    return x_batch

val_files = os.listdir('../input/44352/images_test_rev1/')
val_predictions = []
N_val = len(val_files)
for i in tqdm(np.arange(0, N_val, batch_size)):
    if i+batch_size > N_val:
        upper = N_val
    else:
        upper = i+batch_size
    X = test_image_generator(val_files[i:upper])
    y_pred = model.predict(X)
    val_predictions.append(y_pred)
val_predictions = np.array(val_predictions)
Y_pred = np.vstack(val_predictions)
ids = np.array([v.split('.')[0] for v in val_files]).reshape(len(val_files),1)
submission_df = pd.DataFrame(np.hstack((ids, Y_pred)), columns=df.columns)
submission_df = submission_df.sort_values(by=['GalaxyID'])
submission_df.to_csv('sample_submission.csv', index=False)

#Lessons learnt

1. Summary of the most important know-how you leanred from this assignment?
2. Are there any limitations of the current solution?