## Columbia University
### ECBM E4040 Neural Networks and Deep Learning. Fall 2021.

# ECBM E4040 - Assignment 2- Task 5: Kaggle Open-ended Competition

Kaggle is a platform for predictive modelling and analytics competitions in which companies and researchers post data and statisticians and data miners compete to produce the best models for predicting and describing the data.

If you don't have a Kaggle account, feel free to join at [www.kaggle.com](https://www.kaggle.com). To let the CAs do the grading more conveniently, please __use Lionmail to join Kaggle__ and __use UNI as your username__.

The competition is located here: https://www.kaggle.com/c/ecbm4040-assignment-2-task-5/overview.

You can find detailed description about this in-class competition on the website above. Please read carefully and follow the instructions.

<span style="color:red">__TODO__:</span>
1. Train a custom model for the bottle dataset classification problem. You are free to use any methods taught in the class or found by yourself on the Internet (ALWAYS provide reference to the source).
General training methods include:
    * Dropout
    * Batch normalization
    * Early stopping
    * l1-norm & l2-norm penalization
    
2. You are given the test set to generate your predictions (70% public + 30% private, but you don't know which ones are public/private). Students should achieve an accuracy on the public test set of at least 70%. Two points will be deducted for each 1% below 70% accuracy threshold (i.e. 65% accuracy will have 10 points deducted). The accuracy will be shown on the public leaderboard once you submit your prediction .csv file. The private leaderboard will be released after the competition. The final ranking is based on the private leaderboard result, not the public leaderboard.
3. 

    * Report your results on the Kaggle, for comparison with other students' optimal results (you can do this several times). 
    * Save your best model, using Github Classroom, at the same time when you submit the homework files into Courseworks. See instructions below. 

__Hint__: You can start from what you implemented in task 4. Another classic classification model named 'VGG16' can also be easily implemented. Students are allowed to use pretrained networks, and utilize transfer learning. 

## HW Submission Details:
There are three components to reporting the results of this task: 

**(A) Submission (up to 20 submissions each day) of the .csv prediction file throught the Kaggle platform;**. You should start doing this __VERY early__, so that students can compare their work as they are making progress with model optimization.

**(B) Editing and submitting the content of this Jupyter notebook, through Courseworks; **
(i) The code for your CNN model and for the training function. The code should be stored in __./ecbm4040/neuralnets/kaggle.py__;
(ii) Print out your training process and accuracy __within this notebook__;

**(C) Submitting your best CNN model through Github Classroom repo.**

**Description of (C):** 
For this task, we will continue to use Github classroom to save your model for submission. 

<span style="color:red">__Submission content:__ :</span>
(i) In your Assignment 2 submission folder, create a subfolder called __KaggleModel__. Upload your best model with all the data output (for example, __MODEL.data-00000-of-00001, MODEL.meta, MODEL.index__) into the folder. 
(ii) Remember to delete any intermediate results, **we only want your best model. Do not upload any data files**. The instructors will rerun the uploaded best model and verify against the score which you reported on the Kaggle.



In [None]:
%load_ext autoreload
%autoreload 2

In [26]:
#import zipfile as zf
#files = zf.ZipFile('/home/ecbm4040/e4040-2021fall-assign2-hl3515/data/ecbm4040-assignment-2-task-5.zip', 'r')
#files.extractall('/home/ecbm4040/e4040-2021fall-assign2-hl3515/data/Kaggle')
#files.close()

## Load Data

In [1]:
#Generate dataset
import os
import pandas as pd
import numpy as np
from PIL import Image

#Load Training images and labels
path = os.getcwd()
train_directory ="/home/ecbm4040/e4040-2021fall-assign2-hl3515/data/Kaggle/kaggle_train_128/train_128" #TODO: Enter path for train128 folder (hint: use os.getcwd())
image_list=[]
label_list=[]
for sub_dir in os.listdir(train_directory):
    print("Reading folder {}".format(sub_dir))
    sub_dir_name=os.path.join(train_directory,sub_dir)
    for file in os.listdir(sub_dir_name):
        filename = os.fsdecode(file)
        if filename.endswith(".jpg") or filename.endswith(".png"):
            image_list.append(np.array(Image.open(os.path.join(sub_dir_name,file))))
            label_list.append(int(sub_dir))
X_train=np.array(image_list)
y_train=np.array(label_list)

#Load Test images
test_directory = "/home/ecbm4040/e4040-2021fall-assign2-hl3515/data/Kaggle/kaggle_test_128/test_128"#TODO: Enter path for test128 folder (hint: use os.getcwd())
test_image_list=[]
test_df = pd.DataFrame([], columns=['Id', 'X'])
print("Reading Test Images")
for file in os.listdir(test_directory):
    filename = os.fsdecode(file)
    if filename.endswith(".jpg") or filename.endswith(".png"):
        test_df = test_df.append({
            'Id': filename,
            'X': np.array(Image.open(os.path.join(test_directory,file)))
        }, ignore_index=True)
        
test_df['s'] = [int(x.split('.')[0]) for x in test_df['Id']]
test_df = test_df.sort_values(by=['s'])
test_df = test_df.drop(columns=['s'])
X_test = np.stack(test_df['X'])


print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)

Reading folder 1
Reading folder 3
Reading folder 0
Reading folder 2
Reading folder 4
Reading Test Images
Training data shape:  (15000, 128, 128, 3)
Training labels shape:  (15000,)
Test data shape:  (3500, 128, 128, 3)


## Build and Train Your Model Here

In [3]:
#from utils.neuralnets.cnn.my_Kaggle_trainer import MyKaggle_trainer

In [15]:
print('data shape: ', X_train.shape)
print('labels shape: ', y_train.shape)

data shape:  (15000, 128, 128, 3)
labels shape:  (15000,)


In [2]:
num_train=14000
X_trainSet = X_train[:num_train]
y_trainSet = y_train[:num_train]
X_valSet = X_train[num_train:]
y_valSet = y_train[num_train:]
print('Training data shape: ', X_trainSet.shape)
print('Training labels shape: ', y_trainSet.shape)
print('Training data shape: ', X_valSet.shape)
print('Training data shape: ', y_valSet.shape)
print('Test data shape: ', X_test.shape)

Training data shape:  (14000, 128, 128, 3)
Training labels shape:  (14000,)
Training data shape:  (1000, 128, 128, 3)
Training data shape:  (1000,)
Test data shape:  (3500, 128, 128, 3)


In [4]:
print("Number of classes: {}".format(len(set(y_train))))

Number of classes: 5


In [None]:
#My_Kaggle = MyKaggle_trainer(X_train, y_train, X_test, epochs=10, batch_size=256, lr=0.001)
#MyKaggle_trainer.run(My_Kaggle)

In [5]:
# YOUR CODE HERE
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.callbacks import EarlyStopping

In [7]:
from utils.neuralnets.cnn.my_Kaggle import *
model = create_model(lr = 1e-3)
#model.summary()

In [8]:
# Training process
b_size = 64 
num_epoch = 25 
# Using early stop method reducing the Learning Rate if result is not getting better. 
eraly_stop = EarlyStopping(monitor='val_loss', min_delta=0.0001, patience=15, verbose=1, mode='auto')
reduce_lr = ReduceLROnPlateau(monitor='val_loss', min_delta=0.0004, patience=2, factor=0.1, min_lr=1e-6, mode='auto', verbose=1)
# Perform data augmentation
train_datagen=ImageDataGenerator(preprocessing_function = preprocess_input) 
# Train the model
train_generator = train_datagen.flow(X_trainSet, y_trainSet, batch_size = b_size)
step_size_train = train_generator.n//train_generator.batch_size
# Recording the training process
history = model.fit_generator(generator = train_generator,
                    steps_per_epoch = step_size_train,
                    epochs = num_epoch,
                    validation_data=(X_valSet, y_valSet),
                    callbacks=[reduce_lr, eraly_stop]
                   )



Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25

Epoch 00005: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
Epoch 6/25
Epoch 7/25

Epoch 00007: ReduceLROnPlateau reducing learning rate to 1.0000000474974514e-05.
Epoch 8/25
Epoch 9/25

Epoch 00009: ReduceLROnPlateau reducing learning rate to 1.0000000656873453e-06.
Epoch 10/25
Epoch 11/25

Epoch 00011: ReduceLROnPlateau reducing learning rate to 1e-06.
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 00018: early stopping


## Save your best model

In [21]:
# YOUR CODE HERE
model.save('./model/KaggleModel')

INFO:tensorflow:Assets written to: ./model/KaggleModel/assets


## Generate .csv file for Kaggle

The following code snippet can be used to generate your prediction .csv file.

NOTE: If your kaggle results are indicating random performance, then it's likely that the indices of your csv predictions are misaligned.

In [9]:
from tqdm import tqdm

# test time augmentation, we set TTA for 10 times averaging.
tta_steps = 10
bs=25
predictions = []
test_datagen = ImageDataGenerator(preprocessing_function = preprocess_input)
for i in tqdm(range(tta_steps)):
    preds = model.predict(test_datagen.flow(X_test, batch_size=bs, shuffle=False), steps = len(X_test)/bs)
    predictions.append(preds)

100%|██████████| 10/10 [00:27<00:00,  2.76s/it]


In [10]:
print(X_test.shape)
print(len(predictions))
print(len(predictions[0]))
print(len(predictions[0][0]))
print(predictions[0][0][0])
pred = np.mean(predictions,axis=0)
final_pred = np.argmax(pred,axis=1)
final_pred.reshape(3500,-1)
print(final_pred.shape)
print(final_pred)

(3500, 128, 128, 3)
10
3500
5
4.687297e-06
(3500,)
[1 1 2 ... 3 3 4]


In [11]:
import csv
with open('predicted.csv','w') as csvfile:
    fieldnames = ['Id','label']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for index,l in enumerate(final_pred):
        filename = str(index) + '.png'
        label = str(l)
        writer.writerow({'Id': filename, 'label': label})