# Model for Nature Conservancy Fisheries Kaggle Competition

#### Dependencies

In [1]:
import fish_data as fd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline
import os
import pandas as pd
import json

#### Helper functions

In [2]:
help(fd)

Help on module fish_data:

NAME
    fish_data

DESCRIPTION
    fish_data module contains the helper functions for the model build of the
    Nature Conservancy Fisheries Kaggle Competition.
    
    Dependencies:
        * numpy as np
        * os
        * scipy.ndimage as ndimage
        * scipy.misc as misc
        * scipy.special as special
        * matplotlib.pyplot as plt
        * tensorflow as tf

FUNCTIONS
    count_nodes(x, y, kernel, stride, conv_depth, pad='SAME')
        Calculates the number of total nodes present in the next layer of a
        convolution OR max_pooling event.
    
    decode_image(image_name, size, num_channels=3, mean_channel_vals=[155.0, 155.0, 155.0], mutate=False, crop='random', crop_size=224)
        Converts a dequeued image read from filename to a single tensor array,
        with modifications:
            * smallest dimension resized to standard height and width supplied in size param
            * each channel centered to mean near zero.  Dev

#### Generate a list of filenames

In [3]:
fish_filenames = fd.generate_filenames_list('data/train/', subfolders = True)
print("There are {} filenames in the master set list".format(len(fish_filenames)))
test_filenames = fd.generate_filenames_list('data/test_stg1/', subfolders = False)
print("There are {} filenames in the test set list".format(len(test_filenames)))

There are 3777 filenames in the master set list
There are 1000 filenames in the test set list


#### Retrieve Dictionary of image dimensions

In [4]:
with open('dimensions_dict.json') as f:
    dim_dict = json.load(f)
    
print("Training/Valid set filename dimensions downloaded correctly: {}".format(
        dim_dict.get(fish_filenames[0]) == [720, 1280, 3]))
print("Training/Valid set filename dimensions downloaded correctly: {}".format(
        dim_dict.get(test_filenames[0]) == [720, 1280, 3]))

Training/Valid set filename dimensions downloaded correctly: True
Training/Valid set filename dimensions downloaded correctly: True


#### Generate the labels for the master set list

In [5]:
fish_label_arr = fd.make_labels(fish_filenames, 'train/', '/img')
fish_label_arr.shape
print("One label per row entry: {}".format(all(np.sum(fish_label_arr, 1) == 1) ))

One label per row entry: True


#### Shuffle and split the master set list into training and validation sets

In [6]:
valid_size = 300
files_train, files_val, y_train, y_val = train_test_split(fish_filenames, fish_label_arr, test_size = valid_size)
print("Validation set size: {}".format(y_val.shape[0]))
print("Training set size: {}".format(y_train.shape[0]))

Validation set size: 300
Training set size: 3477


#### Generate a files_train list that represents each class of fish equally

In [None]:
"""Need to refactor generate_balanced_filenames to work from this list, not from scratch."""

In [7]:
train_dims_list = []
for f in files_train :
    train_dims_list.append(dim_dict.get(f))

## Graph and Session Runs

#### Graph parameters

In [8]:
%run -i 'PARAMETERS.py'

Dimensions for each entry: 224x224x3 = 150528
Dimensions after first convolution step (with max pool): 27x27x96 = 69984
Dimensions after second convolution step (with max pool): 13x13x256 = 43264
Dimensions after third convolution step: 13x13x384 = 64896
Dimensions after fourth convolution step: 13x13x384 = 64896
Dimensions after fifth convolution step (with max pool): 6x6x256 = 9216
Dimensions after first connected layer: 4096
Dimensions after second connected layer: 2048
Final dimensions for classification: 8


#### Session parameters

In [9]:
version_ID = 'v2.0.0.0'

In [10]:
%run -i 'GRAPH.py'

In [11]:
%run -i 'SESSION.py'

Initialized!


To view your tensorboard dashboard summary, run the following on the command line:
tensorboard --logdir='/Users/ccthomps/Documents/Python Files/Kaggle Competitions/Nature Conservancy Fisheries/TB_logs/v2.0.0.0'

Batch number: 1
     Training_mean_cross_entropy: 2.04618763923645
     Valid_mean_cross_entropy: 1.7731006145477295
[[  7.11068392e-01  -3.98975164e-02   5.84711209e-02  -6.16743118e-02
   -1.84562922e-01  -9.92975235e-02  -8.37745816e-02   5.19011635e-04]
 [  7.13502586e-01  -4.00349945e-02   5.78910969e-02  -6.08483031e-02
   -1.84259400e-01  -1.00430951e-01  -8.43645558e-02   2.80632125e-03]]
Batch number: 5
     Training_mean_cross_entropy: 1.412601351737976
     Valid_mean_cross_entropy: 1.766634464263916
[[ 1.52372253 -0.05306045  0.19850506 -0.32566044 -0.37626979 -0.47847724
  -0.39142087  0.14303966]
 [ 1.52938974 -0.05440523  0.19821955 -0.3273499  -0.3767544  -0.48167905
  -0.39503968  0.14581628]]
Batch number: 9
     Training_mean_cross_entropy: 1.6

#### Notes during run 
I've fetched two consecutive valid logits during each validation measurement.  I don't know what labels they describe.  However, the same order of probabilities occur in every single occurence.  This order is proportional to the number of that label in the unbalanced dataset.  Thus it appears the model is currently learning the frequency of label instead of image characteristics.  -Batch #871

Also 1.641 is the benchmark set by using the fish frequencies as logits.  Validation set currently oscillating around 1.59 loss so this is consistent with having learned the frequency pattern.  -Batch #1131

In [14]:
print(test_df)

             image       ALB       BET       DOL       LAG       NoF  \
0    img_00005.jpg  0.235612  0.111002  0.122401  0.108757  0.096044   
1    img_00007.jpg  0.235103  0.111067  0.122490  0.108643  0.096109   
2    img_00009.jpg  0.234921  0.111079  0.122453  0.108771  0.096135   
3    img_00018.jpg  0.235209  0.111033  0.122501  0.108716  0.096000   
4    img_00027.jpg  0.235097  0.110977  0.122257  0.108925  0.096194   
5    img_00030.jpg  0.235050  0.111066  0.122464  0.108785  0.096054   
6    img_00040.jpg  0.235306  0.110932  0.122251  0.108913  0.096125   
7    img_00046.jpg  0.235177  0.111069  0.122382  0.108975  0.095965   
8    img_00053.jpg  0.235194  0.111046  0.122451  0.108766  0.096097   
9    img_00071.jpg  0.235394  0.111004  0.122447  0.108642  0.095979   
10   img_00075.jpg  0.235185  0.111008  0.122548  0.108691  0.096025   
11   img_00102.jpg  0.235269  0.110964  0.122283  0.108837  0.096229   
12   img_00103.jpg  0.235290  0.111030  0.122401  0.108726  0.09