<h1> <a href=https://www.kaggle.com/c/the-nature-conservancy-fisheries-monitoring/>The nature conservancy fisheries monitoring</a> </h1>

<h2> NemoX </h2>


<i> Raphaël Meudec, Guillaume Richard, Antoine SIMOULIN</i>

<h2> Introduction </h2>

<p style="text-align: justify"> Nearly half of the world depends on seafood for their main source of protein. In the Western and Central Pacific, where 60% of the world’s tuna is caught, illegal, unreported, and unregulated fishing practices are threatening marine ecosystems, global seafood supplies and local livelihoods. The Nature Conservancy is working with local, regional and global partners to preserve this fishery for the future. </p>

<p style="text-align: justify"> Currently, the Conservancy is looking to the future by using cameras to dramatically scale the monitoring of fishing activities to fill critical science and compliance monitoring data gaps. Although these electronic monitoring systems work well and are ready for wider deployment, the amount of raw data produced is cumbersome and expensive to process manually. </p>

<p style="text-align: justify"> The Conservancy is inviting the Kaggle community to develop algorithms to automatically detect and classify species of tunas, sharks and more that fishing boats catch, which will accelerate the video review process. Faster review and more reliable data will enable countries to reallocate human capital to management and enforcement activities which will have a positive impact on conservation and our planet. </p>

<p style="text-align: justify"> Machine learning has the ability to transform what we know about our oceans and how we manage them. You can be part of the solution. </p>



<h2> Requirements </h2>

* numpy=1.11.3  
* matplotlib=1.5.3 
* scikit-learn=0.18.1
* pandas=0.18.1
* seaborn=0.7.1
* opencv=3.1.0

In [17]:
import numpy as np
np.random.seed(1989)
import matplotlib as mlt
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
import glob
from scipy.misc import *
import skimage.measure as sm
# import progressbar
import multiprocessing
import random
from subprocess import check_output
import os
import time
import cv2
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
new_style = {'grid': False}
plt.rc('axes', **new_style)

# sklearn
import sklearn as skt
from sklearn import cluster
from sklearn.cross_validation import KFold
from sklearn.metrics import log_loss

# keras
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D, ZeroPadding2D, AveragePooling2D
from keras.optimizers import SGD, Adagrad
from keras.callbacks import EarlyStopping
from keras.utils import np_utils
from keras.constraints import maxnorm
from keras import __version__ as keras_version

In [18]:
os.sys.path

['',
 '/Users/Antoine/anaconda3/lib/python35.zip',
 '/Users/Antoine/anaconda3/lib/python3.5',
 '/Users/Antoine/anaconda3/lib/python3.5/plat-darwin',
 '/Users/Antoine/anaconda3/lib/python3.5/lib-dynload',
 '/Users/Antoine/.local/lib/python3.5/site-packages',
 '/Users/Antoine/.local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg',
 '/Users/Antoine/anaconda3/lib/python3.5/site-packages',
 '/Users/Antoine/anaconda3/lib/python3.5/site-packages/Sphinx-1.4.6-py3.5.egg',
 '/Users/Antoine/anaconda3/lib/python3.5/site-packages/aeosa',
 '/Users/Antoine/anaconda3/lib/python3.5/site-packages/setuptools-27.2.0-py3.5.egg',
 '/Users/Antoine/anaconda3/lib/python3.5/site-packages/IPython/extensions',
 '/Users/Antoine/.ipython']

<h2> Loading data </h2>

In [1]:
# Input data files are available in the "../input/" directory.

cat={'ALB', 'BET', 'DOL', 'LAG', 'NoF', 'OTHER', 'SHARK', 'YFT'}
# Function to show 4 images
def show_four(imgs, title):
    #select_imgs = [np.random.choice(imgs) for _ in range(4)]
    select_imgs = [imgs[np.random.choice(len(imgs))] for _ in range(4)]
    _, ax = plt.subplots(1, 4, sharex='col', sharey='row', figsize=(20, 3))
    plt.suptitle(title, size=20)
    for i, img in enumerate(select_imgs):
        ax[i].imshow(img)

# Function to show 8 images
def show_eight(imgs, title):
    select_imgs = [imgs[np.random.choice(len(imgs))] for _ in range(8)]
    _, ax = plt.subplots(2, 4, sharex='col', sharey='row', figsize=(20, 6))
    plt.suptitle(title, size=20)
    for i, img in enumerate(select_imgs):
        ax[i // 4, i % 4].imshow(img)

In [29]:
if False:
    nb=30

    # Data loading
    train_files={}
    train={}
    for c in cat:
        train_files[c] = sorted(glob.glob('./train/'+c+'/*.jpg'), key=lambda x: random.random())[:nb]
        train[c] = np.array([imread(img) for img in train_files[c]])
        print('Length of train %s: %i'%(c,len(train[c])))
        show_four(train[c],c)
        plt.show()

In [22]:
#Get image and resize it
def get_im_cv2(path):
    img = imread(path)
    resized = imresize(img, (64, 64))
    return resized

#Load train data
def load_train():
    X_train = []
    X_train_id = []
    y_train = []
    start_time = time.time()

    print('Read train images')
    folders = ['ALB', 'BET', 'DOL', 'LAG', 'NoF', 'OTHER', 'SHARK', 'YFT']
    for fld in folders:
        index = folders.index(fld)
        print('Load folder {} (Index: {})'.format(fld, index))
        path = os.path.join('./', 'train', fld, '*.jpg')
        files = glob.glob(path)
        for fl in files:
            flbase = os.path.basename(fl)
            img = get_im_cv2(fl)
            X_train.append(img)
            X_train_id.append(flbase)
            y_train.append(index)

    print('Read train data time: {} seconds'.format(round(time.time() - start_time, 2)))
    return X_train, y_train, X_train_id


#Load test data
def load_test():
    path = os.path.join('./', 'test_stg1', '*.jpg')
    files = sorted(glob.glob(path))

    X_test = []
    X_test_id = []
    for fl in files:
        flbase = os.path.basename(fl)
        img = get_im_cv2(fl)
        X_test.append(img)
        X_test_id.append(flbase)

    return X_test, X_test_id

In [24]:
#Creates the submission file
def create_submission(predictions, test_id, info):
    result1 = pd.DataFrame(predictions, columns=['ALB', 'BET', 'DOL', 'LAG', 'NoF', 'OTHER', 'SHARK', 'YFT'])
    result1.loc[:, 'image'] = pd.Series(test_id, index=result1.index)
    now = datetime.datetime.now()
    sub_file = 'submission_' + info + '_' + str(now.strftime("%Y-%m-%d-%H-%M")) + '.csv'
    result1.to_csv(sub_file, index=False)

In [25]:
def read_and_normalize_train_data():
    train_data, train_target, train_id = load_train()

    print('Convert to numpy...')
    train_data = np.array(train_data, dtype=np.uint8)
    train_target = np.array(train_target, dtype=np.uint8)

    print('Reshape...')
    print(train_data)
    train_data = train_data.transpose((0, 3, 1, 2))

    print('Convert to float...')
    train_data = train_data.astype('float32')
    train_data = train_data / 255
    train_target = np_utils.to_categorical(train_target, 8)

    print('Train shape:', train_data.shape)
    print(train_data.shape[0], 'train samples')
    return train_data, train_target, train_id

def read_and_normalize_test_data():
    start_time = time.time()
    test_data, test_id = load_test()

    test_data = np.array(test_data, dtype=np.uint8)
    test_data = test_data.transpose((0, 3, 1, 2))

    test_data = test_data.astype('float32')
    test_data = test_data / 255

    print('Test shape:', test_data.shape)
    print(test_data.shape[0], 'test samples')
    print('Read and process test data time: {} seconds'.format(round(time.time() - start_time, 2)))
    return test_data, test_id

In [26]:
def dict_to_list(d):
    ret = []
    for i in d.items():
        ret.append(i[1])
    return ret


def merge_several_folds_mean(data, nfolds):
    a = np.array(data[0])
    for i in range(1, nfolds):
        a += np.array(data[i])
    a /= nfolds
    return a.tolist()


def create_model():
    model = Sequential()
    model.add(ZeroPadding2D((1, 1), input_shape=(3, 64, 64), dim_ordering='th'))
    model.add(Convolution2D(8, 3, 3, activation='relu', dim_ordering='th', init='he_uniform'))
    model.add(Dropout(0.2))
    model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2), dim_ordering='th'))
    model.add(ZeroPadding2D((1, 1), dim_ordering='th'))
    model.add(Convolution2D(16, 3, 3, activation='relu', dim_ordering='th', init='he_uniform'))
    model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2), dim_ordering='th'))
    model.add(Dropout(0.2))
    
    model.add(Flatten())
    model.add(Dense(96, activation='relu',init='he_uniform'))
    model.add(Dropout(0.4))
    model.add(Dense(24, activation='relu',init='he_uniform'))
    model.add(Dropout(0.2))
    model.add(Dense(8, activation='softmax'))

    sgd = SGD(lr=1e-2, decay=1e-4, momentum=0.88, nesterov=False)
    model.compile(optimizer=sgd, loss='categorical_crossentropy')

    return model


def get_validation_predictions(train_data, predictions_valid):
    pv = []
    for i in range(len(train_data)):
        pv.append(predictions_valid[i])
    return pv

In [30]:
def run_cross_validation_create_models(nfolds=10):
    # input image dimensions
    batch_size = 32
    nb_epoch = 8
    random_state = 51
    first_rl = 96

    train_data, train_target, train_id = read_and_normalize_train_data()

    yfull_train = dict()
    kf = KFold(len(train_id), n_folds=nfolds, shuffle=True, random_state=random_state)
    num_fold = 0
    sum_score = 0
    models = []
    for train_index, test_index in kf:
        model = create_model()
        X_train = train_data[train_index]
        Y_train = train_target[train_index]
        X_valid = train_data[test_index]
        Y_valid = train_target[test_index]

        num_fold += 1
        print('Start KFold number {} from {}'.format(num_fold, nfolds))
        print('Split train: ', len(X_train), len(Y_train))
        print('Split valid: ', len(X_valid), len(Y_valid))

        callbacks = [
            EarlyStopping(monitor='val_loss', patience=3, verbose=0),
        ]
        model.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=nb_epoch,
              shuffle=True, verbose=2, validation_data=(X_valid, Y_valid),
              callbacks=callbacks)

        predictions_valid = model.predict(X_valid.astype('float32'), batch_size=batch_size, verbose=2)
        score = log_loss(Y_valid, predictions_valid)
        print('Score log_loss: ', score)
        sum_score += score*len(test_index)

        # Store valid predictions
        for i in range(len(test_index)):
            yfull_train[test_index[i]] = predictions_valid[i]

        models.append(model)

    score = sum_score/len(train_data)
    print("Log_loss train independent avg: ", score)

    info_string = '_' + str(np.round(score,3)) + '_flds_' + str(nfolds) + '_eps_' + str(nb_epoch) + '_fl_' + str(first_rl)
    return info_string, models


def run_cross_validation_process_test(info_string, models):
    batch_size = 24
    num_fold = 0
    yfull_test = []
    test_id = []
    nfolds = len(models)

    for i in range(nfolds):
        model = models[i]
        num_fold += 1
        print('Start KFold number {} from {}'.format(num_fold, nfolds))
        test_data, test_id = read_and_normalize_test_data()
        test_prediction = model.predict(test_data, batch_size=batch_size, verbose=2)
        yfull_test.append(test_prediction)

    test_res = merge_several_folds_mean(yfull_test, nfolds)
    info_string = 'loss_' + info_string \
                + '_folds_' + str(nfolds)
    create_submission(test_res, test_id, info_string)


if __name__ == '__main__':
    print('Keras version: {}'.format(keras_version))
    num_folds = 3
    info_string, models = run_cross_validation_create_models(num_folds)
    run_cross_validation_process_test(info_string, models)


Keras version: 1.2.1
Read train images
Load folder ALB (Index: 0)
Load folder BET (Index: 1)
Load folder DOL (Index: 2)
Load folder LAG (Index: 3)
Load folder NoF (Index: 4)
Load folder OTHER (Index: 5)
Load folder SHARK (Index: 6)
Load folder YFT (Index: 7)
Read train data time: 196.64 seconds
Convert to numpy...
Reshape...
[[[[205 205 204]
   [204 204 203]
   [208 209 208]
   ..., 
   [156 132 120]
   [129 113 108]
   [104  97  94]]

  [[156 161 158]
   [140 145 148]
   [159 162 164]
   ..., 
   [195 165 145]
   [187 160 144]
   [144 130 124]]

  [[ 99  80  83]
   [ 81  83  93]
   [101 106 110]
   ..., 
   [197 167 146]
   [208 179 157]
   [206 182 164]]

  ..., 
  [[ 85  96 102]
   [ 91 102 115]
   [119 133 147]
   ..., 
   [ 60  74  89]
   [ 60  88 113]
   [ 86 123 152]]

  [[ 77  88  96]
   [ 86  98 112]
   [119 134 148]
   ..., 
   [ 60  71  82]
   [ 69  87 105]
   [ 51  79 120]]

  [[ 64  74  86]
   [ 82  95 111]
   [116 132 146]
   ..., 
   [ 47  62  74]
   [ 65  78  91]
   [ 5

In [9]:
cv2.xfeatures2d.SIFT_create()

<xfeatures2d_SIFT 0x102122510>