

# Statoil/C-CORE Iceberg Classifier

![cover](https://drive.google.com/uc?export=view&id=1QdSEwYcw0NCXiUflW45ehWiULtD38Tpy)

## Motivation
Drifting icebergs present threats to navigation and activities in areas such as offshore of the East Coast of Canada.

Currently, many institutions and companies use aerial reconnaissance and shore-based support to monitor environmental conditions and assess risks from icebergs. However, in remote areas with particularly harsh weather, these methods are not feasible, and the only viable monitoring option is via satellite.

## Goal
Statoil, an international energy company operating worldwide, has worked closely with companies like C-CORE. C-CORE have been using satellite data for over 30 years and have built a computer vision based surveillance system. To keep operations safe and efficient, Statoil is interested in getting a fresh new perspective on how to use machine learning to more accurately detect and discriminate against threatening icebergs as early as possible.

In this project we are going to build an algorithm that automatically identifies if a remotely sensed target is a ship or iceberg. Improvements made will help drive the costs down for maintaining safe working conditions.

# Data Explortion

Dataset found in: `train.json`, `test.json`  
The data (`train.json`, `test.json`) is presented in json format.  
The training data has 1603 data points, where the testing 8424  data has data points.

## Data fields


The files consist of a list of images, and for each image, you can find the following fields:

- **id** - the id of the image  
- **band_1, band_2** - the flattened image data. Each band has 75x75 pixel values in the list, so the list has 5625 elements. Note that these values are not the normal non-negative integers in image files since they have physical meanings - these are float numbers with unit being dB. Band 1 and Band 2 are signals characterized by radar backscatter produced from different polarizations at a particular incidence angle. The polarizations correspond to HH (transmit/receive horizontally) and HV (transmit horizontally and receive vertically). More background on the satellite imagery can be found here.  
- **inc_angle** - the incidence angle of which the image was taken. Note that this field has missing data marked as "na", and those images with "na" incidence angles are all in the training data to prevent leakage.  
- **is_iceberg** - the target variable, set to 1 if it is an iceberg, and 0 if it is a ship. This field only exists in train.json.


# Getting Started

In [None]:
import numpy as np
import pandas as pd
import cv2                
import matplotlib.pyplot as plt
import itertools

from sklearn.model_selection  import train_test_split
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss, accuracy_score, roc_curve, auc
from sklearn.metrics import classification_report, confusion_matrix

from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, GlobalAveragePooling2D, GlobalMaxPooling2D
from keras.layers import Dropout, Flatten, Dense
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam, SGD

%matplotlib inline


In [None]:
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))
# Any results you write to the current directory are saved as output.

## Loading Data

In [None]:
# classes
classes = ['Ship', 'Iceberg']
classes_dict = {0:'Ship', 1:'Iceberg'}

In [None]:
train = pd.read_json("../input/statoil-iceberg-classifier-challenge/train.json")
print("Training data shape:", train.shape)
train.head()

In [None]:
test = pd.read_json("../input/statoil-iceberg-classifier-challenge/test.json")
print("Testing data shape:", test.shape)
test.head()

## Data Distribution (training data)

In [None]:
dt = pd.value_counts(train['is_iceberg'], ascending=True)
print(dt)
# convert class id to class name
dt.index = dt.index.map(classes_dict)
dt.plot.bar(title="Number of instances per Category")


## Viewing Sample Data

In [None]:
# get random samples
samlples_num = 3
iceberg_samples = train[train.is_iceberg==1].sample(n=samlples_num)
ships_samples = train[train.is_iceberg==0].sample(n=samlples_num)

In [None]:
def plot_bands(imgs):
    # Plot band_1
    fig = plt.figure(1,figsize=(15,15))
    for i in range(samlples_num):
        ax = fig.add_subplot(1,samlples_num,i+1)
        img_band1 = np.reshape(np.array(imgs.iloc[i,0]),(75,75))
        categ = 'iceberg' if imgs['is_iceberg'].iloc[0]==1 else 'ship'
        plt.title('band_1: ' + categ)
        ax.imshow(img_band1)

    # Plot band_2
    plt.show()
    fig = plt.figure(1,figsize=(15,15))
    for i in range(samlples_num):
        ax2 = fig.add_subplot(2,samlples_num,i+1)
        img_band2 = np.reshape(np.array(imgs.iloc[i,1]),(75,75))
        plt.title('band_2: ' + categ)
        ax2.imshow(img_band2)

    plt.show()


In [None]:
plot_bands(iceberg_samples)


In [None]:
plot_bands(ships_samples)

As we can see, in some examples it's hard to tell if it's a ship or an iceberg.

# Data Preprocessing

## Handling missing values

In [None]:
missing_values = (train['inc_angle'] == 'na').sum()
percentage = missing_values*100/len(train)
print("Number of missing values in 'inc_angle':", missing_values)
print("Percentage: {:.2}%:".format(percentage))

As there were 133 missing data in `inc_angle` out of 1604 entris (that is 8.3% of data) where the other columns had no missing values, we will exclude this field.

In [None]:
train.drop(['inc_angle'], axis=1, inplace=True)
# view train data
train.head()

## Combining Data: Concatenating bands

In [None]:
# Training data
train_band_1 = np.array([np.array(band).astype(np.float32).reshape(75, 75) for band in train["band_1"]])
train_band_2 = np.array([np.array(band).astype(np.float32).reshape(75, 75) for band in train["band_2"]])

train_features = np.concatenate([train_band_1[:, :, :, np.newaxis],
                                 train_band_2[:, :, :, np.newaxis]], axis=-1)
train_target = np.array(train["is_iceberg"])

print("Features shape:", train_features.shape)
print("Target shape:", train_target.shape)

In [None]:
# Testing data
test_band_1 = np.array([np.array(band).astype(np.float32).reshape(75, 75) for band in test["band_1"]])
test_band_2 = np.array([np.array(band).astype(np.float32).reshape(75, 75) for band in test["band_2"]])

test_features = np.concatenate([test_band_1[:, :, :, np.newaxis],
                                test_band_2[:, :, :, np.newaxis]], axis=-1)
print("Features shape:", test_features.shape)

# Dimensionality Reduction With PCA

In [None]:
scaler = StandardScaler()
images_scaled = scaler.fit_transform([i.flatten() for i in train_features])

pca = PCA(n_components=50)
pca_result = pca.fit_transform(images_scaled)

In [None]:
reduced_X_train, reduced_X_test, reduced_y_train, reduced_y_test = train_test_split(pca_result,
                                                                train_target, 
                                                                test_size=0.25, 
                                                                random_state=42)


 # Using (RandomForest, K-NN, Logistic Regression)

## Using Random Forest Classifier

In [None]:
forest = RandomForestClassifier(n_estimators=50)
forest = forest.fit(reduced_X_train, reduced_y_train)

## Making Predctions
test_predictions = forest.predict(reduced_X_test)
precision = accuracy_score(test_predictions, reduced_y_test) * 100
print("Accuracy with Random Forest: {0:.4f}".format(precision))

## Using K-NN

In [None]:
knn = KNeighborsClassifier(n_neighbors=20)
knn = knn.fit(reduced_X_train, reduced_y_train)

## Making Predctions
test_predictions = knn.predict(reduced_X_test)
precision = accuracy_score(test_predictions, reduced_y_test) * 100
print("Accuracy with K-NN: {0:.4f}".format(precision))

## Using Logistic Regression

In [None]:
lr = LogisticRegression(random_state=20, solver='lbfgs')
lr = lr.fit(reduced_X_train, reduced_y_train)

## Making Predctions
test_predictions = lr.predict(reduced_X_test)
precision = accuracy_score(test_predictions, reduced_y_test) * 100
print("Accuracy with Logistic Regression: {0:.4f}".format(precision))

ِAs we can see, the previous methods don't result more than 79% accuracy.

## Split Data ( training data)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_features,
                                                    train_target,
                                                    test_size=0.25,
                                                    random_state=7)

print("Total size of dataset:", len(train_features))
print("Size of training set:", len(X_train))
print("Size of testing set:", len(X_test))


# Benchmark Model

In [None]:
# define parameters
input_shape = X_train[0].shape

def get_basic_model(input_shape=(75, 75, 2)):
    # Model Archeticture
    basic_model = Sequential()
    # Input layer
    basic_model.add(Conv2D(32, 3, activation="relu", input_shape=input_shape))
    basic_model.add(Conv2D(64, 3, activation="relu"))
    basic_model.add(GlobalAveragePooling2D())
    basic_model.add(Dropout(0.3))
    # output layer
    basic_model.add(Dense(1, activation="sigmoid"))
    return basic_model

basic_model = get_basic_model()
# print model summary
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'training samples')
basic_model.summary()


## Compile model

In [None]:
# compiling model with binary_crossentropy loss function
basic_model.compile("adam", "binary_crossentropy", metrics=["accuracy"])

## Calculate the Classification Accuracy on the Test Set (Before Training)

In [None]:
# evaluate test accuracy
def print_accuracy(model, test_features=X_test, test_target=y_test):
    score = model.evaluate(test_features, test_target, verbose=0)
    accuracy = 100*score[1]
    # print test accuracy
    print('Test accuracy: %.4f%%' % accuracy)
    print('Test loss: {:0.4}'.format(score[0]))
    return accuracy


In [None]:
print_accuracy(basic_model)

## Model Training (model 1)

In [None]:
# train the model
def train_with_kfold(model, checkpoint_path,epochs=50, K=4,batch_size=None):
    history = None
    folds = list(StratifiedKFold(n_splits=K, shuffle=True, random_state=7).split(X_train, y_train))
    for i, (train_index, test_index) in enumerate(folds):
        print('\nFOLD:',i+1)
        # saving each fold's results (weights) as its own checkpoint 
        checkpointer = ModelCheckpoint(filepath= str(i+1)+ "_" + checkpoint_path,
                               verbose=1, save_best_only=True)
        # getting data folds
        X_train_fold, X_test_fold = X_train[train_index], X_train[test_index]
        y_train_fold, y_test_fold = y_train[train_index], y_train[test_index]
        h = model.fit(X_train_fold, y_train_fold,
                                   epochs=epochs,
                                   validation_data=(X_test_fold, y_test_fold),
                                   callbacks=[checkpointer],
                                   shuffle=True,
                                   batch_size=batch_size)
        # concatenating model histories over subsequent folds
        if history is None:
            history = h
        else:
            history.history['acc'].extend(h.history['acc'])
            history.history['val_acc'].extend(h.history['val_acc'])
            history.history['loss'].extend(h.history['loss'])
            history.history['val_loss'].extend(h.history['val_loss'])
    
    return history


In [None]:
# train model
history_1 = train_with_kfold(basic_model, 'basic.model.best.hdf5', epochs=150, K=2)

## Model Evaluation (model 1)

### Calculate the Classification Accuracy on the Test Set

In [None]:
best =-1
print("Fold 1")
basic_model.load_weights('1_basic.model.best.hdf5')
acc = print_accuracy(basic_model)
if acc > best:
    best =1
print("\nFold 2")
basic_model.load_weights('2_basic.model.best.hdf5')
acc = print_accuracy(basic_model)
if acc > best:
    best =2

In [None]:
# load best weights
basic_model.load_weights(str(best) + '_basic.model.best.hdf5')

### Confusion Matrix

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        #print("Normalized confusion matrix")
    else:
        pass#print('Confusion matrix, without normalization')

    #print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')



In [None]:
# get predictions
y_pred = basic_model.predict(X_test)
func = lambda x: 1 if x >= 0.5 else 0
y_pred_classes = np.array(list(map(func, y_pred)))
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred_classes)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=classes,
                      title='Confusion matrix, without normalization')



### Classification Report

In [None]:
print(classification_report(y_test, y_pred_classes, target_names=classes))

### Model History

In [None]:
def plot_history(history):
    # list all data in history
    #print(history.history.keys())
    # summarize history for accuracy
    plt.plot(history.history['acc'])
    plt.plot(history.history['val_acc'])
    plt.title('model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['train', 'test'], loc='upper left')
    plt.show()
    # summarize history for loss
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 'test'], loc='upper left')
    plt.show()


In [None]:
plot_history(history_1)

# Refining Basic Model

In [None]:
refined_model = get_basic_model()
refined_model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
# train model
history_1_V2 = train_with_kfold(refined_model, 'refined.model.best.hdf5', epochs=250, K=4)


## Model Evaluation

In [None]:
best = -1
print("Fold 1")
refined_model.load_weights('1_refined.model.best.hdf5')
acc = print_accuracy(refined_model)
if acc > best:
    best = 1
print("\nFold 2")
refined_model.load_weights('2_refined.model.best.hdf5')
acc = print_accuracy(refined_model)
if acc > best:
    best =2
print("\nFold 3")
refined_model.load_weights('3_refined.model.best.hdf5')
acc = print_accuracy(refined_model)
if acc > best:
    best =3
print("\nFold 4")
refined_model.load_weights('4_refined.model.best.hdf5')
acc = print_accuracy(refined_model)

if acc > best:
    best =4

In [None]:
# load best weights
refined_model.load_weights(str(best) + '_refined.model.best.hdf5')

In [None]:
# get predictions
y_pred = refined_model.predict(X_test)
func = lambda x: 1 if x >= 0.5 else 0
y_pred_classes = np.array(list(map(func, y_pred)))
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred_classes)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=classes,
                      title='Confusion matrix, without normalization')



In [None]:
print(classification_report(y_test, y_pred_classes, target_names=classes))

In [None]:
plot_history(history_1_V2)

# Improved Model

In [None]:
def get_improved_model(input_shape=(75,75,2)):
    # create the model and define the architecture.
    improved_model = Sequential()
    #
    improved_model.add(Conv2D(filters=64, kernel_size=(3,3), padding='same',
                             activation='relu', input_shape=input_shape))
    improved_model.add(MaxPooling2D(pool_size=2))
    
    improved_model.add(Conv2D(filters=128, kernel_size=2, padding='same', activation='relu'))
    improved_model.add(MaxPooling2D(pool_size=2))
    
    improved_model.add(Conv2D(filters=256, kernel_size=2, padding='same', activation='relu'))
    improved_model.add(MaxPooling2D(pool_size=2))
    
    improved_model.add(Conv2D(filters=512, kernel_size=2, padding='same', activation='relu'))
    improved_model.add(MaxPooling2D(pool_size=2))
    
    improved_model.add(Dropout(0.3))
    improved_model.add(Flatten())
    improved_model.add(Dropout(0.5))
    # output layer
    improved_model.add(Dense(1, activation='sigmoid'))
    return improved_model

improved_model = get_improved_model()

# print model summary
print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'training samples')
improved_model.summary()

## Compile model

In [None]:
# compiling model with binary_crossentropy loss function
optimizer = Adam(lr=1e-4)
improved_model.compile(optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"])

## Calculate the Classification Accuracy on the Test Set (Before Training)

In [None]:
# evaluate test accuracy
print_accuracy(improved_model)

## Model Training (model 2)

In [None]:
# train the model
history_2 = train_with_kfold(improved_model, 'improved.model.best.hdf5', epochs=50, K=3)

## Model Evaluation (model 2)

In [None]:
best = -1
print("Fold 1")
improved_model.load_weights('1_improved.model.best.hdf5')
acc = print_accuracy(improved_model)
if acc > best:
    best = 1
print("\nFold 2")
improved_model.load_weights('2_improved.model.best.hdf5')
acc = print_accuracy(improved_model)
if acc > best:
    best = 2
print("\nFold 3")
improved_model.load_weights('3_improved.model.best.hdf5')
acc = print_accuracy(improved_model)
if acc > best:
    best = 3

In [None]:
# load best weights
improved_model.load_weights( str(best) + '_improved.model.best.hdf5')

In [None]:
# get predictions
y_pred = improved_model.predict(X_test)
func = lambda x: 1 if x >= 0.5 else 0
y_pred_classes = np.array(list(map(func, y_pred)))
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred_classes)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=classes,
                      title='Confusion matrix, without normalization')



In [None]:
print(classification_report(y_test, y_pred_classes, target_names=classes))

In [None]:
#
plot_history(history_2)

## Best  aquired weights for this model

In [None]:
# load best acquired models
improved_model.load_weights('../input/statoil-model-weights/pretrained.best.hdf5')
# evaluate test accuracy
print_accuracy(improved_model)

In [None]:
# get predictions
y_pred = improved_model.predict(X_test)
func = lambda x: 1 if x >= 0.5 else 0
y_pred_classes = np.array(list(map(func, y_pred)))
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred_classes)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=classes,
                      title='Confusion matrix, without normalization')



In [None]:
print(classification_report(y_test, y_pred_classes, target_names=classes))

## Make A Submission File (model 2)

In [None]:
# get predictions of testing data
prediction = improved_model.predict(test_features, verbose=1)

submission= pd.DataFrame({'id': test["id"], 'is_iceberg': prediction.flatten()})
submission.to_csv("../working/improved_model_submission.csv", index=False)


# Transfer Learning + Data Augmentation

## Reshaping data: adding a third channel to images

In [None]:

train_band_3 =(train_band_1+train_band_2)/2
mod_train_features = np.concatenate([train_band_1[:, :, :, np.newaxis]
                          , train_band_2[:, :, :, np.newaxis]
                         , train_band_3[:, :, :, np.newaxis]], axis=-1)

print("Reshaped features:", mod_train_features.shape)

In [None]:
test_band_3 =(test_band_1+test_band_2)/2
mod_test_features = np.concatenate([test_band_1[:, :, :, np.newaxis],
                            test_band_2[:, :, :, np.newaxis],
                            test_band_3[:, :, :, np.newaxis]], 
                            axis=-1)

print("Reshaped features:", mod_test_features.shape)

## Split Data

In [None]:
mod_X_train, mod_X_test, mod_y_train, mod_y_test = train_test_split(mod_train_features,
                                                    train_target,
                                                    test_size=0.25,
                                                    random_state=7)

print("Total size of dataset:", len(mod_train_features))
print("Size of training set:", len(mod_X_train))
print("Size of testing set:", len(mod_X_test))


## Importing VGG16 model

In [None]:
from keras.applications.vgg16 import VGG16, preprocess_input, decode_predictions

# VGG16 model
VGG16_model = VGG16(weights='imagenet', include_top=False, input_shape=mod_train_features.shape[1:])
#
print("Number of Layers:", len(VGG16_model.layers))
VGG16_model.summary()


## Modified VGG16 model (model 3)

In [None]:
from keras.layers import concatenate
from keras.models import Model

def get_modified_VGG16():
    # Create new modified model from VGG16
    model = VGG16_model.get_layer('block5_pool').output
    model = GlobalMaxPooling2D()(model)
    model = Dropout(0.5)(model)
    predictions = Dense(1, activation='sigmoid')(model)
    model = Model(input=[VGG16_model.input], output=predictions)

    return model


modified_VGG16 = get_modified_VGG16()
print ("Model Layers: ", len(modified_VGG16.layers))
modified_VGG16.summary()

## Compile model

In [None]:
from keras.optimizers import Adam
learing_rate = 1e-4
#decay = 1e-6
adam_opt = Adam(lr=learing_rate)
# compiling model with binary_crossentropy loss function
modified_VGG16.compile(optimizer=adam_opt, loss="binary_crossentropy", metrics=["accuracy"])

## Calculate the Classification Accuracy on the Test Set (Before Training)

In [None]:
# evaluate test accuracy
print_accuracy(modified_VGG16, test_features=mod_X_test,test_target=mod_y_test)

## Data Augmentaion

In [None]:
from keras.preprocessing import image

# create data generator
datagen_train = image.ImageDataGenerator(
    rotation_range=10,
    width_shift_range=0.1,
    height_shift_range=0.1,
    horizontal_flip=True,
    vertical_flip=True)


datagen_train.fit(mod_X_train, augment=True)


## Model Training (model 3)

In [None]:
# train the model
K = 3
epochs = 40
history = None
folds = list(StratifiedKFold(n_splits=K, shuffle=True, random_state=7).split(mod_X_train, mod_y_train))
for i, (train_index, test_index) in enumerate(folds):
    print('\nFOLD:',i+1)
    checkpointer = ModelCheckpoint(filepath= str(i+1) + "_" + 'transln.model.best.hdf5',
                                   verbose=1,
                                   save_best_only=True)
    X_train_fold, X_test_fold = mod_X_train[train_index], mod_X_train[test_index]
    y_train_fold, y_test_fold = mod_y_train[train_index], mod_y_train[test_index]
    batch_size = 32
    train_generator = datagen_train.flow(
        X_train_fold,
        y_train_fold,
        batch_size=batch_size)
    
    h = modified_VGG16.fit_generator(train_generator,
                    steps_per_epoch=X_train_fold.shape[0] // batch_size,
                    epochs=epochs, verbose=1, callbacks=[checkpointer],
                    validation_data=(X_test_fold, y_test_fold),
                    shuffle=True)
    if history is None:
        history = h
    else:
        history.history['acc'].extend(h.history['acc'])
        history.history['val_acc'].extend(h.history['val_acc'])
        history.history['loss'].extend(h.history['loss'])
        history.history['val_loss'].extend(h.history['val_loss'])

#
history_3 = history

## Model Evaluation (model 3)

In [None]:
# evaluate test accuracy
best = -1
print("Fold 1")
modified_VGG16.load_weights('1_transln.model.best.hdf5')
acc = print_accuracy(modified_VGG16, test_features=mod_X_test,test_target=mod_y_test)
if acc > best:
    best = 1
print("\nFold 2")
modified_VGG16.load_weights('2_transln.model.best.hdf5')
acc = print_accuracy(modified_VGG16, test_features=mod_X_test,test_target=mod_y_test)
if acc > best
    best = 2
print("\nFold 3")
modified_VGG16.load_weights('3_transln.model.best.hdf5')
acc = print_accuracy(modified_VGG16, test_features=mod_X_test,test_target=mod_y_test)
if acc > best:
    best = 3

In [None]:
# load best weights
modified_VGG16.load_weights(str(best) + '_transln.model.best.hdf5')

In [None]:
# get predictions
y_pred = modified_VGG16.predict(mod_X_test)
func = lambda x: 1 if x >= 0.5 else 0
y_pred_classes = np.array(list(map(func, y_pred)))
# Compute confusion matrix
cnf_matrix = confusion_matrix(mod_y_test, y_pred_classes)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=classes,
                      title='Confusion matrix, without normalization')



In [None]:
print(classification_report(mod_y_test, y_pred_classes, target_names=classes))

In [None]:
plot_history(history_3)

## Make A Submission (model 3)

In [None]:
# make predictions of testing data
prediction = modified_VGG16.predict(mod_test_features, verbose=1)

submission= pd.DataFrame({'id': test["id"], 'is_iceberg': prediction.flatten()})
submission.to_csv("../working/tl_submission.csv", index=False)


# Semi-Supervised approach: Pseudo Labelling

In [None]:
# set the best model
best_model = improved_model
print_accuracy(best_model)

In [None]:
portion_size =int(1.5 * len(X_train)) # (that's 21.4% of the testing set)
test_features_portion = test_features[:portion_size,:,:,:]

# get labels of test data portion
y_pred = best_model.predict(test_features_portion)
func = lambda x: 1 if x >= 0.5 else 0
test_target_portion = np.array(list(map(func, y_pred)))

# setting new data
new_features = np.concatenate((X_train,test_features_portion),axis=0)
new_target = np.concatenate((y_train,test_target_portion),axis=0)

print("% of test portion size: {:0.4}%".format(100*portion_size/len(test_features)))
print("Shape of Features", new_features.shape)
print("Shape of Target", new_target.shape)

In [None]:
# splitting data into training and testing sets
new_X_train, new_X_test, new_y_train, new_y_test = train_test_split(new_features,
                                                                    new_target,
                                                                    test_size=0.25,
                                                                    random_state=5)
print("Total size of dataset:", len(new_features))
print("Size of training set:", len(new_X_train))
print("Size of testing set:", len(new_X_test))

## Model Training (Model 4)

In [None]:
# recompile the model with a new learning rate
opt = Adam(lr=1e-4)
best_model.compile(optimizer=opt, loss="binary_crossentropy", metrics=["accuracy"])

In [None]:
K = 3
epochs = 50
# train the model
X_train = new_X_train
y_train = new_y_train

history_4 = train_with_kfold(best_model, 'pseudo.labeled.model.best.hdf5', epochs=epochs, K=K)


## Model Evaluation (Model 4)

In [None]:
# evaluate test accuracy
print("Fold 1")
best_model.load_weights('1_pseudo.labeled.model.best.hdf5')
print('For the new test set:')
print_accuracy(best_model, test_features=new_X_test,test_target=new_y_test)
print('\nFor the old test set:')
print_accuracy(best_model, test_features=X_test,test_target=y_test)
print("----------------------\n")
print("Fold 2")
best_model.load_weights('2_pseudo.labeled.model.best.hdf5')
print('For the new test set:')
print_accuracy(best_model, test_features=new_X_test,test_target=new_y_test)
print('\nFor the old test set:')
print_accuracy(best_model, test_features=X_test,test_target=y_test)
print("----------------------\n")
print("Fold 3")
best_model.load_weights('3_pseudo.labeled.model.best.hdf5')
print('For the new test set:')
print_accuracy(best_model, test_features=new_X_test,test_target=new_y_test)
print('\nFor the old test set:')
print_accuracy(best_model, test_features=X_test,test_target=y_test)
print("----------------------\n")

In [None]:
# load best weights
best_model.load_weights('2_pseudo.labeled.model.best.hdf5')

In [None]:
# get predictions
y_pred = best_model.predict(new_X_test)
func = lambda x: 1 if x >= 0.5 else 0
y_pred_classes_1 = np.array(list(map(func, y_pred)))
# Compute confusion matrix
cnf_matrix = confusion_matrix(new_y_test, y_pred_classes_1)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=classes,
                      title='Confusion matrix (for new test set)')


In [None]:
# get predictions
y_pred = best_model.predict(X_test)
func = lambda x: 1 if x >= 0.5 else 0
y_pred_classes_2 = np.array(list(map(func, y_pred)))
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred_classes_2)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=classes,
                      title='Confusion matrix (for old test set)')


In [None]:
# classification report on new test set
print(classification_report(new_y_test, y_pred_classes_1, target_names=classes))

In [None]:
# classification report on old test set
print(classification_report(y_test, y_pred_classes_2, target_names=classes))

In [None]:
plot_history(history_4)

## Make A Sumbission File  (model 4)

In [None]:
# make predictions of testing data
prediction = best_model.predict(test_features, verbose=1)

submission= pd.DataFrame({'id': test["id"], 'is_iceberg': prediction.flatten()})
submission.to_csv("../working/pseduolabel_submission.csv", index=False)


With pseduo-labeling, and by retraining the best model on the increased training data we got a better model with 92.02% accuaracy and 0.32 loss.

## AUC of ROC 

In [None]:
y_pred = best_model.predict(X_test)
# Compute roc
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
AUC = keras = auc(fpr, tpr)

In [None]:
# plot
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Final model (area = {:.3f})'.format(AUC))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()


# Conclusion

In the following table, we compare between the benchmark model and the final model:

|Model	| Accuracy	|Loss	|F1-score (ships)|	F1-score (icebergs)|
|-------|---------------|--------|------------------|------------------------|
|Benchmark Model	|83%	|0.35	|0.88 |	0.88|
|Final Model	|92.02%|0.32	|0.92	| 0.92|


When we compare the ROC curves of both models, we find the following:

In [None]:
y_pred2 = basic_model.predict(X_test)
# Compute roc
fpr2, tpr2, thresholds2 = roc_curve(y_test, y_pred2)
AUC2 = keras = auc(fpr2, tpr2)

# plot
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Final model (area = {:.3f})'.format(AUC))
plt.plot(fpr2, tpr2, label='Benchmak (area = {:.3f})'.format(AUC2))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve: Final Model vs Benchmark model')
plt.legend(loc='best')
plt.show()

# Zoom in view of the upper left corner.
plt.figure(2)
plt.xlim(-0.05, 0.3)
plt.ylim(0.4, 1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Final model (area = {:.3f})'.format(AUC))
plt.plot(fpr2, tpr2, label='Benchmak (area = {:.3f})'.format(AUC2))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve (zoomed in at top left)')
plt.legend(loc='best')
plt.show()

Attribution: https://hackernoon.com/simple-guide-on-how-to-generate-roc-plot-for-keras-classifier-2ecc6c73115a

The final model did a good job in the classification problem with about 92% accuracy, so we want to explore how it performs with different sample images:

In [None]:
samlples_num = 8
# train (from 'train.json': the whole thing, containing both the training and testing splits used for training the model
iceberg_samples = train[train.is_iceberg==1].sample(n=samlples_num)
ships_samples = train[train.is_iceberg==0].sample(n=samlples_num)

In [None]:
from mpl_toolkits.axes_grid1 import ImageGrid

def prepocess_image(img):
    # preprocess images
    label = img.iloc[3]
    band_1 = np.array(img.iloc[0]).astype(np.float32).reshape(75, 75)
    band_2 = np.array(img.iloc[1]).astype(np.float32).reshape(75, 75)
    img = np.concatenate([band_1[:, :, np.newaxis],
                                 band_2[:, :, np.newaxis]], axis=-1)
    img = np.array([img])
    return img, label

def predict_icberg(img):
    predection = best_model.predict(img)
    prob = predection[0]
    predected_label = 1 if predection >= 0.5 else 0
    return predected_label, prob


def show_predictions(images):
    size = len(images)
    fig = plt.figure(1, figsize=(16, 16))
    grid = ImageGrid(fig, 111, nrows_ncols=(2, size//2), axes_pad=0.05)
    for i, (_,img) in enumerate(images.iterrows()):
        img, label = prepocess_image(img)
        predected_label, prob = predict_icberg(img)
        
        color = 'g' if label==predected_label else 'r' 
        ax = grid[i]
        ax.imshow(img[0,:,:,0])
        ax.text(5, 12, 'Predection: %s (%.2f)' % (predected_label, prob),
                color='w', backgroundcolor=color)
        ax.text(3, 5, 'True Label: %s' % label, color='w', backgroundcolor='k')
        ax.axis('off')
    plt.show()
        

In [None]:
# Show predictions on sample images of icebergs
print("Predictions on sample images of icebergs")
show_predictions(iceberg_samples)

In [None]:
# Show predictions on sample images of ships
print("Predictions on sample images of ships")
show_predictions(ships_samples)

# Summary

* By dimensionality reduction and using classifiers like Random Forest, K-NN and Logistic Regression we didn't get good results and the accuracy didn't exceed 79%.
* Using CNNs has led to better results where a simple CNN benchmark model outperformed the previous methods with ~83% accuracy and up to 90% accuracy after tuning, 0.26 loss and f1-scores of 0.90 and 091 for ship and iceberg classes, respectively.
* By increasing the complexity of the CNN model and introducing more layers we got a better model with ~91.8% accuracy, 0.34 loss, and higher f1-scores of 0.92 and 0.91 for ship and iceberg classes, respectively.
* By making use of transfer learning with VGG16 together with data augmentation, we got 90.5% accuracy, 0.247 loss, and f1-score of 0.91.
* With pseudo-labeling, and by retraining the best obtained model on the increased training data we got a better model with:
    * 95.48% accuracy and 0.1214 loss on the test data (after pseudo-labeling: 752 samples), and f1-scores of 0.95 and 0.96 for ship and iceberg classes, respectively. 
    * 92.02% accuracy and 0.32 loss on the test data (before pseudo-labeling: 401 samples), and f1-score of 0.92 for both ship and iceberg classes.
