# [Happywhale - Whale and Dolphin Identification](https://www.kaggle.com/c/happy-whale-and-dolphin)

## Notebook Contents
1. [Introduction](#introduction)
2. [Submission Format](#submission-format)
3. [Evaluation Metric Explained](#evaluation-metric-explained)
4. [Loading Dataset](#loading-dataset)
5. [Data Cleaning](#data-cleaning)
6. [Dataset Visualization](#visualization)<br/>
     6.1 [Visualize Train and Test Images](#visualization)<br/>
     6.2 [Visualize Class Distribution](#class-distribution-analysis)<br/>
     6.3 [Observations](#observation-regarding-class-distribution)<br/>
7. [Getting Image Resolutions](#image-resolutions)
8. [Color Analysis](#color-analysis)<br/>
    8.1 [Check Gray Scale Images](#color-analysis)<br/>
    8.2 [Visualize Mean Intensity for RGB Channels](#get-mean-intensity-for-each-channel-RGB)<br/>
    8.3 [Observations](#observation-regarding-color-distribution)<br/>
9. [Data Augmentation](#data-augmentation)
10. [Preprocessing Dataset](#preprocessing)

<br>

<a id="introduction"></a>
# Introduction
This training data contains thousands of images of whales and dolphins. Individual whales and dolphins have been identified by researchers and given an `Id`. The challenge is to predict the `Id` of images in the test set by unique—but often subtle—characteristics of their natural markings. The best submissions will suggest photo-`Id` solutions that are fast and accurate.

<br>

### If you find this notebook useful,  <font color='red'>please support with an upvote</font> 🙏

# Importing Libraries

In [None]:
!pip install pycaret

In [None]:
import os

import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import shutil

from keras import layers
from keras.models import Sequential
from keras.preprocessing import image
from keras.layers import Input, Dense, Activation, Dropout
from keras.layers import Flatten, BatchNormalization, Conv2D
from keras.layers import MaxPooling2D, AveragePooling2D
from keras.applications.imagenet_utils import preprocess_input

from PIL import Image
from tqdm import tqdm
import random as rnd
import cv2

!pip install livelossplot
from livelossplot import PlotLossesKeras

%matplotlib inline

In [None]:
import numpy as np
import os
import PIL
import PIL.Image
import tensorflow as tf
import tensorflow_datasets as tfds

<a id="submission-format"></a>
# Submission Format

### We need to predict 5 labels for each of the image.
For each image in the test set, we can predict up to 5 individual_id labels. There are individuals in the test set that are not seen in the training data; these should be predicted as new_individual. The file should contain a header and have the following format:

```
image,predictions 
000188a72f2562.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb960f07d new_individual 
000ba09273d6f3.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb960f07d new_individual 
...
```

<br>

<a id="loading-dataset"></a>
# Loading Dataset
We'll use here the [Pandas](https://pandas.pydata.org/pandas-docs/stable/) to load the dataset into memory

In [None]:
train_df = pd.read_csv('../input/happy-whale-and-dolphin/train.csv')
train_df['path'] = '../input/happy-whale-and-dolphin/train_images/' + train_df['image']

pred_df = pd.read_csv('../input/happy-whale-and-dolphin/sample_submission.csv')
pred_df['path'] = '../input/happy-whale-and-dolphin/test_images/' + pred_df['image']

#### Having two csv files
* train.csv - contain image name,species and individual_id
*  sample_submission.csv - contain image name, dummy label for the images in the test folder

#### And two folders contain the images
* train - having 51033 images of different type of whales and dolphins. There Labels have provided in the train.csv file
* test - having 27956 images of different type of whales and dolphins. We need to predict their labels

In [None]:
train_df.head(10)

In [None]:
print('Train samples count: ', len(train_df))
train_df.columns

In [None]:
print('Species Count: ',len(train_df['species'].value_counts()))
train_df['species'].value_counts()

<a id="data-cleaning"></a>
# Data Cleaning
### Fixing Duplicate Labels
* `bottlenose_dolpin` -> `bottlenose_dolphin`
* `kiler_whale` -> `killer_whale`
* `beluga` -> `beluga_whale`

### Changing Label due to extreme similarities
* `globis` & `pilot_whale` -> `short_finned_pilot_whale`

In [None]:
print('Before fixing duplicate labels : ')
print("Number of unique species : ", train_df['species'].nunique())

train_df['species'].replace({
    'bottlenose_dolpin' : 'bottlenose_dolphin',
    'kiler_whale' : 'killer_whale',
    'beluga' : 'beluga_whale',
    'globis' : 'short_finned_pilot_whale',
    'pilot_whale' : 'short_finned_pilot_whale'
},inplace =True)

print('\nAfter fixing duplicate labels : ')
print("Number of unique species : ", train_df['species'].nunique())


train_df['class'] = train_df['species'].apply(lambda x: x.split('_')[-1])
train_df.head()

### Checking missing data
Lets check if there is any missing values in our dataset

In [None]:
train_df.isna().sum()

### Check for missing image
Now lets see if there is any missing image

In [None]:
len(os.listdir('../input/happy-whale-and-dolphin/train_images')),len(train_df)

<a id="visualization"></a>
# Visualization
### Getting all unique species

In [None]:
# Getting the photos of all unique species
plt.figure(figsize = (15,12))
for idx,i in enumerate(train_df.species.unique()):
    plt.subplot(4,7,idx+1)
    df = train_df[train_df['species'] ==i].reset_index(drop = True)
    image_path = df.loc[rnd.randint(0, len(df))-1,'path']
    img = Image.open(image_path)
    img = img.resize((224,224))
    plt.imshow(img)
    plt.axis('off')
    plt.title(i)
plt.tight_layout()
plt.show()

In [None]:
# Function to plot whales 
def plot_species(df,species_name):
    plt.figure(figsize = (12,12))
    species_df = df[df['species'] ==species_name].reset_index(drop = True)
    plt.suptitle(species_name)
    for idx,i in enumerate(np.random.choice(species_df['path'],5)):
        plt.subplot(8,8,idx+1)
        image_path = i
        img = Image.open(image_path)
        img = img.resize((224,224))
        plt.imshow(img)
        plt.axis('off')
    plt.tight_layout()
    plt.show()

### more images from each species

In [None]:
for species in train_df['species'].unique():
    #print('\n\n')
    plot_species(train_df , species)

### Lets see some image by individual_id

We have to predict individual_id from image. So lets see how each individual looks like.

In [None]:
def plot_individual(df,individual_id):
    plt.figure(figsize = (12,12))
    species_df = df[df['individual_id'] == individual_id].reset_index(drop = True)
    plt.suptitle(individual_id)
    for idx,i in enumerate(np.random.choice(species_df['path'],24)):
        plt.subplot(8,8,idx+1)
        image_path = i
        img = Image.open(image_path)
        img = img.resize((224,224))
        plt.imshow(img)
        plt.axis('off')
    plt.tight_layout()
    plt.show()

#### Top 5 most frequent individual

In [None]:
top_5_ids = train_df.individual_id.value_counts().head(5)
for i in top_5_ids.index:
    #print('\n\n')
    plot_individual(train_df , i)

#### Top 5 least frequent individual

We will get duplicate images because many individual has only one training image.

In [None]:
last_5_ids = train_df.individual_id.value_counts().tail(5)
for i in last_5_ids.index:
    #print('\n\n')
    plot_individual(train_df , i)

### Seeing the distribution of individuals

In [None]:
train_df.individual_id.value_counts().describe()

### Lets see some test images

In [None]:
t_df = pd.read_csv('../input/happy-whale-and-dolphin/sample_submission.csv')
t_df['path'] = '../input/happy-whale-and-dolphin/test_images/' + t_df['image']

def plot_testimages(df):
    plt.figure(figsize = (12,12))
    plt.suptitle('Test Images')
    for idx,i in enumerate(np.random.choice(df['path'],48)):
        plt.subplot(8,8,idx+1)
        image_path = i
        img = Image.open(image_path)
        img = img.resize((224,224))
        plt.imshow(img)
        plt.axis('off')
    plt.tight_layout()
    plt.show()

plot_testimages(t_df)
del t_df

### Observations regarding handpicked images

1. There are some abnormal images in both train and test dataset
2. Some training images contains people, boats, birds, penguins etc
3. Many training images are cropped but some are not.
4. The uncropped images must be taken care of.
5. There are some images take from under water

# Class Distribution Analysis

In [None]:
sns.countplot(x='class',data=train_df)
plt.title('Distribution of classes')
plt.show()

#### Percentage of images of whale and dolphin in the dataset

In [None]:
plt.figure(figsize=(5,5))
class_cnt = train_df.groupby(['class']).size().reset_index(name = 'counts')
plt.pie(class_cnt['counts'], labels=class_cnt['class'],colors=['deepskyblue','royalblue'], autopct='%1.1f%%')
plt.legend(loc='upper left')
plt.show()

#### Number of training images of each species

In [None]:
plt.figure(figsize=(8,8))
sns.countplot(data=train_df, y = 'species',  palette='mako', dodge=False)
plt.show()

#### Number of training images of each species of whale and dolphin

In [None]:
fig,ax = plt.subplots(1,2,figsize=(10,6))

whales = train_df[train_df['class']=='whale']
dolphins = train_df[train_df['class']!='whale']

sns.countplot(y="species", data=whales, order=whales.iloc[0:]["species"].value_counts().index, ax=ax[0], color = "#0077b6")
ax[0].set_title('Most frequent whales')
ax[0].set_ylabel(None)
    
sns.countplot(y="species", data=dolphins,order=dolphins.iloc[0:]["species"].value_counts().index, ax=ax[1], color = "#90e0ef")
ax[1].set_title('Most frequent dolphins')
ax[1].set_ylabel(None)

plt.tight_layout()
plt.show()

#### Number of training images of top 10 individuals

In [None]:
plt.figure(figsize=(12,4))
top_ten_ids = train_df.individual_id.value_counts().head(24)
top_ten_ids = pd.DataFrame({'individual_id':top_ten_ids.index, 'frequency':top_ten_ids.values})

plt.bar(top_ten_ids['individual_id'],top_ten_ids['frequency'],width = 0.8,color='c',zorder=4)
plt.xticks(rotation=90)
plt.ylabel("frequency")
plt.xlabel("Individual Ids")
plt.title("Top 10 Individual Ids used by frequency")
plt.grid(visible = True, color ='grey',linestyle ='-', linewidth = 0.9,alpha = 0.2, zorder=0)
plt.show()

#### Plot the value count graph of each individual

In [None]:
train_df['individual_id'].value_counts().plot()
plt.xticks(rotation=90)
plt.show()

#### Density estimation of each individuals

In [None]:
np.log(train_df['individual_id'].value_counts()).plot.kde()

### Image count of individuals

In [None]:
train_df['count'] = train_df.groupby('individual_id',as_index=False)['individual_id'].transform(lambda x: x.count())
train_df.head()

<a id="observation-regarding-class-distribution"></a>
## Observation Regarding Class Distribution
There is a huge disbalance in the data. There are many classes with only one or several samples:

1. Total Number of individuals are 15587
2. 9258 individuals have just one image
3. Single whale with most images have 400 of them
4. Images dsitribution:
  1. almost 40% comes from whales with 4 or less images.
  1. almost 23% comes from whales with 5-20 images.
  1. rest 37% comes from individual with >20 images.

<a id="image-resolutions"></a>
# Image Resolutions

In [None]:
'''widths, heights = [], []

for path in tqdm(train_df["path"]):
    width, height = Image.open(path).size
    widths.append(width)
    heights.append(height)
    
train_df["width"] = widths
train_df["height"] = heights
train_df["dimension"] = train_df["width"] * train_df["height"]
train_df_save = train_df.copy()'''

### Lets see some small images

In [None]:
'''train_df.sort_values('width').head(84)'''

<a id="preprocessing"></a>
# Preprocessing
### Encoding Labels

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

X = train_df.iloc[:, 3].values
y = train_df.iloc[:, 2].values

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
onehot_encoder = OneHotEncoder(sparse=False)
y = y.reshape(len(y), 1)
y = onehot_encoder.fit_transform(y)

In [None]:
y.shape

## Modeling

In [None]:
train_jpg_path = "../input/happy-whale-and-dolphin/train_images"
test_jpg_peth = "../input/happy-whale-and-dolphin/test_images"
train_images_list = os.listdir('../input/happy-whale-and-dolphin/train_images')

In [None]:
def Loading_Images(data, m, dataset):
    print("Loading images")
    X_train = np.zeros((m, 32, 32, 3))
    count = 0
    for fig in tqdm(data['image']):
        img = image.load_img("../input/happy-whale-and-dolphin/"+dataset+"/"+fig, target_size=(32, 32, 3))
        x = image.img_to_array(img)
        x = preprocess_input(x)
        X_train[count] = x
        count += 1
    return X_train

In [None]:
def prepare_labels(y):
    values = np.array(y)
    label_encoder = LabelEncoder()
    integer_encoded = label_encoder.fit_transform(values)
    onehot_encoder = OneHotEncoder(sparse=False)
    integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
    onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
    y = onehot_encoded
    return y, label_encoder

In [None]:
X = Loading_Images(train_df, train_df.shape[0], "train_images")
X /= 255

In [None]:
import gc
y, label_encoder = prepare_labels(train_df['individual_id'])
print(X.shape)
print(y.shape)
gc.collect()

In [None]:
y.shape

In [None]:
from tensorflow.keras.applications import EfficientNetB0
from tensorflow.keras.layers import GlobalAveragePooling2D, Dropout, Dense
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import tensorflow as tf
from keras.models import Model


base_model = EfficientNetB0(input_shape=(32,32,3), weights=None, include_top=False)

layer = base_model.output
#layer = GlobalAveragePooling2D()(layer)#extra
#layer = Dropout(0.5)(layer)#extra
layer = Dense(1024, activation='relu')(layer)
#layer = Dense(512, activation='relu')(layer)#extra
layer = Flatten()(layer)
predictions = Dense(y.shape[1], activation='softmax')(layer)
model = Model(inputs=base_model.input, outputs=predictions)

model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
model.summary()

In [None]:
train_datagen = ImageDataGenerator(horizontal_flip=True,
                                   vertical_flip=True,
                                   validation_split=0.20,
                                   )

#train_datagen.fit(X)

In [None]:
#history = model.fit(train_datagen.flow(X,y,batch_size=128,subset='training'),validation_data=train_datagen.flow(X,y,batch_size=128,subset='validation'),epochs=180)
history = model.fit(X, y, epochs = 200, batch_size=128, verbose=1)

In [None]:
def cnn_model():
    model = Sequential()
    model.add(Conv2D(32, (6, 6), strides = (1, 1), input_shape = (32, 32, 3)))
    model.add(BatchNormalization(axis = 3))
    model.add(Activation('relu'))
    model.add(MaxPooling2D((2, 2)))
      
    model.add(Conv2D(64, (3, 3), strides = (1,1)))
    model.add(Activation('relu'))
    model.add(AveragePooling2D((3, 3)))

    model.add(Flatten())
    model.add(Dense(512, activation="relu"))
    model.add(Dropout(0.85))

    model.add(Dense(y.shape[1], activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
    
    return(model)

In [None]:
Cnn_model = cnn_model()

In [None]:
plt.figure(figsize=(15,5))
plt.plot(history.history['accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.show()

In [None]:
plt.figure(figsize=(15,5))
plt.plot(history.history['loss'])
plt.title('Model loss')
plt.ylabel('loss')
plt.xlabel('Epoch')
plt.show()

In [None]:
test = os.listdir("../input/happy-whale-and-dolphin/test_images")
print(len(test))

In [None]:
col = ['image']
test_df = pd.DataFrame(test, columns=col)
test_df['predictions'] = ''

In [None]:
batch_size=5000
batch_start = 0
batch_end = batch_size
L = len(test_df)

while batch_start < L:
    limit = min(batch_end, L)
    test_df_batch = test_df.iloc[batch_start:limit]
    print(type(test_df_batch))
    X = Loading_Images(test_df_batch, test_df_batch.shape[0], "test_images")
    X /= 255
    predictions = model.predict(np.array(X), verbose=1)
    for i, pred in enumerate(predictions):
        p=pred.argsort()[-5:][::-1]
        idx=-1
        s=''
        s1=''
        s2=''
        for x in p:
            idx=idx+1
            if pred[x]>0.5:
                s1 = s1 + ' ' +  label_encoder.inverse_transform(p)[idx]
            else:
                s2 = s2 + ' ' + label_encoder.inverse_transform(p)[idx]
        s= s1 + ' new_individual' + s2
        s = s.strip(' ')
        test_df.loc[ batch_start + i, 'predictions'] = s
    batch_start += batch_size   
    batch_end += batch_size
    del X
    del test_df_batch
    del predictions
    gc.collect()

In [None]:
test_df.to_csv('submission.csv',index=False)
test_df.head()

In [None]:
test_df.to_csv('submission_whale_and_dolphin.csv', index = False)

# References
I have used these awesome kernels for whole EDA. Do check them out if you have time.

In [None]:
##https://www.kaggle.com/andradaolteanu/whales-dolphins-effnet-embedding-cos-distance
##https://www.kaggle.com/lextoumbourou/happy-whale-dolphin-q-a-style-eda
##https://www.kaggle.com/andradaolteanu/whales-dolphins-effnet-embedding-cos-distance
##https://www.kaggle.com/rednivrug/eda-for-whale-with-bounding-boxes/notebook
##https://www.kaggle.com/andradaolteanu/whales-dolphins-effnet-embedding-cos-distance
##https://www.kaggle.com/pestipeti/explanation-of-map5-scoring-metric