# [Happywhale - Whale and Dolphin Identification](https://www.kaggle.com/c/happy-whale-and-dolphin)
> Identify whales 🐳 and dolphins 🐬 by unique characteristic

<h4> Hi There</h4>
<p>In this notebook I have analyzed and explored the dataset, fixing some of the problems and implemented some image augmentation techniques. I will continue to update this. I hope you will find this notebook useful. If you do please support with an upvote.<p>

<img src="https://i.imgur.com/VWojFpo.png">

## Notebook Contents
1. [Introduction](#introduction)
2. [Submission Format](#submission-format)
3. [Evaluation Metric Explained](#evaluation-metric-explained)
4. [Loading Dataset](#loading-dataset)
5. [Data Cleaning](#data-cleaning)
6. [Dataset Visualization](#visualization)<br/>
     6.1 [Visualize Train and Test Images](#visualization)<br/>
     6.2 [Visualize Class Distribution](#class-distribution-analysis)<br/>
     6.3 [Observations](#observation-regarding-class-distribution)<br/>
7. [Getting Image Resolutions](#image-resolutions)
8. [Color Analysis](#color-analysis)<br/>
    8.1 [Check Gray Scale Images](#color-analysis)<br/>
    8.2 [Visualize Mean Intensity for RGB Channels](#get-mean-intensity-for-each-channel-RGB)<br/>
    8.3 [Observations](#observation-regarding-color-distribution)<br/>
9. [Data Augmentation](#data-augmentation)
10. [Preprocessing Dataset](#preprocessing)

<br>

<a id="introduction"></a>
# Introduction
This training data contains thousands of images of whales and dolphins. Individual whales and dolphins have been identified by researchers and given an `Id`. The challenge is to predict the `Id` of images in the test set by unique—but often subtle—characteristics of their natural markings. The best submissions will suggest photo-`Id` solutions that are fast and accurate.

<br>

### If you find this notebook useful,  <font color='red'>please support with an upvote</font> 🙏

# Importing Libraries

In [None]:
import os

import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

from keras import layers
from keras.models import Sequential
from keras.preprocessing import image
from keras.layers import Input, Dense, Activation, Dropout
from keras.layers import Flatten, BatchNormalization, Conv2D
from keras.layers import MaxPooling2D, AveragePooling2D
from keras.applications.imagenet_utils import preprocess_input

from PIL import Image
from tqdm import tqdm
import random as rnd
import cv2

!pip install livelossplot
from livelossplot import PlotLossesKeras

%matplotlib inline

<a id="submission-format"></a>
# Submission Format

### We need to predict 5 labels for each of the image.
For each image in the test set, we can predict up to 5 individual_id labels. There are individuals in the test set that are not seen in the training data; these should be predicted as new_individual. The file should contain a header and have the following format:

```
image,predictions 
000188a72f2562.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb960f07d new_individual 
000ba09273d6f3.jpg,37c7aba965a5 114207cab555 a6e325d8e924 19fbb960f07d new_individual 
...
```

<br>

<a id="evaluation-metric-explained"></a>
# Evaluation Metric Explained

The evaluation metric in the competition's description is Mean Average Precision @ 5 (MAP@5):
$$MAP@5 = {1 \over U} \sum_{u=1}^{U} \sum_{k=1}^{min(n,5)}P(k)  × rel(k)$$

where `U` is the number of images, `P(k)` is the precision at cutoff `k`, rel(k)  is an indicator function equaling 1 if the item at rank k is a relevant (correct) label, zero otherwise and `n` is the number of predictions per image.

> the calculation would stop after the first occurrence of the correct whale, so `P(1) = 1`. So, a prediction that is `correct` `incorrect` `incorrect` `incorrect` `incorrect` also scores `1`.

So we don't have to sum up to 5, only up to the first correct answer. In this competition there is only one correct (`TP`) answer per image, so the possible precision scores per image are either `0` or `P(k)=1/k`.

| true  | predicted   | k  | Image score |
|:-:|:-:|:-:|:-:|:-:|
| [x]  | [x, ?, ?, ?, ?]   | 1  | 1.0  |
| [x]  | [?, x, ?, ?, ?]   | 2  | 0 + 1/2 = 0.5 |
| [x]  | [?, ?, x, ?, ?]   | 3  | 0/1 + 0/2 + 1/3  = 0.33 |
| [x]  | [?, ?, ?, x, ?]   | 4  | 0/1 + 0/2 + 0/3 + 1/4  = 0.25 |
| [x]  | [?, ?, ?, ?, x]   | 5  | 0/1 + 0/2 + 0/3 + 0/4 + 1/5  = 0.2 |
| [x]  | [?, ?, ?, ?, ?]   | 5  | 0/1 + 0/2 + 0/3 + 0/4 + 0/5  = 0.0 |

where `x` is the correct and `?` is incorrect prediction. 

### The final score is simply the average over the scores of the images.

<br>

<a id="loading-dataset"></a>
# Loading Dataset
We'll use here the [Pandas](https://pandas.pydata.org/pandas-docs/stable/) to load the dataset into memory

In [None]:
train_df = pd.read_csv('../input/happy-whale-and-dolphin/train.csv')
train_df['path'] = '../input/happy-whale-and-dolphin/train_images/' + train_df['image']

pred_df = pd.read_csv('../input/happy-whale-and-dolphin/sample_submission.csv')
pred_df['path'] = '../input/happy-whale-and-dolphin/test_images/' + pred_df['image']

#### Having two csv files
* train.csv - contain image name,species and individual_id
*  sample_submission.csv - contain image name, dummy label for the images in the test folder

#### And two folders contain the images
* train - having 51033 images of different type of whales and dolphins. There Labels have provided in the train.csv file
* test - having 27956 images of different type of whales and dolphins. We need to predict their labels

In [None]:
train_df.head(10)

In [None]:
print('Train samples count: ', len(train_df))
train_df.columns

In [None]:
print('Species Count: ',len(train_df['species'].value_counts()))
train_df['species'].value_counts()

<a id="data-cleaning"></a>
# Data Cleaning
### Fixing Duplicate Labels
* `bottlenose_dolpin` -> `bottlenose_dolphin`
* `kiler_whale` -> `killer_whale`
* `beluga` -> `beluga_whale`

### Changing Label due to extreme similarities
* `globis` & `pilot_whale` -> `short_finned_pilot_whale`

In [None]:
print('Before fixing duplicate labels : ')
print("Number of unique species : ", train_df['species'].nunique())

train_df['species'].replace({
    'bottlenose_dolpin' : 'bottlenose_dolphin',
    'kiler_whale' : 'killer_whale',
    'beluga' : 'beluga_whale',
    'globis' : 'short_finned_pilot_whale',
    'pilot_whale' : 'short_finned_pilot_whale'
},inplace =True)

print('\nAfter fixing duplicate labels : ')
print("Number of unique species : ", train_df['species'].nunique())


train_df['class'] = train_df['species'].apply(lambda x: x.split('_')[-1])
train_df.head()

### Checking missing data
Lets check if there is any missing values in our dataset

In [None]:
train_df.isna().sum()

### Check for missing image
Now lets see if there is any missing image

In [None]:
len(os.listdir('../input/happy-whale-and-dolphin/train_images'))

<a id="visualization"></a>
# Visualization
### Looking at some random beauties  <a class="anchor" id="third-bullet"></a>
It's a great deal of fun to explore the data and play around with *matplotlib*

```
The below code does not work some time. If you get any error run the cell again. If you found the bug please let me know in the comments.
```

In [None]:
plt.figure(figsize = (15,12))
for idx,i in enumerate(train_df.species.unique()):
    plt.subplot(4,7,idx+1)
    df = train_df[train_df['species'] ==i].reset_index(drop = True)
    image_path = df.loc[rnd.randint(0, len(df))-1,'path']
    img = Image.open(image_path)
    img = img.resize((224,224))
    plt.imshow(img)
    plt.axis('off')
    plt.title(i)
plt.tight_layout()
plt.show()

In [None]:
def plot_species(df,species_name):
    plt.figure(figsize = (12,12))
    species_df = df[df['species'] ==species_name].reset_index(drop = True)
    plt.suptitle(species_name)
    for idx,i in enumerate(np.random.choice(species_df['path'],32)):
        plt.subplot(8,8,idx+1)
        image_path = i
        img = Image.open(image_path)
        img = img.resize((224,224))
        plt.imshow(img)
        plt.axis('off')
    plt.tight_layout()
    plt.show()

### Plotting more images from each species

In [None]:
for species in train_df['species'].unique():
    #print('\n\n')
    plot_species(train_df , species)

### Lets see some image by individual_id

We have to predict individual_id from image. So lets see how each individual looks like.

In [None]:
def plot_individual(df,individual_id):
    plt.figure(figsize = (12,12))
    species_df = df[df['individual_id'] ==individual_id].reset_index(drop = True)
    plt.suptitle(individual_id)
    for idx,i in enumerate(np.random.choice(species_df['path'],24)):
        plt.subplot(8,8,idx+1)
        image_path = i
        img = Image.open(image_path)
        img = img.resize((224,224))
        plt.imshow(img)
        plt.axis('off')
    plt.tight_layout()
    plt.show()

#### Top 5 most frequent individual

In [None]:
top_5_ids = train_df.individual_id.value_counts().head(5)
for i in top_5_ids.index:
    #print('\n\n')
    plot_individual(train_df , i)

#### Top 5 least frequent individual

We will get duplicate images because many individual has only one training image.

In [None]:
last_5_ids = train_df.individual_id.value_counts().tail(5)
for i in last_5_ids.index:
    #print('\n\n')
    plot_individual(train_df , i)

### Lets see some test images

In [None]:
t_df = pd.read_csv('../input/happy-whale-and-dolphin/sample_submission.csv')
t_df['path'] = '../input/happy-whale-and-dolphin/test_images/' + t_df['image']

def plot_testimages(df):
    plt.figure(figsize = (12,12))
    plt.suptitle('Test Images')
    for idx,i in enumerate(np.random.choice(df['path'],48)):
        plt.subplot(8,8,idx+1)
        image_path = i
        img = Image.open(image_path)
        img = img.resize((224,224))
        plt.imshow(img)
        plt.axis('off')
    plt.tight_layout()
    plt.show()

plot_testimages(t_df)
del t_df

### Some hand picked Training Images
I have manually looked up the dataset and found some weared train and test images. We must process those images before feeding to our model. This is required in order to improve model performance. I have also collected some images from <a href="https://www.kaggle.com/andradaolteanu">@andradaolteanu</a>'s notebook

In [None]:
def plot_sometrainimages(image_ids, rows, cols):
    images = []
    
    plt.figure(figsize = (12,36))
    for idx,i in enumerate(image_ids):
        plt.subplot(rows,cols,idx+1)
        image_path = '../input/happy-whale-and-dolphin/train_images/' + i
        img = Image.open(image_path)
        img = img.resize((224,224))
        plt.title(i.split('.')[0])
        images.append(i)
        plt.imshow(img)
        plt.axis('off')
    plt.tight_layout()
    plt.show()
    
some_train_images = [
'ba870b9e693201.jpg','7b0fb782fd9288.jpg','b642b895fc7138.jpg','d5a39578f273f3.jpg','6461d14e8348dc.jpg','e8f26d80ec48a2.jpg',
'db9097bd9d8cc1.jpg','5109dfc3b3e104.jpg','13fd25f2e6f344.jpg','7b8a44f8851f07.jpg','5227f5db439364.jpg','a50cd3d93457b1.jpg', 
'381a58bed2fbc1.jpg','1392531bb34c9a.jpg','9867c0f54ff77d.jpg','d4d8ac80cb3a4b.jpg','d3ed54248b6681.jpg','7617259953a75c.jpg',
'186f85fd38a70a.jpg','93e0f73bb86f70.jpg','21b3ad2437152f.jpg','67f0cd56a4a2c5.jpg','3a8c7e5429df52.jpg','e63956125e34b1.jpg',
'd20de23fe239b2.jpg','79be96e5d8674c.jpg','2a6a8d51f1cf49.jpg','5d6ca9ba43d567.jpg','0c03feec795ed0.jpg','061761cee5d501.jpg',
'cc666d76135ec0.jpg','70cad7ea5587d6.jpg','2bff7fb335a178.jpg','f76609dff2c3c6.jpg','090d7f9228a6bc.jpg','58fda080bd639d.jpg',
'1922f6641653d4.jpg','48196cd0f04a9b.jpg','78d7b40e183021.jpg','e7f43942481868.jpg','bb3fd5ca8db073.jpg','13a25d81619913.jpg',
'35d677992a4f2e.jpg','0246806606bc80.jpg','b4fd6577002028.jpg','c0a5ad1aece888.jpg','67ad9cb0769536.jpg','5e1f489ea57e10.jpg',
'63acaed950eab8.jpg','014f6d1c690aff.jpg','097fb940db8b2c.jpg','4edf5f49a062eb.jpg','9a236360f50155.jpg','4fdcfcd6660edd.jpg',
'0b1bd9850ad8a3.jpg','3cfa63a3bbebb7.jpg','31f99c519f55c9.jpg','898559919d173c.jpg','ce7695de81f1fe.jpg','911dd92c244d93.jpg',
'bec33fe0de0384.jpg','5bb81354d55397.jpg','8f8335a84b89fb.jpg','8022b7a61a1f39.jpg','931854aa0b59b6.jpg','07d4d07aa31141.jpg',
'53eca7a79fbf3d.jpg','3d045073aa762d.jpg','cd5fe465c60cb9.jpg','77908aab4bb24b.jpg','abf6f48044116d.jpg','4f43555e842ade.jpg',
'3c15e996c183aa.jpg','726582ee59e000.jpg','9cadc38ade64ac.jpg','402d2e6df3ca51.jpg','05b9a41635a275.jpg','fd7e858f16fb3f.jpg',
'f461e90a1a0909.jpg','5f220b77007ecd.jpg','269a6830d8b3c4.jpg'
]


In [None]:
plot_sometrainimages(some_train_images, 14, 6)

### Some hand picked Test Images

In [None]:
def plot_sometestimages(image_ids, rows, cols):
    images = []
    
    plt.figure(figsize = (12,6))
    for idx,i in enumerate(image_ids):
        plt.subplot(rows,cols,idx+1)
        image_path = '../input/happy-whale-and-dolphin/test_images/' + i
        img = Image.open(image_path)
        img = img.resize((224,224))
        plt.title(i.split('.')[0])
        images.append(i)
        plt.imshow(img)
        plt.axis('off')
    plt.tight_layout()
    plt.show()

In [None]:
some_test_images = [
'5bf1396d350169.jpg','5e4a1ef591f291.jpg','6caa20cf5526cb.jpg','9b0b44b19ba412.jpg','43f1e346be1ddd.jpg','67e5fb9a6110b0.jpg',
'e4acbbdc2feb58.jpg','fbc03b809b4fb0.jpg','844631d5b8c2f0.jpg','65def7ff6151f6.jpg','efbebaa3d40a48.jpg','2ba1913c52463b.jpg',
]
plot_sometestimages(some_test_images, 2, 6)

### Observations regarding handpicked images

1. There are some abnormal images in both train and test dataset
2. Some training images contains people, boats, birds, penguins etc
3. Many training images are cropped but some are not.
4. The uncropped images must be taken care of.
5. There are some images take from under water

<a id="class-distribution-analysis"></a>
# Class Distribution Analysis
In this section we will be analyzing the number of training and test samples in each class. It will give us a better understanding of our dataset and provide us the necessary information to preprocess our dataset before the training phase. 

In [None]:
plot = sns.countplot(x = train_df['class'], color = '#2596be')
sns.despine()
plot.set_title('Class Distribution\n', font = 'serif', x = 0.1, y=1, fontsize = 16);
plot.set_ylabel("Count", x = 0.02, font = 'serif', fontsize = 12)
plot.set_xlabel("Specie", fontsize = 12, font = 'serif')

for p in plot.patches:
    plot.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2, p.get_height()), 
       ha = 'center', va = 'center', xytext = (0, -20),font = 'serif', textcoords = 'offset points', size = 15)

#### Percentage of images of whale and dolphin in the dataset

In [None]:
plt.figure(figsize=(5,5))
class_cnt = train_df.groupby(['class']).size().reset_index(name = 'counts')
colors = sns.color_palette('Paired')[0:9]
plt.pie(class_cnt['counts'], labels=class_cnt['class'], colors=colors, autopct='%1.1f%%')
plt.legend(loc='upper left')
plt.show()

#### Number of training images of each species

In [None]:
plt.figure(figsize=(8,8))
sns.countplot(data=train_df, y = 'species',  palette='crest', dodge=False)
plt.show()

#### Number of training images of each species of whale and dolphin

In [None]:
fig,ax = plt.subplots(1,2,figsize=(10,5))

whales = train_df[train_df['class']=='whale']
dolphins = train_df[train_df['class']!='whale']

sns.countplot(y="species", data=whales, order=whales.iloc[0:]["species"].value_counts().index, ax=ax[0], color = "#0077b6")
ax[0].set_title('Most frequent whales')
ax[0].set_ylabel(None)
    
sns.countplot(y="species", data=dolphins,order=dolphins.iloc[0:]["species"].value_counts().index, ax=ax[1], color = "#90e0ef")
ax[1].set_title('Most frequent dolphins')
ax[1].set_ylabel(None)

plt.tight_layout()
plt.show()

#### Number of training images of top 10 individuals

In [None]:
plt.figure(figsize=(12,4))
top_ten_ids = train_df.individual_id.value_counts().head(24)
top_ten_ids = pd.DataFrame({'individual_id':top_ten_ids.index, 'frequency':top_ten_ids.values})

plt.bar(top_ten_ids['individual_id'],top_ten_ids['frequency'],width = 0.8,color='c',zorder=4)
plt.xticks(rotation=90)
plt.ylabel("frequency")
plt.xlabel("Individual Ids")
plt.title("Top 10 Individual Ids used by frequency")
plt.grid(visible = True, color ='grey',linestyle ='-', linewidth = 0.9,alpha = 0.2, zorder=0)
plt.show()

#### Plot the value count graph of each individual

In [None]:
train_df['individual_id'].value_counts().plot()
plt.xticks(rotation=90)
plt.show()

#### Density estimation of each individuals

In [None]:
np.log(train_df['individual_id'].value_counts()).plot.kde()

#### Density estimation of individual by whale and dolphin

In [None]:
plt.figure(figsize = (20, 10))
sns.kdeplot(np.log(train_df.loc[train_df['class'] == 'whale']['individual_id'].value_counts()))
sns.kdeplot(np.log(train_df.loc[train_df['class'] == 'dolphin']['individual_id'].value_counts()))
plt.legend(labels = ['whale', 'dolphin'])
plt.show()

#### Number of unique individuals in the dataset

In [None]:
len(train_df.individual_id.unique())

### Image count of individuals

In [None]:
train_df['count'] = train_df.groupby('individual_id',as_index=False)['individual_id'].transform(lambda x: x.count())
train_df.head()

#### Individuals with only one training image

In [None]:
train_df[train_df['count']==1]

#### Percentage of Individuals with less then 5 images

In [None]:
tmp = train_df[train_df['count']<=4]
len(tmp)/len(train_df)

#### Percentage of Individuals with more then 4 and less then 21 images

In [None]:
count = 0
for i in train_df['count']:
    if(i > 4 and i <= 20):
        count += 1
print(count/len(train_df))

<a id="observation-regarding-class-distribution"></a>
## Observation Regarding Class Distribution
There is a huge disbalance in the data. There are many classes with only one or several samples:

1. Total Number of individuals are 15587
2. 9258 individuals have just one image
3. Single whale with most images have 400 of them
4. Images dsitribution:
  1. almost 40% comes from whales with 4 or less images.
  1. almost 23% comes from whales with 5-20 images.
  1. rest 37% comes from individual with >20 images.

<a id="image-resolutions"></a>
# Image Resolutions

In [None]:
widths, heights = [], []

for path in tqdm(train_df["path"]):
    width, height = Image.open(path).size
    widths.append(width)
    heights.append(height)
    
train_df["width"] = widths
train_df["height"] = heights
train_df["dimension"] = train_df["width"] * train_df["height"]

### Lets see some small images

In [None]:
train_df.sort_values('width').head(84)

<a id="color-analysis"></a>
# Color Analysis
We need to do some color analysis to get an ida about the augmentation technique needed for this problem

In [None]:
def is_grey_scale(givenImage):
    w,h = givenImage.size
    for i in range(w):
        for j in range(h):
            r,g,b = givenImage.getpixel((i,j))
            if r != g != b: return False
    return True

### Check color scale of Train images

In [None]:
sampleFrac = 0.1
#get our sampled images
isGreyList = []
for imageName in train_df['path'].sample(frac=sampleFrac):
    val = Image.open(imageName).convert('RGB')
    isGreyList.append(is_grey_scale(val))
print(np.sum(isGreyList) / len(isGreyList))
del isGreyList

### Check color scale of Test images

In [None]:
sampleFrac = 0.1
#get our sampled images
isGreyList_test = []
for imageName in pred_df['path'].sample(frac=sampleFrac):
    val = Image.open(imageName).convert('RGB')
    isGreyList_test.append(is_grey_scale(val))
print(np.sum(isGreyList_test) / len(isGreyList_test))
del isGreyList_test

### Get mean intensity for each channel RGB <a name="get-mean-intensity-for-each-channel-RGB"></a>

In [None]:
def get_rgb_men(row):
    img = cv2.imread(row['path'])
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    return np.sum(img[:,:,0]), np.sum(img[:,:,1]), np.sum(img[:,:,2])

tqdm.pandas()
train_df['R'], train_df['G'], train_df['B'] = zip(*train_df.progress_apply(lambda row: get_rgb_men(row), axis=1) )

In [None]:
def show_color_dist(df, count):
    fig, axr = plt.subplots(count,2,figsize=(15,15))
    for idx, i in enumerate(np.random.choice(df['path'], count)):
        img = cv2.imread(i)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        axr[idx,0].imshow(img)
        axr[idx,0].axis('off')
        axr[idx,1].set_title('R={:.0f}, G={:.0f}, B={:.0f} '.format(np.mean(img[:,:,0]), np.mean(img[:,:,1]), np.mean(img[:,:,2]))) 
        x, y = np.histogram(img[:,:,0], bins=255)
        axr[idx,1].bar(y[:-1], x, label='R', alpha=0.8, color='red')
        x, y = np.histogram(img[:,:,1], bins=255)
        axr[idx,1].bar(y[:-1], x, label='G', alpha=0.8, color='green')
        x, y = np.histogram(img[:,:,2], bins=255)
        axr[idx,1].bar(y[:-1], x, label='B', alpha=0.8, color='blue')
        axr[idx,1].legend()
        axr[idx,1].axis('off')

### Red images and their color distribution
Since we are picking random images, some image may appear multiple times

In [None]:
df = train_df[((train_df['B']*1.05) < train_df['R']) & ((train_df['G']*1.05) < train_df['R'])]
show_color_dist(df, 8)

### Blue images and their color distribution

In [None]:
df = train_df[(train_df['B'] > 1.3*train_df['R']) & (train_df['B'] > 1.3*train_df['G'])]
show_color_dist(df, 8)

### Green images and their color distribution

In [None]:
df = train_df[(train_df['G'] > 1.05*train_df['R']) & (train_df['G'] > 1.05*train_df['B'])]
show_color_dist(df, 8)

<a id="observation-regarding-color-distribution"></a>
### Observation Regarding Color Distribution
1. We see that around 3% of the images in the training set are greyscale. While 1% in the Test set are greyscale.
2. Some whales have yellow spots and some images are reddish.This can happened due to sunset.
3. This suggests that we need to create image transformations that are very agnostic to the RGB spectrum (i.e. bump up the number of greyscaled images in the smaller classes).

## Please Upvote if you find this Notebook Useful 🙏

<a id="data-augmentation"></a>
# Data Augmentation

Data augmentation technique is used to prevent the model from overfitting by showing same image multiple time with slight modification. We will be using keras's built in ImageDataGenerator to augment out training images

In [None]:
from keras.preprocessing.image import ImageDataGenerator
from numpy import expand_dims

In [None]:
def plot_augimages(paths, datagen):
    plt.figure(figsize = (14,28))
    plt.suptitle('Augmented Images')
    
    midx = 0
    for path in paths:
        data = Image.open(path)
        data = data.resize((224,224))
        samples = expand_dims(data, 0)
        it = datagen.flow(samples, batch_size=1)
    
        # Show Original Image
        plt.subplot(10,5, midx+1)
        plt.imshow(data)
        plt.axis('off')
    
        # Show Augmented Images
        for idx, i in enumerate(range(4)):
            midx += 1
            plt.subplot(10,5, midx+1)
            
            batch = it.next()
            image = batch[0].astype('uint8')
            plt.imshow(image)
            plt.axis('off')
        midx += 1
    
    plt.tight_layout()
    plt.show()

    
datagen = ImageDataGenerator(
    rotation_range=20,
    zoom_range=0.10,
    brightness_range=[0.6,1.4],
    channel_shift_range=0.7,
    width_shift_range=0.15,
    height_shift_range=0.15,
    shear_range=0.15,
    horizontal_flip=True,
    fill_mode='nearest'
) 
plot_augimages(np.random.choice(train_df['path'],10), datagen)

<a id="preprocessing"></a>
# Preprocessing
### Encoding Labels

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

X = train_df.iloc[:, 3].values
y = train_df.iloc[:, 2].values

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
onehot_encoder = OneHotEncoder(sparse=False)
y = y.reshape(len(y), 1)
y = onehot_encoder.fit_transform(y)

In [None]:
y.shape

### If you find this notebook useful,  <font color='red'>please support with an upvote</font> 🙏

# References
I have used these awesome kernels for whole EDA. Do check them out if you have time.

In [None]:
##https://www.kaggle.com/andradaolteanu/whales-dolphins-effnet-embedding-cos-distance
##https://www.kaggle.com/lextoumbourou/happy-whale-dolphin-q-a-style-eda
##https://www.kaggle.com/andradaolteanu/whales-dolphins-effnet-embedding-cos-distance
##https://www.kaggle.com/rednivrug/eda-for-whale-with-bounding-boxes/notebook
##https://www.kaggle.com/andradaolteanu/whales-dolphins-effnet-embedding-cos-distance
##https://www.kaggle.com/pestipeti/explanation-of-map5-scoring-metric