## <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">Import of Libraries</p>
<a id="Title"></a>

In [None]:
!pip install gapminder -q
!pip install openpyxl -q

In [None]:
class color:
    """
    Sets colors for printouts.
    """
    GREEN = '\033[92m'
    YELLOW = '\033[93m'
    RED = '\033[91m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'
    END = '\033[0m'


# Printouts color scheme.
g_, y_, r_ = color.GREEN, color.YELLOW, color.RED
bd_, un_, end_ = color.BOLD, color.UNDERLINE, color.END
yb_, gb_,  = bd_+y_, bd_+g_
gbu_ = gb_+un_
cmap_ = ['#007427', '#B27D12']

In [None]:
import os
import numpy as np
import pandas as pd
from glob import glob

import matplotlib.pyplot as plt
import seaborn as sns
from gapminder import gapminder
from datetime import date
from tqdm.notebook import tqdm
from IPython.core.display import HTML
from os.path import getsize
import string
from PIL import Image
import cv2
from copy import deepcopy

import warnings
warnings.filterwarnings('ignore')

print(f'{yb_}\n[INFO] Libraries set up has been completed.{end_}')

## <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">Table of Contents</p>
1. [Introduction](#Introduction)
2. [Dataset overview](#Dataset)
3. [Data visualization and initial preprocessing](#Visualization)
4. [Exploratory Data Analysis](#Exploration)
5. [Bonus Feature Engineering](#FeatureEngineering)
6. [Appendix](#Appendix)

<a id='Introduction'></a>
# <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">1. Introduction</p>

The aim of this competition is to build a machine learning model to identify individual marine mammals (whales). For each image presented, the model should output the whale's individual_id or, if the image corresponds to a new whale (not present in the database), identify it as a new_individual.

Submissions are evaluated according to the Mean Average Precision @ 5 (MAP@5):

$$MAP@5 = \frac{1}{U}\sum\limits_{u=1}^U\sum\limits_{k=1}^{min(n,5)}{P(k)}*{rel(k)}$$

where ***U*** is the number of images, **P(k)**  is the precision at cutoff ***k***, ***n*** is the number predictions per image, and ***rel(k)*** is an indicator function equaling 1 if the item at rank ***k*** is a relevant (correct) label, zero otherwise.

Once a correct label has been scored for an *observation*, that label is no longer considered relevant for that observation, and additional predictions of that label are skipped in the calculation. For example, if the correct label is `A` for an observation, the following predictions all score an average precision of `1.0`.

<a href="#Title" role="button" aria-pressed="true" >Back to the beginning 🔙</a>

<a id='Dataset'></a>
# <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">2. Dataset overview</p>

The dataset provided with this challenge consists of the following files:
- `train.csv` containing 3 columns:
  - `image`: image identifier, mapping to `.jpg` image files located in the `train_images` subfolder.
  - `species`: [this indicates the basic unit of classification and a taxonomic rank of an organism in biology](https://en.wikipedia.org/wiki/Species).
  - `individual_id`: unique whale identifier (what your model will be trained to predict given the image).


- `sample_submission.csv`: example submission file showing the 2 column format for submissions:
  - The first column is an `image`.
  - The second column is a list of 5 `predictions` entries, showing your top 5 predictions for the identity of the whale shown in the image.
  - Submission format example: "37c7aba965a5 114207cab555 a6e325d8e924 19fbb960f07d".
  
To get a sense of the data, here are a few examples from the training dataset:

<p align="left">
  <img src="https://drive.google.com/uc?export=view&id=1byyYzli8ckaqTSylg9YIyGrakdTSuRZW"/>
</p>

You can maybe already see some of the potential challenges inherent in this dataset!
- **Whale positioning**. Whale picture can be taken from different distance and angle, a fin position can vary.
- **Image dimensions**. Image dimensions and aspect ratios can vary, e.g. the "top" image is in a wide format, whereas the other two are close to 4:3 format. 
- **Image quality**. The blurriness vs. sharpness of the images can vary.

Looking through other images, you'll find other sources of diversity:
- Differences in backgrounds and lighting conditions.
- Extraneous objects in the image such as annotations, flora in the background.
- Whales can have a varying portion of the body out of the water.

Here is a collage of some raw, unreshaped image samples from the training dataset:

<p align="center">
  <img src="https://drive.google.com/uc?export=view&id=1N7ooaeaMnyS98YQZ7IS_oYmRl6LwPIhN" width="800"/>
</p>

## <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">Data Loading</p>

In [None]:
train = pd.read_csv('../input/happy-whale-and-dolphin/train.csv')
sub = pd.read_csv('../input/happy-whale-and-dolphin/sample_submission.csv')
train.head(3)

## <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">Quick Sanity Check</p>

In [None]:
p_trn = r'../input/happy-whale-and-dolphin/train_images'
p_tst = r'../input/happy-whale-and-dolphin/test_images'
train_ims = os.listdir(p_trn)
test_ims = os.listdir(p_tst)

splt_trn = [i.split('.')[1] for i in train_ims]
splt_tst = [i.split('.')[1] for i in test_ims]

print(f'\n{yb_}[+] Train_images ext.: {gb_}{set(splt_trn)}{end_}.')
print(f'{yb_}[+] Test_images ext.: {gb_}{set(splt_tst)}{end_}.\n')

print(f'{yb_}[+] Number of train images: {gbu_}{len(splt_trn)}{end_}.')
print(f'{yb_}[+] Number of test images: {gbu_}{len(splt_tst)}{end_}.\n')

mask1 = train.isin(train_ims)
out_of_folder = len(splt_trn) - train[mask1].shape[0]
inter = len(set(train_ims).intersection(test_ims))

print(f'{yb_}[+] Number of images not in train.csv: {gbu_}{out_of_folder}{end_}.')
print(f'{yb_}[+] Images in train/test intersection: {gbu_}{inter}{end_}.')

<a href="#Title" role="button" aria-pressed="true" >Back to the beginning 🔙</a>

<a name="Visualization"></a>

# <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">3. Data visualization and initial preprocessing</p>

## <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">Data Cleaning</p>

**Fixing Duplicate Labels:**

 - `bottlenose_dolpin` => `bottlenose_dolphi`.
 - `kiler_whale` => `killer_whale`.
 - `beluga` => `beluga_whale`.

**Changing Label due to extreme similarities:**
    
 - `globis & pilot_whale` => `short_finned_pilot_whale`.
 
**References:**
    
 - [Discussion: "Fix all known species column problems"](https://www.kaggle.com/c/happy-whale-and-dolphin/discussion/305574) by @kwentar.

 - [Notebook: "Happywhale -⚡️ EDA + Augmentation + CNN 🔥🔥"](https://www.kaggle.com/sahamed/happywhale-eda-augmentation-cnn) by @ahamed.

 - [Discussion: "'short_finned' vs 'long_finned' vs 'pilot_whale'"](https://www.kaggle.com/c/happy-whale-and-dolphin/discussion/305909) by @andradaolteanu.

In [None]:
before = train.species.unique()

train.species.replace({
    "globis": "short_finned_pilot_whale",
    "pilot_whale": "short_finned_pilot_whale",
    "kiler_whale": "killer_whale",
    "bottlenose_dolpin": "bottlenose_dolphin",
    'beluga' : 'beluga_whale'
}, inplace=True)

after = train.species.unique()

print(
    f'\n{yb_}[+] Before fixing duplicate labels,'
    f' unique species: {gbu_}{before.shape[0]}{end_}.'
)
print(
    f'{yb_}[+] After fixing duplicate labels,'
    f' unique species: {gbu_}{after.shape[0]}{end_}.'
)

In total we have had 30 different species in the dataset. Having followed the recommendation provided on the discussions and in the kernels, we have removed the extra species, bringing the total number to 26.

<a href="#Title" role="button" aria-pressed="true" >Back to the beginning 🔙</a>

## <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">Data Enriching: Adding Species Summary</p>

There are multiple notebooks where the authors focus on describing the species in the details or showing the species separately. What has not been done yet, is an introduction of detailed summary of the species. I have encompassed the most important information down below.

**Few References:**
    
 - [Notebook: "🐋 and 🐬 Identification: EDA + Augmentation"](https://www.kaggle.com/ruchi798/and-identification-eda-augmentation) by @ruchi798.

 - [Notebook: "What about species?"](https://www.kaggle.com/kwentar/what-about-species)  by @kwentar.

In [None]:
p = '../input/whales-summary/Whales Summary.xlsx'
whales_summary = pd.read_excel(p, engine='openpyxl')
whales_summary.head(5)

In [None]:
order = [
    'species', 'species_latin', 'size_to_human_img',
    'body_length', 'body_weight','conserv_status', 
    'concerv_dscr', 'lifespan', 'size_to_human',
    'img', 'wiki_link'
]

whales_summary = whales_summary[order]

I have rendered the species summary table down below. Please, click on the images in the table to expand it. You can find useful to read about the table rendering by using the reference below.

**Reference:** 
 - [Article: "Rendering Images inside a Pandas DataFrame"](https://towardsdatascience.com/rendering-images-inside-a-pandas-dataframe-3631a4883f60) by Tanu N Prabhu.

In [None]:
def path_to_image_html(path):
    return '<img src="'+ path + '" width="90" >'

def make_clickable(link):
    return '<a href="%s"target="_blank">%s</a>' % (link, link)

frmts = dict(
    size_to_human_img=path_to_image_html, 
    img=path_to_image_html, 
    wiki_link=make_clickable
)

HTML(whales_summary.to_html(escape=False, formatters=frmts))

**Fields and Unit of Measurement:**

 - `body_length` is given in Meters.
 - `body_weight` is given in Metric Tones.
 - `lifespan` is given in years.
 - `size_to_human` is given as ratio upper bound of `body_length` divided by the average human height of 1.7526 Meters.


<a href="#Title" role="button" aria-pressed="true" >Back to the beginning 🔙</a>

## <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">Data Enriching: Adding Image Size, Subset, Size Bins, Height, Width and Aspect Ratio</p>

In [None]:
def size_get(df, im_dir):
    """
    Gets a size in MB per each image in a list.
    Creates 'image_size' col in the source DataFrame.
    :param df: pd.DataFrame with col 'image' (base names)
    :param im_dir: str
    :return: pd.DataFrame enriched
    """
    
    f = lambda x: os.path.join(im_dir, x)
    im_pathes = df.image.apply(f)
    im_sizes = [getsize(p)/1e6 for p in tqdm(im_pathes)]
    df['image_size'] = im_sizes
    
    return df


def sizebin_get(df):
    """
    Enriches DataFrame with size_bin col, e.g. '6>x=>5' MB.
    :param df: pd.DataFrame with col 'image_size' (MB)
    """
    
    sz = df.image_size
    
    df.loc[(sz >= 5) & (sz < 6), ['size_bin']] = '6>x=>5'  
    df.loc[(sz >= 4) & (sz < 5), ['size_bin']] = '5>x=>4'  
    df.loc[(sz >= 3) & (sz < 4), ['size_bin']] = '4>x=>3'  
    df.loc[(sz >= 2) & (sz < 3), ['size_bin']] = '3>x=>2'  
    df.loc[(sz >= 1) & (sz < 2), ['size_bin']] = '2>x=>1'  
    df.loc[sz < 1, ['size_bin']] = 'x<1' 
    
    return df
    

def h_w_get(df):
    """
    Gets height and width for each image in a list.
    Creates 'height', 'width' and 'subset' cols in the source DataFrame.
    :param df: pd.DataFrame with col 'image' (base names)
    :return: pd.DataFrame enriched
    """
        
    N = df.shape[0]
    df['height'] = np.ones_like(N)
    df['width'] = np.ones_like(N)
    
    for base in tqdm(df.image, total=N):
        
        mask = df.image == base
        subset = df[mask]['subset'].values[0]
        dir_ = f'data/{subset}_images'
        path = os.path.join(dir_, base)
        
        im = Image.open(path)
        width = im.size[0]
        height = im.size[1]
        
        df.loc[mask, 'width'] = width
        df.loc[mask, 'height'] = height
        
    return df


# Constructs a dummy train and test DataFrame.
# Demonstrates how the size_get function works.
test_dummy = pd.DataFrame({'image': test_ims})
train_dummy = deepcopy(size_get(train, p_trn))
test_dummy = size_get(test_dummy, p_tst)
train.head(3)

For the sake of time saving and the notebook performance, I have commented out the data enriching functions for obtaining the size bins, height and width of images. Instead, I am going to read in the already prepared table.

In [None]:
# Creates a dummy  table with cols 'image', 'image_size'.
cols = ['image', 'image_size']
df_images = pd.concat([train_dummy[cols], test_dummy[cols]])
df_images['subset'] = np.ones_like(df_images.shape[0])

# df_images = sizebin_get(df)
# df_images = h_w_get(df_images)

df_images = pd.read_csv('../input/image-sizes/image_sizes.csv')
df_images['aspect_ratio'] = df_images['width']/df_images['height']
df_images.head(3)

<a href="#Title" role="button" aria-pressed="true" >Back to the beginning 🔙</a>

## <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">Data Enriching: Final Train Table Enriched</p>

In [None]:
train.drop(columns=['image_size'], inplace=True)
mask_subset = (df_images.subset == 'train')
df_images_tr = df_images[mask_subset]
train_enrch = train.merge(df_images_tr, how='left', on='image')
train_enrch = train_enrch.merge(whales_summary, how='left', on='species')

In [None]:
train_enrch

This is the final view of the dateset with the extra metadata. I am going to use it during the data exploration phase.

<a href="#Title" role="button" aria-pressed="true" >Back to the beginning 🔙</a>

<a name="Exploration"></a>

# <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">4. Exploratory Data Analysis</p>

Observations from the visualization and preprocessing part:

* The image dataset consists of two major subsets:
 - `train images`: **51033** image(s).
 - `test images`: **27956** image(s).


* The train dataset is relatively big, with the imbalanced number of images per `species` and `invividual_id`.


---
* Let's question our data:

 1. What is the `species` and `invividual_id` statistics?
 2. What is the memory usage per images, species and major subsets?
 3. Are there outliers based on memory usage?
 4. Are there outliers based on aspect ratio?
 5. Are there blurred or distorted images? 

## <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">4.1 What is the species and invividual_id statistics?</p>

In [None]:
species_order = train['species'].value_counts().index
species_count = train.species.value_counts()

plt.figure(figsize=(10, 6))
s = sns.countplot(
    data=train, y='species', palette='crest',
    order=species_order
)

s.spines['top'].set_visible(False)
s.spines['right'].set_visible(False)

for i, v in enumerate(species_count):
    s.text(v, i+0.2, str(v), color='black', 
           fontweight='bold', fontsize=12)
    
plt.show()

* Observation:
 - The species are extremely imbalanced.
* Assumption:
 - Focal loss might be helpful.
 - Data oversampling or downsampling  might be helpful.

Let's take a look at `individual_id`. I have used a "dolphin vs whale" mapping introduced in the notebook below.

**References:**
    
 - [Notebook: "🐋 and 🐬 Identification: EDA + Augmentation"](https://www.kaggle.com/ruchi798/and-identification-eda-augmentation) by @ruchi798.

In [None]:
f_ = lambda x: 'dolphin' if 'dolphin' in x else 'whale'
train_enrch[''] = train_enrch.species.map(f_)
fig, axes  = plt.subplots(figsize=(8, 4))
fig.suptitle('Whales vs Dolphins', size=20, weight='bold', font='Serif')

explode = (0.05, 0.05)
labels = list(train_enrch[''].value_counts().index)
sizes = train_enrch[''].value_counts().values
text_props = {'fontsize': 12, 'weight': 'bold', 'font': 'Serif'}

axes.pie(
    sizes, explode=explode, startangle=60, 
    labels=labels, autopct='%1.0f%%', pctdistance=0.7, 
    colors=cmap_, textprops=text_props)

axes.add_artist(plt.Circle((0,0),0.4,fc='white'))

plt.show()

In [None]:
mask_dolphin = (train_enrch[''] == 'dolphin')
mask_whale = (train_enrch[''] == 'whale')
train_iid_d = train_enrch[mask_dolphin].individual_id.value_counts()[:100]
train_iid_w = train_enrch[mask_whale].individual_id.value_counts()[:100]

df_stack = [train_iid_d, train_iid_w]
df_names = ['Train images per dolphin', 'Train images per whale']

def set_xlabel(df, df_name):
    mean = round(np.mean(df), 2)
    median = int(np.median(df))
    f_str = f'\n{df_name} mean: {mean}, median {median}.'
    return f_str

fig, axes = plt.subplots(ncols=2, figsize=(12, 5))
fig.suptitle(
    'Whales vs Dolphins KDE TOP(100)', size=20, 
    weight='bold', font='Serif'
)

for i, (df, df_name) in enumerate(zip(df_stack, df_names)):

    xlabel = set_xlabel(df, df_name)
    sns.histplot(df, color='#007427', stat='density', ax=axes[i])
    sns.kdeplot(df, color="r", ls='--', lw=2, ax=axes[i])
    axes[i].set_xlabel(xlabel)
    axes[i].set_xlim([0, 200])
    
    
plt.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(20, 21))
fig.suptitle(
    'Whales vs Dolphins INDIVIDUAL_ID COUNT TOP(100)', size=20, 
    weight='bold', font='Serif'
)

for i, (df, df_name) in enumerate(zip(df_stack, df_names)):

    sns.barplot(x=df, y=df.index, ax=axes[i],
              palette='crest', orient='horizontal')

    title = set_xlabel(df, df_name)
    axes[i].set_title(title)
    for j, v in enumerate(df):
        axes[i].text(v+1, j+0.3, str(v), 
                     color='black', fontweight='bold')
        
plt.tight_layout()
plt.show()

* Observations and clarification:
 - The data is pulled for TOP(100) dolphins and whales.
 - `individual_ids` are imbalanced for both whales and dolphins species.
 - `individual_ids` for whales are distributed differently in comparison to dolphins.
 - `individual_ids` for whales are overrepresented in comparison to dolphins.
 
* Assumption:
 - It might be helpful to drill the acquired metadata to find patterns and nature of these differences.

<a href="#Title" role="button" aria-pressed="true" >Back to the beginning 🔙</a>

## <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">4.2 What is the memory usage per images, species and major subsets?</p>

In [None]:
a = train_dummy.describe().T
b = test_dummy.describe().T

new_index = {0: 'train', 1: 'test'}
float_fmt = {'count': '{:.0f}'}
size_stats = pd.concat([a, b], ignore_index=True).rename(new_index)
size_stats.style.background_gradient(cmap='Greens_r').format(float_fmt)

In [None]:
def size_scatter(df, c=None, verbose=True):
    """
    Draws a sactter plot of image sizes.
    Prints out image count for a specific size bin.
    :param df: pd.DataFrame with col image_size.
    :param c: str (plot color optional)
    :param verbose: bool (printouts)
    :return: None
    """
    sns.scatterplot(x=df.index, y=df.image_size, color=c)
    plt.title('Sizes of image files in MB.')
    plt.show()
    print('\n')
    
    if verbose:
        for size in range(6, -1, - 1):
            mask1 = (df.image_size >= size)
            mask2 = (df.image_size < size + 1)
            total = df[mask1 & mask2].shape[0]
            print(
              f'{yb_}[+] Images {gb_}{size+1}{yb_}'
              f' > size >= {gb_}{size} {yb_}MB: {gbu_}{total}{end_}.'
            )
    
    
size_scatter(train_dummy, c=cmap_[0])
size_scatter(test_dummy, c=cmap_[1])

In [None]:
mask3 = df_images.image.isin(train_dummy.image)
mask4 = df_images.image.isin(test_dummy.image)
df_images.loc[mask3, ['subset']] = 'train'
df_images.loc[mask4, ['subset']] = 'test'

plt.figure(figsize=(10, 6))
sns.scatterplot(
    data=df_images, x=df_images.index, y='image_size', 
    hue='subset', palette=cmap_
)
plt.title('Sizes of image files in MB by subset.')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0)
plt.show()

* Observation:

 It looks that the `train` and `test` image sizes share the same distribution. In this aspect the images are homogenous. `train` and `test` shares almost identical mean, std and min values, slightly vary in max stats.

<a href="#Title" role="button" aria-pressed="true" >Back to the beginning 🔙</a>

## <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">4.2*Bonus image Width x Height plotting</p>

In [None]:
fig, axes = plt.subplots(ncols=1, figsize=(12, 9))
sns.scatterplot(data=train_enrch, x='width', y='height', hue='species')
axes.set_title(f'Image Height x Width by species.')

plt.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(ncols=1, figsize=(12, 6))
sns.scatterplot(data=train_enrch, x='width', y='height', hue='', palette=cmap_)
axes.set_title(f'Image Height x Width Dolphin vs Whale.')

plt.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(14, 6))
for i, col in enumerate(['subset', 'size_bin']):
    sns.scatterplot(data=df_images, x='width', 
                  y='height', hue=col, ax=axes[i])
    axes[i].set_title(f'Image Height x Width by {col}.')

plt.tight_layout()
plt.show()

# Counstructs a list of subsets for
# masking data by a subset and creates plots.
_subsets = ['train', 'test']

plt.figure(figsize=(14, 12))
for i, subset in enumerate(_subsets):
    ax = plt.subplot(2, 2, i + 1)
    mask6 = df_images.subset == subset
    sns.scatterplot(data=df_images[mask6], x='width', 
                  y='height', hue='size_bin', ax=ax)

    # Basic stats:
    mean_h = df_images[mask6]['height'].mean()
    mean_w = df_images[mask6]['width'].mean()
    median_h = df_images[mask6]['height'].median()
    median_w = df_images[mask6]['width'].median()

    ax.set_title(
      f'Image Height x Width by subset: {subset}.\n'
      f'H x W mean: ({mean_h:.2f}, {mean_w:.2f}).\n'
      f'H x W median: ({median_h:.2f}, {median_w:.2f}).'
    )

plt.tight_layout()
plt.show()

* Observations and clarification:
 - The charts are lean to the right toward bigger width over height, e.g. aspect ratio > 1.
 - Seems there is a correlation between wide format images and the whales.
 - The Dolphins images are less wider than whales ones.
 - Most of the images are less than 1 MB in size.
* Assumption:
 - It might be helpful to look at the pictures of different bins and outliers based on the image size.  
 - It might be helpful to drill the acquired metadata to find patterns and nature of these differences.

<a href="#Title" role="button" aria-pressed="true" >Back to the beginning 🔙</a>

## <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">4.3 Are there outliers based on memory usage?</p>

In [None]:
plt.figure(figsize=(14, 5))
for i in range(5, 0, -1):

    mask5 = df_images.image_size > i
    a = df_images[mask5].subset.value_counts()

    plt.subplot(3, 2, 5-i+1)
    p = sns.barplot(x=a, y=a.index, orient='horizontal', palette=cmap_)
    for j, v in enumerate(a):
        p.text(v+0.2, j+0.1, str(v), 
                color='black', fontweight='bold')

    plt.title(f'Images by subsets: {i+1} > size >= {i} MB.')
    p.spines['top'].set_visible(False)
    p.spines['right'].set_visible(False)
    
plt.tight_layout()
plt.show()

* Observations:
 - There are not many extreme size cases. 
 - Most of the cases are in the train dataset.
* Assumption:
 - It might be helpful to look at the outliers based on the image size. 

In [None]:
def plot_images(df_images, sample_size, field='image_size', field_threshold=None):
    """
    Plots a grid of images based on a field.
    :param df_images: pd.DataFrame (col1: image, col2: field, col3: subset)
    :param sample_size: int
    :param field_threshold: None > tuple(lower_bound, upper_bound)
    :return: None
    """

    if field_threshold:
        mask_lower = df_images[field] > field_threshold[0]
        mask_upper = df_images[field] < field_threshold[1]
        outliers = df_images[mask_lower & mask_upper]
        outliers = outliers.sort_values(by=field, ascending=False)
        samples = outliers.sample(sample_size)
    else: 
        samples = df_images.sample(sample_size)

    plt_size = int(np.sqrt(sample_size))
    if plt_size**2 < sample_size:
        plt_size = plt_size + 1

    figsize = (plt_size*6, plt_size*6)
    fig = plt.figure(figsize=figsize)
    fig.suptitle(
        f'Images by {field}, threshold: {field_threshold}', 
        size=20, weight='bold', font='Serif'
    )
    
    _zip = zip(
    samples.image, 
    samples[field], 
    samples.subset
    )
    
    dir_ = '../input/happy-whale-and-dolphin/'
    for idx, (image_id, size, subset) in enumerate(_zip):
        plt.subplot(plt_size, plt_size, idx + 1)
        image = cv2.imread(os.path.join(f'{dir_}{subset}_images', image_id))
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        plt.imshow(image)
        
        if field == 'image_size':
            title = f'Image: {image_id}, {field}: {size:.2f} MB,\n subest: {subset}.'
        else:
            title = f'Image: {image_id}, {field}: {size:.2f},\n subest: {subset}.'
        plt.title(title, fontdict={'fontweight': 'bold'})
        plt.axis("off")
        
    plt.tight_layout()
    plt.show()

In [None]:
plot_images(df_images, 9, field_threshold=(3, 6))

In [None]:
plot_images(df_images, 9, field_threshold=(1, 2))

In [None]:
plot_images(df_images, 9, field_threshold=(0, 0.003))

<a href="#Title" role="button" aria-pressed="true" >Back to the beginning 🔙</a>

## <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">4.4 Are there outliers based on aspect ratio?</p>

In [None]:
plt.figure(figsize=(10, 6))

# Vertical line for the square images.
plt.vlines(
    x=1, ymin=0, ymax=df_images.shape[0], colors='red', 
    ls=':', lw=4, label='aspect_ratio: 1'
)

# Vertical line for extremely wide images.
plt.vlines(
    x=10, ymin=0, ymax=df_images.shape[0], colors='red', 
    ls='--', lw=3, label='aspect_ratio: 10')

sns.scatterplot(
    data=df_images, x=df_images['aspect_ratio'], y=df_images.index, 
    hue='subset', palette=cmap_
)

plt.legend(
    bbox_to_anchor=(1.02, 1), 
    loc='upper left', 
    borderaxespad=0
)

plt.ylim(0, 77000)
plt.xticks(list(range(15)))
plt.title('Aspect ratio of image files by subset.')
plt.show()

In [None]:
c = df_images[df_images.subset=='train'].describe()
d = df_images[df_images.subset=='test'].describe()

# Rearranges and renames the header.
cols = [
    'image_size_x', 'image_size_y', 
    'width_x', 'width_y', 
    'height_x', 'height_y',
    'aspect_ratio_x', 'aspect_ratio_y'
]

new_cols = dict()
for i in cols:
    if i[-1] == 'x':
        value = f'{i[:-1]}train'
        new_cols[i] = value
    else:
        value = f'{i[:-1]}test'
        new_cols[i] = value

mrgd = c.merge(d, how='left', left_index=True, right_index=True)[cols]
mrgd.rename(columns=new_cols).style\
.background_gradient(cmap='YlGn', subset=['image_size_train','image_size_test'])\
.background_gradient(cmap='Greens_r', subset=['width_train','width_test'])\
.background_gradient(cmap='YlGn', subset=['height_train','height_test']).format('{:.3f}')\
.background_gradient(cmap='YlGn', subset=['aspect_ratio_train', 'aspect_ratio_test']).format('{:.3f}')

* Observations:
 - Basic stats for the `train` and `test` images is almost indistinguishable. 
 - There are extreme cases of aspect ration in the dataset (aspect ratio > 10).
 - The entire dataset tends toward wide format images.
* Assumption:
 - It might be helpful to look at the outliers based on the aspect ratio. 

In [None]:
plot_images(df_images, 6, field='aspect_ratio', field_threshold=(13, 15))

In [None]:
plot_images(df_images, 9, field='aspect_ratio', field_threshold=(0.4, 0.6))

* Observations:
 - By looking at the extreme cases of aspect ratio, it is easy to notice that the wide pictures are common for the big whales. 
 - Most of the dolphin pictures are taken on edge devices. It seems reasonable enough to think that the huge whale encounter is not common and the pictures are taken panoramically from the distance, most likely with the professional camera.
* Assumption:
 - If we prove the correlation between size of the whale and the aspect ratio it might be used during the posprocessing. 

<a href="#Title" role="button" aria-pressed="true" >Back to the beginning 🔙</a>

## <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">4.4*Bonus Aspect Ratio and Whale Size correlation</p>

In [None]:
train_enrch

In [None]:
# Builds two aggregated views of aspect ration and image size.
group1 = ['conserv_status', 'species', 'size_to_human']
agg_f = ['count', 'min','max', 'mean', 'std']
data1 = train_enrch.groupby(by=group1)['aspect_ratio']\
                .agg(agg_f).sort_values(by='mean', ascending=False)

data2 = train_enrch.groupby(by=group1)['image_size']\
                .agg(agg_f).sort_values(by='mean', ascending=False)

data2_styler = data2.style.background_gradient(cmap='Greens').format('{:.3f}')

In [None]:
corr_ = train_enrch.corr().abs()

fig, axes = plt.subplots(figsize=(8, 5))
mask1 = np.zeros_like(corr_)
mask1[np.triu_indices_from(mask1)] = True
sns.heatmap(
    corr_, mask=mask1, 
    linewidths=.5, cmap='Greens'
)

plt.show()

In [None]:
x = data1.index.get_level_values("species")
y = data1['mean']
hue = data1.index.get_level_values("conserv_status")
size = data1.index.get_level_values("size_to_human")

fig = plt.figure(figsize=(8, 6))
sns.scatterplot(
    data=data1, x=y, y=x, size=size, 
    hue=hue, alpha=0.5, sizes=(20, 800),
)

plt.legend(
    bbox_to_anchor=(1.02, 1), 
    loc='upper left', 
    borderaxespad=0
)

fig.suptitle(
        f'Species aspect_ratio mean vs size_to_human', 
        size=14, weight='bold', font='Serif'
    )

plt.grid()
plt.show()

* Observations:
 - Now we can definitely see that the blue whale `aspect ratio` of images stands out.
 - The `aspect ratio` increases with the `size_to_human` value increases.
 - Dolphins `aspect_ratio` tends toward 1.5 value.

* P.s. In fact, it is truly actionable insight. We can use this information during the postprocessing stage.

<a href="#Title" role="button" aria-pressed="true" >Back to the beginning 🔙</a>

## <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">4.5 Are there blurred or distorted images?</p>

One of the popular technique to identify blurred images is to apply the Laplacian kernel to calculate the variance of Laplacian. Then, we need to figure out the variance threshold.

Let's see how it works:

In [None]:
path_f = lambda x: os.path.join(p_trn, x)
im_pathes = train.image.map(path_f).tolist()

In [None]:
im1 = im_pathes[2]
im2 = im_pathes[3]
im3 = im_pathes[4]

def variance_of_laplacian(image):
    # compute the Laplacian of the image and then return the focus
    # measure, which is simply the variance of the Laplacian
    return cv2.Laplacian(image, cv2.CV_64F).var()

for image in [im1, im2, im3]:
    plt.figure(figsize=(10,10))
    image = cv2.imread(image)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    laplacian = variance_of_laplacian(gray)
    text = f'Variance of Laplacian'

    im3 = cv2.putText(image, "{}: {:.2f}".format(text, laplacian), (25, 200),
              cv2.FONT_HERSHEY_SIMPLEX, 4, (255, 0, 0), 20)

    plt.axis("off")
    plt.imshow(image)
    plt.show()

It seems the blurred images share the variance of Laplacian less than 20. So, it might seem a good idea to try set the threshold for blurred images less than 20.

Below is a snippet for generating the variance of Laplacian for each image in the dataset and save it as a dict.

In [None]:
# def variance_of_laplacian(image):
#     # compute the Laplacian of the image and then return the focus
#     # measure, which is simply the variance of the Laplacian
#     return cv2.Laplacian(image, cv2.CV_64F).var()

# def get_laplasians(img_pathes):
#     var_lap = []
#     for image in tqdm(img_pathes, total=len(img_pathes)):
#         im = cv2.imread(image)
#         gray = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
#         var = variance_of_laplacian(gray)
#         var_lap.append(var)
        
#     df_lap = pd.DataFrame({
#         'image_id': img_pathes, 
#         'var_lap': var_lap
#     })
    
#     return df_lap


# df_lap = get_laplasians(img_pathes)
# df_lap.to_csv('variance_of_laplacian.csv')

<a name="Exploration"></a>

# <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">5.**Bonus Feature Engineering</p>

It can be noticed that the names of the images are anonymized by using some patterns.
If we take a look at the digits and characters, we might notice that characters are more prevalent than the others and some characters never exist.

In [None]:
train['sum_digits'] = train.image.str.findall(r'([0-9])').apply(lambda x: len(x))
train['sum_char'] = train.image.str.findall(r'([a-z])').apply(lambda x: len(x) - 3)

for i in range(10):
    all_matches = train.image.str.findall(f'([{i}])')
    train[f'sum_{i}'] = all_matches.apply(lambda x: len(x))

    
alphabet = string.ascii_lowercase    
for c in alphabet:
    all_matches = train.image.str.findall(f'([{c}])')
    train[f'sum_{c}'] = all_matches.apply(lambda x: len(x))

train.head()

In [None]:
train.sum(axis=0).T.head(15)

* Assumption:
 - We might decipher the names of the images in order to probe the results.
 - P.s. further analysis is to be conducted...

<a href="#Title" role="button" aria-pressed="true" >Back to the beginning 🔙</a>

<a id=''></a>
# <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0px;">Appendix</p>

`aspect_ratio` and `image_size` grouped stats.

In [None]:
data1.style.background_gradient(cmap='Greens').format('{:.3f}')

In [None]:
data2.style.background_gradient(cmap='Greens').format('{:.3f}')

<a id=''></a>
# <p style="background-color:#D9EDF7; font-family:newtimeroman; color:#31708F; font-size:120%; text-align:center; border: 2px; border-style:solid; border-color:#31708F; border-radius: 24px 0;">Any suggestions to improve this notebook will be greatly appreciated. P/s If I have forgotten to reference someone's work, please, do not hesitate to leave your comment. Any questions, suggestions or complaints are most welcome. Upvotes keep me motivated... Thank you.</p>
