<img src="https://i.imgur.com/qDrCF9S.png">

<center><h1>🐳 Part I. Data Understanding 🐬</h1></center>

> **Competition Goal:** Identify and group all images that contain the same individual through time.

### ⬇ Libraries

In [None]:
!pip install imagesize

In [None]:
# Libraries
import os
import sys
import wandb
import time
import random
from tqdm import tqdm
import warnings
import cv2
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
import matplotlib.patches as patches
import matplotlib.pyplot as plt
from IPython.display import display_html
import imagesize
from sklearn.model_selection import StratifiedKFold

# Environment check
warnings.filterwarnings("ignore")
os.environ["WANDB_SILENT"] = "true"
CONFIG = {'competition': 'happywhale', '_wandb_kernel': 'aot'}

# Custom colors
class clr:
    S = '\033[1m' + '\033[96m'
    E = '\033[0m'
    
my_colors = ["#21295C", "#1F4E78", "#1C7293", "#73ABAF", "#C9E4CA", "#87BBA2", "#618E83", "#3B6064"]
print(clr.S+"Notebook Color Scheme:"+clr.E)
sns.palplot(sns.color_palette(my_colors))

### 🐝 W&B Fork & Run

In order to run this notebook you will need to input your own **secret API key** within the `! wandb login $secret_value_0` line. 

🐝**How do you get your own API key?**

Super simple! Go to **https://wandb.ai/site** -> Login -> Click on your profile in the top right corner -> Settings -> Scroll down to API keys -> copy your very own key (for more info check [this amazing notebook for ML Experiment Tracking on Kaggle](https://www.kaggle.com/ayuraj/experiment-tracking-with-weights-and-biases)).

<center><img src="https://i.imgur.com/fFccmoS.png" width=500></center>

In [None]:
# 🐝 Secrets
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("wandb")

! wandb login $secret_value_0

### ⬇ Helper Functions

In [None]:
def show_values_on_bars(axs, h_v="v", space=0.4):
    '''Plots the value at the end of the a seaborn barplot.
    axs: the ax of the plot
    h_v: weather or not the barplot is vertical/ horizontal'''
    
    def _show_on_single_plot(ax):
        if h_v == "v":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() / 2
                _y = p.get_y() + p.get_height()
                value = int(p.get_height())
                ax.text(_x, _y, format(value, ','), ha="center") 
        elif h_v == "h":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() + float(space)
                _y = p.get_y() + p.get_height()
                value = int(p.get_width())
                ax.text(_x, _y, format(value, ','), ha="left")

    if isinstance(axs, np.ndarray):
        for idx, ax in np.ndenumerate(axs):
            _show_on_single_plot(ax)
    else:
        _show_on_single_plot(axs)


# === 🐝 W&B ===
def save_dataset_artifact(run_name, artifact_name, path):
    '''Saves dataset to W&B Artifactory.
    run_name: name of the experiment
    artifact_name: under what name should the dataset be stored
    path: path to the dataset'''
    
    run = wandb.init(project='happywhale', 
                     name=run_name, 
                     config=CONFIG)
    artifact = wandb.Artifact(name=artifact_name, 
                              type='dataset')
    artifact.add_file(path)

    wandb.log_artifact(artifact)
    wandb.finish()
    print("Artifact has been saved successfully.")
    
    
def create_wandb_plot(x_data=None, y_data=None, x_name=None, y_name=None, title=None, log=None, plot="line"):
    '''Create and save lineplot/barplot in W&B Environment.
    x_data & y_data: Pandas Series containing x & y data
    x_name & y_name: strings containing axis names
    title: title of the graph
    log: string containing name of log'''
    
    data = [[label, val] for (label, val) in zip(x_data, y_data)]
    table = wandb.Table(data=data, columns = [x_name, y_name])
    
    if plot == "line":
        wandb.log({log : wandb.plot.line(table, x_name, y_name, title=title)})
    elif plot == "bar":
        wandb.log({log : wandb.plot.bar(table, x_name, y_name, title=title)})
    elif plot == "scatter":
        wandb.log({log : wandb.plot.scatter(table, x_name, y_name, title=title)})
        
        
def create_wandb_hist(x_data=None, x_name=None, title=None, log=None):
    '''Create and save histogram in W&B Environment.
    x_data: Pandas Series containing x values
    x_name: strings containing axis name
    title: title of the graph
    log: string containing name of log'''
    
    data = [[x] for x in x_data]
    table = wandb.Table(data=data, columns=[x_name])
    wandb.log({log : wandb.plot.histogram(table, x_name, title=title)})

# 1. Metadata Cleaning

🐬 **Species column adjustement:**
* `bottlenose_dolpin` -> `bottlenose_dolphin`
* `kiler_whale` -> `killer_whale`
* `beluga` -> `beluga_whale`
* `globis` & `pilot_whale` -> `short_finned_pilot_whale` (due to extreme similarities [according to this discussion](https://www.kaggle.com/c/happy-whale-and-dolphin/discussion/305909))

In [None]:
# Import the training data
train = pd.read_csv("../input/happy-whale-and-dolphin/train.csv")

# Adjust typos in "species" column
train["species"] = train["species"].replace(["bottlenose_dolpin", "kiler_whale",
                                             "beluga", 
                                             "globis", "pilot_whale"],
                                            ["bottlenose_dolphin", "killer_whale",
                                             "beluga_whale", 
                                             "short_finned_pilot_whale", "short_finned_pilot_whale"])
# Create a "class" column
train["class"] = train["species"].apply(lambda x: x.split("_")[-1])

# Create path to train images
TRAIN_PATH = "../input/happy-whale-and-dolphin/train_images/"
train["path"] = TRAIN_PATH + train["image"]


# --- Inspect ---
print(clr.S+"--- TEST ---"+clr.E)
print(clr.S+"Total Number of Samples:"+clr.E, len(os.listdir("../input/happy-whale-and-dolphin/test_images")), "\n")
print(clr.S+"--- TRAIN ---"+clr.E)
print(clr.S+"Number of Missing Values:"+clr.E, train.isna().sum().sum())
print(clr.S+"Data Shape:"+clr.E, train.shape)
train.head()

# 2. Individual Analysis

### 🐬 Things to be noted:
* There is an individual with 400 aparitions, which is an extreme outlier - a `minke_whale`
* ~95% of inidividuals have less than 10 apparitions within the dataset
* there are 60% "new appearances" - meaning that the ID appears only once

In [None]:
# 🐝 W&B Experiment
run = wandb.init(project='happywhale', name='IndividualAnalysis', config=CONFIG)

In [None]:
all_indivs = train["individual_id"].value_counts().reset_index()
individuals = train["individual_id"].value_counts().reset_index().head(1000)
print(clr.S+"Total Unique IDs:"+clr.E, all_indivs.shape[0])
print(clr.S+"Max. Number of Apparitions:"+clr.E, all_indivs["individual_id"].max())
print(clr.S+"Min. Number of Apparitions:"+clr.E, all_indivs["individual_id"].min())

# Plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))
fig.suptitle('- Individuals Analysis -', size = 26, color = my_colors[7], weight='bold')
axs = [ax1, ax2]

sns.histplot(data=individuals, x="individual_id", ax=ax1, color=my_colors[1])
ax1.set_title("Distribution of appearances per individual - top 1000 in descending order",
             size = 15, color = my_colors[6], weight='bold')
ax1.set_xlabel("Individual Numbers", size = 13, color = my_colors[6], weight='bold')
ax1.set_ylabel("Count", size = 13, color = my_colors[6], weight='bold')
ax1.arrow(x=270, y=90, dx=0, dy=-80, 
          width = 0.05, head_width = 8, head_length=7, color=my_colors[6])
ax1.text(140, 100, '~95% have less than 10 apparitions', 
         size=13, color=my_colors[6], weight='bold')

sns.barplot(data=individuals.head(30), x="individual_id", y="index", ax=ax2, palette="Blues_r")
show_values_on_bars(ax2, h_v="h", space=0.4)
ax2.set_title("Top 30 IDs with most appearances",
             size = 15, color = my_colors[6], weight='bold')
ax2.set_ylabel("Individual ID", size = 13, color = my_colors[6], weight='bold')
ax2.set_xlabel("Frequency", size = 13, color = my_colors[6], weight='bold')
ax2.set_xticks([])
ax2.axhspan(-0.5, 10.5, color=my_colors[4], alpha=0.3)
ax2.text(200, 5, 'Individuals with high', 
         size=13, color=my_colors[6], weight='bold')
ax2.text(200, 6, 'number of apparitions', 
         size=13, color=my_colors[6], weight='bold')
ax2.yaxis.set_tick_params(labelsize=12)

sns.despine(left=True, bottom=True)
plt.subplots_adjust(left=None, bottom=None, right=None, top=0.86, wspace=0.3, hspace=None);

In [None]:
# 🐝 Log plots into W&B Dashboard
create_wandb_plot(x_data=individuals.head(30)["index"], 
                  y_data=individuals.head(30).individual_id, 
                  x_name="Individual ID", y_name="Frequency", 
                  title="-Top 30 IDs with most appearances-", 
                  log="frames", plot="bar")

create_wandb_hist(x_data=individuals["individual_id"], 
                  x_name="Individual Numbers", 
                  title="Distribution of Appearances per Individual", 
                  log="hist1")

wandb.finish()

# 3. Species analysis

Below I have mapped and "example" for each species that appears within the dataset, in order to have a better understanding of the differences and similarities in appearance between the individuals.

<img src="https://i.imgur.com/uodWEel.png">

<img src="https://i.imgur.com/nvAGvgH.png">

### 🐬 Things to be noted:
* There are ~70% whales and 30% unique dolphins within the dataset
* The most common [unique] species are the `dusky dolphin` and the `humpback` and `blue whale`. These species have many **distinct inidividuals** within them.
* Regarding the most common species overall (**not taking into account the number of times an inidividual appears**) are the `bottlenose dolphin` and the `beluga` and `humpback whale`. This is because of 2 reasons:
    * many of the top *20 individuals* with most appearances are from bottlenose dolphins and the humpback whales species
    * for beluga whale there are 800 unique individuals, but they appear on an average less than 10 times within the dataset.

In [None]:
# 🐝 W&B Experiment
run = wandb.init(project='happywhale', name='SpeciesAnalysis', config=CONFIG)

In [None]:
species_simple = train["species"].value_counts().reset_index()
species = train.groupby(by=["individual_id", "species"]).count()\
                .reset_index()["species"].value_counts().reset_index()
classes = train.groupby(by=["individual_id", "class"]).count()\
                .reset_index()["class"].value_counts().reset_index()
print(clr.S+"Individuals that are WHALES:"+clr.E, round(classes.iloc[0][1]/classes["class"].sum()*100),"%")
print(clr.S+"Individuals that are DOLPHINS:"+clr.E, round(classes.iloc[1][1]/classes["class"].sum()*100),"%")

# Plots
fig = plt.figure(figsize=(20, 15))

ax1= fig.add_subplot(1,2,1)
ax2= fig.add_subplot(2,2,2)
ax3= fig.add_subplot(2,2,4)
fig.suptitle('- Species Analysis -', size = 26, color = my_colors[7], weight='bold')

sns.barplot(data=classes, y="class", x="index", ax=ax1, palette=[my_colors[3], my_colors[4]])
show_values_on_bars(ax1, h_v="v", space=0.4)
ax1.set_title("Class Frequency",
             size = 15, color = my_colors[6], weight='bold')
ax1.set_ylabel("Frequency", size = 13, color = my_colors[6], weight='bold')
ax1.set_xlabel("Class", size = 13, color = my_colors[6], weight='bold')

sns.barplot(data=species_simple, x="species", y="index", ax=ax2, palette="PuBuGn_r",alpha=0.65)
show_values_on_bars(ax2, h_v="h", space=0.4)
ax2.axhspan(-0.5, 3.5, color=my_colors[4], alpha=0.2)
ax2.set_title("Simple Count on Species",
             size = 15, color = my_colors[6], weight='bold')
ax2.set_ylabel("Simple Species", size = 13, color = my_colors[6], weight='bold')
ax2.set_xlabel("", size = 13, color = my_colors[6], weight='bold')
ax2.set_xticks([])
ax2.yaxis.set_tick_params(labelsize=12)

sns.barplot(data=species, x="species", y="index", ax=ax3, palette="PuBuGn_r")
show_values_on_bars(ax3, h_v="h", space=0.4)
ax3.axhspan(-0.5, 3.5, color=my_colors[4], alpha=0.2)
ax3.set_title("vs Unique Count on Species",
             size = 15, color = my_colors[6], weight='bold')
ax3.set_ylabel("Unique Species", size = 13, color = my_colors[6], weight='bold')
ax3.set_xlabel("", size = 13, color = my_colors[6], weight='bold')
ax3.set_xticks([])
ax3.yaxis.set_tick_params(labelsize=12)

sns.despine(left=True, bottom=True)
plt.subplots_adjust(left=None, bottom=None, right=None, top=0.86, wspace=0.4, hspace=None);

In [None]:
# 🐝 Log plots into W&B Dashboard
create_wandb_plot(x_data=classes["index"], 
                  y_data=classes["class"], 
                  x_name="Class", y_name="Count", 
                  title="-Class Frequency-", 
                  log="class", plot="bar")

create_wandb_plot(x_data=species["index"], 
                  y_data=species["species"], 
                  x_name="Species", y_name="Count", 
                  title="-Unique Count on Species-", 
                  log="species", plot="bar")

Below you can see the 2 graphs that explain the shift between **simple count on species vs unique count on species**:

In [None]:
top_species = train[train["individual_id"].isin(individuals.head(20)["index"].tolist())]["species"]\
                                        .value_counts().reset_index()
beluga = train[train["species"]=="beluga_whale"]["individual_id"].value_counts().reset_index()
beluga_perc = round(beluga[beluga["individual_id"]<=10].shape[0]/beluga.shape[0]*100)

# Plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 7))
fig.suptitle('- Miscellaneous -', size = 26, color = my_colors[7], weight='bold')
axs = [ax1, ax2]

sns.histplot(data=beluga, x="individual_id", ax=ax1, color=my_colors[1])
ax1.set_title("Distribution of appearances on BELUGA Species",
             size = 15, color = my_colors[6], weight='bold')
ax1.set_xlabel("Beluga Inidivual ID", size = 13, color = my_colors[6], weight='bold')
ax1.set_ylabel("Count", size = 13, color = my_colors[6], weight='bold')
ax1.arrow(x=8, y=200, dx=0, dy=-100, 
          width = 0.05, head_width = 2, head_length=10, color=my_colors[6])
ax1.text(8, 210, f'{beluga_perc}% individuals have less that 10 apparitions', 
         size=13, color=my_colors[6], weight='bold')
ax1.axvspan(-0.5, 10.5, color=my_colors[4], alpha=0.3)

sns.barplot(data=top_species, x="species", y="index", ax=ax2, palette="Blues_r")
show_values_on_bars(ax2, h_v="h", space=0.4)
ax2.set_title("Top 10 IDs and their Species",
             size = 15, color = my_colors[6], weight='bold')
ax2.set_ylabel("Individual ID", size = 13, color = my_colors[6], weight='bold')
ax2.set_xlabel("Frequency", size = 13, color = my_colors[6], weight='bold')
ax2.yaxis.set_tick_params(labelsize=13)

sns.despine(left=True, bottom=True)
plt.subplots_adjust(left=None, bottom=None, right=None, top=0.86, wspace=0.4, hspace=None);

In [None]:
# 🐝 Log plots into W&B Dashboard
create_wandb_hist(x_data=beluga["individual_id"], 
                  x_name="Beliga Individual ID", 
                  title="Distribution of Appearances on Beluga Species", 
                  log="hist2")

create_wandb_plot(x_data=top_species["index"], 
                  y_data=top_species["species"], 
                  x_name="Species", y_name="Count", 
                  title="-Top 10 IDs and their Species-", 
                  log="sp2", plot="bar")

wandb.finish()

# 4. Images Analysis

## I. Specimen Explore

In [None]:
# 🐝 W&B Experiment
run = wandb.init(project='happywhale', name='ImageAnalysis', config=CONFIG)

In [None]:
def show_image_species(species_name, sample_size):
    '''
    Shows a sample of n random images from a certain species. Logs the images to W&B as well.
    species_name: string containing the desired species name
    sample_size: number of random images to be printed on a row
    '''
    # Get image info
    data = train[train["species"]==species_name].sample(sample_size, random_state=24)
    image_nr = data["image"].to_list()
    image_path = data["path"].to_list()

    # Plot
    fig, axs = plt.subplots(1, sample_size, figsize=(23, 4))
    axs = axs.flatten()
    wandb_images = []

    for k, path in enumerate(image_path):
        axs[k].set_title(f"{k+1}. {species_name}-{image_nr[k]}", 
                         fontsize = 13, color = my_colors[7], weight='bold')

        img = plt.imread(path)
        wandb_images.append(wandb.Image(img))
        axs[k].imshow(img)
        axs[k].axis("off")

    plt.tight_layout()
    plt.show()

    # 🐝 Log Image to W&B
    wandb.log({f"{species_name}": wandb_images})

### 🐬 Things to be noted:
* **image_size:** the images width and height is very different from one picture to another
* **night view:** not all the pictures were made during the day - some of them were also caught during the night (see `beluga_whale` pictures 1 to 3)
* **multiple individuals:** there are some pictures with 2 or more subjects present in it (see `cuviers_beaked_whale` picture 2 and `frasiers_dolphin` picture 2)
* **ladscape:** in some of the images the subject is very close, however in others the landscape is much more predominant (see `minke_whale` pictures 3 and 4), which could impose some issues in identifying subtle characteristics within the individual
* **additional noise:** there are some images that have digital marking on them that could pollute the algorithm (see `blue_whale` picture 1)

In [None]:
for species_name in train["species"].unique().tolist():
    # Custom function to prin images & log into 🐝W&B
    show_image_species(species_name, sample_size=4)

## II. Same Individuals

Let's also look at a few examples that contain the same individual.

### 🐬 Things to be noted:
* **could there be duplicated images?** - in the first example there are 2 different images that look exactly the same ... at first glimpse. On a closer look, you can see some *very subtle* differences between the water waves. The pictures are taken mere moments appart.
* **increased noice** - as we have seen above as well, there are many cases where the same individual appears onto very different backgrounds, angles (sometimes from the front, sometimes from the side) or shapes (in some cases we see a tail, in others a fin).
* **lighting** - lighting is another pretty noisy aspect and it should be dealt with during the image augmentation phase.

In [None]:
def show_same_individual(example_id, sample_size):
    '''
    Shows a sample of n random images from a certain individual. Logs the images to W&B as well.
    example_id: string containing individual id to be displayed
    sample_size: number of random images to be printed on a row
    '''
    
    data = train[train["individual_id"]==example_id].sample(sample_size, random_state=24)
    image_nr = data["image"].to_list()
    image_path = data["path"].to_list()

    # Plot
    fig, axs = plt.subplots(2, round(sample_size/2), figsize=(23, 6))
    fig.suptitle(f'- Individual {example_id} -', size = 20, color = my_colors[7], weight='bold')
    axs = axs.flatten()
    wandb_images = []

    for k, path in enumerate(image_path):
        axs[k].set_title(f"Img. {image_nr[k]}", 
                         fontsize = 13, color = my_colors[7], weight='bold')

        img = plt.imread(path)
        wandb_images.append(wandb.Image(img))
        axs[k].imshow(img)
        axs[k].axis("off")

    plt.tight_layout()
    plt.show()

    # 🐝 Log Image to W&B
    wandb.log({f"{example_id}": wandb_images})

In [None]:
example_id = all_indivs.iloc[1, :][0]
show_same_individual(example_id, sample_size=8)

In [None]:
example_id = all_indivs.iloc[5, :][0]
show_same_individual(example_id, sample_size=8)

## III. Image sizes

> 🐬 **Note**: I am using `imagesize` library in order to retrieve the `width` and `height` of the images without overloading the memory of the notebook. And it's faster :)

In [None]:
# Save image size to a new column within the training dataset
widths, heights = [], []

for path in tqdm(train["path"]):
    width, height = imagesize.get(path)
    widths.append(width)
    heights.append(height)
    
train["width"] = widths
train["height"] = heights
train["dimension"] = train["width"] * train["height"]

### 🐬 Things to be noted:
* distributions are quite alongated, with varying values for both width and height.
* there are some images with very low values for either `width` or `height` (less than 100 pixels)

In [None]:
data_w = train[["species", "width", "class"]]
data_h = train[["species", "height", "class"]]

print(clr.S+"WIDTH - Min Value:"+clr.E, data_w["width"].min(), "pixels")
print(clr.S+"WIDTH - Max Value:"+clr.E, data_w["width"].max(), "pixels", "\n")
print(clr.S+"HEIGHT - Min Value:"+clr.E, data_h["height"].min(), "pixels")
print(clr.S+"HEIGHT - Max Value:"+clr.E, data_h["height"].max(), "pixels")

# Plots
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(20, 19))
fig.suptitle('- Image Size distribution on Species -', size = 26, color = my_colors[7], weight='bold')
axs = [ax1, ax2]

v1 = sns.violinplot(data=data_w, x="species", y="width", hue="class", 
               palette=[my_colors[1], my_colors[3]], ax=ax1)
ax1.set_title("Width", y=0.97,
             size = 15, color = my_colors[6], weight='bold')
ax1.set_xlabel("")
ax1.set_ylabel("Width", size = 13, color = my_colors[6], weight='bold')
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45, ha='right')


v2 = sns.violinplot(data=data_h, x="species", y="height", hue="class", 
               palette=[my_colors[6], my_colors[4]], ax=ax2)
ax2.set_title("Height", y=0.9,
             size = 15, color = my_colors[6], weight='bold')
ax2.set_ylabel("Height", size = 13, color = my_colors[6], weight='bold')
ax2.set_xlabel("")
ax2.yaxis.set_tick_params(labelsize=13)
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=45, ha='right')

sns.despine(left=True, bottom=True)
plt.subplots_adjust(left=None, bottom=None, right=None, top=0.93, wspace=None, hspace=None);

🐬 So I went ahead to see which images might be with possibly concerning "low resolution":
* **frasiers dolphin**: all images have the lowest dimension out of the entire dataset
* **pygmy killer whale** and **duskin dolphin**: have low resolutions too, with the majority of images being smaller

In [None]:
data_d = train[["species", "dimension", "class"]]

# Plots
fig, (ax1) = plt.subplots(1, 1, figsize=(20, 5))
fig.suptitle('- Image Dimension distribution on Species -', size = 26, color = my_colors[7], weight='bold')

sns.violinplot(data=data_d, x="species", y="dimension", hue="class", 
               palette=[my_colors[1], my_colors[3]], ax=ax1)
# ax1.set_title("Dimension", y=0.97,
#              size = 15, color = my_colors[6], weight='bold')
ax1.set_xlabel("")
ax1.set_ylabel("Dimension", size = 13, color = my_colors[6], weight='bold')
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45, ha='right')

sns.despine(left=True, bottom=True)
plt.subplots_adjust(left=None, bottom=None, right=None, top=0.93, wspace=None, hspace=None);

## IV. Other weird image examples

While browsing through the training/test images *(yes, I have looked at 70k+ images)*, I have found a few examples that I will share here that are ... complicated, to say the least.

Found myself nervous laughing, which I guess it's better than crying. How could we preprocess these images? 😅🤧!

### 🐬 Things to be noted:
* **penguins!** - there are penguins in the data! I find that absolutely adorable.
* **people** - there are also people (tourists and scientists) within the images (not as adorable as the penguins tho).
* **afar objects** - there are many images where the subject is so far in the distance that you can barely see it.
* **very similar photos** - there are images where you would think that they are identicap copies - in fact, they are not. But they are pictures takes moments appart, so the differences between then are extremely subtle. This could mess up the CV score.
* **cannot see the subject** - there are a few examples where I myself cannot see the subject. At all. Maybe I need better glasses.
* **water and ice** - initially I thought I would see only water, however, for the individuals that live mostly in arctic waters, many pictures contain ice
* **don't put your finger over the camera** - innevitably, there are some pictures where there is a human finger blocking partially the image 😅
* **beautiful subject** - I have also added a few images where I believe the takes are absolutely gorgeaus.

... Did I already mention the penguins? 🐧🐧🐧

In [None]:
def plot_weird_images(ids, cols, rows, figsize, title, root_path):
    
    image_path = [root_path+i+".jpg" for i in ids]

    # Plot
    fig, axs = plt.subplots(cols, rows, figsize=figsize)
    fig.suptitle(title, size = 20, color = my_colors[7], weight='bold')
    axs = axs.flatten()

    for k, path in enumerate(image_path):
        axs[k].set_title(f"Img. {ids[k]}", 
                         fontsize = 13, color = my_colors[7], weight='bold')

        img = plt.imread(path)
        axs[k].imshow(img)
        axs[k].axis("off")

    plt.tight_layout()
    plt.show()
    
    
# Hand picked IDs
train_ids = ["0c42057255dbd6", "0c5821546292dd", "02f0606c99c41e", "2b3441bc1f27ce", "2c1f13d5f6d09d",
             "2c54be7b88181a", "3c15e996c183aa", "3cfa63a3bbebb7", "3dace1d4074b97", "3dd2c145275816",
             "5b850348ea63f8", "5c0c29e4993000", "5c90d285552ea2", "5e334d096b864e", "6e73f4c12d7d54",
             "7ad3a277f55107", "9a236360f50155", "9f94de1a3c768b", "12a7b25090e1b9", "13a25d81619913",
             "16d6fb560bd7bc", "35d677992a4f2e", "083a0fee112e3c", "090d7f9228a6bc", "292ceb0ffef4e8",
             "7940e462f75dd9", "24326cc55fe303", "ae680fc65c1ba0", "bb875ffcb8d064", "cd5fe465c60cb9",
             "d4d8ac80cb3a4b", "d5b42024509635", "dae4589b0f8dc5", "f7942e041d9963", "fc55d004bdc2da"]
plot_weird_images(ids=train_ids, cols=5, rows=7, 
                  figsize=(23, 14), title="- [Weird] Train Samples -",
                  root_path="../input/happy-whale-and-dolphin/train_images/")
print("\n")

test_ids = ["5bf1396d350169", "5e4a1ef591f291", "6caa20cf5526cb", "8c660e44867f8a", "9b0b44b19ba412",
            "43f1e346be1ddd", "67e5fb9a6110b0", "d5794a831a1b23", "db6e6c2b29ba40", "e4acbbdc2feb58"]
plot_weird_images(ids=test_ids, cols=2, rows=5, 
                  figsize=(23, 6), title="- [Weird] Test Samples -",
                  root_path="../input/happy-whale-and-dolphin/test_images/")

In the end, we might be better of just erasing some of the outlier images.

For example the image below (image: `cd5fe465c60cb9.jpg`) is cathegorized as `gray_whale`, but there is no subject within it. 

Moreover, the subject's `individual_id` is `fc0f7c162cc0`, and if we take a look it has 72 more apparitions, so there are planty examples to choose from.

<img src="https://i.imgur.com/8iFjmuw.png" width=400>

In [None]:
print(clr.S+"Image characteristics:"+clr.E)
train[train["image"]=="cd5fe465c60cb9.jpg"]

In [None]:
print(clr.S+"Total number of apparitions:"+clr.E, len(train[train["individual_id"]=="fc0f7c162cc0"]))
show_same_individual("fc0f7c162cc0", sample_size=8)

In [None]:
wandb.finish()

# 5. Preprocess

## I. .csv preprocess
🐬 `target` -> column that incorporates for each observation all images that contain the same individual.

In [None]:
# Create a unique id column based on image name
train["image_code"] = train["image"].apply(lambda x: x.split(".")[0])

# Create a 'target' column
tmp = train.groupby('individual_id')['image_code'].agg('unique').to_dict()
train['target'] = train['individual_id'].map(tmp)

# Map the individual id to a unique key (integer, not string)
individual_mapping = train["individual_id"].value_counts().reset_index().drop(columns=["individual_id"])
individual_mapping.columns = ["individual_id"]
individual_mapping["individual_key"] = np.arange(start=0, stop=len(individual_mapping), step=1)

train = pd.merge(train, individual_mapping, on="individual_id")

# Add Validation Fold
### based on individual key group
skf = StratifiedKFold(n_splits=5)
skf_splits = skf.split(X=train.drop(columns="individual_key"), y=train["individual_key"])

for fold, (train_index, valid_index) in enumerate(skf_splits):
      train.loc[valid_index , "kfold"] = np.int(fold)
        
train["kfold"] = train["kfold"].astype(int)
        
# The adjusted training data
train.head(3)

In [None]:
TEST_PATH = "../input/happy-whale-and-dolphin/test_images"

test = pd.DataFrame({"image" : os.listdir(TEST_PATH)})
test["path"] = TEST_PATH + "/" + test["image"]
test["image_code"] = test["image"].apply(lambda x: x.split(".")[0])


widths, heights = [], []

for path in tqdm(test["path"]):
    width, height = imagesize.get(path)
    widths.append(width)
    heights.append(height)
    
test["width"] = widths
test["height"] = heights
test["dimension"] = test["width"] * test["height"]

test.head(3)

In [None]:
# train.to_csv("train.csv", index=False)
# test.to_csv("test.csv", index=False)

# 🐝 Save datasets to W&B
save_dataset_artifact(run_name="TrainArtifact", artifact_name="train",
                      path="../input/happywhale-2022/train.csv")

save_dataset_artifact(run_name="TestArtifact", artifact_name="test",
                      path="../input/happywhale-2022/test.csv")

<img src="https://i.imgur.com/qZHp7Kb.png">

<center><h1>🐳 Part II. Image Similarities - Sisters or Twins 🐬</h1></center>

How does our brain figure out so easily that the synonym of *car* could be *vehicle*? How does our brain know how to group similar words together, like *bear, mountain, hike*? We even have games such as "spot the intruder", where we ask kids as small as kindergarden **recognize which word or image is the intruder** from a pool of words, like *table, chair, bed, kitchen, lion*.

We can do this with faces too. There is much controvercy over some actors that look VERY similar to one another - like they are twins. However, a glimpse at a few images makes our brain imediately recognize which is which. Have you ever had twin friends? At the beginning you cannot tell them appart, but after a little you could even wonder how you could confuse them in the first place.

<center><img src="https://i.imgur.com/GwsrRBK.png" width=750></center>

🐳 Can we do this with whales and dolphins too?

### ⬇️ Other Useful Libraries

In [None]:
# Helpful Imports
!pip install -q efficientnet_pytorch

import albumentations
import torch
from torch.utils.data import DataLoader, Dataset
import torch.nn as nn
import torch.nn.functional as F
from efficientnet_pytorch import EfficientNet
from numpy import dot, sqrt
from scipy import spatial

from transformers import *

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(clr.S+'Device available now:'+clr.E, device)

# 6. Get Image Embeddings

### 🐳 Why do we need these?

Simply put, the **embeddings** are the last layer of a neural network, right before the *classification* part. Usually, we take the pixels of an image, we get them through a neural network of some kind, with multiple layers and neurons per layer. The last layer contains all the *good juices* from the image.

Think of it like a zip file. You take an image, to compress all that good information into a zip file, and then if you reopen the zip file, the image is the same. We want that compressed vector, the **embedding** (to understand more about this you can read **[this amazing article by Chris Deotte - image below from there too](https://www.kaggle.com/c/shopee-product-matching/discussion/226279)**).

<center><img src="https://raw.githubusercontent.com/cdeotte/Kaggle_Images/main/Mar-2021/arcface.png" width=600></center>

**How do we get these?**

We get these by training a neural network, such as a CNN. **ArcFace is a great Loss Function**, in order to force the network to [make similar class embeddings to be close and dissimilar embeddings to be far from each other](https://www.kaggle.com/c/shopee-product-matching/discussion/226279).

Hence, I will be using a very simple backbone such as an already trained **EffNet B7**, just to explore how the image embeddings would associate with one another.

In [None]:
# ---- PARAMETERS ----
STATE = 24
KEYS = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
IMG_SIZE = 256
BATCH_SIZE = 16
# --------------------

In [None]:
# Select only a sample from the training data
df = pd.read_csv("../input/happywhale-2022/train.csv")
df = df[df["individual_key"].isin(KEYS)].reset_index(drop=True)
df["path"] = "../input/happy-whale-and-dolphin/train_images/" + df["image"]

df.head()

### I. The Dataset

In [None]:
def get_transforms(img_size=256):
    '''Function to apply albumentations to the image.
    Keeping it simple for now - Just a resizing and normalization.'''
    
    return  albumentations.Compose([
                albumentations.Resize(img_size, img_size),
                albumentations.Normalize()
            ])

class HappyWhaleDataset(Dataset):
    def __init__(self, csv, transforms=get_transforms(img_size=256)):

        self.csv = csv
        self.transform = transforms

    def __len__(self):
        return self.csv.shape[0]

    def __getitem__(self, index):
        row = self.csv.iloc[index]
                
        image = cv2.imread(row.path)
        image = image[:, :, ::-1]
        
        transformed_img = self.transform(image=image)
        transformed_img = transformed_img['image'].astype(np.float32)
        image = transformed_img.transpose(2, 0, 1)
        
        target = torch.tensor(row.individual_key)

        return torch.tensor(image), target
    

# Get the data loader
dataset = HappyWhaleDataset(df, transforms=get_transforms(img_size=IMG_SIZE))
loader = DataLoader(dataset, batch_size=BATCH_SIZE)

### II. The EffNet Model

In [None]:
class BackboneModel(nn.Module):
    def __init__(self):
        super(BackboneModel, self).__init__()
        # Retrieve pretrained weights
        self.backbone = EfficientNet.from_pretrained('efficientnet-b7')
        
    def forward(self, img):            
        img = self.backbone(img)
        return img
    
# Initiate the model
model = BackboneModel().to(device)

### III. Retrieving the Embeddings

> 🐳 **Note**: the example below is a very simple one, where the model is NOT trained on the new whales/dolphin images. Hence, we would need to create a training pipeline to fine tune the embeddings.

**Training Pipeline for Embeddings and clusterization**: my 2nd notebook [🐳Whales&Dolphins: EffNet Train & RAPIDS Clusters](https://www.kaggle.com/andradaolteanu/whales-dolphins-effnet-train-rapids-clusters)

In [None]:
# Retrieve all embeddings for each image
all_embeddings = []
all_targets = []

with torch.no_grad():
    for img, target in tqdm(loader): 
        img = img.to(device)
        img_embedding = model(img)
        img_embedding = img_embedding.detach().cpu().numpy()
        all_embeddings.append(img_embedding)
        all_targets.append(target.numpy())

In [None]:
# Concatenate batches together
image_embeddings = np.concatenate(all_embeddings)
image_targets = np.concatenate(all_targets)

print(clr.S+"Shape of the embeddings:"+clr.E, image_embeddings[0].shape)

# Save embeddings and corresponding image
np.save('effnet_image_embeddings.npy', image_embeddings)
np.save('effnet_image_targets.npy', image_targets)

In [None]:
# 🐝 Save baseline (no train) embeddings to W&B
save_dataset_artifact(run_name="BaseEmbeddings", artifact_name="effnet_embeds_notrain",
                      path="../input/happywhale-2022/effnet_image_embeddings.npy")

# 7. Cosine Distance

> 🐳 **Note**: My main inspiration was this article on [face recognition with ArcFace](https://learnopencv.com/face-recognition-with-arcface/) - absolutely amazing, highly recommend.

### I. The Cosine Distance Function

In [None]:
def get_cosine_similarity(embeddings):
    '''Compute cos distance between n embedding vector and itself.'''
    similarity_matrix = []
    
    for embed1 in embeddings:
        similarity_row = []
        for embed2 in embeddings:
            similarity_row.append(1 - spatial.distance.cosine(embed1, embed2))
        similarity_matrix.append(similarity_row)
    
    return np.array(similarity_matrix, dtype="float32")

### II. Select images from same individual

In [None]:
# Select few examples from the same individual
example_index = df[df["individual_key"]==1].sample(5, random_state=24).index.tolist()
example_paths = df[df["individual_key"]==1].sample(5, random_state=24)["path"].tolist()
example_embeds = image_embeddings[example_index]

# Compute similarity matrix
cos_matrix = get_cosine_similarity(example_embeds)

mask = np.zeros_like(cos_matrix)
mask[np.triu_indices_from(mask)] = True

### III. Compute cos Distance Matrix

> 🐳 **Note**: These will improve substantially when the embeddings will be fitted on the competition dataset.

[Source](https://www.geeksforgeeks.org/how-to-create-different-subplot-sizes-in-matplotlib/) of how I learned to do this chart :).

Because the embeddings are not trained, the similarity between the images is extremely low. In [my notebook here](https://www.kaggle.com/andradaolteanu/whales-dolphins-effnet-train-rapids-clusters) I train the model and then group the images by clusters, so the similarity matrix looks different (and improved :) ).

In [None]:
# Plots
fig = plt.figure(figsize=(12, 12))
ax1 = plt.subplot2grid(shape=(6, 6), loc=(5, 1), colspan=1)
ax2 = plt.subplot2grid(shape=(6, 6), loc=(5, 2), colspan=1)
ax3 = plt.subplot2grid(shape=(6, 6), loc=(5, 3), colspan=1)
ax4 = plt.subplot2grid(shape=(6, 6), loc=(5, 4), colspan=1)
ax5 = plt.subplot2grid(shape=(6, 6), loc=(5, 5), colspan=1)
h_axes = [ax1, ax2, ax3, ax4, ax5]

ax6 = plt.subplot2grid(shape=(6, 6), loc=(0, 0), colspan=1)
ax7 = plt.subplot2grid(shape=(6, 6), loc=(1, 0), colspan=1)
ax8 = plt.subplot2grid(shape=(6, 6), loc=(2, 0), colspan=1)
ax9 = plt.subplot2grid(shape=(6, 6), loc=(3, 0), colspan=1)
ax10 = plt.subplot2grid(shape=(6, 6), loc=(4, 0), colspan=1)
v_axes = [ax6, ax7, ax8, ax9, ax10]

ax11 = plt.subplot2grid(shape=(6, 6), loc=(0, 1), colspan=5, rowspan=5)

fig.suptitle('- Cosine Distance -', size = 21, color = my_colors[7], weight='bold')
for k, ax in enumerate(h_axes):
    ax.imshow(plt.imread(example_paths[k]))
    ax.set_axis_off()
    
for k, ax in enumerate(v_axes):
    ax.imshow(plt.imread(example_paths[k]))
    ax.set_axis_off()
    
sns.heatmap(cos_matrix, ax=ax11, fmt=".5",
            cbar=False, annot=True, linewidths=0.5, mask=mask, square=True, cmap="winter_r")

plt.tight_layout()
plt.show();

<center><img src="https://i.imgur.com/0cx4xXI.png"></center>

### 🐝 W&B Dashboard

> My [W&B Dashboard](https://wandb.ai/andrada/happywhale?workspace=user-andrada).

<!-- <center><video src="" width=800 controls></center> -->
<center><img src="https://i.imgur.com/XUClL9w.png" width=900></center>

<center><img src="https://i.imgur.com/knxTRkO.png"></center>

### My Specs

* 🖥 Z8 G4 Workstation
* 💾 2 CPUs & 96GB Memory
* 🎮 NVIDIA Quadro RTX 8000
* 💻 Zbook Studio G7 on the go