### Disclaimer: WORK IN PROGRESS

#### Check back soon for more! Work in progress. This aims to be the ultimate starter notebook to this competition for beginners. :) 

### 😃 Author: Roshan Ram 
### 👀 LinkedIn: https://linkedin.com/in/roshanr11 

# 🔥 Herbarium 🌱 2022: Complete Starter NB 🔥

# Sources:
- Modeled notebook UI/Markdown structure off of https://www.kaggle.com/ruchi798/and-identification-eda-augmentation by @ruchi798
- Adapted starter EDA from https://www.kaggle.com/vad13irt/herbarium-2022-fast-exploratory-data-analysis by @vad13irt
- https://www.kaggle.com/jirkaborovec/herbarium-eda
- https://www.kaggle.com/venkatkumar001/herbarium-22-fgvc9-baseline
- https://www.kaggle.com/drcapa/herbarium-2022-starter
- https://www.kaggle.com/semack/herbaploration
- https://www.kaggle.com/odins0n/json-pandas-herbarium-2022
- https://www.kaggle.com/hamzaghanmi/welcome-herbarium-2022

# TL;DR (Jump Right In!) 🏃‍♂️ ⏩  

The Herbarium 2022: Flora of North America is a part of a project of the New York Botanical Garden funded by the National Science Foundation to build tools to identify novel plant species around the world. The dataset strives to represent all known vascular plant taxa in North America, using images gathered from 60 different botanical institutions around the world. For each image Id, you should predict the corresponding image label (category_id) in the Predicted column.

# Background 🔎 

## Herbarium 2021: FGVC9

The Herbarium 2022: Flora of North America is a part of a project of the [New York Botanical Garden](https://www.nybg.org/) funded by the [National Science Foundation](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2054684&HistoricalAwards=false) to build tools to identify novel plant species around the world. The dataset strives to represent all known vascular plant taxa in North America, using images gathered from 60 different botanical institutions around the world.

The Herbarium 2022: Flora of North America dataset comprises 1.05 M images of 15,500 vascular plants, which constitute more than 90% of the taxa documented in North America. We used the comprehensive Checklist of the [Vascular Plants of the Americas (VPA)](Vascular Plants of the Americas (VPA)) produced by [Missouri Botanical Garden](Missouri Botanical Garden) and aligned the taxonomic names to [The World Checklist of Vascular Plants (WCVP)](https://www.nature.com/articles/s41597-021-00997-6) from the [Royal Botanical Garden Kew](https://www.kew.org/). Our dataset is constrained to include only vascular land plants (lycophytes, ferns, gymnosperms, and flowering plants).

<!-- Source: https://www.kaggle.com/c/herbarium-2022-fgvc9/overview -->

<!-- The Herbarium 2021: Half-Earth Challenge is to identify vascular plant specimens provided by the [New York Botanical Garden (NY)](https://www.nybg.org/), [Bishop Museum (BPBM)](https://www.bishopmuseum.org/), [Naturalis Biodiversity Center (NL)](https://www.naturalis.nl/en), [Queensland Herbarium (BRI)](https://www.qld.gov.au/environment/plants-animals/plants/herbarium), and [Auckland War Memorial Museum (AK)](https://www.aucklandmuseum.com/).

*The Herbarium 2021: Half-Earth Challenge* dataset includes more than 2.5M images representing nearly 65,000 species from the Americas and Oceania that have been aligned to a standardized plant list ([LCVP v1.0.2](https://www.nature.com/articles/s41597-020-00702-z)). -->

## What is a herbarium?

A herbarium is a collection of preserved plants stored, catalogued and arranged systematically for study by both professional taxonomists (scientists who name and identify plants), botanists and amateurs.

The creation of a herbarium specimen involves the pressing and drying of plants between sheets of paper, a practice that has changed very little since the beginning, 500 years ago. Thanks to this simple technique, most of the characteristics of living plants are visible on the dried plant. The few that are not (e.g. flower colour, scent, height of a tree, vegetation type) are written on the collection label by the collector. Most importantly, the label should tell us where and when the specimen was collected.

A working reference collection A herbarium acts like a plant library or vast catalogue with each of our three million specimens providing unique information – where it was found, when it flowered, what it looks like and it’s DNA, which remains intact for many years. DNA is now routinely extracted from herbarium specimens. The most important specimens are called 'types'. The type specimen, chosen by the author of the species name, becomes the physical reference for the new species.

This unique working reference collection brings species from all over the world together into one place to be discovered, described and compared. The work is disseminated through the writing of Floras (a description of all the plants in a country or region), monographs (a description of plants or fungi within a group, such as a family) and scientific papers. This fundamental research provides an essential baseline for other plant-based research and helps inform conservation practices.

<!-- Source: https://www.kaggle.com/shaunthesheep/fgvc7-herbarium-2020-data-viz -->

[Read more about this here!](https://www.rbge.org.uk/science-and-conservation/herbarium/)

<!-- # Conscious planet movement? -->

# Terminology 📝

If you're as new as I am to all this plant stuff, chances are that you are just as clueless as I am about what any of the terms used in the Description and Data mean.   

Let's figure it out together.

Terms:

In [None]:
!pip install -q addict

# Import libraries 📚

In [None]:
import pandas as pd
import numpy as np
from addict import Dict
import json
import matplotlib.pyplot as plt
import seaborn as sns
import os
import cv2
import random

# Important Helper Functions 😉

In [None]:
def load_image(path, channels=cv2.COLOR_BGR2RGB):
    if os.path.exists(path):
        image = cv2.imread(path)
        image = cv2.cvtColor(image, channels)
        image = np.asarray(image)
        return image
    else:
        raise Exception(f"Path '{path}' doesn't exist.")


def get_institutions_data_frame(institutions):
    all_ids, all_codes = [], []
    
    for institution in institutions:
        all_ids.append(institution["institution_id"])
        all_codes.append(institution["collectionCode"])
        
    data_frame = pd.DataFrame({
        "institution_id": all_ids,
        "institution": all_codes,
    })
    
    return data_frame


def get_genera_data_frame(genera):
    all_ids, all_genuses = [], []
    for _ in genera:
        all_ids.append(_["genus_id"])
        all_genuses.append(_["genus"])
        
    data_frame = pd.DataFrame({
        "genus_id": all_ids,
        "genus": all_genuses
    })
    
    return data_frame


def get_categories_data_frame(categories):
    all_ids, all_names, all_families, all_genuses, all_species, all_authors = [], [], [], [], [], []
    
    for category in categories:
        all_ids.append(category["category_id"])
        all_names.append(category["scientificName"])
        all_families.append(category["family"])
        all_genuses.append(category["genus"])
        all_species.append(category["species"])
        all_authors.append(category["authors"])
        
    data_frame = pd.DataFrame({
        "category_id": all_ids,
        "category": all_names,
        "family": all_families,
        "genus": all_genuses,
        "species": all_species,
        "authors": all_authors
    })
    
    return data_frame


def get_annotations_data_frame(annotations):
    all_genuses, all_institutions, all_categories, all_images = [], [], [], []
    
    for annotation in annotations:
        all_genuses.append(annotation["genus_id"])
        all_institutions.append(annotation["institution_id"])
        all_categories.append(annotation["category_id"])
        all_images.append(annotation["image_id"])
        
        
    data_frame = pd.DataFrame({
        "genus": all_genuses,
        "institution_id": all_institutions,
        "category_id": all_categories,
        "image_id": all_images,
    })
    
    
    return data_frame


def get_images_data_frame(images):
    all_ids, all_pathes = [], []
    for image in images:
        all_ids.append(image["image_id"])
        all_pathes.append(image["file_name"])
        
    data_frame = pd.DataFrame({
        "image_id": all_ids,
        "image_path": all_pathes,
    })

    return data_frame
    
    
def get_meta_data_frame(meta):
    annotations = get_annotations_data_frame(meta["annotations"])
    annotations = annotations.drop("genus", axis=1)
    images = get_images_data_frame(meta["images"])
    categories = get_categories_data_frame(meta["categories"])
    genera = get_genera_data_frame(meta["genera"])
    institutions = get_institutions_data_frame(meta["institutions"])
    
    data_frame = annotations.merge(images, on="image_id")
    data_frame = data_frame.merge(categories, on="category_id")
    data_frame = data_frame.merge(institutions, on="institution_id")
    
    data_frame = data_frame.drop(["image_id", "category_id", "institution_id"], axis=1)
    
    return data_frame


def create_submission(ids, predictions, path="submission.csv"):
    submission = pd.DataFrame({
        "Id": ids,
        "Predicted": predictions,
    })
    
    submission.to_csv(path, index=False)
    return submission

def read_json(path):
    with open(path, "r", encoding="utf-8") as file:
        data = json.loads(file.read())
    return data


def plot_category_images(data_frame, category, title="Title", rows=1, columns=5, background_color="#fff"):
    fig = plt.figure(figsize=(columns*3, rows*5))
    fig.set_facecolor(background_color)
    data_frame = data_frame[data_frame["category"] == category]
    images = rows * columns
    genuses = ", ".join(data_frame["genus"].unique())
    families = ", ".join(data_frame["family"].unique())
    species = ", ".join(data_frame["species"].unique())
    
    for i in range(images):
        index = random.randint(0, len(data_frame)-1)
        sample = data_frame.iloc[index]
        image_path = sample["image_path"]
        image = load_image(image_path)
        filename = image_path.split("/")[-1]
        
        description = f"Families: {families}\nGenuses: {genuses}\nSpecies: {species}"
        
        ax = fig.add_subplot(rows, columns, i+1)
        ax.set_facecolor(background_color)
        ax.imshow(image)
        ax.set_xlabel(filename, color="#000", fontsize=14, y=1)
        ax.xaxis.set_tick_params(labelsize=0, size=0)
        ax.yaxis.set_visible(False)
        hide_spines(ax)
    
    fig.suptitle(title, fontsize=25, fontweight="bold", fontfamily="serif", horizontalalignment="left", y=1.1, x=0.01)
    fig.text(s=description, x=0.01, y=0.97, fontsize=17, fontfamily="serif", horizontalalignment="left")
    fig.tight_layout(h_pad=5, w_pad=2)
    fig.show()
    
def hide_spines(ax, spines=["top", "right", "left", "bottom"]):
    for spine in spines:
        ax.spines[spine].set_visible(False)

In [None]:
pathes = Dict({
    "train_meta": "../input/herbarium-2022-fgvc9/train_metadata.json",
    "train_images": "../input/herbarium-2022-fgvc9/train_images",
    "test_meta": "../input/herbarium-2022-fgvc9/test_metadata.json",
    "test_images": "../input/herbarium-2022-fgvc9/test_images",
    "sample_submission": "../input/herbarium-2022-fgvc9/sample_submission.csv",
})

# Read in Data 📖

In [None]:
train_meta = read_json(pathes.train_meta)
train = get_meta_data_frame(train_meta)
train["image_path"] = train["image_path"].apply(lambda _: os.path.join(pathes.train_images, _))


test_meta = read_json(pathes.test_meta)
test = get_images_data_frame(test_meta)
test["image_path"] = test["image_path"].apply(lambda _: os.path.join(pathes.test_images, _))


sample_submission = pd.read_csv(pathes.sample_submission)

# Gentle EDA 📈

In [None]:
fig = plt.figure(figsize=(20, 35))
fig.set_facecolor("#fff")
ax = fig.add_subplot()
ax.set_facecolor("#fff")
ax.grid(color="lightgrey", alpha=0.7, axis="both", zorder=0)
sns.countplot(y="institution", data=train, ax=ax, zorder=2)
ax.xaxis.set_tick_params(labelsize=12, size=0, pad=5)
ax.yaxis.set_tick_params(labelsize=12, size=0, pad=5)
ax.set_ylabel("Institution", fontsize=15, fontweight="normal", labelpad=10)
ax.set_xlabel("Count", fontsize=15, fontweight="normal", labelpad=10)
hide_spines(ax)

ax.set_title("Institution Distribution", fontsize=25, fontweight="bold", fontfamily="serif", loc="left", y=1.02)
fig.show()

In [None]:
k = 50
topk_authors = train["authors"].value_counts()[:k]

fig = plt.figure(figsize=(10, 25))
fig.set_facecolor("#fff")
ax = fig.add_subplot()
ax.set_facecolor("#fff")
ax.grid(color="lightgrey", alpha=0.7, axis="both", zorder=0)
sns.barplot(x=topk_authors.values, y=topk_authors.index, ax=ax, zorder=2)
ax.xaxis.set_tick_params(labelsize=12, size=0, pad=5)
ax.yaxis.set_tick_params(labelsize=12, size=0, pad=5)
ax.set_ylabel("Author", fontsize=15, fontweight="normal", labelpad=10)
ax.set_xlabel("Author's works", fontsize=15, fontweight="normal", labelpad=10)
hide_spines(ax)

ax.set_title(f"Top {k} Authors Distribution", fontsize=25, fontweight="bold", fontfamily="serif", loc="left", y=1.02)
fig.show()

In [None]:
k = 50
topk_authors = train["species"].value_counts()[:k]

fig = plt.figure(figsize=(10, 25))
fig.set_facecolor("#fff")
ax = fig.add_subplot()
ax.set_facecolor("#fff")
ax.grid(color="lightgrey", alpha=0.7, axis="both", zorder=0)
sns.barplot(x=topk_authors.values, y=topk_authors.index, ax=ax, zorder=2)
ax.xaxis.set_tick_params(labelsize=12, size=0, pad=5)
ax.yaxis.set_tick_params(labelsize=12, size=0, pad=5)
ax.set_ylabel("Specie", fontsize=15, fontweight="normal", labelpad=10)
ax.set_xlabel("Count", fontsize=15, fontweight="normal", labelpad=10)
hide_spines(ax)

ax.set_title(f"Top {k} Species Distribution", fontsize=25, fontweight="bold", fontfamily="serif", loc="left", y=1.02)
fig.show()

In [None]:
k = 50
topk_authors = train["genus"].value_counts()[:k]

fig = plt.figure(figsize=(10, 25))
fig.set_facecolor("#fff")
ax = fig.add_subplot()
ax.set_facecolor("#fff")
ax.grid(color="lightgrey", alpha=0.7, axis="both", zorder=0)
sns.barplot(x=topk_authors.values, y=topk_authors.index, ax=ax, zorder=2)
ax.xaxis.set_tick_params(labelsize=12, size=0, pad=5)
ax.yaxis.set_tick_params(labelsize=12, size=0, pad=5)
ax.set_ylabel("Genus", fontsize=15, fontweight="normal", labelpad=10)
ax.set_xlabel("Count", fontsize=15, fontweight="normal", labelpad=10)
hide_spines(ax)

ax.set_title(f"Top {k} Genuses Distribution", fontsize=25, fontweight="bold", fontfamily="serif", loc="left", y=1.02)
fig.show()

In [None]:
fig = plt.figure(figsize=(20, 100))
fig.set_facecolor("#fff")
ax = fig.add_subplot()
ax.set_facecolor("#fff")
ax.grid(color="lightgrey", alpha=0.7, axis="both", zorder=0)
sns.countplot(y="family", data=train, ax=ax, zorder=2)
ax.xaxis.set_tick_params(labelsize=12, size=0, pad=5)
ax.yaxis.set_tick_params(labelsize=12, size=0, pad=5)
ax.set_ylabel("Family", fontsize=15, fontweight="normal", labelpad=10)
ax.set_xlabel("Count", fontsize=15, fontweight="normal", labelpad=10)
hide_spines(ax)

ax.set_title("Family Distribution", fontsize=25, fontweight="bold", fontfamily="serif", loc="left", y=1.02)
fig.show()

In [None]:
k = 50
topk_authors = train["category"].value_counts()[:k]

fig = plt.figure(figsize=(10, 25))
fig.set_facecolor("#fff")
ax = fig.add_subplot()
ax.set_facecolor("#fff")
ax.grid(color="lightgrey", alpha=0.7, axis="both", zorder=0)
sns.barplot(x=topk_authors.values, y=topk_authors.index, ax=ax, zorder=2)
ax.xaxis.set_tick_params(labelsize=12, size=0, pad=5)
ax.yaxis.set_tick_params(labelsize=12, size=0, pad=5)
ax.set_ylabel("Specie", fontsize=15, fontweight="normal", labelpad=10)
ax.set_xlabel("Count", fontsize=15, fontweight="normal", labelpad=10)
hide_spines(ax)

ax.set_title(f"Top {k} Categories Distribution", fontsize=25, fontweight="bold", fontfamily="serif", loc="left", y=1.02)
fig.show()

In [None]:
k = 50
topk_authors = train["category"].value_counts()[-k:]

fig = plt.figure(figsize=(10, 25))
fig.set_facecolor("#fff")
ax = fig.add_subplot()
ax.set_facecolor("#fff")
ax.grid(color="lightgrey", alpha=0.7, axis="both", zorder=0)
sns.barplot(x=topk_authors.values, y=topk_authors.index, ax=ax, zorder=2)
ax.xaxis.set_tick_params(labelsize=12, size=0, pad=5)
ax.yaxis.set_tick_params(labelsize=12, size=0, pad=5)
ax.set_ylabel("Specie", fontsize=15, fontweight="normal", labelpad=10)
ax.set_xlabel("Count", fontsize=15, fontweight="normal", labelpad=10)
hide_spines(ax)

ax.set_title(f"Bottom {k} Categories Distribution", fontsize=25, fontweight="bold", fontfamily="serif", loc="left", y=1.02)
fig.show()

In [None]:
categories = train["category"].unique()
for category in categories[::2500]:
    plot_category_images(train, category=category, rows=2, columns=5, title=category, background_color="#fff")

# Check back soon for more! Work in progress. This aims to be the ultimate starter notebook to this competition for beginners. :) 