The Herbarium 2021: Half-Earth Challenge is to identify vascular plant specimens provided by the New York Botanical Garden (NY), Bishop Museum (BPBM), Naturalis Biodiversity Center (NL), Queensland Herbarium (BRI), and Auckland War Memorial Museum (AK).

The Herbarium 2021: Half-Earth Challenge dataset includes more than 2.5M images representing nearly 65,000 species from the Americas and Oceania that have been aligned to a standardized plant list (LCVP v1.0.2).

## Disclaimer
This kernel is heavily inspired from [@yasufuminakama](https://www.kaggle.com/yasufuminakama)'s kernel from last year's competition [Herbarium 2020 PyTorch Resnet18 [inference]](https://www.kaggle.com/yasufuminakama/herbarium-2020-pytorch-resnet18-inference)

<a id = "basic"></a>
# Packages 📦 and Basic Setup

In the following **hidden** code cell, we:

* Import the required libraries (Main ones being torch, torchvision and sklearn)
* Print the device configuration
* Set Random Seed 🌱 to ensure reproducibility
* Create a Logger 📃 for Event Logging

In [1]:
# Import Statements
import os # To set Random Seed for Reproducibility
import cv2 # For Image 🌌 Processing
import json # For Reading in the JSON file
import torch # The Main Machine Learning Framework
import random # To set Random Seed for Reproducibility
import logging # For Event Logging
import sklearn # For LabelEncoder and Metrics
import torchvision # For creating a pretrained model
import numpy as np # For Numerical Processing
import pandas as pd # For creating DataFrames 
import albumentations # For Image Augmentations
from tqdm import tqdm # For Creating ProgressBar
from sklearn import preprocessing # For the 🏷 Label Encoder
from albumentations.pytorch import ToTensorV2 # For Converting to torch.Tensor
from sklearn.model_selection import StratifiedKFold # For Cross Validation


# Device Configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

# Setting RandomSeed🌱 for Reproducibility 
def seed_torch(seed:int =42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True

seed_torch()

# Creating a logger 📃
def init_logger(log_file:str ='training.log'):
    
    # Specify the format 
    formatter = logging.Formatter('%(levelname)s:%(name)s:%(message)s')
    
    # Create a StreamHandler Instance
    stream_handler = logging.StreamHandler()
    stream_handler.setLevel(logging.DEBUG)
    stream_handler.setFormatter(formatter)
    
    # Create a FileHandler Instance
    file_handler = logging.FileHandler(log_file)
    file_handler.setFormatter(formatter)
    
    # Create a logging.Logger Instance
    logger = logging.getLogger('Herbarium')
    logger.setLevel(logging.DEBUG)
    logger.addHandler(stream_handler)
    logger.addHandler(file_handler)
    
    return logger

LOGGER = init_logger()
LOGGER.info("Logger Initialized")

INFO:Herbarium:Logger Initialized


In [2]:
# Basic Parameters for the Model
N_CLASSES = 64500
HEIGHT = 128
WIDTH = 128
batch_size = 512
n_epochs = 1
lr = 4e-4

In [3]:
%%time
with open('../input/herbarium-2021-fgvc8/train/metadata.json', "r", encoding="ISO-8859-1") as file:
    train = json.load(file)

train_img = pd.DataFrame(train['images'])
train_ann = pd.DataFrame(train['annotations']).drop(columns='image_id')
train_df = train_img.merge(train_ann, on='id')
LOGGER.info("Train DataFrame Created: ✅")
train_df.head()

INFO:Herbarium:Train DataFrame Created: ✅


CPU times: user 16.1 s, sys: 1.83 s, total: 17.9 s
Wall time: 21.7 s


Unnamed: 0,file_name,height,id,license,width,category_id,institution_id
0,images/604/92/1814367.jpg,1000,1814367,0,678,60492,0
1,images/108/24/1308257.jpg,1000,1308257,0,666,10824,0
2,images/330/76/1270453.jpg,1000,1270453,0,739,33076,3
3,images/247/99/1123834.jpg,1000,1123834,0,672,24799,0
4,images/170/18/1042410.jpg,1000,1042410,0,675,17018,0


In [4]:
%%time
with open('../input/herbarium-2021-fgvc8/test/metadata.json', "r", encoding="ISO-8859-1") as file:
    test = json.load(file)

test_df = pd.DataFrame(test['images'])
LOGGER.info("Test DataFrame Created: ✅")
test_df.head()

INFO:Herbarium:Test DataFrame Created: ✅


CPU times: user 1.19 s, sys: 95.6 ms, total: 1.29 s
Wall time: 1.52 s


Unnamed: 0,file_name,height,id,license,width
0,images/000/0.jpg,1000,0,0,680
1,images/000/1.jpg,1000,1,0,681
2,images/000/2.jpg,1000,2,0,676
3,images/000/3.jpg,1000,3,0,666
4,images/000/4.jpg,1000,4,0,676


In [5]:
sample_submission = pd.read_csv('../input/herbarium-2021-fgvc8/sample_submission.csv')
sample_submission.head()

Unnamed: 0,Id,Predicted
0,0,0
1,1,0
2,2,0
3,3,0
4,4,0


<a id = "label"></a>

# 🏷 Label Encoder

We use the `LabelEncoder` from sklearn.preprocessing in order to encode target labels with value between `0` and `n_classes-1`.

In [6]:
# Create a Instance of LabelEncoder
le = preprocessing.LabelEncoder()
LOGGER.info("LabelEncoder Instance created ✅")

# Fits the label encoder instance
LOGGER.info("Fitting the LabelEncoder Instance")
le.fit(train_df['category_id'])

# To Transform labels to normalized encoding
LOGGER.info("Converting Labels to Normalized Encoding")
train_df['category_id_le'] = le.transform(train_df['category_id'])
class_map = dict(sorted(train_df[['category_id_le', 'category_id']].values.tolist()))

INFO:Herbarium:LabelEncoder Instance created ✅
INFO:Herbarium:Fitting the LabelEncoder Instance
INFO:Herbarium:Converting Labels to Normalized Encoding


<a id = "data"></a>
# 💿 Dataset and DataLoader 

The following code cell aims to convert the Herbarium dataset into a torch `torch.utils.data.Dataset` object.

All Dataset objects in pytorch represent a map from keys to data samples. We create a subclass which overwrites the **getitem()** and **len()** to method (for it to work well with the `torch.utils.data.DataLoader`).

In the **getitem()** method, we use df[].values[] to get the file_nameand then use cv2 to read the image. If the transform bool is set to True, we apply the transforms.

Each element of our dataset returns:

* Image

In [7]:
class TestDataset(torch.utils.data.Dataset):
    """
    Custom Dataset Class
    """
    def __init__(self, df, transform=None):
        self.df = df
        self.transform = transform
        
    def __len__(self) -> int:
        return len(self.df)

    def __getitem__(self, idx):
        file_name = self.df['file_name'].values[idx]
        file_path = f'../input/herbarium-2021-fgvc8/test/{file_name}'
        image = cv2.imread(file_path)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        
        if self.transform:
            augmented = self.transform(image=image)
            image = augmented['image']
        
        return image

## Image Augmentation:🌆 -> 🌇

Applying Standard Image Augmentation Techniques, such as `Resize`, `Normalize` and conversion to `torch.Tensor`

In [8]:
def get_transforms():

    return albumentations.Compose([
        albumentations.Resize(HEIGHT, WIDTH),
        albumentations.Normalize(
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225],
        ),
        ToTensorV2(),
    ])

In [9]:
# Create Test Dataset
test_dataset = TestDataset(test_df, transform=get_transforms())
LOGGER.info("Test Dataset Object Created ✅")

# Create Test DataLoader
LOGGER.info("Creating Test DataLoader")
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
LOGGER.info("Test DataLoader created")

INFO:Herbarium:Test Dataset Object Created ✅
INFO:Herbarium:Creating Test DataLoader
INFO:Herbarium:Test DataLoader created


# The Model 👷‍♀️

---

### Transfer Learning

The main aim of transfer learning (TL) is to implement a model quickly i.e. instead of creating a DNN (dense neural network) from scratch, the model will transfer the features it has learned from the different dataset that has performed the same task. This transaction is also known as knowledge transfer.

### Resnet18

A residual network, or ResNet for short, is a DNN that helps to build deeper neural networks by utilizing skip connections or shortcuts to jump over some layers. This helps solve the problem of **vanishing gradients**.

There are different versions of ResNet, including ResNet-18, ResNet-34, ResNet-50, and so on. The numbers denote layers, although the architecture is the same.

![](https://i.imgur.com/XwcnU5x.png)

In the end, we just add a Adaptive Pooling Layer and a Fully Connected Layer with output dimensions equal to the number of classes and load the weights from the Training Kernel

In [10]:
%%capture
# Creating a instance of a Resnet18 pretrained Model
model = torchvision.models.resnet18(pretrained=True)

# Add a Adaptive Average Pooling Layer
model.avgpool = torch.nn.AdaptiveAvgPool2d(1)

# Add a Fully connected Layer with N_CLASSES as the output dimension
model.fc = torch.nn.Linear(model.fc.in_features, N_CLASSES)

LOGGER.info("Model Created ✅")

weights_path = '../input/herbarium-2021-pytorch-weights/fold0_best_score.pth'

LOGGER.info("Loading Weights")
model.load_state_dict(torch.load(weights_path))
LOGGER.info("Weights Loaded ✅")

INFO:Herbarium:Model Created ✅
INFO:Herbarium:Loading Weights
INFO:Herbarium:Weights Loaded ✅


<a id = "infer"></a>
# Inference 💪🏻

looping over our data iterator, and feed the inputs to the network

In [11]:
model.to(device) 
    
preds = np.zeros((len(test_dataset)))

for i, images in tqdm(enumerate(test_loader)):
            
    images = images.to(device)
            
    with torch.no_grad():
        y_preds = model(images)
            
    preds[i * batch_size: (i+1) * batch_size] = y_preds.argmax(1).to('cpu').numpy()

475it [1:13:38,  9.30s/it]


# Submit

In [12]:
test_df['preds'] = preds.astype(int)
submission = sample_submission.merge(test_df.rename(columns={'id': 'Id'})[['Id', 'preds']], on='Id').drop(columns='Predicted')
submission['Predicted'] = submission['preds'].map(class_map)
submission = submission.drop(columns='preds')
submission.to_csv('submission.csv', index=False)
submission.head()

Unnamed: 0,Id,Predicted
0,0,45246
1,1,47386
2,2,35602
3,3,38896
4,4,4731
