# **<span style='color:#A80808'>🎯 Goal</span>**

Identify plant species of the Americas from herbarium specimens.

# **<span style='color:#A80808'>🔑 Metric</span>**
Submissions are evaluated using the macro F1 score. For each image Id, you should predict the corresponding image label (category_id) in the Predicted column. 

# **<span style='color:#A80808'>💾 Data</span>**

This dataset uses the COCO dataset format with additional annotation fields. In addition to the species category labels, we also provide supercategory information.

The training set metadata (train_metadata.json) and test set metadata (test_metadata.json) are JSON files in the format below. Naturally, the test set metadata file omits the "annotations", "categories," and "regions" elements.

```

{ "annotations" : [annotation], "categories" : [category], "genera" : [genus] "images" : [image], "distances" : [distance], "licenses" : [license], "institutions" : [institution] }

annotation { "image_id" : int, "category_id" : int, "genus_id" : int, "institution_id" : int
}

image { "image_id" : int, "file_name" : str, "license" : int }

category { "category_id" : int, "scientificName" : str, # We also provide a super-category for each species. "authors" : str, # correspond to 'authors' field in the wcvp "family" : str, # correspond to 'family' field in the wcvp "genus" : str, # correspond to 'genus' field in the wcvp "species" : str, # correspond to 'species' field in the wcvp }

genera { "genus_id" : int, "genus" : str }

distance { # We provide the pairwise evolutionary distance between categories (genus_id0 < genus_id1). "genus_id0" : int,
"genus_id1" : int,
"distance" : float }

institution { "institution_id" : int "collectionCode" : str }

license { "id" : int, "name" : str, "url" : str }

```

The training set images are organized in subfolders h22-train/images/subfolder1/subfolder2/image_id.jpg, where subfolder1 and subfolder2 comes from the first three and the last two digits of the image_id. Image_id is a result of combination between category_id and unique numbers that differentiates images within plant taxa.

The test set images are organized in subfolders test/images/subfolder/image id.jpg, where subfolder corresponds to the integer division of the image_id by 1000. For example, a test image with and image_id of 8005, can be found at h22-test/images/008/test-008005.jpg.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import json

## **<span style='color:#A80808'>Train metadata</span>**

In [None]:
with open("../input/herbarium-2022-fgvc9/train_metadata.json") as json_file:
    train_metadata = json.load(json_file)
    
print('The keys in train_metadata.json:')
for key in train_metadata.keys(): 
    print(key)
    print(len(train_metadata[key]), 'rows')
    print(train_metadata[key][0])

## **<span style='color:#A80808'>Annotations</span>**

In [None]:
annotations = pd.DataFrame()
annotations['image_id'] = [annotation["image_id"] for annotation in train_metadata["annotations"]]
annotations['category_id'] = [annotation["category_id"] for annotation in train_metadata["annotations"]]
annotations['genus_id'] = [annotation["genus_id"] for annotation in train_metadata["annotations"]]
annotations['institution_id'] = [annotation["institution_id"] for annotation in train_metadata["annotations"]]
annotations.head()

In [None]:
fig = px.histogram(annotations['category_id'],  marginal="violin", nbins = 100, template="plotly_white", color_discrete_sequence=['red'])
fig.show()

In [None]:
fig = px.histogram(annotations['genus_id'],  marginal="violin", nbins = 100, template="plotly_white", color_discrete_sequence=['blue'])
fig.show()

In [None]:
fig = px.histogram(annotations['institution_id'],  marginal="violin", nbins = 100, template="plotly_white", color_discrete_sequence=['green'])
fig.show()

## **<span style='color:#A80808'>Images</span>**

In [None]:
images = pd.DataFrame()
images['image_id'] = [image["image_id"] for image in train_metadata["images"]]
images['file_name'] = [f'../input/herbarium-2022-fgvc9/train_images/{image["file_name"]}' for image in train_metadata["images"]]
images['license'] = [image["license"] for image in train_metadata["images"]]

print(f'Shape of images: {images.shape}')
images.head()

In [None]:
# There is only one license
images.license.unique()

## **<span style='color:#A80808'>Categories</span>**

In [None]:
categories = pd.DataFrame()
categories['category_id'] = [category["category_id"] for category in train_metadata["categories"]]
categories['scientificName'] = [category["scientificName"] for category in train_metadata["categories"]]
categories['family'] = [category["family"] for category in train_metadata["categories"]]
categories['genus'] = [category["genus"] for category in train_metadata["categories"]]
categories['species'] = [category["species"] for category in train_metadata["categories"]]
categories['authors'] = [category["authors"] for category in train_metadata["categories"]]

print(f'Shape of categories: {categories.shape}')
categories.head()

In [None]:
fig = px.histogram(categories['family'],  marginal="violin", nbins = 100, template="plotly_white", color_discrete_sequence=['red'])
fig.show()

In [None]:
fig = px.histogram(categories['genus'],  marginal="violin", nbins = 100, template="plotly_white", color_discrete_sequence=['blue'])
fig.show()

In [None]:
fig = px.histogram(categories['species'],  marginal="violin", nbins = 100, template="plotly_white", color_discrete_sequence=['green'])
fig.show()

In [None]:
fig = px.histogram(categories['authors'],  marginal="violin", nbins = 100, template="plotly_white", color_discrete_sequence=['orange'])
fig.show()

## **<span style='color:#A80808'>Genera</span>**

In [None]:
genera = pd.DataFrame()
genera['genus_id'] = [genus["genus_id"] for genus in train_metadata["genera"]]
genera['genus'] = [genus["genus"] for genus in train_metadata["genera"]]

print(f'Shape of genera: {genera.shape}')
genera.head()

## **<span style='color:#A80808'>Institutions</span>**

In [None]:
institutions = pd.DataFrame()
institutions['institution_id'] = [institution["institution_id"] for institution in train_metadata["institutions"]]
institutions['collectionCode'] = [institution["collectionCode"] for institution in train_metadata["institutions"]]

print(f'Shape of institutions: {institutions.shape}')
institutions.head()

In [None]:
print(f'There are {institutions.collectionCode.nunique()} unique collection codes:\n{institutions.collectionCode.unique()}')

# **<span style='color:#A80808'>Distances</span>**

In [None]:
distances = pd.DataFrame()
distances['genus_id_x'] = [distance["genus_id_x"] for distance in train_metadata["distances"]]
distances['genus_id_y'] = [distance["genus_id_y"] for distance in train_metadata["distances"]]
distances['distance'] = [distance["distance"] for distance in train_metadata["distances"]]

print(f'Shape of distances: {distances.shape}')
distances.head()

In [None]:
fig = px.histogram(distances['distance'],  marginal="violin", nbins = 100, template="plotly_white", color_discrete_sequence=['blue'])
fig.show()

# **<span style='color:#A80808'>🏆 Submission</span>**

In [None]:
submission = pd.read_csv('../input/herbarium-2022-fgvc9/sample_submission.csv')
submission['Predicted'] = np.random.randint(0,6932)
submission.to_csv('submission.csv', index=False)
submission.head()