# 1. Introduction

## 1.a Data Description

In this competition, you are asked to take test images and recognize which landmarks (if any) are depicted in them. The training set is available in the train/ folder, with corresponding landmark labels in train.csv. The test set images are listed in the test/ folder. Each image has a unique id. Since there are a large number of images, each image is placed within three subfolders according to the first three characters of the image id (i.e. image abcdef.jpg is placed in a/b/c/abcdef.jpg).

## 1.b Evaluation Metrics

Submissions are evaluated using Global Average Precision (GAP) at k, where k=1. This metric is also known as micro Average Precision (μ

AP), as per [1,2]. It works as follows:

For each test image, you will predict one landmark label and a corresponding confidence score. The evaluation treats each prediction as an individual data point in a long list of predictions (sorted in descending order by confidence scores), and computes the Average Precision based on this list.

If a submission has N

predictions (label/confidence pairs) sorted in descending order by their confidence scores, then the Global Average Precision is computed as:

GAP=(1/M)∑P(i)rel(i)

where:

* N is the total number of predictions returned by the system, across all queries
* M is the total number of queries with at least one landmark from the training set visible in it (note that some queries may not depict landmarks)
* P(i) is the precision at rank i
* rel(i) denotes the relevance of prediciton i: it’s 1 if the i-th prediction is correct, and 0 otherwise

For each id in the test set, you can predict at most one landmark and its corresponding confidence score. Some images contain no landmarks. You may decide not to predict any result for a given query, by submitting an empty prediction. 

## Importing Libraries

In [None]:
import pandas as pd
import numpy as np

import plotly.express as px

import glob

import matplotlib.pyplot as plt
import seaborn as sns

import cv2

## Reading files

In [None]:
df_train= pd.read_csv("../input/landmark-recognition-2020/train.csv")
df_submission = pd.read_csv("../input/landmark-recognition-2020/sample_submission.csv")
TRAIN_PATH = "../input/landmark-recognition-2020/train"
TEST_PATH = "../input/landmark-recognition-2020/test"

## Looking at the dataset

In [None]:
train_list = glob.glob('../input/landmark-recognition-2020/train/*/*/*/*')
test_list = glob.glob('../input/landmark-recognition-2020/test/*/*/*/*')

print( 'Images in Train Folder:', len(train_list))
print( 'Images in Test Folder:', len(test_list))

In [None]:
df_train.head()

In [None]:
df_train.shape[0], df_train.id.nunique()

In [None]:
df_train.landmark_id.nunique()

In [None]:
df_train.isna().sum()

In [None]:
df_image_counts = df_train.groupby("landmark_id").agg(images = ("id","nunique")).reset_index()
df_image_counts.head()

In [None]:
px.box(df_image_counts, x= "images",width=1000, height=300)

What we have:
* We have ~1.6M images in train data with ~81 k landmarks (classes) 
* We have a minimum of 2 images per landmark and a exptionally high 6272 images of a landmark. The upper threshold of the boxplot is 42 meaning most of images have less than or equal to 42 images per landmark (Though it would be interesting to see which is the landmark with 6272 images 😛)

In [None]:
def plot_images(image_list,rows,cols,title):
    fig,ax = plt.subplots(rows,cols,figsize = (25,5*rows))
    ax = ax.flatten()
    for i, image_id in enumerate(image_list):
        image = cv2.imread(TRAIN_PATH+'/{}/{}/{}/{}.jpg'.format(image_id[0],image_id[1],image_id[2],image_id))
        image = cv2.cvtColor(image,cv2.COLOR_BGR2RGB)
        ax[i].imshow(image)
        ax[i].set_axis_off()
        ax[i].set_title(image_id)
    plt.suptitle(title)

In [None]:
plot_images(df_train.loc[df_train.landmark_id==df_image_counts[df_image_counts.images == 6272]["landmark_id"].values[0],"id"].values[:10],2,5,"Images of Landmark - 138982 (10 out of 6k images of this landmark)")

* This seems to be the default class meaning all the images that donot have a class or a landmark in it are put into this class

In [None]:
plot_images(df_train.loc[df_train.landmark_id==df_image_counts[df_image_counts.images == 2231]["landmark_id"].values[0],"id"].values[:10],2,5,"Images of Landmark with second highest number of images (10 out of 2 k images of this landmark)")

In [None]:
plot_images(df_train.loc[df_train.landmark_id==df_image_counts[df_image_counts.images == 10]["landmark_id"].values[0],"id"].values[:10],2,5,"Images of Landmark:" + str(df_image_counts[df_image_counts.images == 10]["landmark_id"].values[0]))

Please Upvote if you found this helpful :)