# Acquiring labels - ILSVRC validation - Imagenet
## ILSVRC2010 val

## First things First
* class_names_ImageNet.txt = Text file of categories AlexNet can predict
* ILSVRC2010_validation_ground_truth.txt = From devkit, line sequence of ILSVRC2010\_val_*.JPEG, each line containing Category_ID of image
* meta.mat = From devkit, containing Category_ID, glossary, other info
## About
* Why ImageNet Dataset ?
* What is the label of ILSVRC2010\_val_*.JPEG



"ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images." 

Accesable at [Explore ImageNet](http://image-net.org/explore)

Each Dataset (ILSVRC) comes with a devkit. In my case, I will be usign 2010's validation set, which comes with file "meta.mat". It can be opened with MatLab.
I've copied the contents of synsets into .txt, converted it to csv = "MatlabData.csv"

In [3]:
# Working on *.csv files with pandas 
import pandas as pd 
# Opening the file containing categories from ILSVRC 2010
df1 = pd.read_csv("MatLab/MatlabData.csv", sep="\t", index_col=0) # Kategorije

print(df1.shape)
# Creating index
df1['ID_Kategorija'] = df1.index
# Show df
df1.head(10)

(1676, 1)


Unnamed: 0_level_0,Kategorija,ID_Kategorija
Index_Id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"french fries, french-fried potatoes, fries, chips",1
2,mashed potato,2
3,"black olive, ripe olive",3
4,face powder,4
5,"crab apple, crabapple",5
6,Granny Smith,6
7,strawberry,7
8,blueberry,8
9,cranberry,9
10,currant,10


In [4]:
# Open validation set ground truth
# Line 1 is picture ILSVRC2010_val_1.JPEG, its label is Category_ID 78
df2 = pd.read_csv("ILSVRC2010_validation_ground_truth.txt", sep=" ")
# Show df2
df2.head(10)

Unnamed: 0,ID_Kategorija
0,78
1,854
2,435
3,541
4,973
5,657
6,168
7,879
8,390
9,300


In [5]:
# Acquire filenames from folder containing ILSVRC2010_val*.JPEG
from pathlib import Path
import glob
# Array with file names, from which we will create a dataframe
imenaDatotek = []

images = glob.glob("C:/Users/Rok/Downloads/ImageNet/2010/val/*.JPEG")

for image in images:
    with open(image, 'rb') as file:
        Path(file.name).stem
        ime = Path(file.name).stem
        ime += '.JPEG'
        imeSkrajsano = ime[15:]
        imenaDatotek.append({
            'ImeDatoteke' : ime
        })
        
# Creating new df
df3 = pd.DataFrame(data=imenaDatotek)
# Merging df2, df3. We now have a dataframe containing *filename* and its *Category_ID*
df = pd.concat([df2, df3], axis=1)
df.head(10)

Unnamed: 0,ID_Kategorija,ImeDatoteke
0,78,ILSVRC2010_val_00000001.JPEG
1,854,ILSVRC2010_val_00000002.JPEG
2,435,ILSVRC2010_val_00000003.JPEG
3,541,ILSVRC2010_val_00000004.JPEG
4,973,ILSVRC2010_val_00000005.JPEG
5,657,ILSVRC2010_val_00000006.JPEG
6,168,ILSVRC2010_val_00000007.JPEG
7,879,ILSVRC2010_val_00000008.JPEG
8,390,ILSVRC2010_val_00000009.JPEG
9,300,ILSVRC2010_val_00000010.JPEG


In [6]:
# Merging *Category_ID* of ILSVRC2010 along with *Category_ID* of validation set. 1676 labels against 1000
df4 = df.merge(df1, how='right', on='ID_Kategorija')

In [7]:
# Show df4
df4

Unnamed: 0,ID_Kategorija,ImeDatoteke,Kategorija
0,78,ILSVRC2010_val_00000001.JPEG,"seashore, coast, seacoast, sea-coast"
1,78,ILSVRC2010_val_00001183.JPEG,"seashore, coast, seacoast, sea-coast"
2,78,ILSVRC2010_val_00001826.JPEG,"seashore, coast, seacoast, sea-coast"
3,78,ILSVRC2010_val_00002958.JPEG,"seashore, coast, seacoast, sea-coast"
4,78,ILSVRC2010_val_00004643.JPEG,"seashore, coast, seacoast, sea-coast"
...,...,...,...
50671,1672,,"business, concern, business concern, business ..."
50672,1673,,"carrier, common carrier"
50673,1674,,line
50674,1675,,"railway, railroad, railroad line, railway line..."


In [8]:
# Change column position
df4 = df4[['ImeDatoteke','Kategorija','ID_Kategorija']] 
# Sort by filename
df4 = df4.sort_values(by=['ImeDatoteke'])
# Write to file containing *Category_ID*, *filename*, *Category*, *.csv
df4.to_csv("OutputCSV/Ime,Kategorija,ID.csv", sep=";", decimal=".", index=False)
# Show df4
df4.head()

Unnamed: 0,ImeDatoteke,Kategorija,ID_Kategorija
0,ILSVRC2010_val_00000001.JPEG,"seashore, coast, seacoast, sea-coast",78
50,ILSVRC2010_val_00000002.JPEG,"bookshop, bookstore, bookstall",854
100,ILSVRC2010_val_00000003.JPEG,"kit fox, Vulpes macrotis",435
150,ILSVRC2010_val_00000004.JPEG,"scale, weighing machine",541
200,ILSVRC2010_val_00000005.JPEG,"chain mail, ring mail, mail, chain armor, chai...",973


In [9]:
# Open categories AlexNet can predict
df5 = pd.read_csv("class_names_ImageNet.txt", sep=";", index_col=0) # Kategorije
df5.head()

"tench, Tinca tinca"
"goldfish, Carassius auratus"
"great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias"
"tiger shark, Galeocerdo cuvieri"
"hammerhead, hammerhead shark"


In [12]:
# Merge df4 and df5 - dataframes with categories
df = df5.merge(df4, how='outer', on='Kategorija', indicator=True)
# Drop column filename
df = df.drop(columns=['ImeDatoteke'])
# Save only rows, where df4 and df5 both contained same category - _merge = both
df = df.loc[df['_merge'] == 'both']
# Drop all duplicates - different pictures belong to same category
df = df.drop_duplicates(subset='Kategorija', keep="first")
# Drop all other columns but Category
df = df.drop(columns=['_merge','ID_Kategorija'])
# Save to *.csv - Both meaning category can be predicted and is included in ILSVRC2010 validation set
df.to_csv("OutputCSV/BothKategorije.csv", sep=";", decimal=".", index=False)