Jaro-Winkler-Distance considers the length of strings, the number of matching characters and the number of transpositions, also we use a prefix scale that gives us more accurate answers when strings have a common prefix 

In [2]:
import logging

import jellyfish
import pandas as pd
import numpy as np

logger = logging.getLogger(__name__)

In [2]:
df_dbpedia = pd.read_csv("./files/dbpedia_classes.csv")
df_imageNet = pd.read_csv("./files/ImageNet.csv")
df_imageNet['Class Name'] = df_imageNet['Class Name'].str.replace(", ",",")
df_imageNet['Class Name'] = df_imageNet['Class Name'].str.split(",")

In [3]:
def get_jaro_winkler_distance():
    max_ratio_list = []
    max_label_list = []
    for class_list in df_imageNet['Class Name']:
        max_val = 0
        max_label = ''
        for item in class_list:
            for label in df_dbpedia['label']:
                if max_val< jellyfish.jaro_winkler_similarity(item, label):
                    max_val = jellyfish.jaro_winkler_similarity(item, label)
                    max_label = label
        max_ratio_list.append(max_val)
        max_label_list.append(max_label)
        
    df_imageNet['jaro_winkler_distance'] = max_ratio_list
    df_imageNet['dbpedia_class'] = max_label_list
    df_imageNet.rename(columns = {"Class Name":"imageNet_class"}, inplace=True)
            
get_jaro_winkler_distance()

In [4]:
df_imageNet


Unnamed: 0,Class ID,imageNet_class,jaro_winkler_distance,dbpedia_class
0,0,"[tench, Tinca tinca]",0.790000,tenure
1,1,"[goldfish, Carassius auratus]",0.693182,golf course
2,2,"[great white shark, white shark, man-eater, ma...",0.844444,man
3,3,"[tiger shark, Galeocerdo cuvieri]",0.762756,figure skater
4,4,"[hammerhead, hammerhead shark]",0.777778,camera
...,...,...,...,...
995,995,[earthstar],0.853333,earthquake
996,996,"[hen-of-the-woods, hen of the woods, Polyporus...",0.715278,island
997,997,[bolete],0.800000,bone
998,998,"[ear, spike, capitulum]",0.916667,year


Lets look at the number of classes with a distance score equal to 100

In [5]:
result =  df_imageNet[df_imageNet['jaro_winkler_distance' ]== 1.0]
result


Unnamed: 0,Class ID,imageNet_class,jaro_winkler_distance,dbpedia_class
134,134,"[crane, bird]",1.0,bird
242,242,[boxer],1.0,boxer
286,286,"[cougar, puma, catamount, mountain lion, paint...",1.0,painter
323,323,"[monarch, monarch butterfly, milkweed butterfl...",1.0,monarch
408,408,"[amphibian, amphibious vehicle]",1.0,amphibian
437,437,"[beacon, lighthouse, beacon light, pharos]",1.0,lighthouse
483,483,[castle],1.0,castle
487,487,"[cellular telephone, cellular phone, cellphone...",1.0,mobile phone
497,497,"[church, church building]",1.0,church
525,525,"[dam, dike, dyke]",1.0,dam


In [6]:
print(result.shape)

(24, 4)


Lets look at the number of classes with a distance score higher than 0.9

In [7]:
result =  df_imageNet[df_imageNet['jaro_winkler_distance' ] > 0.9]
result

Unnamed: 0,Class ID,imageNet_class,jaro_winkler_distance,dbpedia_class
22,22,"[bald eagle, American eagle, Haliaeetus leucoc...",0.931429,American Leader
64,64,[green mamba],0.905455,green alga
132,132,"[American egret, great white heron, Egretta al...",0.914762,American Leader
134,134,"[crane, bird]",1.000000,bird
231,231,[collie],0.909524,college
...,...,...,...,...
979,979,"[valley, vale]",1.000000,valley
980,980,[volcano],1.000000,volcano
981,981,"[ballplayer, baseball player]",1.000000,baseball player
989,989,"[hip, rose hip, rosehip]",0.916667,ship


Lets look at the number of classes with a distance score higher than 0.8

In [8]:
result =  df_imageNet[df_imageNet['jaro_winkler_distance' ] > 0.8]
result

Unnamed: 0,Class ID,imageNet_class,jaro_winkler_distance,dbpedia_class
2,2,"[great white shark, white shark, man-eater, ma...",0.844444,man
5,5,"[electric ray, crampfish, numbfish, torpedo]",0.841905,electrical substation
7,7,[cock],0.833333,lock
10,10,"[brambling, Fringilla montifringilla]",0.896296,brain
11,11,"[goldfinch, Carduelis carduelis]",0.835133,Cardinal direction
...,...,...,...,...
988,988,[acorn],0.893333,actor
989,989,"[hip, rose hip, rosehip]",0.916667,ship
990,990,"[buckeye, horse chestnut, conker]",0.894444,conifer
995,995,[earthstar],0.853333,earthquake


Lets look at the number of classes with a distance score higher than 0.7

In [9]:
result =  df_imageNet[df_imageNet['jaro_winkler_distance' ] > 0.7]
result

Unnamed: 0,Class ID,imageNet_class,jaro_winkler_distance,dbpedia_class
0,0,"[tench, Tinca tinca]",0.790000,tenure
2,2,"[great white shark, white shark, man-eater, ma...",0.844444,man
3,3,"[tiger shark, Galeocerdo cuvieri]",0.762756,figure skater
4,4,"[hammerhead, hammerhead shark]",0.777778,camera
5,5,"[electric ray, crampfish, numbfish, torpedo]",0.841905,electrical substation
...,...,...,...,...
995,995,[earthstar],0.853333,earthquake
996,996,"[hen-of-the-woods, hen of the woods, Polyporus...",0.715278,island
997,997,[bolete],0.800000,bone
998,998,"[ear, spike, capitulum]",0.916667,year


Conclusion: We could use jaro_winkler distance but we need to set the treshold very high because with > 80 we already have 910 out of all classes.
A problem here is that it only compares strings not semantics (for example Class ID: 999 has a 0,76 distance but is actually not the same),
also there is no significant change between choosing jaro and jaro_winkler



Idea: Right now we take the highest value out of the image_net classes array, but lets change it so we take the average of the values as distance for each class and assign the class with the highest value to the dbpedia class

In [12]:
df_imageNet


Unnamed: 0,Class ID,imageNet_class,jaro_winkler_distance,dbpedia_class
0,0,"[tench, Tinca tinca]",0.790000,tenure
1,1,"[goldfish, Carassius auratus]",0.693182,golf course
2,2,"[great white shark, white shark, man-eater, ma...",0.844444,man
3,3,"[tiger shark, Galeocerdo cuvieri]",0.762756,figure skater
4,4,"[hammerhead, hammerhead shark]",0.777778,camera
...,...,...,...,...
995,995,[earthstar],0.853333,earthquake
996,996,"[hen-of-the-woods, hen of the woods, Polyporus...",0.715278,island
997,997,[bolete],0.800000,bone
998,998,"[ear, spike, capitulum]",0.916667,year


In [4]:
img_net = pd.read_csv("./files/ImageNet.csv")
img_net_list = img_net["Class Name"].tolist()

dbpedia = pd.read_csv("./files/dbpedia_classes.csv")
db_list = dbpedia["label"].tolist()

In [8]:
dataframe_jarow = pd.DataFrame(index =db_list ,columns =img_net_list)

In [9]:
for i in range(0,len(img_net_list)):
    for a in range(0,len(db_list)):
        if len(img_net_list[i].split(",")) > 1:
            max_val = 0.0
            for el in img_net_list[i].split(","):
                el_val = jellyfish.jaro_winkler_similarity(img_net_list[i], db_list[a])
                if el_val > max_val:
                    max_val = el_val
            dataframe_jarow.iat[a,i] = max_val
        else:
            dataframe_jarow.iat[a,i] = jellyfish.jaro_winkler_similarity(img_net_list[i], db_list[a])

In [10]:
dataframe_jarow

Unnamed: 0,"tench, Tinca tinca","goldfish, Carassius auratus","great white shark, white shark, man-eater, man-eating shark, Carcharodon caharias',","tiger shark, Galeocerdo cuvieri","hammerhead, hammerhead shark","electric ray, crampfish, numbfish, torpedo",stingray,cock,hen,"ostrich, Struthio camelus",...,"buckeye, horse chestnut, conker",coral fungus,agaric,gyromitra,"stinkhorn, carrion fungus",earthstar,"hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa",bolete,"ear, spike, capitulum","toilet tissue, toilet paper, bathroom tissue"
company,0.420635,0.453263,0.377128,0.391705,0.285714,0.555556,0.490079,0.595238,0.0,0.288571,...,0.397337,0.634921,0.436508,0.502646,0.49381,0.417989,0.378778,0.436508,0.293651,0.498918
activity,0.490741,0.384259,0.49508,0.37948,0.386905,0.464286,0.583333,0.458333,0.0,0.475,...,0.37948,0.305556,0.527778,0.490741,0.47,0.569444,0.472032,0.430556,0.448413,0.44697
name,0.37037,0.429012,0.599398,0.354839,0.619048,0.496032,0.458333,0.0,0.0,0.0,...,0.427419,0.444444,0.472222,0.453704,0.526667,0.407407,0.342466,0.472222,0.365079,0.424242
person,0.481481,0.425926,0.488286,0.598566,0.468254,0.584127,0.513889,0.0,0.5,0.428889,...,0.515233,0.472222,0.444444,0.518519,0.525556,0.611111,0.490487,0.444444,0.452381,0.411616
actor,0.337037,0.459259,0.43427,0.476344,0.490476,0.465079,0.55,0.483333,0.0,0.57,...,0.565591,0.544444,0.577778,0.374074,0.486667,0.533333,0.309132,0.411111,0.498413,0.481818
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wind motor,0.457407,0.495062,0.446319,0.486738,0.347619,0.406349,0.558333,0.0,0.477778,0.446667,...,0.426344,0.288889,0.422222,0.618519,0.517143,0.474074,0.503392,0.422222,0.446825,0.467677
Windmill,0.287037,0.466049,0.432731,0.438172,0.27381,0.365079,0.5,0.0,0.486111,0.443333,...,0.0,0.402778,0.430556,0.490741,0.498333,0.0,0.472032,0.0,0.390873,0.44697
wine,0.527778,0.429012,0.516064,0.521505,0.428571,0.349206,0.583333,0.0,0.527778,0.43,...,0.427419,0.0,0.0,0.0,0.526667,0.453704,0.485921,0.472222,0.365079,0.515152
woman,0.337037,0.491358,0.532731,0.410753,0.32381,0.315873,0.383333,0.483333,0.0,0.413333,...,0.410753,0.616667,0.455556,0.540741,0.462222,0.437037,0.435921,0.455556,0.415873,0.481818


In [12]:
a = dataframe_jarow
a = dataframe_jarow.replace("Not Present", np.nan)
a =a.replace(0.0, np.nan)

#Now we remove all columns where we have only NaN values
nan = a.dropna(axis=1, how="all")
highest_values_nan_max = nan.apply(lambda s: s.abs().nlargest(3).index.tolist(), axis=0)
highest_values_nan_max

Unnamed: 0,"tench, Tinca tinca","goldfish, Carassius auratus","great white shark, white shark, man-eater, man-eating shark, Carcharodon caharias',","tiger shark, Galeocerdo cuvieri","hammerhead, hammerhead shark","electric ray, crampfish, numbfish, torpedo",stingray,cock,hen,"ostrich, Struthio camelus",...,"buckeye, horse chestnut, conker",coral fungus,agaric,gyromitra,"stinkhorn, carrion fungus",earthstar,"hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa",bolete,"ear, spike, capitulum","toilet tissue, toilet paper, bathroom tissue"
0,Concentration camp,fish,referee,ice hockey league,camera,lymph,stream,lock,chef,anatomical structure,...,beer,organ,aircraft,writer,saint,earthquake,cheese,bone,year in spaceflight,national collegiate athletic association team ...
1,train carriage,Biological database,attack,prehistorical period,area,electrical substation,star,coach,vein,British royalty,...,hockey club,colour,article,aristocrat,snooker world ranking,artist,chess player,letter,ski area,letter
2,engine,classical music artist,rest area,historical period,Caterer,year in spaceflight,single,ice hockey player,fern,politician spouse,...,cheese,coal pit,award,motor race,Historical region,restaurant,hotel,baronet,president,national collegiate athletic association athlete


In [14]:
from pathlib import Path  

filepath = Path('../evaluation/files/jarow_winkler_top3.csv')  

filepath.parent.mkdir(parents=True, exist_ok=True)  

highest_values_nan_max.to_csv(filepath, index=False)  