Jaro-Winkler-Distance considers the length of strings, the number of matching characters and the number of transpositions, also we use a prefix scale that gives us more accurate answers when strings have a common prefix 

In [1]:
import logging

import jellyfish
import pandas as pd
import numpy as np

logger = logging.getLogger(__name__)

In [2]:
df_dbpedia = pd.read_csv("./files/dbpedia_classes.csv")
df_imageNet = pd.read_csv("./files/ImageNet.csv")
df_imageNet['Class Name'] = df_imageNet['Class Name'].str.replace(", ",",")
df_imageNet['Class Name'] = df_imageNet['Class Name'].str.split(",")

In [3]:
def get_jaro_winkler_distance():
    max_ratio_list = []
    max_label_list = []
    for class_list in df_imageNet['Class Name']:
        max_val = 0
        max_label = ''
        for item in class_list:
            for label in df_dbpedia['label']:
                if max_val< jellyfish.jaro_winkler_similarity(item, label):
                    max_val = jellyfish.jaro_winkler_similarity(item, label)
                    max_label = label
        max_ratio_list.append(max_val)
        max_label_list.append(max_label)
        
    df_imageNet['jaro_winkler_distance'] = max_ratio_list
    df_imageNet['dbpedia_class'] = max_label_list
    df_imageNet.rename(columns = {"Class Name":"imageNet_class"}, inplace=True)
            
get_jaro_winkler_distance()

In [4]:
df_imageNet


Unnamed: 0,Class ID,imageNet_class,jaro_winkler_distance,dbpedia_class
0,0,"[tench, Tinca tinca]",0.790000,tenure
1,1,"[goldfish, Carassius auratus]",0.693182,golf course
2,2,"[great white shark, white shark, man-eater, ma...",0.844444,man
3,3,"[tiger shark, Galeocerdo cuvieri]",0.762756,figure skater
4,4,"[hammerhead, hammerhead shark]",0.777778,camera
...,...,...,...,...
995,995,[earthstar],0.853333,earthquake
996,996,"[hen-of-the-woods, hen of the woods, Polyporus...",0.715278,island
997,997,[bolete],0.800000,bone
998,998,"[ear, spike, capitulum]",0.916667,year


Lets look at the number of classes with a distance score equal to 100

In [5]:
result =  df_imageNet[df_imageNet['jaro_winkler_distance' ]== 1.0]
result


Unnamed: 0,Class ID,imageNet_class,jaro_winkler_distance,dbpedia_class
134,134,"[crane, bird]",1.0,bird
242,242,[boxer],1.0,boxer
286,286,"[cougar, puma, catamount, mountain lion, paint...",1.0,painter
323,323,"[monarch, monarch butterfly, milkweed butterfl...",1.0,monarch
408,408,"[amphibian, amphibious vehicle]",1.0,amphibian
437,437,"[beacon, lighthouse, beacon light, pharos]",1.0,lighthouse
483,483,[castle],1.0,castle
487,487,"[cellular telephone, cellular phone, cellphone...",1.0,mobile phone
497,497,"[church, church building]",1.0,church
525,525,"[dam, dike, dyke]",1.0,dam


In [6]:
print(result.shape)

(24, 4)


Lets look at the number of classes with a distance score higher than 0.9

In [7]:
result =  df_imageNet[df_imageNet['jaro_winkler_distance' ] > 0.9]
result

Unnamed: 0,Class ID,imageNet_class,jaro_winkler_distance,dbpedia_class
22,22,"[bald eagle, American eagle, Haliaeetus leucoc...",0.931429,American Leader
64,64,[green mamba],0.905455,green alga
132,132,"[American egret, great white heron, Egretta al...",0.914762,American Leader
134,134,"[crane, bird]",1.000000,bird
231,231,[collie],0.909524,college
...,...,...,...,...
979,979,"[valley, vale]",1.000000,valley
980,980,[volcano],1.000000,volcano
981,981,"[ballplayer, baseball player]",1.000000,baseball player
989,989,"[hip, rose hip, rosehip]",0.916667,ship


Lets look at the number of classes with a distance score higher than 0.8

In [8]:
result =  df_imageNet[df_imageNet['jaro_winkler_distance' ] > 0.8]
result

Unnamed: 0,Class ID,imageNet_class,jaro_winkler_distance,dbpedia_class
2,2,"[great white shark, white shark, man-eater, ma...",0.844444,man
5,5,"[electric ray, crampfish, numbfish, torpedo]",0.841905,electrical substation
7,7,[cock],0.833333,lock
10,10,"[brambling, Fringilla montifringilla]",0.896296,brain
11,11,"[goldfinch, Carduelis carduelis]",0.835133,Cardinal direction
...,...,...,...,...
988,988,[acorn],0.893333,actor
989,989,"[hip, rose hip, rosehip]",0.916667,ship
990,990,"[buckeye, horse chestnut, conker]",0.894444,conifer
995,995,[earthstar],0.853333,earthquake


Lets look at the number of classes with a distance score higher than 0.7

In [9]:
result =  df_imageNet[df_imageNet['jaro_winkler_distance' ] > 0.7]
result

Unnamed: 0,Class ID,imageNet_class,jaro_winkler_distance,dbpedia_class
0,0,"[tench, Tinca tinca]",0.790000,tenure
2,2,"[great white shark, white shark, man-eater, ma...",0.844444,man
3,3,"[tiger shark, Galeocerdo cuvieri]",0.762756,figure skater
4,4,"[hammerhead, hammerhead shark]",0.777778,camera
5,5,"[electric ray, crampfish, numbfish, torpedo]",0.841905,electrical substation
...,...,...,...,...
995,995,[earthstar],0.853333,earthquake
996,996,"[hen-of-the-woods, hen of the woods, Polyporus...",0.715278,island
997,997,[bolete],0.800000,bone
998,998,"[ear, spike, capitulum]",0.916667,year


Conclusion: We could use jaro_winkler distance but we need to set the treshold very high because with > 80 we already have 910 out of all classes.
A problem here is that it only compares strings not semantics (for example Class ID: 999 has a 0,76 distance but is actually not the same),
also there is no significant change between choosing jaro and jaro_winkler



Idea: Right now we take the highest value out of the image_net classes array, but lets change it so we take the average of the values as distance for each class and assign the class with the highest value to the dbpedia class