###  Matching based on Jaccard distance
Jaccard Distance is a measure of how dissimilar two sets are.  The lower the distance, the more similar the two strings.
Jaccard Distance depends on another concept called “Jaccard Similarity Index” which is (the number in both sets) / (the number in either set) * 100
J(X,Y) = |X∩Y| / |X∪Y|

J(X,Y) = |X∩Y| / |X∪Y|
 
Then we can calculate the Jaccard Distance as follows:
Python
 D(X,Y) = 1 – J(X,Y)

 D(X,Y) = 1 – J(X,Y)

In [5]:
import pandas as pd
import nltk

In [6]:
df_dbpedia = pd.read_csv("./files/dbpedia_classes.csv")
df_imageNet = pd.read_csv("./files/ImageNet.csv")
df_imageNet['Class Name'] = df_imageNet['Class Name'].str.replace(", ",",")
df_imageNet['Class Name'] = df_imageNet['Class Name'].str.split(",")
df_imageNet.rename(columns = {"Class Name":"imageNet_class"}, inplace=True)

In [13]:
def get_min_jaccard_distance():
    min_jaccard_list = []
    min_label_list = []
    for class_list in df_imageNet['imageNet_class']:
        min_val = 1
        min_label = ''
        for item in class_list:
            for label in df_dbpedia['label']:
                if min_val> nltk.jaccard_distance(set(item), set(label)):
                    min_val = nltk.jaccard_distance(set(item), set(label))
                    min_label = label
        min_jaccard_list.append(min_val)
        min_label_list.append(min_label)
        
    df_imageNet['min_jaccard_distance'] = min_jaccard_list
    df_imageNet['dbpedia_class'] = min_label_list
            
get_min_jaccard_distance()

In [15]:
df_imageNet.sort_values(ascending = True, by= "min_jaccard_distance")[:50]

Unnamed: 0,Class ID,imageNet_class,min_jaccard_distance,dbpedia_class
751,751,"[racer, race car, racing car]",0.0,race
687,687,"[organ, pipe organ]",0.0,organ
497,497,"[church, church building]",0.0,church
134,134,"[crane, bird]",0.0,bird
478,478,[carton],0.0,cartoon
553,553,"[file, file cabinet, filing cabinet]",0.0,file
807,807,"[solar dish, solar collector, solar furnace]",0.0,roller coaster
668,668,[mosque],0.0,mosque
519,519,[crate],0.0,crater
355,355,[llama],0.0,mammal


### Evaluation
With an overview, we could see that the jaccard-distance based string matching machted two words that has the same set of alphabets. That is inaccurate compared with fuzzy mathcing.

In [7]:
import nltk

In [8]:
img_net = pd.read_csv("./files/ImageNet.csv")
img_net_list = img_net["Class Name"].tolist()

dbpedia = pd.read_csv("./files/dbpedia_classes.csv")
db_list = dbpedia["label"].tolist()

In [6]:
dataframe_jaccard = pd.DataFrame(index =db_list ,columns =img_net_list)

In [16]:
for i in range(0,len(img_net_list)):
    for a in range(0,len(db_list)):
        if len(img_net_list[i].split(",")) > 1:
            max_val = 1.0
            for el in img_net_list[i].split(","):
                el_val = nltk.jaccard_distance(set(el), set(db_list[a]))
                #el_val = (len(el + db_list[a])) - len(set(el).intersection(db_list[a])) / (len(el + db_list[a]))
                if el_val < max_val:
                    max_val = el_val
            dataframe_jaccard.iat[a,i] = max_val
        else:
            first = set(img_net_list[i])
            second = set(db_list[a])
            dataframe_jaccard.iat[a,i] = nltk.jaccard_distance(first,second)

In [17]:
dataframe_jaccard

Unnamed: 0,"tench, Tinca tinca","goldfish, Carassius auratus","great white shark, white shark, man-eater, man-eating shark, Carcharodon caharias',","tiger shark, Galeocerdo cuvieri","hammerhead, hammerhead shark","electric ray, crampfish, numbfish, torpedo",stingray,cock,hen,"ostrich, Struthio camelus",...,"buckeye, horse chestnut, conker",coral fungus,agaric,gyromitra,"stinkhorn, carrion fungus",earthstar,"hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa",bolete,"ear, spike, capitulum","toilet tissue, toilet paper, bathroom tissue"
company,0.727273,0.928571,0.733333,0.8125,0.818182,0.692308,0.75,0.75,0.888889,0.764706,...,0.727273,0.714286,0.8,0.636364,0.714286,0.916667,0.733333,0.909091,0.666667,0.769231
activity,0.555556,0.727273,0.769231,0.714286,0.909091,0.5,0.6,0.875,1.0,0.7,...,0.8,0.866667,0.625,0.6,0.785714,0.8,0.866667,0.9,0.636364,0.75
name,0.714286,0.909091,0.5,0.833333,0.571429,0.818182,0.8,1.0,0.6,0.8,...,0.777778,0.846154,0.875,0.8,0.846154,0.75,0.833333,0.875,0.6,0.769231
person,0.777778,0.833333,0.714286,0.769231,0.75,0.555556,0.727273,0.875,0.714286,0.7,...,0.545455,0.692308,0.9,0.833333,0.6,0.666667,0.615385,0.777778,0.666667,0.636364
actor,0.666667,0.7,0.692308,0.692308,0.777778,0.6,0.7,0.666667,1.0,0.5,...,0.636364,0.666667,0.571429,0.555556,0.666667,0.625,0.769231,0.75,0.666667,0.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wind motor,0.666667,0.692308,0.583333,0.6875,0.714286,0.545455,0.692308,0.909091,0.909091,0.647059,...,0.642857,0.75,0.833333,0.583333,0.583333,0.846154,0.571429,0.833333,0.714286,0.6
Windmill,0.818182,0.727273,0.8,0.8,0.8,0.75,0.833333,1.0,0.875,0.823529,...,0.916667,0.866667,0.9,0.833333,0.833333,1.0,0.692308,0.9,0.75,0.833333
wine,0.714286,0.909091,0.727273,0.833333,0.888889,0.818182,0.8,1.0,0.6,0.875,...,0.777778,0.928571,0.875,0.909091,0.8,0.888889,0.727273,0.875,0.75,0.8
woman,0.8,0.916667,0.7,0.866667,0.777778,0.833333,0.818182,0.857143,0.857143,0.8125,...,0.8,0.769231,0.888889,0.7,0.769231,0.9,0.75,0.888889,0.833333,0.785714


In [18]:
nan = dataframe_jaccard.astype(float)
highest_values_nan_max = nan.apply(lambda s: s.abs().nsmallest(3).index.tolist(), axis=0)
highest_values_nan_max

Unnamed: 0,"tench, Tinca tinca","goldfish, Carassius auratus","great white shark, white shark, man-eater, man-eating shark, Carcharodon caharias',","tiger shark, Galeocerdo cuvieri","hammerhead, hammerhead shark","electric ray, crampfish, numbfish, torpedo",stingray,cock,hen,"ostrich, Struthio camelus",...,"buckeye, horse chestnut, conker",coral fungus,agaric,gyromitra,"stinkhorn, carrion fungus",earthstar,"hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa",bolete,"ear, spike, capitulum","toilet tissue, toilet paper, bathroom tissue"
0,train carriage,guitarist,animanga character,figure skater,drama,protected area,organisation,lock,gene,anatomical structure,...,soccer tournoment,soccer league season,glacier,agglomeration,historian,theatre,radio station,hotel,area,tennis tournament
1,insect,musical artist,ski area,clerical order,speed skater,time period,guitarist,book,shrine,comics character,...,naruto character,soccer league,vicar,mayor,formula one racing,stream,fashion designer,noble,classical music composition,comics character
2,scientist,classical music artist,figure skater,periodical literature,ski area,article,gymnast,rocket,name,architectural structure,...,country estate,golf course,aircraft,organisation,soccer league season,skater,Controlled designation of origin wine,automobile,year,historic place


In [19]:
from pathlib import Path  

filepath = Path('../evaluation/files/jaccard_top3.csv')  

filepath.parent.mkdir(parents=True, exist_ok=True)  

highest_values_nan_max.to_csv(filepath, index=False)  

Levenshtein

In [9]:
dataframe_lev = pd.DataFrame(index =db_list ,columns =img_net_list)

In [10]:
from nltk.metrics import edit_distance

In [15]:
for i in range(0,len(img_net_list)):
    for a in range(0,len(db_list)):
        if len(img_net_list[i].split(",")) > 1:
            max_val = 10000000.0
            for el in img_net_list[i].split(","):
                el_val = edit_distance(el, db_list[a], substitution_cost=1, transpositions=False)

                if el_val < max_val:
                    max_val = el_val
            dataframe_lev.iat[a,i] = max_val
        else:
            first = set(img_net_list[i])
            second = set(db_list[a])
            dataframe_lev.iat[a,i] = edit_distance(img_net_list[i], db_list[a], substitution_cost=1, transpositions=False)

In [14]:
dataframe_lev

Unnamed: 0,"tench, Tinca tinca","goldfish, Carassius auratus","great white shark, white shark, man-eater, man-eating shark, Carcharodon caharias',","tiger shark, Galeocerdo cuvieri","hammerhead, hammerhead shark","electric ray, crampfish, numbfish, torpedo",stingray,cock,hen,"ostrich, Struthio camelus",...,"buckeye, horse chestnut, conker",coral fungus,agaric,gyromitra,"stinkhorn, carrion fungus",earthstar,"hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa",bolete,"ear, spike, capitulum","toilet tissue, toilet paper, bathroom tissue"
company,1.0,1.0,1.0,1.0,1.0,1.0,7,5,6,1.0,...,1.0,9,7,7,1.0,9,1.0,6,1.0,1.0
activity,1.0,1.0,1.0,1.0,1.0,1.0,6,7,8,1.0,...,1.0,11,6,7,1.0,7,1.0,7,1.0,1.0
name,1.0,1.0,1.0,1.0,1.0,1.0,7,4,4,1.0,...,1.0,11,5,8,1.0,8,1.0,5,1.0,1.0
person,1.0,1.0,1.0,1.0,1.0,1.0,8,6,4,1.0,...,1.0,10,6,8,1.0,7,1.0,6,1.0,1.0
actor,1.0,1.0,1.0,1.0,1.0,1.0,7,4,5,1.0,...,1.0,11,5,7,1.0,6,1.0,6,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wind motor,1.0,1.0,1.0,1.0,1.0,1.0,9,9,9,1.0,...,1.0,11,10,8,1.0,8,1.0,9,1.0,1.0
Windmill,1.0,1.0,1.0,1.0,1.0,1.0,7,8,7,1.0,...,1.0,12,7,7,1.0,9,1.0,8,1.0,1.0
wine,1.0,1.0,1.0,1.0,1.0,1.0,6,4,3,1.0,...,1.0,11,6,8,1.0,9,1.0,5,1.0,1.0
woman,1.0,1.0,1.0,1.0,1.0,1.0,7,4,4,1.0,...,1.0,9,6,7,1.0,8,1.0,5,1.0,1.0


In [12]:
nan = dataframe_lev.astype(float)
highest_values_nan_max = nan.apply(lambda s: s.abs().nsmallest(3).index.tolist(), axis=0)
highest_values_nan_max

Unnamed: 0,"tench, Tinca tinca","goldfish, Carassius auratus","great white shark, white shark, man-eater, man-eating shark, Carcharodon caharias',","tiger shark, Galeocerdo cuvieri","hammerhead, hammerhead shark","electric ray, crampfish, numbfish, torpedo",stingray,cock,hen,"ostrich, Struthio camelus",...,"buckeye, horse chestnut, conker",coral fungus,agaric,gyromitra,"stinkhorn, carrion fungus",earthstar,"hen-of-the-woods, hen of the woods, Polyporus frondosus, Grifola frondosa",bolete,"ear, spike, capitulum","toilet tissue, toilet paper, bathroom tissue"
0,company,company,company,company,company,company,mineral,lock,chef,company,...,company,fungus,award,Pyramid,company,artist,company,bone,company,company
1,activity,activity,activity,activity,activity,activity,winery,book,sea,activity,...,activity,road tunnel,cleric,comic,activity,earthquake,activity,college,activity,activity
2,name,name,name,name,name,name,single,coach,vein,name,...,name,coal pit,comic,criminal,name,star,name,poet,name,name
