## Fuzzy String Matching
Fuzzy String Matching, also known as Approximate String Matching, is the process of finding strings that approximately match a pattern. 

In Python, Fuzzywuzzy is a library that uses Levenshtein Distance to calculate the differences between sequences and patterns. 

In [1]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd

### Data Loading and preprocessing

In [2]:
df_dbpedia = pd.read_csv("/Users/twei/workplace/ISE-workplace/ISE-Linking-Entities-from-Images-to-Knowledge-Graphs/src/data/dbpedia_classes.csv")
df_imageNet = pd.read_csv("/Users/twei/workplace/ISE-workplace/ISE-Linking-Entities-from-Images-to-Knowledge-Graphs/src/data/ImageNet.csv")
df_imageNet['Class Name'] = df_imageNet['Class Name'].str.replace(", ",",")
df_imageNet['Class Name'] = df_imageNet['Class Name'].str.split(",")

In [3]:
df_imageNet.iloc[1]['Class Name'] 

['goldfish', 'Carassius auratus']

### String Distance Calculating - Fuzzy Mapping
As we can see, in each row we have a list of string. The initial prototyping idea is iterate each imageNet class list with dbpedia Class and chosing the highest sequences’ similarity ratio of the string item in the imageNet class list, because the strings are Synonyms with probabaly different letters. 

In [4]:
def get_max_fuzzy_ratio():
    max_ratio_list = []
    max_label_list = []
    for class_list in df_imageNet['Class Name']:
        max_val = 0
        max_label = ''
        for item in class_list:
            for label in df_dbpedia['label']:
                if max_val< fuzz.ratio(item, label):
                    max_val = fuzz.ratio(item, label)
                    max_label = label
        max_ratio_list.append(max_val)
        max_label_list.append(max_label)
        
    df_imageNet['max_similarity_ratio'] = max_ratio_list
    df_imageNet['dbpedia_class'] = max_label_list
    df_imageNet.rename(columns = {"Class Name":"imageNet_class"}, inplace=True)
            
get_max_fuzzy_ratio()


In [5]:
df_imageNet

Unnamed: 0,Class ID,imageNet_class,max_similarity_ratio,dbpedia_class
0,0,"[tench, Tinca tinca]",60,beach
1,1,"[goldfish, Carassius auratus]",67,fish
2,2,"[great white shark, white shark, man-eater, ma...",67,monastery
3,3,"[tiger shark, Galeocerdo cuvieri]",58,figure skater
4,4,"[hammerhead, hammerhead shark]",62,camera
...,...,...,...,...
995,995,[earthstar],67,artist
996,996,"[hen-of-the-woods, hen of the woods, Polyporus...",50,unit of work
997,997,[bolete],67,letter
998,998,"[ear, spike, capitulum]",86,year


In [74]:
df_imageNet.sort_values(ascending = False, by= "max_similarity_ratio")

Unnamed: 0,Class ID,imageNet_class,max_similarity_ratio,dbpedia_class
487,487,"[cellular telephone, cellular phone, cellphone...",100,mobile phone
408,408,"[amphibian, amphibious vehicle]",100,amphibian
762,762,"[restaurant, eating house, eating place, eatery]",100,restaurant
497,497,"[church, church building]",100,church
437,437,"[beacon, lighthouse, beacon light, pharos]",100,lighthouse
...,...,...,...,...
203,203,[West Highland white terrier],45,old territory
165,165,[black-and-tan coonhound],45,chemical compound
500,500,[cliff dwelling],45,building
222,222,[kuvasz],40,guitarist


In [75]:
df_imageNet_under_100 = df_imageNet[df_imageNet['max_similarity_ratio']<100].to_csv("matching_under_100.csv")
df_imageNet_over_90 = df_imageNet[df_imageNet['max_similarity_ratio']>=90]
df_imageNet_over_90 = df_imageNet_over_90[df_imageNet_over_90['max_similarity_ratio']<100].to_csv("matching_over_90.csv")
df_imageNet_over_80 = df_imageNet[df_imageNet['max_similarity_ratio']>=80]
df_imageNet_over_80 = df_imageNet_over_80[df_imageNet_over_80['max_similarity_ratio']<90].to_csv("matching_over_80.csv")
df_imageNet_over_70 = df_imageNet[df_imageNet['max_similarity_ratio']>=70]
df_imageNet_over_70 = df_imageNet_over_70[df_imageNet_over_70['max_similarity_ratio']<80].to_csv("matching_over_70.csv")

### Evaluation
Without ground-truth we could not do quantatitive evaluation properly. But with an overview, we could see that 
- String Matching recognizes the class Mapping that based on a alphabetic similarity. For example, "[electric ray, crampfish, numbfish, torpedo]"	 is mapped to the class fish
- But for imageNet classes, that are semantic similar to the dbpedia class, string mapping do not work well. For example, hen is mapped to chef. To solve this problem, we may need a NLP model to do high level mapping.