## Fuzzy String Matching
Fuzzy String Matching, also known as Approximate String Matching, is the process of finding strings that approximately match a pattern. 

In Python, Fuzzywuzzy is a library that uses Levenshtein Distance to calculate the differences between sequences and patterns. 

In [1]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd

### Data Loading and preprocessing

In [6]:
df_dbpedia = pd.read_csv("/Users/twei/workplace/ISE-workplace/ISE-Linking-Entities-from-Images-to-Knowledge-Graphs/src/data/dbpedia_classes.csv")
df_imageNet = pd.read_csv("/Users/twei/workplace/ISE-workplace/ISE-Linking-Entities-from-Images-to-Knowledge-Graphs/src/data/ImageNet.csv")
df_imageNet['Class Name'] = df_imageNet['Class Name'].str.split(",")

In [7]:
df_imageNet

Unnamed: 0,Class ID,Class Name
0,0,"[tench, Tinca tinca]"
1,1,"[goldfish, Carassius auratus]"
2,2,"[great white shark, white shark, man-eater, ..."
3,3,"[tiger shark, Galeocerdo cuvieri]"
4,4,"[hammerhead, hammerhead shark]"
...,...,...
995,995,[earthstar]
996,996,"[hen-of-the-woods, hen of the woods, Polypor..."
997,997,[bolete]
998,998,"[ear, spike, capitulum]"


In [8]:
df_dbpedia

Unnamed: 0,class,label
0,http://dbpedia.org/ontology/Company,company
1,http://dbpedia.org/ontology/Activity,activity
2,http://dbpedia.org/ontology/Name,name
3,http://dbpedia.org/ontology/Person,person
4,http://dbpedia.org/ontology/Actor,actor
...,...,...
778,http://dbpedia.org/ontology/WindMotor,Wind motor
779,http://dbpedia.org/ontology/Windmill,Windmill
780,http://dbpedia.org/ontology/Wine,wine
781,http://dbpedia.org/ontology/Woman,woman


### String Distance Calculating
As we can see, in each row we have a list of string. The initial prototyping idea is iterate each imageNet class list with dbpedia Class and chosing the highest sequences’ similarity ratio of the string item in the imageNet class list, because the strings are Synonyms with probabaly different letters. 

In [14]:
def get_max_fuzzy_ratio():
    max_ratio_list = []
    max_label_list = []
    for class_list in df_imageNet['Class Name']:
        max_val = 0
        max_label = ''
        for item in class_list:
            for label in df_dbpedia['label']:
                if max_val< fuzz.ratio(item, label):
                    max_val = fuzz.ratio(item, label)
                    max_label = label
        max_ratio_list.append(max_val)
        max_label_list.append(max_label)
        
    df_imageNet['max_similarity_ratio'] = max_ratio_list
    df_imageNet['dbpedia_class'] = max_label_list
    df_imageNet.rename(columns = {"Class Name":"imageNet_class"}, inplace=True)
            
get_max_fuzzy_ratio()


In [15]:
df_imageNet[:30]

Unnamed: 0,Class ID,imageNet_class,max_similarity_ratio,dbpedia_class
0,0,"[tench, Tinca tinca]",60,beach
1,1,"[goldfish, Carassius auratus]",67,fish
2,2,"[great white shark, white shark, man-eater, ...",63,monastery
3,3,"[tiger shark, Galeocerdo cuvieri]",58,figure skater
4,4,"[hammerhead, hammerhead shark]",62,camera
5,5,"[electric ray, crampfish, numbfish, torpedo]",62,fish
6,6,[stingray],57,stream
7,7,[cock],75,lock
8,8,[hen],57,chef
9,9,"[ostrich, Struthio camelus]",67,district


### Evaluation
Without ground-truth we could not do quantatitive evaluation properly. But with an overview, we could see that 
- String Matching recognizes the class Mapping that based on a alphabetic similarity. For example, "[electric ray, crampfish, numbfish, torpedo]"	 is mapped to the class fish
- But for imageNet classes, that are semantic similar to the dbpedia class, string mapping do not work well. For example, hen is mapped to chef. To solve this problem, we may need a NLP model to do high level mapping.