# Introduction

This notebook uses Magellan, an Entity Matching (EM) tool developed in University of Wisconsin - Madison, to match resturants from Yelp (Dataset A) and Tridadvisor (Dataset B). Each dataset consists of information of 3000+ resturants in Los Angeles.

First, import *py_entitymatching* and other libraries.

In [49]:
import py_entitymatching as em
import pandas as pd
import warnings
warnings.simplefilter('ignore')

# Read input tables

Load csv file as dataframes and set the key attribute in the dataframe

In [50]:
A = em.read_csv_metadata('../data/A.csv', key='id')
B = em.read_csv_metadata('../data/B.csv', key='id')

In [51]:
print('Number of tuples in A: ' + str(len(A)))
print('Number of tuples in B: ' + str(len(B)))
print('Number of tuples in A X B: ' + str(len(A) * len(B)))

Number of tuples in A: 3188
Number of tuples in B: 3120
Number of tuples in A X B: 9946560


In [52]:
A.head(4)

Unnamed: 0,id,name,category_1,category_2,address,city,zipcode,phone,price,rating,...,hours_tue_open,hours_tue_close,hours_wed_open,hours_wed_close,hours_thu_open,hours_thu_close,hours_fri_open,hours_fir_close,hours_sat_open,hours_sat_close
0,a1,Hae Jang Chon Korean BBQ Restaurant,korean,barbeque,3821 W 6th St,Los Angeles,90020.0,2133899000.0,2.0,4.0,...,1100.0,200.0,1100.0,200.0,1100.0,200.0,1100.0,200.0,1100.0,200.0
1,a2,Kang Ho-dong Baekjeong,barbeque,korean,3465 W 6th St,Los Angeles,90020.0,2133850000.0,2.0,4.5,...,1130.0,130.0,1130.0,130.0,1130.0,130.0,1130.0,130.0,1130.0,130.0
2,a3,Road To Seoul,korean,barbeque,1230 S Western Ave,Los Angeles,90006.0,3237319000.0,2.0,4.0,...,1100.0,2400.0,1100.0,2400.0,1100.0,100.0,1100.0,100.0,1100.0,2300.0
3,a4,Langer's,delis,sandwiches,704 S Alvarado St,Los Angeles,90057.0,2134838000.0,2.0,4.5,...,800.0,1600.0,800.0,1600.0,800.0,1600.0,800.0,1600.0,,


In [53]:
B.head(4)

Unnamed: 0,id,name,category_1,category_2,address,city,zipcode,phone,price,rating,...,hours_tue_open,hours_tue_close,hours_wed_open,hours_wed_close,hours_thu_open,hours_thu_close,hours_fri_open,hours_fri_close,hours_sat_open,hours_sat_close
0,b1,Providence,Seafood,Vegetarian Friendly,5955 Melrose Ave,Los Angeles,90038.0,3238378000.0,4.0,4.5,...,1800.0,2200.0,1800.0,2200.0,1800.0,2200.0,1800.0,2200.0,1730.0,2200.0
1,b2,Raffaello Ristorante,Italian,Vegetarian Friendly,400 S Pacific Ave,Los Angeles,90731.0,3105141000.0,2.5,4.5,...,1100.0,1400.0,1100.0,1400.0,1100.0,1400.0,1100.0,1400.0,1500.0,2145.0
2,b3,Brent's Delicatessen & Restaurant,American,Delicatessen,19565 Parthenia St,Los Angeles,91324.0,8188866000.0,2.5,4.5,...,600.0,2100.0,600.0,2100.0,600.0,2100.0,600.0,2100.0,600.0,2100.0
3,b4,Tocaya Organica,Mexican,Latin,1715 Pacific Avenue,Los Angeles,,4247449000.0,1.0,4.5,...,,,,,,,,,,


Since both tables are small, downsampling is not performed.

# Block tables to get candidate set

## First attempt

Blocker 1: Matching resturants should be in the same city

Since all the resturants are in the Los Angeles area, blocker 1 still results in a large number of candidate pairs.

In [54]:
# Create attribute equivalence blocker
ab1 = em.AttrEquivalenceBlocker()

# Block using city attribute
C1 = ab1.block_tables(A, B, 'city', 'city',
                     l_output_attrs = ['name', 'address', 'city', 'zipcode', 'phone'],
                     r_output_attrs = ['name', 'address', 'city', 'zipcode', 'phone'])

len(C1)

3435309

Blocker 2: Matching resturants should be within the same zipcode

This step reduces the preivous candidate set by a factor of 10

In [55]:
# Create attribute equivalence blocker
ab2 = em.AttrEquivalenceBlocker()

# Block using zipcod attribute
C2 = ab2.block_candset(C1, 'zipcode', 'zipcode', allow_missing = True, 
                      show_progress = False)

len(C2)

279897

Blocker 3: matching resturants should use the same phone number

This blocker reduces the candiate set by a factor of 5

In [56]:
# Create attribute equivalence blocker
ab3 = em.AttrEquivalenceBlocker()

# Block using phone number
C3 = ab3.block_candset(C2, 'phone', 'phone', allow_missing = True,
                      show_progress = False)

len(C3)

51576

Blocker 4: matching resturants should share some common words in street adress

In this step, stopwords needs to be updated to include common words in address, such as "Street", "st", "Avenue", "Ave", "Boulevard", "Blvd", "Drive", "Dr", "Road", "Rd"

This block reduces the candidate set by a factor of 5


In [57]:
# Create overlap blocker
op1 = em.OverlapBlocker()

# Update stopwords
addr_stopwords = ['street','st','avenue','ave','boulevard','blvd',
                     'drive','dr','road','rd','s','n','e','w']
op1.stop_words.extend(addr_stopwords)

op1.stop_words

['a',
 'an',
 'and',
 'are',
 'as',
 'at',
 'be',
 'by',
 'for',
 'from',
 'has',
 'he',
 'in',
 'is',
 'it',
 'its',
 'on',
 'that',
 'the',
 'to',
 'was',
 'were',
 'will',
 'with',
 'street',
 'st',
 'avenue',
 'ave',
 'boulevard',
 'blvd',
 'drive',
 'dr',
 'road',
 'rd',
 's',
 'n',
 'e',
 'w']

In [58]:
# Block using address
C4 = op1.block_candset(C3, 'address', 'address', word_level = True, overlap_size = 1,
                      rem_stop_words = True, show_progress = False, allow_missing = True)

len(C4)

3885

Blocker 5: matching resturants should share common tokens in name

This step reduces the candidate set by a factor of 10

In [59]:
# Create overlap blocker
op2 = em.OverlapBlocker()

# Block using name
C5 = op2.block_candset(C4, 'name', 'name', word_level = False, q_val = 3,
                      overlap_size = 2, show_progress = False, allow_missing = True)

len(C5)

956

In [60]:
C5.head(10)

Unnamed: 0,_id,ltable_id,rtable_id,ltable_name,ltable_address,ltable_city,ltable_zipcode,ltable_phone,rtable_name,rtable_address,rtable_city,rtable_zipcode,rtable_phone
133,133,a1,b162,Hae Jang Chon Korean BBQ Restaurant,3821 W 6th St,Los Angeles,90020.0,2133899000.0,Hae Jang Chon Korean BBQ Restaurant,3821 W 6th St,Los Angeles,90020.0,2133899000.0
2736,2736,a2,b268,Kang Ho-dong Baekjeong,3465 W 6th St,Los Angeles,90020.0,2133850000.0,Kang Hodong Baekjeong,3465 W 6th St,Los Angeles,90020.0,2133850000.0
4121,4121,a2,b1987,Kang Ho-dong Baekjeong,3465 W 6th St,Los Angeles,90020.0,2133850000.0,KangHoDong Baek Jeong,3465 W 6th St Ste 20,Los Angeles,90020.0,2133850000.0
5613,5613,a3,b710,Road To Seoul,1230 S Western Ave,Los Angeles,90006.0,3237319000.0,Road to Seoul,1230 S Western Ave,Los Angeles,90006.0,3237319000.0
7584,7584,a4,b7,Langer's,704 S Alvarado St,Los Angeles,90057.0,2134838000.0,Langer's,704 S Alvarado St,Los Angeles,90057.0,2134838000.0
10415,10415,a5,b392,EMC Seafood & Raw Bar,3500 W 6th St,Los Angeles,90020.0,2133520000.0,EMC Seafood and Raw Bar,3500 W 6th St,Los Angeles,90020.0,2133520000.0
12803,12803,a6,b223,Soowon Galbi KBBQ Restaurant,856 S Vermont Ave,Los Angeles,90005.0,2133659000.0,Soowon Galbi,856 S Vermont Ave,Los Angeles,90005.0,2133659000.0
15462,15462,a7,b385,Beer Belly,532 S Western Ave,Los Angeles,90020.0,2133872000.0,Beer Belly,532 S Western Ave,Los Angeles,90020.0,2133872000.0
17979,17979,a8,b372,BCD Tofu House,3575 Wilshire Blvd,Los Angeles,90010.0,2133827000.0,BCD Tofu House,3575 Wilshire Blvd,Los Angeles,90010.0,2133827000.0
20954,20954,a9,b939,Slurpin' Ramen Bar,3500 W 8th St,Los Angeles,90005.0,2133889000.0,Slurpin' Ramen Bar,3500 W 8th St,Los Angeles,90005.0,2133889000.0


### Debug Blocker

Make sure that blocker did not drop any potential matches

This step removes the equivalance blocker on city, since the city attributes have different context in the two tables.

The final blocker exercise attribute equivalance on zipcode and phone number, ensure overlap of at least two 3-gram words in name of the restuarant. The final candidate set includes 1123 pairs.

In [61]:
# Debug blocker output
dbg = em.debug_blocker(C5, A, B, output_size = 200)

dbg.head()

Unnamed: 0,_id,ltable_id,rtable_id,ltable_name,ltable_category_1,ltable_category_2,ltable_address,ltable_city,rtable_name,rtable_category_1,rtable_category_2,rtable_address,rtable_city
0,0,a629,b2329,Andre's Italian Restaurant,italian,pizza,6332 W 3rd St,Los Angeles,All'Angolo,Italian,Pizza,4050 W 3rd St,Los Angeles
1,1,a2209,b1456,Tacone,american (new),,330 S Hope St,Los Angeles,California Pizza Kitchen,American,Pizza,330 S Hope St,Los Angeles
2,2,a73,b1471,Grand Central Market,food court,,317 S Broadway,Los Angeles,Golden Road at Grand Central,,,317 S Broadway,Los Angeles
3,3,a2284,b1649,Yoshinoya,japanese,fast food,3021 S Figueroa St,Los Angeles,Chick-fil-A,Fast Food,American,3758 S Figueroa St,Los Angeles
4,4,a1810,b780,Pitfire Artisan Pizza,pizza,sandwiches,5211 Lankershim Blvd,North Hollywood,Pitfire Artisan Pizza,Pizza,Italian,5211 Lankershim Blvd,Los Angeles


The debug results suggest two issues in the original blocker.

1) the city attribute in Table A has a different schema context from that of table B. In table A (yelp), the city attributes indicates city or particular location (e.g. Hollywook) for some resturants. Therefore, the attribute equivalance blocker based on city needs to be removed.

2) Equivalance on phone number could be too aggressive. A single resturant could have two different phone numbers, and typos are more often in phone number attribute. Therefore, the blocker on phone number need to be changed to overlap blocker.

## Final Blocker

In [62]:
# Block using zipcode attribute equivalance
ab3 = em.AttrEquivalenceBlocker()
C6 = ab3.block_tables(A, B, 'zipcode', 'zipcode', allow_missing = True,
                     l_output_attrs = ['name', 'address', 'city', 'zipcode', 'phone'],
                     r_output_attrs = ['name', 'address', 'city', 'zipcode', 'phone'])

len(C6)

1038989

In [63]:
# Block using address
op3 = em.OverlapBlocker()
op3.stop_words.extend(addr_stopwords)
C7 = op3.block_candset(C6, 'address', 'address', word_level = True, overlap_size = 1,
                      rem_stop_words = True, show_progress = False, allow_missing = False)

len(C7)

23390

In [64]:
# Block using name
op4 = em.OverlapBlocker()
C8 = op4.block_candset(C7, 'name', 'name', word_level = False, q_val = 3,
                      overlap_size = 2, show_progress = False, allow_missing = True)

len(C8)

2803

In [65]:
# Block using phone number
op5 = em.OverlapBlocker()
C = op5.block_candset(C8, 'phone', 'phone', word_level = False, q_val = 3,
                      overlap_size = 4, allow_missing = True, show_progress = False)


In [66]:
len(C)

1745

In [67]:
C.head()

Unnamed: 0,_id,ltable_id,rtable_id,ltable_name,ltable_address,ltable_city,ltable_zipcode,ltable_phone,rtable_name,rtable_address,rtable_city,rtable_zipcode,rtable_phone
1,1,a1,b162,Hae Jang Chon Korean BBQ Restaurant,3821 W 6th St,Los Angeles,90020.0,2133899000.0,Hae Jang Chon Korean BBQ Restaurant,3821 W 6th St,Los Angeles,90020.0,2133899000.0
2,2,a1,b268,Hae Jang Chon Korean BBQ Restaurant,3821 W 6th St,Los Angeles,90020.0,2133899000.0,Kang Hodong Baekjeong,3465 W 6th St,Los Angeles,90020.0,2133850000.0
10,10,a1,b1987,Hae Jang Chon Korean BBQ Restaurant,3821 W 6th St,Los Angeles,90020.0,2133899000.0,KangHoDong Baek Jeong,3465 W 6th St Ste 20,Los Angeles,90020.0,2133850000.0
23,23,a2,b162,Kang Ho-dong Baekjeong,3465 W 6th St,Los Angeles,90020.0,2133850000.0,Hae Jang Chon Korean BBQ Restaurant,3821 W 6th St,Los Angeles,90020.0,2133899000.0
24,24,a2,b268,Kang Ho-dong Baekjeong,3465 W 6th St,Los Angeles,90020.0,2133850000.0,Kang Hodong Baekjeong,3465 W 6th St,Los Angeles,90020.0,2133850000.0


In [68]:
# Debug blocker output
dbg = em.debug_blocker(C, A, B, output_size = 200)

dbg.head(50)

C.to_csv('../data/C.csv', index = False)

# Matching tuple pairs in the candidate set

## Sampling the labeling the candidate set

First, randomly sample 500 tuple paris for labeling

In [69]:
# sample candidate set
S = em.sample_table(C, 500)

# save the sample table
S.to_csv('../data/S.csv', index = False)

In [70]:
# Load the labeled data
G = em.read_csv_metadata('../data/S_labeled.csv', key='_id', 
                         ltable=A, rtable=B, fk_ltable='ltable_id', fk_rtable='rtable_id')

print("Table G with length %d" % len(G))
G.head()

Table G with length 500


Unnamed: 0,_id,ltable_id,rtable_id,ltable_name,ltable_address,ltable_city,ltable_zipcode,ltable_phone,rtable_name,rtable_address,rtable_city,rtable_zipcode,rtable_phone,labe
0,1,a1,b162,Hae Jang Chon Korean BBQ Restaurant,3821 W 6th St,Los Angeles,90020,2133899000.0,Hae Jang Chon Korean BBQ Restaurant,3821 W 6th St,Los Angeles,90020.0,2133899000.0,1
1,23,a2,b162,Kang Ho-dong Baekjeong,3465 W 6th St,Los Angeles,90020,2133850000.0,Hae Jang Chon Korean BBQ Restaurant,3821 W 6th St,Los Angeles,90020.0,2133899000.0,0
2,24,a2,b268,Kang Ho-dong Baekjeong,3465 W 6th St,Los Angeles,90020,2133850000.0,Kang Hodong Baekjeong,3465 W 6th St,Los Angeles,90020.0,2133850000.0,1
3,32,a2,b1987,Kang Ho-dong Baekjeong,3465 W 6th St,Los Angeles,90020,2133850000.0,KangHoDong Baek Jeong,3465 W 6th St Ste 20,Los Angeles,90020.0,2133850000.0,1
4,177,a37,b162,Hangari Bajirak Kalgooksoo,3470 W 6th St,Los Angeles,90020,2133882000.0,Hae Jang Chon Korean BBQ Restaurant,3821 W 6th St,Los Angeles,90020.0,2133899000.0,0


In [71]:
# Split G into I and J
IJ = em.split_train_test(G, train_proportion=0.5, random_state=0)
I = IJ['train']
J = IJ['test']
# Store I and J in csv format
em.to_csv_metadata(I, '../data/I.csv')
em.to_csv_metadata(J, '../data/J.csv')
print('length of I: %d' % len(I))
print('length of J: %d' % len(J))

length of I: 250
length of J: 250


## Creating  a set of learning-based matchers

We created the following matchers: (1) decision tree, (2) random forest, (3) naive bayes, (4) svm, (5) logistic regression, and (6) linear regression.

In [72]:
# Create a set of ML-matchers
dt = em.DTMatcher(name='DecisionTree', random_state=0)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')
nb = em.NBMatcher(name='NaiveBayes')

## Creating features

Next, we selected the attributes, including name, category_1, category_2, address, city, zipcodw and phone, in the input tables, and automatically generated features by py_entitymatching built-in function.

In [73]:
# Generate a set of features
F = em.get_features_for_matching(A.iloc[:, 1:8], B.iloc[:, 1:8], validate_inferred_attr_types=False)

# Show the generated features
print("Magellan generated the following %s features" % len(F))
F.feature_name

Magellan generated the following 48 features


0                     name_name_jac_qgm_3_qgm_3
1                 name_name_cos_dlm_dc0_dlm_dc0
2                 name_name_jac_dlm_dc0_dlm_dc0
3                                 name_name_mel
4                            name_name_lev_dist
5                             name_name_lev_sim
6                                 name_name_nmw
7                                  name_name_sw
8         category_1_category_1_jac_qgm_3_qgm_3
9     category_1_category_1_cos_dlm_dc0_dlm_dc0
10    category_1_category_1_jac_dlm_dc0_dlm_dc0
11                    category_1_category_1_mel
12               category_1_category_1_lev_dist
13                category_1_category_1_lev_sim
14                    category_1_category_1_nmw
15                     category_1_category_1_sw
16        category_2_category_2_jac_qgm_3_qgm_3
17    category_2_category_2_cos_dlm_dc0_dlm_dc0
18    category_2_category_2_jac_dlm_dc0_dlm_dc0
19                    category_2_category_2_mel
20               category_2_category_2_l

## Extracting feature vectors

We then extracted feature vectors using the development set (I) created features.

In [74]:
# Convert the I into a set of feature vectors using F
H = em.extract_feature_vecs(I, 
                            feature_table=F, 
                            attrs_after='labe',
                            show_progress=False)
print('The table of feature vecors:')
H.head()

The table of feature vecors:


Unnamed: 0,_id,ltable_id,rtable_id,name_name_jac_qgm_3_qgm_3,name_name_cos_dlm_dc0_dlm_dc0,name_name_jac_dlm_dc0_dlm_dc0,name_name_mel,name_name_lev_dist,name_name_lev_sim,name_name_nmw,...,city_city_sw,zipcode_zipcode_exm,zipcode_zipcode_anm,zipcode_zipcode_lev_dist,zipcode_zipcode_lev_sim,phone_phone_exm,phone_phone_anm,phone_phone_lev_dist,phone_phone_lev_sim,labe
476,842418,a1853,b1255,0.125,0.353553,0.2,0.723671,18.0,0.217391,-9.0,...,1.0,,,,,,,,,0
162,36618,a1400,b2392,1.0,1.0,1.0,1.0,0.0,1.0,9.0,...,11.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1
34,3380,a2092,b1024,0.157895,0.333333,0.2,0.642839,13.0,0.409091,5.0,...,11.0,1.0,1.0,0.0,1.0,0.0,0.999814,5.0,0.583333,0
44,5763,a53,b42,1.0,1.0,1.0,1.0,0.0,1.0,5.0,...,11.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1
97,14717,a1050,b599,0.090909,0.235702,0.125,0.410256,22.0,0.153846,-4.0,...,11.0,1.0,1.0,0.0,1.0,0.0,1.0,5.0,0.583333,0


In [75]:
# Check if the feature vectors contain missing values
any(pd.notnull(H))

True

We observed that the extracted feature vectors contain missing values. Thus, We have to impute the missing values for the learning-based matchers to fit the model correctly. Here, we impute the missing value in a column with the mean of the values in that column.

In [76]:
# Impute feature vectors with the mean of the column values.
H = em.impute_table(H, 
                exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'],
                strategy='mean')

## Selecting the best matcher using cross-validation

Now, we selected the best matcher using 5-fold cross-validation and used 'precision' metric to select the best matcher.

In [77]:
# Select the best ML matcher using CV
result = em.select_matcher([dt, rf, svm, ln, lg, nb], table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'],
        k=5,
        target_attr='labe', metric_to_select_matcher='f1', random_state=0)
result['cv_stats']

Unnamed: 0,Matcher,Average precision,Average recall,Average f1
0,DecisionTree,0.946289,0.962761,0.953599
1,RF,0.969858,0.979259,0.974203
2,SVM,0.895076,0.942678,0.91544
3,LinReg,0.977,0.971852,0.974205
4,LogReg,0.960386,0.949428,0.953793
5,NaiveBayes,0.985185,0.951852,0.966864


## Debug matcher (Naive Bayes)

In this step, we choosed the best matcher -- Naive Bayes -- based on precision score, and tested the extra feature related to price and rating. 

In [78]:
# Split H into P and Q
PQ = em.split_train_test(H, train_proportion=0.5, random_state=0)
P = PQ['train']
Q = PQ['test']

# Create a feature on the value of (price + rating), then compute Levenshtein similarity
sim = em.get_sim_funs_for_matching()
tok = em.get_tokenizers_for_matching()
feature_string = """lev_sim(wspace(float(ltuple['price']) + float(ltuple['rating'])), 
                            wspace(float(rtuple['price']) + float(rtuple['rating'])))"""
feature = em.get_feature_fn(feature_string, sim, tok)

# Add feature to F
em.add_feature(F, 'lev_ws_price+rating', feature)

True

In [79]:
# Train using feature vectors from P
Pf = em.extract_feature_vecs(P, feature_table=F, attrs_after='labe',show_progress=False)
Pf = em.impute_table(Pf, 
                exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'],
                strategy='mean')

# Train using feature vectors from P
matcher = nb
name = 'NaiveBayes'
matcher.fit(table=Pf, 
       exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'], 
       target_attr='labe')

# Convert Q into a set of feature vectors using F
Qf = em.extract_feature_vecs(Q, feature_table=F, attrs_after=['labe'])
Qf = em.impute_table(Qf, 
                exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'],
                strategy='mean')

# Predict
predictions = matcher.predict(table=Qf, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'], 
              append=True, target_attr='predicted', inplace=False)

# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'labe', 'predicted')
print(name)
em.print_eval_summary(eval_result)
eval_summary = em.eval_matches(predictions, 'labe', 'predicted')

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:00


NaiveBayes
Precision : 100.0% (58/58)
Recall : 89.23% (58/65)
F1 : 94.31%
False positives : 0 (out of 58 positive predictions)
False negatives : 7 (out of 67 negative predictions)


## Selecting the best matcher again with the new feature set
We observed that the new feature set with extra feature could improve the precision of Naive Bayes matcher.
we then tested the new feature set on 6 matchers with cross validation.
The best matcher is still the Naive Bayes matcher.

In [None]:
# Convert I into feature vectors using updated F
H = em.extract_feature_vecs(I, 
                            feature_table=F, 
                            attrs_after='labe',
                            show_progress=False)
H = em.impute_table(H, 
                exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'],
                strategy='mean')

# Select the best matcher again using CV
result = em.select_matcher([dt, rf, svm, ln, lg, nb], table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'],
        k=5,
        target_attr='labe', metric_to_select_matcher='f1', random_state=0)
result['cv_stats']

Unnamed: 0,Matcher,Average precision,Average recall,Average f1
0,DecisionTree,0.960696,0.962761,0.961258
1,RF,0.976623,0.963502,0.969706
2,SVM,0.893651,0.933587,0.910848
3,LinReg,0.977,0.971852,0.974205
4,LogReg,0.960386,0.949428,0.953793
5,NaiveBayes,0.985185,0.951852,0.966864


## Test the matchers on testing set (J)

Next, we trained the 6 matchers with new feature vectors generated from set I and debugging step, and tested on set J.
The following is the resuslt of each matcher:


### Decision Tree

In [None]:
# decision tree
matcher = dt
name = 'DecisionTree'

matcher.fit(table=H, 
       exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'], 
       target_attr='labe')

Ht = em.extract_feature_vecs(J, 
                            feature_table=F, 
                            attrs_after='labe',
                            show_progress=False)
Ht = em.impute_table(Ht, 
                exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'],
                strategy='mean')

# Predict
predictions = matcher.predict(table=Ht, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'], 
              append=True, target_attr='predicted', inplace=False)

# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'labe', 'predicted')
print(name)
em.print_eval_summary(eval_result)
eval_summary = em.eval_matches(predictions, 'labe', 'predicted')

### SVM

In [None]:
# SVM
matcher = svm
name = 'SVM'

matcher.fit(table=H, 
       exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'], 
       target_attr='labe')

# Predict
predictions = matcher.predict(table=Ht, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'], 
              append=True, target_attr='predicted', inplace=False)

# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'labe', 'predicted')
print(name)
em.print_eval_summary(eval_result)
eval_summary = em.eval_matches(predictions, 'labe', 'predicted')

### Random Forest

In [None]:
# Random Forest
matcher = rf
name = 'RandomForest'

matcher.fit(table=H, 
       exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'], 
       target_attr='labe')

# Predict
predictions = matcher.predict(table=Ht, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'], 
              append=True, target_attr='predicted', inplace=False)

# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'labe', 'predicted')
print(name)
em.print_eval_summary(eval_result)
eval_summary = em.eval_matches(predictions, 'labe', 'predicted')

### Naive Bayes

In [None]:
matcher = nb
name = 'NaiveBayes'

matcher.fit(table=H, 
       exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'], 
       target_attr='labe')

# Predict
predictions = matcher.predict(table=Ht, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'], 
              append=True, target_attr='predicted', inplace=False)

# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'labe', 'predicted')
print(name)
em.print_eval_summary(eval_result)
eval_summary = em.eval_matches(predictions, 'labe', 'predicted')

### Logistic Regression

In [None]:
matcher = lg
name = 'LogReg'

matcher.fit(table=H, 
       exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'], 
       target_attr='labe')

# Predict
predictions = matcher.predict(table=Ht, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'], 
              append=True, target_attr='predicted', inplace=False)

# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'labe', 'predicted')
print(name)
em.print_eval_summary(eval_result)
eval_summary = em.eval_matches(predictions, 'labe', 'predicted')

### Linear Regression

In [None]:
matcher = ln
name = 'LinReg'

matcher.fit(table=H, 
       exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'], 
       target_attr='labe')

# Predict
predictions = matcher.predict(table=Ht, exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'labe'], 
              append=True, target_attr='predicted', inplace=False)

# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'labe', 'predicted')
print(name)
em.print_eval_summary(eval_result)
eval_summary = em.eval_matches(predictions, 'labe', 'predicted')

## Test Results

From the results above, we observed that the <b>Naive Bayes</b> (precision 97.6%) and the <b>Linear Regression</b> (precision 97.6%) have the best performance compared with precision score, while the <b>Linear Regression</b> matcher has the better recall and F1.