# Debug gold candidate selection
In which we determine why candidate selection is misfiring for the gold toponyms.

In [1]:
import data_helpers
from data_helpers import extract_topo_annotations, load_combined_geo_data, load_filtered_lexicon, organize_topo_annotations, get_edit_dist, load_municipality_data
import codecs

## Load data

In [2]:
topo_annotate_file = '../../data/facebook-maria/all_group_sample_statuses_annotated_topo.txt'
topo_annotation_lines = [l.strip() for l in codecs.open(topo_annotate_file, 'r', encoding='utf-8')]
topo_annotations = extract_topo_annotations(topo_annotation_lines, include_coords=True)

In [54]:
geo_data = load_combined_geo_data()
geo_data = geo_data[geo_data.loc[:, 'name'].apply(lambda x: type(x) is str and x != '')]
geo_data = geo_data[geo_data.loc[:, 'lat'].apply(lambda x: type(x) is float and not pd.np.isnan(x))]
# get filtered lexicon lol
geo_lexicon = load_filtered_lexicon()
topo_data_df = organize_topo_annotations(topo_annotations)

## Get oracle results

In [12]:
topo_list = topo_data_df.loc[:, 'topo']
gold_labels = topo_data_df.loc[:, 'id'].values.tolist()
gold_locs = topo_data_df.loc[:, ['lat','lon']].values.tolist()

In [5]:
print(gold_labels)

[5184195284, 22196127, 5184329391, 5184329391, 5184329405, 4567576, 5184324841, 22142874, 4564408, 540517197, 540517203, 4304654246, 22255034, 22255043, 4566901, 5185023442, 22248310, 4568235, 371917346, 4568505, 5184366268]


First test: any OOV id numbers??

In [6]:
all_ids = geo_data.loc[:, 'id'].values.tolist()
for l in gold_labels:
    if(l not in all_ids):
        print('error with id %d'%(l))

OK - we find that the nothing is out of vocabulary. So why is that error happening in our hard-coded tests?

Debugging - we've determined that the culprit is the municipality data.

In [7]:
municipality_id_lookup = load_municipality_data()
print(municipality_id_lookup.head())

        id   municipality
0  4049880        Bayamón
1  4049900  Trujillo Alto
2  4049922          Ceiba
3  4050022          Ponce
4  4050076          Ponce


In [10]:
for l in gold_labels:
    if(l not in municipality_id_lookup.loc[:, 'id'].values.tolist()):
        print('error with id %d'%(l))

That's been fixed. But why are the gold candidates getting ranked so low in terms of edit distance??

## Edit distance

Compute edit distance between name and all other possible toponyms, and compare the relative position of the gold label.

In [39]:
import pandas as pd
geo_lexicon = geo_data.loc[:, 'name']
for i, r in topo_data_df.iterrows():
    topo_name = r.loc['topo']
    topo_id = r.loc['id']
    edit_dists = get_edit_dist(topo_name, geo_lexicon, word_level=False)
    edit_dist_df = pd.DataFrame([geo_data.loc[:, 'id'].values, edit_dists.index.tolist(), edit_dists.values]).transpose()
    edit_dist_df.columns = ['id', 'name', 'edit_dist']
    edit_dist_df.sort_values('edit_dist', inplace=True, ascending=True)
    gold_rank = pd.np.where(edit_dist_df.loc[:, 'id'] == topo_id)[0][0]
    gold_name = edit_dist_df[edit_dist_df.loc[:, 'id'] == topo_id].loc[:, 'name'].values[0]
    print('topo name %s: gold topo %s %d ranked %d'%(topo_name, gold_name, topo_id, gold_rank))

topo name ingenio: gold topo Ingenio 5184195284 ranked 1
topo name carretera 104: gold topo Carretera 104 22196127 ranked 0
topo name Tres Caminos: gold topo Tres Caminos 5184329391 ranked 0
topo name Tres Camino: gold topo Tres Caminos 5184329391 ranked 1
topo name la gallera: gold topo La Gallera 5184329405 ranked 0
topo name aeropuerto Rafael Hernández: gold topo Rafael Hernandez Airport 4567576 ranked 1797
topo name La Sabana: gold topo La Sabana 5184324841 ranked 0
topo name Calle Jade: gold topo Calle Jade 22142874 ranked 3
topo name escuela Don Manolo: gold topo Escuela Don Manolo 4564408 ranked 0
topo name Palmas Bajas: gold topo Quebrada Palmas Bajas 540517197 ranked 18173
topo name culebra: gold topo Quebrada Culebra 540517203 ranked 30330
topo name Puerto Hermina: gold topo Ruinas de Puerto Herminia 4304654246 ranked 7962
topo name Calle 6: gold topo Calle 6 22255034 ranked 498
topo name 7: gold topo Calle 7 22255043 ranked 1631
topo name Pozuelo: gold topo Punta Pozuelo 456

In [40]:
geo_lexicon = geo_data.loc[:, 'name_lower_no_diacritic']
for i, r in topo_data_df.iterrows():
    topo_name = r.loc['topo'].lower()
    topo_id = r.loc['id']
    edit_dists = get_edit_dist(topo_name, geo_lexicon, word_level=False)
    edit_dist_df = pd.DataFrame([geo_data.loc[:, 'id'].values, edit_dists.index.tolist(), edit_dists.values]).transpose()
    edit_dist_df.columns = ['id', 'name', 'edit_dist']
    edit_dist_df.sort_values('edit_dist', inplace=True, ascending=True)
    gold_rank = pd.np.where(edit_dist_df.loc[:, 'id'] == topo_id)[0][0]
    gold_name = edit_dist_df[edit_dist_df.loc[:, 'id'] == topo_id].loc[:, 'name'].values[0]
    print('topo name %s: gold topo %s %d ranked %d'%(topo_name, gold_name, topo_id, gold_rank))

topo name ingenio: gold topo ingenio 5184195284 ranked 0
topo name carretera 104: gold topo carretera 104 22196127 ranked 0
topo name tres caminos: gold topo tres caminos 5184329391 ranked 0
topo name tres camino: gold topo tres caminos 5184329391 ranked 1
topo name la gallera: gold topo la gallera 5184329405 ranked 0
topo name aeropuerto rafael hernández: gold topo rafael hernandez airport 4567576 ranked 6377
topo name la sabana: gold topo la sabana 5184324841 ranked 0
topo name calle jade: gold topo calle jade 22142874 ranked 15
topo name escuela don manolo: gold topo escuela don manolo 4564408 ranked 0
topo name palmas bajas: gold topo quebrada palmas bajas 540517197 ranked 19974
topo name culebra: gold topo quebrada culebra 540517203 ranked 30954
topo name puerto hermina: gold topo ruinas de puerto herminia 4304654246 ranked 7087
topo name calle 6: gold topo calle 6 22255034 ranked 608
topo name 7: gold topo calle 7 22255043 ranked 1403
topo name pozuelo: gold topo punta pozuelo 45

We've determined that the lower+accent-stripped does slightly better in some cases (`calle 4`, `punatilla`) and worse in others (`calle 6`, `aeropuerto rafael hernández`).

What about normalizing edit distance for length?

In [35]:
def norm_edit_dist(x, lexicon, word_level=False):
    edit_dists = get_edit_dist(x, lexicon, word_level=word_level)
    if(word_level):
        edit_dists /= pd.np.array(map(lambda y: min(len(x.split(' ')), len(y.split(' '))), lexicon))
    else:
        edit_dists /= pd.np.array(map(lambda y: min(len(x), len(y)), lexicon))
    return edit_dists

In [38]:
geo_lexicon = geo_data.loc[:, 'name_lower_no_diacritic']
for i, r in topo_data_df.iterrows():
    topo_name = r.loc['topo'].lower()
    topo_id = r.loc['id']
    edit_dists = norm_edit_dist(topo_name, geo_lexicon, word_level=False)
    edit_dist_df = pd.DataFrame([geo_data.loc[:, 'id'].values, edit_dists.index.tolist(), edit_dists.values]).transpose()
    edit_dist_df.columns = ['id', 'name', 'edit_dist']
    edit_dist_df.sort_values('edit_dist', inplace=True, ascending=True)
    gold_rank = pd.np.where(edit_dist_df.loc[:, 'id'] == topo_id)[0][0]
    gold_name = edit_dist_df[edit_dist_df.loc[:, 'id'] == topo_id].loc[:, 'name'].values[0]
    print('topo name %s: gold topo %s %d ranked %d'%(topo_name, gold_name, topo_id, gold_rank))

topo name ingenio: gold topo ingenio 5184195284 ranked 1
topo name carretera 104: gold topo carretera 104 22196127 ranked 0
topo name tres caminos: gold topo tres caminos 5184329391 ranked 1
topo name tres camino: gold topo tres caminos 5184329391 ranked 1
topo name la gallera: gold topo la gallera 5184329405 ranked 0
topo name aeropuerto rafael hernández: gold topo rafael hernandez airport 4567576 ranked 2719
topo name la sabana: gold topo la sabana 5184324841 ranked 0
topo name calle jade: gold topo calle jade 22142874 ranked 13
topo name escuela don manolo: gold topo escuela don manolo 4564408 ranked 0
topo name palmas bajas: gold topo quebrada palmas bajas 540517197 ranked 5237
topo name culebra: gold topo quebrada culebra 540517203 ranked 32777
topo name puerto hermina: gold topo ruinas de puerto herminia 4304654246 ranked 3027
topo name calle 6: gold topo calle 6 22255034 ranked 452
topo name 7: gold topo calle 7 22255043 ranked 1403
topo name pozuelo: gold topo punta pozuelo 456

Woah! WIth a few exeptions (`calle 4`) using normalized edit distance helps a lot.

In [42]:
geo_lexicon = geo_data.loc[:, 'name_lower_no_diacritic']
for i, r in topo_data_df.iterrows():
    topo_name = r.loc['topo'].lower()
    topo_id = r.loc['id']
    edit_dists = norm_edit_dist(topo_name, geo_lexicon, word_level=True)
    edit_dist_df = pd.DataFrame([geo_data.loc[:, 'id'].values, edit_dists.index.tolist(), edit_dists.values]).transpose()
    edit_dist_df.columns = ['id', 'name', 'edit_dist']
    edit_dist_df.sort_values('edit_dist', inplace=True, ascending=True)
    gold_rank = pd.np.where(edit_dist_df.loc[:, 'id'] == topo_id)[0][0]
    gold_name = edit_dist_df[edit_dist_df.loc[:, 'id'] == topo_id].loc[:, 'name'].values[0]
    print('topo name %s: gold topo %s %d ranked %d'%(topo_name, gold_name, topo_id, gold_rank))

topo name ingenio: gold topo ingenio 5184195284 ranked 1
topo name carretera 104: gold topo carretera 104 22196127 ranked 0
topo name tres caminos: gold topo tres caminos 5184329391 ranked 1
topo name tres camino: gold topo tres caminos 5184329391 ranked 5
topo name la gallera: gold topo la gallera 5184329405 ranked 0
topo name aeropuerto rafael hernández: gold topo rafael hernandez airport 4567576 ranked 15739
topo name la sabana: gold topo la sabana 5184324841 ranked 0
topo name calle jade: gold topo calle jade 22142874 ranked 17
topo name escuela don manolo: gold topo escuela don manolo 4564408 ranked 0
topo name palmas bajas: gold topo quebrada palmas bajas 540517197 ranked 1
topo name culebra: gold topo quebrada culebra 540517203 ranked 2528
topo name puerto hermina: gold topo ruinas de puerto herminia 4304654246 ranked 54303
topo name calle 6: gold topo calle 6 22255034 ranked 346
topo name 7: gold topo calle 7 22255043 ranked 4125
topo name pozuelo: gold topo punta pozuelo 45669

In [41]:
geo_lexicon = geo_data.loc[:, 'name']
for i, r in topo_data_df.iterrows():
    topo_name = r.loc['topo'].lower()
    topo_id = r.loc['id']
    edit_dists = norm_edit_dist(topo_name, geo_lexicon, word_level=False)
    edit_dist_df = pd.DataFrame([geo_data.loc[:, 'id'].values, edit_dists.index.tolist(), edit_dists.values]).transpose()
    edit_dist_df.columns = ['id', 'name', 'edit_dist']
    edit_dist_df.sort_values('edit_dist', inplace=True, ascending=True)
    gold_rank = pd.np.where(edit_dist_df.loc[:, 'id'] == topo_id)[0][0]
    gold_name = edit_dist_df[edit_dist_df.loc[:, 'id'] == topo_id].loc[:, 'name'].values[0]
    print('topo name %s: gold topo %s %d ranked %d'%(topo_name, gold_name, topo_id, gold_rank))

topo name ingenio: gold topo Ingenio 5184195284 ranked 1
topo name carretera 104: gold topo Carretera 104 22196127 ranked 0
topo name tres caminos: gold topo Tres Caminos 5184329391 ranked 0
topo name tres camino: gold topo Tres Caminos 5184329391 ranked 1
topo name la gallera: gold topo La Gallera 5184329405 ranked 0
topo name aeropuerto rafael hernández: gold topo Rafael Hernandez Airport 4567576 ranked 5525
topo name la sabana: gold topo La Sabana 5184324841 ranked 0
topo name calle jade: gold topo Calle Jade 22142874 ranked 10
topo name escuela don manolo: gold topo Escuela Don Manolo 4564408 ranked 0
topo name palmas bajas: gold topo Quebrada Palmas Bajas 540517197 ranked 14304
topo name culebra: gold topo Quebrada Culebra 540517203 ranked 31074
topo name puerto hermina: gold topo Ruinas de Puerto Herminia 4304654246 ranked 15275
topo name calle 6: gold topo Calle 6 22255034 ranked 523
topo name 7: gold topo Calle 7 22255043 ranked 1631
topo name pozuelo: gold topo Punta Pozuelo 4

Let's use the following settings:

- lower+stripped name
- character-level edit distance

In [55]:
import data_helpers
reload(data_helpers)
from data_helpers import geo_lookup
test_mention = 'ingenio'
test_id = 5184195284
top_k = 100
geo_lexicon = geo_data.loc[:, 'name_lower_no_diacritic']
test_candidates = geo_lookup(test_mention, geo_data, geo_lexicon, word_char=False, k=top_k)
print(test_candidates.head())
test_rank = pd.np.where(test_candidates.loc[:, 'id'] == test_id)[0][0]
print('gold topo has rank %d'%(test_rank))

        name          id        lat        lon      dist
7    ingenio     4565551  18.442170 -66.226000         0
8    ingenio  5184195284  18.083412 -65.872931         0
159  pinerio  4638639190  18.069854 -67.172194  0.428571
56   ansonia   351408612  18.451985 -66.062295  0.571429
54    pinero   302349873  18.410700 -66.055386  0.666667
gold topo has rank 1


OK great. That is now working less poorly than before.

Let's use this setup in the `geo_lookup` function.

## Sieve function
Instead of building a one-stop shop for string similarity, let's build a sieve to catch possible variants:

1. Get exact matches.
2. Try common substitutions ("bo" => "barrio", "aeropuerto" => "airport").
3. Compute edit distance, get all candidates with distance k.

The results could be messy! We might get 100+ instances of `Calle 4` even if we want to cap at $k=100$. But it makes the most sense for this application.

In [134]:
from data_helpers import get_norm_edit_dist
import re
SUB_PAIRS = [('aeropuerto', 'airport'), 
             ('bo.', 'barrio'),
             ('c/', 'calle'), 
             ('carr', 'carretera')]
# get uppercase versions - only if we're not using lowercased names
# upper_first = lambda x: x[0].upper() + x[1:]
# SUB_PAIRS += map(lambda x: (upper_first(x[0]), upper_first(x[1])), SUB_PAIRS)
def get_all_subs(x, sub_pairs):
    x_subs = set()
    for s1, s2 in sub_pairs:
        x_sub = ' '.join(map(lambda y: y.replace(s1, s2), x.split(' ')))
        if(x_sub != x):
            x_subs.add(x_sub)
    return list(x_subs)
def sieve(topo_mention, geo_data, name_col='name', edit_dist_cutoff=1., verbose=False):
    """
    Run sieve for toponym candidate generation.
    
    Parameters:
    -----------
    topo_mention : str
    geo_data : pandas.DataFrame
    name_col : str
    edit_dist_cutoff : float
    Maximum allowed (normalized) edit distance for extra candidates.
    verbose : bool
    
    Returns:
    --------
    candidates : list
    """
    candidates = []
    # exact matches
    matches = geo_data[geo_data.loc[:, name_col] == topo_mention].loc[:, 'id'].values.tolist()
    candidates += matches
    # common substitutes
    topo_mention_subs = get_all_subs(topo_mention, SUB_PAIRS)
    topo_mention_sub_ids = geo_data[geo_data.loc[:, name_col].isin(topo_mention_subs)].loc[:, 'id'].values.tolist()
    candidates += topo_mention_sub_ids
    if(len(candidates) == 0):
        valid_edit_dist_candidates = geo_data[~ geo_data.loc[:, 'id'].isin(candidates)].loc[:, name_col].unique()
        edit_dists = get_norm_edit_dist(topo_mention, valid_edit_dist_candidates).sort_values(inplace=False, ascending=True)
        edit_dist_df = pd.DataFrame([edit_dists.index, edit_dists.values], index=['name', 'dist']).transpose()
        edit_dist_df = pd.merge(edit_dist_df, geo_data.loc[:, [name_col, 'id']], left_on='name', right_on=name_col)
        if(verbose):
            print('preliminary edit dist:\n%s'%(edit_dist_df.head(20)))
        edit_dist_cutoff = edit_dist_df[edit_dist_df.loc[:, 'dist'] <= edit_dist_cutoff]
        edit_dist_cutoff.sort_values(['dist', 'name'], inplace=True, ascending=True)
        edit_candidates = edit_dist_cutoff.loc[:, 'id'].values.tolist()
        candidates += edit_candidates
    return candidates

In [103]:
gold_name = 'la puntilla'
topo_mention = 'punatilla'
topo_id = 5184366268
sieve_candidates = pd.np.array(sieve(topo_mention, geo_data, name_col='name_lower_no_diacritic', edit_dist_cutoff=1.))
print(len(sieve_candidates))
topo_rank = pd.np.where(sieve_candidates == topo_id)[0][0]
print('gold topo at rank %s'%(topo_rank))
print('edit dist %.3f'%(get_norm_edit_dist(gold_name, [topo_mention])[0]))

7179
gold topo at rank 13
edit dist 0.444


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


OK! The sieve works decently well. Let's evaluate the cutoff $\alpha$ based on a score to maximize mean reciprocal rank and minimize irrelevant information (candidate set size) among candidates $C_{t}$ for each toponym $t$.

$$score(\alpha, C) = \sum_{t \in T} MRR(t, C_{t}) * \frac{1}{|C_{t}|}$$

In [108]:
from __future__ import division
def mean_reciprocal_rank(topo_id, topo_candidates):
    # assign MRR=0 to OOV topo ID
    if(topo_id not in topo_candidates):
        MRR = 0.
    else:
        topo_rank = pd.np.where(pd.np.array(topo_candidates) == topo_id)[0][0]+1
        MRR = 1 / topo_rank
    return MRR

In [138]:
from unidecode import unidecode
cutoff_vals = pd.np.arange(0.1, 2.1, 0.1)
MRR_sums = {}
candidate_size_sums = {}
for edit_dist_cutoff in cutoff_vals:
    MRR_sum = 0.
    candidate_size_sum = 0.
    MRR_vals = []
    candidate_sizes = []
    zero_sum = 0.
    for i, r in topo_data_df.iterrows():
        topo_mention = r.loc['topo']
        topo_mention_clean = unidecode(topo_mention.lower())
        topo_id = r.loc['id']
        sieve_candidates = pd.np.array(sieve(topo_mention_clean, geo_data, name_col='name_lower_no_diacritic', edit_dist_cutoff=edit_dist_cutoff))
        MRR = mean_reciprocal_rank(topo_id, sieve_candidates)
        if(MRR == 0.):
            zero_sum += 1
        else:
            MRR_vals.append(MRR)
        candidate_size = len(sieve_candidates)
        MRR_sum += MRR
        if(candidate_size > 0):
            candidate_size_sum += candidate_size ** -1
            candidate_sizes.append(candidate_size)
    MRR_sums[edit_dist_cutoff] = MRR_sum
    candidate_size_sums[edit_dist_cutoff] = candidate_size_sum
    MRR_mean = pd.np.mean(MRR_vals)
    candidate_size_mean = pd.np.mean(candidate_sizes)
    print('edit dist cutoff %.3f got mean MRR=%.3f, mean candidate set size = %.3f, zeros = %d'%
          (edit_dist_cutoff, MRR_mean, candidate_size_mean, zero_sum))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


edit dist cutoff 0.100 got mean MRR=0.622, mean candidate set size = 98.438, zeros = 8
edit dist cutoff 0.200 got mean MRR=0.622, mean candidate set size = 98.438, zeros = 8
edit dist cutoff 0.300 got mean MRR=0.622, mean candidate set size = 93.000, zeros = 8
edit dist cutoff 0.400 got mean MRR=0.622, mean candidate set size = 84.105, zeros = 8
edit dist cutoff 0.500 got mean MRR=0.583, mean candidate set size = 85.048, zeros = 7
edit dist cutoff 0.600 got mean MRR=0.583, mean candidate set size = 113.429, zeros = 7
edit dist cutoff 0.700 got mean MRR=0.583, mean candidate set size = 241.333, zeros = 7
edit dist cutoff 0.800 got mean MRR=0.510, mean candidate set size = 814.714, zeros = 5
edit dist cutoff 0.900 got mean MRR=0.480, mean candidate set size = 2591.619, zeros = 4
edit dist cutoff 1.000 got mean MRR=0.480, mean candidate set size = 6021.571, zeros = 4
edit dist cutoff 1.100 got mean MRR=0.480, mean candidate set size = 6954.238, zeros = 4
edit dist cutoff 1.200 got mean MR

Where are all these zeros coming from?

In [136]:
edit_dist_cutoff = 0.3
for i, r in topo_data_df.iterrows():
    topo_mention = r.loc['topo']
    topo_mention_clean = unidecode(topo_mention.lower())
    topo_id = r.loc['id']
    sieve_candidates = pd.np.array(sieve(topo_mention_clean, geo_data, name_col='name_lower_no_diacritic', edit_dist_cutoff=edit_dist_cutoff, verbose=True))
    MRR = mean_reciprocal_rank(topo_id, sieve_candidates)
    if(MRR == 0.):
        print('topo mention %s not represented in %d candidates'%(topo_mention, len(sieve_candidates)))

topo mention Tres Camino not represented in 1 candidates
preliminary edit dist:
                                 name      dist  \
0            escuela rafael hernandez  0.291667   
1            escuela rafael hernandez  0.291667   
2            escuela rafael hernandez  0.291667   
3            escuela rafael hernandez  0.291667   
4            escuela rafael hernandez  0.291667   
5            escuela rafael hernandez  0.291667   
6   escuela superior rafael hernandez  0.407407   
7              calle rafael hernandez  0.409091   
8              calle rafael hernandez  0.409091   
9              calle rafael hernandez  0.409091   
10             calle rafael hernandez  0.409091   
11             calle rafael hernandez  0.409091   
12             calle rafael hernandez  0.409091   
13             calle rafael hernandez  0.409091   
14             calle rafael hernandez  0.409091   
15             calle rafael hernandez  0.409091   
16             calle rafael hernandez  0.409091   
17

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


preliminary edit dist:
              name      dist name_lower_no_diacritic          id
0    palmas barrio  0.333333           palmas barrio     7268722
1    palmas barrio  0.333333           palmas barrio     7268723
2    palmas barrio  0.333333           palmas barrio     7268724
3    palmas barrio  0.333333           palmas barrio     7268725
4    palmas reales  0.333333           palmas reales  3799505935
5      palma sabal  0.363636             palma sabal   528231851
6      palma sabal  0.363636             palma sabal   528231852
7    palmar barrio  0.416667           palmar barrio     7268716
8   palomas barrio  0.416667          palomas barrio     7268729
9     palmas drive  0.416667            palmas drive    22172738
10    palmas drive  0.416667            palmas drive   373721836
11     calle lajas  0.454545             calle lajas    22129578
12     calle lajas  0.454545             calle lajas    22136506
13     calle lajas  0.454545             calle lajas   333892438
14

Seems like we max out the MRR and candidate set size at edit distance 0.3, so let's use that as our cutoff.

Here's a better way to do the edit distance stuff: set an initial cutoff ($k=0.3$) and then keep expanding that until we have a decent number of candidates ($c=50$).

In [None]:
def sieve_expand()

In [None]:
# same test but with precision/recall
for edit_dist_cutoff in cutoff_vals:
    for i, r in topo_data_df.iterrows():
        topo_mention = r.loc['topo']
        topo_id = r.loc['id']
        sieve_candidates = pd.np.array(sieve(topo_mention, geo_data, name_col='name_lower_no_diacritic', edit_dist_cutoff=edit_dist_cutoff))
        MRR = mean_reciprocal_rank(topo_id, sieve_candidates)
        if(MRR == 0.):
            zero_sum += 1
        candidate_size = len(sieve_candidates)
        MRR_sum += MRR
        if(candidate_size > 0):
            candidate_size_sum += candidate_size ** -1
    MRR_sums[edit_dist_cutoff] = MRR_sum
    candidate_size_sums[edit_dist_cutoff] = candidate_size_sum
    print('edit dist cutoff %.3f got MRR=%.3f, candidate size = %.3f, zeros = %d'%
          (edit_dist_cutoff, MRR_sum, candidate_size_sum, zero_sum))

TODO: break candidate ranking ties based on geographic distance to group centroid.