Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1989 guessing of namedplaces #2809

Merged
merged 14 commits into from
Mar 30, 2015
43 changes: 43 additions & 0 deletions services/importer/lib/importer/content_guesser.rb
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,14 @@ def country_column
nil
end

def namedplace_column
return nil if not enabled?
columns.each do |column|
return column[:column_name] if is_namedplace_column? column
end
nil
end

def ip_column
return nil if not enabled?
columns.each do |column|
Expand Down Expand Up @@ -73,10 +81,30 @@ def is_country_column?(column)
end
end

def is_namedplace_column?(column)
return false unless is_text_type? column
entropy = metric_entropy(column, country_name_normalizer) # TODO: optimize
if entropy < minimum_entropy
false
else
proportion = namedplace_proportion(column)
if proportion < threshold
false
else
log_namedplace_guessing_match_metrics(proportion)
true
end
end
end

def log_country_guessing_match_metrics(proportion)
@importer_stats.gauge('country_proportion', proportion)
end

def log_namedplace_guessing_match_metrics(proportion)
@importer_stats.gauge('namedplace_proportion', proportion)
end

def log_ip_guessing_match_metrics(proportion)
@importer_stats.gauge('ip_proportion', proportion)
end
Expand Down Expand Up @@ -137,6 +165,21 @@ def country_proportion(column)
country_proportion
end

def namedplace_proportion(column)
column_name_sym = column[:column_name].to_sym
matches = count_namedplaces(sample, column_name_sym)
country_proportion = matches.to_f / sample.count
log "namedplace_proportion(#{column[:column_name]}) = #{country_proportion}"
country_proportion
end

def count_namedplaces(sample, column_name_sym)
sql_array = sample.map{|row| "'" + row[column_name_sym] + "'"}.join(',')
query = "WITH geo_function as (SELECT (geocode_namedplace(Array[#{sql_array}])).*) select count(success) FROM geo_function where success = TRUE"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replying @javisantana this is the critical part. this is gonna be called through sql api once per column with an array of up to sample size.

Due to the nature of the sampling and the dataset itself, I do not expect having a good hit rate in cache for these queries, nor for the geocodings on the full datasets.

If you're ok with it, then me too and more than happy of releasing it :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so why don't we add a metric there? also, if you feel like the call is going to be really expensive we have the option of having a different geocode_namedplace that gets not only a single array but an array per column. Are we doing this only for text columns, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to add a metric and check whether performance is a problem or not

Yes, we're just querying text columns.

About sending one single query with all text columns of the sample: do you think it will be more efficient for the general case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be more efficient because you only need to open a connection (at least this).
Let's add that metric and see how it looks

ret = geocoder_sql_api.fetch(query)
ret.first['count']
end

def log(msg)
@job.log msg if @job
end
Expand Down
31 changes: 31 additions & 0 deletions services/importer/lib/importer/georeferencer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ def run
create_the_geom_from_latlon ||
create_the_geom_from_ip_guessing ||
create_the_geom_from_country_guessing ||
create_the_geom_from_namedplaces_guessing ||
create_the_geom_in(table_name)

enable_autovacuum
Expand Down Expand Up @@ -151,6 +152,31 @@ def create_the_geom_from_country_guessing
return false
end

def create_the_geom_from_namedplaces_guessing
return false if not @content_guesser.enabled?
job.log 'Trying namedplaces guessing...'
begin
namedplace_column_name = nil
@importer_stats.timing('guessing') do
@tracker.call('guessing')
namedplace_column_name = @content_guesser.namedplace_column
@tracker.call('importing')
end
if namedplace_column_name
job.log "Found namedplace column: #{namedplace_column_name}"
create_the_geom_in table_name
return geocode_namedplaces namedplace_column_name
end
rescue Exception => ex
message = "create_the_geom_from_namedplaces_guessing failed: #{ex.message}"
Rollbar.report_message(message,
'warning',
{user_id: @job.logger.user_id, backtrace: ex.backtrace})
job.log "WARNING: #{message}"
end
return false
end

def create_the_geom_from_ip_guessing
return false if not @content_guesser.enabled?
job.log 'Trying ip guessing...'
Expand Down Expand Up @@ -180,6 +206,11 @@ def geocode_countries country_column_name
geocode(country_column_name, 'polygon', 'admin0')
end

def geocode_namedplaces namedplace_column_name
job.log "Geocoding namedplaces..."
geocode(namedplace_column_name, 'point', 'namedplace')
end

def geocode_ips ip_column_name
job.log "Geocoding ips..."
geocode(ip_column_name, 'point', 'ipaddress')
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ def copy_results_to_table_query

def country
country = @internal_geocoder.countries
country == %Q{'world'} ? 'null' : country
(country == %Q{'world'} || country.blank?) ? 'null' : country
end

def dest_table
Expand Down