-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1989 guessing of namedplaces #2809
Conversation
Frontend tests were OK 👍 (details) |
why "this does not seem to scale well" ? what do you mean? |
|
||
def count_namedplaces(sample, column_name_sym) | ||
sql_array = sample.map{|row| "'" + row[column_name_sym] + "'"}.join(',') | ||
query = "WITH geo_function as (SELECT (geocode_namedplace(Array[#{sql_array}])).*) select count(success) FROM geo_function where success = TRUE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replying @javisantana this is the critical part. this is gonna be called through sql api once per column with an array of up to sample size.
Due to the nature of the sampling and the dataset itself, I do not expect having a good hit rate in cache for these queries, nor for the geocodings on the full datasets.
If you're ok with it, then me too and more than happy of releasing it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so why don't we add a metric there? also, if you feel like the call is going to be really expensive we have the option of having a different geocode_namedplace
that gets not only a single array but an array per column. Are we doing this only for text columns, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to add a metric and check whether performance is a problem or not
Yes, we're just querying text columns.
About sending one single query with all text columns of the sample: do you think it will be more efficient for the general case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be more efficient because you only need to open a connection (at least this).
Let's add that metric and see how it looks
Frontend tests were OK 👍 (details) |
Frontend tests were OK 👍 (details) |
@@ -199,12 +233,13 @@ def geocode(formatter, geometry_type, kind) | |||
geometry_type: geometry_type, | |||
kind: kind, | |||
max_rows: nil, | |||
country_column: nil | |||
country_column: country_column_name, | |||
countries: "'#{country}'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note for myself: this should be nil if country
is nil
Frontend tests were OK 👍 (details) |
Frontend tests were OK 👍 (details) |
Frontend tests were OK 👍 (details) |
Frontend tests were OK 👍 (details) |
Frontend tests were OK 👍 (details) |
👍 Nice tests! |
1989 guessing of namedplaces
There's the namedplaces guessing, taking into account countries
@Kartones this is pretty much ready for a "final" review, can you please take another look?