Regionprovider Region ID fetching optimisation #1202

Merged
merged 9 commits into from Jan 5, 2016

Projects

None yet

3 participants

@stevage
Contributor
stevage commented Dec 22, 2015

This PR includes two separate optimisations:

  1. Maintaining an index of pre-computed codes for fast matching
  2. Allowing the use of server-side JSON instead of fetching IDs from WFS. The WFS method is still available as a fallback.
stevage added some commits Dec 22, 2015
@stevage stevage Use precomputed JSON lists of feature IDs to speed up region mapping. faed327
@stevage stevage Refactor RegionProvider region ID fetching so WFS and JSON methods pl…
…ay nicely.
b423452
@stevage stevage added a commit to TerriaJS/nationalmap that referenced this pull request Dec 22, 2015
@stevage stevage Use pre-generated lists of region IDs, supported by TerriaJS/terriajs… f602eac
stevage and others added some commits Dec 22, 2015
@stevage stevage Include SA1 test file, with all 55,000 SA1s.
456fbca
@kring kring Merge remote-tracking branch 'origin/master' into regionprovider-fidList
0f96784
@RacingTadpole RacingTadpole commented on the diff Dec 23, 2015
test/Models/RegionProviderSpec.js
@@ -26,9 +25,8 @@ beforeEach(function() {
describe('RegionProvider', function() {
it('parses WFS xml correctly', function(done) {
- loadText('test/csv/mini_ced.xml').then(function(xml) {
- ced.loadRegionsFromXML(xml);
- expect(ced.regions.length).toEqual(6);
+ ced.loadRegionIDs().then(function(json) {
@RacingTadpole
RacingTadpole Dec 23, 2015 Member

This isn't really a comment on your commit, but on the existing code - can you set it up so the unit tests didn't go off to a server at all, eg. using a spy? (eg. see ArcGisMapServerCatalogItemSpec)

@RacingTadpole
RacingTadpole Dec 23, 2015 Member

btw @kring's doing some cool work on this right now using sinon.fakeServer.create(), to solve the problem of failing unit tests on IE9 (which cannot load data from the region mapping server without a proxy).

@stevage
stevage Dec 23, 2015 Contributor

Yeah, good idea.

@RacingTadpole RacingTadpole and 1 other commented on an outdated diff Dec 23, 2015
lib/Map/RegionProvider.js
@@ -116,6 +117,17 @@ var RegionProvider = function(regionType, properties) {
* @type {Object}
*/
this._idIndex = {};
+
+ /**
+ * The URL of a pre-generated JSON file containing just a long list of IDs for a given
+ * layer attribute, in the order of ascending feature IDs (fids). If defined, it will
+ * be used in preference to requesting those attributes from the WFS server.
+ * @type {String}
+ */
+ this.regionIdList = properties.regionIdList;
+
+
+
@RacingTadpole
RacingTadpole Dec 23, 2015 Member

I like the way Cesium uses jsdoc to specify which options are allowed, eg. https://github.com/TerriaJS/cesium/blob/master/Source/Core/CylinderGeometry.js#L53 - would it be worth doing this for properties here?

@stevage
stevage Dec 23, 2015 Contributor

OIC, documenting the properties as constructor parameters rather than on the individual statements that set them. I like it - but I think the style I followed is the one we mostly use. Thoughts, @kring ?

@RacingTadpole RacingTadpole and 1 other commented on an outdated diff Dec 23, 2015
lib/Map/RegionProvider.js
@@ -116,6 +117,17 @@ var RegionProvider = function(regionType, properties) {
* @type {Object}
*/
this._idIndex = {};
+
+ /**
+ * The URL of a pre-generated JSON file containing just a long list of IDs for a given
+ * layer attribute, in the order of ascending feature IDs (fids). If defined, it will
+ * be used in preference to requesting those attributes from the WFS server.
+ * @type {String}
+ */
+ this.regionIdList = properties.regionIdList;
@RacingTadpole
RacingTadpole Dec 23, 2015 Member

A nit-picky suggestion - it's potentially confusing to have something whose name ends in List actually be a URL (ie a String) - could you change this to this.regionIdListUrl or something similar?

@RacingTadpole
RacingTadpole Dec 23, 2015 Member

Actually on closer examination, am I right that this URL returns a json object eg.

{"layer": "region_map:FID_SA1_2011...", "property": "SA1_MAIN11", "values": [10101, ...]}

? I'd have guessed something named regionIdsListUrl would return an array, not an object - or perhaps an object with a property called regionIdsList.
So, what would you say to renaming values to ids in the json, and renaming this to this.regionIdsUrl?
Of course, I realise I'm probably being too pedantic. :-)

@stevage
stevage Dec 23, 2015 Contributor

Interesting points. I had thought that the JSON spec required the outer most object to be an object, not an array, but it looks like it doesn't. It's still sort of frowned on in API design though - I prefer passing an object that has a tiny bit of context at least.

Personally I think it's ok to describe an object which consists of two strings and an array of thousands of values as "a list". It really is just a list with a tiny bit of metadata. Otherwise you'd have to call virtually everything an object - even a javascript Array is an object, right? :)

In any case, I'm not really proposing that the format of this JSON file be a public interface - it's a private format that is used in this one place, and by the script that generates it.

So, what would you say to renaming values to ids

This is really tricky. Depending on the exact context, you can call these things "values", "ids", "codes", "attributes" etc etc. I'm a bit uncomfortable calling them "ids" because for certain layers and attributes, they're English text strings like "Baw Baw (S)", which don't really have the character of "IDs" to me. At the time they're generated by the script, they're (agnostically) the value of a particular attribute on a particular layer. Later on, our code treats them as IDs, possibly too optimistically (as with ambiguous names like "Camperdown (C)").

and renaming this to this.regionIdsUrl

Technically, it's not a URL, because it can just be a relative path, data/regionids/foo.json. I think regionIdsFile would be more accurate.

(Ha, I think I out-nitpicked your nitpick :))

@stevage
stevage Dec 23, 2015 Contributor

Changed it to regionIdsFile

@RacingTadpole RacingTadpole and 1 other commented on an outdated diff Dec 23, 2015
lib/Map/RegionProvider.js
// if this column might be ambiguous then fetch the disambiguating values for each column as well (usually State)
if (this.disambigProp) {
- url = baseuri.setQuery('valueReference', this.disambigProp).toString();
- promises.push(loadText(url).then(function(xml) {
- that.loadRegionsFromXML(xml, that.disambigProp, "disambigServerReplacements");
- }));
+ var dp = this.loadRegionsFromWfs(this.disambigProp);
@RacingTadpole
RacingTadpole Dec 23, 2015 Member

Wondering about this line - as you say, loading from wfs is the slower option - is there a way to tell if the faster json option is available for the disambiguation variable? I imagine it's hard because this RegionProvider would need to know things about other regions.

@stevage
stevage Dec 23, 2015 Contributor

Heh, yeah. I was kind of over it by this point :) So, to be explicit, the current situation for disambiguation matching on LGA+State is:

  1. Load LGA name column, using JSON file
  2. Load LGA state column, without using JSON file
  3. Go to town.

I kind of dismissed it as an edge case, but I think you're right. There are enough LGAs to make this worthwhile.

@stevage
stevage Dec 23, 2015 Contributor

Oh yeah, now I remember - as you pointed out, "RegionProvider would need to know things about other regions". In a previous iteration of this RegionProvider thing, each RP actually did contain a reference to a whole separate RP, but that turned out to be a pretty flakey concept. A disambiguation field is just an additional field of the same boundary layer, many of the other attributes don't apply.

Fixed by adding regionDisambigIdList field, and cleaning up the code some more.

@RacingTadpole RacingTadpole and 1 other commented on an outdated diff Dec 23, 2015
lib/Map/RegionProvider.js
}
// store a lookup by attribute, for performance.
- this._idIndex[prop] = i;
+ that._idIndex[id] = i;
@RacingTadpole
RacingTadpole Dec 23, 2015 Member

processRegionIds is potentially called twice, once for the regions and sometimes also for the disambiguation column (eg. if two different States can have regions with the same name, the disambiguation variable is the State). Since both cases are written into the same object that._idIndex, is there ever a risk that the disambiguation id might be the same as the region id, and overwrite it here?

@stevage
stevage Dec 23, 2015 Contributor

Interesting point! As the code stands, this is never an issue, because _idIndex isn't even consulted when disambiguation is in question (see findRegionIndex). By definition, this kind of a hash table isn't much use when you have ambiguous values.

But you've prompted me to finish the job of the index, so it's used for disambiguation, too. (The disambiguation property itself doesn't get an index, because it wouldn't be useful.)

@RacingTadpole
Member

OK, that's all I can think of! Over to you again :-)

stevage added some commits Dec 23, 2015
@stevage stevage Use index for disambiguation as well, simplifying the matching proces…
…s somewhat.
45d9f1b
@stevage stevage Support JSON file for regionDisambigIdList as well, for consistency a…
…nd simplicity.
44bbacb
@stevage stevage Use regionmapping.json and associated JSON files from NationalMap, fo…
…r more realistic testing.
c2d73d5
@stevage stevage Rename regionIdList to regionIdsFile
0c80c36
@RacingTadpole
Member

Looks good! Let me know when you're ready for me to take another look.

@stevage
Contributor
stevage commented Dec 24, 2015

Oh, yep, I'm ready - think I addressed everything.

@kring kring Merge remote-tracking branch 'origin/master' into regionprovider-fidList
0fdbfe9
@kring kring merged commit fa03cf3 into master Jan 5, 2016

2 of 3 checks passed

continuous-integration/travis-ci/push The Travis CI build is in progress
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
licence/cla Contributor License Agreement is signed.
Details
@kring kring deleted the regionprovider-fidList branch Jan 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment