Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fallback code for unsupported locales #168

Merged
merged 25 commits into from
Oct 5, 2017
Merged

Add fallback code for unsupported locales #168

merged 25 commits into from
Oct 5, 2017

Conversation

bsudekum
Copy link

@bsudekum bsudekum commented Oct 3, 2017

This PR adds support for trying to find locales which we currently do not support. This gist of it is, that if a developer provides es-MX, we should give them instructions provides by es.

This also moves to use en when a language is not found.

/cc @mcwhittemore @1ec5

Copy link
Contributor

@mcwhittemore mcwhittemore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change findAppropriateLanguage to findTranslation and have this return the translation object rather than a string which is a key in that index?

Doing this would let us do something like _.merge({}, englishTranslation, foundTranslation) and remove all the if not in the lang, look in english checks.

We could expose the language code selected on this object as well so that someone can use this function to know if their language code is supported by OSRM text instructions.

modifier: 'left'
},
name: 'Way Name'
}), 'Gire a la izquierda en Way Name');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turn left instructions are identical between es and es-ES. Can we find an instruction that differs between the two locales?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, maybe it’d be better to unit test getAvailableLanguage() directly instead of (or in addition to) testing its integration as part of compile().

index.js Outdated
@@ -210,6 +211,30 @@ module.exports = function(version, _options) {
}

return output;
},
findAppropriateLanguage: function(language) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this method need to be exported? Also, how about calling it getAvailableLanguage() or getBestMatchingLanguage()?

index.js Outdated
if (languages.supportedCodes.indexOf(language) === -1) throw new Error('language code ' + language + ' not loaded');
compile: function(originalLanguage, step, options) {
if (!originalLanguage) throw new Error('No language code provided');
var language = this.findAppropriateLanguage(originalLanguage);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several other exported methods take a language parameter as well. We should be consistent; otherwise, a client using the tokenize() method, for example, will end up with a mismash.

index.js Outdated
@@ -210,6 +211,30 @@ module.exports = function(version, _options) {
}

return output;
},
findAppropriateLanguage: function(language) {
if (languages.supportedCodes.indexOf(language) > -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

languages.instructions is an object with language codes as the keys. I think it’d be a bit easier to work with that object than the supported codes array.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the list of languages we're using here consistent. An array of locales is little more accurate than a dictionary with keys IMO.

index.js Outdated
// no country code needed.
return filteredSupportedLanguages.sort(function(a, b) {
return a.length - b.length;
})[0];
Copy link
Member

@1ec5 1ec5 Oct 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should look for matches in this order (prior art), at least until we can pull in a more sophisticated library that specializes in language fallback logic:

  1. Same full locale code (e.g., lng-Scpt-CC)
  2. Same language code and script code (lng-Scpt)
  3. Same language code and country code (lng-CC)
  4. Same language code (lng)
  5. Same language code and any script code (lng-Scpx)
  6. Same language code and any country code (lng-CX)
  7. English (en)

It’ll help if we use languages.instructions instead of languages.supportedCodes.

/cc @jcsg

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that sounds good. I was originally worried that condition 5 might mean that we'd serve Serbian in Latin rather than Cyrillic, but that probably would be better than just serving English for that region, especially if the voice instructions still use Serbian.

Thanks for looping me in, @1ec5 .

@1ec5
Copy link
Member

1ec5 commented Oct 4, 2017

Can we change findAppropriateLanguage to findTranslation and have this return the translation object rather than a string which is a key in that index?

We could simplify this PR considerably by removing all the calls to getBestMatchingLanguage() and making the client responsible for calling getBestMatchingLanguage() before passing a locale code into any other method like compile() or tokenize(). To enforce that, this library would throw an error if the passed-in locale code isn’t available.

index.js Outdated

// Same language code (lng)
} else if (supportedLanguageCode.indexOf(languageCode) > -1) {
return languages.supportedCodes[supportedLanguageCode.indexOf(languageCode)];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire expression is equivalent to just languageCode.

index.js Outdated
return languageCode + '-' + countryCode;

// Same language code (lng)
} else if (supportedLanguageCode.indexOf(languageCode) > -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is guaranteed to be equivalent to if (language.instructions[languageCode]).

index.js Outdated
countryCode = splitLocale[2];
}

var supportedLanguageCode = languages.supportedCodes.map(function(locale) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an array, so the name should be plural.

index.js Outdated
return languages.supportedCodes[supportedLanguageCode.indexOf(languageCode)];

// Same language code and any script code (lng-Scpx)
} else if (supportedLanguageCode.indexOf(languageCode) > -1 && scriptCode) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we support zh-Hans but not zh, and language is zh-Hant, then we won’t go into this if statement because zh-Hant isn’t one of the supported languages. Instead, we need to look for an item in supportedLanguageCodes that contains the language code zh and also contains a script code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 167f7c2

});

t.test('getBestMatchingLanguage', function(t) {
t.assert(compiler('v5').getBestMatchingLanguage('zh-Hant'), 'zh-Hans');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s make sure all the cases above get proper unit tests too, especially es-MXes. It’ll also be important to make sure we match the language codes case-insensitively, since that’s what the Directions API does.

index.js Outdated
var scriptCode = false;
var countryCode = false;

if (splitLocale.lenth === 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: length.

index.js Outdated
var countryCode = false;

if (splitLocale.lenth === 1) {
languageCode = splitLocale[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well set languageCode unconditionally at this point. If language is something bogus like c++, then we’ll end up falling back to English anyways.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to keep the splitting code and the return code separated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s no else case here. If language is the empty string, then we wind up looking up silly values like false-false in instructions. I continue to think we should pull this line out of the if and set it unconditionally.

index.js Outdated
} else if (splitLocale.length === 3) {
languageCode = splitLocale[0];
scriptCode = splitLocale[1];
countryCode = splitLocale[2];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here’s a useful reference on the syntax of a BCP47 language tag. We’re currently handling all the parts that a client is likely to pass in.

index.js Outdated
// Same language code and any script code (lng-Scpx) and the found language contains a script
} else if (supportedLanguageCodes.indexOf(languageCode) > -1 && scriptCode &&
languages.supportedCodes[supportedLanguageCodes.indexOf(languageCode)].split('-').length > 1 &&
languages.supportedCodes[supportedLanguageCodes.indexOf(languageCode)].split('-')[1].length === 4) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be easier if, right off the bat, we map each item in supportedCodes to an object with the keys language, script, and region.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd only be used really for this case. I'm fine with keeping this.

Copy link
Member

@1ec5 1ec5 Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do the preprocessing suggested above, then this becomes:

else if (availableLanguages.find(function (language) {
  return language.language === languageCode && language.script;
}))

and some of the duplication in the next else if block goes away.

We can do that preprocessing inside languages.js to avoid any meaningful performance penalty.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it around to this in 9a19e1e, not sure if it's any more clear.

index.js Outdated
})) {
return languages.supportedCodes[supportedLanguageCodes.indexOf(languageCode)];
// Same language code and any country code (lng-CX)
} else if (supportedLanguageCodes.indexOf(languageCode) > -1 && countryCode) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the passed-in language is pt-PT and we support pt-BR (but not pt), this line checks whether pt-PT is supported (it isn’t) and pt-PT has a country code (it does). Consequently, we fail to go in here. Let’s add a test for this case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index.js Outdated
});

var availableLanguages = languages.supportedCodes.map(function(language) {
return that.divideLanguageCodes(language);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

languages.js currently exports a supportedCodes containing strings; it should also export supportedLanguages (or availableLanguages) that maps to these objects. That way this code only has to run once, no matter how many times getBestMatchingLanguage() is called.

index.js Outdated
} else if (availableLanguages.find(function (language) {
return language.languageCode === languageCode && language.scriptCode;
})) {
return languages.supportedCodes[supportedLanguageCodes.indexOf(languageCode)];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The object returned by find() should have a locale property that contains the full string, so that we can just return the locale property here instead of searching through supportedCodes.

index.js Outdated
},
divideLanguageCodes: function(language) {
// If the language is not found, try a little harder
var splitLocale = language.toLowerCase().split('-');
Copy link
Member

@1ec5 1ec5 Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, it’d be cleaner to use a regular expression for this job:

// Two-letter language code, then optional four-letter script code, then optional two-letter region code
var match = language.match(/(\w\w)(?:-(\w\w\w\w))?(?:-(\w\w))?/i);
// Normalize case
var locale = [];
if (match[1]) {
  match[1] = match[1].toLowerCase();
  locale.push(match[1]);
}
if (match[2]) {
  match[2] = match[2][0].toUpperCase() + match[2].substring(1).toLowerCase();
  locale.push(match[2]);
}
if (match[3]) {
  match[3] = match[3].toUpperCase();
  locale.push(match[3]);
}
return {
  locale: locale.join("-"),
  language: match[1],
  script: match[2],
  region: match[3]
};

Looks like a lot, but most of it is to normalize the case, which is important for avoiding having to index into supportedLanguages all the time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is failing for me. I think both styles of parsing out the codes is fine.

Copy link
Member

@1ec5 1ec5 Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it’s failing, it’s probably because the rest of the code hasn’t been updated to account for the normalized case being used by ☝️.

t.assert(compiler('v5').getBestMatchingLanguage('es-MX'), 'es');
t.assert(compiler('v5').getBestMatchingLanguage('es-es'), 'es-es');
t.end();
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s add some tests of divideLanguageCodes() or whatever we call it once we move it to languages.js.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index.js Outdated
})) {
return languages.parsedSupportedCodes.find(function (language) {
return language.languageCode === languageCode && language.scriptCode;
}).locale;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of repeating the find() call, store the result of the first find. It’d be a lot easier to store things in local variables if you replace all the else ifs with ifs – after all, we’re returning early in each case.

t.deepEqual(languages.parseLanguageIntoCodes('zh'), { countryCode: false, languageCode: 'zh', locale: 'zh', scriptCode: false });
t.deepEqual(languages.parseLanguageIntoCodes('es-MX'), { countryCode: 'mx', languageCode: 'es', locale: 'es-MX', scriptCode: false });
t.deepEqual(languages.parseLanguageIntoCodes('es-ES'), { countryCode: 'es', languageCode: 'es', locale: 'es-ES', scriptCode: false });
t.deepEqual(languages.parseLanguageIntoCodes('pt-PT'), { countryCode: 'pt', languageCode: 'pt', locale: 'pt-PT', scriptCode: false });
Copy link
Member

@1ec5 1ec5 Oct 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Directions API currently allows a language parameter to be specified in any case, such as pt-pt or PT-Pt. If the Directions API is to use getBestMatchingLanguage(), we should test that this method matches case insensitively, so pt-pt should also produce pt-BR.

languages.js Outdated
locale: language,
languageCode: languageCode,
scriptCode: scriptCode,
countryCode: countryCode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s a bit redundant to say “code” in each of these keys.

Bobby Sudekum added 2 commits October 5, 2017 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants