bug caused by unidecode's bug #1986

Closed
xuduo35 opened this Issue Jan 19, 2014 · 5 comments

Comments

Projects
None yet
4 participants
@xuduo35
Contributor

xuduo35 commented Jan 19, 2014

I post a new article with title"第一篇". The article link will become 'http://127.0.0.1:2368/Di%20[/?]%20Pian%20/', and 404 error happen. After some check,Ithink it's caused by module unidecode. Test code(in index.js):


unidecode = require('unidecode');
console.log("unidecode(第一篇) = " + unidecode("第一篇") + "\n");

Result:


E:\node_js\Ghost>npm start

> ghost@0.4.0 start E:\node_js\Ghost
> node index

unidecode(第一篇) = Di [?] Pian

It seems unidecode cannot decode chinese word '一" correctly.

@halfdan

This comment has been minimized.

Show comment Hide comment
@halfdan

halfdan Jan 23, 2014

Member

Hi @xuduo35, this is indeed an issue with unidecode.

The character is U4E00 which is undefined in the unidecode files: https://github.com/FGRibreau/node-unidecode/blob/master/data/x4e.js

I looked up the translation of what you wrote on Google Translate and it transliterated the text to: Dì yī piān.

Would you agree that "yi" could serve as transliteration for the 一 character?

Member

halfdan commented Jan 23, 2014

Hi @xuduo35, this is indeed an issue with unidecode.

The character is U4E00 which is undefined in the unidecode files: https://github.com/FGRibreau/node-unidecode/blob/master/data/x4e.js

I looked up the translation of what you wrote on Google Translate and it transliterated the text to: Dì yī piān.

Would you agree that "yi" could serve as transliteration for the 一 character?

@voronoipotato

This comment has been minimized.

Show comment Hide comment
@voronoipotato

voronoipotato Jan 23, 2014

一 is yi, you can double check by using the chinese keyboard and typing the pinyin

一 is yi, you can double check by using the chinese keyboard and typing the pinyin

halfdan added a commit to halfdan/node-unidecode that referenced this issue Jan 23, 2014

Add transliteration for 一 (U4E00)
Added missing transliteration as suggested in:
TryGhost/Ghost#1986

@halfdan halfdan referenced this issue in FGRibreau/node-unidecode Jan 23, 2014

Merged

Add transliteration for 一 (U4E00) #4

@ErisDS

This comment has been minimized.

Show comment Hide comment
@ErisDS

ErisDS Jan 23, 2014

Owner

一 is yi, can just about remember this from my Mandarin lessons

Surely we're going to run into this problem for all the un-transliterated characters in https://github.com/halfdan/node-unidecode/blob/d905ec9f27b597ffeb446ff2dfdc75200eeeccba/data/x4e.js?

Perhaps @xuduo35 or @wangsai, or one of our other Chinese speaking contributors could look at transliterating the missing characters? What's the easiest way to get a list of the characters which are currently transliterated as [?] ?

Owner

ErisDS commented Jan 23, 2014

一 is yi, can just about remember this from my Mandarin lessons

Surely we're going to run into this problem for all the un-transliterated characters in https://github.com/halfdan/node-unidecode/blob/d905ec9f27b597ffeb446ff2dfdc75200eeeccba/data/x4e.js?

Perhaps @xuduo35 or @wangsai, or one of our other Chinese speaking contributors could look at transliterating the missing characters? What's the easiest way to get a list of the characters which are currently transliterated as [?] ?

@xuduo35

This comment has been minimized.

Show comment Hide comment
@xuduo35

xuduo35 Jan 24, 2014

Contributor

I checked "第一篇" with perl module Text::Unidecode, it also output 'Di [?] Pian'.
And this work "一" is special, it can be pronounce. But there are some other chinese words, even me don't know how to pronounce. I think it's okay for unidecode to use [?] to decode words of this type.
But this will cause bad url.
So I think we can just replace [?] to other char after decode, like '-' or '_'. It's more important to make URL right.
@ErisDS How about your opinion?

Just add one line to replace [?] to '-'.

WARNING: terminal is not fully functional
diff --git a/core/server/models/base.js b/core/server/models/base.js
index e03a164..f03d9d9 100644
--- a/core/server/models/base.js
+++ b/core/server/models/base.js
@@ -226,6 +226,7 @@ ghostBookshelf.Model = ghostBookshelf.Model.extend({
         slug = slug.charAt(slug.length - 1) === '-' ? slug.substr(0, slug.lengt
         // Remove non ascii characters
         slug = unidecode(slug);
+        slug = slug.replace(/\[\?\]/, "-");
         // Check the filtered slug doesn't match any of the reserved keywords
         slug = /^(ghost|ghost\-admin|admin|wp\-admin|wp\-login|dashboard|logout
             .test(slug) ? slug + '-post' : slug;
Contributor

xuduo35 commented Jan 24, 2014

I checked "第一篇" with perl module Text::Unidecode, it also output 'Di [?] Pian'.
And this work "一" is special, it can be pronounce. But there are some other chinese words, even me don't know how to pronounce. I think it's okay for unidecode to use [?] to decode words of this type.
But this will cause bad url.
So I think we can just replace [?] to other char after decode, like '-' or '_'. It's more important to make URL right.
@ErisDS How about your opinion?

Just add one line to replace [?] to '-'.

WARNING: terminal is not fully functional
diff --git a/core/server/models/base.js b/core/server/models/base.js
index e03a164..f03d9d9 100644
--- a/core/server/models/base.js
+++ b/core/server/models/base.js
@@ -226,6 +226,7 @@ ghostBookshelf.Model = ghostBookshelf.Model.extend({
         slug = slug.charAt(slug.length - 1) === '-' ? slug.substr(0, slug.lengt
         // Remove non ascii characters
         slug = unidecode(slug);
+        slug = slug.replace(/\[\?\]/, "-");
         // Check the filtered slug doesn't match any of the reserved keywords
         slug = /^(ghost|ghost\-admin|admin|wp\-admin|wp\-login|dashboard|logout
             .test(slug) ? slug + '-post' : slug;
@xuduo35

This comment has been minimized.

Show comment Hide comment
@xuduo35

xuduo35 Jan 24, 2014

Contributor

I mean, there exist some unicodes which are not a complete word. They are just to used to construct other word in our language(I think the same situation exists in Japanese or Korean). They cannot be pronounced, so there is not translation for them. Check these with command 'find node_modules/unidecode/data|xargs grep '[?]'
', there are too many. I think no way to fill them all out.

Contributor

xuduo35 commented Jan 24, 2014

I mean, there exist some unicodes which are not a complete word. They are just to used to construct other word in our language(I think the same situation exists in Japanese or Korean). They cannot be pronounced, so there is not translation for them. Check these with command 'find node_modules/unidecode/data|xargs grep '[?]'
', there are too many. I think no way to fill them all out.

xuduo35 added a commit to xuduo35/Ghost that referenced this issue Jan 25, 2014

bug caused by unidecode's bug
issue #1986
remove URL reserved chars after unidecode, because unidecode will produce
some URL reserved chars.

xuduo35 pushed a commit to xuduo35/Ghost that referenced this issue Jan 26, 2014

unknown unknown
bug caused by unidecode's bug
close #1986
- remove URL reserved chars after unidecode, because unidecode will produce some URL reserved chars.

@ErisDS ErisDS closed this in 1d1caad Jan 28, 2014

morficus pushed a commit to morficus/Ghost that referenced this issue Sep 4, 2014

bug caused by unidecode's bug
close #1986
- remove URL reserved chars after unidecode, because unidecode will produce some URL reserved chars.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment