Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

bug caused by unidecode's bug #1986

Closed
xuduo35 opened this Issue · 5 comments

4 participants

@xuduo35

I post a new article with title"第一篇". The article link will become 'http://127.0.0.1:2368/Di%20[/?]%20Pian%20/', and 404 error happen. After some check,Ithink it's caused by module unidecode. Test code(in index.js):


unidecode = require('unidecode');
console.log("unidecode(第一篇) = " + unidecode("第一篇") + "\n");

Result:


E:\node_js\Ghost>npm start

> ghost@0.4.0 start E:\node_js\Ghost
> node index

unidecode(第一篇) = Di [?] Pian

It seems unidecode cannot decode chinese word '一" correctly.

@halfdan
Collaborator

Hi @xuduo35, this is indeed an issue with unidecode.

The character is U4E00 which is undefined in the unidecode files: https://github.com/FGRibreau/node-unidecode/blob/master/data/x4e.js

I looked up the translation of what you wrote on Google Translate and it transliterated the text to: Dì yī piān.

Would you agree that "yi" could serve as transliteration for the 一 character?

@voronoipotato

一 is yi, you can double check by using the chinese keyboard and typing the pinyin

@halfdan halfdan referenced this issue from a commit in halfdan/node-unidecode
@halfdan halfdan Add transliteration for 一 (U4E00)
Added missing transliteration as suggested in:
TryGhost/Ghost#1986
d905ec9
@halfdan halfdan referenced this issue in FGRibreau/node-unidecode
Open

Add transliteration for 一 (U4E00) #4

@ErisDS
Owner

一 is yi, can just about remember this from my Mandarin lessons

Surely we're going to run into this problem for all the un-transliterated characters in https://github.com/halfdan/node-unidecode/blob/d905ec9f27b597ffeb446ff2dfdc75200eeeccba/data/x4e.js?

Perhaps @xuduo35 or @wangsai, or one of our other Chinese speaking contributors could look at transliterating the missing characters? What's the easiest way to get a list of the characters which are currently transliterated as [?] ?

@xuduo35

I checked "第一篇" with perl module Text::Unidecode, it also output 'Di [?] Pian'.
And this work "一" is special, it can be pronounce. But there are some other chinese words, even me don't know how to pronounce. I think it's okay for unidecode to use [?] to decode words of this type.
But this will cause bad url.
So I think we can just replace [?] to other char after decode, like '-' or '_'. It's more important to make URL right.
@ErisDS How about your opinion?

Just add one line to replace [?] to '-'.

WARNING: terminal is not fully functional
diff --git a/core/server/models/base.js b/core/server/models/base.js
index e03a164..f03d9d9 100644
--- a/core/server/models/base.js
+++ b/core/server/models/base.js
@@ -226,6 +226,7 @@ ghostBookshelf.Model = ghostBookshelf.Model.extend({
         slug = slug.charAt(slug.length - 1) === '-' ? slug.substr(0, slug.lengt

         // Remove non ascii characters
         slug = unidecode(slug);
+        slug = slug.replace(/\[\?\]/, "-");
         // Check the filtered slug doesn't match any of the reserved keywords
         slug = /^(ghost|ghost\-admin|admin|wp\-admin|wp\-login|dashboard|logout

             .test(slug) ? slug + '-post' : slug;
@xuduo35

I mean, there exist some unicodes which are not a complete word. They are just to used to construct other word in our language(I think the same situation exists in Japanese or Korean). They cannot be pronounced, so there is not translation for them. Check these with command 'find node_modules/unidecode/data|xargs grep '[?]'
', there are too many. I think no way to fill them all out.

@xuduo35 xuduo35 referenced this issue from a commit in xuduo35/Ghost
@xuduo35 xuduo35 bug caused by unidecode's bug
issue #1986
remove URL reserved chars after unidecode, because unidecode will produce
some URL reserved chars.
e8e10e2
@xuduo35 xuduo35 referenced this issue from a commit in xuduo35/Ghost
unknown bug caused by unidecode's bug
close #1986
- remove URL reserved chars after unidecode, because unidecode will produce some URL reserved chars.
21d45f7
@ErisDS ErisDS closed this issue from a commit
@xuduo35 xuduo35 bug caused by unidecode's bug
close #1986
- remove URL reserved chars after unidecode, because unidecode will produce some URL reserved chars.
1d1caad
@ErisDS ErisDS closed this in 1d1caad
@oluseyi oluseyi referenced this issue from a commit in oluseyi/Wraith
@xuduo35 xuduo35 bug caused by unidecode's bug
close #1986
- remove URL reserved chars after unidecode, because unidecode will produce some URL reserved chars.
6ad11f8
@icharlie icharlie referenced this issue from a commit
@xuduo35 xuduo35 bug caused by unidecode's bug
close #1986
- remove URL reserved chars after unidecode, because unidecode will produce some URL reserved chars.
cb31ca8
@morficus morficus referenced this issue from a commit in morficus/Ghost
@xuduo35 xuduo35 bug caused by unidecode's bug
close #1986
- remove URL reserved chars after unidecode, because unidecode will produce some URL reserved chars.
d984ca1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.