Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.WordCount not accurate for Japanese pages #1266

Closed
RickCogley opened this issue Jul 11, 2015 · 10 comments
Closed

.WordCount not accurate for Japanese pages #1266

RickCogley opened this issue Jul 11, 2015 · 10 comments

Comments

@RickCogley
Copy link
Contributor

Hello - when I look at .WordCount results against an English page and a Japanese page, the count is correct for English of course, but incorrect for Japanese, returning a very small number.

I assume it is counting words using spaces as the delimiter, but Japanese and other CJK languages are inherently space-less. This also causes trouble for search engines, as an aside.

I am wondering if .WordCount could detect the language, and count characters for CJK language, instead of returning an incorrect number.

Best regards,
Rick

@bep bep added the Bug label Jul 11, 2015
@bep
Copy link
Member

bep commented Jul 11, 2015

It can hardly detect the language ...? but should be able to use the languageCode. What is the correct way to count words in Japanese?

@RickCogley
Copy link
Contributor Author

Hi @bep, it's not straightforward since there are so many combinations, and no spaces. One would need a really large dictionary file of all possible combinations of kanji characters, and then run that against text to try to guess the number of words. You could not even guess what a "word" was, since sometimes what looks like a four-character combination is actually two two-character combinations.

I think it is just better to stick to counting the double-byte characters and giving a count of those.

Edit: noting also that it's possible to intersperse single-byte English and Japanese, as well, among Japanese characters.

@RickCogley
Copy link
Contributor Author

An interesting aside: today I was working on SEO or social partials, and discovered that the .languageCode var is used for the RSS feed, which in turn uses en-US for English, but ja for Japanese (not ja-JP).

And that's not conducive to using for "locale" because:

  • Facebook og - uses underbar like en_US, ja_JP
  • Schema.org - uses hyphen like en-US, ja-JP
  • (Twitter card uses no locale)

What I ended up doing was to settle on the hyphen version in locale in site and page params, then use the replace function like {{ replace . "-" "_" }} to change to the underscore version, for Facebook og, as needed.

@bep
Copy link
Member

bep commented Jul 11, 2015

@RickCogley one part of me just loves having a language guy like you on the team coming up with problems like these, the other part ...

@bep
Copy link
Member

bep commented Jul 11, 2015

Reading what you say, I guess what we need here is to skip the discussion about what a word is -- and export a new method on page: RuneCount.

https://golang.org/pkg/unicode/utf8/#RuneCount
http://blog.golang.org/strings

@RickCogley
Copy link
Contributor Author

@bep, hehe, "sorry". :-)
RuneCount, yes!, that would work, because we can check for locale then show either WordCount or RuneCount as appropriate.

@bep
Copy link
Member

bep commented Jul 12, 2015

@RickCogley just to check: In my head it wouldn't make sense to include whitespace in that count, right?

@bep bep closed this as completed in 77c60a3 Jul 12, 2015
bep added a commit that referenced this issue Jul 12, 2015
Do not create it unless used.

See #1266
@RickCogley
Copy link
Contributor Author

@bep, no, I don't think we need whitespace counted.

Japanese can use a normal ASCII space, and, there is a double byte space. Sometimes we use them in names:

田中 太郎
田中 太郎

Those have a single byte space and a double byte space between the last and first names.

But this usage is pretty rare.

@bep
Copy link
Member

bep commented Jul 12, 2015

Whitespace also includes newlines and tabs etc, so I think it would give a skewed count for small texts with lots of paragraphs. I will keep it as implemented.

tychoish pushed a commit to tychoish/hugo that referenced this issue Aug 13, 2017
tychoish pushed a commit to tychoish/hugo that referenced this issue Aug 13, 2017
Do not create it unless used.

See gohugoio#1266
bep added a commit that referenced this issue Jun 8, 2021
4c81c6c2a live reload: add section about `--navigateToChanged`
271014257 Update netify hugo version to 0.83.1
14199cff1 Add pull_request event
0c33b05de Hosting on GitHub: Little wording fixes and update Ubuntu runner in example workflow to 20.04 (#1457)
e47b6c33a Hugo Modules plural typo (#1266)
0f2bbacdd Add node_modules to .gitignore
1d645d79f Overhaul scratch.md (#1451)
572766889 Add link to golang regex syntax, change modified date
21b0c7459 Add info about contentType config
de7d96fa2 Document Go template's multiline support
0c8f2dcb1 Avoid scratch usage
696fa92e1 Rename scratch var
44193f267 Update usage instructions
4230f8fa5 Rename and refactor shortcode
e9953751e Strip leading whitespaces
d61a58010 Add `insertpages` shortcode
04d30677d Mention WebP under 'Target Format' (#1431)
946784508 Update lookup-order.md (#1443)
a7b587988 Update index.md
27907f7ea netlify: Hugo 0.83.1
044d37e57 Merge branch 'tempv0.83.1'
b81aedb03 Fix page `.Kind`
fcf7775ad releaser: Add release notes to /docs for release of 0.83.1
9b39c77c8 fix typo in 0.83 release notes
1c38993ce Update index.md
45b8aefa6 Update index.md
43902dfaa Update index.md
3d959c7ae Merge branch 'tempv0.83.0'
6c22dc327 Fix URL
497ea3224 Use Hugo version badge shortcode
a182d10dd releaser: Add release notes to /docs for release of 0.83.0
287fd9ac0 docs: Fix shortcode
e789c879a docs: Regenerate docs helper
1666c7f31 docs: Regenerate CLI docs
117de1d12 Merge commit 'c239c643fee10bfa217cb108755b798f8f5f3b10'
a6bf3f7d9 docs: Regen docs helper

git-subtree-dir: docs
git-subtree-split: 4c81c6c2ace6c23d0d5d24ee37e6a2f30acba01e
@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants