Skip to content
This repository has been archived by the owner on Feb 25, 2023. It is now read-only.

Support for frequency lists with readings #382

Closed
Geniusssmit opened this issue Mar 1, 2020 · 15 comments · Fixed by #450
Closed

Support for frequency lists with readings #382

Geniusssmit opened this issue Mar 1, 2020 · 15 comments · Fixed by #450

Comments

@Geniusssmit
Copy link

Is it possible to import frequency list similar to:

[["方","ホウ",1],
["方","カタ",2],
["明日","アス",3]
["明日","アシタ",4]]

So different readings of the compound words would show different frequency information.
If it's impossible for now, that's my "feature request"

@toasted-nutbread
Copy link
Collaborator

It currently isn't, but support can be added. There is a workaround for this issue for supporting pitch accents (#61), and a similar approach could be taken for frequency lists.

However, is there a data source available which has this? A new dictionary would have to be created and imported using such data.

@Geniusssmit
Copy link
Author

absolutely! I have very good freq lists. I just wanted to import them. I can create them by my self.
For example list created from analyzing all Japanese Netflix https://mega.nz/#!gTgTDb7b!a1DGu0gk1d1BqAPNY7XQ2nrRvrBUVV1Ql6hH01aKOAA

or that one

frequency.txt

@siikamiika
Copy link
Collaborator

@Geniusssmit Out of curiosity, when was that list created? I think Netflix CJK subtitles are currently served as images and require OCR to get the text out of them. I'd be very interested in a way to still download them as text.

@Geniusssmit
Copy link
Author

@Geniusssmit Out of curiosity, when was that list created? I think Netflix CJK subtitles are currently served as images and require OCR to get the text out of them. I'd be very interested in a way to still download them as text.

February 13, 2019. You can download all netflix japanese subs from here https://mega.nz/#F!hgBW0QID!IsLg3YRSdfBJkjkJtzbDJQ

@Geniusssmit
Copy link
Author

srt format, not OCRed

@siikamiika
Copy link
Collaborator

@Geniusssmit Thanks! I think I'll find some use for those.

I assume that the archive was created by some third party before Netflix switched to image based subtitles as the Kodi plugin introduced in this video doesn't work anymore https://www.youtube.com/watch?v=i2SudOnkiuc. The fork (?) linked in the video has been removed from GitHub, but the author seems to be working on a project that can be used to OCR newer Netflix subtitles https://github.com/Zarxrax/png2srt.

@Geniusssmit
Copy link
Author

That's cool. I'm actually interested how pitch accent and readings would be implemented

@toasted-nutbread
Copy link
Collaborator

It's in progress, but the main difference is that there is more metadata used to specify the reading for each term/expression. The feature is nearing completion, so after that's done, adding this should be simple.

@toasted-nutbread
Copy link
Collaborator

See: #385

@toasted-nutbread
Copy link
Collaborator

I have created an initial version setting up what you requested based on the pitch accent structure in my frequency-improvements branch, specifically commit toasted-nutbread@20de591.

For the example in your opening post, the dictionary data would look like this:

[
  ["", "freq", {"reading": "ほう", "frequency": 1}],
  ["", "freq", {"reading": "かた", "frequency": 2}],
  ["明日", "freq", {"reading": "あす", "frequency": 3}],
  ["明日", "freq", {"reading": "あした", "frequency": 4}]
]

Note that the reading is expected to be in hiragana rather than katakana, except when the source term is partially or fully katakana. For example:

[
  ["アイゴ属", "freq", {"reading": "アイゴぞく", "frequency": 1}],
  ["あいご属", "freq", {"reading": "あいごぞく", "frequency": 2}] // not a real word
]

@Geniusssmit
Copy link
Author

Cool! Is it possible to use katakana for readings? All tools made for creating frequency lists use katakana for reading section

@toasted-nutbread
Copy link
Collaborator

Cool! Is it possible to use katakana for readings?

I think the way that Yomichan dictionaries work is that they use hiragana for readings unless the expression contains katakana. So you would probably have to do a conversion for it to work as expected.

@toasted-nutbread
Copy link
Collaborator

@Geniusssmit This feature is now available on the master branch if you want to use that for testing. The format used is as described in #382 (comment). Let us know if you encounter any issues creating your dictionary data by creating a new issue or reopening.

@Geniusssmit
Copy link
Author

So you would probably have to do a conversion for it to work as expected.

I tried many sites but unfortunately nothing can handle text as big as frequency list, how can I do that?

@toasted-nutbread
Copy link
Collaborator

You would probably have to write/use a script to do it. Yomichan internally uses https://github.com/WaniKani/WanaKana, so you could use that .

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants