-
Notifications
You must be signed in to change notification settings - Fork 192
Support for frequency lists with readings #382
Comments
It currently isn't, but support can be added. There is a workaround for this issue for supporting pitch accents (#61), and a similar approach could be taken for frequency lists. However, is there a data source available which has this? A new dictionary would have to be created and imported using such data. |
absolutely! I have very good freq lists. I just wanted to import them. I can create them by my self. or that one |
@Geniusssmit Out of curiosity, when was that list created? I think Netflix CJK subtitles are currently served as images and require OCR to get the text out of them. I'd be very interested in a way to still download them as text. |
February 13, 2019. You can download all netflix japanese subs from here https://mega.nz/#F!hgBW0QID!IsLg3YRSdfBJkjkJtzbDJQ |
srt format, not OCRed |
@Geniusssmit Thanks! I think I'll find some use for those. I assume that the archive was created by some third party before Netflix switched to image based subtitles as the Kodi plugin introduced in this video doesn't work anymore https://www.youtube.com/watch?v=i2SudOnkiuc. The fork (?) linked in the video has been removed from GitHub, but the author seems to be working on a project that can be used to OCR newer Netflix subtitles https://github.com/Zarxrax/png2srt. |
That's cool. I'm actually interested how pitch accent and readings would be implemented |
It's in progress, but the main difference is that there is more metadata used to specify the reading for each term/expression. The feature is nearing completion, so after that's done, adding this should be simple. |
See: #385 |
I have created an initial version setting up what you requested based on the pitch accent structure in my frequency-improvements branch, specifically commit toasted-nutbread@20de591. For the example in your opening post, the dictionary data would look like this: [
["方", "freq", {"reading": "ほう", "frequency": 1}],
["方", "freq", {"reading": "かた", "frequency": 2}],
["明日", "freq", {"reading": "あす", "frequency": 3}],
["明日", "freq", {"reading": "あした", "frequency": 4}]
] Note that the reading is expected to be in hiragana rather than katakana, except when the source term is partially or fully katakana. For example: [
["アイゴ属", "freq", {"reading": "アイゴぞく", "frequency": 1}],
["あいご属", "freq", {"reading": "あいごぞく", "frequency": 2}] // not a real word
] |
Cool! Is it possible to use katakana for readings? All tools made for creating frequency lists use katakana for reading section |
I think the way that Yomichan dictionaries work is that they use hiragana for readings unless the expression contains katakana. So you would probably have to do a conversion for it to work as expected. |
@Geniusssmit This feature is now available on the master branch if you want to use that for testing. The format used is as described in #382 (comment). Let us know if you encounter any issues creating your dictionary data by creating a new issue or reopening. |
I tried many sites but unfortunately nothing can handle text as big as frequency list, how can I do that? |
You would probably have to write/use a script to do it. Yomichan internally uses https://github.com/WaniKani/WanaKana, so you could use that . |
Is it possible to import frequency list similar to:
So different readings of the compound words would show different frequency information.
If it's impossible for now, that's my "feature request"
The text was updated successfully, but these errors were encountered: