Support for frequency lists with readings #382

Geniusssmit · 2020-03-01T13:23:03Z

Is it possible to import frequency list similar to:

[["方","ホウ",1],
["方","カタ",2],
["明日","アス",3]
["明日","アシタ",4]]

So different readings of the compound words would show different frequency information.
If it's impossible for now, that's my "feature request"

The text was updated successfully, but these errors were encountered:

toasted-nutbread · 2020-03-01T15:11:07Z

It currently isn't, but support can be added. There is a workaround for this issue for supporting pitch accents (#61), and a similar approach could be taken for frequency lists.

However, is there a data source available which has this? A new dictionary would have to be created and imported using such data.

Geniusssmit · 2020-03-01T17:26:44Z

absolutely! I have very good freq lists. I just wanted to import them. I can create them by my self.
For example list created from analyzing all Japanese Netflix https://mega.nz/#!gTgTDb7b!a1DGu0gk1d1BqAPNY7XQ2nrRvrBUVV1Ql6hH01aKOAA

or that one

frequency.txt

siikamiika · 2020-03-01T17:32:07Z

@Geniusssmit Out of curiosity, when was that list created? I think Netflix CJK subtitles are currently served as images and require OCR to get the text out of them. I'd be very interested in a way to still download them as text.

Geniusssmit · 2020-03-01T17:43:08Z

@Geniusssmit Out of curiosity, when was that list created? I think Netflix CJK subtitles are currently served as images and require OCR to get the text out of them. I'd be very interested in a way to still download them as text.

February 13, 2019. You can download all netflix japanese subs from here https://mega.nz/#F!hgBW0QID!IsLg3YRSdfBJkjkJtzbDJQ

Geniusssmit · 2020-03-01T17:44:18Z

srt format, not OCRed

siikamiika · 2020-03-01T18:19:23Z

@Geniusssmit Thanks! I think I'll find some use for those.

I assume that the archive was created by some third party before Netflix switched to image based subtitles as the Kodi plugin introduced in this video doesn't work anymore https://www.youtube.com/watch?v=i2SudOnkiuc. The fork (?) linked in the video has been removed from GitHub, but the author seems to be working on a project that can be used to OCR newer Netflix subtitles https://github.com/Zarxrax/png2srt.

Geniusssmit · 2020-03-01T18:21:10Z

That's cool. I'm actually interested how pitch accent and readings would be implemented

toasted-nutbread · 2020-03-01T18:32:47Z

It's in progress, but the main difference is that there is more metadata used to specify the reading for each term/expression. The feature is nearing completion, so after that's done, adding this should be simple.

toasted-nutbread · 2020-03-01T21:15:07Z

See: #385

toasted-nutbread · 2020-03-02T01:17:29Z

I have created an initial version setting up what you requested based on the pitch accent structure in my frequency-improvements branch, specifically commit toasted-nutbread@20de591.

For the example in your opening post, the dictionary data would look like this:

[
  ["方", "freq", {"reading": "ほう", "frequency": 1}],
  ["方", "freq", {"reading": "かた", "frequency": 2}],
  ["明日", "freq", {"reading": "あす", "frequency": 3}],
  ["明日", "freq", {"reading": "あした", "frequency": 4}]
]

Note that the reading is expected to be in hiragana rather than katakana, except when the source term is partially or fully katakana. For example:

[
  ["アイゴ属", "freq", {"reading": "アイゴぞく", "frequency": 1}],
  ["あいご属", "freq", {"reading": "あいごぞく", "frequency": 2}] // not a real word
]

Geniusssmit · 2020-03-02T06:41:27Z

Cool! Is it possible to use katakana for readings? All tools made for creating frequency lists use katakana for reading section

toasted-nutbread · 2020-03-08T15:52:56Z

Cool! Is it possible to use katakana for readings?

I think the way that Yomichan dictionaries work is that they use hiragana for readings unless the expression contains katakana. So you would probably have to do a conversion for it to work as expected.

toasted-nutbread · 2020-04-18T18:20:43Z

@Geniusssmit This feature is now available on the master branch if you want to use that for testing. The format used is as described in #382 (comment). Let us know if you encounter any issues creating your dictionary data by creating a new issue or reopening.

Geniusssmit · 2020-06-29T08:16:35Z

So you would probably have to do a conversion for it to work as expected.

I tried many sites but unfortunately nothing can handle text as big as frequency list, how can I do that?

toasted-nutbread · 2020-06-29T23:01:52Z

You would probably have to write/use a script to do it. Yomichan internally uses https://github.com/WaniKani/WanaKana, so you could use that .

toasted-nutbread added this to To do in Database improvements Mar 1, 2020

toasted-nutbread mentioned this issue Mar 8, 2020

Technical documentation #403

Closed

toasted-nutbread mentioned this issue Apr 12, 2020

Add support for filtering frequency metadata based on readings #450

Merged

toasted-nutbread moved this from To do to In progress in Database improvements Apr 12, 2020

toasted-nutbread closed this as completed in #450 Apr 18, 2020

Database improvements automation moved this from In progress to Done Apr 18, 2020

toasted-nutbread mentioned this issue Apr 18, 2020

Changelog #376

Closed

Thermospore mentioned this issue Sep 21, 2020

Importing a frequency list with readings #855

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for frequency lists with readings #382

Support for frequency lists with readings #382

Geniusssmit commented Mar 1, 2020

toasted-nutbread commented Mar 1, 2020

Geniusssmit commented Mar 1, 2020

siikamiika commented Mar 1, 2020

Geniusssmit commented Mar 1, 2020

Geniusssmit commented Mar 1, 2020

siikamiika commented Mar 1, 2020

Geniusssmit commented Mar 1, 2020

toasted-nutbread commented Mar 1, 2020

toasted-nutbread commented Mar 1, 2020

toasted-nutbread commented Mar 2, 2020

Geniusssmit commented Mar 2, 2020

toasted-nutbread commented Mar 8, 2020

toasted-nutbread commented Apr 18, 2020

Geniusssmit commented Jun 29, 2020

toasted-nutbread commented Jun 29, 2020

Support for frequency lists with readings #382

Support for frequency lists with readings #382

Comments

Geniusssmit commented Mar 1, 2020

toasted-nutbread commented Mar 1, 2020

Geniusssmit commented Mar 1, 2020

siikamiika commented Mar 1, 2020

Geniusssmit commented Mar 1, 2020

Geniusssmit commented Mar 1, 2020

siikamiika commented Mar 1, 2020

Geniusssmit commented Mar 1, 2020

toasted-nutbread commented Mar 1, 2020

toasted-nutbread commented Mar 1, 2020

toasted-nutbread commented Mar 2, 2020

Geniusssmit commented Mar 2, 2020

toasted-nutbread commented Mar 8, 2020

toasted-nutbread commented Apr 18, 2020

Geniusssmit commented Jun 29, 2020

toasted-nutbread commented Jun 29, 2020