Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The exceptions collection does not include full list #14

Open
Mifrill opened this issue Sep 23, 2021 · 2 comments
Open

The exceptions collection does not include full list #14

Mifrill opened this issue Sep 23, 2021 · 2 comments

Comments

@Mifrill
Copy link
Contributor

Mifrill commented Sep 23, 2021

The logic w, s = line.split(/\s+/) compute only for 2 first matches even for cases with 3 matches

open_file(exc) do |io|
io.each_line do |line|
w, s = line.split(/\s+/)
@exceptions[pos][w] ||= []
@exceptions[pos][w] << s
end
end

For example:

zamindaris zamindari zemindari

      open_file(exc) do |io|
        io.each_line do |line|
          w, s = line.split(/\s+/)
          if line =~ /zamin/
            puts line
            puts w
            puts s
          end
          @exceptions[pos][w] ||= []
          @exceptions[pos][w] << s
        end
      end

# => 
# zamindaris zamindari zemindari
# zamindaris
# zamindari

The word zemindari is out of the compute range, is it a bug?

@yohasebe
Copy link
Owner

Yes, there can be two or more tokenized forms corresponding to a single surface form. In fact, the default dictionary file does contain a few entries having such alternatives. However, I thought it would be more convenient to get the result as a single tokenized form rather than an array. This decision may not sound very thoughtful--especially if you have your own dict files that have many entries with multiple tokenized forms. Still, I would like to keep it this way so as not to break the existing code of many users. Thanks anyway!

@Mifrill
Copy link
Contributor Author

Mifrill commented Oct 14, 2021

I would like to keep it this way so as not to break the existing code of many users.

well if it is a not quite expected behavior which we can improve then, maybe we can think about release with this kind of "breaking" change?
Because thing like: "break the existing code of many users" should be guarded by the gem version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants