Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different output formats possible via constructor arguments? #99

Closed
jzohrab opened this issue Oct 1, 2023 · 1 comment
Closed

Different output formats possible via constructor arguments? #99

jzohrab opened this issue Oct 1, 2023 · 1 comment

Comments

@jzohrab
Copy link

jzohrab commented Oct 1, 2023

Hello, thank you very much for your work on this project. I'm using MeCab for a language-learning program, and would like to use this library if possible.

The mecab binary allowed for some arguments to be passed which would affect its output. For example:

$ mecab -F %m\\t%t\\t%h\\n -U %m\\t%t\\t%h\\n -E EOP\\t3\\t7\\n
太郎はこの本を女性に渡した。
太郎	2	44
は	6	16
この	6	68
本	2	38
を	6	13
女性	2	38
に	6	13
渡し	2	31
た	6	25
。	3	7
EOP	3	7

Is there a way to get the same with this python library? I tried some obvious attempts, e.g.

import MeCab
t = MeCab.Tagger('-F %m\\t%t\\t%h\\n -U %m\\t%t\\t%h\\n -E EOP\\t3\\t7\\n -r ./mecabrc_dummy.txt -d ./.venv/lib/python3.11/site-packages/unidic_lite/dicdir')   # also tried single \ instead of \\
sentence = "太郎はこの本を女性に渡した。"
print(t.parse(sentence))

but this still outputs the same as the default Tagger output:

$ python main.py 
太郎	タロー	タロウ	タロウ	名詞-固有名詞-人名-名			1
は	ワ	ハ	は	助詞-係助詞			
この	コノ	コノ	此の	連体詞			0
...
渡し	ワタシ	ワタス	渡す	動詞-一般	五段-サ行	連用形-一般	0
た	タ	タ	た	助動詞	助動詞-タ	終止形-一般	
。			。	補助記号-句点			
EOS

I edited unidic_lite/dicdir/dicrc:

output-format-type = custom

; output custom - new three-column output
node-format-custom = %m\t%t\t%h\n
unk-format-custom  = %m\t%t\t%h\n
bos-format-custom  =
eos-format-custom  = EOP\t3\t7\n

With that, the output was more or less what I expected (the third column is different, but that doesn't matter):

$ python main.py 
太郎	2	1
は	6	1
この	6	1
本	2	1
...
た	6	1
。	3	1
EOP	3	7

I did try with unidic, instead of unidic_lite,

t = MeCab.Tagger('-r ./mecabrc_dummy.txt -d ./.venv/lib/python3.11/site-packages/unidic/dicdir -F %m\\t%t\\t%h\\n -U %m\\t%t\\t%h\\n -E EOP\\t3\\t7\\n')

and got the default unidic output:

太郎	名詞,固有名詞,人名,名,,,タロウ,タロウ,太郎,タロー,太郎,タロー,固,"","","","","","",名,タロウ,タロウ,タロウ,タロウ,"1","","",6252931250790912,22748
は	助詞,係助詞,,,,,ハ,は,は,ワ,は,ワ,和,"","","","","","",係助,ハ,ハ,ハ,ハ,"","動詞%F2@0,名詞%F1,形容詞%F2@-1","",8059703733133824,29321
この	連体詞,,,,,,コノ,此の,この,コノ,この,コノ,和,"","","","","","",相,コノ,コノ,コノ,コノ,"0","","",3547308012741120,12905
...
。	補助記号,句点,,,,,,。,。,,。,,記号,"","","","","","",補助,,,,,"","","",6880571302400,25
EOS

Thank you again!

@polm
Copy link
Collaborator

polm commented Oct 2, 2023

This is not possible due to a long standing issue in MeCab that causes the UniDic config file to take precedence. Your command line version only works because your config (presumably IPAdic) doesn't specify a default format. I made a PR to fix it six years ago but never received any response.

taku910/mecab#38

However, rather than using MeCab's rather arcane format syntax, I suggest you use fugashi's structured Node objects to create formatted node output - it should be much easier.

https://github.com/polm/fugashi

@polm polm closed this as completed Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants