forked from dmcc/bllip-parser
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling quotes #58
Comments
I think this is happening because parse_tagged() needs pretokenized text
and BLLIP's tokenizer replaces quotes with their two-backtick and
two-single-quote variants (this is how they're encoded in PTB format).
We could make parse_tagged call tokenize() on its input, but I thought it
would be safer for users to call it first to make sure they knew what their
sentence would look like after tokenization.
…On Wed, Jul 12, 2017 at 12:16 AM, jofatmofn ***@***.***> wrote:
Given the text
John said, "Welcome to the heaven".
rrp.simple_parse gives
(S1 (S (NP (NNP John)) (VP (VBD said) (, ,) (`` ``) (INTJ (UH Welcome) (PP
(TO to) (NP (DT the) (NN heaven)))) ('' '')) (. .)))
If I use rrp.parse_tagged with the following tokens and postags
tokens=[u'John', u'said', u',', u'"', u'Welcome', u'to', u'the', u'heaven', u'"', u'.']
postags={0: u'NNP', 1: u'VBD', 2: u',', 3: u'``', 4: u'UH', 5: u'TO', 6: u'DT', 7: u'NN', 8: u"''", 9: u'.'}
it returns an empty list.
Workaround: In tokens, if I change the beginning double quotes to two
backticks and ending double quotes to two apostrophe, as
tokens=[u'John', u'said', u',', u'``', u'Welcome', u'to', u'the',
u'heaven', u"''", u'.']
then it works.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#58>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAm5ZREY0MpxtOP0T4wt1xC-chCCmM27ks5sNHK5gaJpZM4OVODg>
.
|
Sure. Thanks. Could you please direct me to any reference (document or code) which highlights such replacements. I need to use tokens and postags from another parser and I can apply these before calling BLLIP. |
I think this more or less covers it:
ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html
There's no strict standard and each parser may interpret some edge cases
slightly differently, but the main things to note for using
rrp.parse_tagged are how quotes, apostrophes, and brackets are handled.
|
Thanks. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Given the text
John said, "Welcome to the heaven".
rrp.simple_parse gives
(S1 (S (NP (NNP John)) (VP (VBD said) (, ,) (`` ``) (INTJ (UH Welcome) (PP (TO to) (NP (DT the) (NN heaven)))) ('' '')) (. .)))
If I use rrp.parse_tagged with the following tokens and postags
it returns an empty list.
Workaround: In tokens, if I change the beginning double quotes to two backticks and ending double quotes to two apostrophe, as
tokens=[u'John', u'said', u',', u'``', u'Welcome', u'to', u'the', u'heaven', u"''", u'.']
then it works.
The text was updated successfully, but these errors were encountered: