Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for other NLP preprocessors (CoreNLP, nltk) #1418

Open
bhancock8 opened this issue Aug 17, 2019 · 3 comments
Open

Add support for other NLP preprocessors (CoreNLP, nltk) #1418

bhancock8 opened this issue Aug 17, 2019 · 3 comments
Labels
feature request help wanted no-stale Auto-stale bot skips this issue

Comments

@bhancock8
Copy link
Member

spaCy is great as a preprocessor for NLP labeling functions, but there are other libraries that individuals may want to use.

Ideally, we'd like to have wrappers for other packages as well, such as Stanford CoreNLP (https://stanfordnlp.github.io/stanfordnlp/) and NLTK (https://www.nltk.org/). We can pattern match on the SpacyPreprocessor. Then ultimately, give the nlp_labeling_function decorator a keyword argument where the user can specify which preprocessor to use.

@cyrilou242
Copy link

cyrilou242 commented Oct 1, 2019

Hello, I may have a need for this in the future. I may get some time to contribute.

I am not sure of what you mean by 'pattern matching' the SpacyPreprocessor: do we want to rebuild a spacy.Doc like object, to assure some compatibility at the tf definition level?

For instance: when defining an augmentation function like this for spacy :

spacy_proc = SpacyPreprocessor()
@transformation_function(pre=[spacy_proc])
def swap_adjectives(x):
    adjective_idxs = [i for i, token in enumerate(x.doc) if token.pos_ == "ADJ"]
    ... # modify adjectives
        return x

Let's say we want to replace spacy_proc by a nltk_proc or stanfordNLP_proc.

  • option 1: the token.pos_ exists for all nlp preprocessors, changing the preprocessor in the transformation above does not break the function
  • option 2: each processors builds its own objects and we assume the user knows what each processor builds, changing the preprocessor in the transformation above breaks the function because .pos_ belongs to spacy

In any case nltk does not have a pipeline concept (correct me if I'am wrong), so it would have to be specified.
On the contrary, StanfordNLP does have a pipeline, and it is different from spacy's one.

Maybe you already have thought about the design of this.
If think going with option 2 would be better to avoid having to maintain 'adaptators' to spacy Doc format, but switching between nlp proc will be difficult. I guess you would go for option 2, but I am a bit biased for option 1, because I'm building a repo of text transformation function, and being able to switch from one processors to another without breaking my tfs would be cool.

Let me know if there are any other important points to consider.

EDITED: I mistakenly switched option 1 and option 2 at the end of the last paragraph.

@bhancock8
Copy link
Member Author

Hi @cyrilou242, thanks for your post! I think you've outlined the tradeoffs well. Because each preprocessor produces potentially different fields, and even fields with the same high-level "type" (such as NER tags) can have different cardinalities and definitions for each tag between processors, I think option 2 is the safer choice, where in your function you'll need to use the field names specific to the preprocessor that you chose, and we assume that you've done your homework to understand what that field means.

If you're able to find the time to give this a shot, feel free to post intermediate thoughts here along the way so we can talk over any additional design considerations like the one you brought up; much easier to have these conversations early!

@cyrilou242
Copy link

cyrilou242 commented Oct 2, 2019

Thanks for the reply, I agree with option 2.
I'll get back in some weeks, I got to dig a bit more into StanfordNLP 0.2.0 which is quite new.

@vincentschen vincentschen added the no-stale Auto-stale bot skips this issue label Nov 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request help wanted no-stale Auto-stale bot skips this issue
Projects
None yet
Development

No branches or pull requests

3 participants