Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extensions for use cases from DKPro Core and cTAKES to the CAS interface #83

Open
zesch opened this issue Nov 4, 2019 · 37 comments

Comments

@zesch
Copy link
Member

zesch commented Nov 4, 2019

it would be nice to be able to initialize a CAS with a certain type system, e.g.

from somewhere import DKProCoreTypeSystem
from cassis import Cas

cas = Cas(DKProCoreTypeSystem())
@jcklie
Copy link
Collaborator

jcklie commented Nov 4, 2019

Do you have a typesystem xml which I can use for that?

@zesch
Copy link
Member Author

zesch commented Nov 4, 2019

You mean for testing purposes?

I think it would be the job of the other library to provide the initializer. Although for DKPro, Cassis could also provide it directly :)

@reckart
Copy link
Member

reckart commented Nov 4, 2019

The idea with the initializer is a bit more than just load type system X.

The idea is that the initializer patches the CAS instance with additional methods, e.g.

cas.get_tokens()
cas.get_tokens_as_text()
cas.get_sentences()
cas.get_sentences_as_text()
cas.get_pos_tags()
cas.get_named_entities()
...

... and that we could e.g. have an for DKPro Core and another one for say cTAKES and both would patch the CAS with the same convenience methods but internally resorting to different select statements.

The initializer would work like a visitor, e.g.

Cas(DKProCoreTypeSystem()) triggers a call to DKProCoreTypeSystem.apply(cas)).

@jcklie such a thing works with Python, right?

@jcklie
Copy link
Collaborator

jcklie commented Nov 4, 2019

We can provide these methods, I am not sure about the implementation though. My question was:Is there some official DKPro typesystem XML which I can use or can you provide me with some Java Code to generate it to keep it in sync with DKPro?

@reckart
Copy link
Member

reckart commented Nov 4, 2019

The "best" solution for this would probably be to use DKPro Meta :)

@reckart
Copy link
Member

reckart commented Nov 4, 2019

Well, basically what you do is create a Maven Project which has a dependency on all dkpro-core-api-** modules and then call

        TypeSystemDescription dkproCoreTS = TypeSystemDescriptionFactory
                .createTypeSystemDescription();
        try (FileOutputStream out = new FileOutputStream("target/dkpro-core-aggregated-ts.xml")) {
            dkproCoreTS.toXML(out);
        }

@jcklie jcklie changed the title Shortcut for initializing CAS with a type system Add DKPro support for typesystem and often used use cases Nov 4, 2019
@jcklie jcklie self-assigned this Nov 4, 2019
@jcklie jcklie added this to the 0.2.3 milestone Nov 4, 2019
@jcklie
Copy link
Collaborator

jcklie commented Nov 4, 2019

I would implement it as extending Cas, the constructor loads the DKPro sype system then. Simple and not so magic.

@reckart
Copy link
Member

reckart commented Nov 4, 2019

But then we'd end up having to import CAS from different libraries...

@jcklie
Copy link
Collaborator

jcklie commented Nov 4, 2019

I would add it to cassis, so it is from cassis import DKProCas. DKPro to me is an important enough part of the UIMA world to add it to cassis itself.

@reckart
Copy link
Member

reckart commented Nov 4, 2019

For the moment, I don't feel very comfortable with this. I don't like the idea of the CAS becoming something new just because it contains certain types. The idea of the CAS is that it is a generic data structure. If we subclass it for a particular framework, I feel it goes against this idea.

Actually, the strategy you have shown me OTR for the Pandas accessors looked nice. It makes very clear that there is one generic data structure and there are separately different ways of accessing it.

@jcklie
Copy link
Collaborator

jcklie commented Nov 4, 2019

Is it ok if we implement this in cassis or should it be part of pydkpro?

@reckart
Copy link
Member

reckart commented Nov 4, 2019

I understand that IDEs may not support auto-complete for such extensions. But I wonder if IDEs like PyCharm really only do static code analysis or also consider whether a method has actually been called somewhere before. E.g. if I call method x.foo() once and later I type y.f... (where y is of the same type as x), then it would be reasonable to offer foo() in the auto complete (without documentation at least) - I wonder if there are hints one can provide to the IDEs to fine-tune the autocomplete, e.g. for scenarios like the extension methods suggested here.

@jcklie
Copy link
Collaborator

jcklie commented Nov 4, 2019

Pycharm offers some auto completion based on what was called before (the typing is limited then) and there are stub files where you can maybe add more information: https://mypy.readthedocs.io/en/latest/stubs.html . But it does not know that there is an extension, as it is added at run time (except when I just add it as a field to cassis and throw an error if it is not compatible).

@reckart
Copy link
Member

reckart commented Nov 4, 2019

The idea of involving cassis came to me because I though we should/could pass the type system "strategy" to the constructor - i.e. cassis would somehow have to understand the strategy and react to it. If we use a completely different mechanism which does not require cassis to be aware of the mechanism, it could be done elsewhere.

A compromise between subtyping and adding dynamically might be a generic type (if such a thing is possible?), e.g.

cas = CAS[DKPro_Core]()
cas.access <= must return an instance of the generic type, e.g. DKPro_Core
cas.access.XXX <= IDE could theoretically know which methods the generic type provides

@jcklie
Copy link
Collaborator

jcklie commented Nov 4, 2019

I think we need features from Python 3.8 for that and even then I am unsure. So what we have now is:

  1. Use the pandas extensions style and have no type hints, let pydkpro implement this. Other people can add nice cas extensions
  2. Hardcode dkpro, ctakes and more as cas extensions into cassis so that we have type support. Throw an error if the Cas does not conform when using these
  3. Why not both

@jcklie jcklie changed the title Add DKPro support for typesystem and often used use cases Add extensions for use cases from DKPro and ctakes to the cas interface Nov 4, 2019
@jcklie
Copy link
Collaborator

jcklie commented Nov 4, 2019

I think this issue contains two things, the DKPro type system and extension. I will track the type system stuff in #9.

@jcklie
Copy link
Collaborator

jcklie commented Nov 5, 2019

I did some quick and dirty script to convert a typesystem XMI to Python classes for the DKPro Core type system. One can get type hints for the wrapped CAS, the accessor and does not need to redefine all cas methods:

image

image

The code basically is

class DKProAccessor:

    def __init__(self, cas: Cas):
        self._cas = cas

    def __getattr__(self, name: str):
        """ If the method is not found on the accessor, then we just delegate to the cas. """
        return getattr(self._cas, name)

    def get_tokens(self) -> Iterator[Token]:
        return self._cas.select("de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token")

    def get_named_entities(self) -> Iterator[NamedEntity]:
        return self._cas.select("de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity")


def build_dkpro_cas() -> Union[Cas, DKProAccessor]:
    cas = Cas(typesystem=load_dkpro_core_typesystem())
    dkpro = DKProAccessor(cas)
    return dkpro

I can write a decorator for init and __getattr__ so that these are added automatically to extensions.

@jcklie
Copy link
Collaborator

jcklie commented Nov 5, 2019

I do not know whether I want to keep the type hints for the extension methods, but I like how to define extensions.

@reckart
Copy link
Member

reckart commented Nov 6, 2019

So the IDE dynamically evaluates the DKProAccessor to discover the fields?

@jcklie
Copy link
Collaborator

jcklie commented Nov 6, 2019

We tell the IDE that build_dkpro_cas can either return a cas or an accessor.

@reckart
Copy link
Member

reckart commented Nov 6, 2019

How does the IDE know that e.g. Token has the field form? I don't see anything in your code that would do that?

@jcklie
Copy link
Collaborator

jcklie commented Nov 6, 2019

I generate type descriptions Python code from the XML. If you have a fixed type system, then you can do that and check the generate python code in your source control. I will later push the code for that; this issue should maybe focus on the extension only.

@reckart
Copy link
Member

reckart commented Nov 6, 2019

Generating classes from the type system description - so a "pycasgen" - an equivalent of the "jcasgen" we have in Java which generates Java classes from the type system. Why not? :)

I think such a "pycasgen" script could be part of cassis and projects like DKPro Core or cTAKES could pre-generate the classes and push them to pypi as separate packages. WDYT?

@jcklie
Copy link
Collaborator

jcklie commented Nov 6, 2019

We can do that. My question right now is where to put the extensions, I like to have them in cassis itself, as they are related to CAS/XMI stuff. Also, I need them sometimes for my own code and dont want to install pydkpro just for the extensions and types.

@reckart
Copy link
Member

reckart commented Nov 6, 2019

If by extensions you mean e.g. the generated types - I think these should be released separately and with the same version numbers as the corresponding DKPro Core / cTAKES / etc versions. They do not follow the same release cycle as cassis.

@jcklie
Copy link
Collaborator

jcklie commented Nov 6, 2019

I mean the dkpro/ctakes accessor and util functions that were requested.

@reckart
Copy link
Member

reckart commented Nov 6, 2019

@zesch @aggarwalpiush WDYT? Type-system-specific accessors and Python classes generated from type systems should probably be kept together and have a release cycle mirroring the release cycle of the type system they mirror. Have them as a separate project under DKPro already now (which I think would be nice since we could already make use of them in INCEpTION)? Have them with your pipelining code later?

@zesch
Copy link
Member Author

zesch commented Nov 6, 2019

Not sure I really understand the implications. Whatever works best on your side.

@jcklie
Copy link
Collaborator

jcklie commented Nov 8, 2019

I would create a new repository and Python package dkpro-typeshed where we add the extension methods and generated types to get a nice API. This would then only depend on cassis. pykdkpro then can use it to make its API nicer. We use a seperate package in order to track the dkpro version and respective types new/different types.

@zesch
Copy link
Member Author

zesch commented Nov 8, 2019

Sounds good

@reckart
Copy link
Member

reckart commented Nov 8, 2019

We have various DKPro projects and they all have different release cycles. I think the type system is generated for a particular version of a particular project. Thus having a single repo where all generated types are located doesn't seem sensible to me. We would always have to release all types at the same time and it would be impossible for users to choose a version combination they would care for. I think having a type companion repo for each DKPro project would make sense, e.g. dkpro-core-python-api and dkpro-keyphrases-python-api etc.

@jcklie
Copy link
Collaborator

jcklie commented Nov 8, 2019

This sounds like a lot of work and maintenance nightmare, right now it also works without (type unsafe in the same way the raw Java cas interface has no safety and type information). So I would then just add the accessor which returns the right FeaturesStructures but gives no IDE support, i.e. changing

def get_tokens(self) -> Iterator[Token]:
    return self._cas.select("de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token")

to

def get_tokens(self) -> Iterator[FeatureStructure]:
    return self._cas.select("de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token")

as a first step.

@reckart
Copy link
Member

reckart commented Nov 8, 2019

This sounds like a lot of work and maintenance nightmare

What's a maintenance nightmare?

@jcklie
Copy link
Collaborator

jcklie commented Nov 8, 2019

Having a repo for each would mean to set up many repositories and pypi packages. I would rather not do that right now.

@reckart
Copy link
Member

reckart commented Nov 8, 2019

We only need to set up one for DKPro Core. I even thought about putting the generated Python classes directly into the "dkpro-core" repository along with all the Java stuff. But considering that the Python stuff is still "young", we might care to refine/release it more often than the Java stuff, so it might have a faster release cycle (e.g. "2.0.0, then 2.0.0.1 because we fix a bug in the code generator, then 2.0.0.2 because we fix another bug, etc.").

@reckart reckart changed the title Add extensions for use cases from DKPro and ctakes to the cas interface Add extensions for use cases from DKPro Core and cTAKES to the CAS interface Nov 8, 2019
@reckart
Copy link
Member

reckart commented Nov 9, 2019

I have added a repo here and you should all have proper access to it: https://github.com/dkpro/dkpro-core-python-api

We can still rename it / move around things later if we decide to change anything. For now, we'll only create types for DKPro Core anyway.

@jcklie
Copy link
Collaborator

jcklie commented Nov 22, 2019

I will come back to this after the ACL deadline.

@jcklie jcklie removed this from the 0.2.3 milestone Nov 23, 2019
@jcklie jcklie added this to the Backlog milestone Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants