Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguity & Word Cluster Classes: POS Tagging #30

Open
dulocian opened this issue Jul 24, 2017 · 2 comments
Open

Ambiguity & Word Cluster Classes: POS Tagging #30

dulocian opened this issue Jul 24, 2017 · 2 comments

Comments

@dulocian
Copy link

Hi,

I would like to know whether it is possible to train my own Ambiguity and Cluster models to be used with POS tagging South African languages.

The only options available are the currently included models:

  • en-ambiguity-classes-simplified.xz
  • en-ambiguity-classes-simplified-lowercase.xz
  • en-brown-clusters-simplified-lowercase.xz
  • en-brown-clusters-twit-lowercase.xz

If it be possible, how could I go about creating them?

Regards

@jdchoi77
Copy link
Member

Brown cluster is relatively easy; you can use any available tool to generate the brown clusters and use the following script to convert into the NLP4J format:

https://github.com/emorynlp/nlp4j/blob/master/cli/src/main/java/edu/emory/mathcs/nlp/bin/BrownClusterExtract.java

Ambiguity class is a hashmap, where the key is a word and the value is the list of possible pos tags. You can save this also to a java object and compress it to the xz format.

Please let me know if this makes sense. Thanks.

best,

Jinho

@dulocian
Copy link
Author

Hi Jinho,

I made use of Percy Liang's C++ implementation of the Brown hierarchical word clustering algorithm.

Once the clusters are created using Liang's implementation, they are then converted using the script specified in your your response. These converted files are then placed in nlp4j-english-1.1.2.jar in the lexica directory alongside the other cluster and ambiguity classes.

In the config-decode-pos.xml and the config-train-pos.xml files, the following lexica field is adapted:
<word_clusters field="word_form_lowercase">edu/emory/mathcs/nlp/lexica/SA-lang-clusters.xz</word_clusters>

No errors arise when training with these specifications, however the accuracy of the PoS tagger model remains unchanged when compared to its control model which is trained without the cluster class. I am not sure what could be the cause of this.

I have also tried all of the possible word cluster fields, including:

  • word_form,
  • word_form_lowercase,
  • word_form_undigitalized,
  • word_form_simplified,
  • word_form_simplified_lowercase,
  • word_shape,
  • word_shape_lowercase,
  • orthographic,
  • orthographic_lowercase,

Please assist.

Regards,
J.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants