Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create features with duplicate keys ? #23

Closed
binhnq94 opened this issue Nov 28, 2017 · 8 comments
Closed

How to create features with duplicate keys ? #23

binhnq94 opened this issue Nov 28, 2017 · 8 comments

Comments

@binhnq94
Copy link

I see in (crfsuite document)[http://www.chokkan.org/software/crfsuite/manual.html] that key of feature can be duplicate:

B-NP    w[1..4]=a:2 w[1..4]=man w[1..4]=eats
B-NP    w[1..4]=a w[1..4]=a w[1..4]=man w[1..4]=eats
B-NP    w[1..4]=a:2.0 w[1..4]=man:1.0 w[1..4]=eats:1.0

How to create features with duplicate keys if i using sklearn-crfsuite ?

@kmike
Copy link
Contributor

kmike commented Feb 27, 2018

There are no duplicate features here: for example, in a first row w[1..4]=a and w[1..4]=man are feature names; w[1..4]=a has a value = 2 (:2), and w[1..4]=man has a default value = 1. = sign is just a convention to build readable feature names.

@binhnq94
Copy link
Author

In crfsuite document:
-> This is the BNF notation representing the data format.

<line>           ::= <item> | <eos>
<item>           ::= <label> ('\t' <attribute>)+ <br>
<eos>            ::= <br>
<label>          ::= <string>
<attribute>      ::= <name> | <name> ':' <scaling>
<name>           ::= (<letter> | "\:" | "\\")+
<scaling>        ::= <numeric>
<br>             ::= '\n'

It mean 2 is scaling or weight.

A sample data for CRFsuite:
A sample data for CRFsuite

@kmike
Copy link
Contributor

kmike commented Feb 27, 2018

Yes, 2 is weight, this is what I've written as well.. Could you please provide an example of a duplicate key? It seems I don't understand your question.

@binhnq94
Copy link
Author

binhnq94 commented Feb 27, 2018

Sample sequence: a man eats some food
If we config feature of some sequence labeling problems, we have some type of features:

  • type 1: w[-3], w[-2], w[-1], w[0], w[1], w[2], w[3] is a window around the current word. Features for man:
{
   w[-1]: a,
   w[0]: man, 
   w[1]: eats, 
   w[2]: food
}
  • type 2: w[-3]|w[-2]|w[-1], w[1]|w[2]|w[3] are junctive features:
{
   w[-3]|w[-2]|w[-1]: a,
   w[1]|w[2]|w[3]: eats|some|foods, 
}
  • type 3: w[-3...-1], w[1...3] are disjunctive features:
{
   w[-3...-1]: a,
   w[1...3]: eats,
   w[1...3]: some,
   w[1...3]: foods
}

And your package cant support type 3.

@kmike
Copy link
Contributor

kmike commented Feb 27, 2018

See data format examples at http://python-crfsuite.readthedocs.io/en/latest/pycrfsuite.html#pycrfsuite.ItemSequence - the format you're using is just a shortcut:

{“string_key”: “string_value”, ...} dict; that’s the same as {“string_key=string_value”: 1.0, ...}

All features have float values, both in crfsuite and python-crfsuite; {"w[-1]": "a"} is just a shortcut for {"w[-1]=a": 1.0}, and in crfsuite example = is just a convention. 1.0 is an assumed value; in crfsuite you override it by putting after :, in python-crfsuite you can override it by using another feature format ({"string_key": float_value}).

So if you want such disjunctive features, pass {"w[1...3]=eats": 1.0} instead of {"w[1...3]": "eats"}. If you don't need weights, you can also use a list feature format.

@binhnq94
Copy link
Author

Ok. I think that if i pass {"w[1...3]=eats": 1.0}, it be like i pass w[1...3]=eats=1.0:1.0.

If you don't need weights, you can also use a list feature format.

-> How i use a list feature format with sklearn-crfsuite ?

@kmike
Copy link
Contributor

kmike commented Feb 27, 2018

{"w[1...3]=eats": 1.0} is exactly the same as {"w[1...3]": "eats"}.

If all of your features have value 1.0, and you build feature names manually, you can pass a list: ["w[1...3]=eats", "w[1...3]=some", ...] instead of a dict {"w[1...3]": "eats", "w[1...3]": "some", ...}, they are exactly the same.

Please check data formats here: http://python-crfsuite.readthedocs.io/en/latest/pycrfsuite.html#pycrfsuite.ItemSequence

@kmike kmike closed this as completed Feb 27, 2018
@binhnq94
Copy link
Author

Understood. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants