Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self-interactions are not included when the feature has value 1 #698

Closed
stephen-hoover opened this issue Jun 17, 2015 · 18 comments
Closed

Comments

@stephen-hoover
Copy link

When using quadratic or cubic interaction terms, self interactions are not created from features which have a value of 1. Example:

echo '1 |f a b' | vw shows 3 features (correct: "a", "b", and the constant).
echo '1 |f a:2 b:2' | vw -q ff shows 6 features, which is also correct: "a", "b", constant, "a * a", "b * b", and "a * b".
echo '1 |f a b' | vw -q ff shows 4 features. If I add an "--invert_hash", the model output is

Constant:116060:0.158008
f^a:14355:0.158008
f^a*f^b:161933:0.158008
f^b:146788:0.158008

which is missing the "a * a" and "b * b" terms.

This behavior also happens at test time. If I extract model coefficients from an invert hash model file and compare predictions generated from those to predictions generated directly from VW, the predictions match everywhere except where a feature has a value of exactly 1.

I found this error in VW version 7.10.2, commit hash dc5532f, running on OS X 10.10.3. I know that the error was not present at commit hash bb68807, but I haven't done a bisection to locate exactly where it was introduced.

@stephen-hoover
Copy link
Author

I finished a git-bisect. It appears that the problem was introduced in either the commit ee1fc3c or 72ceea8 from Mon May 4 2015 (neither compiles for me, so I can't say which one).

@trufanov-nok
Copy link
Contributor

This is expected behaviour. I'll write an article about new feature generation algorythm in Vw's wiki today. In short - it generates simple combinations of features for interactons of same namespace instead of permutations (as it was before).
That means that for -q ff and 2 features in f you'll get only a*b instead of a*a, a*b, b*a, b*b
And for 3 features in f you'll get only a*b, a*c, b*c

The reason behind that is that your model won’t benefit from b*a feature if it already includes a*b. Although they'll be in different places of memory, they'll have same value after training. It looks like such "duplicate" features also don't help with hash collisions resolution.
Also features like a*a or a*a*a is useless if your model includes feature a - they all will have same value after training. The only exception is when feature a has weight != 1.0. In this case feature values will be different and thus we don't filter out a*a like interactions for a:2.0.

These new rules of feature interactions generation supposed to decrease number of generated features by filtering out features that won't improve model. The training shall converge faster and results may be slightly better. So this approach is enabled by default.

You still can switch back to the old feature generation engine by specifying --permutations flag in VW's command line.

@djstrong
Copy link
Contributor

Why "It looks like such "duplicate" features also don't help with hash collisions resolution."?

@trufanov-nok
Copy link
Contributor

Wiki: https://github.com/JohnLangford/vowpal_wabbit/wiki/Feature-interactions

Why "It looks like such "duplicate" features also don't help with hash collisions resolution."?

I believe that model gains such "unnecessary" features so fast (with grow of features in self-interacting namespace) that very soon they become a source of collisions. For example, -q aa --cubic aaa with 30 features in a will produce 28k features per example, and only ~4.5k with simple combinations mode.

I think I've also seen a paper that proves that you can't solve collision problems by using 2 different hashes in hashing-trick storing hashes to the same address-space. At least not for non toy datasets. Our situation is very similar to usage of 2 hashes for same feature in hashing-trick.

@djstrong
Copy link
Contributor

Thanks for answer. I refer to my problem: text classification. In bigram model my data set generates 800M features, I can use only 28 bits (~300M), and I have couple of special non-text features. So I can try to replicate these features using above mentioned method.
Better solution is to hash features by myself and ensure that words does not collide with important features. It would be great if vw could use separate memory for defined namespace.

@trufanov-nok
Copy link
Contributor

Well, it would be better to find out beforehand how badly collisions affects your model predictability.
When I was competing on kaggle I've played with -b and found out that my model prediction power doesn't grow much after -b 24. My feature set wasn't big and could fit into RAM, so I assumed that collision effect isn't big enough and fighting with collisions isn't feasible.
But in your case it may means that collision effect is so big that switching between -b27 and -b28 is like dancing around -b10 -b11 for me.

Do you want to make 100% collision protected only a "couple of special non-text features" or all interactions with them? If first and you think this helps you plus know some C++ then it's not difficult to do. You just need to replace "special non-text features" with hashes in some interval [0,100] in your dataset. "By default VW hashes string features and does not hash integer features.". And then plug a small C++ check into VW code that will make sure that other features will never get a hash from this interval. Perhaps there is a easier way but I prefer adjust VW code for my purposes instead playing with scripts etc. I can point you to the right piece of VW code to do that.

And let's continue with this in another thread to not spam this thread as this discussion is offtopic

@stephen-hoover
Copy link
Author

@trufanov-nok , it's great that the interaction terms no longer include both "a * b" and "b * a". I was happy to see that change in your recent refactor. But not including "a * a" is a problem. If the feature "a" always has value 1, it's true that "a*a" is redundant with "a". The problem comes if "a" can sometimes take the value 1 and sometimes take values other than 1.

Consider building a model on 1 feature with quadratic interaction terms, and say that you end up with a model that looks like y = 0.5 * x + 1.5 * x^2. If x = 1.001, VW outputs a prediction of 2.0035. If x = 0.999, VW outputs a prediction of 1.9965. However, if x = 1.0, VW outputs a prediction of 0.5. I don't think that these cases would be uncommon. You'd run into all the time if your features are counts, for example. I found it because I'm using the iris dataset for testing; it has continuous features, but the continuous value is sometimes exactly 1.

I think that a discontinuity like this in a linear model is undesirable, and it's certainly very counterintuitive. This discontinuity also exists in the cost function when training, which I find equally undesirable. I'd prefer to not have the duplicate "a * b" and "b * a" terms, but if that means the self-interaction terms disappear when they're equal to 1, I can add on the "--permutations" flag.

@JohnLangford
Copy link
Member

This seems like a convincing argument to me---continuity is certainly
desirable. Typically, the a_a terms are dominated by a_b terms, so
there is little computational need to eliminate them. And w.r.t.
accuracy, discontinuity could bite hard.

-John

On 06/24/2015 11:47 AM, Stephen Hoover wrote:

@trufanov-nok https://github.com/trufanov-nok , it's great that the
interaction terms no longer include both "a * b" and "b * a". I was
happy to see that change in your recent refactor. But not including "a

  • a" is a problem. If the feature "a" always has value 1, it's true
    that "a*a" is redundant with "a". The problem comes if "a" can
    sometimes take the value 1 and sometimes take values other than 1.

Consider building a model on 1 feature with quadratic interaction
terms, and say that you end up with a model that looks like y = 0.5 *
x + 1.5 * x^2. If x = 1.001, VW outputs a prediction of 2.0035. If x =
0.999, VW outputs a prediction of 1.9965. However, if x = 1.0, VW
outputs a prediction of 0.5. I don't think that these cases would be
uncommon. You'd run into all the time if your features are counts, for
example. I found it because I'm using the iris dataset for testing; it
has continuous features, but the continuous value is sometimes exactly 1.

I think that a discontinuity like this in a linear model is
undesirable, and it's certainly very counterintuitive. This
discontinuity also exists in the cost function when training, which I
find equally undesirable. I'd prefer to not have the duplicate "a * b"
and "b * a" terms, but if that means the self-interaction terms
disappear when they're equal to 1, I can add on the "--permutations" flag.


Reply to this email directly or view it on GitHub
#698 (comment).

@trufanov-nok
Copy link
Contributor

Well, I'm convinced too.

The problem is that if we allow a*a feature generation regardless of weight + enable this mode by default then this opens the door for generation of a*a*x features for --cubic interactions etc. And this in its way significantly decrease feature number reduction in simple combinations mode vs permutations mode.

I think out 2 fast fixes:

  1. Let's keep everything as is, but replace "with weght != 1.0" rule with "with explicitly setted weight". This means VW will generate a*a for all features with weights != 1.0 plus for a:1.0 or a:1, but not for just a. This will help for features which are counts. @stephen-hoover will it solve all your cases? I doubt so.

Thus my second proposal is:

  1. To allow a*a generation regardless of weight by default, but add 2 additional options for the command line:
    --no_self_interactions to suppress a*a features generation at all. With this mode i'm trying to address datasets with only categorical features. You'll get only simple combinations of features with it which means minimal number of features.
    --no_self_interactions_for_default_weighted_features to turn back to current implementation with modification described in proposal 1. I'll document this behaviour and pitfalls that Stephan mentioned. With this mode i'm trying to address datasets which have only few columns for features with weights and others are weightless categorical features.

My idea: let's let experienced user to decide what kind of dataset he has. So he'll be able to reduce number of generated features even more with one of these two flags.

What you guys think?

@stephen-hoover
Copy link
Author

I actually like your first option, only generating self-interactions for terms with explicit weights. As long as this is well-documented, it should let people distinguish between features which could take on any weight and features which only take on "present" and "not present" values. It does lead to a gotcha if a user mixes explicit and implicit weights, but I do think that it would handle all of my use cases.

I think that another good option is your proposal 2b), to have a command line flag which turns on the behavior in your first proposal. That way a new user wouldn't get tripped up by something they didn't expect, but expert users could still benefit from not generating so many unnecessary interaction terms.

@trufanov-nok
Copy link
Contributor

May be sticking with option 1 will be enough, @JohnLangford ?

@martinpopel
Copy link
Contributor

I like option 1, it seems intuitive to me:
feature:1 is a integer or real-valued feature which has value 1.
feature is a boolean feature (with a true value).

I agree with the idea of making reasonable defaults for beginners, while allowing experts to tune it. On the other hand, the number of command line options is also a burden for beginners (vw -h | wc -l currently prints 235 and the number is still growing). So I would suggest to not add extra options (--no_self_interactions_for_default_weighted_features) which are not necessary.

@JohnLangford
Copy link
Member

Option 1 seems nonviable because the only thing available at the moment
of the feature expansion is:
(a) the first character of the namespace.
(b) the feature index.
(c) the feature value.

Our baseline approach should probably be to allow self-interaction
features by default. If we can find a use case where eliminating this
is definitely useful we can add a flag. Martin's concern about
excessive options is real though, so I'm inclined to not add a flag
unless the flag has a known use case. most features occur in
namespaces which have more than one feature and hence these self
interaction terms are always dominated by the cross terms
computationally and representationally. Given this, I expect the
benefits of removing self-interaction terms to be negligible both
computationally and representationally.

-John

On 06/24/2015 02:54 PM, Martin Popel wrote:

I like option 1, it seems intuitive to me:
|feature:1| is a integer or real-valued feature which has value 1.
|feature| is a boolean feature (with a true value).

I agree with the idea of making reasonable defaults for beginners,
while allowing experts to tune it. On the other hand, the number of
command line options is also a burden for beginners (|vw -h | wc -l|
currently prints 235 and the number is still growing). So I would
suggest to not add extra options
(|--no_self_interactions_for_default_weighted_features|) which are not
necessary.


Reply to this email directly or view it on GitHub
#698 (comment).

@trufanov-nok
Copy link
Contributor

Option 1 seems nonviable because the only thing available at the moment

But what if I add a boolean flag to example class which will be set on if weight was initialized here ?

@JohnLangford
Copy link
Member

There is more than one feature in an example, so that seems inadequate.

-John

On 06/24/2015 03:09 PM, Alexander Trufanov wrote:

Option 1 seems nonviable because the only thing available at the
moment

But what if I add a boolean flag to example class which will be set on
if weight was initialized here
https://github.com/JohnLangford/vowpal_wabbit/blob/master/vowpalwabbit/parse_example.cc#L54
?


Reply to this email directly or view it on GitHub
#698 (comment).

@trufanov-nok
Copy link
Contributor

Sorry, I meant a bool or char for feature struct.
Or perhaps better v_array in example class to store flags, so it could be reused without new memory allocation.
Such flag is needed only for single feature (not needed for features generated by interaction) so the array won't be big.

Nope, this won't work either. Shall take a time and think about this.

@JohnLangford
Copy link
Member

It does not seem worth it.

This would imply that every feature takes up significantly more space
(if we want aligned access) and is more complex. A corner case does
not seem to justify the complexity implied as it will slow the common
case down.

-John

On 06/24/2015 03:33 PM, Alexander Trufanov wrote:

Sorry, I meant a bool or char for feature struct.


Reply to this email directly or view it on GitHub
#698 (comment).

@trufanov-nok
Copy link
Contributor

Ok, i've done a required fix and tested VW with -q ff --cubic fff and 27 features in f.
It produces 18279 features for --permutation, 3006 features with simple combinations and 3654 features with proposed approach. I'll submit a PR for it tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants