-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: kNN computes on k-first among whatever neighbors it finds. Instead of fixed k-neighbours #131
Comments
This is not really a bug, it's just the way k-NN is defined. Using a neighborhood that's actually far-away is a well-known drawback of the neighborhood methods. I'm not sure why you're mentioning k-means here?
Storing the neighbors during training requires extra work during the prediction step: we can only consider neighbors that have rated the target item so we'll need to filter out those who did not. Also, storing a U x I matrix may or may not be more memory efficient than storing a UxU or IxI matrix, depending on the number of users and items. In the current implementation, finding the k-nearest neighbors is fairly efficient because the sorting is done only over the other users that have rated the target item (in the case of user-user similarities). So we're sorting a fairly short list each time instead of sorting a huge list once and for all. If you want to implement the method you suggest I'd be happy to see the results! Nicolas |
As far as I understand the kNN idea - it is not defined this way. Rather we should always the same fixed neighbours and recommend only those items, that were rated by some of these fixed neighbours. Thus, they can't always compute prediction for any given item, which is a well-known drawback of kNN. Finally, there is a common sense approach. Consider an example: in case we choose any closest k neighbours among those who have rated some item - the top-N recommendations list will fail:
|
A major issue with your approach is that as it's fairly unlikely the k-nearest neighbors of a user have rated the target item, the prediction is most of the time impossible and in the rare cases where there is one, it relies on only few ratings (much less than k). So it's not reliable either. Obviously these are just conjectures of mine, it should be empirically verified to check if that holds. I think we can agree that both approaches have pros and cons. The one that is currently implemented is, as far as I know, the most commonly used. See e.g. The recommender system handbook by Ricci & all or The recommender system textbook by Aggarwal (I find the latter to be much more useful). |
Nicolas, what I found in the book you offered seems to support the vision of KNN which I described:
|
Well TBH these statements are true for both strategies. I agree though that it's not so clear what strategy the author is refering to. E.g. p.36 eq. 2.4:
I always thought that this means we should first filter the users who haven't rated i, and then see which ones are the closest. But thanks to your remarks I understand now that it's not so clear. It could mean either your definition, or the one I've been using. The same goes for the handbook:
It's not clear whether Ni(u) is a subset of N(u) (your understanding) or not (mine). In any case, if you want to implement your solution please go for it! |
you mean I may substitute current logic for .predict() for all kNN algorithms? Refactoring planAim - least changes to current code. Optional acces to current sim.matrix
|
No I think both options should be available, and the current one (even if it may not be the most natural for some) should be the default, to conserve backward compatibility. This could be set via a parameter
As I said earlier I'm not sure this would improve performance. If it does improve performance in some cases (whether it's computation time or memory use), then I'd be happy to support it. But it should be done in a separate PR as this is for the most part unrelated to the neighbors selection.
Good idea
I'm not sure what you mean?
Ok if that's useful, but in a different PR (see remark above)
There is no prediction() method, did you mean
YES :)
There is already a guide for this in the FAQ if I'm not mistaken. Good luck in any case, and thank you for addressing this :) ! |
Quick update: I checked mymedialite and librec versions of kNN. If I understand their code correctly, they both compute the set of nearest neighbors and only then filter out those not having rated the target item, which supports your version :) |
So, in case other libraries, as you have checked, have different implementations - I offer to change the default behaviour of the current kNN implementation, rather than introduce new behaviour as just an option. |
Like I said I prefer to keep the current version as the default one for backward compatibility reasons. If we change the default algorithm, then some people will find that their predictions have changed when upgrading to the new version even if haven't change their code. |
I believe that being consistent with other (standard) way of kNN implementation - is more important than being consistent with previous versions. I don't see this change in prediction as a bad thing itself, in case it provides better top-N recommendations. I am very concerned about the need to change the behaviour. And in case you are concerned about letting it know to users - you may, for instance, just update the major library version (2.x), as one should usually expect some change in behaviour in new major versions (in case we add top_N prediction to the API such version change will be more justifiable, as it signals the opportunity to use built-in method rather then implementing it through some adapter over the library) |
I understand your point. Changing API and default behaviour in a next major release might be the solution. But for now, let's just keep the current startegy as the default one: we can change the default later. |
I consider leaving default behaviour as is - a bug proliferation. As it means that all new users will have wrong default behaviour and old users, by default, also will get deteriorate top-N predictions. So, I won't agree to implement it this way (leaving current behaviour as a default). |
Once again, the current implementation is neither wrong nor correct. It's not a bug. It's just the way it is, and it does exactly what it claims it does. It's a variant of the neighborhood approach, just like the one you propose is another variant (and there are many, many variants). They both have pros and cons, and we may prefer one or the other depending on our goals. If that's of any comfort to you, Crab proposes both strategies for selecting the neighbors.
That's still to be (empirically) proven. The version that will end-up being the default will be the one that performs best (criteria being accuracy and top-N metrics, computation time and memory usage). If the new method ends up being the default one, this will happen on a new major release (note that this is still hypothetical). Also, we won't release version 2.0 for this change alone. Version 2.0 will be out once I finish working on implicit feedback (I can't give any timeline right now). In the mean time, nothing prevents us from releasing 1.0.6 with your changes, keeping the current neighborhood as the default one. That would give us the opportunity to add a warning in the doc indicating that the default will probably change in next releases. |
You are right - both strategies to define k-neighbours are used.
I propose to implement such default behaviour, without any params, to avoid cluttering API, i.e. leave current behaviour as is for user_based=False, and change it for user_based=True. Have a look at the Args docstring for kNN we will have to add otherwise. I would prefer to avoid it:
|
I didn't know that, thanks for the info. Would you have any reference?
If what you're saying above is true then changing the default depending on Surprise is about letting users have complete control over their experiments, so we need to make everything explicit and tunable. So we should be able to chose whatever neighborhood we want. I don't mind long docstring as long as they're clear and useful. |
As I have just got it - they pick any k only because they have another variable to confine the neighborhood to look for this k. So, this can't be used even for i-i without adding additional param "the model size" This was in video-lectures on my coursera course: Original paper, describing item-item CF: So, seems that users should either compute similarity in a "fixed k" approach, or we should add a confine on the model size (fixed number of neighbors to look for k-closest among them). Would you agree to add such a param immediately? It is logical, as it will allow to effectively use both behaviors to look for neighbors, in both user-based and item-based approaches. |
Although, I may add-up this param with a different PR. It won't result in double work in changing the code |
Sorry Олег I'm not sure I completely understand everything, but I don't see how this supports your claim that the different strategies are preferred depending on whether we're computing similarities between users or items? |
In short: for item-item recommendations not same k is used, but k closest items, among those, that were already rated by the used. |
Ok but it is simply specific to the proposed algorithm in this paper, not a general recommended practice, right? Anyway, let's keep things simple.
These are the definitions for user-user similarities. The item-item case is analogous. If the Nicolas |
I would rather keep the parameter as I have specified above (fix_k_closest). |
That being said, just because this is the paper introducing item-item doesn't mean we should implement it: it's more than 15 years old and practices evolve a lot. |
fix_k_closest - is clear, together with a good param description. Previously you agreed on the example of the description. |
No, it's not. The number of neighbors is not fixed in either of the methods.
No, I did not. I just said that I didn't mind long docstring as long as they are clear. I'm sorry but the one you proposed was not. Can you please expend a bit on the model size / neighborhood size? I check the paper briefly but did not completely understand what it means and I don't have time to read it carefully for now. A quick TLDR would be helpful. |
k neighbors are absolutely fixed in the method that I advocate. Idea of real-world implementation is to decrease memory consumption for the trained model and make predictions faster, by precomputing as many factors at the model training stage, as possible. So, only k closest neighbors and their respective similarity are necessary to keep as a part of user-based model (and that is why I advocate "fix_k_closest", or "fixed_k" param name). For item-based models: |
Nicolas, the fixed-k approach makes it possible to compute top-N predictions much more effectively, by looking only at the items that min number of fixed neighbors have jointly estimated. So, I offer to introduce the get_top_n method to kNN algs, which will be available in case the new option is set to True. It will have drastically improved performance over the one you propose in FAQ (giving predictions for all the items) |
I understand what you mean: they're fixed in the sense that for a given user,we will always look at the same set neighbors, which is not the case of the current implem where we first filter out users that have not rated the target item. However, with the method you propose, the actual set of neighbors that we use for a prediction is not fixed, because the subset of neighbors that have rated the target item is never the same. Let N(u) be the k nearest neighbors of u, regardless of their rating for the target item. With the new method, we use a subset of N(u) to make a prediction: only the users in N(u) that have rated i can be used. For a different item (j), this subset of neighbors will be different. This is why I think Thanks for the TLDR. Here is what I understand, please let me know if I got it right or wrong (I consider here for simplicity that we have a user-user similarity):
Is that correct, or am I missing something? Note that issue #3 was about similarity computation using too much memory. This will not be fixed by the proposed changes: the similarity matrix has to be computed anyway, even if it's ultimately freed. I'm not sure I completely understand what you mean about top-N but honestly I'd rather not focus on that right now as it's fairly unrelated to the current matter. Nicolas |
You got it right with model size, and wrong with fixed k:
To conclude - we need a "fix_k_closest" param to implement classical u-u approach, and "model_size" param to implement more efficient i-i implementation. regarding top_N prediction - you proposed to do it as predicting score for each item, and then sorting out highest predictions. While in user-based kNN - we actually won't be able to predict rating for a great share of items. And computationally cheaper would be: first select only a subset of items, jointly rated by min of fixed kNN, and then make predictions only for those items. So, for user-user CF you don't need trying to predict all ratings, and approach will be more computationally effective. |
See my PR, probably it would be easier to get an idea for user-based fixed-k approach from code. |
You seem to imply once again that we have to change the default. Please don't make me repeat myself for the fifth time. Also, please don't differentiate between item-item and user-user. Recommended practices may differ depending on the kind of similarity but that's really not the question we're tring to address right now. I used user-user as an example in my previous post, that's all. For the reason that I clearly explained before, I'm sorry man but communication is extremely difficult and I'm starting to lose patience here. |
I didn't imply it, I just tried to explain that for user_based and item-based cases library users usually should pick different values of the new option. If you see my PR - you will see that change of behaviour happens only in case of the non_default value of the given option. |
Ok but please let's entirely ignore this question for now. I do not want to consider which approach is more suitable when we're user u-u or i-i. For now I just want to understand how the general method works, regardless of i-i or u-u. So, I'll rewrite what I understood, please let me know if it's correct. I use u-u here but only for illustration purproses. The i-i case is strictly symmetric.
Is that correct? You told me I mistunderstood the use of fixed_k but you explained it by differerentiating between u-u and i-i so I don't know what I got wrong. Now, if what I understood is correct, then we can simly implement this directly. That would require:
Sorry no, I don't want to read some code until we completely agree on what the changes will be.
Do you at least understand and acknowledge my point when I say that they're also not fixed in some sense? I feel like you've been ignoring this sor far. |
You understand model size absolutely correctly. But, nevertheless, I see a couple of reasons why you might want to keep these two params separately:
So, imho, easier for users would be to have separate params of "fix_k" and "model_size", while the second one should be used only for fixed_k=False. We can introduce both params with keeping current beahviour as default. Further (in next major version) we could substitute those defaults for user_based=True. And fixed_k case also allows a transparent and efficient implementation for both estimate and get_top_n . |
I am not sure what is the purpose of this task that you foresee. But it can be omitted, as I see it:
|
Thanks for the feedback. Just to be clear, I did not propose to only use the That being said, I completely agree with your first point about user simplicity. So how about this: We introduce two new parameters:
So the 2 most common strategies are easily available and only the k parameter needs to be tuned. If someone wants to make fancier experiments, she can do that by setting |
I'd rather propose for
From user perspective: such combinations is a better match with research papers, as Finally, I am not sure whether extended Proposed docstring (revised version of my previous proposal):
|
Yes, it is. You even acknowledged it.
No. Neighbors are not fixed in any of the two startegies. You clearly have been ignoring my (numerous) comments so far. Listen Олег, I'm trying to consider your opinions with care and attention. I really value surprise's users feedback. But I feel like you're really not returning the favor here, in this thread or in the others you have opened. We're almost 40 comments down, and the discussion hasn't really moved forward since the beginning. We are still stuck at some details that should not be blocking us. I really don't want to pull the I'm in charge so things should be the way I think card. But I'm tired of explaining the same things over and over. |
Nicolas, I'm sorry it looks like that. Regarding this particular issue: I am trying hard to come up with as simple explanation of API params as possible, as I see self-evident API as a great value for any lib. That's is the reason I am so obdurate with keeping (Although I am also confused about the long conversation we have here, but I see it as highly productive as we have clarified for ourselves classical approaches to U-U and I-I kNN implementations, which certainly worth some time. And clear API also worth some effort, and may be tricky to design, as it is a subjective thing) |
Sorry for the late reply, I've been quite busy. I completely agree that the neighborhood techniques proposed in Surprise could be improved, and your contributions are very relevant. But I'm still convinced that going for the 'fix*' option is really misleading, and I fail to understand why you still want to push it, after all my efforts to convince you. I'll try to re-write below what I previously tried to explain. Let's call method A the method that is currently implemented, and method B the one you propose. The only difference between the two methods is how they define the neighborhood. In what follows I will examplify using a user-user similarity, but the item-item case is strictly symmetric. The distinction between user-user and item-item is completely irrelevant here. Let
In both methods, the weighted average for the rating prediction is computed using the users in Regarding implementation, the steps described above may not be the optimal ones. For example for Method B it makes sense, as you suggested, to store Please tell me if I'm missing something or if this is not clear. I'm completely OK with going for something else than 'extended' and 'restricted'. But if what I've written above is correct, then any notion of 'fixed' or 'non-fixed' should be avoided. |
You put the difference between two methods absolutely correctly.
Yes, you got it right - that was my idea of "fixed neighbors". Just not all of them are actually used, while their potential number is fixed. It may be something like this (third alternative):
If you don't like this alternative - please, provide your version of docstring, and let's use it. |
This was informative conversation with good arguments. What is the end decision? |
I guess the consensus is that the 'new' proposed neighborhood method is definitely worth implementing. If it were to be implemented, my proposal is in #131 (comment). The arguments in favor of other options haven't convinced me so far. |
Current approach in KNNBasic finds whatever are the closest k-neighbours and computes prediction on them. Although it helps to get at least some kind of prediction for each item - it actually compromises the basic idea of comparability of ratings for different items:
With the current implementation, one item can get a higher prediction than another, and we might recommend it, while its prediction might have been derived from some remote neighbours, and is therefore unreliable.
So, for K-means - we need to find k nearest neighbours first and then consider only their ratings to compute predictions.
BTW, it makes sense to store the neighbours within the fitted model, in order to avoid reiterating look up for closest neighbours, as a costly operation. And to make the fitted model more compact - we can get rid of the other part of similarity matrix, after fitting the model, as we don't actually need it for predictions.
The text was updated successfully, but these errors were encountered: