Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: kNN computes on k-first among whatever neighbors it finds. Instead of fixed k-neighbours #131

Open
ODemidenko opened this issue Jan 26, 2018 · 43 comments

Comments

@ODemidenko
Copy link
Contributor

ODemidenko commented Jan 26, 2018

Current approach in KNNBasic finds whatever are the closest k-neighbours and computes prediction on them. Although it helps to get at least some kind of prediction for each item - it actually compromises the basic idea of comparability of ratings for different items:
With the current implementation, one item can get a higher prediction than another, and we might recommend it, while its prediction might have been derived from some remote neighbours, and is therefore unreliable.
So, for K-means - we need to find k nearest neighbours first and then consider only their ratings to compute predictions.

BTW, it makes sense to store the neighbours within the fitted model, in order to avoid reiterating look up for closest neighbours, as a costly operation. And to make the fitted model more compact - we can get rid of the other part of similarity matrix, after fitting the model, as we don't actually need it for predictions.

@NicolasHug
Copy link
Owner

NicolasHug commented Jan 26, 2018

With the current implementation, one item can get a higher prediction than another, and we might recommend it, while its prediction might have been derived from some remote neighbours, and is therefore unreliable

This is not really a bug, it's just the way k-NN is defined. Using a neighborhood that's actually far-away is a well-known drawback of the neighborhood methods.
Also note that Surprise does exactly what it advertises: it is clear from the docs that only the neighbors having rated the target item are used (for user-user similarities).

I'm not sure why you're mentioning k-means here?

BTW, it makes sense to store the neighbours within the fitted model, in order to avoid reiterating look up for closest neighbours, as a costly operation. And to make the fitted model more compact - we can get rid of the other part of similarity matrix, after fitting the model, as we don't actually need it for predictions.

Storing the neighbors during training requires extra work during the prediction step: we can only consider neighbors that have rated the target item so we'll need to filter out those who did not. Also, storing a U x I matrix may or may not be more memory efficient than storing a UxU or IxI matrix, depending on the number of users and items.

In the current implementation, finding the k-nearest neighbors is fairly efficient because the sorting is done only over the other users that have rated the target item (in the case of user-user similarities). So we're sorting a fairly short list each time instead of sorting a huge list once and for all.

If you want to implement the method you suggest I'd be happy to see the results!

Nicolas

@ODemidenko
Copy link
Contributor Author

ODemidenko commented Jan 27, 2018

This is not really a bug, it's just the way k-NN is defined. Using a neighbourhood that's actually far-away is a well-known drawback of the neighbourhood methods.

As far as I understand the kNN idea - it is not defined this way. Rather we should always the same fixed neighbours and recommend only those items, that were rated by some of these fixed neighbours. Thus, they can't always compute prediction for any given item, which is a well-known drawback of kNN.
I formed my vision from:
"Recommender Systems", by Dietmar Jannach; Markus Zanker; Alexander Felfernig, Published by Cambridge University Press, 2010. (not sure whether it has a direct formula, but it becomes clear from the text)
And a course: https://www.coursera.org/learn/collaborative-filtering, where, besides the lectures covering the topic, a kNN example calculation is given in Excel, within a second-week exercise, with examples of the correct results. I have decided to fulfil these computations with surprise lib instead of excel, and this is how I found that surprise returns incorrect results (they won't match examples from the exercise).

Finally, there is a common sense approach. Consider an example: in case we choose any closest k neighbours among those who have rated some item - the top-N recommendations list will fail:

  • Imagine item A with highest marks given by fairly remote k neighbours (having 0.000001 correlation, as library currently checks that sim>0). Weighted mean mark will be 5
  • item B rated slightly worse by our k nearest neighbours (having correlation=1), with weighted mean mark=4.95
    within "top-N" recommendations, sorted by rating this approach would place higher item A recommended by neighbours with average 0.000001 similarity. Doesn't this seem strange?

@NicolasHug
Copy link
Owner

A major issue with your approach is that as it's fairly unlikely the k-nearest neighbors of a user have rated the target item, the prediction is most of the time impossible and in the rare cases where there is one, it relies on only few ratings (much less than k). So it's not reliable either. Obviously these are just conjectures of mine, it should be empirically verified to check if that holds.

I think we can agree that both approaches have pros and cons.

The one that is currently implemented is, as far as I know, the most commonly used. See e.g. The recommender system handbook by Ricci & all or The recommender system textbook by Aggarwal (I find the latter to be much more useful).

@ODemidenko
Copy link
Contributor Author

Nicolas, what I found in the book you offered seems to support the vision of KNN which I described:
The recommender system textbook by Aggarwal, p.9 :

" In general, the k most similar users to Bob can be used to make rating predictions
for Bob. "
...
For example, it might be difficult to find sufficiently similar users to Bob, who have rated
Gladiator. In such cases, it is difficult to robustly predict Bob’s rating of Gladiator. In
other words, such methods might lack full coverage of rating predictions. Nevertheless,
the lack of coverage is often not an issue, when only the top-k items are required.

@NicolasHug
Copy link
Owner

Well TBH these statements are true for both strategies.

I agree though that it's not so clear what strategy the author is refering to. E.g. p.36 eq. 2.4:

Let P u(j) be the set of k closest users to target user u, who have specified ratings for item j.

I always thought that this means we should first filter the users who haven't rated i, and then see which ones are the closest. But thanks to your remarks I understand now that it's not so clear. It could mean either your definition, or the one I've been using.

The same goes for the handbook:

The k-nearest-neighbors (k-NN) of u, denoted by N (u),
are the k users v with the highest similarity w_uv to u. However, only the users who
have rated item i can be used in the prediction of r_ui , and we instead consider the
k users most similar to u that have rated i. We write this set of neighbors as Ni(u).

It's not clear whether Ni(u) is a subset of N(u) (your understanding) or not (mine).

In any case, if you want to implement your solution please go for it!

@ODemidenko
Copy link
Contributor Author

ODemidenko commented Jan 27, 2018

you mean I may substitute current logic for .predict() for all kNN algorithms?
Do you confirm that I can also adjust .fit() logic in order to store only top-k neighbours and their respectful similarity (instead of keeping full sim matrix)?
I believe it makes sense to leave full similiarity matrix computation, but keep it optional, only for those who want to double-check similarity matrix computation (e.g. I compared it to the one obtained from pandas).

Refactoring plan

Aim - least changes to current code. Optional acces to current sim.matrix

  • Move "Compute similarity" from AlgoBase into SymmetricAlg (as it belongs to kNN only)
  • Insert compute_kNN in the end of every fit method
  • make optional .sim matrix (by default - kill it after kNN is computed)
  • change .prediction() for all kNN
  • add some tests
  • Optionaly - add top_n_recommendation(uid)

@NicolasHug
Copy link
Owner

NicolasHug commented Jan 27, 2018

you mean I may substitute current logic for .predict() for all kNN algorithms?

No I think both options should be available, and the current one (even if it may not be the most natural for some) should be the default, to conserve backward compatibility. This could be set via a parameter neighbors_set whose values could be either extended (default) and conservative (the new version). This is just a suggestion of course, but the possible values should be as explanatory as possible. Also predict() should not be changed: only fit() and estimate().

Do you confirm that I can also adjust .fit() logic in order to store only top-k neighbours and their respectful similarity (instead of keeping full sim matrix)?

As I said earlier I'm not sure this would improve performance. If it does improve performance in some cases (whether it's computation time or memory use), then I'd be happy to support it. But it should be done in a separate PR as this is for the most part unrelated to the neighbors selection.

Move "Compute similarity" from AlgoBase into SymmetricAlg (as it belongs to kNN only)

Good idea

Insert compute_kNN in the end of every fit method

I'm not sure what you mean?

make optional .sim matrix (by default - kill it after kNN is computed)

Ok if that's useful, but in a different PR (see remark above)

change .prediction() for all kNN

There is no prediction() method, did you mean estimate()?

add some tests

YES :)

Optionaly - add top_n_recommendation(uid)

There is already a guide for this in the FAQ if I'm not mistaken.

Good luck in any case, and thank you for addressing this :) !
If you haven't already, please check out the contribution guidelines. If they're not clear enough let me know, I'll try to improve it.

@NicolasHug
Copy link
Owner

Quick update: I checked mymedialite and librec versions of kNN. If I understand their code correctly, they both compute the set of nearest neighbors and only then filter out those not having rated the target item, which supports your version :)

@ODemidenko
Copy link
Contributor Author

So, in case other libraries, as you have checked, have different implementations - I offer to change the default behaviour of the current kNN implementation, rather than introduce new behaviour as just an option.
Old estimate() behaviour, I believe, should be abolished as inconsistent with other implementations. While optional full sim matrix might still be available after fitting the kNN, in order to make it easier to verify correctness of similarity measure computations

@NicolasHug
Copy link
Owner

Like I said I prefer to keep the current version as the default one for backward compatibility reasons. If we change the default algorithm, then some people will find that their predictions have changed when upgrading to the new version even if haven't change their code.

@ODemidenko
Copy link
Contributor Author

I believe that being consistent with other (standard) way of kNN implementation - is more important than being consistent with previous versions. I don't see this change in prediction as a bad thing itself, in case it provides better top-N recommendations.
And I am concerned that in case we keep current behaviour as a default one - new users will get this inconsistent predictions, as in many cases people just stay with default options. Besides, I see value in keeping .sim property for analysis, but I don't see any value in keeping an opportunity to get current-style predictions at all. As they potentially provide wrong top-N recommendations.

I am very concerned about the need to change the behaviour. And in case you are concerned about letting it know to users - you may, for instance, just update the major library version (2.x), as one should usually expect some change in behaviour in new major versions (in case we add top_N prediction to the API such version change will be more justifiable, as it signals the opportunity to use built-in method rather then implementing it through some adapter over the library)

@NicolasHug
Copy link
Owner

NicolasHug commented Jan 28, 2018

I understand your point. Changing API and default behaviour in a next major release might be the solution. But for now, let's just keep the current startegy as the default one: we can change the default later.

@ODemidenko
Copy link
Contributor Author

I consider leaving default behaviour as is - a bug proliferation. As it means that all new users will have wrong default behaviour and old users, by default, also will get deteriorate top-N predictions. So, I won't agree to implement it this way (leaving current behaviour as a default).
I also want to find a way to address your concern about implicit informing about the change of behaviour.
So, let's find some way to achieve both goals simultaneously: either immediately releasing new major version, or some other way (like updating lib description about the change of behaviour, or adding some additional note to prediction results).

@NicolasHug
Copy link
Owner

NicolasHug commented Jan 28, 2018

I consider leaving default behaviour as is - a bug proliferation. [...] all new users will have wrong default behaviour

Once again, the current implementation is neither wrong nor correct. It's not a bug. It's just the way it is, and it does exactly what it claims it does. It's a variant of the neighborhood approach, just like the one you propose is another variant (and there are many, many variants). They both have pros and cons, and we may prefer one or the other depending on our goals. If that's of any comfort to you, Crab proposes both strategies for selecting the neighbors.

old users, by default, also will get deteriorate top-N predictions

That's still to be (empirically) proven. The version that will end-up being the default will be the one that performs best (criteria being accuracy and top-N metrics, computation time and memory usage).

If the new method ends up being the default one, this will happen on a new major release (note that this is still hypothetical).

Also, we won't release version 2.0 for this change alone. Version 2.0 will be out once I finish working on implicit feedback (I can't give any timeline right now). In the mean time, nothing prevents us from releasing 1.0.6 with your changes, keeping the current neighborhood as the default one. That would give us the opportunity to add a warning in the doc indicating that the default will probably change in next releases.

@ODemidenko
Copy link
Contributor Author

You are right - both strategies to define k-neighbours are used.
But they are commonly associated with the method:

  • non-fixed k (as you have implemented it) - is used for item-item kNN.
  • fixed k is used for user_based KNN.

I propose to implement such default behaviour, without any params, to avoid cluttering API, i.e. leave current behaviour as is for user_based=False, and change it for user_based=True.

Have a look at the Args docstring for kNN we will have to add otherwise. I would prefer to avoid it:

fix_k_neighbors(boolean): Defines whether k neighbours used to
            predict similarity are fixed, i.e. same neighbours are used for any
            item, or k neighbours are obtained for each item as the closest
            neighbours among those who actually rated this item (for user_based 
            similarity) or among all similar items (for item-item similarity).
            Default = False
            - For item-item kNN - usual strategy is to obtain k
            closest items, among rated by current user (use default)
            - For user_based kNN - to obtain some kind of prediction for 
            a greater number of items use non-fixed neighbours (default). 
            To obtain consistent ranking among top-N predictions - use fixed k.


@NicolasHug
Copy link
Owner

But they are commonly associated with the method [...]

I didn't know that, thanks for the info. Would you have any reference?

I propose to implement such default behaviour, without any params, to avoid cluttering API, i.e. leave current behaviour as is for user_based=False, and change it for user_based=True.

If what you're saying above is true then changing the default depending on user_based is indeed a good option (defaults should only change in the next major release though, I will not change my mind on this :) ). I still want to provide users with the possibility to change it, so there should still be a parameter.

Surprise is about letting users have complete control over their experiments, so we need to make everything explicit and tunable. So we should be able to chose whatever neighborhood we want. I don't mind long docstring as long as they're clear and useful.

@ODemidenko
Copy link
Contributor Author

As I have just got it - they pick any k only because they have another variable to confine the neighborhood to look for this k. So, this can't be used even for i-i without adding additional param "the model size"

This was in video-lectures on my coursera course:
The item-item model is often truncated to only keep M neighbors per item. The scorer uses at most k neighbors to compute each prediction. M must be significantly larger than k, because the user we’re predicting for may not have rated all M of the neighbor items; we need enough potential neighbors to be able to find k of them among the user’s ratings.
So, here "M" is the same as "k" in the current surprise implementation

Original paper, describing item-item CF:
Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide
web (WWW '01). ACM, New York, NY, USA, 285-295.
See 4.4:
A model size of l means that we only considered l best similarity values for model building and later on used k of them for the prediction generation process, where k < l.

So, seems that users should either compute similarity in a "fixed k" approach, or we should add a confine on the model size (fixed number of neighbors to look for k-closest among them). Would you agree to add such a param immediately? It is logical, as it will allow to effectively use both behaviors to look for neighbors, in both user-based and item-based approaches.

@ODemidenko
Copy link
Contributor Author

Although, I may add-up this param with a different PR. It won't result in double work in changing the code

@NicolasHug
Copy link
Owner

Sorry Олег I'm not sure I completely understand everything, but I don't see how this supports your claim that the different strategies are preferred depending on whether we're computing similarities between users or items?

@ODemidenko
Copy link
Contributor Author

ODemidenko commented Jan 29, 2018

I don't see how this supports your claim that the different strategies are preferred depending on whether we're computing similarities between users or items?

In short: for item-item recommendations not same k is used, but k closest items, among those, that were already rated by the used.

@NicolasHug
Copy link
Owner

Ok but it is simply specific to the proposed algorithm in this paper, not a general recommended practice, right?

Anyway, let's keep things simple.
The best course of action for now is to add a neighborhood parameter to all k-NN algorithms, which can be either extended (default, current strategy) and conservative. This parameter indicates how the neighborhood is defined (u and i being the target user and item):

  • extended: the set of neighbors is defined as the set of k closest users to u, among those that have rated i.
  • conservative: the set of neighbors is defined as the set of k closest users to u, regardless of the existence of their rating for i.

These are the definitions for user-user similarities. The item-item case is analogous.

If the conservative approach is used, then indeed the similarity matrix could be thrown away once all neighbors are stored. We could allow this option via another parameter, but this should be done in another PR: all things in good time.

Nicolas

@ODemidenko
Copy link
Contributor Author

I would rather keep the parameter as I have specified above (fix_k_closest).
The idea of neighborhood is tricky, because the item-item paper I linked above - uses another parameter to confine neighborhood ("model size"). And I believe that we should pay attention to this paper, as it is the one that originally introduced the notion of item-item kNN.

@NicolasHug
Copy link
Owner

fix_k_closest is misleading and does not reflect any of the two behaviours.

neighborhood is versatile enough: we can allow different possible values later if we want to add more neighborhood selection strategies, such as the one of the paper for example.

That being said, just because this is the paper introducing item-item doesn't mean we should implement it: it's more than 15 years old and practices evolve a lot.

@ODemidenko
Copy link
Contributor Author

fix_k_closest - is clear, together with a good param description. Previously you agreed on the example of the description.
The bad thing with "neighborhood" name is that it complicates further introduction of another param of "model size"/"neighborhood size", which is a recommended contemporary approach to item-item recommendations, according to the 2016 Coursera course.

@NicolasHug
Copy link
Owner

NicolasHug commented Jan 30, 2018

fix_k_closest - is clear, together with a good param description

No, it's not. The number of neighbors is not fixed in either of the methods.

Previously you agreed on the example of the description.

No, I did not. I just said that I didn't mind long docstring as long as they are clear. I'm sorry but the one you proposed was not.

Can you please expend a bit on the model size / neighborhood size? I check the paper briefly but did not completely understand what it means and I don't have time to read it carefully for now. A quick TLDR would be helpful.

@ODemidenko
Copy link
Contributor Author

ODemidenko commented Jan 30, 2018

k neighbors are absolutely fixed in the method that I advocate.

Idea of real-world implementation is to decrease memory consumption for the trained model and make predictions faster, by precomputing as many factors at the model training stage, as possible.

So, only k closest neighbors and their respective similarity are necessary to keep as a part of user-based model (and that is why I advocate "fix_k_closest", or "fixed_k" param name).

For item-based models:
k-neighbors are not fixed, like in the way you currently implemented. Probably the reason is that its usage is recommended only for situations where there are much more users, than items, and users have rated only a small subset of items, while every item was rated by many users. In this situation it is proposed to look for k-neighbors of a given item, rated by current user.
The optimization approach is to store only small subset of neighbors for every item (e.g. ~ 200 items, for k~ 30), where this "model size">>k. Thus, eventual model is still realtively compact (it is approximately n/model-size smaller, than full similarity matrix, where n is a number of items). Such approach allow this method to be used by such companies as Amazon, as it drastically decreases the model size. (Btw, issue #3 reported about this very problem - current model's memory footprint is too big, because it keeps excessive similarities)
So, for item-item classification we may keep current approach for looking up any neighbors, but in order to make it feasible for production purposes - we shall confine model size by keeping only a subset of sim matrix, defined by "model size"/"neighborhood size" param.

ODemidenko added a commit to ODemidenko/Surprise that referenced this issue Jan 30, 2018
@ODemidenko
Copy link
Contributor Author

Nicolas, the fixed-k approach makes it possible to compute top-N predictions much more effectively, by looking only at the items that min number of fixed neighbors have jointly estimated. So, I offer to introduce the get_top_n method to kNN algs, which will be available in case the new option is set to True. It will have drastically improved performance over the one you propose in FAQ (giving predictions for all the items)

ODemidenko added a commit to ODemidenko/Surprise that referenced this issue Jan 30, 2018
ODemidenko added a commit to ODemidenko/Surprise that referenced this issue Jan 30, 2018
@NicolasHug
Copy link
Owner

k neighbors are absolutely fixed in the method that I advocate.

I understand what you mean: they're fixed in the sense that for a given user,we will always look at the same set neighbors, which is not the case of the current implem where we first filter out users that have not rated the target item.

However, with the method you propose, the actual set of neighbors that we use for a prediction is not fixed, because the subset of neighbors that have rated the target item is never the same. Let N(u) be the k nearest neighbors of u, regardless of their rating for the target item. With the new method, we use a subset of N(u) to make a prediction: only the users in N(u) that have rated i can be used. For a different item (j), this subset of neighbors will be different.

This is why I think fix_k_closest or fix_k_neighbors or fixed_k is misleading.


Thanks for the TLDR.

Here is what I understand, please let me know if I got it right or wrong (I consider here for simplicity that we have a user-user similarity):

  • The model size (l) is the number of stored neighbors for each user. Let us denote N(u) this set for each user. |N(u)| = l for all users.
  • To make a prediction for user u, item i: we look at k < l users in N(u) that have rated i. Obviously we may find less than k users in N(u) that have rated the target item: this number is k' and is the actual number of neighbors that are used for the prediction.
  • This view is a generalization of the current implementation and the proposed change:
    • current technique is equivalent to having l = |U| (where U is the set of all users)
    • proposed changed is equivalent to having l = k << |U|.

Is that correct, or am I missing something?


Note that issue #3 was about similarity computation using too much memory. This will not be fixed by the proposed changes: the similarity matrix has to be computed anyway, even if it's ultimately freed.


I'm not sure I completely understand what you mean about top-N but honestly I'd rather not focus on that right now as it's fairly unrelated to the current matter.

Nicolas

@ODemidenko
Copy link
Contributor Author

You got it right with model size, and wrong with fixed k:

  • fixed-k approach is generally used only for u-u CF. And for it - model size parameter (l) doesn't make sense at all - it's enough to store only k fixed nearest neighbors. For any item prediction is made as a weighted average among those fixed k neighbors, but no less than min of them should have rated it. So, here the "fixed_k" param is applicable, and it correctly conveys the point.
  • for item-item CF - approach for finding neighbors remains the same as is currently implemented, i.e. k neighbor item are used. But to compress the model - the approach is to store only l neighbors.

To conclude - we need a "fix_k_closest" param to implement classical u-u approach, and "model_size" param to implement more efficient i-i implementation.

regarding top_N prediction - you proposed to do it as predicting score for each item, and then sorting out highest predictions. While in user-based kNN - we actually won't be able to predict rating for a great share of items. And computationally cheaper would be: first select only a subset of items, jointly rated by min of fixed kNN, and then make predictions only for those items. So, for user-user CF you don't need trying to predict all ratings, and approach will be more computationally effective.

@ODemidenko
Copy link
Contributor Author

See my PR, probably it would be easier to get an idea for user-based fixed-k approach from code.

@NicolasHug
Copy link
Owner

NicolasHug commented Jan 30, 2018

You seem to imply once again that we have to change the default. Please don't make me repeat myself for the fifth time.

Also, please don't differentiate between item-item and user-user. Recommended practices may differ depending on the kind of similarity but that's really not the question we're tring to address right now. I used user-user as an example in my previous post, that's all.

For the reason that I clearly explained before, fix_k_closest is misleading. That won't happen.

I'm sorry man but communication is extremely difficult and I'm starting to lose patience here.

@ODemidenko
Copy link
Contributor Author

ODemidenko commented Jan 31, 2018

You seem to imply once again that we have to change the default

I didn't imply it, I just tried to explain that for user_based and item-based cases library users usually should pick different values of the new option. If you see my PR - you will see that change of behaviour happens only in case of the non_default value of the given option.
If you are dissatisfied with "fix_k..." name - give your own name, just keep in mind that further we might need to add another "model size" (~'neighborhood to consider') option and names shouldn't be misleading.
(for me it seems that "fix_..." is a good name, as the actual set of members, for classical user_based approach, should be strictly fixed. And that is an optional behavior I offer to introduce)

@NicolasHug
Copy link
Owner

just tried to explain that for user_based and item-based cases library users usually should pick different values of the new option.

Ok but please let's entirely ignore this question for now. I do not want to consider which approach is more suitable when we're user u-u or i-i. For now I just want to understand how the general method works, regardless of i-i or u-u.

So, I'll rewrite what I understood, please let me know if it's correct. I use u-u here but only for illustration purproses. The i-i case is strictly symmetric.

  • The model size (l) is the number of stored neighbors for each user. Let us denote N(u) this set for each user. |N(u)| = l for all users.
  • To make a prediction for user u, item i: we look at k < l users in N(u) that have rated i. Obviously we may find less than k users in N(u) that have rated the target item: this number is k' and is the actual number of neighbors that are used for the prediction.
  • This view is a generalization of the current implementation and the proposed change:
    • current technique is equivalent to having l = |U| (where U is the set of all users)
    • proposed changed is equivalent to having l = k << |U|.

Is that correct? You told me I mistunderstood the use of fixed_k but you explained it by differerentiating between u-u and i-i so I don't know what I got wrong.

Now, if what I understood is correct, then we can simly implement this directly. That would require:

  • The addition of the model_size parameter, which by default would be equal to the number of users if u-u (or items if i-i), so that the current implementation stays the default one.
  • Some strong doc. The combined use of model_size and k is not really trivial so we should carefully explain what they are, and what are typical use-cases (basically what their values should be for this or that neighborhood strategy).
  • Some design. The first issue I foresee is that we'll need to know very fast if a given user has rated a given item. Right now, this lookup can't be done very efficiently (see ur and ir fields of the Trainset class).

If you see my PR

Sorry no, I don't want to read some code until we completely agree on what the changes will be.


for me it seems that "fix_..." is a good name, as the actual set of members, for classical user_based approach, should be strictly fixed

Do you at least understand and acknowledge my point when I say that they're also not fixed in some sense? I feel like you've been ignoring this sor far.

@ODemidenko
Copy link
Contributor Author

You understand model size absolutely correctly.
Moreover, you came up with a generalization, which I haven't noticed, that setting model size=k will be effectively the same as using "fixed k" approach I was speaking about.
So, technically, if you introduce only "model size" param - this will cover all the possible use cases.

But, nevertheless, I see a couple of reasons why you might want to keep these two params separately:

  1. combining them is highly confusing. All of the sources on user_based CF I've seen (and I've read many) don't have any notion of "model size" in them. I believe they shouldn't have it, because in case you consider only k closest neighbors, as they usually do, then you don't need a notion of "model size" as it always coincides with kNN.
    So, imagine how confusing it might be for the users to grasp that instead of simply putting {user_based=True,k=30} they will suddenly have an unexpected "model size" param.
  2. One of the most common use cases of recommedation systems in general is providing top-n recommendations. Computational cost for it is much simpler in case you know that neighbors are fixed. Then you first collect items to estimate, as only those rated by >=min number of neighbors, and then estimate them and find top n.
    While, for model_size>k and floating k - you can't have such a shortcut and will have a more complicated logic, at least because you would always search for k neighbors within N(u) and look up their similarity, while you can effectively substitue this step for k=model size approach.

So, imho, easier for users would be to have separate params of "fix_k" and "model_size", while the second one should be used only for fixed_k=False. We can introduce both params with keeping current beahviour as default. Further (in next major version) we could substitute those defaults for user_based=True. And fixed_k case also allows a transparent and efficient implementation for both estimate and get_top_n .

@ODemidenko
Copy link
Contributor Author

Some design. The first issue I foresee is that we'll need to know very fast if a given user has rated a given item. Right now, this lookup can't be done very efficiently (see ur and ir fields of the Trainset class).

I am not sure what is the purpose of this task that you foresee. But it can be omitted, as I see it:

  • for fixed-k cases it is not an issue, as we always use the whole set of kNN for prediction
  • for non_fixed k, and model_size<N(i). Here is as I see it (just for a case of example I use item-item case):
    We find and intersection of user-vector (rated items) and N(i) of a given item. This is done by checking whether user_rated_item in set_of_N(i)_items, thus we get a user_rated_item and its rating. And we need only to find i-i similarity with the estimated item. This can be performed as a key_lookup within N(i). N(i) for this purposes should be implemented as a dict of dict (we can easily compress .sim matrix into such a form, by leaving top(model_size) records from each row).
    So, no need to change the Trainset.

@NicolasHug
Copy link
Owner

NicolasHug commented Feb 2, 2018

Thanks for the feedback.

Just to be clear, I did not propose to only use the model_size parameter. We should definitely have both k and model_size.

That being said, I completely agree with your first point about user simplicity. So how about this:

We introduce two new parameters: neighborhood, and model_size. k is obviously still a parameter.

  • neighborhood can take values extended (current version), restricted (proposed changes), and custom. When it's set to either extended or restricted, only the k parameter is used and model_size is automatically set accordingly (model_size = |U| or |I|, or model_size = k).
  • model_size is None and ignored by default and we only use it if neighborhood is set to custom. In this case, the user can have greater control.

So the 2 most common strategies are easily available and only the k parameter needs to be tuned. If someone wants to make fancier experiments, she can do that by setting neighborhood to custom.

@ODemidenko
Copy link
Contributor Author

I'd rather propose for neighborhood to have only two value: extended and restricted:

  • for extended (current, default) have model_size (default=None, which means |U| or |I|)- this will give current behaviour. Optionally, model size can be specified.
  • for restricted - model_size is not applicable at all.

From user perspective: such combinations is a better match with research papers, as extended logically goes with flexible model-size and for restricted notion of model size is absent (its notion is excessive).

Finally, I am not sure whether extended/restricted` options - are good names:
extended/restricted, imho, have less clear notion, than "fixed" and "non-fixed" names. I don't like the "non-fixed" name as well, actually. But 'fixed _neighbors' seems way more obvious for me than "restricted neighborhood".
A good test for param name - let's try to imagine the docstring for it:
you will have to write that either "always same neighbors are used" or "always same (fixed) neighbors are used" ("always same" has "fixed" notion here). I can't imagine how you can use the word: "restricted" (or its synonyms) in describing the notion of "restricted neighborhood" param.

Proposed docstring (revised version of my previous proposal):

        fix_neighbors(boolean): Defines whether neighbors used to make
            predictions are fixed, i.e. same fixed neighbors are used for any
            prediction, or neighbors are obtained as the closest among those
            who actually rated this item (for user_based=True)
            (for item_based - among items rated by the same user).
            Default = False
            - For item-item kNN - usual strategy is to obtain k
              closest items, among all items rated by current user
              (fix_neighbors=False)
            - For user_based kNN - usual strategy to obtain consistent ranking
              among top-N predctions is fixed_neighbors=True.
        model_size (int, default=None): Defines maximum number of neighbors to
            store within the trained model and consider for predictions.
            This param allows decreased model size with trade-off of some
            prediction quality. By default the whole model is stored.
            model_size adjustment is available only for fix_neighbors=False, and
            defines number of neighbors among whom k nearest will be selected.
            Allowed values for model_size > k.
            In [Sarwar, 2001] setting model size at 3% of all possible items
            allowed 98.3% accuracy of the full model.

@NicolasHug
Copy link
Owner

for restricted - model_size is not applicable at all

Yes, it is. You even acknowledged it.

you will have to write that either "always same neighbors are used" or "always same (fixed) neighbors are used"

No. Neighbors are not fixed in any of the two startegies.

You clearly have been ignoring my (numerous) comments so far.

Listen Олег, I'm trying to consider your opinions with care and attention. I really value surprise's users feedback. But I feel like you're really not returning the favor here, in this thread or in the others you have opened. We're almost 40 comments down, and the discussion hasn't really moved forward since the beginning. We are still stuck at some details that should not be blocking us. I really don't want to pull the I'm in charge so things should be the way I think card. But I'm tired of explaining the same things over and over.
Please consider changing the way you approach your contributions.

@ODemidenko
Copy link
Contributor Author

Nicolas, I'm sorry it looks like that.
Let me explain myself and my approach to contributing to the lib:
I highly appreciate the clear way your lib is designed, great code readability and documentation quality. I compared it to "crab" library, as a previous attempt to design a universal lib, and Surprise looks even better in comparison.
Meanwhile, I currently see Surprise as being fairly basic in kNN methods (and lack of top-N recommendations), as it doesn't support some classical similarity measures and implementational heuristics. I am trying to contribute there, while keeping the architecture as simple and decoupled, as it is now, and maintaining your exacting coding standards.

Regarding this particular issue: I am trying hard to come up with as simple explanation of API params as possible, as I see self-evident API as a great value for any lib. That's is the reason I am so obdurate with keeping model size as a param for the extended case, and having a good name for this restricted / extended pair.
But I am not stubborn and can implement it with 'restricted' name instead of 'fixed'. Just I can't come up with a simple explanation for your choice of names. This is the only reason I propose "fix" as a better name (imho). If you provide a clear docstring for your choice of names - I could implement all the logic (I have long ago since implemented and tested it, actually, to pass my course).

(Although I am also confused about the long conversation we have here, but I see it as highly productive as we have clarified for ourselves classical approaches to U-U and I-I kNN implementations, which certainly worth some time. And clear API also worth some effort, and may be tricky to design, as it is a subjective thing)

@NicolasHug
Copy link
Owner

Sorry for the late reply, I've been quite busy.

I completely agree that the neighborhood techniques proposed in Surprise could be improved, and your contributions are very relevant.

But I'm still convinced that going for the 'fix*' option is really misleading, and I fail to understand why you still want to push it, after all my efforts to convince you.

I'll try to re-write below what I previously tried to explain.


Let's call method A the method that is currently implemented, and method B the one you propose. The only difference between the two methods is how they define the neighborhood. In what follows I will examplify using a user-user similarity, but the item-item case is strictly symmetric. The distinction between user-user and item-item is completely irrelevant here.

Let N_k(u, i) be "the set of k-nearest neighbors of u that have rated the target item i". This set is defined in different ways according to method A and method B.

  • Method A: to compute N_k(u, i), we first find U_i, the set of users that have rated i. Then, we sort this set U_i, according to a similarity measure with u. We end up with a sorted set U_i'. The first element in U_i' is the user v in U_i for which sim(u, v) is minimal. Here, N_k(u, i) is defined as the k first users v of U_i'.

  • Method B: To compute N_k(u, i), we consider the set of all users U, regardless of their ratings for i. We then sort all these users, according to a similarity measure with u. We end up with a sorted set U'. The first element in U' is a user v in U for which sim(u, v) is minimal. Here, N_k(u, i) is defined as the k first users v of U' that have rated i.

In both methods, the weighted average for the rating prediction is computed using the users in N_k(u, i). In both methods, this set N_k(u, i) changes when either u or i changes (it may not change sometimes, but that's just a matter of chance).

Regarding implementation, the steps described above may not be the optimal ones. For example for Method B it makes sense, as you suggested, to store N_k(u): the set of k nearest neighbors of u, regardless of their rating for i (this is basically the first k users of U'). This makes the computation of N_k(u, i) very easy and fast. I think that the reason you think the neighborhood is fixed i method B is because you think of this set N_k(u), which indeed, is always fixed. But this is not the neighborhood we use in method B! We use N_k(u, i), which is never the same, because the users in N_k(u) haven't all rated the same items.


Please tell me if I'm missing something or if this is not clear.

I'm completely OK with going for something else than 'extended' and 'restricted'. But if what I've written above is correct, then any notion of 'fixed' or 'non-fixed' should be avoided.

@ODemidenko
Copy link
Contributor Author

You put the difference between two methods absolutely correctly.

I think that the reason you think the neighborhood is fixed i method B is because you think of this set N_k(u), which indeed, is always fixed.

Yes, you got it right - that was my idea of "fixed neighbors". Just not all of them are actually used, while their potential number is fixed.
I see why my proposed name might be misleading. But 'extended' and 'restricted' neighborhood seems too vague to me.
So, the challenge is to find unambiguous and clear names.
I don't have an ideal solution here. If you can give a clear docstring for your name proposal - let's use it.

It may be something like this (third alternative):

neighborhood (str): Defines how neighbors are selected
    Options: 'universally nearest neighbors' / 'item-specific neighbors'
    - 'universally nearest neighbors' - neighbors are selected from all users/items as having the highest 
    similarity. Same neighbors are used for any item prediction. 
    This is usual approach for `user_based`=True recommendations
    - 'item specific-neighbors' -  nearest neighbors are selected among those who rated the item of 
    interest 
    (for  'user_based' = True. For user_based=False - nearest items rated by the user are selected). 
    Therefore choice of neighbors depends on the rated item. 
    This is usual approach for `user_based`=False recommendations

    In both cases, actual number of neighbors used to predict the rating may vary, as only neighbors with 
    similarity>0 and those that have rating for an item of interest are used.

If you don't like this alternative - please, provide your version of docstring, and let's use it.

@fisherslo
Copy link

This was informative conversation with good arguments.

What is the end decision?

@NicolasHug
Copy link
Owner

I guess the consensus is that the 'new' proposed neighborhood method is definitely worth implementing.

If it were to be implemented, my proposal is in #131 (comment).

The arguments in favor of other options haven't convinced me so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants