-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement fit_partial()
for ImplicitALSWrapperModel
and LightFMWrapperModel
#179
Conversation
fit_partial()
for ImplicitALSWrapperModel
and LightFMWrapperModel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for such a great feature!
I added some comments, if you don't agree with them, let's discuss. I'm not 100% sure I understand the use case, so may be some of the comments are wrong. But if you agree with them, then it's fine.
rectools/dataset/dataset.py
Outdated
@@ -245,3 +258,72 @@ def get_raw_interactions(self, include_weight: bool = True, include_datetime: bo | |||
pd.DataFrame | |||
""" | |||
return self.interactions.to_external(self.user_id_map, self.item_id_map, include_weight, include_datetime) | |||
|
|||
def construct_new_datasets( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably construct_new_dataset
would be more reasonable since only one dataset is returned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd call it build_new_dataset_on_id_map
or smth like it. But we need to think about the conception first.
Right now it's not working properly in the conceptions that we use. We have a rule that all hot users (users that have interactions) always go before warm users (users that have only features). And we use this assumption in some of our impertant internal logic.
New method can break this rule because new users with interactions will have ids after warm users from previous dataset state. So we really can't make this a public method in the library.
Options are:
- Rewrite logic. We need to keep only hot users ids from previous state. Then add new hot users ids. Then add new warm user ids. I suppose we are dropping all previous warm users because we are totally changing the actual data. (All the same is for items) In this case I'd call it or
rebuild_with_new_data
orconstruct_with_new_data
orreplace_data
- Just don't make it a public method. Move it to the tests if we need it there. But then again write a correct public method for required logic. Also tests will be needed for it anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out. I wasn't aware of the warm users' assumption. And yeah, I expected this to be a controversial change. And that's why I haven't written the test yet 😅
Originally, I expected for RecTools' Dataset
to do similar to LightFM's Dataset.fit_partial()
, but I found that RecTools Dataset
is frozen.
I'm okay to make it private, or I can rebuild IdMap
by concatenating new and old users/items something like:
old_hot_user_id_map = IdMap.from_dict({e: i for e,i in zip(ds.user_id_map.convert_to_external(ds.get_hot_users()), ds.get_hot_users())})
old_hot_item_id_map = ...
new_user_id = old_hot_user_id_map.add_ids(...
I'm not confident if it's okay to drop the previous warm user yet though,.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I attempt to drop old warm ids, but I face UnknownIdError
.
RecTools/rectools/dataset/features.py
Line 121 in f73d054
raise UnknownIdError("All ids in `df` must be present in `id_map`") |
Maybe, storing hot/warm marks like hot_ids
in IdMap can be an option like np.array([1, 0, 1])
where 1
represents hot and 0
represents warm id. Then, we can fetch hot or warm users/items explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall conception of ids in rectools Dataset
are the following:
- All user ids are row indexes in both user-interactions matrix and user features matrix and vector model embeddings matrix
- Because we don't wont empty rows in user-interactions matrix we can't have warm user ids before hot ones. Empty rows are bad because some models will add them to loss calculation which is conceptually wrong.
- So we always need to have all hot user ids -> then all warm user ids. They have the same order in user features matrix
- That's why we don't need
hot_ids
flag. But we always need to make sure we keep the assumption of hot/warm ids order
As for keeping warm user ids in the dataset it is very simple. If those users don't have features in user features matrix in the final dataset then they need to be dropped from user_id_map. Cause this is the only application of those warm ids - to get features for them (during recommend
method)
So if you are dropping feature values for some users - you should drop their ids from IdMap.
Models do not store any embeds for warm users so nothing will break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the error, one possible workaround is to keep all (old) hot user features inside the dataset.
For sparse features we can drop the values for memory efficiency.
For dense features don' really see what we can do using current implementations. Need to think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, since we are not dropping "old hot" user embeddings from the models, why should we drop their features from dataset? For recommend
method of incrementally trained model we still need all features in the dataset for all users that ever were hot.
So I think that no-hot-any-more users that are not present in new user_features_df should keep all their values in user_features
matrix.
But all users (both warm and hot) that are present in new user_features_df
should have all their values updated (real service scenario).
Now about corner cases. I think that for incremental training we need to require exactly the same number of features, feature names and feature types as were present in the original dataset. At least in this PR. Or we gonna dive too deep. So all this checks for correct data in features are needed. And cat_user_features
and make_dense_user_features
should not be passed to rebuild_with_new_data
.
Items follow same logic as users.
Wow. That's a tough story :) What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let me revive my dropping code and see what it happens on CI. I'll share it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed to drop old warm ids e8929c5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blondered I think I completed implementing the "Rewrite logic" option, but let me know if you find anything else.
@@ -11,6 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | |||
### Added | |||
- `Debias` mechanism for classification, ranking and auc metrics. New parameter `is_debiased` to `calc_from_confusion_df`, `calc_per_user_from_confusion_df` methods of classification metrics, `calc_from_fitted`, `calc_per_user_from_fitted` methods of auc and rankning (`MAP`) metrics, `calc_from_merged`, `calc_per_user_from_merged` methods of ranking (`NDCG`, `MRR`) metrics. ([#152](https://github.com/MobileTeleSystems/RecTools/pull/152)) | |||
- `nbformat >= 4.2.0` dependency to `[visuals]` extra ([#169](https://github.com/MobileTeleSystems/RecTools/pull/169)) | |||
- Implement `fit_partial()` for `ImplicitALSWrapperModel` and `LightFMWrapperModel` ([#179](https://github.com/MobileTeleSystems/RecTools/pull/179)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very excited about the new feature. Thank you for the PR!
Let's discuss what we're trying to achieve with fit_partial
.
In implicit ALS has partial_fit_users
and partial_fit_items
methods that allow to update weights only for specific parts of users/items and also support feeding new users and items to already fitted model.
LightFM model has different logic. It has fit_partial
method. From docs: "Unlike fit, repeated calls to this method will cause training to resume from the current model state". This method does not support adding new users/items. And it is not meant to update only parts of embeddings.
What exactly do we want to achieve with fit_partial
in rectools? I would say that first step would be to follow LightFM logic. We are missing the ability to resume training, out fit
always refits models from the beginning, just like LightFM fit
.
As for partial_fit_users
and partial_fit_items
- these are very model specific ALS methods. I wouldn't call them fit_partial
. I would move them to a separate issue and PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
My motivation for adding fit_partial
is to enable incremental training. For example, weekly batches for full training and daily incremental training to adopt new and existing user transactions.
I don't seriously care about preserving old users/items. Rather, shortening training time would be important.
Actually, LightFM has a known workaround for adding a new user/item. lyst/lightfm#347 (comment)
I included that workaround into the patch. ea33c05
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great feature for LigtFM, thank you!
So there are two major usage cases for fit_partial
:
- Incremental training and support for adding new items and features. This is executed when dataset is changing during calls.
- Epochal training. This is executed when dataset is staying the same during calls. It just resumes training from the previous state like
Lightfm.fit_partial
Right now in LightFM you implemented support for both cases but in ALS only for one of them (incremental). Could you please add support for ALS epochal training (with NotImplementedError
if features are present)? Otherwise users would be very confused for different behaviour of the same method. And could you please add an optional epochs
argument to fit_partial
of ALS?
We would also need tests for both models. With fitting model for 10 epochs with fit
and 10 times fitting models with fit_partial
for 1 epoch. And finals embeds should be equal.
How do you feel about the whole story?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Let me add tests for re-training with the same dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the tests 117c5c7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blondered I think 117c5c7 shows how it works for the use case 2.
Please let me know if you'd like to add anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Sorry for the delay. It's summer and it was a vacation time :)
- Add know user test case LightFM - Add _resize() function to resize the user and item embeddings - Use deepcopy only if not model is not fitted - Allow to pass epochs to fit_partial() - Handle sample_weight for K-OS WARP loss ImplicitALS - Raise NotFittedError if fit_partial() is called before fit()
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #179 +/- ##
===========================================
Coverage 100.00% 100.00%
===========================================
Files 45 56 +11
Lines 2242 3520 +1278
===========================================
+ Hits 2242 3520 +1278 ☔ View full report in Codecov by Sentry. |
Based on discussion, we decided to drop the old warm id in previous interactions. This is by RecTools design, and we should follow it. MobileTeleSystems#179 (comment)
@blondered @feldlime I think all comments are addressed. Please have a look and let me know if you need anything. |
rectools/dataset/dataset.py
Outdated
@@ -245,3 +258,72 @@ def get_raw_interactions(self, include_weight: bool = True, include_datetime: bo | |||
pd.DataFrame | |||
""" | |||
return self.interactions.to_external(self.user_id_map, self.item_id_map, include_weight, include_datetime) | |||
|
|||
def construct_new_datasets( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the error, one possible workaround is to keep all (old) hot user features inside the dataset.
For sparse features we can drop the values for memory efficiency.
For dense features don' really see what we can do using current implementations. Need to think about it.
rectools/dataset/dataset.py
Outdated
@@ -245,3 +258,72 @@ def get_raw_interactions(self, include_weight: bool = True, include_datetime: bo | |||
pd.DataFrame | |||
""" | |||
return self.interactions.to_external(self.user_id_map, self.item_id_map, include_weight, include_datetime) | |||
|
|||
def construct_new_datasets( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, since we are not dropping "old hot" user embeddings from the models, why should we drop their features from dataset? For recommend
method of incrementally trained model we still need all features in the dataset for all users that ever were hot.
So I think that no-hot-any-more users that are not present in new user_features_df should keep all their values in user_features
matrix.
But all users (both warm and hot) that are present in new user_features_df
should have all their values updated (real service scenario).
Now about corner cases. I think that for incremental training we need to require exactly the same number of features, feature names and feature types as were present in the original dataset. At least in this PR. Or we gonna dive too deep. So all this checks for correct data in features are needed. And cat_user_features
and make_dense_user_features
should not be passed to rebuild_with_new_data
.
Items follow same logic as users.
Wow. That's a tough story :) What do you think?
@@ -309,3 +330,78 @@ def _handle_features( | |||
item_features=item_features_new, | |||
) | |||
return filtered_dataset | |||
|
|||
def rebuild_with_new_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of things could go wrong with features because we don't do any checks. And user will not know about wrong behaviour.
Let's add on_unconsistent_features
argument with options "raise" and "ignore". If "raise" is specified, then after features creation there needs to be a check that old dataset feature names are exactly the same as new feature names (and they are also in the same order). If something is not right then ValueError
is raised
@@ -90,6 +90,26 @@ def _fit(self, dataset: Dataset) -> None: # type: ignore | |||
self.verbose, | |||
) | |||
|
|||
def _fit_partial(self, dataset: Dataset) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've discussed it a lot. Let's not do ALS support for fit_partial
for now.
Main reasons are:
- ALS wrapper has different behaviour then LightFM wrapper in current implementation. ALS cannot do
fit_partial
on model that has not yet been fitted. We can't leave it as it is. - Features support should be here for ALS. So we're gonna have to rewrite some of our code for features support in
fit
to add support forfit_partial
and also forepochs
parameter. And it's a long story. It should be in another PR.
actual, | ||
) | ||
|
||
def test_fit_partial_with_same_dataset(self, dataset: Dataset) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this a test where factors are being compared. Same recommendations on small toy data are not really a signal that fit_partial
and fit
are doing exactly the same logic. Please fit
model for 10 epochs. Then fit_partial
10 times for one epoch on another copy of model. And then please compare that factors are the same
@@ -11,6 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | |||
### Added | |||
- `Debias` mechanism for classification, ranking and auc metrics. New parameter `is_debiased` to `calc_from_confusion_df`, `calc_per_user_from_confusion_df` methods of classification metrics, `calc_from_fitted`, `calc_per_user_from_fitted` methods of auc and rankning (`MAP`) metrics, `calc_from_merged`, `calc_per_user_from_merged` methods of ranking (`NDCG`, `MRR`) metrics. ([#152](https://github.com/MobileTeleSystems/RecTools/pull/152)) | |||
- `nbformat >= 4.2.0` dependency to `[visuals]` extra ([#169](https://github.com/MobileTeleSystems/RecTools/pull/169)) | |||
- Implement `fit_partial()` for `ImplicitALSWrapperModel` and `LightFMWrapperModel` ([#179](https://github.com/MobileTeleSystems/RecTools/pull/179)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Sorry for the delay. It's summer and it was a vacation time :)
Thanks for dedicating your time and effort to my PR. Your support is invaluable and greatly appreciated. I decided to close this PR because
My assumption for user/item features was:
In our case, we mainly handle interactions only with Implicit. I'm glad to know some code in this PR works without features for ALS. I know I could be better off joining your Telegram channel but installing another chat tool is not what I want. |
Thank you so much for your code and you time! We do provide users with very simple interface for all of the models and datasets, but it requires us to do a lot of work under the hood. The story with features support in ALS is huge on its own. And support for incremental dataset there is also huge. If you would like to contribute in the future and you don't want to communicate in telegram, we can make discussions in github issues, no problem. |
Description
Introduce fit_partial for ImplicitALSWrapperModel and LightFMWrapperModel.
Expected usage:
Due to the complexity of explicit user/item feature fitting of ALS, I don't support them for fit_partial to ALS with features for now.
Close #176
Type of change
How Has This Been Tested?
Before submitting a PR, please check yourself against the following list. It would save us quite a lot of time.