Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement fit_partial() for ImplicitALSWrapperModel and LightFMWrapperModel #179

Closed
wants to merge 12 commits into from

Conversation

chezou
Copy link
Contributor

@chezou chezou commented Aug 13, 2024

Description

Introduce fit_partial for ImplicitALSWrapperModel and LightFMWrapperModel.

Expected usage:

dataset = ...

new_interactions = ...

new_dataset = dataset.rebuild_with_new_data(interactions_df=new_interactions)
model.fit_partial(new_dataset)

Due to the complexity of explicit user/item feature fitting of ALS, I don't support them for fit_partial to ALS with features for now.

Close #176

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Optimization

How Has This Been Tested?

Before submitting a PR, please check yourself against the following list. It would save us quite a lot of time.

  • Have you read the contribution guide?
    • Yes
  • Have you updated the relevant docstrings? We're using Numpy format, please double-check yourself
    • Yes
  • Does your change require any new tests?
    • Yes
  • Have you updated the changelog file?
    • Yes

@chezou chezou changed the title Implement fit_partial Implement fit_partial() for ImplicitALSWrapperModel and LightFMWrapperModel Aug 13, 2024
Copy link
Collaborator

@feldlime feldlime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for such a great feature!

I added some comments, if you don't agree with them, let's discuss. I'm not 100% sure I understand the use case, so may be some of the comments are wrong. But if you agree with them, then it's fine.

rectools/dataset/dataset.py Outdated Show resolved Hide resolved
@@ -245,3 +258,72 @@ def get_raw_interactions(self, include_weight: bool = True, include_datetime: bo
pd.DataFrame
"""
return self.interactions.to_external(self.user_id_map, self.item_id_map, include_weight, include_datetime)

def construct_new_datasets(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably construct_new_dataset would be more reasonable since only one dataset is returned

Copy link
Collaborator

@blondered blondered Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call it build_new_dataset_on_id_map or smth like it. But we need to think about the conception first.

Right now it's not working properly in the conceptions that we use. We have a rule that all hot users (users that have interactions) always go before warm users (users that have only features). And we use this assumption in some of our impertant internal logic.

New method can break this rule because new users with interactions will have ids after warm users from previous dataset state. So we really can't make this a public method in the library.

Options are:

  • Rewrite logic. We need to keep only hot users ids from previous state. Then add new hot users ids. Then add new warm user ids. I suppose we are dropping all previous warm users because we are totally changing the actual data. (All the same is for items) In this case I'd call it or rebuild_with_new_data or construct_with_new_data or replace_data
  • Just don't make it a public method. Move it to the tests if we need it there. But then again write a correct public method for required logic. Also tests will be needed for it anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out. I wasn't aware of the warm users' assumption. And yeah, I expected this to be a controversial change. And that's why I haven't written the test yet 😅

Originally, I expected for RecTools' Dataset to do similar to LightFM's Dataset.fit_partial(), but I found that RecTools Dataset is frozen.

I'm okay to make it private, or I can rebuild IdMap by concatenating new and old users/items something like:

old_hot_user_id_map = IdMap.from_dict({e: i for e,i in  zip(ds.user_id_map.convert_to_external(ds.get_hot_users()), ds.get_hot_users())})
old_hot_item_id_map = ...

new_user_id = old_hot_user_id_map.add_ids(...

I'm not confident if it's okay to drop the previous warm user yet though,.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I attempt to drop old warm ids, but I face UnknownIdError.

raise UnknownIdError("All ids in `df` must be present in `id_map`")

Maybe, storing hot/warm marks like hot_ids in IdMap can be an option like np.array([1, 0, 1]) where 1 represents hot and 0 represents warm id. Then, we can fetch hot or warm users/items explicitly.

Copy link
Collaborator

@blondered blondered Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall conception of ids in rectools Dataset are the following:

  1. All user ids are row indexes in both user-interactions matrix and user features matrix and vector model embeddings matrix
  2. Because we don't wont empty rows in user-interactions matrix we can't have warm user ids before hot ones. Empty rows are bad because some models will add them to loss calculation which is conceptually wrong.
  3. So we always need to have all hot user ids -> then all warm user ids. They have the same order in user features matrix
  4. That's why we don't need hot_ids flag. But we always need to make sure we keep the assumption of hot/warm ids order

As for keeping warm user ids in the dataset it is very simple. If those users don't have features in user features matrix in the final dataset then they need to be dropped from user_id_map. Cause this is the only application of those warm ids - to get features for them (during recommend method)
So if you are dropping feature values for some users - you should drop their ids from IdMap.
Models do not store any embeds for warm users so nothing will break.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the error, one possible workaround is to keep all (old) hot user features inside the dataset.
For sparse features we can drop the values for memory efficiency.
For dense features don' really see what we can do using current implementations. Need to think about it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since we are not dropping "old hot" user embeddings from the models, why should we drop their features from dataset? For recommend method of incrementally trained model we still need all features in the dataset for all users that ever were hot.
So I think that no-hot-any-more users that are not present in new user_features_df should keep all their values in user_features matrix.
But all users (both warm and hot) that are present in new user_features_df should have all their values updated (real service scenario).

Now about corner cases. I think that for incremental training we need to require exactly the same number of features, feature names and feature types as were present in the original dataset. At least in this PR. Or we gonna dive too deep. So all this checks for correct data in features are needed. And cat_user_features and make_dense_user_features should not be passed to rebuild_with_new_data.

Items follow same logic as users.

Wow. That's a tough story :) What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let me revive my dropping code and see what it happens on CI. I'll share it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to drop old warm ids e8929c5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blondered I think I completed implementing the "Rewrite logic" option, but let me know if you find anything else.

rectools/models/lightfm.py Outdated Show resolved Hide resolved
rectools/models/lightfm.py Outdated Show resolved Hide resolved
rectools/models/lightfm.py Show resolved Hide resolved
rectools/models/implicit_als.py Outdated Show resolved Hide resolved
tests/models/test_implicit_als.py Outdated Show resolved Hide resolved
@@ -11,6 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added
- `Debias` mechanism for classification, ranking and auc metrics. New parameter `is_debiased` to `calc_from_confusion_df`, `calc_per_user_from_confusion_df` methods of classification metrics, `calc_from_fitted`, `calc_per_user_from_fitted` methods of auc and rankning (`MAP`) metrics, `calc_from_merged`, `calc_per_user_from_merged` methods of ranking (`NDCG`, `MRR`) metrics. ([#152](https://github.com/MobileTeleSystems/RecTools/pull/152))
- `nbformat >= 4.2.0` dependency to `[visuals]` extra ([#169](https://github.com/MobileTeleSystems/RecTools/pull/169))
- Implement `fit_partial()` for `ImplicitALSWrapperModel` and `LightFMWrapperModel` ([#179](https://github.com/MobileTeleSystems/RecTools/pull/179))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very excited about the new feature. Thank you for the PR!

Let's discuss what we're trying to achieve with fit_partial.
In implicit ALS has partial_fit_users and partial_fit_items methods that allow to update weights only for specific parts of users/items and also support feeding new users and items to already fitted model.
LightFM model has different logic. It has fit_partial method. From docs: "Unlike fit, repeated calls to this method will cause training to resume from the current model state". This method does not support adding new users/items. And it is not meant to update only parts of embeddings.

What exactly do we want to achieve with fit_partial in rectools? I would say that first step would be to follow LightFM logic. We are missing the ability to resume training, out fit always refits models from the beginning, just like LightFM fit.

As for partial_fit_users and partial_fit_items - these are very model specific ALS methods. I wouldn't call them fit_partial. I would move them to a separate issue and PR.

@chezou @feldlime what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

My motivation for adding fit_partial is to enable incremental training. For example, weekly batches for full training and daily incremental training to adopt new and existing user transactions.

I don't seriously care about preserving old users/items. Rather, shortening training time would be important.

Actually, LightFM has a known workaround for adding a new user/item. lyst/lightfm#347 (comment)
I included that workaround into the patch. ea33c05

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great feature for LigtFM, thank you!

So there are two major usage cases for fit_partial:

  1. Incremental training and support for adding new items and features. This is executed when dataset is changing during calls.
  2. Epochal training. This is executed when dataset is staying the same during calls. It just resumes training from the previous state like Lightfm.fit_partial

Right now in LightFM you implemented support for both cases but in ALS only for one of them (incremental). Could you please add support for ALS epochal training (with NotImplementedError if features are present)? Otherwise users would be very confused for different behaviour of the same method. And could you please add an optional epochs argument to fit_partial of ALS?

We would also need tests for both models. With fitting model for 10 epochs with fit and 10 times fitting models with fit_partial for 1 epoch. And finals embeds should be equal.

How do you feel about the whole story?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Let me add tests for re-training with the same dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the tests 117c5c7

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blondered I think 117c5c7 shows how it works for the use case 2.

Please let me know if you'd like to add anything else.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Sorry for the delay. It's summer and it was a vacation time :)

- Add know user test case

LightFM
- Add _resize() function to resize the user and item embeddings
- Use deepcopy only if not model is not fitted
- Allow to pass epochs to fit_partial()
- Handle sample_weight for K-OS WARP loss

ImplicitALS
- Raise NotFittedError if fit_partial() is called before fit()
Copy link

codecov bot commented Aug 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (9b3992e) to head (dd84f10).
Report is 77 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##              main      #179     +/-   ##
===========================================
  Coverage   100.00%   100.00%             
===========================================
  Files           45        56     +11     
  Lines         2242      3520   +1278     
===========================================
+ Hits          2242      3520   +1278     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@chezou
Copy link
Contributor Author

chezou commented Aug 30, 2024

@blondered @feldlime I think all comments are addressed. Please have a look and let me know if you need anything.

@@ -245,3 +258,72 @@ def get_raw_interactions(self, include_weight: bool = True, include_datetime: bo
pd.DataFrame
"""
return self.interactions.to_external(self.user_id_map, self.item_id_map, include_weight, include_datetime)

def construct_new_datasets(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the error, one possible workaround is to keep all (old) hot user features inside the dataset.
For sparse features we can drop the values for memory efficiency.
For dense features don' really see what we can do using current implementations. Need to think about it.

@@ -245,3 +258,72 @@ def get_raw_interactions(self, include_weight: bool = True, include_datetime: bo
pd.DataFrame
"""
return self.interactions.to_external(self.user_id_map, self.item_id_map, include_weight, include_datetime)

def construct_new_datasets(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since we are not dropping "old hot" user embeddings from the models, why should we drop their features from dataset? For recommend method of incrementally trained model we still need all features in the dataset for all users that ever were hot.
So I think that no-hot-any-more users that are not present in new user_features_df should keep all their values in user_features matrix.
But all users (both warm and hot) that are present in new user_features_df should have all their values updated (real service scenario).

Now about corner cases. I think that for incremental training we need to require exactly the same number of features, feature names and feature types as were present in the original dataset. At least in this PR. Or we gonna dive too deep. So all this checks for correct data in features are needed. And cat_user_features and make_dense_user_features should not be passed to rebuild_with_new_data.

Items follow same logic as users.

Wow. That's a tough story :) What do you think?

@@ -309,3 +330,78 @@ def _handle_features(
item_features=item_features_new,
)
return filtered_dataset

def rebuild_with_new_data(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of things could go wrong with features because we don't do any checks. And user will not know about wrong behaviour.
Let's add on_unconsistent_features argument with options "raise" and "ignore". If "raise" is specified, then after features creation there needs to be a check that old dataset feature names are exactly the same as new feature names (and they are also in the same order). If something is not right then ValueError is raised

@@ -90,6 +90,26 @@ def _fit(self, dataset: Dataset) -> None: # type: ignore
self.verbose,
)

def _fit_partial(self, dataset: Dataset) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've discussed it a lot. Let's not do ALS support for fit_partial for now.
Main reasons are:

  1. ALS wrapper has different behaviour then LightFM wrapper in current implementation. ALS cannot do fit_partial on model that has not yet been fitted. We can't leave it as it is.
  2. Features support should be here for ALS. So we're gonna have to rewrite some of our code for features support in fit to add support for fit_partial and also for epochs parameter. And it's a long story. It should be in another PR.

actual,
)

def test_fit_partial_with_same_dataset(self, dataset: Dataset) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this a test where factors are being compared. Same recommendations on small toy data are not really a signal that fit_partial and fit are doing exactly the same logic. Please fit model for 10 epochs. Then fit_partial 10 times for one epoch on another copy of model. And then please compare that factors are the same

@@ -11,6 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added
- `Debias` mechanism for classification, ranking and auc metrics. New parameter `is_debiased` to `calc_from_confusion_df`, `calc_per_user_from_confusion_df` methods of classification metrics, `calc_from_fitted`, `calc_per_user_from_fitted` methods of auc and rankning (`MAP`) metrics, `calc_from_merged`, `calc_per_user_from_merged` methods of ranking (`NDCG`, `MRR`) metrics. ([#152](https://github.com/MobileTeleSystems/RecTools/pull/152))
- `nbformat >= 4.2.0` dependency to `[visuals]` extra ([#169](https://github.com/MobileTeleSystems/RecTools/pull/169))
- Implement `fit_partial()` for `ImplicitALSWrapperModel` and `LightFMWrapperModel` ([#179](https://github.com/MobileTeleSystems/RecTools/pull/179))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Sorry for the delay. It's summer and it was a vacation time :)

@chezou
Copy link
Contributor Author

chezou commented Sep 3, 2024

Thanks for dedicating your time and effort to my PR. Your support is invaluable and greatly appreciated.

I decided to close this PR because

  • There can still be substantial unwritten assumptions about user/item features that are unknown to me as an outsider. And it seems the gap is becoming a huge blocker to understand the intention of comments.
  • Implicit ALS's user/item features handling and Dataset structure may need drastic changes to adopt warm feature handling.

My assumption for user/item features was:

  • There should be user/item tables along with the interaction table.
  • Those tables' columns are additive during the incremental training phase. If a column is removed, we can do full training.
  • Looking at the Lightfm's model resize workaround, it looks like features are additive, not dropped.

In our case, we mainly handle interactions only with Implicit. I'm glad to know some code in this PR works without features for ALS.

I know I could be better off joining your Telegram channel but installing another chat tool is not what I want.

@chezou chezou closed this Sep 3, 2024
@blondered
Copy link
Collaborator

Thank you so much for your code and you time!

We do provide users with very simple interface for all of the models and datasets, but it requires us to do a lot of work under the hood. The story with features support in ALS is huge on its own. And support for incremental dataset there is also huge.
Another problem was with the mapping of features in the old dataset and in the new one. Our methods were not written to support incremental features so they could completely break the place of a specific feature in the dataset features table and in the model feature embeddings table. User would not know about this, but model performance would get hurt a lot. Because old embeddings would be mapped to completely different features from new dataset.

If you would like to contribute in the future and you don't want to communicate in telegram, we can make discussions in github issues, no problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fit_partial function for ImplicitALSWrapperModel and LightFMWrapperModel
3 participants