-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Online Modeling - Topic Representation Loss and Mapping Confusion #946
Comments
Thank you for your kind words and the extensive description!
What is happening here is that # We perform a partial fit on our documents.
model.partial_fit(docs)
predictions = model.topics_
# Afte loading the model, we can assign topics as follows:
predictions = model.transform(unseen_docs)
You are using the |
For my use case, we can assume that I have infinite documents coming in. So, I thought the best way to approach this would be to classify them and fit the model at the same time. In that sense, there are no "unseen docs". With other variations, this approach has been working quite well for me. I am using "find_topics" to satisfy search queries, not to classify. I was more specifically wondering how topics change from one batch to the next. In successive runs of partial fit with new documents, how do the previously created topics change? I have observed that sometimes they stay the same, and sometimes they change to be very different. In other words, can I expect that documents classified in topic 1 in partial_fit run 1 to be similar to documents classified in topic 1 in partial_fit run 10?
Ah. Thank you! I will look more closely at OnlineCountVectorizer. Do you see any reason that I shouldn't set it to 0? |
No, that actually goes against the "online" part of online machine learning. The idea here is that when you perform batches of training the model learns more and more as it gets more and more information. As such, what it has learned in batch 1 might not be relevant anymore in batch 10. Moreover, what it learned in batch 1 might actually be incorrect since the model only has limited data and by batch 10 it knows the correct representation.
The idea with online machine learning is often that you want to classify the most current information and that information from years ago might be less relevant. Having some sort of decay factor helps put an emphasis on current data. |
@MaartenGr The options confuse me a bit. MiniBatchKMeans is used for clustering. For online learning shouldn't one use river? I was under the assumption that MiniBatchKMeans wouldn't work well for continuous learning. Am I missing something here? |
@vantubbe Although MiniBatchKMeans is a clustering algorithm, it takes in batches of data which allows it for online learning of the clusters. River also uses clustering algorithms but optimized for online use cases. |
That makes sense, thank you for the clarification! Could you please clarify one additional thing: a higher From your documentation on the OnlineCountVectorizer:
So if decay is set to 0.01, that means the frequency will be decreased by only 1% percent each iteration. However your above comment seems to imply that it could be too quickly decaying.
I would assume that 0.01 implies a fairly slow decay, although I also assume the frequency at which one iterates training batches needs to be considered too. Thank you again for the help! |
The impact of decay not only depends on its value but also on the number of batches you run. For example, if you have a decay of 10% and you run only 2 batches, then the impact will not be as big. However, if you have a decay of .01 % and you have a million batches, then that quickly adds up! |
I started experimenting with the river approach you gave as an example in the docs. This isn't necessarily relevant to the original issue here, but I figured it may be helpful to point out a slightly improved custom River class (at least for my use case). I found the quality to be high with DBSTREAM. However, the processing time gets a bit out of control for large data sets. river.cluster.DBSTREAM calls self._recluster on every pass of predict_one. I found success circumventing this and implementing the "partial_fit" function a bit differently than the example in the docs.
... The speed up is huge. I think we can close this issue. However, if you can see any problems with the way I have done this, let me know! ... Looks like this is being actively discussed on the river repo online-ml/river#1086 |
@emarsc Great feedback and thank you for including the above code. When using River, would you mind sharing which dimensionality reduction you chose to use in your Bertopic model? I originally used UMAP but that does not support incremental learning and I assume is not a great fit for use with River. When I switched to |
@vantubbe I am using incremental PCA along with river's DBSTREAM. I am unable to find a more suitable dimension reduction algorithm. Incremental UMAP might be possible and is being discussed, but doesn't seem to be implemented yet. For my data, with 5 components, the default DBSTREAM configuration was unable to find any meaningful clusters (if any clusters at all). I changed the DBSTREAM clustering_threshold parameter from 1 (default) to 0.5 and I am getting meaningful clusters. I have also increased the components to anywhere from 10 to 25 and this seems to help as well. In combination, this seems to be satisfactory for me, but I have not fully validated it yet. I am going to be looking a bit further into the DBSTREAM algorithm and parameters. I will let you know if I find more success! I do not recall where, but I think I read something from @MaartenGr suggesting you could train UMAP on a sufficiently large data set and still use it with online clustering if you don't expect your data to change much .... all probably depends on your use case. |
@emarsc Good deal, thanks for the details! I was also looking for incremental UMAP and found AlignedUMAP. It's an implementation of UMAP intended for temporal/online learning. Unfortunately I have no idea how to use it - requires a
I did the same! I had to set Thanks again, excited to hear about future breakthroughs! |
@emarsc I was reviewing this issue and I have a similar use case, I am looking to initially model on a dataset, say ~10k to 100k docs and then using the River package I would like to update the saved topic model twice per month with another ~10-20k docs. I have been testing the river functionality for this and I have tried to implement you river speed up (#946 (comment)) but regardless of the custom River class I use, the clustering never finishes. I have tested it with ~2k docs and it appears to work, although once I expand to ~30k docs, the modeling doesn't complete even after 18+ hours. Now, typcially, this would take ~30min to 90min when doing normal incremental modeling (90 min when using closer to 100k docs). Could there be something I am missing here for the speed ups, or is there something else that you have found that helps improve this speed up? I appreciate any suggestions or new ideas! Also, a side note, I am using GPU acceleration to improve speed! For reference, here is my current River class & modeling code:
|
@mdcox It might be worthwhile to check where the model slows down. You have several lines of code in your |
@MaartenGr good note (back to the basics!) thank you for the suggestion! I have added tqdm to the code to track the timing of each part. The estimated tqdm completion time looks to jump around a lot. I included two screenshots below as an example, one second it is ~1.5 hours and then it jumps to ~5 hours. It also appears to be slowly getting longer, ie the time estimates are generally increasing (ie by 10% the minimum estimate doesn't dip below ~3 hours whereas before it was ~1 hour). The top end of the estimate also grows to ~10 hours whereas before it was ~5 hours. This is all in the first loop of the
@MaartenGr have you seen slowdowns in the |
@mdcox I am not entirely sure but it might be related to online-ml/river#1086. From my side, I do not think there is much to do aside from using a different clustering algorithm. Since it seems to be algorithm-specific, it might be worthwhile to ask the maintainer of the River package for help. |
Sounds good! Thank you for the help! |
I am having a couple of issues with online topic modeling. I have read all of the relevant documentation (I think), but I am still unsure if what I am experiencing is a bug, or if what I am trying to do is simply not supported. Seeking some clarification.
Use case
My use case involves continuously training a model on new data (always running, millions of documents a day).
The model is initialized as:
Fitting:
Issue
I have two main issues / points of confusion.
Here is an example:
`
I have tried to read the code and documentation on topic_mapper_. It seems this is how the shift in topics should be mapped? I am not having much luck in figuring out how to leverage that data structure. The changed topics don't seem to map to anything.
...
Thank you for taking the time. Clarification on whether or not what I am experiencing is expected would be greatly appreciated.
Also, this package rocks. Great work.
The text was updated successfully, but these errors were encountered: