Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with Zero-shot Topic Modeling regarding outliers and future operations #1967

Open
ianrandman opened this issue May 3, 2024 · 3 comments

Comments

@ianrandman
Copy link

ianrandman commented May 3, 2024

I am using zero-shot topic modeling and have run into some issues (a subset of which has already been brought up in passed issues), plus I have some concerns about correctness of the resulting topic model. I have two strategies in mind to resolve these, one of which I have already mostly written as a patch to the process for my project.

My flow is roughly:

  1. topic_model = BERTopic(...)
  2. topic_model.fit(...)
  3. topic_model.reduce_topics(...)
  4. topic_model.reduce_outliers(...)
  5. topic_model.update_topics(...)

The Problems

Let's consider the case that some zero-shot topics have matched with some of the documents. The next step is to continue with clustering and normal topic modeling for the remaining documents in self.

In _combine_zeroshot_topics, a dummy model for the zero-shot topics is made and fitted to generate topic representations. This is merged with self (which current represents the topic model for un-assigned topics that went on to be clustered) using BERTopic.merge_models to create merged_model. Some cleanup is done before attributes in self are replaced with those in merged_model.

The final topics are 0 through $#_matching_zeroshot_topics_-_one and $#_matching_zeroshot_topics through $#_matching_zeroshot_topics_+_#_clusters_-_1. The outlier topic is now treated as a normal topic with ID $#_matching_zeroshot_topics.

Outliers

The first issue is the loss of the outlier topic. The merged_model._outliers is 0, which makes its way into self._outliers.
This attribute is now wrong and it will no longer correctly be used in future operations as an offset to exclude the outlier topic.
The -1 topic is now gone, so we have lost any meaningful reference to outliers existing. Future operations look at topic -1 for identifying outlier documents. This issue is outlined in #1957.

If we were to call reduce_outliers on a model without outliers, there is no validation. I have not looked into other strategies, but for the c-tf-idf strategy, the function will fail on bow_doc = self.vectorizer_model.transform(outlier_docs) because outlier_docs is empty. A subset of the reduce_outliers issue is described in #1771.

self.c_tf_idf_

Any future operation that references the self.c_tf_idf_ matrix no longer works, because self.c_tf_idf_ == None, which is a carry-over from the merged_model. In my workflow, the self.c_tf_idf_ matrix will be set correctly if I reduce_topics, because the final step is to extract the topic representations, which includes fitting self.vetcorizer_model and self.ctfidf_model and setting self.c_tf_idf_.

Configuration-related attributes

We have now see outliers and the C-TF-IDF matrix not persist through the merge. Any other configuration-related attributes that may be important later are also not persisted. In my case, the relevant ones are self.zeroshot_topic_list , self.vetcorizer_model, and self.ctfidf_model, but there are many more. These get reset to their defaults. This especially becomes a problem if representations are ever updated again (which they are a couple more times in my workflow), because important properties like reduce_frequent_words=True in the ClassTfidfTransformer are no longer set.

This is a direct consequence of

self.__dict__.clear()
self.__dict__.update(merged_model.__dict__)

in _combine_zeroshot_topics.

Correctness of representations

The representations in the merged_model, which a copied over into self are effective largely a concatenation of the represenations from the zeroshot_model and self (i.e. the clustered model). These representations are calculated in part with the c_tf_idf_ matrix, which is produced in part from the size of the vocabulary. The problem is the vocabulary in zeroshot_model and self are only subsets of the real vocabulary, which includes all documents. As a user, I expect my zero-shot topics to be used at the very start as a sort of filtering step but mostly not have any effects later on (except the zero-shot topic labels persisting). This implies I expect representations to be calculated the way they would normally be, which is to use all documents.

While the representations calculated in each model separately are not necessarily wrong, I would argue it would be more correct for the represenations to be calculated considering all documents. That is, I could call self._extract_topics(...) after fitting, and the resulting representations (barring the special handling of zero-shot topic labels) should not change.

topic_embeddings_

For this one, I am not 100% sure yet whether it is an issue. I would expect that upon merging the zeroshot_model and self, the merged_model.topic_embeddings_ should essentially just be a concatenation of the two sets of topic embeddings (plus maybe some quirks relating to where that outlier topic goes). In my debugging, it appears that the topic embeddings from zeroshot_model.topic_embeddings_ do not change, but the ones for the topics from clustering (from self.topic_embeddings_) do change. And that the cosine similarity of those with the ones in the new merged_model is not even high. Perhaps I am missing something when doing this analysis, or perhaps BERTopic.merge_models truly is not preserving topic embeddings, which I believe is wrong, as topic embeddings are only a function of the documents in each topic (and their embeddings) in this case. Maybe it's related to the mappings parameter in _create_topic_vectors? Still debugging this one...

Proposed Solutions

reduce_outliers

I discussed a lack of validation for the existence of outliers, leading to unclear errors. This function should start with a validation check to look for -1 in self.topics_ (or really any attribute that has a mapping from topic IDs), or a more explicit solution is to check the value of self._outliers.

This private attribute can fall out of sync with reality, so I propose deriving it from one of the other attributes. This can look like

@property
def _outliers() -> int:
    # alternative solution which is less efficient:
    # return int(-1 in self.topics_)
    # other mapping attributes like self.topic_sizes_ can be used
    return int(-1 in self.topic_labels_)  # O(1) operation

One of the specific reasons I propose this is that in my flow, I calculate new topic IDs for all of my documents using topic_model.reduce_outliers(...). This does not change the model at all, so I set topic_model.topics_ = topic_model.reduce_outliers(...). I will later call topic_model.update_topics(...) to update the representations (which also sets the self._outliers attribute, but if I look for whether the topic_model still has outliers between these operations, topic_model._outliers is no longer a reliable source of information.

_combine_zeroshot_topics

This is where the bulk of my fixes through patching have been for the previously described problems.

Persisting of attributes from self

I described why I believe confguration-related attributes should not get cleared out. Obviously, (the majority of) attributes that were affected by fitting should be overriden in self by those in merged_model. To be more specific,

# Public attributes
self.topics_ = None
self.probabilities_ = None
self.topic_sizes_ = None
self.topic_mapper_ = None
self.topic_representations_ = None
self.topic_embeddings_ = None
self.topic_labels_ = None
self.custom_labels_ = None
self.c_tf_idf_ = None
self.representative_images_ = None
self.representative_docs_ = {}
self.topic_aspects_ = {}

are the majority of attributes that should get overridden. I have also added umap_model and hdbscan_model as attributes that should come from merged_model, but I am less sure about the implications of these.

I copy all attributes that are not these into a dictionary. The end of the function is changed to

# Update the class internally
self.__dict__.clear()
self.__dict__.update(merged_model.__dict__)
####### Patched Changes #######
self.__dict__.update(properties_to_keep)
###############################

to keep those configuration-related properties. The vectorizer_model and ctf_idf model are kept for their configurations, even though they were only fitted to the clustered documents. This could be problematic. The result of this patch includes self._outliers being set properly.

Missing Outlier Topic

I have described that the resulting model has the zero-shot topics starting at ID 0 followed by the clustered topic, including the outlier topic. Right after the merge happens, the outlier topic is actually in the correct spot with ID -1 but a gap exists after the zero-shot topics for where it will be moved to. I let this and related operations to the representation-related attributes run.

If self._outliers is set to 1, I will manually move the outlier topic back to ID -1. This involves three operations. For each of

[
    merged_model.topic_labels_,
    merged_model.topic_sizes_,
    merged_model.representative_docs_,
    merged_model.topic_representations_
]

the a new mapping from -1 to the info of the outlier topic (which is at ID $#_matching_zeroshot_topics ) is made. The mappings at $#_matching_zeroshot_topics_+_1 (all of the non-outlier topics that came from clustering) are shifted by 1 to a lower topic number. After the shift, the last mapping (ID $#_matching_zeroshot_topics_+_#_clusters_-_1) is deleted.

The same concept is implemented for merged_model.topics_ (topic assignment for each document) but through the use of numpy operations before it is converted back to a list.

For an implementation that I think would be more appropriate to be merged into this repo, I think the BERTopic.merge_models should be changed to get this correct from the start.

Missing c_tf_idf_ Matrix and Lack of Correctness

Both of these issues can be handled in the same operation, which can be summarized by merged_model._extract_topics(all_docs, all_embeddings).

We want to update the representations in the merged model, which fits the vectorizer_model and ctfidf_model and sets the c_tf_idf_ matrix, but this time with all documents. The _combine_zeroshot_topics function only accepts the assigned documents (assigned_documents), assigned embeddings (referred to in this function as embeddings), and remaining documents (documents). To get all embeddings, I patch the fit_transform function to pass in the embeddings of the clustered documents. In _combine_zeroshot_topics, embeddings is moved to assigned_embeddings to keep the naming consistent with _zeroshot_topic_modeling, and embeddings is now used for the embeddings of clustered documents (documents).

Before extracting topics, I set

merged_model.ctfidf_model = self.ctfidf_model
merged_model.vectorizer_model = self.vectorizer_model

so we use the right configuration for those.

I also make sure afterwards to merged_model.topic_labels_.update(zeroshot_topic_labels) so they do not get overridden by the topic words. I have set zeroshot_topic_labels as a mapping from topic ID to label only for the zeroshot-topics.

I have not tested it yet, but I should also _save_representative_docs to update the representative docs, as these are a function of the c_tf_idf_ matrix and the vectorizer_model and ctfidf_model models that are different now (by using all docs) than when representative docs were originally calculated (when the matrix and models were from vocabulary subsets in the zeroshot_model and self (i.e. the clustered model).

fit_transform

So that seems like changes to _combine_zeroshot_topics are lengthy, potentially incomplete and computationally inefficient (representation mode(s) are run again and the representations from zeroshot_model and self before merge are essentially thrown out. A simpler solution that I have not written or detailed extensively is to avoid this merging of models in the first place.

The idea starts with separating assigned documents and assigned embeddings from the document and embeddings to be clustered -- same as the current implementation. The next steps are dimensionality reduction and clustering -- no changes yet.

Now, instead of performing representation-related operations on just the topics that came from clustering, we can merge the topics from clustering with the topics from zero-shot topic modeling. Some care needs to be taken to ensure the outlier topic stays at -1, but I do not think it matters much if zero-shot topics come next or get tacked onto the end (in fact, this alternate ordering in proposed in #1771). The representations will be calculated using all topics and all documents. There would be no need for _combine_zeroshot_topics, but it perhaps could be repurposed for the combining of zero-shot topics with the topics from clusters after the clustering step. The vectorizer_model and ctfidf_model will be fitted to all documents, and self.c_tf_idf_ will include the total vocabulary.

Now, operations that reference these (reduce_topics and reduce_outliers in my case) will (mostly) work, making no distinction between zero-shot topics and topics from clustering. There will still need to be some extra care whenever topic representations are updated (like in reduce_topics) to not override the zero-shot topic labels.

Couple more extra bit of care I can think of needs to be taken:

  • When fitting, self.nr_topics may be set. I believe from the perspective of the user, this should apply to all topics, not just ones from clustering, as it currently the case. There should be validation that self.nr_topics exceeds the number of zero-shot topics that matched. If this is not the case, the user should be prompted to either raise self.nr_topics or raise the zeroshot_min_similarity.
  • The returned probabilities and related self._map_probabilities(probabilities, original_topics=True) needs to have the right shape. I have not dealt with these, but I fear these could already have problems in the current implementation when zero-shot topic modeling is used, because the probabilities appear to be only for the clustered documents, not all documents.

Final Discussion

  • Are there reasons for the quirks I have described in the problems section that are not explained in the referenced issue and PR?
  • Do you see flaws in my description of my current patches?
  • Does my alternative implementation of zero-shot topic modeling make sense, or are there specific reasons for why it is implemented as the merging of two BERTopic topic models?
  • Anything else relevant to know or discuss?
@MaartenGr
Copy link
Owner

@ianrandman Thanks for sharing the issues extensively and a bunch of great solutions! I haven't read through them all but wanted to let you know that this is on my radar. Hopefully somewhere next week, I'll find a moment to spend time reading this properly (as it definitely deserves proper attention!).

What I can share right now is that it is my intention to release a hotfix soon (v0.16.2) when some things on my side slow down a bit before I delve into any major new updates, like these.

@torqw
Copy link

torqw commented May 16, 2024

Being able to use topics.over.time with zero-shot would indeed be a very welcome addition.

@MaartenGr
Copy link
Owner

Outliers
The first issue is the loss of the outlier topic. The merged_model._outliers is 0, which makes its way into self._outliers.
This attribute is now wrong and it will no longer correctly be used in future operations as an offset to exclude the outlier topic.
The -1 topic is now gone, so we have lost any meaningful reference to outliers existing. Future operations look at topic -1 for identifying outlier documents. This issue is outlined in #1957.
If we were to call reduce_outliers on a model without outliers, there is no validation. I have not looked into other strategies, but for the c-tf-idf strategy, the function will fail on bow_doc = self.vectorizer_model.transform(outlier_docs) because outlier_docs is empty. A subset of the reduce_outliers issue is described in #1771.

Indeed, the outlier should have had made its way to the merged model which caused some issues. The pull request (and v0.16.2) should fix that but as we will discuss below, there may be better solutions.

Correctness of representations
The representations in the merged_model, which a copied over into self are effective largely a concatenation of the represenations from the zeroshot_model and self (i.e. the clustered model). These representations are calculated in part with the c_tf_idf_ matrix, which is produced in part from the size of the vocabulary. The problem is the vocabulary in zeroshot_model and self are only subsets of the real vocabulary, which includes all documents. As a user, I expect my zero-shot topics to be used at the very start as a sort of filtering step but mostly not have any effects later on (except the zero-shot topic labels persisting). This implies I expect representations to be calculated the way they would normally be, which is to use all documents.
While the representations calculated in each model separately are not necessarily wrong, I would argue it would be more correct for the represenations to be calculated considering all documents. That is, I could call self._extract_topics(...) after fitting, and the resulting representations (barring the special handling of zero-shot topic labels) should not change.

I am not sure whether I agree that the documents of the zero-shot topics should be included in the general c-TF-IDF process as that would require a third manual model that requires the topics of both the zeroshot and the regular model in order to correctly generate the representations. Having said that, let's go over the proposed solutions and see what's what:

I discussed a lack of validation for the existence of outliers, leading to unclear errors. This function should start with a validation check to look for -1 in self.topics_ (or really any attribute that has a mapping from topic IDs), or a more explicit solution is to check the value of self._outliers.

Although this should be fixed in the new release, having said a check could be worthwhile in case this happens more often. I would want to prevent having too much code though since the repo is already getting quite big for a sole maintainer (me).

are the majority of attributes that should get overridden. I have also added umap_model and hdbscan_model as attributes that should come from merged_model, but I am less sure about the implications of these.

Those should not be added since that would make .transform not possible. HDBSCAN wasn't trained on the zeroshot topics and as such using the cosine similarity between topics and document embeddings is preferred.

For an implementation that I think would be more appropriate to be merged into this repo, I think the BERTopic.merge_models should be changed to get this correct from the start.

I believe this should be fixed with the new release, foregoing any issues with respect to the ordering.

Missing c_tf_idf_ Matrix and Lack of Correctness

This is rather tricky since we want to prevent either calculating too much on the hand and having too much manual work on the other hand. The easiest solution is to introduce a third model that uses the topics generated from the first two models in order to easily generate the c-TF-IDF representations, topic embeddings, etc. without the need for more manual work. However, that would mean that all representation models are going to be run twice which is definitely not preferred. An option would be to find the zero-shot topics, calculate the clusters and then create a single manual BERTopic model using the zero-shot topics and clusters. That way, there is very minimal work needed and representations only need to run once.

So that seems like changes to _combine_zeroshot_topics are lengthy, potentially incomplete and computationally inefficient (representation mode(s) are run again and the representations from zeroshot_model and self before merge are essentially thrown out.

With zero-shot topics, representations should be run only once, not twice since the merging produce does nothing more than simply merging the models and not recalculating them.

Now, instead of performing representation-related operations on just the topics that came from clustering, we can merge the topics from clustering with the topics from zero-shot topic modeling. Some care needs to be taken to ensure the outlier topic stays at -1, but I do not think it matters much if zero-shot topics come next or get tacked onto the end (in fact, this alternate ordering in proposed in #1771). The representations will be calculated using all topics and all documents. There would be no need for combine_zeroshot_topics, but it perhaps could be repurposed for the combining of zero-shot topics with the topics from clusters after the clustering step. The vectorizer_model and ctfidf_model will be fitted to all documents, and self.c_tf_idf will include the total vocabulary.

Ah! Going in chronological order with answering seems to have its downsides as, if I'm not mistaken, that's the same solution I presented above. Calculate zero-shot topics and clusters without the representations and then using both in a manual BERTopic model, thereby simplifying the process and still allowing for a single calculation of the representations whilst having c-TF-IDF representations of the topics. Right?

Are there reasons for the quirks I have described in the problems section that are not explained in the referenced issue and PR?

I think I answered most of this although I would almost think about going into a call with this many changes to make sure we are all on the same page. But that depends on how you see your role in this.

Do you see flaws in my description of my current patches?

I think nothing expect the few mentions above. Mostly, I think it's important to minimize any complexity going forward since those has been the main source of issues. Wherever we can simplify things, that would be preferred.

Does my alternative implementation of zero-shot topic modeling make sense, or are there specific reasons for why it is implemented as the merging of two BERTopic topic models?

At the moment, I can't think of any reason not to do this. At the time it seemed like an elegant solution to the approach but this is indeed more efficient.

Anything else relevant to know or discuss?

The code base is already fairly large and although I do not mind adding to it, it needs to be done from a "were are going to simplify things"-perspective. Your suggestions are great. I understand the bigger picture you are sketching and might still need to deep-dive into things but as it stands now I think mainly the procedure of replacing .merge_models with a manual BERTopic would be the way to go.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants