Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GNN message passing and feature aggregation layer for sbert embeddings #1065

Open
danielpatrickhug opened this issue Mar 3, 2023 · 5 comments

Comments

@danielpatrickhug
Copy link

danielpatrickhug commented Mar 3, 2023

Summary:

Message passing and feature aggregation are effective techniques for improving the quality of topic clusters in a graph-based topic modeling system. Message passing involves propagating information through the edges of a graph using matrix exponents, which allows information to be shared between nodes and helps to capture the relationships between them. This allows for more accurate modeling of topic clusters and helps to identify hidden themes that may not be apparent in the raw data. By combining these two techniques, the topic model is able to identify more coherent and meaningful topic clusters and produces results that are more informative and useful for downstream analysis.

Implementation

I've been using message passing and feature aggregation to improve representations for instruction data and then using BERTopic to visualize the data. I've noticed general improvements in topic representation for instructions.
Examples: https://github.com/danielpatrickhug/Sentence_Kernels
Open source implementation: LAION-AI/Open-Assistant#1650

Another potential issue: large topic clusters summarized by chatgpt have brittle topic labels. I would like to add a new summarizer that summarizes topic summaries of different samples of representative docs. I have code for this as well(in sentence kernels).

I can make a PR, please let me know if you have suggestions! :)

@danielpatrickhug
Copy link
Author

2Hop message passing and aggregation features For MathQA https://docs.google.com/spreadsheets/d/1dr1Ah8rcxb3i-7ipph0-aHYmgMakBWbuViD8qaN1acI/edit?usp=sharing

@MaartenGr
Copy link
Owner

Apologies for the late response! Thank for you sharing this. If I am not mistaken, this would be an embedding-agnostic way of capturing the relationships between documents before clustering them right? Do you have a minimal example I can try out? For example, with the 20 NewsGroups, and without ChatGPT labels? It is easier to validate the output if we evaluate it on a barebones example.

@danielpatrickhug
Copy link
Author

Correct, the kernel(sim matrix) A of the embeddings set(cos sim as kernel fn) is first thresholded(connections are set to 0 or 1 given a distance threshold, I use SVD to look at the connected components to set the threshold but you can also use an activation fn.)
then the kernel A is exponentiated A^2 and the features are aggregated(k-hop message passing) to capture higher-order relationships between the documents creating node embeddings. I've seen research using this down at the molecule and protein level, robotic path planning strategy prediction, and in neuroscience too.

I have a solid evaluation script using the Open Assistant data, but the data set hasn't been released yet... I do not have a news group but I could whip one up.

I wrote a repo to iterate through the extracted ASTs of a repository to generate a contextual graph(code summaries, QAchains, and soon it will even retrieve documents) using different chatbot system prompts. I then merge the embedding sets of the code and the generated content into a single kernel and message passing between them. in general, the code embeddings and generated content embeddings get aggregated creating rich node/edge features. I then pass the aggregated features into the topic model pipeline. https://github.com/danielpatrickhug/GitModel

I use my repo to topic model its identity topic tree like a python eating itself lol, to me the representation is more coherent when using message passing when not. Would you be interested in seeing how the representation changes using BERTopic?

@danielpatrickhug
Copy link
Author

I also incorporate a recurrent aspect where the resulting tree is passed back into the system prompts and the pipeline is run again. here is also another example from the recent alpaca model https://github.com/yizhongw/self-instruct/blob/main/self_instruct/bootstrap_instructions.py

Also, this group found that graph aware BERTs(trained with a message passed GNN layer) increase link prediction task by like 30% https://www.youtube.com/watch?v=HRC4hZKiUWU&t=3038s

@MaartenGr
Copy link
Owner

I use my repo to topic model its identity topic tree like a python eating itself lol, to me the representation is more coherent when using message passing when not. Would you be interested in seeing how the representation changes using BERTopic?

Yes, it would be great to see how it would change the representation using a minimal example. The issue with a ChatGPT-like representation on top of it is that it is difficult to evaluate the resulting model, both in stability as well as own interpretation using keywords directly from the source. If you could create an example with 20NewsGroups and with the default pipeline, then it is easier to see what your proposal is doing in isolation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants