Add GNN message passing and feature aggregation layer for sbert embeddings #1065

danielpatrickhug · 2023-03-03T16:39:40Z

Summary:

Message passing and feature aggregation are effective techniques for improving the quality of topic clusters in a graph-based topic modeling system. Message passing involves propagating information through the edges of a graph using matrix exponents, which allows information to be shared between nodes and helps to capture the relationships between them. This allows for more accurate modeling of topic clusters and helps to identify hidden themes that may not be apparent in the raw data. By combining these two techniques, the topic model is able to identify more coherent and meaningful topic clusters and produces results that are more informative and useful for downstream analysis.

Implementation

I've been using message passing and feature aggregation to improve representations for instruction data and then using BERTopic to visualize the data. I've noticed general improvements in topic representation for instructions.
Examples: https://github.com/danielpatrickhug/Sentence_Kernels
Open source implementation: LAION-AI/Open-Assistant#1650

Another potential issue: large topic clusters summarized by chatgpt have brittle topic labels. I would like to add a new summarizer that summarizes topic summaries of different samples of representative docs. I have code for this as well(in sentence kernels).

I can make a PR, please let me know if you have suggestions! :)

danielpatrickhug · 2023-03-03T17:34:31Z

2Hop message passing and aggregation features For MathQA https://docs.google.com/spreadsheets/d/1dr1Ah8rcxb3i-7ipph0-aHYmgMakBWbuViD8qaN1acI/edit?usp=sharing

MaartenGr · 2023-03-14T06:25:29Z

Apologies for the late response! Thank for you sharing this. If I am not mistaken, this would be an embedding-agnostic way of capturing the relationships between documents before clustering them right? Do you have a minimal example I can try out? For example, with the 20 NewsGroups, and without ChatGPT labels? It is easier to validate the output if we evaluate it on a barebones example.

danielpatrickhug · 2023-03-14T07:09:01Z

Correct, the kernel(sim matrix) A of the embeddings set(cos sim as kernel fn) is first thresholded(connections are set to 0 or 1 given a distance threshold, I use SVD to look at the connected components to set the threshold but you can also use an activation fn.)
then the kernel A is exponentiated A^2 and the features are aggregated(k-hop message passing) to capture higher-order relationships between the documents creating node embeddings. I've seen research using this down at the molecule and protein level, robotic path planning strategy prediction, and in neuroscience too.

I have a solid evaluation script using the Open Assistant data, but the data set hasn't been released yet... I do not have a news group but I could whip one up.

I wrote a repo to iterate through the extracted ASTs of a repository to generate a contextual graph(code summaries, QAchains, and soon it will even retrieve documents) using different chatbot system prompts. I then merge the embedding sets of the code and the generated content into a single kernel and message passing between them. in general, the code embeddings and generated content embeddings get aggregated creating rich node/edge features. I then pass the aggregated features into the topic model pipeline. https://github.com/danielpatrickhug/GitModel

I use my repo to topic model its identity topic tree like a python eating itself lol, to me the representation is more coherent when using message passing when not. Would you be interested in seeing how the representation changes using BERTopic?

danielpatrickhug · 2023-03-14T07:13:11Z

I also incorporate a recurrent aspect where the resulting tree is passed back into the system prompts and the pipeline is run again. here is also another example from the recent alpaca model https://github.com/yizhongw/self-instruct/blob/main/self_instruct/bootstrap_instructions.py

Also, this group found that graph aware BERTs(trained with a message passed GNN layer) increase link prediction task by like 30% https://www.youtube.com/watch?v=HRC4hZKiUWU&t=3038s

MaartenGr · 2023-03-24T05:09:56Z

I use my repo to topic model its identity topic tree like a python eating itself lol, to me the representation is more coherent when using message passing when not. Would you be interested in seeing how the representation changes using BERTopic?

Yes, it would be great to see how it would change the representation using a minimal example. The issue with a ChatGPT-like representation on top of it is that it is difficult to evaluate the resulting model, both in stability as well as own interpretation using keywords directly from the source. If you could create an example with 20NewsGroups and with the default pipeline, then it is easier to see what your proposal is doing in isolation.

danielpatrickhug mentioned this issue Mar 3, 2023

chatgpt topic labels are brittle for large neighborhoods. #1066

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GNN message passing and feature aggregation layer for sbert embeddings #1065

Add GNN message passing and feature aggregation layer for sbert embeddings #1065

danielpatrickhug commented Mar 3, 2023 •

edited

danielpatrickhug commented Mar 3, 2023

MaartenGr commented Mar 14, 2023

danielpatrickhug commented Mar 14, 2023

danielpatrickhug commented Mar 14, 2023

MaartenGr commented Mar 24, 2023

Add GNN message passing and feature aggregation layer for sbert embeddings #1065

Add GNN message passing and feature aggregation layer for sbert embeddings #1065

Comments

danielpatrickhug commented Mar 3, 2023 • edited

Summary:

Implementation

danielpatrickhug commented Mar 3, 2023

MaartenGr commented Mar 14, 2023

danielpatrickhug commented Mar 14, 2023

danielpatrickhug commented Mar 14, 2023

MaartenGr commented Mar 24, 2023

danielpatrickhug commented Mar 3, 2023 •

edited