Message Tree Topic Modeling Pipeline #1650

danielpatrickhug · 2023-02-16T22:46:16Z

This is a PR to add topic modeling and k-hop message passing including a much faster sparse implementation and sentence transformer embedding aggregation.

Using message passing and using the k_hop adj matrix to aggregate the embedding features into cluster features like a GCN seems to result in much better topic clusters.

I also added loading tools for the exported message trees, a new util requirements.txt, and refactored the cosine_similarity in similarity_functions.py to instead compute the cosine similarity kernel. cos_sim and embed_data functions were ported over from one of @kenhktsui filter/cluster notebook(Scalable Agglomerative Clustering.ipynb)

BERTopic: https://maartengr.github.io/BERTopic/

still a WIP but I tested it locally and it works and wanted to get feedback. There are a couple more cleanup tasks, like typing, doc_strings, moving globals like ADJACENCY_THRESHOLD and MODEL_NAME to config, allowing for more customizability of the topic_model, etc.

Please let me know if you notice any errors or have any suggestions. :)

backend/oasst_backend/utils/message_tree_topic_modeling.py

backend/oasst_backend/utils/exported_tree_loading.py

andreaskoepf · 2023-02-17T09:17:20Z

backend/oasst_backend/utils/exported_tree_loading.py

+def load_jsonl(filepaths):
+    data = []
+    for filepath in filepaths:
+        with open(filepath, "r") as f:


not sure if you saw it, we have fully typed pydantic classes for loading message trees, e.g. see use in import.py:

Open-Assistant/backend/import.py

Lines 115 to 118 in e963ca3

dict_tree = json.loads(line)

# validate data

tree: ExportMessageTree = pydantic.parse_obj_as(ExportMessageTree, dict_tree)

did not see that. looking through it now :)

andreaskoepf · 2023-02-17T09:20:17Z

backend/oasst_backend/utils/similarity_functions.py

+    return embeddings
+
+
+def cos_sim(a: Tensor, b: Tensor):


Is the built-in torch function not suitable here? https://pytorch.org/docs/stable/generated/torch.nn.functional.cosine_similarity.html

added built-in torch cosine similarity, I left the old one too incase anyone is interested in the internals. lmk if I should remove

backend/oasst_backend/utils/similarity_functions.py

andreaskoepf · 2023-02-17T09:24:35Z

backend/oasst_backend/utils/message_tree_topic_modeling.py

+parser.add_argument("--k", type=int, default=2)
+parser.add_argument("--threshold", type=float, default=0.65)
+parser.add_argument("--exported_tree_path", nargs="+", help="<Required> Set flag", required=True)
+# Use like python message_tree_topic_modeling.py --exported_tree_path 2023-02-06_oasst_prod.jsonl 2023-02-07_oasst_prod.jsonl


This comment seems to be helpful .. could you please move it into a doc string to __main__? Also maybe put the argument parsing in a function that is called by __main__ instead of having it always executed in global scope?

moved them to a function and call it in main

…g_pipeline

danielpatrickhug added 2 commits February 16, 2023 17:31

initial commit to add topic modeling.

4f34d7c

run pre-commit

6693e90

danielpatrickhug added backend ml data labels Feb 16, 2023

danielpatrickhug requested review from yk and andreaskoepf as code owners February 16, 2023 22:46

danielpatrickhug added 2 commits February 16, 2023 17:52

re run pre-commit

5b57d57

removed indexing in load_data function

d311a84

olliestanley reviewed Feb 16, 2023

View reviewed changes

backend/oasst_backend/utils/message_tree_topic_modeling.py Outdated Show resolved Hide resolved

backend/oasst_backend/utils/message_tree_topic_modeling.py Outdated Show resolved Hide resolved

olliestanley reviewed Feb 16, 2023

View reviewed changes

backend/oasst_backend/utils/exported_tree_loading.py Outdated Show resolved Hide resolved

danielpatrickhug added 3 commits February 16, 2023 18:20

spelling fixes.

4bbc2a5

added a couple clean ups and added argparse to topic model

268d601

removed unused import and fixed arg comment

fe7ed53

LAION-AI deleted a comment from github-actions bot Feb 17, 2023

danielpatrickhug added 3 commits February 16, 2023 19:19

small comment correction

a2e2f1b

Fixed defaultdict usage in load data

b5b3473

defaultdict cleanup

06dfe72

andreaskoepf reviewed Feb 17, 2023

View reviewed changes

danielpatrickhug added 7 commits February 18, 2023 10:54

add built-in torch cosine similarity function and guassian kernel func

353767e

added types and removed sent indexing

0d7abfc

moved doc string

d2297d7

ran pre-commit

409ec88

Merge remote-tracking branch 'upstream/main' into sbert_topic_modelin…

aa36f83

…g_pipeline

moved argument parsing to a function called by __main__

ff27b51

fixed flake8 formattting issue and re ran pre-commit

708ebb7

LAION-AI deleted a comment from github-actions bot Feb 18, 2023

danielpatrickhug assigned danielpatrickhug and unassigned danielpatrickhug Feb 18, 2023

andreaskoepf mentioned this pull request Feb 19, 2023

A few more backend unit tests (trivial stuff) #1726

Open

andreaskoepf approved these changes Feb 19, 2023

View reviewed changes

andreaskoepf merged commit 87e02e2 into LAION-AI:main Feb 19, 2023

danielpatrickhug mentioned this pull request Mar 3, 2023

Add GNN message passing and feature aggregation layer for sbert embeddings MaartenGr/BERTopic#1065

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Message Tree Topic Modeling Pipeline #1650

Message Tree Topic Modeling Pipeline #1650

danielpatrickhug commented Feb 16, 2023 •

edited

andreaskoepf Feb 17, 2023

danielpatrickhug Feb 18, 2023

andreaskoepf Feb 17, 2023

danielpatrickhug Feb 18, 2023

andreaskoepf Feb 17, 2023

danielpatrickhug Feb 18, 2023

	dict_tree = json.loads(line)

	# validate data
	tree: ExportMessageTree = pydantic.parse_obj_as(ExportMessageTree, dict_tree)

Message Tree Topic Modeling Pipeline #1650

Message Tree Topic Modeling Pipeline #1650

Conversation

danielpatrickhug commented Feb 16, 2023 • edited

andreaskoepf Feb 17, 2023

Choose a reason for hiding this comment

danielpatrickhug Feb 18, 2023

Choose a reason for hiding this comment

andreaskoepf Feb 17, 2023

Choose a reason for hiding this comment

danielpatrickhug Feb 18, 2023

Choose a reason for hiding this comment

andreaskoepf Feb 17, 2023

Choose a reason for hiding this comment

danielpatrickhug Feb 18, 2023

Choose a reason for hiding this comment

danielpatrickhug commented Feb 16, 2023 •

edited