New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Message Tree Topic Modeling Pipeline #1650
Message Tree Topic Modeling Pipeline #1650
Conversation
def load_jsonl(filepaths): | ||
data = [] | ||
for filepath in filepaths: | ||
with open(filepath, "r") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if you saw it, we have fully typed pydantic classes for loading message trees, e.g. see use in import.py:
Open-Assistant/backend/import.py
Lines 115 to 118 in e963ca3
dict_tree = json.loads(line) | |
# validate data | |
tree: ExportMessageTree = pydantic.parse_obj_as(ExportMessageTree, dict_tree) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did not see that. looking through it now :)
return embeddings | ||
|
||
|
||
def cos_sim(a: Tensor, b: Tensor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the built-in torch function not suitable here? https://pytorch.org/docs/stable/generated/torch.nn.functional.cosine_similarity.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added built-in torch cosine similarity, I left the old one too incase anyone is interested in the internals. lmk if I should remove
parser.add_argument("--k", type=int, default=2) | ||
parser.add_argument("--threshold", type=float, default=0.65) | ||
parser.add_argument("--exported_tree_path", nargs="+", help="<Required> Set flag", required=True) | ||
# Use like python message_tree_topic_modeling.py --exported_tree_path 2023-02-06_oasst_prod.jsonl 2023-02-07_oasst_prod.jsonl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment seems to be helpful .. could you please move it into a doc string to __main__
? Also maybe put the argument parsing in a function that is called by __main__
instead of having it always executed in global scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved them to a function and call it in main
This is a PR to add topic modeling and k-hop message passing including a much faster sparse implementation and sentence transformer embedding aggregation.
Using message passing and using the k_hop adj matrix to aggregate the embedding features into cluster features like a GCN seems to result in much better topic clusters.
I also added loading tools for the exported message trees, a new util requirements.txt, and refactored the cosine_similarity in similarity_functions.py to instead compute the cosine similarity kernel. cos_sim and embed_data functions were ported over from one of @kenhktsui filter/cluster notebook(Scalable Agglomerative Clustering.ipynb)
BERTopic: https://maartengr.github.io/BERTopic/
still a WIP but I tested it locally and it works and wanted to get feedback. There are a couple more cleanup tasks, like typing, doc_strings, moving globals like ADJACENCY_THRESHOLD and MODEL_NAME to config, allowing for more customizability of the topic_model, etc.
Please let me know if you notice any errors or have any suggestions. :)