Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message Tree Topic Modeling Pipeline #1650

Merged

Conversation

danielpatrickhug
Copy link
Collaborator

@danielpatrickhug danielpatrickhug commented Feb 16, 2023

This is a PR to add topic modeling and k-hop message passing including a much faster sparse implementation and sentence transformer embedding aggregation.

Using message passing and using the k_hop adj matrix to aggregate the embedding features into cluster features like a GCN seems to result in much better topic clusters.

I also added loading tools for the exported message trees, a new util requirements.txt, and refactored the cosine_similarity in similarity_functions.py to instead compute the cosine similarity kernel. cos_sim and embed_data functions were ported over from one of @kenhktsui filter/cluster notebook(Scalable Agglomerative Clustering.ipynb)

BERTopic: https://maartengr.github.io/BERTopic/

still a WIP but I tested it locally and it works and wanted to get feedback. There are a couple more cleanup tasks, like typing, doc_strings, moving globals like ADJACENCY_THRESHOLD and MODEL_NAME to config, allowing for more customizability of the topic_model, etc.

Please let me know if you notice any errors or have any suggestions. :)

@LAION-AI LAION-AI deleted a comment from github-actions bot Feb 17, 2023
@LAION-AI LAION-AI deleted a comment from github-actions bot Feb 17, 2023
@LAION-AI LAION-AI deleted a comment from github-actions bot Feb 17, 2023
def load_jsonl(filepaths):
data = []
for filepath in filepaths:
with open(filepath, "r") as f:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if you saw it, we have fully typed pydantic classes for loading message trees, e.g. see use in import.py:

dict_tree = json.loads(line)
# validate data
tree: ExportMessageTree = pydantic.parse_obj_as(ExportMessageTree, dict_tree)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did not see that. looking through it now :)

return embeddings


def cos_sim(a: Tensor, b: Tensor):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added built-in torch cosine similarity, I left the old one too incase anyone is interested in the internals. lmk if I should remove

backend/oasst_backend/utils/similarity_functions.py Outdated Show resolved Hide resolved
parser.add_argument("--k", type=int, default=2)
parser.add_argument("--threshold", type=float, default=0.65)
parser.add_argument("--exported_tree_path", nargs="+", help="<Required> Set flag", required=True)
# Use like python message_tree_topic_modeling.py --exported_tree_path 2023-02-06_oasst_prod.jsonl 2023-02-07_oasst_prod.jsonl
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems to be helpful .. could you please move it into a doc string to __main__? Also maybe put the argument parsing in a function that is called by __main__ instead of having it always executed in global scope?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved them to a function and call it in main

@LAION-AI LAION-AI deleted a comment from github-actions bot Feb 18, 2023
@LAION-AI LAION-AI deleted a comment from github-actions bot Feb 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants