Skip to content

Incoming reports clustering by similarity#86

Merged
ksy36 merged 5 commits intomainfrom
incoming_clustering
Feb 26, 2026
Merged

Incoming reports clustering by similarity#86
ksy36 merged 5 commits intomainfrom
incoming_clustering

Conversation

@ksy36
Copy link
Copy Markdown
Collaborator

@ksy36 ksy36 commented Feb 4, 2026

Changes in this PR:
import_reports_from_bigquery is only saving reports to db without cluster_id or bucket_id
triage_new_reports command gets reports that don't have bucket_id and attempts to cluster and bucket them (runs every hour at the moment)

I think once we import live reports the frequency of triaging can be increased.

@ksy36 ksy36 force-pushed the incoming_clustering branch 2 times, most recently from 1ce7f27 to c515179 Compare February 10, 2026 20:43
@ksy36 ksy36 marked this pull request as ready for review February 10, 2026 20:47
@ksy36 ksy36 requested a review from jgraham February 10, 2026 20:47
@ksy36
Copy link
Copy Markdown
Collaborator Author

ksy36 commented Feb 10, 2026

I need to find a way to run cluster_reports command once to cluster existing reports on production before these changes are deployed :)

@ksy36 ksy36 force-pushed the incoming_clustering branch from c515179 to ddfcd1a Compare February 11, 2026 15:30
Comment thread server/reportmanager/models.py Outdated
@ksy36 ksy36 requested a review from denschub February 11, 2026 16:28
@ksy36 ksy36 marked this pull request as draft February 12, 2026 15:33
@ksy36 ksy36 marked this pull request as ready for review February 12, 2026 20:24
@ksy36
Copy link
Copy Markdown
Collaborator Author

ksy36 commented Feb 12, 2026

Ok I've added a page at /reportmanager/clustering/ where we can trigger full clustering and it also displays all runs, full and incremental ones (triaging new reports). We can probably display more info in the table later, if needed.

Screenshot 2026-02-12 at 3 23 56 PM

Copy link
Copy Markdown
Collaborator

@jgraham jgraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly reviewed the backend changes so far; for the frontend I wonder if we can get away without having this feature in the first instance.

Comment thread server/reportmanager/models.py Outdated
Comment thread server/reportmanager/management/commands/triage_new_reports.py
Comment thread server/reportmanager/clustering/ClusterBucketManager.py
Comment thread server/reportmanager/clustering/ClusterBucketManager.py Outdated
Comment thread server/reportmanager/clustering/ClusterBucketManager.py Outdated
Comment thread server/reportmanager/management/commands/import_reports_from_bigquery.py Outdated
Comment thread server/reportmanager/management/commands/triage_new_reports.py Outdated
Comment thread server/reportmanager/management/commands/triage_new_reports.py Outdated
Comment thread server/reportmanager/management/commands/triage_new_reports.py Outdated
Comment thread server/frontend/src/components/Clustering.vue Outdated
@ksy36 ksy36 force-pushed the incoming_clustering branch 2 times, most recently from e0f083e to 3ed40e0 Compare February 20, 2026 22:03
Comment thread server/reportmanager/clustering/ClusterBucketManager.py
Comment thread server/reportmanager/clustering/ClusterBucketManager.py
Comment thread server/reportmanager/clustering/SBERTClusterer.py Outdated
Comment thread server/reportmanager/cron.py Outdated
Comment thread server/frontend/src/components/Clustering.vue Outdated
Comment thread server/frontend/src/components/Clustering.vue Outdated
Comment thread server/frontend/src/components/Clustering.vue
)

def handle(self, *args: object, **options: object) -> None:
status = ClusteringJob.get_clustering_status()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How important is this lock if we can't run stuff from the UI? I guess it still makes sense, but it also makes sense that we can't run multiple import jobs in parallel for example, in which case maybe we should have a more generic locking mechanism rather than one specific to this job type? I guess we're also using it to track progress, but again it seems like there's a lot in common with what we'd want to track progress of other jobs e.g. import. Not a change for this PR.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, triage_new_reports runs every hour, so it's unlikely (but possible) that it will run at the same time as full reclustering (which I'm thinking to run with a command directly). Also on the first run triage_new_reports checks whether there are any successful runs, i.e. initial clustering exists and returns early if not. I guess it could also trigger initial clustering once (if it doesn't exist) without me manually running the command.
But yes I agree about generic locking mechanism.

Comment thread server/reportmanager/models.py
@ksy36 ksy36 force-pushed the incoming_clustering branch from 7c219b9 to 884f346 Compare February 26, 2026 04:08
Copy link
Copy Markdown
Collaborator

@jgraham jgraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should land this (once the lint is fixed!) so we can start to see the results and experiment with it.

@ksy36 ksy36 force-pushed the incoming_clustering branch from 884f346 to 99e3155 Compare February 26, 2026 14:17
@ksy36 ksy36 force-pushed the incoming_clustering branch from 99e3155 to 7bb6c35 Compare February 26, 2026 19:10
@ksy36 ksy36 merged commit 2bd2d25 into main Feb 26, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants