Skip to content

Similarity Model

melisande1 edited this page Jul 30, 2021 · 21 revisions

Why use the similarity model?

Oftentimes, repositories contain many similar lines of codes.
On the other hand, no matter how robust a classifier is, instances of misclassified snippets will always remain a reality.
The task of manually labeling snippets can thus never be completely erased.
But oftentimes, those misclassified snippets share similarities between each other.
Upon manually updating a snippet to a new state, e.g., from 'leak' to 'false_positive', being able to automatically detect similar snippets in the repository and update their state accordingly can save the user a lot of time.
This is precisely the purpose of the similarity model.
Do not ever bother performing the same update on similar snippets anymore: this can now be done with a single click, using the similarity model!

How are snippet embeddings computed?

Each snippet is fed to a preprocessing layer that tokenizes its input, i.e., splits the snippet into sub-words, and returns the ids of each of those tokens.
These ids are fed to a pretrained encoder, the Small BERT model, that computes a 128-element embedding - i.e., vector - for each corresponding token.
To obtain a single, condensed, 128-element embedding for a whole snippet, the mean over its tokens' embeddings is computed.

Computing similarity between snippets

The BERT model assigns similar embeddings to tokens that share semantic similarity and are usually found in similar contexts.
It is therefore easy to see that similar snippets - i.e., snippets that contain similar tokens - will have similar embeddings.
The degree of similarity between two embeddings is captured by the cosine similarity, whose values range from 0 to 1 - the higher the value the more similar the embeddings.

Scanning with similarity enabled

With the similarity model enabled, each time a scan is performed two additional steps take place:

  • Embeddings are computed for all discoveries, using the described method
  • Embeddings are added to the database in the embeddings table, along with their corresponding snippets, discovery ids and the repository url.

When a discovery's state is updated e.g., from 'leak' to 'false_positive', by clicking the "Mark as" button in the UI and leaving the "Update similar discoveries" checkbox checked, the similarity between this target discovery and each discovery not already in the target state is computed, using the cosine similarity measure.
For each discovery in the repository, if their similarity to the target discovery is above a certain threshold, their state will be updated to the target state; otherwise their state is left unchanged.