Skip to content

Machine Learning Models

Marco Rosa edited this page Oct 15, 2021 · 19 revisions

Since regex scanners are prone to produce a lot of false positive discoveries, machine learning models can be used to reduce the number of discoveries to be manually analysed. In particular, models automatically classify discoveries as false_positive (i.e., spam).

The models need an implementation (in credentialdigger/models folder). Possible binaries are automatically downloaded on-the-fly, i.e., starting from Credential Digger v4.4 there's no need to download the binaries of the models anymore 🎉.

Supported Models

If you want to propose a new model to reduce false positive discoveries, please contact us (or open an issue in the project)

Path Model

The Path Model empowers regular expressions to match typical files that contain fake credentials.

After a pre-processing phase, the file path of a discovery is matched with a regular expression to guess whether the credentials contained in it will be real ones or not. Indeed, according to our observations, documentation (e.g., README and .md files in general), tutorials, tests, virtual environments and dependencies pushed to the repository (e.g., node_modules), don't contain real secrets used in production.

Up to v4.3 we used a ML approach based on fasttext, but we shifted to regular expressions in v4.4 since it proved to be more performing without loss of precision. Please visit the OLD machine learning models page for further information regarding the old Path Model.

Password Model

The approach of the OLD Snippet Model was revolutionized in v4.4 in favour of a more efficient strategy. Indeed, fasttext and the double-model strategy (i.e., Snippet Extractor and Snippet Classifier) has been deprecated and replaced by a unique, open source, Password Model.

The new Password Model is based on NLP and provides a higher precision compared to the old Snippet Model, but it only works with passwords and it's slower.

Similarity Model

The similarity feature can be enabled before running a scan in order to reduce the manual workload of assessing the discoveries. Indeed, if this feature is enabled, similarity scores are computed among the snippets of a repo after a scan. This way, every time the user changes the state of a discovery using the UI (e.g., marking a snippet as a False Positive), or calling the update_similar_snippets method in the library, all the discoveries with similar snippets will be classified accordingly.