Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access to model definitions and training/validation data? #108

Open
nlykkei opened this issue May 29, 2021 · 2 comments
Open

Access to model definitions and training/validation data? #108

nlykkei opened this issue May 29, 2021 · 2 comments

Comments

@nlykkei
Copy link

nlykkei commented May 29, 2021

Would it be possible to get access to model definitions and training/validation data for the models used in SAP/credential-digger?

I'm interested to see how these models were trained, and to possible contribute to their future development.

Currently it seems that only trained models are available for download.

@SlimTrabelsi
Copy link
Contributor

SlimTrabelsi commented Jun 1, 2021

Hi @nlykkei,

Thank you for the interest to the project.
I'll start first with a clarification with regards to the training/validation data. Currently we trained two types of Models, one based on real data that we keep internal (for privacy reasons), and a second one that is open source, that is trained using synthetic generated data. If you are interested we can give you more details on how this data is generated or how to train your own data (already some details are avaialble in our publication here ).
If you are interested in contributing to the project or if you want to deploy it in your professional environment , let's then have a call together with the team and discuss this in details. You can join me directly on my e-mail that you will find in the publication ;) .
Best regards
Slim

@nlykkei
Copy link
Author

nlykkei commented Jun 8, 2021

Hi @SlimTrabelsi

Thanks for your reply,

If you are interested we can give you more details on how this data is generated or how to train your own data (already some details are avaialble in our publication here ).

I'd be very grateful, if you'd provide more details than already provided in the publication.

Personally, I've been working on a similar problem, but it has been very difficult to progress from a strict set of regular expressions (blacklist) to using ML to decide on results that are hard to express using regular expressions without introducing too many false positives (e.g. social security numbers: \d{8}[-: ]?\d{4}).

The experience I have gained is that it was only possible to identify sensitive data given a sufficient amount of context in its neighbourhood (e.g. think of a URL, https://user:pass@example.com/foo/bar).

My experience with ML is elementary university courses and DeepLearning.AI certifications. Would you say that my skill level is inadequate to develop this kind of system?

Best regards
Nicolas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants