Skip to content

Commit

Permalink
ML refactoring (#551)
Browse files Browse the repository at this point in the history
* test ipmarkup

* test category2rules

* fix

* fix

* test val pos

* touch

* Update .github/workflows/benchmark.yml

* Update .github/workflows/benchmark.yml

* Update .github/workflows/benchmark.yml

* value sanitize in url improvement

* style

* fix ; split in url

* fix dub 2024-05-07T20:06:19+03:00

* fix4test

* Update .github/workflows/benchmark.yml

* dynamical bBM report

* fix google multi pattern

* trigger

* fix keyword pattern for \t

* Update tests/samples/auth_n.template

touch a file to rerun bm

* \ for url pattern

* touch

* touch

* cat md5

* sha256sum md5

* psd

* entropy fix

* BM scores fix

* BM upd

* BM scors

* BM scor:

* cache upd

* [skip actions] [valpos] 2024-05-13T16:13:18+03:00

* Rollback custom ref

* accesskey

* BM upd for re-launch

* BM upd for re-launch 2

* BM scors upd

* use prescan report for BM

* BM upd

* doc FP fix

* BM scores fix

* BM scores

* BM scores upd

* BM scores upd

* Slack Token upd

* BM csores upd

* BM csores upd2

* upd4test

* BM scor up with 0 in rate

* 20240519-1108 batch 2048 42 epc

* [skip actions] [ml] 2024-05-19T12:31:17+03:00

* 21th epoch

* 21th epoch. weights

* test4

* [skip actions] [ml] 2024-05-20T18:01:16+03:00

* [skip actions] [ml] 2024-05-20T18:04:19+03:00

* [skip actions] [ml] 2024-05-20T18:19:46+03:00

* jwt uses ml

* [skip actions] [ml] 2024-05-20T20:10:48+03:00

* ml test

* sigmoid

* fix validator

* 0.33

* testfix

* Update benchmark.yml

* testfix

* rollback BM wf

* upd

* BM scores fix

* workaround for CI step

* requests==2.32.0

* mypyfix

* reform

* test commit

* upd sample which gives shivering probability in macos

* features size preprocess to calculate the dimension automatically

* ml_validator creates in runtime instead import

* reform

* use local metrics

* [skip actions] [ml] 2024-05-27T13:31:20+03:00

* [skip actions] [ml] 2024-05-27T13:33:58+03:00

* rollback

* rollback test

* save plot fix

* BM scores fix

* Update experiment/src/prepare_data.py

Co-authored-by: ShinHyung Choi <sh519.choi@samsung.com>

* BM scores fix after CredData PR

---------

Co-authored-by: ShinHyung Choi <sh519.choi@samsung.com>
  • Loading branch information
babenek and csh519 committed May 28, 2024
1 parent 5ad4e1d commit d16a3a5
Show file tree
Hide file tree
Showing 25 changed files with 1,590 additions and 2,304 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ jobs:
- name: Check ml_model.onnx integrity
if: ${{ always() && steps.code_checkout.conclusion == 'success' }}
run: |
md5sum --binary credsweeper/ml_model/ml_model.onnx | grep 57ec152f6aa740456c742ecd5e7d9ef5
md5sum --binary credsweeper/ml_model/ml_model.onnx | grep 8f277b2f4a67a9911a9a860f1b5c0489
# # # Python setup

Expand Down
40 changes: 20 additions & 20 deletions cicd/benchmark.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
DATA: 16998279 interested lines. MARKUP: 63222 items
DATA: 16998279 interested lines. MARKUP: 63226 items
FileType FileNumber ValidLines Positives Negatives Templates
--------------- ------------ ------------ ----------- ----------- -----------
194 28318 64 430 87
Expand Down Expand Up @@ -83,8 +83,8 @@ FileType FileNumber ValidLines Positives Negatives Templat
.java 621 134132 311 1348 169
.jenkinsfile 1 58 1 7
.jinja2 1 64 2
.js 658 536388 494 2628 338
.json 860 13670750 817 10952 139
.js 658 536388 494 2630 338
.json 860 13670750 817 10953 139
.jsp 13 3202 1 42
.jsx 7 857 19
.jwt 6 8 7
Expand Down Expand Up @@ -123,7 +123,7 @@ FileType FileNumber ValidLines Positives Negatives Templat
.mqh 1 1023 2
.msg 1 26644 1 1
.mysql 1 36 2
.ndjson 2 5006 49 324
.ndjson 2 5006 49 325
.nix 4 211 12
.nolint 1 2 1
.odd 1 1281 57
Expand Down Expand Up @@ -223,25 +223,25 @@ FileType FileNumber ValidLines Positives Negatives Templat
.yml 418 36162 437 920 374
.zsh 6 872 12
.zsh-theme 1 97 1
TOTAL: 10335 16998279 8097 60877 5159
credsweeper result_cnt : 7519, lost_cnt : 0, true_cnt : 6817, false_cnt : 702
TOTAL: 10335 16998279 8097 60881 5159
credsweeper result_cnt : 7394, lost_cnt : 0, true_cnt : 6795, false_cnt : 599
Rules Positives Negatives Templates Reported TP FP TN FN FPR FNR ACC PRC RCL F1
------------------------------ ----------- ----------- ----------- ---------- ---- ---- ----- ---- -------- -------- -------- -------- -------- --------
API 117 3104 184 112 103 9 3279 14 0.002737 0.119658 0.993245 0.919643 0.880342 0.899563
API 117 3104 184 105 101 4 3284 16 0.001217 0.136752 0.994126 0.961905 0.863248 0.909910
AWS Client ID 163 13 0 154 154 0 13 9 0.000000 0.055215 0.948864 1.000000 0.944785 0.971609
AWS Multi 71 12 0 83 71 11 1 0 0.916667 0.000000 0.867470 0.865854 1.000000 0.928105
AWS S3 Bucket 61 25 0 87 61 24 1 0 0.960000 0.000000 0.720930 0.717647 1.000000 0.835616
Atlassian Old PAT token 27 211 3 10 3 7 207 24 0.032710 0.888889 0.871369 0.300000 0.111111 0.162162
Auth 318 2750 87 308 269 39 2798 49 0.013747 0.154088 0.972108 0.873377 0.845912 0.859425
Auth 318 2750 87 293 267 26 2811 51 0.009165 0.160377 0.975594 0.911263 0.839623 0.873977
Azure Access Token 19 0 0 0 0 0 19 1.000000 0.000000 0.000000
BASE64 Private Key 7 2 0 7 7 0 2 0 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000
BASE64 encoded PEM Private Key 7 0 0 5 5 0 0 2 0.285714 0.714286 1.000000 0.714286 0.833333
Bitbucket Client ID 147 1833 3 41 27 14 1822 120 0.007625 0.816327 0.932426 0.658537 0.183673 0.287234
Bitbucket Client Secret 239 535 0 44 33 11 524 206 0.020561 0.861925 0.719638 0.750000 0.138075 0.233216
Certificate 22 456 1 20 15 5 452 7 0.010941 0.318182 0.974948 0.750000 0.681818 0.714286
Credential 31 130 74 29 29 0 204 2 0.000000 0.064516 0.991489 1.000000 0.935484 0.966667
Certificate 22 456 1 17 16 1 456 6 0.002188 0.272727 0.985386 0.941176 0.727273 0.820513
Credential 31 130 74 31 28 3 201 3 0.014706 0.096774 0.974468 0.903226 0.903226 0.903226
Docker Swarm Token 2 0 0 2 2 0 0 0 0.000000 1.000000 1.000000 1.000000 1.000000
Dropbox App secret 62 112 0 45 37 7 105 25 0.062500 0.403226 0.816092 0.840909 0.596774 0.698113
Dropbox App secret 62 114 0 45 37 7 107 25 0.061404 0.403226 0.818182 0.840909 0.596774 0.698113
Facebook Access Token 0 1 0 0 0 1 0 0.000000 1.000000
Firebase Domain 6 1 0 7 6 1 0 0 1.000000 0.000000 0.857143 0.857143 1.000000 0.923077
Github Old Token 1 0 0 1 1 0 0 0 0.000000 1.000000 1.000000 1.000000 1.000000
Expand All @@ -253,18 +253,18 @@ Google OAuth Access Token 3 0 0
Grafana Provisioned API Key 22 1 0 1 1 0 1 21 0.000000 0.954545 0.086957 1.000000 0.045455 0.086957
IPv4 691 365 0 1004 691 302 63 0 0.827397 0.000000 0.714015 0.695871 1.000000 0.820665
IPv6 33 135 0 33 33 0 135 0 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000
JSON Web Token 284 10 2 280 272 8 4 12 0.666667 0.042254 0.932432 0.971429 0.957746 0.964539
JSON Web Token 284 11 2 280 272 8 5 12 0.615385 0.042254 0.932660 0.971429 0.957746 0.964539
Jira / Confluence PAT token 0 4 0 0 0 4 0 0.000000 1.000000
Jira 2FA 7 6 0 3 3 0 6 4 0.000000 0.571429 0.692308 1.000000 0.428571 0.600000
Key 427 7871 462 452 389 61 8272 38 0.007320 0.088993 0.988699 0.864444 0.911007 0.887115
Nonce 43 89 0 60 32 28 61 11 0.314607 0.255814 0.704545 0.533333 0.744186 0.621359
Key 427 7871 462 415 391 23 8310 36 0.002760 0.084309 0.993265 0.944444 0.915691 0.929845
Nonce 43 89 0 42 36 6 83 7 0.067416 0.162791 0.901515 0.857143 0.837209 0.847059
PEM Private Key 1019 1483 0 1023 1019 4 1479 0 0.002697 0.000000 0.998401 0.996090 1.000000 0.998041
Password 1902 7425 2675 1647 1554 93 10007 348 0.009208 0.182965 0.963256 0.943534 0.817035 0.875740
Salt 42 72 2 42 38 4 70 4 0.054054 0.095238 0.931034 0.904762 0.904762 0.904762
Secret 1353 29656 873 1264 1235 29 30500 118 0.000950 0.087214 0.995389 0.977057 0.912786 0.943829
Password 1902 7425 2675 1636 1543 93 10007 359 0.009208 0.188749 0.962340 0.943154 0.811251 0.872244
Salt 42 72 2 38 38 0 74 4 0.000000 0.095238 0.965517 1.000000 0.904762 0.950000
Secret 1353 29656 873 1239 1229 10 30519 124 0.000328 0.091648 0.995797 0.991929 0.908352 0.948302
Seed 1 6 0 0 0 6 1 0.000000 1.000000 0.857143 0.000000
Slack Token 4 1 0 4 4 0 1 0 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000
Token 553 3975 448 517 489 28 4395 64 0.006331 0.115732 0.981511 0.945841 0.884268 0.914019
Token 553 3976 448 499 476 23 4401 77 0.005199 0.139241 0.979908 0.953908 0.860759 0.904943
Twilio API Key 0 5 2 0 0 7 0 0.000000 1.000000
URL Credentials 167 117 254 143 143 0 371 24 0.000000 0.143713 0.955390 1.000000 0.856287 0.922581
8097 60877 5159 7538 6817 702 60175 1280 0.011531 0.158083 0.971265 0.906637 0.841917 0.873079
URL Credentials 167 117 254 153 149 4 367 18 0.010782 0.107784 0.959108 0.973856 0.892216 0.931250
8097 60881 5159 7412 6795 599 60282 1302 0.009839 0.160800 0.972440 0.918988 0.839200 0.877284
35 changes: 18 additions & 17 deletions credsweeper/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import signal
import sys
from pathlib import Path
from typing import Any, List, Optional, Union, Dict, Sequence
from typing import Any, List, Optional, Union, Dict, Sequence, Tuple

import pandas as pd

Expand All @@ -13,7 +13,7 @@

from credsweeper.common.constants import KeyValidationOption, Severity, ThresholdPreset
from credsweeper.config import Config
from credsweeper.credentials import Candidate, CredentialManager
from credsweeper.credentials import Candidate, CredentialManager, CandidateKey
from credsweeper.deep_scanner.deep_scanner import DeepScanner
from credsweeper.file_handler.diff_content_provider import DiffContentProvider
from credsweeper.file_handler.file_path_extractor import FilePathExtractor
Expand Down Expand Up @@ -336,32 +336,33 @@ def post_processing(self) -> None:
"""Machine learning validation for received credential candidates."""
if self._use_ml_validation():
logger.info(f"Grouping {len(self.credential_manager.candidates)} candidates")
new_cred_list = []
new_cred_list: List[Candidate] = []
cred_groups = self.credential_manager.group_credentials()
ml_cred_groups = []
ml_cred_groups: List[Tuple[CandidateKey, List[Candidate]]] = []
for group_key, group_candidates in cred_groups.items():
# Analyze with ML if all candidates in group require ML
# Analyze with ML if any candidate in group require ML
for candidate in group_candidates:
if not candidate.use_ml:
if candidate.use_ml:
ml_cred_groups.append((group_key, group_candidates))
break
else:
ml_cred_groups.append((group_key.value, group_candidates))
continue
# If at least one of credentials in the group do not require ML - automatically report to user
for candidate in group_candidates:
candidate.ml_validation = KeyValidationOption.NOT_AVAILABLE
new_cred_list += group_candidates
# all candidates do not require ML
new_cred_list.extend(group_candidates)

# prevent extra ml_validator creation if ml_cred_groups is empty
if ml_cred_groups:
logger.info(f"Run ML Validation for {len(ml_cred_groups)} groups")
is_cred, probability = self.ml_validator.validate_groups(ml_cred_groups, self.ml_batch_size)
for i, (_, group_candidates) in enumerate(ml_cred_groups):
if is_cred[i]:
for candidate in group_candidates:
candidate.ml_validation = KeyValidationOption.VALIDATED_KEY
candidate.ml_probability = probability[i]
new_cred_list += group_candidates
for candidate in group_candidates:
if candidate.use_ml:
if is_cred[i]:
candidate.ml_validation = KeyValidationOption.VALIDATED_KEY
candidate.ml_probability = probability[i]
new_cred_list.append(candidate)
else:
candidate.ml_validation = KeyValidationOption.NOT_AVAILABLE
new_cred_list.append(candidate)
else:
logger.info("Skipping ML validation due not applicable")

Expand Down
9 changes: 7 additions & 2 deletions credsweeper/credentials/candidate_key.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,10 @@ class CandidateKey:
def __init__(self, line_data: LineData):
self.path: str = line_data.path
self.line_num: int = line_data.line_num
self.value: str = line_data.value
self.key: Tuple[str, int, str] = (self.path, self.line_num, self.value)
self.value_start: int = line_data.value_start
self.value_end: int = line_data.value_end
self.key: Tuple[str, int, int, int] = (self.path, self.line_num, self.value_start, self.value_end)
self.__line = line_data.line

def __hash__(self):
return hash(self.key)
Expand All @@ -23,3 +25,6 @@ def __eq__(self, other):

def __ne__(self, other):
return not (self == other)

def __repr__(self) -> str:
return f"{self.key}:{self.__line}"
18 changes: 9 additions & 9 deletions credsweeper/ml_model/features.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ class PossibleComment(Feature):
r"""Feature is true if candidate line starts with #,\*,/\*? (Possible comment)."""

def extract(self, candidate: Candidate) -> bool:
for i in ["#", "*", "/*"]:
for i in ["#", "*", "/*", "//"]:
if candidate.line_data_list[0].line.startswith(i):
return True
return False
Expand Down Expand Up @@ -260,13 +260,13 @@ class FileExtension(Feature):

def __init__(self, extensions: List[str]) -> None:
super().__init__()
self.extensions = extensions
self.label_binarizer = LabelBinarizer()
self.label_binarizer.fit(extensions)

def __call__(self, candidates: List[Candidate]) -> csr_matrix:
enc = LabelBinarizer()
enc.fit(self.extensions)
extensions = [candidate.line_data_list[0].file_type for candidate in candidates]
return enc.transform(extensions)
result = self.label_binarizer.transform(extensions)
return result

def extract(self, candidate: Candidate) -> Any:
raise NotImplementedError
Expand All @@ -282,13 +282,13 @@ class RuleName(Feature):

def __init__(self, rule_names: List[str]) -> None:
super().__init__()
self.rule_names = rule_names
self.label_binarizer = LabelBinarizer()
self.label_binarizer.fit(rule_names)

def __call__(self, candidates: List[Candidate]) -> csr_matrix:
enc = LabelBinarizer()
enc.fit(self.rule_names)
rule_names = [candidate.rule_name for candidate in candidates]
return enc.transform(rule_names)
result = self.label_binarizer.transform(rule_names)
return result

def extract(self, candidate: Candidate) -> Any:
raise NotImplementedError
Binary file modified credsweeper/ml_model/ml_model.onnx
Binary file not shown.
Loading

0 comments on commit d16a3a5

Please sign in to comment.