Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML refactoring #551

Merged
merged 104 commits into from
May 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
104 commits
Select commit Hold shift + click to select a range
7a97b80
test ipmarkup
babenek May 6, 2024
7bdff47
test category2rules
babenek May 6, 2024
a4e4bd6
fix
babenek May 6, 2024
4dbec97
fix
babenek May 6, 2024
f044f5c
test val pos
babenek May 6, 2024
10df847
touch
babenek May 6, 2024
f97982e
Update .github/workflows/benchmark.yml
babenek May 7, 2024
b742c4d
Update .github/workflows/benchmark.yml
babenek May 7, 2024
a010350
Update .github/workflows/benchmark.yml
babenek May 7, 2024
b430560
value sanitize in url improvement
babenek May 7, 2024
376cdad
Merge remote-tracking branch 'origin/valpos' into valpos
babenek May 7, 2024
d91065a
style
babenek May 7, 2024
ea61214
fix ; split in url
babenek May 7, 2024
dbdd613
fix dub 2024-05-07T20:06:19+03:00
babenek May 7, 2024
b9fc64a
fix4test
babenek May 7, 2024
c089c06
Update .github/workflows/benchmark.yml
babenek May 8, 2024
7c354f4
dynamical bBM report
babenek May 8, 2024
5408ef2
Merge remote-tracking branch 'origin/valpos' into valpos
babenek May 8, 2024
87c70ae
fix google multi pattern
babenek May 8, 2024
fc60c7a
trigger
babenek May 8, 2024
40aa00a
fix keyword pattern for \t
babenek May 9, 2024
f2dcce1
Update tests/samples/auth_n.template
babenek May 9, 2024
24503c5
\ for url pattern
babenek May 9, 2024
67622cc
Merge remote-tracking branch 'origin/valpos' into valpos
babenek May 9, 2024
e460da4
touch
babenek May 9, 2024
91b4487
touch
babenek May 9, 2024
0452ff1
cat md5
babenek May 9, 2024
09eae5d
sha256sum md5
babenek May 9, 2024
6a8ef55
psd
babenek May 9, 2024
12c9abd
entropy fix
babenek May 10, 2024
32399d4
BM scores fix
babenek May 10, 2024
63163e3
BM upd
babenek May 10, 2024
40affde
Merge branch 'main' into category2rules
babenek May 12, 2024
8eda750
BM scors
babenek May 12, 2024
39d14e4
BM scor:
babenek May 13, 2024
fdc2e05
cache upd
babenek May 13, 2024
f59b4e4
[skip actions] [valpos] 2024-05-13T16:13:18+03:00
babenek May 13, 2024
f4864ac
Rollback custom ref
babenek May 13, 2024
594503e
accesskey
babenek May 13, 2024
c3b1ad7
Merge branch 'valpos' into category2rules
babenek May 13, 2024
1684f33
BM upd for re-launch
babenek May 13, 2024
71efc8a
BM upd for re-launch 2
babenek May 13, 2024
9688b93
Merge branch 'valpos' into category2rules
babenek May 13, 2024
9dabcf5
BM scors upd
babenek May 13, 2024
2b38fcb
use prescan report for BM
babenek May 14, 2024
d91ae5b
BM upd
babenek May 14, 2024
6cd2f77
doc FP fix
babenek May 14, 2024
5e14f3c
BM scores fix
babenek May 14, 2024
7374643
Merge branch 'main' into valpos
babenek May 14, 2024
d0e8f51
Merge branch 'valpos' into category2rules
babenek May 14, 2024
00ccd97
BM scores
babenek May 14, 2024
41d3253
Merge branch 'main' into valpos
babenek May 15, 2024
cc7b68a
Merge branch 'valpos' into category2rules
babenek May 15, 2024
0966627
BM scores upd
babenek May 15, 2024
2cb3084
BM scores upd
babenek May 15, 2024
e2a01d9
Slack Token upd
babenek May 15, 2024
986b65e
BM csores upd
babenek May 16, 2024
537f4d7
BM csores upd2
babenek May 16, 2024
2b56364
upd4test
babenek May 16, 2024
a7feb97
BM scor up with 0 in rate
babenek May 16, 2024
850973e
20240519-1108 batch 2048 42 epc
babenek May 18, 2024
eba9da1
[skip actions] [ml] 2024-05-19T12:31:17+03:00
babenek May 19, 2024
879d0be
21th epoch
babenek May 19, 2024
d9af5fc
21th epoch. weights
babenek May 19, 2024
53c23df
test4
babenek May 19, 2024
bd6d845
Merge branch 'main' into ml
babenek May 20, 2024
932d970
[skip actions] [ml] 2024-05-20T18:01:16+03:00
babenek May 20, 2024
2a4ec2a
[skip actions] [ml] 2024-05-20T18:04:19+03:00
babenek May 20, 2024
46a3d27
[skip actions] [ml] 2024-05-20T18:19:46+03:00
babenek May 20, 2024
520b536
jwt uses ml
babenek May 20, 2024
afc2840
[skip actions] [ml] 2024-05-20T20:10:48+03:00
babenek May 20, 2024
aadb077
ml test
babenek May 20, 2024
f554c2b
sigmoid
babenek May 20, 2024
878688f
fix validator
babenek May 20, 2024
9025c9b
0.33
babenek May 20, 2024
0695548
testfix
babenek May 20, 2024
2af3ce1
Update benchmark.yml
babenek May 20, 2024
ab6da6b
testfix
babenek May 21, 2024
f18cd03
rollback BM wf
babenek May 21, 2024
5eab9d5
upd
babenek May 21, 2024
13278cd
BM scores fix
babenek May 21, 2024
69c875a
workaround for CI step
babenek May 21, 2024
dc6df04
Merge remote-tracking branch 'origin/ml' into ml
babenek May 21, 2024
4f57aa4
requests==2.32.0
babenek May 21, 2024
c1265c3
mypyfix
babenek May 21, 2024
12c12ff
reform
babenek May 21, 2024
91c83e4
test commit
babenek May 21, 2024
b9f568a
upd sample which gives shivering probability in macos
babenek May 21, 2024
b33824a
Merge branch 'main' into ml
babenek May 21, 2024
f9fe875
features size preprocess to calculate the dimension automatically
babenek May 21, 2024
831cb01
Merge remote-tracking branch 'origin/ml' into ml
babenek May 21, 2024
d7d3886
ml_validator creates in runtime instead import
babenek May 21, 2024
117dfae
reform
babenek May 21, 2024
d4bdb0a
use local metrics
babenek May 22, 2024
138b9e2
[skip actions] [ml] 2024-05-27T13:31:20+03:00
babenek May 27, 2024
0a6c44e
[skip actions] [ml] 2024-05-27T13:33:58+03:00
babenek May 27, 2024
6f67b3b
rollback
babenek May 27, 2024
edee47e
Merge remote-tracking branch 'upstream/main' into ml
babenek May 27, 2024
99f455a
rollback test
babenek May 27, 2024
f2d7f32
save plot fix
babenek May 27, 2024
1ee99b5
BM scores fix
babenek May 27, 2024
ed0a9f4
Update experiment/src/prepare_data.py
babenek May 28, 2024
69f4930
BM scores fix after CredData PR
babenek May 28, 2024
aea2dfd
Merge remote-tracking branch 'origin/ml' into ml
babenek May 28, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ jobs:
- name: Check ml_model.onnx integrity
if: ${{ always() && steps.code_checkout.conclusion == 'success' }}
run: |
md5sum --binary credsweeper/ml_model/ml_model.onnx | grep 57ec152f6aa740456c742ecd5e7d9ef5
md5sum --binary credsweeper/ml_model/ml_model.onnx | grep 8f277b2f4a67a9911a9a860f1b5c0489

# # # Python setup

Expand Down
40 changes: 20 additions & 20 deletions cicd/benchmark.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
DATA: 16998279 interested lines. MARKUP: 63222 items
DATA: 16998279 interested lines. MARKUP: 63226 items
FileType FileNumber ValidLines Positives Negatives Templates
--------------- ------------ ------------ ----------- ----------- -----------
194 28318 64 430 87
Expand Down Expand Up @@ -83,8 +83,8 @@ FileType FileNumber ValidLines Positives Negatives Templat
.java 621 134132 311 1348 169
.jenkinsfile 1 58 1 7
.jinja2 1 64 2
.js 658 536388 494 2628 338
.json 860 13670750 817 10952 139
.js 658 536388 494 2630 338
.json 860 13670750 817 10953 139
.jsp 13 3202 1 42
.jsx 7 857 19
.jwt 6 8 7
Expand Down Expand Up @@ -123,7 +123,7 @@ FileType FileNumber ValidLines Positives Negatives Templat
.mqh 1 1023 2
.msg 1 26644 1 1
.mysql 1 36 2
.ndjson 2 5006 49 324
.ndjson 2 5006 49 325
.nix 4 211 12
.nolint 1 2 1
.odd 1 1281 57
Expand Down Expand Up @@ -223,25 +223,25 @@ FileType FileNumber ValidLines Positives Negatives Templat
.yml 418 36162 437 920 374
.zsh 6 872 12
.zsh-theme 1 97 1
TOTAL: 10335 16998279 8097 60877 5159
credsweeper result_cnt : 7519, lost_cnt : 0, true_cnt : 6817, false_cnt : 702
TOTAL: 10335 16998279 8097 60881 5159
credsweeper result_cnt : 7394, lost_cnt : 0, true_cnt : 6795, false_cnt : 599
Rules Positives Negatives Templates Reported TP FP TN FN FPR FNR ACC PRC RCL F1
------------------------------ ----------- ----------- ----------- ---------- ---- ---- ----- ---- -------- -------- -------- -------- -------- --------
API 117 3104 184 112 103 9 3279 14 0.002737 0.119658 0.993245 0.919643 0.880342 0.899563
API 117 3104 184 105 101 4 3284 16 0.001217 0.136752 0.994126 0.961905 0.863248 0.909910
AWS Client ID 163 13 0 154 154 0 13 9 0.000000 0.055215 0.948864 1.000000 0.944785 0.971609
AWS Multi 71 12 0 83 71 11 1 0 0.916667 0.000000 0.867470 0.865854 1.000000 0.928105
AWS S3 Bucket 61 25 0 87 61 24 1 0 0.960000 0.000000 0.720930 0.717647 1.000000 0.835616
Atlassian Old PAT token 27 211 3 10 3 7 207 24 0.032710 0.888889 0.871369 0.300000 0.111111 0.162162
Auth 318 2750 87 308 269 39 2798 49 0.013747 0.154088 0.972108 0.873377 0.845912 0.859425
Auth 318 2750 87 293 267 26 2811 51 0.009165 0.160377 0.975594 0.911263 0.839623 0.873977
Azure Access Token 19 0 0 0 0 0 19 1.000000 0.000000 0.000000
BASE64 Private Key 7 2 0 7 7 0 2 0 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000
BASE64 encoded PEM Private Key 7 0 0 5 5 0 0 2 0.285714 0.714286 1.000000 0.714286 0.833333
Bitbucket Client ID 147 1833 3 41 27 14 1822 120 0.007625 0.816327 0.932426 0.658537 0.183673 0.287234
Bitbucket Client Secret 239 535 0 44 33 11 524 206 0.020561 0.861925 0.719638 0.750000 0.138075 0.233216
Certificate 22 456 1 20 15 5 452 7 0.010941 0.318182 0.974948 0.750000 0.681818 0.714286
Credential 31 130 74 29 29 0 204 2 0.000000 0.064516 0.991489 1.000000 0.935484 0.966667
Certificate 22 456 1 17 16 1 456 6 0.002188 0.272727 0.985386 0.941176 0.727273 0.820513
Credential 31 130 74 31 28 3 201 3 0.014706 0.096774 0.974468 0.903226 0.903226 0.903226
Docker Swarm Token 2 0 0 2 2 0 0 0 0.000000 1.000000 1.000000 1.000000 1.000000
Dropbox App secret 62 112 0 45 37 7 105 25 0.062500 0.403226 0.816092 0.840909 0.596774 0.698113
Dropbox App secret 62 114 0 45 37 7 107 25 0.061404 0.403226 0.818182 0.840909 0.596774 0.698113
Facebook Access Token 0 1 0 0 0 1 0 0.000000 1.000000
Firebase Domain 6 1 0 7 6 1 0 0 1.000000 0.000000 0.857143 0.857143 1.000000 0.923077
Github Old Token 1 0 0 1 1 0 0 0 0.000000 1.000000 1.000000 1.000000 1.000000
Expand All @@ -253,18 +253,18 @@ Google OAuth Access Token 3 0 0
Grafana Provisioned API Key 22 1 0 1 1 0 1 21 0.000000 0.954545 0.086957 1.000000 0.045455 0.086957
IPv4 691 365 0 1004 691 302 63 0 0.827397 0.000000 0.714015 0.695871 1.000000 0.820665
IPv6 33 135 0 33 33 0 135 0 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000
JSON Web Token 284 10 2 280 272 8 4 12 0.666667 0.042254 0.932432 0.971429 0.957746 0.964539
JSON Web Token 284 11 2 280 272 8 5 12 0.615385 0.042254 0.932660 0.971429 0.957746 0.964539
Jira / Confluence PAT token 0 4 0 0 0 4 0 0.000000 1.000000
Jira 2FA 7 6 0 3 3 0 6 4 0.000000 0.571429 0.692308 1.000000 0.428571 0.600000
Key 427 7871 462 452 389 61 8272 38 0.007320 0.088993 0.988699 0.864444 0.911007 0.887115
Nonce 43 89 0 60 32 28 61 11 0.314607 0.255814 0.704545 0.533333 0.744186 0.621359
Key 427 7871 462 415 391 23 8310 36 0.002760 0.084309 0.993265 0.944444 0.915691 0.929845
Nonce 43 89 0 42 36 6 83 7 0.067416 0.162791 0.901515 0.857143 0.837209 0.847059
PEM Private Key 1019 1483 0 1023 1019 4 1479 0 0.002697 0.000000 0.998401 0.996090 1.000000 0.998041
Password 1902 7425 2675 1647 1554 93 10007 348 0.009208 0.182965 0.963256 0.943534 0.817035 0.875740
Salt 42 72 2 42 38 4 70 4 0.054054 0.095238 0.931034 0.904762 0.904762 0.904762
Secret 1353 29656 873 1264 1235 29 30500 118 0.000950 0.087214 0.995389 0.977057 0.912786 0.943829
Password 1902 7425 2675 1636 1543 93 10007 359 0.009208 0.188749 0.962340 0.943154 0.811251 0.872244
Salt 42 72 2 38 38 0 74 4 0.000000 0.095238 0.965517 1.000000 0.904762 0.950000
Secret 1353 29656 873 1239 1229 10 30519 124 0.000328 0.091648 0.995797 0.991929 0.908352 0.948302
Seed 1 6 0 0 0 6 1 0.000000 1.000000 0.857143 0.000000
Slack Token 4 1 0 4 4 0 1 0 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000
Token 553 3975 448 517 489 28 4395 64 0.006331 0.115732 0.981511 0.945841 0.884268 0.914019
Token 553 3976 448 499 476 23 4401 77 0.005199 0.139241 0.979908 0.953908 0.860759 0.904943
Twilio API Key 0 5 2 0 0 7 0 0.000000 1.000000
URL Credentials 167 117 254 143 143 0 371 24 0.000000 0.143713 0.955390 1.000000 0.856287 0.922581
8097 60877 5159 7538 6817 702 60175 1280 0.011531 0.158083 0.971265 0.906637 0.841917 0.873079
URL Credentials 167 117 254 153 149 4 367 18 0.010782 0.107784 0.959108 0.973856 0.892216 0.931250
8097 60881 5159 7412 6795 599 60282 1302 0.009839 0.160800 0.972440 0.918988 0.839200 0.877284
35 changes: 18 additions & 17 deletions credsweeper/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import signal
import sys
from pathlib import Path
from typing import Any, List, Optional, Union, Dict, Sequence
from typing import Any, List, Optional, Union, Dict, Sequence, Tuple

import pandas as pd

Expand All @@ -13,7 +13,7 @@

from credsweeper.common.constants import KeyValidationOption, Severity, ThresholdPreset
from credsweeper.config import Config
from credsweeper.credentials import Candidate, CredentialManager
from credsweeper.credentials import Candidate, CredentialManager, CandidateKey
from credsweeper.deep_scanner.deep_scanner import DeepScanner
from credsweeper.file_handler.diff_content_provider import DiffContentProvider
from credsweeper.file_handler.file_path_extractor import FilePathExtractor
Expand Down Expand Up @@ -336,32 +336,33 @@ def post_processing(self) -> None:
"""Machine learning validation for received credential candidates."""
if self._use_ml_validation():
logger.info(f"Grouping {len(self.credential_manager.candidates)} candidates")
new_cred_list = []
new_cred_list: List[Candidate] = []
cred_groups = self.credential_manager.group_credentials()
ml_cred_groups = []
ml_cred_groups: List[Tuple[CandidateKey, List[Candidate]]] = []
for group_key, group_candidates in cred_groups.items():
# Analyze with ML if all candidates in group require ML
# Analyze with ML if any candidate in group require ML
for candidate in group_candidates:
if not candidate.use_ml:
if candidate.use_ml:
ml_cred_groups.append((group_key, group_candidates))
break
else:
ml_cred_groups.append((group_key.value, group_candidates))
continue
# If at least one of credentials in the group do not require ML - automatically report to user
for candidate in group_candidates:
candidate.ml_validation = KeyValidationOption.NOT_AVAILABLE
new_cred_list += group_candidates
# all candidates do not require ML
new_cred_list.extend(group_candidates)

# prevent extra ml_validator creation if ml_cred_groups is empty
if ml_cred_groups:
logger.info(f"Run ML Validation for {len(ml_cred_groups)} groups")
is_cred, probability = self.ml_validator.validate_groups(ml_cred_groups, self.ml_batch_size)
for i, (_, group_candidates) in enumerate(ml_cred_groups):
if is_cred[i]:
for candidate in group_candidates:
candidate.ml_validation = KeyValidationOption.VALIDATED_KEY
candidate.ml_probability = probability[i]
new_cred_list += group_candidates
for candidate in group_candidates:
if candidate.use_ml:
if is_cred[i]:
candidate.ml_validation = KeyValidationOption.VALIDATED_KEY
candidate.ml_probability = probability[i]
new_cred_list.append(candidate)
else:
candidate.ml_validation = KeyValidationOption.NOT_AVAILABLE
new_cred_list.append(candidate)
else:
logger.info("Skipping ML validation due not applicable")

Expand Down
9 changes: 7 additions & 2 deletions credsweeper/credentials/candidate_key.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,10 @@ class CandidateKey:
def __init__(self, line_data: LineData):
self.path: str = line_data.path
self.line_num: int = line_data.line_num
self.value: str = line_data.value
self.key: Tuple[str, int, str] = (self.path, self.line_num, self.value)
self.value_start: int = line_data.value_start
self.value_end: int = line_data.value_end
self.key: Tuple[str, int, int, int] = (self.path, self.line_num, self.value_start, self.value_end)
self.__line = line_data.line

def __hash__(self):
return hash(self.key)
Expand All @@ -23,3 +25,6 @@ def __eq__(self, other):

def __ne__(self, other):
return not (self == other)

def __repr__(self) -> str:
return f"{self.key}:{self.__line}"
18 changes: 9 additions & 9 deletions credsweeper/ml_model/features.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ class PossibleComment(Feature):
r"""Feature is true if candidate line starts with #,\*,/\*? (Possible comment)."""

def extract(self, candidate: Candidate) -> bool:
for i in ["#", "*", "/*"]:
for i in ["#", "*", "/*", "//"]:
if candidate.line_data_list[0].line.startswith(i):
return True
return False
Expand Down Expand Up @@ -260,13 +260,13 @@ class FileExtension(Feature):

def __init__(self, extensions: List[str]) -> None:
super().__init__()
self.extensions = extensions
self.label_binarizer = LabelBinarizer()
self.label_binarizer.fit(extensions)

def __call__(self, candidates: List[Candidate]) -> csr_matrix:
enc = LabelBinarizer()
enc.fit(self.extensions)
extensions = [candidate.line_data_list[0].file_type for candidate in candidates]
return enc.transform(extensions)
result = self.label_binarizer.transform(extensions)
return result

def extract(self, candidate: Candidate) -> Any:
raise NotImplementedError
Expand All @@ -282,13 +282,13 @@ class RuleName(Feature):

def __init__(self, rule_names: List[str]) -> None:
super().__init__()
self.rule_names = rule_names
self.label_binarizer = LabelBinarizer()
self.label_binarizer.fit(rule_names)

def __call__(self, candidates: List[Candidate]) -> csr_matrix:
enc = LabelBinarizer()
enc.fit(self.rule_names)
rule_names = [candidate.rule_name for candidate in candidates]
return enc.transform(rule_names)
result = self.label_binarizer.transform(rule_names)
return result

def extract(self, candidate: Candidate) -> Any:
raise NotImplementedError
Binary file modified credsweeper/ml_model/ml_model.onnx
Binary file not shown.
Loading
Loading