ML refactoring (#551)

* test ipmarkup * test category2rules * fix * fix * test val pos * touch * Update .github/workflows/benchmark.yml * Update .github/workflows/benchmark.yml * Update .github/workflows/benchmark.yml * value sanitize in url improvement * style * fix ; split in url * fix dub 2024-05-07T20:06:19+03:00 * fix4test * Update .github/workflows/benchmark.yml * dynamical bBM report * fix google multi pattern * trigger * fix keyword pattern for \t * Update tests/samples/auth_n.template touch a file to rerun bm * \ for url pattern * touch * touch * cat md5 * sha256sum md5 * psd * entropy fix * BM scores fix * BM upd * BM scors * BM scor: * cache upd * [skip actions] [valpos] 2024-05-13T16:13:18+03:00 * Rollback custom ref * accesskey * BM upd for re-launch * BM upd for re-launch 2 * BM scors upd * use prescan report for BM * BM upd * doc FP fix * BM scores fix * BM scores * BM scores upd * BM scores upd * Slack Token upd * BM csores upd * BM csores upd2 * upd4test * BM scor up with 0 in rate * 20240519-1108 batch 2048 42 epc * [skip actions] [ml] 2024-05-19T12:31:17+03:00 * 21th epoch * 21th epoch. weights * test4 * [skip actions] [ml] 2024-05-20T18:01:16+03:00 * [skip actions] [ml] 2024-05-20T18:04:19+03:00 * [skip actions] [ml] 2024-05-20T18:19:46+03:00 * jwt uses ml * [skip actions] [ml] 2024-05-20T20:10:48+03:00 * ml test * sigmoid * fix validator * 0.33 * testfix * Update benchmark.yml * testfix * rollback BM wf * upd * BM scores fix * workaround for CI step * requests==2.32.0 * mypyfix * reform * test commit * upd sample which gives shivering probability in macos * features size preprocess to calculate the dimension automatically * ml_validator creates in runtime instead import * reform * use local metrics * [skip actions] [ml] 2024-05-27T13:31:20+03:00 * [skip actions] [ml] 2024-05-27T13:33:58+03:00 * rollback * rollback test * save plot fix * BM scores fix * Update experiment/src/prepare_data.py Co-authored-by: ShinHyung Choi <sh519.choi@samsung.com> * BM scores fix after CredData PR --------- Co-authored-by: ShinHyung Choi <sh519.choi@samsung.com>
Samsung · May 28, 2024 · d16a3a5 · d16a3a5
1 parent 5ad4e1d
commit d16a3a5
Show file tree

Hide file tree

Showing 25 changed files with 1,590 additions and 2,304 deletions.
diff --git a/.github/workflows/check.yml b/.github/workflows/check.yml
@@ -58,7 +58,7 @@ jobs:
     - name: Check ml_model.onnx integrity
       if: ${{ always() && steps.code_checkout.conclusion == 'success' }}
       run: |
-        md5sum --binary credsweeper/ml_model/ml_model.onnx | grep 57ec152f6aa740456c742ecd5e7d9ef5
+        md5sum --binary credsweeper/ml_model/ml_model.onnx | grep 8f277b2f4a67a9911a9a860f1b5c0489
 
     # # # Python setup
 

diff --git a/cicd/benchmark.txt b/cicd/benchmark.txt
@@ -1,4 +1,4 @@
-DATA: 16998279 interested lines. MARKUP: 63222 items
+DATA: 16998279 interested lines. MARKUP: 63226 items
 FileType           FileNumber    ValidLines    Positives    Negatives    Templates
 ---------------  ------------  ------------  -----------  -----------  -----------
                           194         28318           64          430           87
@@ -83,8 +83,8 @@ FileType           FileNumber    ValidLines    Positives    Negatives    Templat
 .java                     621        134132          311         1348          169
 .jenkinsfile                1            58            1            7
 .jinja2                     1            64                         2
-.js                       658        536388          494         2628          338
-.json                     860      13670750          817        10952          139
+.js                       658        536388          494         2630          338
+.json                     860      13670750          817        10953          139
 .jsp                       13          3202            1           42
 .jsx                        7           857                        19
 .jwt                        6             8            7
@@ -123,7 +123,7 @@ FileType           FileNumber    ValidLines    Positives    Negatives    Templat
 .mqh                        1          1023                         2
 .msg                        1         26644            1            1
 .mysql                      1            36                                      2
-.ndjson                     2          5006           49          324
+.ndjson                     2          5006           49          325
 .nix                        4           211                        12
 .nolint                     1             2                         1
 .odd                        1          1281                        57
@@ -223,25 +223,25 @@ FileType           FileNumber    ValidLines    Positives    Negatives    Templat
 .yml                      418         36162          437          920          374
 .zsh                        6           872                        12
 .zsh-theme                  1            97                         1
-TOTAL:                  10335      16998279         8097        60877         5159
-credsweeper result_cnt : 7519, lost_cnt : 0, true_cnt : 6817, false_cnt : 702
+TOTAL:                  10335      16998279         8097        60881         5159
+credsweeper result_cnt : 7394, lost_cnt : 0, true_cnt : 6795, false_cnt : 599
 Rules                             Positives    Negatives    Templates    Reported    TP    FP     TN    FN       FPR       FNR       ACC       PRC       RCL        F1
 ------------------------------  -----------  -----------  -----------  ----------  ----  ----  -----  ----  --------  --------  --------  --------  --------  --------
-API                                     117         3104          184         112   103     9   3279    14  0.002737  0.119658  0.993245  0.919643  0.880342  0.899563
+API                                     117         3104          184         105   101     4   3284    16  0.001217  0.136752  0.994126  0.961905  0.863248  0.909910
 AWS Client ID                           163           13            0         154   154     0     13     9  0.000000  0.055215  0.948864  1.000000  0.944785  0.971609
 AWS Multi                                71           12            0          83    71    11      1     0  0.916667  0.000000  0.867470  0.865854  1.000000  0.928105
 AWS S3 Bucket                            61           25            0          87    61    24      1     0  0.960000  0.000000  0.720930  0.717647  1.000000  0.835616
 Atlassian Old PAT token                  27          211            3          10     3     7    207    24  0.032710  0.888889  0.871369  0.300000  0.111111  0.162162
-Auth                                    318         2750           87         308   269    39   2798    49  0.013747  0.154088  0.972108  0.873377  0.845912  0.859425
+Auth                                    318         2750           87         293   267    26   2811    51  0.009165  0.160377  0.975594  0.911263  0.839623  0.873977
 Azure Access Token                       19            0            0                 0     0      0    19            1.000000  0.000000            0.000000
 BASE64 Private Key                        7            2            0           7     7     0      2     0  0.000000  0.000000  1.000000  1.000000  1.000000  1.000000
 BASE64 encoded PEM Private Key            7            0            0           5     5     0      0     2            0.285714  0.714286  1.000000  0.714286  0.833333
 Bitbucket Client ID                     147         1833            3          41    27    14   1822   120  0.007625  0.816327  0.932426  0.658537  0.183673  0.287234
 Bitbucket Client Secret                 239          535            0          44    33    11    524   206  0.020561  0.861925  0.719638  0.750000  0.138075  0.233216
-Certificate                              22          456            1          20    15     5    452     7  0.010941  0.318182  0.974948  0.750000  0.681818  0.714286
-Credential                               31          130           74          29    29     0    204     2  0.000000  0.064516  0.991489  1.000000  0.935484  0.966667
+Certificate                              22          456            1          17    16     1    456     6  0.002188  0.272727  0.985386  0.941176  0.727273  0.820513
+Credential                               31          130           74          31    28     3    201     3  0.014706  0.096774  0.974468  0.903226  0.903226  0.903226
 Docker Swarm Token                        2            0            0           2     2     0      0     0            0.000000  1.000000  1.000000  1.000000  1.000000
-Dropbox App secret                       62          112            0          45    37     7    105    25  0.062500  0.403226  0.816092  0.840909  0.596774  0.698113
+Dropbox App secret                       62          114            0          45    37     7    107    25  0.061404  0.403226  0.818182  0.840909  0.596774  0.698113
 Facebook Access Token                     0            1            0                 0     0      1     0  0.000000            1.000000
 Firebase Domain                           6            1            0           7     6     1      0     0  1.000000  0.000000  0.857143  0.857143  1.000000  0.923077
 Github Old Token                          1            0            0           1     1     0      0     0            0.000000  1.000000  1.000000  1.000000  1.000000
@@ -253,18 +253,18 @@ Google OAuth Access Token                 3            0            0
 Grafana Provisioned API Key              22            1            0           1     1     0      1    21  0.000000  0.954545  0.086957  1.000000  0.045455  0.086957
 IPv4                                    691          365            0        1004   691   302     63     0  0.827397  0.000000  0.714015  0.695871  1.000000  0.820665
 IPv6                                     33          135            0          33    33     0    135     0  0.000000  0.000000  1.000000  1.000000  1.000000  1.000000
-JSON Web Token                          284           10            2         280   272     8      4    12  0.666667  0.042254  0.932432  0.971429  0.957746  0.964539
+JSON Web Token                          284           11            2         280   272     8      5    12  0.615385  0.042254  0.932660  0.971429  0.957746  0.964539
 Jira / Confluence PAT token               0            4            0                 0     0      4     0  0.000000            1.000000
 Jira 2FA                                  7            6            0           3     3     0      6     4  0.000000  0.571429  0.692308  1.000000  0.428571  0.600000
-Key                                     427         7871          462         452   389    61   8272    38  0.007320  0.088993  0.988699  0.864444  0.911007  0.887115
-Nonce                                    43           89            0          60    32    28     61    11  0.314607  0.255814  0.704545  0.533333  0.744186  0.621359
+Key                                     427         7871          462         415   391    23   8310    36  0.002760  0.084309  0.993265  0.944444  0.915691  0.929845
+Nonce                                    43           89            0          42    36     6     83     7  0.067416  0.162791  0.901515  0.857143  0.837209  0.847059
 PEM Private Key                        1019         1483            0        1023  1019     4   1479     0  0.002697  0.000000  0.998401  0.996090  1.000000  0.998041
-Password                               1902         7425         2675        1647  1554    93  10007   348  0.009208  0.182965  0.963256  0.943534  0.817035  0.875740
-Salt                                     42           72            2          42    38     4     70     4  0.054054  0.095238  0.931034  0.904762  0.904762  0.904762
-Secret                                 1353        29656          873        1264  1235    29  30500   118  0.000950  0.087214  0.995389  0.977057  0.912786  0.943829
+Password                               1902         7425         2675        1636  1543    93  10007   359  0.009208  0.188749  0.962340  0.943154  0.811251  0.872244
+Salt                                     42           72            2          38    38     0     74     4  0.000000  0.095238  0.965517  1.000000  0.904762  0.950000
+Secret                                 1353        29656          873        1239  1229    10  30519   124  0.000328  0.091648  0.995797  0.991929  0.908352  0.948302
 Seed                                      1            6            0                 0     0      6     1  0.000000  1.000000  0.857143            0.000000
 Slack Token                               4            1            0           4     4     0      1     0  0.000000  0.000000  1.000000  1.000000  1.000000  1.000000
-Token                                   553         3975          448         517   489    28   4395    64  0.006331  0.115732  0.981511  0.945841  0.884268  0.914019
+Token                                   553         3976          448         499   476    23   4401    77  0.005199  0.139241  0.979908  0.953908  0.860759  0.904943
 Twilio API Key                            0            5            2                 0     0      7     0  0.000000            1.000000
-URL Credentials                         167          117          254         143   143     0    371    24  0.000000  0.143713  0.955390  1.000000  0.856287  0.922581
-                                       8097        60877         5159        7538  6817   702  60175  1280  0.011531  0.158083  0.971265  0.906637  0.841917  0.873079
+URL Credentials                         167          117          254         153   149     4    367    18  0.010782  0.107784  0.959108  0.973856  0.892216  0.931250
+                                       8097        60881         5159        7412  6795   599  60282  1302  0.009839  0.160800  0.972440  0.918988  0.839200  0.877284
diff --git a/credsweeper/app.py b/credsweeper/app.py
@@ -4,7 +4,7 @@
 import signal
 import sys
 from pathlib import Path
-from typing import Any, List, Optional, Union, Dict, Sequence
+from typing import Any, List, Optional, Union, Dict, Sequence, Tuple
 
 import pandas as pd
 
@@ -13,7 +13,7 @@
 
 from credsweeper.common.constants import KeyValidationOption, Severity, ThresholdPreset
 from credsweeper.config import Config
-from credsweeper.credentials import Candidate, CredentialManager
+from credsweeper.credentials import Candidate, CredentialManager, CandidateKey
 from credsweeper.deep_scanner.deep_scanner import DeepScanner
 from credsweeper.file_handler.diff_content_provider import DiffContentProvider
 from credsweeper.file_handler.file_path_extractor import FilePathExtractor
@@ -336,32 +336,33 @@ def post_processing(self) -> None:
         """Machine learning validation for received credential candidates."""
         if self._use_ml_validation():
             logger.info(f"Grouping {len(self.credential_manager.candidates)} candidates")
-            new_cred_list = []
+            new_cred_list: List[Candidate] = []
             cred_groups = self.credential_manager.group_credentials()
-            ml_cred_groups = []
+            ml_cred_groups: List[Tuple[CandidateKey, List[Candidate]]] = []
             for group_key, group_candidates in cred_groups.items():
-                # Analyze with ML if all candidates in group require ML
+                # Analyze with ML if any candidate in group require ML
                 for candidate in group_candidates:
-                    if not candidate.use_ml:
+                    if candidate.use_ml:
+                        ml_cred_groups.append((group_key, group_candidates))
                         break
                 else:
-                    ml_cred_groups.append((group_key.value, group_candidates))
-                    continue
-                # If at least one of credentials in the group do not require ML - automatically report to user
-                for candidate in group_candidates:
-                    candidate.ml_validation = KeyValidationOption.NOT_AVAILABLE
-                new_cred_list += group_candidates
+                    # all candidates do not require ML
+                    new_cred_list.extend(group_candidates)
 
             # prevent extra ml_validator creation if ml_cred_groups is empty
             if ml_cred_groups:
                 logger.info(f"Run ML Validation for {len(ml_cred_groups)} groups")
                 is_cred, probability = self.ml_validator.validate_groups(ml_cred_groups, self.ml_batch_size)
                 for i, (_, group_candidates) in enumerate(ml_cred_groups):
-                    if is_cred[i]:
-                        for candidate in group_candidates:
-                            candidate.ml_validation = KeyValidationOption.VALIDATED_KEY
-                            candidate.ml_probability = probability[i]
-                        new_cred_list += group_candidates
+                    for candidate in group_candidates:
+                        if candidate.use_ml:
+                            if is_cred[i]:
+                                candidate.ml_validation = KeyValidationOption.VALIDATED_KEY
+                                candidate.ml_probability = probability[i]
+                                new_cred_list.append(candidate)
+                        else:
+                            candidate.ml_validation = KeyValidationOption.NOT_AVAILABLE
+                            new_cred_list.append(candidate)
             else:
                 logger.info("Skipping ML validation due not applicable")
 

diff --git a/credsweeper/credentials/candidate_key.py b/credsweeper/credentials/candidate_key.py
@@ -12,8 +12,10 @@ class CandidateKey:
     def __init__(self, line_data: LineData):
         self.path: str = line_data.path
         self.line_num: int = line_data.line_num
-        self.value: str = line_data.value
-        self.key: Tuple[str, int, str] = (self.path, self.line_num, self.value)
+        self.value_start: int = line_data.value_start
+        self.value_end: int = line_data.value_end
+        self.key: Tuple[str, int, int, int] = (self.path, self.line_num, self.value_start, self.value_end)
+        self.__line = line_data.line
 
     def __hash__(self):
         return hash(self.key)
@@ -23,3 +25,6 @@ def __eq__(self, other):
 
     def __ne__(self, other):
         return not (self == other)
+
+    def __repr__(self) -> str:
+        return f"{self.key}:{self.__line}"
diff --git a/credsweeper/ml_model/features.py b/credsweeper/ml_model/features.py
@@ -146,7 +146,7 @@ class PossibleComment(Feature):
     r"""Feature is true if candidate line starts with #,\*,/\*? (Possible comment)."""
 
     def extract(self, candidate: Candidate) -> bool:
-        for i in ["#", "*", "/*"]:
+        for i in ["#", "*", "/*", "//"]:
             if candidate.line_data_list[0].line.startswith(i):
                 return True
         return False
@@ -260,13 +260,13 @@ class FileExtension(Feature):
 
     def __init__(self, extensions: List[str]) -> None:
         super().__init__()
-        self.extensions = extensions
+        self.label_binarizer = LabelBinarizer()
+        self.label_binarizer.fit(extensions)
 
     def __call__(self, candidates: List[Candidate]) -> csr_matrix:
-        enc = LabelBinarizer()
-        enc.fit(self.extensions)
         extensions = [candidate.line_data_list[0].file_type for candidate in candidates]
-        return enc.transform(extensions)
+        result = self.label_binarizer.transform(extensions)
+        return result
 
     def extract(self, candidate: Candidate) -> Any:
         raise NotImplementedError
@@ -282,13 +282,13 @@ class RuleName(Feature):
 
     def __init__(self, rule_names: List[str]) -> None:
         super().__init__()
-        self.rule_names = rule_names
+        self.label_binarizer = LabelBinarizer()
+        self.label_binarizer.fit(rule_names)
 
     def __call__(self, candidates: List[Candidate]) -> csr_matrix:
-        enc = LabelBinarizer()
-        enc.fit(self.rule_names)
         rule_names = [candidate.rule_name for candidate in candidates]
-        return enc.transform(rule_names)
+        result = self.label_binarizer.transform(rule_names)
+        return result
 
     def extract(self, candidate: Candidate) -> Any:
         raise NotImplementedError
diff --git a/credsweeper/ml_model/ml_model.onnx b/credsweeper/ml_model/ml_model.onnx