ML refactoring #551

babenek · 2024-05-03T05:53:49Z

Description

Please include a summary of the change and which is fixed.

ML refactoring
add line, variable lstm layers
retrain the model

How has this been tested?

Please describe the tests that you ran to verify your changes.

UnitTest
Benchmark

touch a file to rerun bm

csh519 · 2024-05-27T10:05:33Z

experiment/src/data_loader.py

+        # line_index_set.add((index[0], index[1]))
+        # rules_set.update(line_data["RuleName"])
+        # if not reference_line_data:
+        #     reference_line_data = copy.deepcopy(line_data)
+
+        # remove to reduce memory usage
+        # line_data.pop("line_num")
+        # line_data.pop("path")
+        # line_data.pop("value_end")
+
        values.append(line_data)
-    # values = list(detected_data.values())
+
+    # for _, i in meta_data.items():
+    #     if i["Used"] is True:
+    #         continue
+    #     elif i["GroundTruth"] == 'T' \
+    #             and any(x in rules_set for x in i["Category"].split(':')) \
+    #             and (i["FilePath"], i["LineStart"]) in line_index_set \
+    #             and 0 <= i["ValueStart"] < i["ValueEnd"]:
+    #         print(f"NOT FOUND:{i}")
+    #         markup_data = {
+    #             "line": None,  # read
+    #             "line_num": i["LineStart"], # not used
+    #             "path": i["FilePath"],
+    #             "value": None,
+    #             "value_start": i["ValueStart"],  # remove
+    #             "value_end": i["ValueEnd"],  # remove
+    #             "variable": None,  # ???
+    #             'RuleName': (x for x in i["Category"].split(':') if x in line_index_set),
+    #             'GroundTruth': 'T',
+    #             'ext': Util.get_extension(i["FilePath"]),
+    #             'type': i["FilePath"].split('/')[-2]
+    #         }
+    #         assert markup_data.keys() == reference_line_data.keys(), reference_line_data.keys()


It seems old code left.. Please remove if you don't need it.

csh519 · 2024-05-27T10:08:17Z

tests/test_main.py

@@ -798,7 +798,6 @@ def test_param_n(self) -> None:
    def test_param_p(self) -> None:
        # internal parametrized tests for quick debug
        items = [  #
-            ("prod.py", b"secret_api_key='Ah\\tga%$FiQ@Ei8'", "secret_api_key", "Ah\\tga%$FiQ@Ei8"),  #


It seems this PR makes some FPs...
Why those cases are removed?

\t sign did not appear in train set, so the sequence gives low ml probability. I'll update the sample instead removing

csh519

Thank you for refactor the ML model.
I have one suggestion about typo, please check below.

experiment/src/prepare_data.py

Co-authored-by: ShinHyung Choi <sh519.choi@samsung.com>

csh519

ML model has been refactored.

LGTM 👍

babenek and others added 30 commits May 6, 2024 18:01

test ipmarkup

7a97b80

test category2rules

7bdff47

fix

a4e4bd6

fix

4dbec97

test val pos

f044f5c

touch

10df847

Update .github/workflows/benchmark.yml

f97982e

Update .github/workflows/benchmark.yml

b742c4d

Update .github/workflows/benchmark.yml

a010350

value sanitize in url improvement

b430560

Merge remote-tracking branch 'origin/valpos' into valpos

376cdad

style

d91065a

fix ; split in url

ea61214

fix dub 2024-05-07T20:06:19+03:00

dbdd613

fix4test

b9fc64a

Update .github/workflows/benchmark.yml

c089c06

dynamical bBM report

7c354f4

Merge remote-tracking branch 'origin/valpos' into valpos

5408ef2

fix google multi pattern

87c70ae

trigger

fc60c7a

fix keyword pattern for \t

40aa00a

Update tests/samples/auth_n.template

f2dcce1

touch a file to rerun bm

\ for url pattern

24503c5

Merge remote-tracking branch 'origin/valpos' into valpos

67622cc

touch

e460da4

touch

91b4487

cat md5

0452ff1

sha256sum md5

09eae5d

psd

6a8ef55

entropy fix

12c9abd

babenek added 2 commits May 21, 2024 14:47

ml_validator creates in runtime instead import

d7d3886

reform

117dfae

babenek marked this pull request as ready for review May 21, 2024 13:23

babenek requested a review from a team as a code owner May 21, 2024 13:23

babenek requested review from xDizzix, Yullia, iuriimet, meanrin, Dmitriy-NK, csh519 and kmnls May 21, 2024 13:25

use local metrics

d4bdb0a

csh519 reviewed May 27, 2024

View reviewed changes

babenek added 7 commits May 27, 2024 13:31

[skip actions] [ml] 2024-05-27T13:31:20+03:00

138b9e2

[skip actions] [ml] 2024-05-27T13:33:58+03:00

0a6c44e

rollback

6f67b3b

Merge remote-tracking branch 'upstream/main' into ml

edee47e

rollback test

99f455a

save plot fix

f2d7f32

BM scores fix

1ee99b5

babenek requested a review from csh519 May 27, 2024 12:42

csh519 reviewed May 28, 2024

View reviewed changes

experiment/src/prepare_data.py Outdated Show resolved Hide resolved

babenek and others added 3 commits May 28, 2024 12:00

Update experiment/src/prepare_data.py

ed0a9f4

Co-authored-by: ShinHyung Choi <sh519.choi@samsung.com>

BM scores fix after CredData PR

69f4930

Merge remote-tracking branch 'origin/ml' into ml

aea2dfd

babenek requested a review from csh519 May 28, 2024 10:11

csh519 approved these changes May 28, 2024

View reviewed changes

xDizzix approved these changes May 28, 2024

View reviewed changes

babenek merged commit d16a3a5 into Samsung:main May 28, 2024
27 of 28 checks passed

babenek deleted the ml branch May 28, 2024 10:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML refactoring #551

ML refactoring #551

babenek commented May 3, 2024 •

edited

csh519 May 27, 2024

csh519 May 27, 2024

babenek May 27, 2024

csh519 left a comment

csh519 left a comment

ML refactoring #551

ML refactoring #551

Conversation

babenek commented May 3, 2024 • edited

Description

How has this been tested?

csh519 May 27, 2024

Choose a reason for hiding this comment

csh519 May 27, 2024

Choose a reason for hiding this comment

babenek May 27, 2024

Choose a reason for hiding this comment

csh519 left a comment

Choose a reason for hiding this comment

csh519 left a comment

Choose a reason for hiding this comment

babenek commented May 3, 2024 •

edited