add EOS token to modulartokenizer #44

floccinauc · 2023-05-24T13:10:15Z

No description provided.

mosheraboh

Just saw this relatively old PR.
Is it still relevant?
I left comments inline anyway

mosheraboh · 2023-06-06T07:58:00Z

fusedrug/data/tokenizer/modulartokenizer/test_multi_tokenizer_creation.py

@@ -98,7 +98,7 @@ def get_training_corpus(dataset: List) -> Generator:
    print("Fin")


-@hydra.main(config_path="./configs", config_name="tokenizer_config", version_base=None)
+@hydra.main(config_path="./configs", config_name="tokenizer_config_personal", version_base=None)


mosheraboh · 2023-06-06T07:59:49Z

fusedrug/data/tokenizer/modulartokenizer/create_new_t5_tokenizer_with_special_tokens.py

@@ -0,0 +1,136 @@
+import hydra


What is this file? test? script that creates modular tokenizer?
Why is titan relevant here?

mosheraboh · 2023-06-06T08:01:22Z

fusedrug/data/tokenizer/modulartokenizer/configs/tokenizer_add_special_tokens_config.yaml

@@ -0,0 +1,46 @@
+paths:


any chance we can always use smaller case in variable and files names (including AA and SMILES)

mosheraboh · 2023-06-06T08:02:22Z

fusedrug/data/tokenizer/modulartokenizer/configs/tokenizer_add_special_tokens_config.yaml

+data:  
+  tokenizer:
+    # modular_tokenizers_out_path: "${paths.tokenizers_path}/modular_wordlevelAA_BPESMILES/"
+    overall_max_len: 70


Why do you need to specify the max_len here? We should specify it per task.

mosheraboh · 2023-06-06T08:03:13Z

fusedrug/data/tokenizer/modulartokenizer/configs/tokenizer_add_special_tokens_config.yaml

+        # max_len: 100 # [Optional] max number of tokens to be used by all instances of this tokenizer. If None or undefined, no limit is set.
+        start_delimiter: "<start_AA>"
+        end_delimiter: "<end_AA>"
+      - name: AA


What is the difference between AA and AA nonspecial?

VADIM RATNER VADIMRA@il.ibm.com and others added 3 commits May 18, 2023 09:45

from_file for modular tokenizer

a5e842d

from_file for modular tokenizer

5a93fc8

add EOS token to modulartokenizer

9ad130b

floccinauc requested a review from mosheraboh May 24, 2023 13:10

floccinauc self-assigned this May 24, 2023

Merge branch 'main' into mod_tok_from_file

ff909be

mosheraboh requested changes Jun 6, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add EOS token to modulartokenizer #44

add EOS token to modulartokenizer #44

floccinauc commented May 24, 2023

mosheraboh left a comment

mosheraboh Jun 6, 2023

mosheraboh Jun 6, 2023

mosheraboh Jun 6, 2023

mosheraboh Jun 6, 2023

mosheraboh Jun 6, 2023

add EOS token to modulartokenizer #44

Are you sure you want to change the base?

add EOS token to modulartokenizer #44

Conversation

floccinauc commented May 24, 2023

mosheraboh left a comment

Choose a reason for hiding this comment

mosheraboh Jun 6, 2023

Choose a reason for hiding this comment

mosheraboh Jun 6, 2023

Choose a reason for hiding this comment

mosheraboh Jun 6, 2023

Choose a reason for hiding this comment

mosheraboh Jun 6, 2023

Choose a reason for hiding this comment

mosheraboh Jun 6, 2023

Choose a reason for hiding this comment