New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to generate our own pretrain dataset? #18
Comments
I have finished the script which can generate the pretrain dataset. It works when I run it. But I am not sure if it's exactly right. Could you please help me to check it. Thanks a lot. 1、python command/pretrain/prepare_json.py in:data-raw/bin out:data-raw/funcbytes command/finetune/prepare_finetune_single.py is shown below. |
|
Hi @RobinHan24 thanks for posting your scripts. Since you are pretraining, the byte1-4 needs to include real execution traces, not dummy traces only used in finetuning. While your generated data format might seem correct, it might not include the actual traces. This corresponds to your 2nd step "python command/finetune/prepare_finetune_trace.py", this is only preparing the dataset with dummy traces (where byte1-4 are mostly dummy values). If you want to generate actual traces, you need an emulator and really execute the code you collected in funcbytes (you may want to look at |
@peikexin9 Thanks again. I'm curious how to collect so much vulnerability data in order to uncover vulnerabilities that have not been discovered in firmware images. Could you please share your experice or methods. Thank you very much. |
As mentioned in readme, I followed to run the script preprocess_pretrain_10k.py to generate data in data-bin/pretrain_10k, but how can I generate myown data which is in data-src/pretrain_10k, thanks a lot.
The text was updated successfully, but these errors were encountered: