ps: 1st project for Shang Hai Jiao Tong University AI application development competition
🚧 Currently under developing 🚧
Malpro is currently in active development and not usable yet. For now, only 50 fixed malware families are supported.
There are many products that use raw PE(.exe) file to classify malwares by using machine learning and/or extracting PE features. But instead, We are using assembly file of the raw PE file (generated by IDA) to classify it.
Our project can classify 85 types of malware families now(for beta versions), and we will support 147 types in the future.
All families we supported is in malware_families_list.json
When we are doing our research, we have noticed a bunch of project using cnn(or malconv, random forest, etc.) to classify the malware.
Then, we saw a competition held by Microsoft (https://www.kaggle.com/competitions/malware-classification) that use .asm file to classify 9 different types of malware family.
So could we classify more types of it? We have learned about the blogs, articles and interviews published by the champions of this competition. Finally, we decided that classifying by assembly file would be a good idea.
Features in assembly file are easier to be learned by machines than raw PE file.
We try to prove this as rigorous as possible, but because of the lack of time and cpu(I'm dead serious), we could only say that when you got 6000s and one cpu, training random forest models by using assembly file is better than PE file.
If you are interested in it, see blog.
The powerful features in assembly files are:
* opgram-ncode features
We recommand n=3 or n=4 for the opgram-ncode features to train the random forest model.
NOTE: You can customise the 'n' in opcode-ngram features when training models in main.py
Testing machine: ultra7 32GB cpu
Dataset: kaggle
n=3
dataset size: 2480 malwares (train:test=9:1)
malware families: 9 types
accuracy: 0.9959677419354839
time spent in extracting features: 36s(.asm image features) + 8min34s(opcode-3gram features)
time spent in training: less than 10s
Dataset: Vx underground dataset (42 malware families)
and Dataset from kaggle (8 malware families)
check the accuracy in ./model/model_accu.txt and ./model/n=3/model_accu.txt in 50-malware-branch
n=4
dataset size: 6921 malwares (train:test=9:1)
malware families: 50 types
accuracy: 0.9415692821368948
n=3
dataset size: 6921 malwares (train:test=9:1)
malware families: 50 types
accuracy: 0.9415692821368948
time spent in training: 12s
check the accuracy in ./model/model_accu.txt in 85-malware-branch
Dataset: Vx underground dataset (42 malware families)
and Dataset from kaggle (8 malware families)
and Dataset from BODMAS(35 malware families)
n=4
dataset size: 9352 malwares (train:test=9:1)
malware families: 85 types
accuracy: 0.8728632478632479
n=3
dataset size: 9352 malwares (train:test=9:1)
malware families: 85 types
accuracy: 0.8824786324786325
there is a model trained by us in the project file (model/model.pt), so you can predict the malware without training on your own
pip install -r requirments.txt
TIPS: we recommend you to create a new python 3.9 virtual environment for this project because it depends on some libraries in old version
install IDA pro (7.x) from here
Change {ida_install_dir}/idc/analysis.idc
line41:
from
gen_file(OFILE_ASM, fhandle, 0, BADADDR, 0); // create the assembler file
to
gen_file(OFILE_LST, fhandle, 0, BADADDR, 0); // create the assembler file
Change {ida_install_dir}/cfg/ida.cfg
line 399:
from
OPCODE_BYTES = 0 // display this many instruction/data bytes (0 to disable)
to
OPCODE_BYTES = 16 // display this many instruction/data bytes (0 to disable)
python main.py
we upload one malware sample for each kind of the malware families(50 types). You can use main.py to examine the model.
There are two models using n=4(deafult) and n=3 are uploaded.
The two model files is over 100MB(too large to upload) so we upload .zip file of it, don't forget to unzip
But in the release vesion, the models are not in zip files
Change the model file to n=3 if you want.
choose "Predict malware(.asm) directly(using ./model/model.pt)" in the menu and enter the location of one of the examples(or you can download raw PE files of those 9 types of malwares and use IDA to generate assembly files as your own samples), then you will get a predict result.
python server.py
It will run a web server on your host(port 7777) as the frontend of our project, open it and upload your PE file, you will see a result like this.
ONLY SUPPORT 85 TYPES OF MALWARE FAMILIES
download the analyze details if you want.
download .asm dataset with labels
You can download the dataset of kaggle which contains 9 types of malware famlies.
Or you can download other PE malware dataset and turn them to LST file in IDA and label them.
create /train folder and TrainLabels.csv
copy your data set to /train and your label file to TrainLabels.csv(same label file format in the kaggle challenge)
TIPS: if your train set is a subset of your label file(rename it to trainLabels_all.csv), use utils/convert.py to fix it.
Run main.py to train the model
python main.py
it will show a menu like this
you can train your model in this script
XiaoBin Peng(handsongPeng)
Please send bug reports and feature requests through github issue tracker. Malpro is currently under development now and it's open to any constructive suggestions.
The Malpro project is released under apache2 license.