Windows malware detection based on dynamic behaviors(Similarity API calls) using Multilayer Perceptron (MLP)
Our project is about to train an MLP prediction model to detect windows malwares based on counts of API calls similarity.
We have collected about 12000 windows executable files from different public sources during web scraping and combining various datasets. Their 79% are malwares.
Then, We implemented the Cuckoo sandbox locally to collect their dynamic behaviors report. After ordering, we selected the count of API calls as our feature vector to input the network.
Based on the above decision, we created our CSV dataset and SQLite database. The databsed has 3 table:
- "APIs" : List of APIs that has seen in the whole of our reports.(311 = feature vector)
- "Reports" : List of reports with their md5 and VirusTotal rank ("positive" column).
- "APIs_Reports" : A many-to-many relationship between the above two tables plus a column("repetition") that indicates for the given report and given API how many calls occurred.
The CSV dataset was created based on the above database. Column "OUTPUT" is our output(label) that it shows is given file is a malware or not. For files with equal or greater 10 rank in VirusTotal, we labeled 1, and for files with equal 0 rank in VirusTotal, we labeled 0.
Confusion matrix on unseen test data(20% of dataset):
- Mohsen Ebadpour
Bachelor of science in computer engineering from University of Mohaghegh Ardabili(UMA)
This project was part of my final project in BSc.
For reports of executable files that Cuckoo sandbox generated(78~93GB), please contact to mohsenebadpour@outlook.com