Drebin:

In this repository, we provide the artefacts of our paper "Lessons Learnt on Reproducibility in Machine Learning Based Android Malware Detection", which has been accepted to be published in Empirical Software Engineering (EMSE).

Drebin:

To extract the features, use `GetApkData.py` script:

This script generates ".data" files that represent the features extracted from the APK files.

INPUTs are: The path to the APK files directory and optionally the number of CPUs to use. They are explained in details in the script. OUTPUTs are the ".data" file that has the list of the extracted features and the manifest file.

Example:

python GetApkData.py -d my_dataset

To embedd the features and perform the classification, use `classification_drebin.py` script:

This script embedd the features extracted with GetApkData.py in vectors space and it performs the classification using SVM LinearSVC classifier. Recall and Accuracy are calculated 10 times. Each experiment is performed with 66% of training set and 33% of test set, and scores are averaged. Roc curve is plotted using the last trained classifier from the 10 experiments.

INPUTs are:

- The path to the .data malware files directory, 
- The path to the .data goodware files directory, 
- The name of the txt file on which the results will be written, 
- The name of the pdf file on which the roc curve will be saved, 
- and optionally, The seed to fix for the experiments.

OUTPUTs are:

- the results text file,
- and roc curve pdf graph.

Example:

python classification_drebin.py -md my_dataset/malware/ -gd my_dataset/goodware/ -fs file_scores -roc file_roc

MaMaDoid:

To extract the features, use `mamadroid.py` script:

This script is used to generate the call graphs.

INPUTs are:

- The path to the APK files directory, 
- and the path to the Android platform directory

OUTPUTs are:

- 4 directories created inside the provided APK files directory. They are: 
    - "graphs" (it contains the call graphs), 
    - "family", 
    - "package",
    - "class" directories (they contain the abstarction of the call graphs to family, package, and class modes respectively).

Example:

python mamadroid.py -f my_dataset -d android/

To embedd the features, use `MaMaStat.py` script:

This script creates the features files for family and package modes.

INPUTs are:

- The names of the call graphs datasets in this format (database1:database2:database3). 
          You need to move the call graphs generated with `mamadroid.py` script, to "graphs" 
          directory that is in the same directory as `mamadroid.py` and `MaMaStat.py` scripts,
- Flag to write intermediate files or not.
          INPUTs are explained in more details in the script.

OUTPUTs are:

- The features files (one per indicated database) that are created as "name_of_the_database".csv files in the folders Features/Families and Features/Packages

Example:

python MaMaStat.py -d Trial1 -wf N

To perform the classification, use `classification_mamadroid.py` script:

This script performs MaMaDroid's classification using Random Forest classifier. Scores (Precision, Recall, F1-score) are calculated using 10-folds cross-validation with and without PCA for family and package modes.

INPUTs are:

- The path to the CSV features files of family mode, for drebin, 2013, 2014, 2015, 2016, 
         oldbenign, and newbenign datasets. These files are generated by `MaMaStat.py` script 
         in Features/Families,
- The path to the CSV features files of package mode, for drebin, 2013, 2014, 2015, 2016, 
         oldbenign, and newbenign datasets. These files are generated by `MaMaStat.py` script 
         in Features/Packages.
- The name of the txt file on which the results will be written, 
- and optionally, the seed to fix for the experiments.

OUTPUTs are:

- The results text file.

Example:

python classification_mamadroid.py -pf Features/Families -pp Features/Packages -fs file_scores

RevealDroid:

Notes:

-To build RevealDroid, please follow the instructions in https://bitbucket.org/joshuaga/revealdroid/src/master/

To download the apps and extract the features, use `load_apk_and_extract_features.py` script:

Load the CSV file from AndroZoo https://androzoo.uni.lu/lists and put it in your home directory ~/ It will be used in case you try to download apps from AndroZoo using their md5

INPUTs are:

- The path to the txt file containing the hashes,
- The name of your dataset: (e.g., my_dataset),
- A valid AndroZoo APIKEY

OUTPUTs are:

- The script downloads the apps from AndroZoo and extracts the features that are stored in:
    - data/apiusage/my_dataset
    - data/native_external_calls/my_dataset
    - android-reflection-analysis/data/my_dataset

For malware detection , use `run_cv_md.py` script:

INPUTs are:

- The name of your malware datasets separated by space. Note that the features of these datasets should be located in /data/apiusage/malware{i} /data/native_external_calls/malware{i} and ../android-reflection-analysis/data/malware{i}
- The name of your goodware datasets separated by space. Note that the features of these datasets should be located in /data/apiusage/goodware{i} /data/native_external_calls/goodware{i} and ../android-reflection-analysis/data/goodware{i}
- The name to be used to save the roc curve figure

OUTPUTs are:

- The script runs 10-fold cross-validation and prints average precision, average recall, and average F1-score for both malware and goodware
- It also generates the PR curve

For family detection , use `run_cv_family.py` script:

INPUTs are:

- The path to the file that contains hashes and their corresponding families separated by space. This file is located in dataset/revealdroid for both genome and all the malware datasets used in the experiments
- The name of your malware datasets to consider. They should be separated by space. Note that the features of these datasets should be located in /data/apiusage/malware{i} /data/native_external_calls/malware{i} and ../android-reflection-analysis/data/malware{i}.
Also, for genome, you should use "drebin" name as this collection contains the genome apps.
- The name to be used to save the roc curve figure

OUTPUTs are:

- The script runs 10-fold cross-validation and prints average accuracy

DroidCat:

After the publication of our paper, DroidCat's author contacted us and we provided him with more details on the dataset mismatches discussed in our paper. We note that our reproduction attempt of DroidCat was performed with the latest version of DroidCat artefacts publicly available at the time (Repo: https://bitbucket.org/haipeng_cai/droidcat/, latest commit: d108ace0ddb7c56c8f4ebce02801bfee2a3c5d24 Mars 31th, 2019). Hence the dataset mismatches in DroidCat artefacts we described in our paper may be fixed when you read this message (i.e., after August 4th, 2021).

Notes:

Make sure that droidcat repo is in your home directory ~/
Install the tools and dependecies listed in https://bitbucket.org/haipeng_cai/droidfax/src/master/portable/README
Make sure to install the android sdk manager in your home directory in ~/.android
You can also use our helping script install.sh but make sure to change the path of your java and the JAVA_HOME environment variable.

To download the apps, use `download_apps_androzoo.sh` script:

INPUTs are:

- A text file that contains the sha256. Note that the lists of hashes can be found in dataset/droidcat,
- The name of your dataset: (e.g., my_dataset),
- A valid AndroZoo APIKEY

OUTPUTs are:

- The apps are saved in ~/droidcat/droidcat/testbed/inputs/my_dataset

To run static analysis on the apps, use `instrument_apps.sh` script:

You will need to create a key using keytool, then adapt the script signandalign.sh with the details of your key
Put the key you have created in ~/droidcat/droidcat/scripts
You will also have to create a file "droidcat_keytool_password" where you store your password

INPUTs are:

- The name of your dataset. Note that the apps should be stored in: ~/droidcat/droidcat/testbed/inputs/my_dataset

OUTPUTs are:

- The instrumented apps are saved in ~/droidcat/droidcat/testbed/cg.instrumented/my_dataset

To generate the call traces, use `run_monkey.sh` script:

INPUTs are:

- The name of your dataset. Note that the instrumented apps should be stored in: ~/droidcat/droidcat/testbed/cg.instrumented/my_dataset

OUTPUTs are:

- The traces are saved in ~/droidcat/droidcat/testbed/monkey_results/my_dataset

To compute generalfeatures (structure), use `extract_generalReport.sh` script:

INPUTs are:

- The name of your dataset. Note that the instrumented apps should be stored in: ~/droidcat/droidcat/testbed/cg.instrumented/my_dataset and the traces should be located in ~/droidcat/droidcat/testbed/monkey_results/my_dataset

OUTPUTs are:

- GeneralFeatures are saved in ~/droidcat/droidcat/testbed/allGeneralReports/my_dataset/gfeatures.txt

To compute iccfeatures, use `extract_iccReport.sh` script:

INPUTs are:

- The name of your dataset. Note that the instrumented apps should be stored in: ~/droidcat/droidcat/testbed/cg.instrumented/my_dataset and the traces should be located in ~/droidcat/droidcat/testbed/monkey_results/my_dataset

OUTPUTs are:

- ICCFeatures are saved in ~/droidcat/droidcat/testbed/allICCReports/my_dataset/iccfeatures.txt

To compute securityfeatures, use `extract_securityReport.sh` script:

INPUTs are:

- The name of your dataset. Note that the instrumented apps should be stored in: ~/droidcat/droidcat/testbed/cg.instrumented/my_dataset and the traces should be located in ~/droidcat/droidcat/testbed/monkey_results/my_dataset

OUTPUTs are:

- SecurityFeatures are saved in ~/droidcat/droidcat/testbed/allSecurityReports/my_dataset/securityfeatures.txt

To arrange the features:

For each dataset, put the 3 feature files in the same folder.
Put the datasets directories (created in the previous step) in a directory named "features"
Put your "features" directory in ~/droidcat/droidcat/ML
Based on the lists of hashes in dataset/DroidCat directory, you will end up with the following datasets: malware-2017-more, newzoo2011, vs2013, vs2014, vs2015, vs2016, zoo2010, zoo2011, zoo2012, zoo2017, zoobenign2010, zoobenign2011, zoobenign2012, zoobenign2013, zoobenign2014, zoobenign2015, zoobenign2016, zoobenign2017

Example:

~/droidcat/droidcat/ML/zoobenign2014 should contain gfeatures.txt, iccfeatures.txt, and securityfeatures.txt of zoobenign2014 dataset

To retrieve the first-seen dates of the apps, use `splitAllByFirstSeenDate.sh` script:

-The new features are stored in features_droidcat_byfirstseen

Note that for reproducible experiments, you can uncomment the corresponding lines in the following files: common.py, configs.py, family_detection.py, featureLoader_wdate.py, malware_detection.py, plot_roc.py

For malware detection, use `malware_detection.py` script:

-This script prints for each of the 4 datasets (D1617, D1415, D1213, D0911): accuracy, recall, precision, and F1-score

For family classification, use `family_detection.py` script:

-Load the CSV file from AndroZoo https://androzoo.uni.lu/lists and put it in your home directory ~/ -Change the APIKEY variable in ~/droidcat/droidcat/ML/configs.py with your AndroZoo APIKEY -This script prints for each of the 4 datasets (D1617, D1415, D1213, D0911): accuracy, recall, precision, and F1-score

To generate the roc curves, use `plot_roc.py` script:

INPUTs are:

- type: det for malware detection and fam for family detection
- file: The name of the output roc curve file

OUTPUTs are:

- The roc curve file

MalScan:

Notes:

-You need to have the following python3 libraries installed: networkx, androguard, numpy, and sklearn

To download the apps, use `download_apps_androzoo_malscan.sh` script:

INPUTs are:

- A text file that contains the sha256. Note that the lists of hashes can be found in dataset/malscan,
- The year of your dataset,
- The type of your apps; malware or goodware
- A valid AndroZoo APIKEY,

OUTPUTs are:

- The apps are saved in apps/year/type directory

To generate the call graphs, use `CallGraphExtraction.py` script:

INPUTs are:

- The path to your dir of APK files,
- The path to the output files

OUTPUTs are:

- The call graphs are saved in the path of output files you have provided

Example:

python3 CallGraphExtraction.py -f apps/2011/malware -o callgraphs/2011/malware

To extract the features, use `FeatureExtraction.py` script:

INPUTs are:

- The path to your dir of call graphs. Note that your directory should contain both malware and goodware folders with their call graphs,
- The path to the output file
- The type of centrality: degree, katz, closeness, or harmonic

OUTPUTs are:

- The csv file of the chosen centrality is saved in the output file you have provided

Example:

python3 FeatureExtraction.py -d callgraphs/2011 -o features/2011 -c degree The script generates the file features/2011/degree.csv

To perform the classification, use `Classification.py` script:

INPUTs are:

- The path to your dir of csv files generated in the previous step,
- The path to the output file
- The type of centrality: degree, closeness, harmonic, katz, average, or concatenate. Note that for degree, closeness, harmonic, and katz, you must have only the csv file of the chosen centrality. As for average and concatenate, you must have the csv files of degree, closeness, harmonic, and katz centralities.

OUTPUTs are:

- A csv file of that contains F1,Precision,Recall,Accuracy,TPR,FPR,TNR,FNR for KNN-1, KNN-3, and Random Forest classifiers. The file is saved in the path of output file you have provided

Example:

python3 Classification.py -d features/2011 -o results/2011 -t degree The script generates the file results/2011/degree_result.csv

Datasets:

Drebin:

Malware:
- They are provided by original authors.
- Drebin_Malware_APK_Done.txt: List of hashes of APKs that passed the features extraction.
- Drebin_Malware_APK_Errors.txt: List of hashes of APKs that failed in the features extraction.
Goodware:
- (They were collected from AndroZoo)
- Drebin_Goodware_Original_All.txt: List of hashes of Drebin original APKs.
- Drebin_Goodware_Original_Found.txt: List of hashes of Drebin original APKs that are available in AndroZoo.
- Drebin_Goodware_Original_NotFound.txt: List of hashes of Drebin original APKs that are not available in AndroZoo.
- Drebin_Goodware_Original_Failed.txt: List of hashes of Drebin original APKs.
- Drebin_Goodware_CompleteWith.txt: List of hashes of APKs that are used to complete the dataset.
- Drebin_Goodware_Orig+Completed.txt: List of hashes of original APKs that passed the features extraction and the APK that used to complete the goodware dataset.

MaMaDroid:

Malware:
- 2013, 2014, 2015, and 2016 are collected from VirusShare. They are also available in AndroZoo
- drebin is provided by Drebin original authors
- INPUTs are: The path to the APK files directory and optionally the number of CPUs to use.
- 2013_hashes_found.txt: List of hashes of 2013 dataset APKs that we were able to collect.
- 2013_hashes_done.txt: List of hashes of 2013 dataset APKs that passed the features extraction.
- 2013_hashes_failed.txt: List of hashes of 2013 dataset APKs that failed in the features extraction.
- 2014_hashes_found.txt: List of hashes of 2014 dataset APKs that we were able to collect.
- 2014_hashes_done.txt: List of hashes of 2014 dataset APKs that passed the features extraction.
- 2014_hashes_failed.txt: List of hashes of 2014 dataset APKs that failed in the features extraction.
- 2015_hashes_found.txt: List of hashes of 2015 dataset APKs that we were able to collect.
- 2015_hashes_notFound.txt: List of hashes of 2015 dataset APKs that we were not able to collect.
- 2015_hashes_done.txt: List of hashes of 2015 dataset APKs that passed the features extraction.
- 2015_hashes_failed.txt: List of hashes of 2015 dataset APKs that failed in the features extraction.
- 2016_hashes_found.txt: List of hashes of 2016 dataset APKs that we were able to collect.
- 2016_hashes_done.txt: List of hashes of 2016 dataset APKs that passed the features extraction.
- 2016_hashes_failed.txt: List of hashes of 2016 dataset APKs that failed in the features extraction.
- drebin_hashes_found.txt: List of hashes of drebin dataset APKs that we were able to collect.
- drebin_hashes_done.txt: List of hashes of drebin dataset APKs that passed the features extraction.
- drebin_hashes_failed.txt: List of hashes of drebin dataset APKs that failed in the features extraction.
Goodware:
- oldbenign dataset is collected from PlayDrone, and it is available in AndroZoo
- newbenign dataset are collected from AndroZoo.
- oldbenign_hashes_found.txt: List of hashes of oldbenign dataset APKs that we were able to collect.
- oldbenign_hashes_done.txt: List of hashes of oldbenign dataset APKs that passed the features extraction.
- oldbenign_hashes_failed.txt: List of hashes of oldbenign dataset APKs that failed in the features extraction.
- oldbenign_namesApp_found.txt: List of apps names of oldbenign dataset APKs that we were able to collect.
- oldbenign_namesApp_done.txt: List of apps names (as provided by PlayDrone) of oldbenign dataset APKs that passed the features extraction.
- oldbenign_namesApp_failed.txt: List of apps names of oldbenign dataset APKs that failed in the features extraction.
- newbenign_hashes_found.txt: List of hashes of newbenign dataset APKs that we were able to collect.
- newbenign_hashes_notFound.txt: List of hashes of newbenign dataset APKs that we were not able to collect.
- newbenign_hashes_done.txt: List of hashes of newbenign dataset APKs that passed the features extraction.
- newbenign_hashes_failed.txt: List of hashes of newbenign dataset APKs that failed in the features extraction.
- newbenign_UsedToComplete.txt: List of hashes of APKs that are used to complete the newbenign original dataset.

RevealDroid:

Malware:
- drebin_sha.txt: List of drebin apps
- drebin_sha_intersection_ok_all_features.txt: List of drebin apps after features extraction
- remaining_sha_found_all.txt: List of VirusTotal apps
- remain_sha_intersection_ok_all_features.txt: List of VirusTotal apps after features extraction
- virusshare_md5.txt: List of VirusShare apps
- virusshare_sha_md5_all_intersection_ok_all_features.txt: List of VirusShare apps after features extraction
Goodware:
- benign_androzoo.txt: List of benign apps
- benign_androzoo_intersection_ok_all_features.txt: List of benign apps after features extraction
Family labels:
- all_labels_malware.txt: List of family labels for drebin, virusshare, and virustotal apps
- genome_sha256_labels.txt: List of family labels for genome apps

DroidCat:

All these apps can be downloaded from AndroZoo

Malware:
- apks.malware-2017-more
- apks.zoo2017
- apks.vs2016
- apks.vs2015
- apks.vs2014
- apks.vs2013
- apks.zoo2012
- apks.newzoo2011
- apks.zoo2011
- apks.zoo2010
Goodware:
- sha256.benign2017
- apks.zoobenign2016
- apks.zoobenign2015
- apks.zoobenign2014
- apks.zoobenign2013
- apks.zoobenign2012
- apks.zoobenign2011
- apks.zoobenign2010

MalScan:

All these apps can be downloaded from AndroZoo

Malware:
- 2018_malware.txt
- 2017_malware.txt
- 2016_malware.txt
- 2015_malware.txt
- 2014_malware.txt
- 2013_malware.txt
- 2012_malware.txt
- 2011_malware.txt
Goodware:
- 2018_benign.txt
- 2017_benign.txt
- 2016_benign.txt
- 2015_benign.txt
- 2014_benign.txt
- 2013_benign.txt
- 2012_benign.txt
- 2011_benign.txt

Original Code:

Repositories:

Seed Values

To try to achieve replicable results, the results presented in our paper were computed after setting the seed values to:

MaMaDroid:
- 388652140
Drebin
- 388652140
RevealDroid:
- 123456789
DroidCat
- 480509637
MalScan
- 480509637

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
drebin		drebin
droidcat		droidcat
malscan		malscan
mamadroid		mamadroid
revealdroid		revealdroid
README.md		README.md

Trustworthy-Software/Reproduction-of-Android-Malware-detection-approaches

Folders and files

Latest commit

History

Repository files navigation

Drebin:

To extract the features, use GetApkData.py script:

To embedd the features and perform the classification, use classification_drebin.py script:

INPUTs are:

OUTPUTs are:

MaMaDoid:

To extract the features, use mamadroid.py script:

INPUTs are:

OUTPUTs are:

To embedd the features, use MaMaStat.py script:

INPUTs are:

OUTPUTs are:

To perform the classification, use classification_mamadroid.py script:

INPUTs are:

OUTPUTs are:

RevealDroid:

Notes:

To download the apps and extract the features, use load_apk_and_extract_features.py script:

INPUTs are:

OUTPUTs are:

For malware detection , use run_cv_md.py script:

INPUTs are:

OUTPUTs are:

For family detection , use run_cv_family.py script:

INPUTs are:

OUTPUTs are:

DroidCat:

Notes:

To download the apps, use download_apps_androzoo.sh script:

INPUTs are:

OUTPUTs are:

To run static analysis on the apps, use instrument_apps.sh script:

INPUTs are:

OUTPUTs are:

To generate the call traces, use run_monkey.sh script:

INPUTs are:

OUTPUTs are:

To compute generalfeatures (structure), use extract_generalReport.sh script:

INPUTs are:

OUTPUTs are:

To compute iccfeatures, use extract_iccReport.sh script:

INPUTs are:

OUTPUTs are:

To compute securityfeatures, use extract_securityReport.sh script:

INPUTs are:

OUTPUTs are:

To arrange the features:

To retrieve the first-seen dates of the apps, use splitAllByFirstSeenDate.sh script:

Note that for reproducible experiments, you can uncomment the corresponding lines in the following files: common.py, configs.py, family_detection.py, featureLoader_wdate.py, malware_detection.py, plot_roc.py

For malware detection, use malware_detection.py script:

For family classification, use family_detection.py script:

To generate the roc curves, use plot_roc.py script:

INPUTs are:

OUTPUTs are:

MalScan:

Notes:

To download the apps, use download_apps_androzoo_malscan.sh script:

INPUTs are:

OUTPUTs are:

To generate the call graphs, use CallGraphExtraction.py script:

INPUTs are:

OUTPUTs are:

To extract the features, use FeatureExtraction.py script:

INPUTs are:

OUTPUTs are:

To perform the classification, use Classification.py script:

INPUTs are:

OUTPUTs are:

Datasets:

Drebin:

MaMaDroid:

RevealDroid:

DroidCat:

MalScan:

Original Code:

To extract the features, use `GetApkData.py` script:

To embedd the features and perform the classification, use `classification_drebin.py` script:

To extract the features, use `mamadroid.py` script:

To embedd the features, use `MaMaStat.py` script:

To perform the classification, use `classification_mamadroid.py` script:

To download the apps and extract the features, use `load_apk_and_extract_features.py` script:

For malware detection , use `run_cv_md.py` script:

For family detection , use `run_cv_family.py` script:

To download the apps, use `download_apps_androzoo.sh` script:

To run static analysis on the apps, use `instrument_apps.sh` script:

To generate the call traces, use `run_monkey.sh` script:

To compute generalfeatures (structure), use `extract_generalReport.sh` script:

To compute iccfeatures, use `extract_iccReport.sh` script:

To compute securityfeatures, use `extract_securityReport.sh` script:

To retrieve the first-seen dates of the apps, use `splitAllByFirstSeenDate.sh` script:

For malware detection, use `malware_detection.py` script:

For family classification, use `family_detection.py` script:

To generate the roc curves, use `plot_roc.py` script:

To download the apps, use `download_apps_androzoo_malscan.sh` script:

To generate the call graphs, use `CallGraphExtraction.py` script:

To extract the features, use `FeatureExtraction.py` script:

To perform the classification, use `Classification.py` script: