Skip to content

Trustworthy-Software/Reproduction-of-Android-Malware-detection-approaches

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

In this repository, we provide the artefacts of our paper "Lessons Learnt on Reproducibility in Machine Learning Based Android Malware Detection", which has been accepted to be published in Empirical Software Engineering (EMSE).

Drebin:

To extract the features, use GetApkData.py script:

This script generates ".data" files that represent the features extracted from the APK files.

INPUTs are: The path to the APK files directory and optionally the number of CPUs to use. They are explained in details in the script. OUTPUTs are the ".data" file that has the list of the extracted features and the manifest file.

Example:

python GetApkData.py -d my_dataset

To embedd the features and perform the classification, use classification_drebin.py script:

This script embedd the features extracted with GetApkData.py in vectors space and it performs the classification using SVM LinearSVC classifier. Recall and Accuracy are calculated 10 times. Each experiment is performed with 66% of training set and 33% of test set, and scores are averaged. Roc curve is plotted using the last trained classifier from the 10 experiments.

INPUTs are:

- The path to the .data malware files directory, 
- The path to the .data goodware files directory, 
- The name of the txt file on which the results will be written, 
- The name of the pdf file on which the roc curve will be saved, 
- and optionally, The seed to fix for the experiments. 

OUTPUTs are:

- the results text file,
- and roc curve pdf graph.

Example:

python classification_drebin.py -md my_dataset/malware/ -gd my_dataset/goodware/ -fs file_scores -roc file_roc

MaMaDoid:

To extract the features, use mamadroid.py script:

This script is used to generate the call graphs.

INPUTs are:

- The path to the APK files directory, 
- and the path to the Android platform directory

OUTPUTs are:

- 4 directories created inside the provided APK files directory. They are: 
    - "graphs" (it contains the call graphs), 
    - "family", 
    - "package",
    - "class" directories (they contain the abstarction of the call graphs to family, package, and class modes respectively).

Example:

python mamadroid.py -f my_dataset -d android/

To embedd the features, use MaMaStat.py script:

This script creates the features files for family and package modes.

INPUTs are:

- The names of the call graphs datasets in this format (database1:database2:database3). 
          You need to move the call graphs generated with `mamadroid.py` script, to "graphs" 
          directory that is in the same directory as `mamadroid.py` and `MaMaStat.py` scripts,
- Flag to write intermediate files or not.
          INPUTs are explained in more details in the script.

OUTPUTs are:

- The features files (one per indicated database) that are created as "name_of_the_database".csv files in the folders Features/Families and Features/Packages

Example:

python MaMaStat.py -d Trial1 -wf N

To perform the classification, use classification_mamadroid.py script:

This script performs MaMaDroid's classification using Random Forest classifier. Scores (Precision, Recall, F1-score) are calculated using 10-folds cross-validation with and without PCA for family and package modes.

INPUTs are:

- The path to the CSV features files of family mode, for drebin, 2013, 2014, 2015, 2016, 
         oldbenign, and newbenign datasets. These files are generated by `MaMaStat.py` script 
         in Features/Families,
- The path to the CSV features files of package mode, for drebin, 2013, 2014, 2015, 2016, 
         oldbenign, and newbenign datasets. These files are generated by `MaMaStat.py` script 
         in Features/Packages.
- The name of the txt file on which the results will be written, 
- and optionally, the seed to fix for the experiments. 

OUTPUTs are:

- The results text file.

Example:

python classification_mamadroid.py -pf Features/Families -pp Features/Packages -fs file_scores

RevealDroid:

Notes:

-To build RevealDroid, please follow the instructions in https://bitbucket.org/joshuaga/revealdroid/src/master/

To download the apps and extract the features, use load_apk_and_extract_features.py script:

  • Load the CSV file from AndroZoo https://androzoo.uni.lu/lists and put it in your home directory ~/ It will be used in case you try to download apps from AndroZoo using their md5

INPUTs are:

- The path to the txt file containing the hashes,
- The name of your dataset: (e.g., my_dataset),
- A valid AndroZoo APIKEY

OUTPUTs are:

- The script downloads the apps from AndroZoo and extracts the features that are stored in:
    - data/apiusage/my_dataset
    - data/native_external_calls/my_dataset
    - android-reflection-analysis/data/my_dataset

For malware detection , use run_cv_md.py script:

INPUTs are:

- The name of your malware datasets separated by space. Note that the features of these datasets should be located in /data/apiusage/malware{i} /data/native_external_calls/malware{i} and ../android-reflection-analysis/data/malware{i}
- The name of your goodware datasets separated by space. Note that the features of these datasets should be located in /data/apiusage/goodware{i} /data/native_external_calls/goodware{i} and ../android-reflection-analysis/data/goodware{i}
- The name to be used to save the roc curve figure

OUTPUTs are:

- The script runs 10-fold cross-validation and prints average precision, average recall, and average F1-score for both malware and goodware
- It also generates the PR curve

For family detection , use run_cv_family.py script:

INPUTs are:

- The path to the file that contains hashes and their corresponding families separated by space. This file is located in dataset/revealdroid for both genome and all the malware datasets used in the experiments
- The name of your malware datasets to consider. They should be separated by space. Note that the features of these datasets should be located in /data/apiusage/malware{i} /data/native_external_calls/malware{i} and ../android-reflection-analysis/data/malware{i}.
Also, for genome, you should use "drebin" name as this collection contains the genome apps.
- The name to be used to save the roc curve figure

OUTPUTs are:

- The script runs 10-fold cross-validation and prints average accuracy

DroidCat:

After the publication of our paper, DroidCat's author contacted us and we provided him with more details on the dataset mismatches discussed in our paper. We note that our reproduction attempt of DroidCat was performed with the latest version of DroidCat artefacts publicly available at the time (Repo: https://bitbucket.org/haipeng_cai/droidcat/, latest commit: d108ace0ddb7c56c8f4ebce02801bfee2a3c5d24 Mars 31th, 2019). Hence the dataset mismatches in DroidCat artefacts we described in our paper may be fixed when you read this message (i.e., after August 4th, 2021).

Notes:

  • Make sure that droidcat repo is in your home directory ~/
  • Install the tools and dependecies listed in https://bitbucket.org/haipeng_cai/droidfax/src/master/portable/README
  • Make sure to install the android sdk manager in your home directory in ~/.android
  • You can also use our helping script install.sh but make sure to change the path of your java and the JAVA_HOME environment variable.

To download the apps, use download_apps_androzoo.sh script:

INPUTs are:

- A text file that contains the sha256. Note that the lists of hashes can be found in dataset/droidcat,
- The name of your dataset: (e.g., my_dataset),
- A valid AndroZoo APIKEY

OUTPUTs are:

- The apps are saved in ~/droidcat/droidcat/testbed/inputs/my_dataset

To run static analysis on the apps, use instrument_apps.sh script:

  • You will need to create a key using keytool, then adapt the script signandalign.sh with the details of your key
  • Put the key you have created in ~/droidcat/droidcat/scripts
  • You will also have to create a file "droidcat_keytool_password" where you store your password

INPUTs are:

- The name of your dataset. Note that the apps should be stored in: ~/droidcat/droidcat/testbed/inputs/my_dataset

OUTPUTs are:

- The instrumented apps are saved in ~/droidcat/droidcat/testbed/cg.instrumented/my_dataset

To generate the call traces, use run_monkey.sh script:

INPUTs are:

- The name of your dataset. Note that the instrumented apps should be stored in: ~/droidcat/droidcat/testbed/cg.instrumented/my_dataset

OUTPUTs are:

- The traces are saved in ~/droidcat/droidcat/testbed/monkey_results/my_dataset

To compute generalfeatures (structure), use extract_generalReport.sh script:

INPUTs are:

- The name of your dataset. Note that the instrumented apps should be stored in: ~/droidcat/droidcat/testbed/cg.instrumented/my_dataset and the traces should be located in ~/droidcat/droidcat/testbed/monkey_results/my_dataset

OUTPUTs are:

- GeneralFeatures are saved in ~/droidcat/droidcat/testbed/allGeneralReports/my_dataset/gfeatures.txt

To compute iccfeatures, use extract_iccReport.sh script:

INPUTs are:

- The name of your dataset. Note that the instrumented apps should be stored in: ~/droidcat/droidcat/testbed/cg.instrumented/my_dataset and the traces should be located in ~/droidcat/droidcat/testbed/monkey_results/my_dataset

OUTPUTs are:

- ICCFeatures are saved in ~/droidcat/droidcat/testbed/allICCReports/my_dataset/iccfeatures.txt

To compute securityfeatures, use extract_securityReport.sh script:

INPUTs are:

- The name of your dataset. Note that the instrumented apps should be stored in: ~/droidcat/droidcat/testbed/cg.instrumented/my_dataset and the traces should be located in ~/droidcat/droidcat/testbed/monkey_results/my_dataset

OUTPUTs are:

- SecurityFeatures are saved in ~/droidcat/droidcat/testbed/allSecurityReports/my_dataset/securityfeatures.txt

To arrange the features:

  • For each dataset, put the 3 feature files in the same folder.
  • Put the datasets directories (created in the previous step) in a directory named "features"
  • Put your "features" directory in ~/droidcat/droidcat/ML
  • Based on the lists of hashes in dataset/DroidCat directory, you will end up with the following datasets: malware-2017-more, newzoo2011, vs2013, vs2014, vs2015, vs2016, zoo2010, zoo2011, zoo2012, zoo2017, zoobenign2010, zoobenign2011, zoobenign2012, zoobenign2013, zoobenign2014, zoobenign2015, zoobenign2016, zoobenign2017

Example:

~/droidcat/droidcat/ML/zoobenign2014 should contain gfeatures.txt, iccfeatures.txt, and securityfeatures.txt of zoobenign2014 dataset

To retrieve the first-seen dates of the apps, use splitAllByFirstSeenDate.sh script:

-The new features are stored in features_droidcat_byfirstseen

Note that for reproducible experiments, you can uncomment the corresponding lines in the following files: common.py, configs.py, family_detection.py, featureLoader_wdate.py, malware_detection.py, plot_roc.py

For malware detection, use malware_detection.py script:

-This script prints for each of the 4 datasets (D1617, D1415, D1213, D0911): accuracy, recall, precision, and F1-score

For family classification, use family_detection.py script:

-Load the CSV file from AndroZoo https://androzoo.uni.lu/lists and put it in your home directory ~/ -Change the APIKEY variable in ~/droidcat/droidcat/ML/configs.py with your AndroZoo APIKEY -This script prints for each of the 4 datasets (D1617, D1415, D1213, D0911): accuracy, recall, precision, and F1-score

To generate the roc curves, use plot_roc.py script:

INPUTs are:

- type: det for malware detection and fam for family detection
- file: The name of the output roc curve file

OUTPUTs are:

- The roc curve file

MalScan:

Notes:

-You need to have the following python3 libraries installed: networkx, androguard, numpy, and sklearn

To download the apps, use download_apps_androzoo_malscan.sh script:

INPUTs are:

- A text file that contains the sha256. Note that the lists of hashes can be found in dataset/malscan,
- The year of your dataset,
- The type of your apps; malware or goodware
- A valid AndroZoo APIKEY,

OUTPUTs are:

- The apps are saved in apps/year/type directory

To generate the call graphs, use CallGraphExtraction.py script:

INPUTs are:

- The path to your dir of APK files,
- The path to the output files

OUTPUTs are:

- The call graphs are saved in the path of output files you have provided

Example:

python3 CallGraphExtraction.py -f apps/2011/malware -o callgraphs/2011/malware

To extract the features, use FeatureExtraction.py script:

INPUTs are:

- The path to your dir of call graphs. Note that your directory should contain both malware and goodware folders with their call graphs,
- The path to the output file
- The type of centrality: degree, katz, closeness, or harmonic

OUTPUTs are:

- The csv file of the chosen centrality is saved in the output file you have provided

Example:

python3 FeatureExtraction.py -d callgraphs/2011 -o features/2011 -c degree The script generates the file features/2011/degree.csv

To perform the classification, use Classification.py script:

INPUTs are:

- The path to your dir of csv files generated in the previous step,
- The path to the output file
- The type of centrality: degree, closeness, harmonic, katz, average, or concatenate. Note that for degree, closeness, harmonic, and katz, you must have only the csv file of the chosen centrality. As for average and concatenate, you must have the csv files of degree, closeness, harmonic, and katz centralities.

OUTPUTs are:

- A csv file of that contains F1,Precision,Recall,Accuracy,TPR,FPR,TNR,FNR for KNN-1, KNN-3, and Random Forest classifiers. The file is saved in the path of output file you have provided

Example:

python3 Classification.py -d features/2011 -o results/2011 -t degree The script generates the file results/2011/degree_result.csv

Datasets:

Drebin:

  • Malware:
    • They are provided by original authors.
    • Drebin_Malware_APK_Done.txt: List of hashes of APKs that passed the features extraction.
    • Drebin_Malware_APK_Errors.txt: List of hashes of APKs that failed in the features extraction.
  • Goodware:
    • (They were collected from AndroZoo)
    • Drebin_Goodware_Original_All.txt: List of hashes of Drebin original APKs.
    • Drebin_Goodware_Original_Found.txt: List of hashes of Drebin original APKs that are available in AndroZoo.
    • Drebin_Goodware_Original_NotFound.txt: List of hashes of Drebin original APKs that are not available in AndroZoo.
    • Drebin_Goodware_Original_Failed.txt: List of hashes of Drebin original APKs.
    • Drebin_Goodware_CompleteWith.txt: List of hashes of APKs that are used to complete the dataset.
    • Drebin_Goodware_Orig+Completed.txt: List of hashes of original APKs that passed the features extraction and the APK that used to complete the goodware dataset.

MaMaDroid:

  • Malware:

    • 2013, 2014, 2015, and 2016 are collected from VirusShare. They are also available in AndroZoo
    • drebin is provided by Drebin original authors
    • INPUTs are: The path to the APK files directory and optionally the number of CPUs to use.
    • 2013_hashes_found.txt: List of hashes of 2013 dataset APKs that we were able to collect.
    • 2013_hashes_done.txt: List of hashes of 2013 dataset APKs that passed the features extraction.
    • 2013_hashes_failed.txt: List of hashes of 2013 dataset APKs that failed in the features extraction.
    • 2014_hashes_found.txt: List of hashes of 2014 dataset APKs that we were able to collect.
    • 2014_hashes_done.txt: List of hashes of 2014 dataset APKs that passed the features extraction.
    • 2014_hashes_failed.txt: List of hashes of 2014 dataset APKs that failed in the features extraction.
    • 2015_hashes_found.txt: List of hashes of 2015 dataset APKs that we were able to collect.
    • 2015_hashes_notFound.txt: List of hashes of 2015 dataset APKs that we were not able to collect.
    • 2015_hashes_done.txt: List of hashes of 2015 dataset APKs that passed the features extraction.
    • 2015_hashes_failed.txt: List of hashes of 2015 dataset APKs that failed in the features extraction.
    • 2016_hashes_found.txt: List of hashes of 2016 dataset APKs that we were able to collect.
    • 2016_hashes_done.txt: List of hashes of 2016 dataset APKs that passed the features extraction.
    • 2016_hashes_failed.txt: List of hashes of 2016 dataset APKs that failed in the features extraction.
    • drebin_hashes_found.txt: List of hashes of drebin dataset APKs that we were able to collect.
    • drebin_hashes_done.txt: List of hashes of drebin dataset APKs that passed the features extraction.
    • drebin_hashes_failed.txt: List of hashes of drebin dataset APKs that failed in the features extraction.
  • Goodware:

    • oldbenign dataset is collected from PlayDrone, and it is available in AndroZoo
    • newbenign dataset are collected from AndroZoo.
    • oldbenign_hashes_found.txt: List of hashes of oldbenign dataset APKs that we were able to collect.
    • oldbenign_hashes_done.txt: List of hashes of oldbenign dataset APKs that passed the features extraction.
    • oldbenign_hashes_failed.txt: List of hashes of oldbenign dataset APKs that failed in the features extraction.
    • oldbenign_namesApp_found.txt: List of apps names of oldbenign dataset APKs that we were able to collect.
    • oldbenign_namesApp_done.txt: List of apps names (as provided by PlayDrone) of oldbenign dataset APKs that passed the features extraction.
    • oldbenign_namesApp_failed.txt: List of apps names of oldbenign dataset APKs that failed in the features extraction.
    • newbenign_hashes_found.txt: List of hashes of newbenign dataset APKs that we were able to collect.
    • newbenign_hashes_notFound.txt: List of hashes of newbenign dataset APKs that we were not able to collect.
    • newbenign_hashes_done.txt: List of hashes of newbenign dataset APKs that passed the features extraction.
    • newbenign_hashes_failed.txt: List of hashes of newbenign dataset APKs that failed in the features extraction.
    • newbenign_UsedToComplete.txt: List of hashes of APKs that are used to complete the newbenign original dataset.

RevealDroid:

  • Malware:

    • drebin_sha.txt: List of drebin apps
    • drebin_sha_intersection_ok_all_features.txt: List of drebin apps after features extraction
    • remaining_sha_found_all.txt: List of VirusTotal apps
    • remain_sha_intersection_ok_all_features.txt: List of VirusTotal apps after features extraction
    • virusshare_md5.txt: List of VirusShare apps
    • virusshare_sha_md5_all_intersection_ok_all_features.txt: List of VirusShare apps after features extraction
  • Goodware:

    • benign_androzoo.txt: List of benign apps
    • benign_androzoo_intersection_ok_all_features.txt: List of benign apps after features extraction
  • Family labels:

    • all_labels_malware.txt: List of family labels for drebin, virusshare, and virustotal apps
    • genome_sha256_labels.txt: List of family labels for genome apps

DroidCat:

All these apps can be downloaded from AndroZoo

  • Malware:

    • apks.malware-2017-more
    • apks.zoo2017
    • apks.vs2016
    • apks.vs2015
    • apks.vs2014
    • apks.vs2013
    • apks.zoo2012
    • apks.newzoo2011
    • apks.zoo2011
    • apks.zoo2010
  • Goodware:

    • sha256.benign2017
    • apks.zoobenign2016
    • apks.zoobenign2015
    • apks.zoobenign2014
    • apks.zoobenign2013
    • apks.zoobenign2012
    • apks.zoobenign2011
    • apks.zoobenign2010

MalScan:

All these apps can be downloaded from AndroZoo

  • Malware:

    • 2018_malware.txt
    • 2017_malware.txt
    • 2016_malware.txt
    • 2015_malware.txt
    • 2014_malware.txt
    • 2013_malware.txt
    • 2012_malware.txt
    • 2011_malware.txt
  • Goodware:

    • 2018_benign.txt
    • 2017_benign.txt
    • 2016_benign.txt
    • 2015_benign.txt
    • 2014_benign.txt
    • 2013_benign.txt
    • 2012_benign.txt
    • 2011_benign.txt

Original Code:

Seed Values

To try to achieve replicable results, the results presented in our paper were computed after setting the seed values to:

  • MaMaDroid:
    • 388652140
  • Drebin
    • 388652140
  • RevealDroid:
    • 123456789
  • DroidCat
    • 480509637
  • MalScan
    • 480509637

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published