Microsoft Malware Classification Challenge
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


This project emerged for fulfilling a requirement of the Machine Learning course (EE 769) I took in Spring 2018 at IIT Bombay.

The detailed report for the project is available at my blog here.

Microsoft Malware Classification Challenge

Directory Structure

The git repository has the following directories -

  1. src - this directory contains all the code files
  2. feature-dump - this contains separate pickle files for each type of features(viz) corresponding to each malware instance's extracted features
  3. all-features - this directory contains the pickle files corresponding to each malware instance's extracted features
  4. all-feature-train - folder with features of train instances
  5. all-feature-test - folder with features of test instances
  6. new-files - folder containing file which needs to be classified. If you want to predict a class for a particular pair of .asm and .byte files, keep those files in this folder
  7. new-files-feature-dump - the extracted features' pickle files are stored in this directory
  8. new-files-all-feature-dump - this contains the pickle file for all features


The training is done by running the command python3 in the src/ directory from the terminal. After training the trained models are stored as pickle files in the src/ folder by their respective names. The finalModel, an object of class SupervisedModels is stored as the file `finalModels.pkl' which contains the information about scalers, features and underlying trained classifiers


Testing can be done in 2 ways -

  1. Predicting the lables of test dataset - run the command python3 0 in src/ folder from the terminal. This prints out the accuracy of the model on testdata
  2. Predicting the labels of a new file - run the command python3 1 fileName in src/ folder from the terminal. The files fileName.asm and fileName.bytes are assumed to be in the folder new-files/. This prints out the predicted label by each of the underlying classifier

Both of the testing procedures load the finalModel from the file 'finalModels.pkl' and predict the labels on the corresponding data instances.