Skip to content

Reducing the Errors in Malware Prediction using Machine Learning Techniques

License

Notifications You must be signed in to change notification settings

PrasunDutta007/Microsoft-Malware-Detection-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Microsoft Malware Detection

Software

Jupyter Notebook  

Packages

1) Pandas                            5) NLTK
2) Scipy                               6) Numpy
3) Sci-Kit Learn                    7) Plotly
4) Seaborn                          8) Matplotlib

Installation of Packages

  • Open cmd and type the following commands:
  pip3 install pandas
  pip3 install matplotlib
  pip3 install nltk
  pip3 install numpy
  pip3 install scipy
  pip3 install scikit-learn
  pip3 install seaborn
  pip3 install plotly

Concepts Used

  • Hyperparameter Tuning
  • K-Nearest Neighbours
  • Logistic Regression
  • Exploratory Data Analysis
  • T-Distributed Stochastic Neighbourhood Embedding (t-SNE)
  • Random Forest Classifier

Problem Overview

Microsoft has been very active in building anti-malware products over the years and it runs it’s anti-malware utilities over 150 million computers around the world. This generates tens of millions of daily data points to be analyzed as potential malware. In order to be effective in analyzing and classifying such large amounts of data, we need to be able to group them into groups and identify their respective families. This dataset provided by Microsoft contains about 9 classes of malware.
Source: https://www.kaggle.com/c/malware-classification

The 9 Classes are as Follows:

  1. Ramnit
  2. Lollipop
  3. Kelihos_ver3
  4. Vundo
  5. Simda
  6. Tracur
  7. Kelihos_ver1
  8. Obfuscator.ACY
  9. Gatak

There are nine different classes of malware that we need to classify based on the given data point => Multi class classification problem

The Performance of the entire model will be evaluated based on Two Performance Metrices:

1) Multi class log-loss
2) Confusion matrix

Analysis

Step 1: Separating out the .asm and .byte files

  1. 150 GB of .asm files
  2. 50 GB of byte files

Step 2: Training and Testing Dataset (in General)

Randomly Splitting the dataset into Training, Cross-Validation & Testing data

Let me now explain what exactly do I mean by Train, Cross-Validation & Test:

  1. The Training Dataset is used to Train the Model
  2. The Validation Dataset is used to evaluate the given models among which one them is then chosen. This chosen model is then trained with the new Training Dataset.
  3. Finally the the Trained Model is evaluated with the Test Dataset

    In Steps 1 and 2, we do not want to evaluate the Candidate Models once. Instead, we prefer to Evaluate each model Multiple Times with different Dataset and take the Average Score for our decision at Step 3. If we have the luxury to vast amounts of data which we have in our case, this can be done easily. Otherwise, we can also use the trick of K-fold to resample the same dataset multiple times and pretend they are different. As we are evaluating the model, or hyperparameter, the model has to be trained from scratch, each time, without reusing the training result from previous attempts. We call this process Cross Validation

Step 3: Exploratory Data Analysis (Byte Files)

Now What is Actually Meant By Exploratory Data Analysis ?

Exploratory Data Analysis (EDA) is used by data scientists to analyze and investigate Data Sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for Data Scientists to discover Patterns, Spot Anomalies, Test a Hypothesis, or Check Assumptions.

EDA is primarily used to see what data can reveal beyond the Formal Modeling or Hypothesis Testing Task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the Statistical Techniques we are considering for data analysis are appropriate.

Step 3.1: Analysing the Distribution of Malware Classes in Whole Data Set

The above given Bar Graph illustrates the classes to which these malwares belong to and it can be observed clearly that the dataset contains very few data points which belong to Class - 5 which will definitely act as a Constraint when building the future model predictions. Moreover, the data for each class are also not balanced, indicating it is an example of Imbalanced Classification.

Step 3.2: Feature Extraction (Byte Files)

Here is a brief definition of what is meant by Feature Extraction: The Problem of selecting some Subset of a Learning Algorithm’s Input Variables upon which it should focus attention, while ignoring the rest. In other words, Dimensionality Reduction!!

Mathematically Speaking, Given a set of features F = { f1 ,…, f2 ,…, fn } the Feature Selection problem is to find a Subset that “Maximizes the Learner’s Ability to Classify Patterns”

Step 3.2.1: File Size of Byte Files as a Feature

Now a typical .byte file will have code like this:

00401000 56 8D 44 24 08 50 8B F1 E8 1C 1B 00 00 C7 06 08

Here 00401100 represents the Starting Address of the Code & then we have a Set of Two Hexadecimal Values concatenated together

Before understanding how this code will be useful for our analysis, it is important to understand these two concepts namely:

1) Unigrams


To generate 1-grams or Unigrams we pass the value of n=1 in N-grams function of NLTK. But first, we split the sentence into tokens and then pass these tokens to ngrams function. These Unigrams are useful for creating capabilities like Autocorrect, Autocompletion of sentences, Text Summarization, Speech Recognition, etc.

2) Bag of Words

Bag of words model helps convert the text into numerical representation (numerical feature vectors) such that the same can be used to train models using machine learning algorithms.
Here are the key steps of fitting a bag-of-words model:

a) Create a vocabulary indices of words or tokens from the entire set of documents. The vocabulary indices can be created in alphabetical order.
b) Construct the numerical feature vector for each document that represents how frequent each word appears in different documents. The feature vector representing each will be sparse in nature as the words in each document will represent only a small subset of words out of all words (bag-of-words) present in entire set of documents.

Output:


The above Table is a sample output which keeps count of each of these 256 patterns occuring in these byte files i.e, in 00 to FF we have 256 possibilities so it keeps track of these 256 possibilities. However, in our output we have 258 columns the two additional columns are for Serial Number and File Id

Step 3.2.2: Multivariate Analysis (Byte Files)

Let's first understand what Multivariate Analysis even means ??

Multivariate analysis deals with the statistical analysis of data collected on more than one dependent variable. Moreover, to be considered truly multivariate all the variables must be random and interrelated in such a way that their different effects can not meaningfully be interpreted separately.

The building block of the multivariate analysis is the variate. It is defined as the weighted sum of the variables, where the weights are defined by the multivariate techniques. The variate of n weighted variables(X1 to Xn) can be written as :

Variate = X1W1 + X2W2 + X3W3 + … + XnWn
where X1, X2…Xn are the observed variables and
W1, W2, W3…Wn are the weights.

These variates capture the multivariate features of the analysis, thus in each technique, the variate acts as the focal point of the analysis. For example, in multiple regression, the variate is determined in such a manner that the correlation between the dependent variable and the independent variables is maximum.

Before moving towards our output it's important to first understand briefly what t-SNE means!!


In the Above Demonstration, the Picture in the Bottom Left corner represents the actual Cluster Representation in the Higher-Dimensional Space. Now t-SNE tries to plot this same structure in a reduced dimension, like in the above case its being plotted in a 2-Dimensional space.

So, basically t-SNE is a non-linear dimensionality reduction algorithm which finds patterns in the data based on the similarity of data points with features, the similarity of points is calculated as the conditional probability that a point A would choose point B as its neighbour. It then tries to minimize the difference between these conditional probabilities (or similarities) in higher-dimensional and lower-dimensional space for a perfect representation of data points in lower-dimensional space.

Now, Let's understand the Output we have got after doing the T-SNE analysis:


From the above Visual it's quite clear that the Malware byte code belonging to Class 1,2 and 3 are quite nicely clustered and hence will give good results when we further run them in our Machine Learning Models. This is due to the fact that we have great number of data sets based on these 3 classes. Some other classes too like 9 & 8 have small clusers formed though in a scattered manner. However, for the rest of the classes we observe that it is pretty much scattered throughout without much observable pattern.

Step 4: Train Test Split (Byte Files)

We split our Byte files in a random ratio between Train, Test & Cross-Validation. This step is in continuation to Step 2 where we now actually implement it, before implementing the above dataset into Machine Learning Models.

Step 5: Machine Learning Models (Only on Byte Files)

A “Model” in Machine Learning is the output of a Machine Learning Algorithm run on data. A model represents what was learned by a machine learning algorithm. The model is the “thing” that is saved after running a machine learning algorithm on training data and represents the rules, numbers, and any other algorithm-specific data structures required to make predictions.

Before delving deep into each of the Models, let us first understand what Multiclass Log-Loss means:

Log Loss is one of the most important classification metric based on probabilities. It's hard to interpret raw log-loss values, but log-loss is still a good metric for comparing models. For any given problem, a lower log-loss value means better predictions. Log Loss is a slight twist on something called as the Likelihood Function. So, we will start by understanding the likelihood function. The likelihood function answers the question "How likely did the model think the actually observed set of outcomes was."

Lets have a look at the Formula for Multiclass Log-Loss:


Here, where N is the number of samples or instances,
M is the number of possible labels,
yij is a binary indicator of whether or not label j is the correct classification for instance i, and
pij is the model probability of assigning label j to instance i.
A perfect classifier would have a Log Loss of precisely zero. Less ideal classifiers have progressively larger values of Log Loss.

Step 5.1: Random Model

A Random Model in Machine Learning in our case, will compute the probability values of all the classes on a Random basis and will give us a Cross-Vaidation, Test and Misclassified points value. Note that these values have been generated in a Random manner and not based on any Algorithms or Patterns. So, if we make any Future Models we need to make sure that those model's log-loss values are less than the random model's value cause if it exceeds the random models's value then it is an indication that something is seriously going wrong in the new model. So, in the Newer Models we will try to keep our log-loss values as close to 0 as possible. Since, a log-loss of 0 indicates an almost perfect model.

Before Understanding the Actual Output, let us try and understand What is meant by Confusion, Precision & Recall Matrix!!

Let us consider a Confusion Matrix size of 2 X 2 which predicts Cancer in patients, to keep things simple:


Let us now understand a few basic terminologies associated with Confusion Matrices:

True Positive (TP) — Model correctly predicts the positive class (prediction and actual both are positive). In the above example, 10 people who have Cancer are predicted positively by the model.

True Negative (TN) — Model correctly predicts the negative class (prediction and actual both are negative). In the above example, 60 people who don’t have Cancer are predicted negatively by the model.

False Positive (FP) — Model gives the wrong prediction of the negative class (predicted-positive, actual-negative). In the above example, 22 people are predicted as positive of having Cancer, although they don’t have Cancer. FP is also called a TYPE I error.

False Negative (FN) — Model wrongly predicts the positive class (predicted-negative, actual-positive). In the above example, 8 people who have Cancer are predicted as negative. FN is also called a TYPE II error.

With the help of these four values, we can calculate True Positive Rate (TPR), False Negative Rate (FPR), True Negative Rate (TNR), and False Negative Rate (FNR).


Precision:

Out of all the positive predicted, what percentage is truly positive. The precision value lies between 0 and 1.


Recall:

Out of the total positive, what percentage are predicted positive. It is the same as TPR (true positive rate)



Output:


The above given output is that of a Precision Matrix that we have obtained from our Random Model.
Let us now look at how to read a Precision Matrix with one example, Consider Row 3-Column 1:

We will read it as follows: "27.9% points which have been predicted to belong to Class 1 actually belong to Class 3"

The Above Model gives us the following Results:
Log loss on Cross Validation Data using Random Model 2.45615644965
Log loss on Test Data using Random Model 2.48503905509
Number of misclassified points 88.5004599816

From these Results, we can now atleast make out that, our future models shouldn't cross these values as otherwise that would mean, it is even worse than a Random Model!!!!

Step 5.2: K Nearest Neighbour Classification

The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other.
The KNN algorithm hinges on this assumption being true enough for the algorithm to be useful. KNN captures the idea of similarity (sometimes called distance, proximity, or closeness) with some mathematics involved like — calculating the distance between points on a graph.

The K-NN working can be explained on the basis of the below algorithm:

  • Step-1: Select the number K of the neighbors
  • Step-2: Calculate the Euclidean distance of K number of neighbors
  • Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
  • Step-4: Among these k neighbors, count the number of the data points in each category.
  • Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
  • Step-6: Our model is ready.

Output:


The above Plot shows for different K values, what are the different log-loss values and we choose the best K value which is 1 in our case, as for K=1 we have got the least log-loss value!! This process of fine tuning our parameter, in this case K, is also known as Hyperparameter Tuning.We get the following values for K=1:
(Here, alpha means the same as K in KNN)
For values of best alpha = 1 The train log loss is: 0.0782947669247
For values of best alpha = 1 The cross validation log loss is: 0.225386237304
For values of best alpha = 1 The test log loss is: 0.241508604195
Number of misclassified points 4.50781968721

Output:


The above given output is that of a Precision Matrix which we obtained for the value of K=1. From the Matrix, we can atleast conclude that most of the predictions are accurate except for Class 5 where almost 25% points are missclassified into Class 1.

Step 5.3: Logistic Regression

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables (which may be real-valued, binary-valued, categorical-valued, etc.).

Output:



From the first output it is clear that for the value of c=3, we get the least log-loss values which are as follows:
log loss for train data 0.498923428696
log loss for cv data 0.549929846589
log loss for test data 0.528347316704
Number of misclassified points 12.3275068997

Also in the second output (Precision Matrix) we can see that none of the byte files have been predicted to belong to Class 5!!!! and this happened because, the number of byte files which actually belong to class 5 is very few as a result the model made error while predicting the same.

Step 5.4: Random Forest Classifier

It is based on the concept of Ensemble learning (Ensemble learning is the process by which multiple models, such as classifiers or experts, are strategically generated and combined to solve a particular computational intelligence problem. Ensemble learning is primarily used to improve the (classification, prediction, function approximation, etc.))

Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, it predicts the final output.

Output:



From the first output, it is clear that for the value of 1000 (no. of decision tress) we are getting the least log-loss values, which are as follows:
For values of best alpha = 1000 The train log loss is: 0.0266476291801
For values of best alpha = 1000 The cross validation log loss is: 0.0879849524621
For values of best alpha = 1000 The test log loss is: 0.0858346961407
Number of misclassified points 2.02391904324

In the second output that of a Precision Matrix, we can see that all the .byte files have been predicted to belong to their correct classes almost perfectly and out of all the models above this one produced the least error!!

Step 6: Multivariate Analysis (On .asm files)

Output:


From the above t-SNE plot, we can observe extremely small clusters are formed here and there. As such there isn't much observable patterns. So, we cannot make much of an assumption just based on this plot.

Also the above given plot has been made from the main 52 features as part of the .asm files!!

Step 7: Train Test Split (.asm Files)

Randomly Splitting the dataset into Training, Cross-Validation & Testing data. In our case in the following ratio (64%, 16%, 20% respectively)

Step 8: Machine Learning Models (Only on .asm Files)

Step 8.1: K-Nearest Neighbour Classification

Output:



In the first output we can observe that for a K value of around 3 we are getting the least log loss values, which are as follows:
log loss for train data 0.0476773462198
log loss for cv data 0.0958800580948
log loss for test data 0.0894810720832
Number of misclassified points 2.02391904324

In the second output (Precision Matrix) we observe, for more or less all the classes the predictions have a good accuracy %age. Just for class 5, the predictions is slightly out of place where 20% points are predicted to belong to Class 6.

Step 8.2: Logistic Regression

Output:



In the first output for a value of c around 1000 we get the minimum log loss values, which are as follows:
log loss for train data 0.396219394701
log loss for cv data 0.424423536526
log loss for test data 0.415685592517
Number of misclassified points 9.61361545538

In the second output (Precision Matrix) we observe lot of misinterpretations being made by this model especially for classes 5 and 7, which are as follows:
All Class 5 Files -----> Predicted to belong to Class 1!!
Class 7 Files ---------> Predicted to belong to Classes 1, 4, and 8 apart from Class 7 itslef!!

Step 8.3: Random Forest Classifier

Output:



From the first output, we observe that for a value of around 3000 (no. of decision tress) we are getting the least log-loss values, which are as follows:
log loss for train data 0.0116517052676
log loss for cv data 0.0496706817633
log loss for test data 0.0571239496453
Number of misclassified points 1.14995400184

From the second output we can come to a conclusion that the Random Forest Classifier again succesfully predicts most of classes with a very good accuracy similar to the Byte files case.

Conclusion

  • The Machine Learning Models KNN and Random Forest performed significantly well as compared to Logistic Regression for both the byte files and asm files.
  • To take this anaylsis a step further, ML models based on a combination of byte and asm files can be made to get the perfect picture of error results. Since, now we know KNN and Random forest performs the best for both cases, these two alogrithms can be checked with the combination of byte and asm file as well.
  • We can further improve (reduce) our log loss values by Xgboost. However, this being a very computationally demanding process (multiprocessing), it takes upto days for the computation to complete. This is primarily because, we are dealing with a humongous amount of data.

References & Resources

About

Reducing the Errors in Malware Prediction using Machine Learning Techniques

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published