<a href="https://colab.research.google.com/github/kamkali/Malicious_Discovery/blob/master/DL_Malware_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Welcome to the Jupyter Notebook with Google Colab! Here we will post our research on discovering malware using machine learning.

*authors: Jakub Burghardt, Kamil Kaliś, Michał Szczepaniak-Krupowski*

In [1]:
!pip install pefile
def sup():
  print("Hello World!")
sup()

Hello World!


In [0]:
def upload_files():
    """Function for uploading files to the project"""
    from google.colab import files
    uploaded = files.upload()
    for k, v in uploaded.items():
        open(k, 'wb').write(v)
    return list(uploaded.keys())

-----

# Part I: planning


## Main tasks:
- get acquaint with malware topic:
    - types of malware
    - frequency of appearance on different operating systems
    - detection methods: static and dynamic
    - trends
- familiarity with machine learning topic: 
    - selection of different algorithms
- most significant features pick
- dataset pick:
    - should be up to date (as current as possible)
    - should contain proper features selected earlier
- analysis of efficiency of chosen algorithms:
    - statistical comparison of accuracy, true positives/false negatives/etc.

## Assignment of the task:
### Michał Szczepaniak-Krupowski:
A deeper look at malware – types and families, comparison between static and dynamic methods.

The popularity of malware on different OS – why Windows users are the most threatened on the malicious software attack?

### Kamil Kaliś:
What is Machine Learning and why is it crucial in Cybersecurity – the need for accelerating the usage of ML in modern detecting malware.

Comparison of different supervised learning algorithms – pros and cons in malware detection.

### Jakub Burghardt:
Malware detection methods – more detailed reasoning about static detection methods.

A brief look at PE format (Portable Executable) – how can they be helpful in detecting malware with static methods?

-----

# Mid-term presentation
### link to presentation:
https://drive.google.com/open?id=1x1DcDueWrDM8XHEd-sKMWykt5G3wMW9X-8NbKFgRhmk

# Project implementation:

># *'Before'* section:

Code snippet to upload MalwareData.csv if not already in colab directory:

In [0]:
import os
if not os.path.exists('MalDiscoveryData'):
    !git clone https://github.com/kamkali/MalDiscoveryData
    !unzip MalDiscoveryData/MalwareData.csv.zip

Logger module to save outputs from functions:

In [0]:
import logging
import functools
import time

"""Logger module created by Kamil Kaliś"""


def logger_setup(logger_file="results.log"):
    """Sets up the logger.
    Usage:
        1. With wrapper @log_to_file(logger_file=<loggername.log>)
        2. Set logger within module as:
            2.1 log = get_logger(logger_file-<loggername.log>)
            2.2 use 'log' variable to use logger and write them to file"""

    logformat = "[%(asctime)s %(levelname)s] %(message)s"
    dateformat = "%d-%m-%y %H:%M:%S"
    logger = logging.getLogger(logger_file)
    formatter = logging.Formatter(logformat)
    formatter.datefmt = dateformat
    fh = logging.FileHandler(logger_file, mode="a")
    fh.setFormatter(formatter)
    sh = logging.StreamHandler()
    sh.setFormatter(formatter)
    logger.setLevel(logging.INFO)
    logger.addHandler(fh)
    logger.addHandler(sh)
    logger.propagate = False


def log_to_file(func=None, logger_file='results.log'):
    def log_exec_time(original_func):
        """Wrapper for logging execution time of marked function.
        Usage:
        Add adnotation above function: '@log_to_file(<logger_file.log>)'"""

        @functools.wraps(original_func)
        def wrapper(*args, **kwargs):
            log = logging.getLogger(logger_file)
            log.info(f"Running {original_func.__name__}...")
            start_time = time.time()
            result = original_func(*args, **kwargs)
            exec_time = time.time() - start_time
            log.info(f"Function {original_func.__name__} finished in {exec_time}s")
            return result

        return wrapper

    return log_exec_time if func is None else log_exec_time(func)


def get_logger(logger_file='results.log'):
    return logging.getLogger(logger_file)


--------

> # K-Nearest Neighbors:
*author: Kamil Kaliś*

One simple way to solve malware detection problems is to use the K-Nearest Neighbors algorithm. KNN model from the third-party library (*sklearn*) will be used.
We also need data to work on and in this case, it is collected in *MalwareData.csv* file. To process data, *pandas* library and sklearn modules are used. To visualize data and charts *plotly* library will be used.

As mentioned above, let's start with importing adequate libraries and modules:

In [0]:
import pandas as pd
import plotly.graph_objects as go
from plotly.offline import plot
from sklearn import neighbors, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import confusion_matrix

Then, read data to work on:

In [6]:
data = pd.read_csv('MalwareData.csv', sep='|')
print(data.shape[0], data.shape[1])
benign_files = data[data['legitimate'] == 1].count()
malware_files = data[data['legitimate'] == 0].count()
print(f"Clean files count is {benign_files[1]} and malware files count is {malware_files[1]}")

138047 57
Clean files count is 41323 and malware files count is 96724


MalwareData.csv contains 138 047 records as a whole, where 41 323 files are benign, clean files, and 96 724 files which are malicious. 
Data is reprezented with 56 features. Last 57'th feature represents a label *legitimate* classifies given sample:

'1' – for clean file, '0' – for malware file.

Knowing all this, we can start to process the data to more algorithm friendly form.

## Function: malware_data_transform

### **@description:** 
> Performs data transform as dropping insignificant columns and 'legitimate' column, which contains labels. It can normalize or standardize data.

### **@params:**

>*   optimize_data – enables data normalization or standardization: 
default is None
>*   csv_data – cvs file to read and transform:
default is MalwareData.csv
>*   csv_sep – separator used in csv file:
default is '|'
>*   enable figures – enables plotly charts to show:
default is False

### **@returns:**


> full_data.values, labels – tuple with values of features and labels for malicious and clean files








In [0]:
@log_to_file(logger_file='KNN_results.log')
def malware_data_transform(optimize_data=None, csv_data='MalwareData.csv', csv_sep='|', enable_figures=False):
    log = get_logger(logger_file='KNN_results.log')
    log.info("-Reading csv file")
    full_data = pd.read_csv(csv_data, sep=csv_sep)

    pd.set_option("display.max_columns", None)

    labels = full_data['legitimate'].values
    full_data: pd.DataFrame = full_data.drop(['Name', 'md5', 'legitimate'], axis=1)

    if optimize_data == 'normalize':
        log.info("--Data normalization processing...")
        full_data = pd.DataFrame(preprocessing.MinMaxScaler().fit_transform(full_data))
    elif optimize_data == 'standardize':
        log.info("--Data standardization processing...")
        full_data = pd.DataFrame(preprocessing.StandardScaler().fit_transform(full_data))

    if enable_figures:
        log.info("--Figures enabled")
        plot_bar_figures(full_data, optimize_data)

    return full_data.values, labels


When transforming data, it is good to see how it looks like. The next function provides a utility to show non-optimized data, normalized or standardized forms of data.

## Function: plot_bar_figures

### **@description:** 
> Function to plot bar charts with features values
### **@params:**


>*   optimize_data – enables data normalization or standardization: 
default is None
>*   full_data – transformed data

### **@returns:**
> Nothing








In [0]:
def plot_bar_figures(full_data, optimize_data):
    cols = full_data.keys()
    if optimize_data == 'normalize':
        cols = full_data.keys()
        full_data_normalized = pd.DataFrame(preprocessing.MinMaxScaler().fit_transform(full_data))
        trace_norm_mean: pd.DataFrame = full_data_normalized.mean()
        trace_norm_std: pd.DataFrame = full_data_normalized.std()

        # -------------------------------------------------
        layout = go.Layout(title='Normalized data mean')
        figure = go.Figure(go.Bar(y=trace_norm_mean,
                                  x=cols
                                  ), layout=layout)
        figure.show()
        # -------------------------------------------------
        layout = go.Layout(title='Normalized data standard deviation')
        figure = go.Figure(go.Bar(y=trace_norm_std,
                                  x=cols
                                  ), layout=layout)
        figure.show()

    elif optimize_data == 'standardize':
        cols = full_data.keys()

        full_data_standardized = pd.DataFrame(preprocessing.StandardScaler().fit_transform(full_data))
        trace_stand_mean = full_data_standardized.mean()
        trace_stand_std = full_data_standardized.std()

        # -------------------------------------------------
        layout = go.Layout(title='Standardized data mean')
        figure = go.Figure(go.Bar(y=trace_stand_mean,
                                  x=cols
                                  ), layout=layout)
        
        figure.show()
        # -------------------------------------------------
        layout = go.Layout(title='Standardized data standard deviation')
        figure = go.Figure(go.Bar(y=trace_stand_std,
                                  x=cols
                                  ), layout=layout)
        figure.show()
    else:
        trace_mean = full_data.mean()
        trace_std = full_data.std()

        # -------------------------------------------------
        layout = go.Layout(title='Data mean')
        figure = go.Figure(go.Bar(y=trace_mean,
                                  x=cols
                                  ), layout=layout)
        figure.show()
        # -------------------------------------------------
        layout = go.Layout(title='Data standard deviation')
        figure = go.Figure(go.Bar(y=trace_std,
                                  x=cols
                                  ), layout=layout)
        figure.show()

The classifier is built using *sklearn* library. To get the best results, it needs to be parameterized and analyzed with different combinations of parameters, like:
* Numbers of neighbors
* Number of features and proper algorithm to choose them
* Different metrics and weights in KNN algorithm
* Non-optimized or normalized / standardized data

## Function: knn_classifier

### **@description:** 
> Function to fit KNN classifier for different params and receive metrics, like accuracy percentage, false positives percentage and false positives percentage.
### **@params:**


>*   input data – data applied to teach algorithm
>*   labels – 0/1 label for each sample in input data
>*   n_neighbors - upper limit of neighbors to classify with (descending order):
default is 1
>*   run_for_features – tuple of starting number of features to select, step, upper limit: default is (3, 30, 20)

### **@returns:**
> Nothing








In [0]:
@log_to_file(logger_file='KNN_results.log')
def knn_classifier(input_data, labels, n_neighbors=1, run_for_features=(3, 30, 20)):
    log = get_logger(logger_file='KNN_results.log')
    for n in range(1, n_neighbors + 1):
        log.info(f"-Running KNN algorithm with n_neighbors={n}")
        weight = 'distance'
        classifier = neighbors.KNeighborsClassifier(n, weights=weight, metric='euclidean')
        for k in range(*run_for_features):
            log.info(f"--Finding best {k} features")
            best_features_data = SelectKBest(f_classif, k=k).fit_transform(input_data, labels)
            X_train, X_test, Y_train, Y_test = train_test_split(best_features_data, labels, test_size=0.3)

            log.info(f"---Starting fitting for weight={weight}")
            classifier.fit(X_train, Y_train)

            score = classifier.score(X_test, Y_test)

            """ Confusion matrix """
            log.info("-----Measuring confusion matrix...")
            result = classifier.predict(X_test)
            conf_matrix = confusion_matrix(Y_test, result)

            log.info(f"----KNN accuracy is: {score * 100}%")

            precision = conf_matrix[0][0] / (conf_matrix[0][0] + conf_matrix[0][1]) * 100
            log.info(f"------Precision in percent: {precision}%")

            recall = conf_matrix[0][0] / (conf_matrix[0][0] + conf_matrix[1][0]) * 100
            log.info(f"------Recall in percent: {recall}%")

            false_positives = conf_matrix[0][1] / sum(conf_matrix[0]) * 100
            log.info(f"------False positives in percent: {false_positives}%")

            false_negatives = conf_matrix[1][0] / sum(conf_matrix[1]) * 100
            log.info(f"------False negatives in percent: {false_negatives}%")


The output of functions is collected by the logger module. Output file with results is named *KNN_results.log*.

Below is the section, which evokes the above functions. It is *main* part of the program.

In [10]:
logger_setup(logger_file='KNN_results.log')
get_logger(logger_file='KNN_results.log').info("---------------Starting---------------")
get_logger(logger_file='KNN_results.log').info("--------------------------------------")

[22-12-19 13:49:03 INFO] ---------------Starting---------------
[22-12-19 13:49:03 INFO] --------------------------------------


In [11]:
input_data, labels = malware_data_transform(enable_figures=True)
knn_classifier(input_data, labels, 1, (3, 24, 20))

[22-12-19 13:49:03 INFO] Running malware_data_transform...
[22-12-19 13:49:03 INFO] -Reading csv file
[22-12-19 13:49:04 INFO] --Figures enabled


[22-12-19 13:49:05 INFO] Function malware_data_transform finished in 1.827791452407837s
[22-12-19 13:49:05 INFO] Running knn_classifier...
[22-12-19 13:49:05 INFO] -Running KNN algorithm with n_neighbors=1
[22-12-19 13:49:05 INFO] --Finding best 3 features
[22-12-19 13:49:05 INFO] ---Starting fitting for weight=distance
[22-12-19 13:49:11 INFO] -----Measuring confusion matrix...
[22-12-19 13:49:13 INFO] ----KNN accuracy is: 96.57853434745866%
[22-12-19 13:49:13 INFO] ------Precision in percent: 97.7166741510933%
[22-12-19 13:49:13 INFO] ------Recall in percent: 97.39705274755543%
[22-12-19 13:49:13 INFO] ------False positives in percent: 2.283325848906698%
[22-12-19 13:49:13 INFO] ------False negatives in percent: 6.064495427562972%
[22-12-19 13:49:13 INFO] --Finding best 23 features
[22-12-19 13:49:13 INFO] ---Starting fitting for weight=distance
[22-12-19 13:49:27 INFO] -----Measuring confusion matrix...
[22-12-19 13:49:37 INFO] ----KNN accuracy is: 97.61439092116383%
[22-12-19 13:49

When data is a raw input without optimization, charts show that the value of one feature dominates another one. The function executes quickly, because it may enter this dominant value and pretend based on its value. However, results, in this case, may not be the most accurate.

In [12]:
get_logger(logger_file='KNN_results.log').info("--------------------------------------")
input_data, labels = malware_data_transform(optimize_data='normalize', enable_figures=True)
knn_classifier(input_data, labels, 1, (3, 24, 20))

[22-12-19 13:49:37 INFO] --------------------------------------
[22-12-19 13:49:37 INFO] Running malware_data_transform...
[22-12-19 13:49:37 INFO] -Reading csv file
[22-12-19 13:49:38 INFO] --Data normalization processing...
[22-12-19 13:49:38 INFO] --Figures enabled


[22-12-19 13:49:39 INFO] Function malware_data_transform finished in 1.5302696228027344s
[22-12-19 13:49:39 INFO] Running knn_classifier...
[22-12-19 13:49:39 INFO] -Running KNN algorithm with n_neighbors=1
[22-12-19 13:49:39 INFO] --Finding best 3 features
[22-12-19 13:49:39 INFO] ---Starting fitting for weight=distance
[22-12-19 13:49:43 INFO] -----Measuring confusion matrix...
[22-12-19 13:49:44 INFO] ----KNN accuracy is: 96.40951346130629%
[22-12-19 13:49:44 INFO] ------Precision in percent: 97.47283729885848%
[22-12-19 13:49:44 INFO] ------Recall in percent: 97.41589636094979%
[22-12-19 13:49:44 INFO] ------False positives in percent: 2.527162701141521%
[22-12-19 13:49:44 INFO] ------False negatives in percent: 6.098451058308329%
[22-12-19 13:49:44 INFO] --Finding best 23 features
[22-12-19 13:49:45 INFO] ---Starting fitting for weight=distance
[22-12-19 13:50:35 INFO] -----Measuring confusion matrix...
[22-12-19 13:51:05 INFO] ----KNN accuracy is: 99.17179765785343%
[22-12-19 13:

After performing data normalization - a process of eliminating units of data, enabling to compare data more easily - mean value and std are looking better. It affects algorithm execution time, which doubles, but it performs better on data rescaled to values ​​between 0 and 1. The algorithm predicts classes more accurate and False positives/negatives results lower.
Precision and recall measurements for normalized data is above 99%, which is a satisfying result.

In [13]:
get_logger(logger_file='KNN_results.log').info("--------------------------------------")
input_data, labels = malware_data_transform(optimize_data='standardize', enable_figures=True)
knn_classifier(input_data, labels, 1, (3, 24, 20))

[22-12-19 13:51:05 INFO] --------------------------------------
[22-12-19 13:51:05 INFO] Running malware_data_transform...
[22-12-19 13:51:05 INFO] -Reading csv file
[22-12-19 13:51:06 INFO] --Data standardization processing...
[22-12-19 13:51:06 INFO] --Figures enabled


[22-12-19 13:51:07 INFO] Function malware_data_transform finished in 1.600907802581787s
[22-12-19 13:51:07 INFO] Running knn_classifier...
[22-12-19 13:51:07 INFO] -Running KNN algorithm with n_neighbors=1
[22-12-19 13:51:07 INFO] --Finding best 3 features
[22-12-19 13:51:07 INFO] ---Starting fitting for weight=distance
[22-12-19 13:51:11 INFO] -----Measuring confusion matrix...
[22-12-19 13:51:12 INFO] ----KNN accuracy is: 96.36363636363636%
[22-12-19 13:51:12 INFO] ------Precision in percent: 97.57256481768412%
[22-12-19 13:51:12 INFO] ------Recall in percent: 97.25110676413054%
[22-12-19 13:51:12 INFO] ------False positives in percent: 2.427435182315876%
[22-12-19 13:51:12 INFO] ------False negatives in percent: 6.474296799224054%
[22-12-19 13:51:12 INFO] --Finding best 23 features
[22-12-19 13:51:12 INFO] ---Starting fitting for weight=distance
[22-12-19 13:51:50 INFO] -----Measuring confusion matrix...
[22-12-19 13:52:17 INFO] ----KNN accuracy is: 99.09694555112883%
[22-12-19 13:5

After data standardization, there are similar results and conclusions as with data normalization. Both are giving promising results above 99%.

In [14]:
get_logger(logger_file='KNN_results.log').info("--------------------------------------")
get_logger(logger_file='KNN_results.log').info("---------------Stopping---------------")

[22-12-19 13:52:17 INFO] --------------------------------------
[22-12-19 13:52:17 INFO] ---------------Stopping---------------


Below is an output from functions run above.

In [15]:
import os
if os.path.exists('KNN_results.log'):
    !cat KNN_results.log

[22-12-19 13:40:10 INFO] ---------------Starting---------------
[22-12-19 13:40:10 INFO] --------------------------------------
[22-12-19 13:40:10 INFO] -Reading csv file
[22-12-19 13:40:11 INFO] --Figures enabled
[22-12-19 13:40:12 INFO] -Running KNN algorithm with n_neighbors=1
[22-12-19 13:40:12 INFO] --Finding best 3 features
[22-12-19 13:40:12 INFO] ---Starting fitting for weight=distance
[22-12-19 13:40:18 INFO] -----Measuring confusion matrix...
[22-12-19 13:40:20 INFO] ----KNN accuracy is: 96.43607388627309%
[22-12-19 13:40:20 INFO] ------Precision in percent: 97.59364750561022%
[22-12-19 13:40:20 INFO] ------Recall in percent: 97.3181395669088%
[22-12-19 13:40:20 INFO] ------False positives in percent: 2.4063524943897807%
[22-12-19 13:40:20 INFO] ------False negatives in percent: 6.257028112449799%
[22-12-19 13:40:20 INFO] --Finding best 23 features
[22-12-19 13:40:20 INFO] ---Starting fitting for weight=distance
[22-12-19 13:40:33 INFO] -----Measuring confusion matrix...
[22-