<a href="https://colab.research.google.com/github/kamkali/Malicious_Discovery/blob/master/DL_Malware_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Welcome to the Jupyter Notebook with Google Colab! Here we will post our research on discovering malware using machine learning.

*authors: Jakub Burghardt, Kamil Kaliś, Michał Szczepaniak-Krupowski*

In [1]:
!pip install pefile
def sup():
  print("Hello World!")
sup()

Collecting pefile
[?25l  Downloading https://files.pythonhosted.org/packages/36/58/acf7f35859d541985f0a6ea3c34baaefbfaee23642cf11e85fe36453ae77/pefile-2019.4.18.tar.gz (62kB)
[K     |█████▎                          | 10kB 16.2MB/s eta 0:00:01[K     |██████████▌                     | 20kB 6.5MB/s eta 0:00:01[K     |███████████████▊                | 30kB 8.9MB/s eta 0:00:01[K     |█████████████████████           | 40kB 5.7MB/s eta 0:00:01[K     |██████████████████████████▎     | 51kB 6.9MB/s eta 0:00:01[K     |███████████████████████████████▌| 61kB 8.0MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 4.8MB/s 
Building wheels for collected packages: pefile
  Building wheel for pefile (setup.py) ... [?25l[?25hdone
  Created wheel for pefile: filename=pefile-2019.4.18-cp36-none-any.whl size=60823 sha256=a0d1fd16bc57dcee096dfde88af1be17db00ea13205ed1a5193bb4f1fd0d6660
  Stored in directory: /root/.cache/pip/wheels/1c/a1/95/4f33011a0c013c872fe6f0f364dc463a2588120

In [0]:
def upload_files():
    """Function for uploading files to the project"""
    from google.colab import files
    uploaded = files.upload()
    for k, v in uploaded.items():
        open(k, 'wb').write(v)
    return list(uploaded.keys())

-----

# Part I: planning


## Main tasks:
- get acquaint with malware topic:
    - types of malware
    - frequency of appearance on different operating systems
    - detection methods: static and dynamic
    - trends
- familiarity with machine learning topic: 
    - selection of different algorithms
- most significant features pick
- dataset pick:
    - should be up to date (as current as possible)
    - should contain proper features selected earlier
- analysis of efficiency of chosen algorithms:
    - statistical comparison of accuracy, true positives/false negatives/etc.

## Assignment of the task:
### Michał Szczepaniak-Krupowski:
A deeper look at malware – types and families, comparison between static and dynamic methods.

The popularity of malware on different OS – why Windows users are the most threatened on the malicious software attack?

### Kamil Kaliś:
What is Machine Learning and why is it crucial in Cybersecurity – the need for accelerating the usage of ML in modern detecting malware.

Comparison of different supervised learning algorithms – pros and cons in malware detection.

### Jakub Burghardt:
Malware detection methods – more detailed reasoning about static detection methods.

A brief look at PE format (Portable Executable) – how can they be helpful in detecting malware with static methods?

-----

# Mid-term presentation
### link to presentation:
https://drive.google.com/open?id=1x1DcDueWrDM8XHEd-sKMWykt5G3wMW9X-8NbKFgRhmk

# Project implementation:

># *'Before'* section:

Code snippet to upload MalwareData.csv if not already in colab directory:

In [3]:
import os
if not os.path.exists('MalDiscoveryData'):
    !git clone https://github.com/kamkali/MalDiscoveryData
    !unzip MalDiscoveryData/MalwareData.csv.zip

Cloning into 'MalDiscoveryData'...
remote: Enumerating objects: 8, done.[K
remote: Counting objects: 100% (8/8), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 8 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (8/8), done.
Archive:  MalDiscoveryData/MalwareData.csv.zip
  inflating: MalwareData.csv         


Logger module to save outputs from functions:

In [0]:
import logging
import functools
import time

"""Logger module created by Kamil Kaliś"""


def logger_setup():
    """Sets up the logger.
    Usage:
        1. With wrapper @log_exec_time
        2. Set logger within module as:
            2.1 log = get_logger('exec_time.log')
            2.2 use 'log' variable to use logger and write them to file"""

    logformat = "[%(asctime)s %(levelname)s] %(message)s"
    dateformat = "%d-%m-%y %H:%M:%S"
    logger = logging.getLogger("exec_time.log")
    formatter = logging.Formatter(logformat)
    formatter.datefmt = dateformat
    fh = logging.FileHandler("exec_time.log", mode="a")
    fh.setFormatter(formatter)
    sh = logging.StreamHandler()
    sh.setFormatter(formatter)
    logger.setLevel(logging.INFO)
    logger.addHandler(fh)
    logger.addHandler(sh)
    logger.propagate = False


def log_exec_time(func):
    """Wrapper for logging execution time of marked function.
    Usage:
    Add adnotation above function: '@log_exec_time'"""

    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        # logger_setup()
        log = logging.getLogger('exec_time.log')
        log.info(f"Running {func.__name__}...")
        start_time = time.time()
        result = func(*args, **kwargs)
        exec_time = time.time() - start_time
        log.info(f"Function {func.__name__} finished in {exec_time}s")
        return result

    return wrapper


def get_logger(logger_name='exec_time.log'):
    return logging.getLogger(logger_name)


--------

> # K-Nearest Neighbors:
*author: Kamil Kaliś*

One simple way to solve malware detection problems is to use the K-Nearest Neighbors algorithm. KNN model from the third-party library(*sklearn*) will be used.
We also need data to work on and in this case, it is collected in *MalwareData.csv* file. To process data, *pandas* library and sklearn modules are used. To visualize data and charts *plotly* library will be used.

As mentioned above, let's start with importing adequate libraries and modules:

In [0]:
import pandas as pd
import plotly.graph_objects as go
from plotly.offline import plot
from sklearn import neighbors, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import confusion_matrix

Then, read data to work on:

In [6]:
data = pd.read_csv('MalwareData.csv', sep='|')
print(data.shape[0], data.shape[1])
benign_files = data[data['legitimate'] == 1].count()
malware_files = data[data['legitimate'] == 0].count()
print(f"Clean files count is {benign_files[1]} and malware files count is {malware_files[1]}")

138047 57
Clean files count is 41323 and malware files count is 96724


MalwareData.csv contains 138 047 records as a whole, where 41 323 files are benign, clean files, and 96 724 files which are malicious. 
Data is reprezented with 56 features. Last 57'th feature represents a label *legitimate* classifies given sample:

'1' – for clean file, '0' – for malware file.

Knowing all this, we can start to process the data to more algorithm friendly form.

## Function: malware_data_transform

### **@description:** 
> Performs data transform as dropping insignificant columns and 'legitimate' column, which contains labels. It can normalize or standardize data.

### **@params:**

>*   optimize_data – enables data normalization or standardization: 
default is None
>*   csv_data – cvs file to read and transform:
default is MalwareData.csv
>*   csv_sep – separator used in csv file:
default is '|'
>*   enable figures – enables plotly charts to show:
default is False

### **@returns:**


> full_data.values, labels – tuple with values of features and labels for malicious and clean files








In [0]:
@log_exec_time
def malware_data_transform(optimize_data=None, csv_data='MalwareData.csv', csv_sep='|', enable_figures=False):
    log = get_logger()
    log.info("-Reading csv file")
    full_data = pd.read_csv(csv_data, sep=csv_sep)

    pd.set_option("display.max_columns", None)

    labels = full_data['legitimate'].values
    full_data: pd.DataFrame = full_data.drop(['Name', 'md5', 'legitimate'], axis=1)

    if optimize_data == 'normalize':
        log.info("--Data normalization processing...")
        full_data = pd.DataFrame(preprocessing.MinMaxScaler().fit_transform(full_data))
    elif optimize_data == 'standardize':
        log.info("--Data standardization processing...")
        full_data = pd.DataFrame(preprocessing.StandardScaler().fit_transform(full_data))

    if enable_figures:
        log.info("--Figures enabled")
        plot_bar_figures(full_data, optimize_data)

    return full_data.values, labels


When transforming data, it is good to see how it looks like. The next function provides a utility to show non-optimized data, normalized or standardized forms of data.

## Function: plot_bar_figures

### **@description:** 
> Function to plot bar charts with features values
### **@params:**


>*   optimize_data – enables data normalization or standardization: 
default is None
>*   full_data – transformed data

### **@returns:**
> Nothing








In [0]:
def plot_bar_figures(full_data, optimize_data):
    cols = full_data.keys()
    if optimize_data == 'normalize':
        cols = full_data.keys()
        full_data_normalized = pd.DataFrame(preprocessing.MinMaxScaler().fit_transform(full_data))
        trace_norm_mean: pd.DataFrame = full_data_normalized.mean()
        trace_norm_std: pd.DataFrame = full_data_normalized.std()

        # -------------------------------------------------
        layout = go.Layout(title='Normalized data mean')
        figure = go.Figure(go.Bar(y=trace_norm_mean,
                                  x=cols
                                  ), layout=layout)
        figure.show()
        # -------------------------------------------------
        layout = go.Layout(title='Normalized data standard deviation')
        figure = go.Figure(go.Bar(y=trace_norm_std,
                                  x=cols
                                  ), layout=layout)
        figure.show()

    elif optimize_data == 'standardize':
        cols = full_data.keys()

        full_data_standardized = pd.DataFrame(preprocessing.StandardScaler().fit_transform(full_data))
        trace_stand_mean = full_data_standardized.mean()
        trace_stand_std = full_data_standardized.std()

        # -------------------------------------------------
        layout = go.Layout(title='Standardized data mean')
        figure = go.Figure(go.Bar(y=trace_stand_mean,
                                  x=cols
                                  ), layout=layout)
        
        figure.show()
        # -------------------------------------------------
        layout = go.Layout(title='Standardized data standard deviation')
        figure = go.Figure(go.Bar(y=trace_stand_std,
                                  x=cols
                                  ), layout=layout)
        figure.show()
    else:
        trace_mean = full_data.mean()
        trace_std = full_data.std()

        # -------------------------------------------------
        layout = go.Layout(title='Data mean')
        figure = go.Figure(go.Bar(y=trace_mean,
                                  x=cols
                                  ), layout=layout)
        figure.show()
        # -------------------------------------------------
        layout = go.Layout(title='Data standard deviation')
        figure = go.Figure(go.Bar(y=trace_std,
                                  x=cols
                                  ), layout=layout)
        figure.show()

In [0]:
@log_exec_time
def knn_classifier(input_data, labels, n_neighbors=1, run_for_features=(3, 30, 20)):
    log = get_logger()
    for n in range(1, n_neighbors + 1):
        log.info(f"-Running KNN algorithm with n_neighbors={n}")
        weight = 'distance'
        classifier = neighbors.KNeighborsClassifier(n, weights=weight, metric='euclidean')
        for k in range(*run_for_features):
            log.info(f"--Finding best {k} features")
            best_features_data = SelectKBest(f_classif, k=k).fit_transform(input_data, labels)
            X_train, X_test, Y_train, Y_test = train_test_split(best_features_data, labels, test_size=0.3)

            log.info(f"---Starting fitting for weight={weight}")
            classifier.fit(X_train, Y_train)

            score = classifier.score(X_test, Y_test)

            log.info(f"----KNN accuracy is: {score * 100}%")

            """ Confusion matrix """
            log.info("-----Measuring confusion matrix...")
            result = classifier.predict(X_test)
            conf_matrix = confusion_matrix(Y_test, result)

            false_positives = conf_matrix[0][1] / sum(conf_matrix[0]) * 100
            false_negatives = conf_matrix[1][0] / sum(conf_matrix[1]) * 100
            log.info(f"------False positives in percent: {false_positives}%")
            log.info(f"------False negatives in percent: {false_negatives}%")


In [10]:
logger_setup()
get_logger().info("---------------Starting---------------")
get_logger().info("--------------------------------------")

[18-12-19 13:33:40 INFO] ---------------Starting---------------
[18-12-19 13:33:40 INFO] --------------------------------------


In [11]:
input_data, labels = malware_data_transform(enable_figures=True)
# knn_classifier(input_data, labels, 1, (3, 24, 20))

[18-12-19 13:33:40 INFO] Running malware_data_transform...
[18-12-19 13:33:40 INFO] -Reading csv file
[18-12-19 13:33:40 INFO] --Figures enabled


[18-12-19 13:33:42 INFO] Function malware_data_transform finished in 2.4588699340820312s


In [12]:
get_logger().info("--------------------------------------")
input_data, labels = malware_data_transform(optimize_data='normalize', enable_figures=True)
# knn_classifier(input_data, labels, 1, (3, 24, 20))

[18-12-19 13:33:42 INFO] --------------------------------------
[18-12-19 13:33:42 INFO] Running malware_data_transform...
[18-12-19 13:33:42 INFO] -Reading csv file
[18-12-19 13:33:43 INFO] --Data normalization processing...
[18-12-19 13:33:43 INFO] --Figures enabled


[18-12-19 13:33:44 INFO] Function malware_data_transform finished in 1.5148811340332031s


In [13]:
get_logger().info("--------------------------------------")
input_data, labels = malware_data_transform(optimize_data='standardize', enable_figures=True)
# knn_classifier(input_data, labels, 1, (3, 24, 20))

[18-12-19 13:33:44 INFO] --------------------------------------
[18-12-19 13:33:44 INFO] Running malware_data_transform...
[18-12-19 13:33:44 INFO] -Reading csv file
[18-12-19 13:33:44 INFO] --Data standardization processing...
[18-12-19 13:33:45 INFO] --Figures enabled


[18-12-19 13:33:45 INFO] Function malware_data_transform finished in 1.5526783466339111s


In [15]:
get_logger().info("--------------------------------------")
get_logger().info("---------------Stopping---------------")

[18-12-19 13:35:32 INFO] --------------------------------------
[18-12-19 13:35:32 INFO] ---------------Stopping---------------


In [16]:
import os
if os.path.exists('exec_time.log'):
    !cat exec_time.log

[18-12-19 13:33:40 INFO] ---------------Starting---------------
[18-12-19 13:33:40 INFO] --------------------------------------
[18-12-19 13:33:40 INFO] Running malware_data_transform...
[18-12-19 13:33:40 INFO] -Reading csv file
[18-12-19 13:33:40 INFO] --Figures enabled
[18-12-19 13:33:42 INFO] Function malware_data_transform finished in 2.4588699340820312s
[18-12-19 13:33:42 INFO] --------------------------------------
[18-12-19 13:33:42 INFO] Running malware_data_transform...
[18-12-19 13:33:42 INFO] -Reading csv file
[18-12-19 13:33:43 INFO] --Data normalization processing...
[18-12-19 13:33:43 INFO] --Figures enabled
[18-12-19 13:33:44 INFO] Function malware_data_transform finished in 1.5148811340332031s
[18-12-19 13:33:44 INFO] --------------------------------------
[18-12-19 13:33:44 INFO] Running malware_data_transform...
[18-12-19 13:33:44 INFO] -Reading csv file
[18-12-19 13:33:44 INFO] --Data standardization processing...
[18-12-19 13:33:45 INFO] --Figures enabled
[18-12-19