In [1]:
import pandas as pd
import numpy as np

In [2]:
# Get the dataset path and output the dataframe
malware_filepath = './datasets/Obfuscated/Obfuscated-MalMem2022.csv'
malware_data_raw = pd.read_csv(malware_filepath)

In [3]:
malware_data_raw

Unnamed: 0,Category,pslist.nproc,pslist.nppid,pslist.avg_threads,pslist.nprocs64bit,pslist.avg_handlers,dlllist.ndlls,dlllist.avg_dlls_per_proc,handles.nhandles,handles.avg_handles_per_proc,...,svcscan.kernel_drivers,svcscan.fs_drivers,svcscan.process_services,svcscan.shared_process_services,svcscan.interactive_process_services,svcscan.nactive,callbacks.ncallbacks,callbacks.nanonymous,callbacks.ngeneric,Class
0,Benign,45,17,10.555556,0,202.844444,1694,38.500000,9129,212.302326,...,221,26,24,116,0,121,87,0,8,Benign
1,Benign,47,19,11.531915,0,242.234043,2074,44.127660,11385,242.234043,...,222,26,24,118,0,122,87,0,8,Benign
2,Benign,40,14,14.725000,0,288.225000,1932,48.300000,11529,288.225000,...,222,26,27,118,0,120,88,0,8,Benign
3,Benign,32,13,13.500000,0,264.281250,1445,45.156250,8457,264.281250,...,222,26,27,118,0,120,88,0,8,Benign
4,Benign,42,16,11.452381,0,281.333333,2067,49.214286,11816,281.333333,...,222,26,24,118,0,124,87,0,8,Benign
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58591,Ransomware-Shade-fa03be3078d1b9840f06745f160eb...,37,15,10.108108,0,215.486487,1453,39.270270,7973,215.486487,...,221,26,24,116,0,120,86,0,8,Malware
58592,Ransomware-Shade-f56687137caf9a67678cde91e4614...,37,14,9.945946,0,190.216216,1347,36.405405,7038,190.216216,...,221,26,24,116,0,116,88,0,8,Malware
58593,Ransomware-Shade-faddeea111a25da4d0888f3044ae9...,38,15,9.842105,0,210.026316,1448,38.105263,7982,215.729730,...,221,26,24,116,0,120,88,0,8,Malware
58594,Ransomware-Shade-f866c086af2e1d8ebaa6f2c863157...,37,15,10.243243,0,215.513513,1452,39.243243,7974,215.513513,...,221,26,24,116,0,120,87,0,8,Malware


### Observations
When looking at the dataset in quesiton, we see that the dataset has a label column to merit a classification task. This would make our lives much easier as we do not have to manually label and read each data sample to determine it's class. Additionally, we also have a category column which seems to be the multi-categorical column to further specify what tpe of malware the sample is identified as. Unfortuneately, we have to parse and clean this part of the data due to the extra string (the hash of the malware) appended after the defined category. Of course, it won't be simple to do this parsing in jupyter notebook, but we will use the Microsoft Excel software to help us easily perform this task. 

As an end result, we get the re-engineered MalMem dataset as the following

In [4]:
new_malware_filepath = './datasets/Obfuscated/Obfuscated-MalMem2022_edited.csv'
malware_data = pd.read_csv(new_malware_filepath)

In [5]:
malware_data

Unnamed: 0,Category,pslist.nproc,pslist.nppid,pslist.avg_threads,pslist.nprocs64bit,pslist.avg_handlers,dlllist.ndlls,dlllist.avg_dlls_per_proc,handles.nhandles,handles.avg_handles_per_proc,...,svcscan.kernel_drivers,svcscan.fs_drivers,svcscan.process_services,svcscan.shared_process_services,svcscan.interactive_process_services,svcscan.nactive,callbacks.ncallbacks,callbacks.nanonymous,callbacks.ngeneric,Class
0,Benign,45,17,10.555556,0,202.844444,1694,38.500000,9129,212.302326,...,221,26,24,116,0,121,87,0,8,Benign
1,Benign,47,19,11.531915,0,242.234043,2074,44.127660,11385,242.234043,...,222,26,24,118,0,122,87,0,8,Benign
2,Benign,40,14,14.725000,0,288.225000,1932,48.300000,11529,288.225000,...,222,26,27,118,0,120,88,0,8,Benign
3,Benign,32,13,13.500000,0,264.281250,1445,45.156250,8457,264.281250,...,222,26,27,118,0,120,88,0,8,Benign
4,Benign,42,16,11.452381,0,281.333333,2067,49.214286,11816,281.333333,...,222,26,24,118,0,124,87,0,8,Benign
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58591,Ransomware,37,15,10.108108,0,215.486487,1453,39.270270,7973,215.486487,...,221,26,24,116,0,120,86,0,8,Malware
58592,Ransomware,37,14,9.945946,0,190.216216,1347,36.405405,7038,190.216216,...,221,26,24,116,0,116,88,0,8,Malware
58593,Ransomware,38,15,9.842105,0,210.026316,1448,38.105263,7982,215.729730,...,221,26,24,116,0,120,88,0,8,Malware
58594,Ransomware,37,15,10.243243,0,215.513513,1452,39.243243,7974,215.513513,...,221,26,24,116,0,120,87,0,8,Malware


### Data Cleaning
Now that we have fixed this problem in the background using Microsoft Excel, we can now perform data cleaning and feature engineering. In order to do this, we will have to carefully inspect each feature to ensure that each feature contains information about the malware and not some number ID as an example. 

In [6]:
# print each feature from the dataset
malware_data.columns

Index(['Category', 'pslist.nproc', 'pslist.nppid', 'pslist.avg_threads',
       'pslist.nprocs64bit', 'pslist.avg_handlers', 'dlllist.ndlls',
       'dlllist.avg_dlls_per_proc', 'handles.nhandles',
       'handles.avg_handles_per_proc', 'handles.nport', 'handles.nfile',
       'handles.nevent', 'handles.ndesktop', 'handles.nkey', 'handles.nthread',
       'handles.ndirectory', 'handles.nsemaphore', 'handles.ntimer',
       'handles.nsection', 'handles.nmutant', 'ldrmodules.not_in_load',
       'ldrmodules.not_in_init', 'ldrmodules.not_in_mem',
       'ldrmodules.not_in_load_avg', 'ldrmodules.not_in_init_avg',
       'ldrmodules.not_in_mem_avg', 'malfind.ninjections',
       'malfind.commitCharge', 'malfind.protection',
       'malfind.uniqueInjections', 'psxview.not_in_pslist',
       'psxview.not_in_eprocess_pool', 'psxview.not_in_ethread_pool',
       'psxview.not_in_pspcid_list', 'psxview.not_in_csrss_handles',
       'psxview.not_in_session', 'psxview.not_in_deskthrd',
       'psxv

In [7]:
malware_data['Category'].isna().iloc[0]

False

In [8]:
is_na = False
for _ , i in enumerate(malware_data.columns):
    if malware_data[i].isna().iloc[_]:
        print(malware_data[i].isna())
        is_na = True

if is_na == False:
    print("Dataset does not contain na data")
else:
    print("Dataset contains na data")



Dataset does not contain not-a-number data


In [9]:
malware_data.dtypes

Category                                   object
pslist.nproc                                int64
pslist.nppid                                int64
pslist.avg_threads                        float64
pslist.nprocs64bit                          int64
pslist.avg_handlers                       float64
dlllist.ndlls                               int64
dlllist.avg_dlls_per_proc                 float64
handles.nhandles                            int64
handles.avg_handles_per_proc              float64
handles.nport                               int64
handles.nfile                               int64
handles.nevent                              int64
handles.ndesktop                            int64
handles.nkey                                int64
handles.nthread                             int64
handles.ndirectory                          int64
handles.nsemaphore                          int64
handles.ntimer                              int64
handles.nsection                            int64


In [10]:
# Print out the number of unique values for each feature to find any nonsensible features
for i in malware_data.columns:
    print(malware_data[i].value_counts())

Category
Benign        29298
Spyware       10020
Ransomware     9791
Trojan         9487
Name: count, dtype: int64
pslist.nproc
41     10012
40      9226
42      7822
44      5777
43      5616
       ...  
106        1
122        1
132        1
161        1
96         1
Name: count, Length: 114, dtype: int64
pslist.nppid
12    16559
16    12242
15     8653
17     7898
13     4404
18     2827
14     2452
19     1294
20      537
8       446
11      334
21      229
9       203
22      184
10      133
23       55
24       29
25       27
37        9
39        8
38        8
26        8
40        6
62        5
27        5
28        4
60        4
36        3
52        3
54        3
66        2
49        2
55        2
56        2
48        2
72        1
61        1
57        1
34        1
42        1
50        1
35        1
53        1
43        1
33        1
44        1
63        1
51        1
41        1
Name: count, dtype: int64
pslist.avg_threads
10.000000    619
10.162162    502
10.135135 

In [None]:
# Potential Useless features
# [pslist.nprocs64bit, handles.nport, modules.nmodeules, svcscan.interactive_process_services]

Based on this previous cell, we can see that a couple of features seem to have a finite or few amount of unique values. Features such as callbacks.ngeneric, callbacks.nanonymous, modules.nmodules, and psxview.not_in_eprocess_pool are features with no more than three unique values. Since this is handful of features and not an absurb about, we can add these features back to the dataset for the ML model to test on, and eveluate if the models work best with those features or not. We will create a seperate dataset called "Obfuscated-MalMen2022_reduced.csv" that has both the edited category feature and the dropped aforementioned features.  

In [12]:
malware_data_reduced = malware_data
malware_data_reduced = malware_data_reduced.drop(columns=['callbacks.ngeneric', 'callbacks.nanonymous', 'modules.nmodules', 'psxview.not_in_eprocess_pool'])
print(malware_data_reduced.columns)

Index(['Category', 'pslist.nproc', 'pslist.nppid', 'pslist.avg_threads',
       'pslist.nprocs64bit', 'pslist.avg_handlers', 'dlllist.ndlls',
       'dlllist.avg_dlls_per_proc', 'handles.nhandles',
       'handles.avg_handles_per_proc', 'handles.nport', 'handles.nfile',
       'handles.nevent', 'handles.ndesktop', 'handles.nkey', 'handles.nthread',
       'handles.ndirectory', 'handles.nsemaphore', 'handles.ntimer',
       'handles.nsection', 'handles.nmutant', 'ldrmodules.not_in_load',
       'ldrmodules.not_in_init', 'ldrmodules.not_in_mem',
       'ldrmodules.not_in_load_avg', 'ldrmodules.not_in_init_avg',
       'ldrmodules.not_in_mem_avg', 'malfind.ninjections',
       'malfind.commitCharge', 'malfind.protection',
       'malfind.uniqueInjections', 'psxview.not_in_pslist',
       'psxview.not_in_ethread_pool', 'psxview.not_in_pspcid_list',
       'psxview.not_in_csrss_handles', 'psxview.not_in_session',
       'psxview.not_in_deskthrd', 'psxview.not_in_pslist_false_avg',
       'p

In [14]:
malware_data_reduced.to_csv('./datasets/Obfuscated/Obfuscated-MalMem2022_reduced.csv')
malware_data_reduced

Unnamed: 0,Category,pslist.nproc,pslist.nppid,pslist.avg_threads,pslist.nprocs64bit,pslist.avg_handlers,dlllist.ndlls,dlllist.avg_dlls_per_proc,handles.nhandles,handles.avg_handles_per_proc,...,psxview.not_in_deskthrd_false_avg,svcscan.nservices,svcscan.kernel_drivers,svcscan.fs_drivers,svcscan.process_services,svcscan.shared_process_services,svcscan.interactive_process_services,svcscan.nactive,callbacks.ncallbacks,Class
0,Benign,45,17,10.555556,0,202.844444,1694,38.500000,9129,212.302326,...,0.191489,389,221,26,24,116,0,121,87,Benign
1,Benign,47,19,11.531915,0,242.234043,2074,44.127660,11385,242.234043,...,0.127660,392,222,26,24,118,0,122,87,Benign
2,Benign,40,14,14.725000,0,288.225000,1932,48.300000,11529,288.225000,...,0.125000,395,222,26,27,118,0,120,88,Benign
3,Benign,32,13,13.500000,0,264.281250,1445,45.156250,8457,264.281250,...,0.187500,395,222,26,27,118,0,120,88,Benign
4,Benign,42,16,11.452381,0,281.333333,2067,49.214286,11816,281.333333,...,0.217391,392,222,26,24,118,0,124,87,Benign
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58591,Ransomware,37,15,10.108108,0,215.486487,1453,39.270270,7973,215.486487,...,0.184211,389,221,26,24,116,0,120,86,Malware
58592,Ransomware,37,14,9.945946,0,190.216216,1347,36.405405,7038,190.216216,...,0.162162,389,221,26,24,116,0,116,88,Malware
58593,Ransomware,38,15,9.842105,0,210.026316,1448,38.105263,7982,215.729730,...,0.225000,389,221,26,24,116,0,120,88,Malware
58594,Ransomware,37,15,10.243243,0,215.513513,1452,39.243243,7974,215.513513,...,0.162162,389,221,26,24,116,0,120,87,Malware


### End of Data Cleaning
We have examined and cleaned the dataset in question. Additionally, we have made two datasets from the original raw data by modifying the category feature to be, effectively, a multi-class feature. We also reduced a couple dimensions to the dataset by removing some insignificant features due to the low number of unique values. We may further edit this notebook in the future if further findings indicate more features should be removed. Otherwise, we will assume that what has been experimented and tested are sufficient in training our models.   