# Imphadel 
#### Decision Tree Model for Detecting Malicious Windows Executable Files based on Metadata

## Requirements



For Data Collection Process
1. MacOs or Ubuntu Device with 7zip installed for malware samples collection
2. Clean Windows Device for clean samples collection

For Other Proceeses
1. Any system will work

## Introduction

### PE File

Portable Executable (PE) is a file format used in Microsoft Windows operating systems to store executable code, libraries, and other resources such as icons and bitmaps. PE files consist of several sections, including the header, text section, data section, resource section, and import section. The header section contains metadata about the file, including the file size, entry point, and the addresses of each section. The text section contains the executable code, while the data section contains initialized and uninitialized data used by the program.

The import section of the PE file is particularly relevant for malware detection. This section lists the functions that the executable code needs to call from external libraries or DLLs (Dynamic Link Libraries) to run correctly. Malware authors often use the import section to hide their malicious code by modifying the names or addresses of the functions they import. As a result, malware detection tools may struggle to identify these functions as malicious.

This is where machine learning models based on imported functions can be beneficial. These models use the information in the import section to analyze the functions that are being imported and their parameters. They then use this information to identify patterns that are indicative of malicious behavior.

For example, a machine learning model may look for functions that are commonly used by malware, such as those used to download or execute additional files, disable security software, or steal sensitive information. By analyzing the import section and the parameters used by these functions, the model can identify potential threats and alert the user or take action to mitigate the risk.

Overall, machine learning models based on imported functions can provide an additional layer of protection against malware by analyzing the metadata of PE files and identifying potential threats that may be missed by traditional signature-based detection techniques.

### Imphash vs Model

Both machine learning models based on imported functions and the imphash technique are used to analyze the import section of PE files for malware detection. However, there are some key differences between the two.

Imphash is a hash-based technique that calculates a unique identifier for each PE file based on the functions that are imported. It does this by hashing the names of the imported functions and their addresses. This creates a unique identifier, known as the imphash, that can be compared to a database of known malicious hashes to identify potential threats.

One advantage of imphash is that it is a fast and efficient technique that can quickly analyze large volumes of PE files. However, there are some limitations to this approach. For example, imphash may not be effective against polymorphic malware that changes the function names or addresses, or against malware that uses encryption or other obfuscation techniques to hide its import section. Another example is that malware writers can import unnecessary functions to mess with the hash.

In contrast, machine learning models based on imported functions can be more effective at identifying these types of malware. These models use algorithms to analyze the functions being imported and their parameters, rather than relying solely on the hash values. This allows them to identify patterns and behaviors that may be indicative of malicious activity, even if the function names or addresses have been obfuscated.

Another advantage of machine learning models is that they can be trained on large datasets of both malicious and benign samples, allowing them to learn and adapt to new threats over time. This can make them more effective at identifying new and emerging threats, as well as improving their overall accuracy and reducing false positives.

Overall, both imphash and machine learning models based on imported functions have their strengths and weaknesses. While imphash is a useful and efficient technique, it may not be effective against all types of malware. Machine learning models, on the other hand, can be more effective at identifying polymorphic and obfuscated malware, and can improve over time with additional training data.






## Data Collection Process

### Collecting 1000 exe malware samples from Malware Bazaar using its API


This script will download exe malware and store it in ./zipped folder
Please note that Malware Bazaar limit its download limit to 1000 samples per week per IP address. Therefore, this script should only be run once.

In [None]:
import requests
import json

data = {
    'query': 'get_file_type',
    'file_type': 'exe',
    'limit': '1000'
}

list_response = requests.post('https://mb-api.abuse.ch/api/v1/', data=data)
list_response_json = json.loads(list_response.text)

sample_count = 1
for sample in list_response_json['data']:
    data = {
        'query': 'get_file',
        'sha256_hash': sample['sha256_hash']
    }

    response = requests.post('https://mb-api.abuse.ch/api/v1/', data=data)
    with open(f'zipped/sample{sample_count:03d}.zip', 'wb') as f:
        f.write(response.content)
    sample_count += 1

The downloaded samples are in the format of password-protected zip file, therefore the following script will extract the exe samples and store it in ./infected folder. 

In [None]:
import subprocess
import os
import shutil

#Replace the path here
SEVEN_ZIP_PATH = "/usr/local/bin/7zz"

sample_count = 1
while sample_count <= 1000:
    subprocess.call([ SEVEN_ZIP_PATH, "x", "-pinfected", f"zipped/sample{sample_count:03d}.zip"])
    for file in os.listdir("./"):
        if file.endswith('.exe'):
            src_file = os.path.join(file)
            dst_file = f"infected/sample{sample_count:03d}.exe"
            shutil.move(src_file, dst_file)
            sample_count += 1
            break

### Collecting Clean Executable Samples


This script is meant to be run on an uninfected Windows System, not here!  
```
import os
import shutil

source_dir = "C:\\"

dest_dir = ".\\clean"

if not os.path.exists(dest_dir):
    os.makedirs(dest_dir)

for root, dirs, files in os.walk(source_dir):
    for file in files:
        if file.endswith(".exe"):
            src_file_path = os.path.join(root, file)
            dest_file_path = os.path.join(dest_dir, file)
            shutil.copy2(src_file_path, dest_file_path)
            print(f"Copied {src_file_path} to {dest_file_path}")
print("Done copying exe files")
```

This script will collect all exe files in the system it is being run on. Please note that it does not guarantee any sample count; it might need to be run on multiple devices to collect near 1000 samples.  
Once its done, transfer the 'safe' folder to the current directory of this notebook.

### Extracting the Metadata from the executable files

In [1]:
%pip install pefile pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


For Malicious Exes

In [2]:
import os
import pefile
import pandas as pd

# Define the metadata fields to extract
metadata = {"Name": [],
            "Timestamp": [],
            "Number of sections": [],
            "Section alignment": [],
            "File alignment": [],
            "Size of image": [],
            "Size of headers": [],
            "Subsystem": [],
            "DLL characteristics": [],
            "Imported DLLs": [],
            "Imported functions": [],
            "Exported symbols": []}

# Iterate over the files in the directory
for filename in os.listdir("./infected"):
    if filename.endswith(".exe"):
        filepath = os.path.join("./infected", filename)

        # Open the PE file
        pe = pefile.PE(filepath)

        # Extract the metadata fields
        metadata["Name"].append(pefile.MACHINE_TYPE.get(pe.FILE_HEADER.Machine, "Unknown"))
        metadata["Timestamp"].append(pe.FILE_HEADER.TimeDateStamp)
        metadata["Number of sections"].append(pe.FILE_HEADER.NumberOfSections)
        metadata["Section alignment"].append(pe.OPTIONAL_HEADER.SectionAlignment)
        metadata["File alignment"].append(pe.OPTIONAL_HEADER.FileAlignment)
        metadata["Size of image"].append(pe.OPTIONAL_HEADER.SizeOfImage)
        metadata["Size of headers"].append(pe.OPTIONAL_HEADER.SizeOfHeaders)
        metadata["Subsystem"].append(pefile.SUBSYSTEM_TYPE.get(pe.OPTIONAL_HEADER.Subsystem, "Unknown"))
        metadata["DLL characteristics"].append(",".join([pefile.DLL_CHARACTERISTICS[x] for x in [pe.OPTIONAL_HEADER.DllCharacteristics] if x in pefile.DLL_CHARACTERISTICS]))
        metadata["Imported DLLs"].append(",".join([entry.dll.decode() for entry in pe.DIRECTORY_ENTRY_IMPORT]) if hasattr(pe, 'DIRECTORY_ENTRY_IMPORT') else "")
        metadata["Imported functions"].append(",".join([f.name.decode() for entry in pe.DIRECTORY_ENTRY_IMPORT for f in entry.imports if f.name is not None]) if hasattr(pe, 'DIRECTORY_ENTRY_IMPORT') else "")
        metadata["Exported symbols"].append(",".join([exp.name.decode() for exp in pe.DIRECTORY_ENTRY_EXPORT.symbols if exp.name is not None]) if hasattr(pe, 'DIRECTORY_ENTRY_EXPORT') else "")
        
        # Close the PE file
        pe.close()

# Create the DataFrame
df_mal = pd.DataFrame(metadata)

df_mal.head()

Unnamed: 0,Name,Timestamp,Number of sections,Section alignment,File alignment,Size of image,Size of headers,Subsystem,DLL characteristics,Imported DLLs,Imported functions,Exported symbols
0,IMAGE_FILE_MACHINE_I386,1468633330,5,4096,512,1007616,1024,IMAGE_SUBSYSTEM_WINDOWS_GUI,,"ADVAPI32.dll,KERNEL32.dll,GDI32.dll,USER32.dll...","GetTokenInformation,RegDeleteValueA,RegOpenKey...",
1,IMAGE_FILE_MACHINE_I386,3226397388,3,8192,512,770048,512,IMAGE_SUBSYSTEM_WINDOWS_GUI,,mscoree.dll,_CorExeMain,
2,IMAGE_FILE_MACHINE_AMD64,3344059323,2,8192,512,2170880,512,IMAGE_SUBSYSTEM_WINDOWS_GUI,,,,
3,IMAGE_FILE_MACHINE_I386,1468633330,5,4096,512,741376,1024,IMAGE_SUBSYSTEM_WINDOWS_GUI,,"ADVAPI32.dll,KERNEL32.dll,GDI32.dll,USER32.dll...","GetTokenInformation,RegDeleteValueA,RegOpenKey...",
4,IMAGE_FILE_MACHINE_I386,1468633330,5,4096,512,856064,1024,IMAGE_SUBSYSTEM_WINDOWS_GUI,,"ADVAPI32.dll,KERNEL32.dll,GDI32.dll,USER32.dll...","GetTokenInformation,RegDeleteValueA,RegOpenKey...",


In [16]:
df_mal.shape

(998, 13)

For Clean exe

In [3]:
# Clear metadata
metadata = {"Name": [],
            "Timestamp": [],
            "Number of sections": [],
            "Section alignment": [],
            "File alignment": [],
            "Size of image": [],
            "Size of headers": [],
            "Subsystem": [],
            "DLL characteristics": [],
            "Imported DLLs": [],
            "Imported functions": [],
            "Exported symbols": []}

for filename in os.listdir("./safe"):
    if filename.endswith(".exe"):
        filepath = os.path.join("./safe", filename)

        try:
            # Open the PE file
            pe = pefile.PE(filepath)

            # Extract the metadata fields
            metadata["Name"].append(pefile.MACHINE_TYPE.get(pe.FILE_HEADER.Machine, "Unknown"))
            metadata["Timestamp"].append(pe.FILE_HEADER.TimeDateStamp)
            metadata["Number of sections"].append(pe.FILE_HEADER.NumberOfSections)
            metadata["Section alignment"].append(pe.OPTIONAL_HEADER.SectionAlignment)
            metadata["File alignment"].append(pe.OPTIONAL_HEADER.FileAlignment)
            metadata["Size of image"].append(pe.OPTIONAL_HEADER.SizeOfImage)
            metadata["Size of headers"].append(pe.OPTIONAL_HEADER.SizeOfHeaders)
            metadata["Subsystem"].append(pefile.SUBSYSTEM_TYPE.get(pe.OPTIONAL_HEADER.Subsystem, "Unknown"))
            metadata["DLL characteristics"].append(",".join([pefile.DLL_CHARACTERISTICS[x] for x in [pe.OPTIONAL_HEADER.DllCharacteristics] if x in pefile.DLL_CHARACTERISTICS]))
            metadata["Imported DLLs"].append(",".join([entry.dll.decode() for entry in pe.DIRECTORY_ENTRY_IMPORT]) if hasattr(pe, 'DIRECTORY_ENTRY_IMPORT') else "")
            metadata["Imported functions"].append(",".join([f.name.decode() for entry in pe.DIRECTORY_ENTRY_IMPORT for f in entry.imports if f.name is not None]) if hasattr(pe, 'DIRECTORY_ENTRY_IMPORT') else "")
            metadata["Exported symbols"].append(",".join([exp.name.decode() for exp in pe.DIRECTORY_ENTRY_EXPORT.symbols if exp.name is not None]) if hasattr(pe, 'DIRECTORY_ENTRY_EXPORT') else "")
            
            # Close the PE file
            pe.close()
        
        except pefile.PEFormatError as e:
            print(f"Skipping {filepath}: {e}")
            continue
        

# Create the DataFrame
df_safe = pd.DataFrame(metadata)

df_safe.head()

Skipping ./safe/eqnedt32.exe: 'The file is empty'


Unnamed: 0,Name,Timestamp,Number of sections,Section alignment,File alignment,Size of image,Size of headers,Subsystem,DLL characteristics,Imported DLLs,Imported functions,Exported symbols
0,IMAGE_FILE_MACHINE_AMD64,1676906954,6,4096,512,69632,1024,IMAGE_SUBSYSTEM_WINDOWS_CUI,,"KERNEL32.dll,ole32.dll,OLEAUT32.dll,MSVCP140.d...","GetLastError,InitializeCriticalSectionEx,Delet...",
1,IMAGE_FILE_MACHINE_I386,1538728210,5,4096,512,3407872,1024,IMAGE_SUBSYSTEM_WINDOWS_GUI,,"KERNEL32.dll,USER32.dll,GDI32.dll,MSIMG32.dll,...","GetStringTypeW,GetTimeZoneInformation,GetConso...",
2,IMAGE_FILE_MACHINE_AMD64,1768684448,6,4096,512,98304,1024,IMAGE_SUBSYSTEM_WINDOWS_CUI,,"ADVAPI32.dll,KERNEL32.dll,msvcrt.dll,ole32.dll...","RegQueryValueExW,InitiateSystemShutdownExW,Ope...",
3,IMAGE_FILE_MACHINE_AMD64,1599554788,2,8192,512,344064,512,IMAGE_SUBSYSTEM_WINDOWS_CUI,,,,
4,IMAGE_FILE_MACHINE_I386,4278777568,5,4096,512,94208,1024,IMAGE_SUBSYSTEM_WINDOWS_GUI,,"KERNEL32.dll,USER32.dll,msvcrt.dll,ADVAPI32.dll","CreateFileMappingW,LoadResource,FindResourceEx...",


In [17]:
df_safe.shape

(799, 13)

In this case, we have 998 usable malcious samples and 799 clean samples

## Data Preprocessing

We need to downsample the malicious samples to match the sample size of the clean samples

In [26]:
df_mal = df_mal.head(799)

Now that we have loaded our raw data into two dataframes df_mal and df_safe, we can start preprocessing by setting y label to both dataframes then combine them

In [27]:
df_mal['is_mal'] = 1
df_safe['is_mal'] = 0
df_combined = pd.concat([df_safe, df_mal], ignore_index=True).sample(frac=1)
df_combined = df_combined.reset_index(drop=True)
df_combined.head(5)

Unnamed: 0,Name,Timestamp,Number of sections,Section alignment,File alignment,Size of image,Size of headers,Subsystem,DLL characteristics,Imported DLLs,Imported functions,Exported symbols,is_mal
0,IMAGE_FILE_MACHINE_I386,1632607007,5,4096,512,245760,1024,IMAGE_SUBSYSTEM_WINDOWS_GUI,,"ADVAPI32.dll,SHELL32.dll,ole32.dll,COMCTL32.dl...","RegCreateKeyExW,RegEnumKeyW,RegQueryValueExW,R...",,1
1,IMAGE_FILE_MACHINE_I386,1576457453,5,4096,512,552960,1024,IMAGE_SUBSYSTEM_WINDOWS_GUI,,"KERNEL32.dll,USER32.dll,GDI32.dll,SHELL32.dll,...","ExitProcess,SetFileAttributesW,Sleep,GetTickCo...",,0
2,IMAGE_FILE_MACHINE_I386,1417973532,7,4096,512,933888,1024,IMAGE_SUBSYSTEM_WINDOWS_CUI,,"KERNEL32.dll,msvcrt.dll,USER32.dll","DeleteCriticalSection,EnterCriticalSection,Get...",,0
3,IMAGE_FILE_MACHINE_AMD64,1656570707,6,4096,512,7208960,1024,IMAGE_SUBSYSTEM_WINDOWS_CUI,,"AdCoreUnits-16.dll,adskassetapi_new-16.dll,adp...",?GetGlobalName@UnitsUtilities@units@core@platf...,??0CMADPInfoManager@CMADP@Platform@Autodesk@@A...,0
4,IMAGE_FILE_MACHINE_I386,1681870375,3,8192,512,737280,512,IMAGE_SUBSYSTEM_WINDOWS_GUI,,mscoree.dll,_CorExeMain,,1


### Splitting Data and Encoding

From the previous code block, we can see that the values in 'Imported DLLs', 'Imported functions', and 'Exported symbols' are stored as a list of string in csv format, we need to split and encode them.

In [34]:
%pip install scikit-learn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [29]:
from sklearn.preprocessing import MultiLabelBinarizer

# Assuming your DataFrame is named df_combined

# Splitting the strings into lists
df_combined['Imported DLLs'] = df_combined['Imported DLLs'].apply(lambda x: x.split(',') if x else [])
df_combined['Imported functions'] = df_combined['Imported functions'].apply(lambda x: x.split(',') if x else [])
df_combined['Exported symbols'] = df_combined['Exported symbols'].apply(lambda x: x.split(',') if x else [])

# Encoding the columns
mlb = MultiLabelBinarizer()

encoded_imported_dlls = pd.DataFrame(mlb.fit_transform(df_combined['Imported DLLs']), columns=mlb.classes_, index=df_combined.index)
encoded_imported_functions = pd.DataFrame(mlb.fit_transform(df_combined['Imported functions']), columns=mlb.classes_, index=df_combined.index)
encoded_exported_symbols = pd.DataFrame(mlb.fit_transform(df_combined['Exported symbols']), columns=mlb.classes_, index=df_combined.index)

# Concatenating the encoded columns with the original dataframe
df_combined_encoded = pd.concat([df_combined, encoded_imported_dlls, encoded_imported_functions, encoded_exported_symbols], axis=1)

# Dropping the original columns
df_combined_encoded = df_combined_encoded.drop(['Imported DLLs', 'Imported functions', 'Exported symbols'], axis=1)
df_combined_encoded.head()

Unnamed: 0,Name,Timestamp,Number of sections,Section alignment,File alignment,Size of image,Size of headers,Subsystem,DLL characteristics,is_mal,...,wkhtmltopdf_set_finished_callback,wkhtmltopdf_set_global_setting,wkhtmltopdf_set_object_setting,wkhtmltopdf_set_phase_changed_callback,wkhtmltopdf_set_progress_changed_callback,wkhtmltopdf_set_warning_callback,wkhtmltopdf_version,zError,zlibCompileFlags,zlibVersion
0,IMAGE_FILE_MACHINE_I386,1632607007,5,4096,512,245760,1024,IMAGE_SUBSYSTEM_WINDOWS_GUI,,1,...,0,0,0,0,0,0,0,0,0,0
1,IMAGE_FILE_MACHINE_I386,1576457453,5,4096,512,552960,1024,IMAGE_SUBSYSTEM_WINDOWS_GUI,,0,...,0,0,0,0,0,0,0,0,0,0
2,IMAGE_FILE_MACHINE_I386,1417973532,7,4096,512,933888,1024,IMAGE_SUBSYSTEM_WINDOWS_CUI,,0,...,0,0,0,0,0,0,0,0,0,0
3,IMAGE_FILE_MACHINE_AMD64,1656570707,6,4096,512,7208960,1024,IMAGE_SUBSYSTEM_WINDOWS_CUI,,0,...,0,0,0,0,0,0,0,0,0,0
4,IMAGE_FILE_MACHINE_I386,1681870375,3,8192,512,737280,512,IMAGE_SUBSYSTEM_WINDOWS_GUI,,1,...,0,0,0,0,0,0,0,0,0,0


Now We have to encode the remaining columns with string value

In [30]:
# One-hot encoding for 'Name', 'Subsystem', and 'DLL characteristics' columns
df_combined_encoded = pd.get_dummies(df_combined_encoded, columns=['Name', 'Subsystem', 'DLL characteristics'])

## Model

The following code trains the decision tree model and also uses GridSearchCV to find maximum recall.  
Please note that we are using recall as the main metric because we are building a malware detection tool.  
The main purpose of malware detection is to avoid malware to be run on the user computer, hence, reducing the False Negative(FN) is the best approach.

In [31]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Preparing the dataset
X = df_combined_encoded.drop('is_mal', axis=1)
y = df_combined_encoded['is_mal']

# Splitting the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Creating the Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Defining the hyperparameter grid
param_grid = {'max_depth': range(1, 21)}

# Performing grid search with cross-validation
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='recall')
grid_search.fit(X_train, y_train)

# Getting the best hyperparameters
best_max_depth = grid_search.best_params_['max_depth']
print("Best max depth:", best_max_depth)

# Fitting the model with the best hyperparameters
best_clf = grid_search.best_estimator_

# Predicting the test set
y_pred = best_clf.predict(X_test)

# Evaluating the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))


Best max depth: 7
Accuracy: 0.9208333333333333
Confusion Matrix:
[[213  24]
 [ 14 229]]
Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.90      0.92       237
           1       0.91      0.94      0.92       243

    accuracy                           0.92       480
   macro avg       0.92      0.92      0.92       480
weighted avg       0.92      0.92      0.92       480



#### Saving Model for later use

In [35]:
import pickle

# Save the trained model and the column names to a file
with open("decision_tree_model_and_columns.pkl", "wb") as file:
    pickle.dump((best_clf, X_train.columns), file)