In this section, we will extract features from the PE header to be used in building a
malware/benign samples classifier. We will continue utilizing the pefile
Python module

In [13]:
#Import pefile and modules for enumerating our samples:
import pefile
from os import listdir
from os.path import isfile, join
directories = ["assets/Benign_PE_Samples", "assets/Malicious_PE_Samples"]

In [14]:
#We define a function to collect the names of the sections of a file and preprocess
#them for readability and normalization:
def get_section_names(pe):
    """Gets a list of section names from a PE file."""
    list_of_section_names = []
    for sec in pe.sections:
        normalized_name = sec.Name.decode().replace("\x00",
        "").lower()
        list_of_section_names.append(normalized_name)
    return list_of_section_names

In [15]:
#We define a convenience function to preprocess and standardize our imports:
def preprocess_imports(list_of_DLLs):
    """Normalize the naming of the imports of a PE file."""
    return [x.decode().split(".")[0].lower() for x in list_of_DLLs]

In [16]:
#We then define a function to collect the imports from a file using pefile:
def get_imports(pe):
    """Get a list of the imports of a PE file."""
    list_of_imports = []
    for entry in pe.DIRECTORY_ENTRY_IMPORT:
        list_of_imports.append(entry.dll)
    return preprocess_imports(list_of_imports)

Finally, we prepare to iterate through all of our files and create lists to store our
features:

In [20]:
imports_corpus = []
num_sections = []
section_names = []
for dataset_path in directories:
    samples = [f for f in listdir(dataset_path) if
isfile(join(dataset_path, f))]
    for file in samples:
        file_path = dataset_path + "/" + file
        try:
        #In addition to collecting the preceding features, we also collect the number of sections of a file:
            pe = pefile.PE(file_path)
            imports = get_imports(pe)
            n_sections = len(pe.sections)
            sec_names = get_section_names(pe)
            imports_corpus.append(imports)
            num_sections.append(n_sections)
            section_names.append(sec_names)
            print(imports_corpus)
        #In case a file's PE header cannot be parsed, we define a try-catch clause:
        except Exception as e:
            print(e)
            print("Unable to obtain imports from " + file_path)

[['mscoree']]
[['mscoree'], ['kernel32', 'user32', 'gdi32', 'shell32', 'advapi32', 'comctl32', 'ole32', 'version']]
[['mscoree'], ['kernel32', 'user32', 'gdi32', 'shell32', 'advapi32', 'comctl32', 'ole32', 'version'], ['advapi32', 'comctl32', 'gdi32', 'kernel32', 'ole32', 'shell32', 'user32', 'version']]
[['mscoree'], ['kernel32', 'user32', 'gdi32', 'shell32', 'advapi32', 'comctl32', 'ole32', 'version'], ['advapi32', 'comctl32', 'gdi32', 'kernel32', 'ole32', 'shell32', 'user32', 'version'], ['kernel32', 'msvcrt', 'msvcrt']]
[['mscoree'], ['kernel32', 'user32', 'gdi32', 'shell32', 'advapi32', 'comctl32', 'ole32', 'version'], ['advapi32', 'comctl32', 'gdi32', 'kernel32', 'ole32', 'shell32', 'user32', 'version'], ['kernel32', 'msvcrt', 'msvcrt'], ['mscoree']]


How it works...
As you can see, in Step 1, we imported the pefile module to enumerate the samples. Once
that is done, we define the convenience function, as you can see in Step 2. The reason being
that it often imports using varying cases (upper/lower). This causes the same import to
appear as distinct imports.
After preprocessing the imports, we then define another function to collect all the imports
of a file into a list. We will also define a function to collect the names of the sections of a file
in order to standardize these names such as .text, .rsrc, and .reloc while containing
distinct parts of the file (Step 3). The files are then enumerated in our folders and empty lists
will be created to hold the features we will be extracting. The predefined functions will then
collect the imports (Step 4), section names, and the number of sections of each file (Steps 5
and 6). Lastly, a try-catch clause will be defined in case a file's PE header cannot be parsed
(Step 7). This can happen for many reasons. One reason being that the file is not actually a
PE file. Another reason is that its PE header is intentionally or unintentionally malformed.