Scopo degli algoritmi di Computer Vision da usare nel nostro progetto:


    Input: Immagine 

         --> Detector [Immagine] = {Array di keypoints} 

         --> Descriptor [{Array di keypoints}] = {Matrice di features}

(Ogni Keypoint viene descritto da un numero/tipo di features differenti che dipende dall'algoritmo usato. SIFT per esempio sta per "Scale Invariant Feature Transform" e si base su features preservabili da trasformazioni spaziali ([roto-traslazioni / restizioni-espansioni])

In [1]:
import os
import cv2
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline

In [8]:
# Define the path to the two folders
folder_1_path = "dataset/byteplot/malicious/"
folder_2_path = "dataset/byteplot/benign/benign/"
folder_3_path = "dataset/byteplot/benign/benign_edu/"

# Read the images in each folder and store them in a list
images_1 = [cv2.imread(os.path.join(folder_1_path, image_file), cv2.IMREAD_GRAYSCALE) \
            for image_file in os.listdir(folder_1_path)]
images_2 = [cv2.imread(os.path.join(folder_2_path, image_file), cv2.IMREAD_GRAYSCALE) \
            for image_file in os.listdir(folder_2_path)]
images_3 = [cv2.imread(os.path.join(folder_3_path, image_file), cv2.IMREAD_GRAYSCALE) \
            for image_file in os.listdir(folder_3_path)]

# Label each image with the respective label (1 for total_des_trainfolder_1, 2 for folder_2)
labels_1 = [0 for _ in range(len(images_1))]
#labels_2 = [1 for _ in range(len(images_2))]
labels_2 = [1 for _ in range(len(images_2)+len(images_3))]

# Combine the images and labels into a single dataset
images = images_1 + images_2 + images_3
labels = labels_1 + labels_2

X = images
y = labels

In [9]:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Implementazione SIFT/ORB descriptors. Partendo dal file path, generiamo di nuovo il bytre plot, ma questa volta evidenziamo i keypoints.

In [28]:
# Define detector object
detector = cv2.ORB_create(nfeatures = 32)

total_kps_train=[]
total_des_train=[]
total_kps_test=[]
total_des_test=[]

for image in X_train:
    query_kps, query_des = detector.detectAndCompute(image, None)
    total_kps_train.append(query_kps)
    total_des_train.append(query_des)

for image in X_test:
    query_kps, query_des = detector.detectAndCompute(image, None)
    total_kps_test.append(query_kps)
    total_des_test.append(query_des)

In [29]:
max = 0
i=0
for des in total_des_train:
    if des is not None:
        #print(i)
        #print(des.shape[0])
        if des.shape[0] > max:
            max = des.shape[0]
    else:
        del total_des_train[i]
        del y_train[i]
    i+= 1

max1=0
i=0
for des in total_des_test:
    if des is not None:
        #print(i)
        #print(des.shape[0])
        if des.shape[0] > max1:
            max1 = des.shape[0]
    else:
        del total_des_test[i]
        del y_test[i]
    i+= 1
    
if max1 > max:
    max=max1

In [30]:
total_des_train_ex = []
y_train_ex = []
total_des_test_ex = []
y_test_ex = []

i = 0
for des in total_des_train:
    padding_to_add = max - des.shape[0]
    if padding_to_add != 0:
        pad_list = []
        for k in range(32):
            pad_list.append(0)
        for j in range(padding_to_add):
            total_des_train[i] = np.append(total_des_train[i], pad_list)
        total_des_train[i] = np.array(total_des_train[i]).reshape(max,32)
    i += 1

i = 0
for des in total_des_test:
    padding_to_add = max - des.shape[0]
    if padding_to_add != 0:
        pad_list = []
        for k in range(32):
            pad_list.append(0)
        for j in range(padding_to_add):
            total_des_test[i] = np.append(total_des_test[i], pad_list)
        total_des_test[i] = np.array(total_des_test[i]).reshape(max,32)
    i += 1

In [31]:
i = 0
for des in total_des_train:
    total_des_train[i] = np.hstack(des)
    i += 1

i = 0
for des in total_des_test:
    total_des_test[i] = np.hstack(des)
    i += 1

The array in input to fit needs to be 2d array/list, on the x axis you have the number of samples on the y axis you have the n of elements for that samples. (e.g. on an array with 1000 images and for every images a (32,32) matrix of descriptors, you would need to stack the matrix vertically having a 1d array of 1024 elements. Then,array to give in input to fit would be (1000,1024) and would be accepted by fit).

All the elements in the input array have to be of the same size. For example, on the previous array we can't have 999 samples having 1024 points and 1 sample having 1023 points. The fit method in this case will throw you an error similar to "the homogenous size is not correct found(x+,) etc...".

Trivially, the number of samples and the number of labels have to match.


In [32]:
clf = RandomForestClassifier(n_estimators=200, random_state=42)
clf.fit(total_des_train, y_train)

score = clf.score(total_des_test, y_test)
print("Test accuracy:", score)

Test accuracy: 0.8570658036677454
