### Malware Meets Data Scienc: Exploration of Malware Using K-Nearest Neighbors (kNN) 
#### This research study focuses on the exploration of the Malware Open-source Threat Intelligence Family (MOTIF) data set. This data set has 3,095 malware samples, with the viruses removed. 
#### To access the data set directly:
https://github.com/boozallen/MOTIF

#### Previously, XG Boost and computational neural networks (MalConv net) were used to set accuracy benchmarks for this dataset. Now we want to explore using KNN as an alternative. K-Nearest Neighbors (kNN) is a supervised machine learning algorithm. Based on a determined number of classifiers k, a data point is predicted from the k points around it. 

#### kNN comes with pros and cons. Cons include decreased speed of the model as k increases, memory inefficiencies, and outlier sensitivity. Pros include easy application and interpretation, and classification needs. Due to the groupings of malware families, we experiment to discover whether kNN is a reliable model for this data set. 

#### Note: Each model will have its data written to a csv file for reference purposes. 

#### Import relevant packages

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

#### X and Y are feature vectors of malware from the MOTIF dataset. .  
#### Note: The data is not malacious.

In [2]:
X = np.load("X.npy")
X.shape

(3095, 2381)

In [3]:
y = np.load("y.npy")
y.shape

(3095,)

#### Original Datset
#### 80/20 Split with N=5,4,3,2,1

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
neigh = KNeighborsClassifier(n_neighbors=5)
neigh.fit(X_train, y_train)

In [6]:
#print(neigh.predict(X_test))
#print("-------------------------------------------")
#print(neigh.predict_proba(X_test))

y_pred_80_n5 = neigh.predict(X_test)

In [7]:
accuracy_score(y_test, y_pred_80_n5)
#accuracy_score(y_test, y_pred, normalize=False)

0.39418416801292405

In [8]:
df = pd.DataFrame(y_pred_80_n5)

In [9]:
#write to csv
df.to_csv('trainTest_80_n5', sep='\t', encoding='utf-8')

In [10]:
neigh = KNeighborsClassifier(n_neighbors=4)
neigh.fit(X_train, y_train)
y_pred_80_n4 = neigh.predict(X_test)

print(accuracy_score(y_test, y_pred_80_n4))

df = pd.DataFrame(y_pred_80_n5)
df.to_csv('trainTest_80_n4', sep='\t', encoding='utf-8')

0.3990306946688207


In [11]:
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
y_pred_80_n3 = neigh.predict(X_test)

print(accuracy_score(y_test, y_pred_80_n3))

df = pd.DataFrame(y_pred_80_n3)
df.to_csv('trainTest_80_n3', sep='\t', encoding='utf-8')

0.41195476575121165


In [12]:
neigh = KNeighborsClassifier(n_neighbors=2)
neigh.fit(X_train, y_train)
y_pred_80_n2 = neigh.predict(X_test)

print(accuracy_score(y_test, y_pred_80_n2))

df = pd.DataFrame(y_pred_80_n2)
df.to_csv('trainTest_80_n2', sep='\t', encoding='utf-8')

0.4378029079159935


In [13]:
neigh = KNeighborsClassifier(n_neighbors=1)
neigh.fit(X_train, y_train)
y_pred_80_n1 = neigh.predict(X_test)

print(accuracy_score(y_test, y_pred_80_n1))

df = pd.DataFrame(y_pred_80_n1)
df.to_csv('trainTest_80_n1', sep='\t', encoding='utf-8')

0.4991922455573506


#### 70/30 Split with N=5,4,3,2,1

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
neigh = KNeighborsClassifier(n_neighbors=5)
neigh.fit(X_train, y_train)
y_pred_70_n5 = neigh.predict(X_test)

df = pd.DataFrame(y_pred_70_n5)
#write to csv
df.to_csv('trainTest_70_n5', sep='\t', encoding='utf-8')

accuracy_score(y_test, y_pred_70_n5)


0.36921420882669537

In [15]:
#y_unique = np.unique(y)
#y_unique[1]
#y_unique

In [16]:
neigh = KNeighborsClassifier(n_neighbors=4)
neigh.fit(X_train, y_train)
y_pred_70_n4 = neigh.predict(X_test)

df = pd.DataFrame(y_pred_70_n4)
#write to csv
df.to_csv('trainTest_70_n4', sep='\t', encoding='utf-8')

accuracy_score(y_test, y_pred_70_n4)

0.3789020452099031

In [17]:
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
y_pred_70_n3 = neigh.predict(X_test)

df = pd.DataFrame(y_pred_70_n3)
#write to csv
df.to_csv('trainTest_70_n3', sep='\t', encoding='utf-8')

accuracy_score(y_test, y_pred_70_n3)

0.3982777179763186

In [18]:
neigh = KNeighborsClassifier(n_neighbors=2)
neigh.fit(X_train, y_train)
y_pred_70_n2 = neigh.predict(X_test)

df = pd.DataFrame(y_pred_70_n2)
#write to csv
df.to_csv('trainTest_70_n2', sep='\t', encoding='utf-8')

accuracy_score(y_test, y_pred_70_n2)

0.43057050592034446

In [19]:
neigh = KNeighborsClassifier(n_neighbors=1)
neigh.fit(X_train, y_train)
y_pred_70_n1 = neigh.predict(X_test)

df = pd.DataFrame(y_pred_70_n1)
#write to csv
df.to_csv('trainTest_70_n1', sep='\t', encoding='utf-8')

accuracy_score(y_test, y_pred_70_n1)

0.48869752421959095

#### Modified dataset
#### 80/20 Split for N=5,4,3,2,1

#### Next we attempt to remove classes for which less than four data points exists. This is not cherry-picking, because the logical assumption is this will improve the accuracy of the dataset.

In [20]:
from collections import Counter

# Keep only families with more than four samples
label_counts = Counter(y)
keep_labels = set([label for label, count in label_counts.items() if label_counts[label] > 4])
keep_idxs = [idx for idx, label in enumerate(y) if label in keep_labels]
X_keep, y_keep = X[keep_idxs], y[keep_idxs]

# Get number of remaining malware families‹
num_class = len(keep_labels)

# Re-map labels to be in range [0, num_class)
label_map = dict(zip(sorted(keep_labels), np.arange(num_class, dtype=np.float32)))
y_keep = np.array([label_map[label] for label in y_keep])

In [21]:
X_keep.shape

(2504, 2381)

In [22]:
y_keep.shape

(2504,)

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X_keep, y_keep, test_size=0.2, random_state=42)
neigh = KNeighborsClassifier(n_neighbors=5)
neigh.fit(X_train, y_train)
y_pred_keep_80_n5 = neigh.predict(X_test)

df = pd.DataFrame(y_pred_keep_80_n5)

#write to csv
df.to_csv('trainTest_keep_80_n5', sep='\t', encoding='utf-8')

accuracy_score(y_test, y_pred_keep_80_n5)

0.5329341317365269

In [24]:
neigh = KNeighborsClassifier(n_neighbors=4)
neigh.fit(X_train, y_train)
y_pred_keep_80_n4 = neigh.predict(X_test)

df = pd.DataFrame(y_pred_keep_80_n4)

#write to csv
df.to_csv('trainTest_keep_80_n4', sep='\t', encoding='utf-8')

accuracy_score(y_test, y_pred_keep_80_n4)

0.5369261477045908

In [25]:
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
y_pred_keep_80_n3 = neigh.predict(X_test)

df = pd.DataFrame(y_pred_keep_80_n3)

#write to csv
df.to_csv('trainTest_keep_80_n3', sep='\t', encoding='utf-8')

accuracy_score(y_test, y_pred_keep_80_n3)

0.5588822355289421

In [26]:
neigh = KNeighborsClassifier(n_neighbors=2)
neigh.fit(X_train, y_train)
y_pred_keep_80_n2 = neigh.predict(X_test)

df = pd.DataFrame(y_pred_keep_80_n2)

#write to csv
df.to_csv('trainTest_keep_80_n2', sep='\t', encoding='utf-8')

accuracy_score(y_test, y_pred_keep_80_n2)

0.5708582834331337

In [27]:
neigh = KNeighborsClassifier(n_neighbors=1)
neigh.fit(X_train, y_train)
y_pred_keep_80_n1 = neigh.predict(X_test)

df = pd.DataFrame(y_pred_keep_80_n1)

#write to csv
df.to_csv('trainTest_keep_80_n1', sep='\t', encoding='utf-8')

accuracy_score(y_test, y_pred_keep_80_n1)

0.6187624750499002

#### 70/30 Split with N=5,4,3,2,1

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X_keep, y_keep, test_size=0.3, random_state=42)
neigh = KNeighborsClassifier(n_neighbors=5)
neigh.fit(X_train, y_train)
y_pred_keep_70_n5 = neigh.predict(X_test)

df = pd.DataFrame(y_pred_keep_80_n5)

#write to csv
df.to_csv('trainTest_keep_70_n5', sep='\t', encoding='utf-8')

accuracy_score(y_test, y_pred_keep_70_n5)

0.4787234042553192

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X_keep, y_keep, test_size=0.3, random_state=42)
neigh = KNeighborsClassifier(n_neighbors=4)
neigh.fit(X_train, y_train)
y_pred_keep_70_n4 = neigh.predict(X_test)

df = pd.DataFrame(y_pred_keep_80_n4)

#write to csv
df.to_csv('trainTest_keep_70_n4', sep='\t', encoding='utf-8')

accuracy_score(y_test, y_pred_keep_70_n4)

0.4920212765957447

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X_keep, y_keep, test_size=0.3, random_state=42)
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
y_pred_keep_70_n3 = neigh.predict(X_test)

df = pd.DataFrame(y_pred_keep_80_n3)

#write to csv
df.to_csv('trainTest_keep_70_n3', sep='\t', encoding='utf-8')

accuracy_score(y_test, y_pred_keep_70_n3)

0.5172872340425532

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X_keep, y_keep, test_size=0.3, random_state=42)
neigh = KNeighborsClassifier(n_neighbors=2)
neigh.fit(X_train, y_train)
y_pred_keep_70_n2 = neigh.predict(X_test)

df = pd.DataFrame(y_pred_keep_80_n2)

#write to csv
df.to_csv('trainTest_keep_70_n2', sep='\t', encoding='utf-8')

accuracy_score(y_test, y_pred_keep_70_n2)

0.5292553191489362

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X_keep, y_keep, test_size=0.3, random_state=42)
neigh = KNeighborsClassifier(n_neighbors=1)
neigh.fit(X_train, y_train)
y_pred_keep_70_n1 = neigh.predict(X_test)

df = pd.DataFrame(y_pred_keep_80_n1)

#write to csv
df.to_csv('trainTest_keep_70_n1', sep='\t', encoding='utf-8')

accuracy_score(y_test, y_pred_keep_70_n1)

0.589095744680851

#### Accuracy score is maximized when both datsets used an 80/20 Split with N=1.
#### The combination with the highest accuracy score is .62 using the modified dataset, 80/20 Split, with N=1.


#### Manually create two dfs containing accuracy scores.
#### Then, create two heatmaps.

In [37]:
orig_acc = {
    "80/20": [.499, .438, .412, .399, .394],
    "70/30": [.489, .431, .398, .379, .369]
}

orig_acc_df = pd.DataFrame(orig_acc, index = ['N=1', "N=2", "N=3", "N=4", "N=5"])

orig_acc_df

Unnamed: 0,80/20,70/30
N=1,0.499,0.489
N=2,0.438,0.431
N=3,0.412,0.398
N=4,0.399,0.379
N=5,0.394,0.369


In [41]:
mod_acc = {
    "80/20": [.619, .571, .559, .537, .533],
    "70/30": [.589, .529, .517, .492, .479]
}

mod_acc_df = pd.DataFrame(mod_acc, index = ['N=1', "N=2", "N=3", "N=4", "N=5"])

mod_acc_df

Unnamed: 0,80/20,70/30
N=1,0.619,0.589
N=2,0.571,0.529
N=3,0.559,0.517
N=4,0.537,0.492
N=5,0.533,0.479


In [62]:
import plotly.express as px

fig = px.imshow(orig_acc_df, 
                labels=dict(x="Train/Test Split", y="Designated N", color="Accuracy Score"),
                text_auto=True, aspect="auto")
fig.update_layout(
    title='Accuracy Scores Using Original Data Set'
    )

fig.show()

In [63]:
fig = px.imshow(mod_acc_df, 
                labels=dict(x="Train/Test Split", y="Designated N", color="Accuracy Score"),
                text_auto=True, aspect="auto")
fig.update_layout(
    title='Accuracy Scores Using Modified Data Set'
    )

fig.show()

#### K=1 has the best accuracy for both the original and modified data sets. However, the kNN model is not the best model for the MOTIF data set. For good kNN models, it should be that as K increases, accuracy increases, which is the opposite case here. Certain models will perform better (see page 8 of the original white paper contained in GitHub), our model performs better than MalConv2 (accuracy of .487), but worse than LightGBM (accuracy of .724). 

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=502e1241-c12d-4f32-92c8-d836dc47ef11' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>