# HACKtheMACHINE 2021 | Track 2: Data Science, Detective Bot 

The following data dictionary describes the columns or fields of the data set and a description of the objects. This information with more detail can also be found in the `EMBER` documentation of the `features.py` file at: https://github.com/elastic/ember/blob/master/ember/features.py 

| Field Name | Description | 
|------------|-------------|
| sha256 | The Secure Hash Algorithm (SHA) is a cryptographic hash function like a signature or fingerprints for a data set. Even if one symbol is changed the algorithm will produce a different hash value. The SHA256 algorithm generates a fixed size 256-bit (32-byte) hash. The SHA256 algorithm is used to ensure you acquire the same data as the original. For exmaple, if you download something you can check if the data has not changed (due to network errors or malware injection) by comparing the hashes of your file and the original.|
| histogram | Byte histogram (count + non-normalized) over the entire binary file. The byte histogram contains 256 integer values and represent the counts of each byte value within the value. When generating model features the byte histgoram is normalized to a distribution, since file size is represented as a feature in the general file information. | 
| byteentropy | 2D byte/entropy histogram based loosely on (Saxe and Berlin, 2015). This roughly approximates the joint probability of byte value and local entropy. See Section 2.1.1 in https://arxiv.org/pdf/1508.03096.pdf for more info. The byte entropy histogram approximates the joint distriubtion p(H, X) of entropy H and byte value X. By computing the scalar entropy H for a fixed-length window and pairing it with each byte occurrence within the window. This is repeated as the window slides across the input bytes. |
| strings | Contains simple statistics about printable strings of the following: <ul><li>`numstrings`: number of strings <li> `avlength`: average length of strings <li>`printabledist`: histogram of the printable characters within those strings <li>`printables`: distinct information from byte histogram information from the byte histogram information since its derived only from strings containing at least 5 consecutive printable characters <li>`entropy`: entropy of characters across all printable strings <li>`paths` number of strings that begin with **C:** (case insensitive) that may indicate a path <li>`urls`: the number of occurences of **http://** or **https://** (case insensitive) that may indicate a URL <li>`registry`: number of occurrences of HKEY that may indicate a registry key, <li>`MZ`: number of occurrences of the short string MZ |
| general | Provides general file information. 0/1 indicates a binary output <ul><li>`size`: length of bytes <li>`vsize`: virtual size <li>`has_debug`: 0/1  <li>`exports`: 0/1 <li>`imports`: 0/1 <li>`has_relocations`: 0/1  <li>`has_resources`: 0/1 <li>`has_signature`: 0/1 <li>`has_tls`: 0/1 <li>`symbols`: 0/1 |
| header | Provides header file information on machine, architecture, OS, link and other information: <ul><li> `coeff`: [ `timestamp`, `machine`,`characteristics` ] <li> `optional`: [`subsystem`, `dll_characteristics`, `magic`, `major_image_version`, `minor_linker_version`, `major_operating_system_version`, `minor_operating_system_version`, `major_subsystem_version`, `minor_subsystem_version`, `sizeof_code`, `sizeof_headers`, `sizeof_heap_commit`]
| section | Information about section names, sizes and entropy. Uses hashing trick to summarize all this section into a feature vector. <ul><li> `imports`: [`KERNEL32.dll` : [`GetTickCount`] | 
| imports | Information about imported libraries and functions from the import address table. Note that the total number of imported functions is contained in GeneralFileInfo. |
| exports | Information about exported functions. Note that the total number of exported functions is contained in GeneralFileInfo.|
| datadirectories | Extracts size and virtual address of the first 15 data dictectories. |
| label / category | Class label indicating benign `0` or malicious `1`|

In [1]:
import sys

In [2]:
sys.path

['/Users/rolandchin/Desktop/HackTheMachine/HtMFall2021-Corona/Challenge1',
 '/opt/anaconda3/lib/python38.zip',
 '/opt/anaconda3/lib/python3.8',
 '/opt/anaconda3/lib/python3.8/lib-dynload',
 '',
 '/opt/anaconda3/lib/python3.8/site-packages',
 '/opt/anaconda3/lib/python3.8/site-packages/IPython/extensions',
 '/Users/rolandchin/.ipython']

## Load Libraries

In [5]:
import numpy as np #data manipulation
import pandas as pd #data manipulation
import sklearn as sk #modeling & metrics
import seaborn as sns #visualizations
import scipy as stats #visualizations
from matplotlib import pyplot as plt #visualizations

#imputation, scaling, metrics
from sklearn import preprocessing
from sklearn import metrics
from sklearn.utils import resample
from sklearn.metrics import r2_score, classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.linear_model import LogisticRegression

#outlier classification
from sklearn.ensemble import IsolationForest

import xgboost as xgb #xgb model
# import lightgbm as lgb #lgbm model
# # from lightgbm import *
# import re #fix error for lgbm
import hyperopt #hyperparameter tuning
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials, space_eval

import shap #shap plot

import pickle 
 
import warnings
warnings.filterwarnings('ignore')

In [None]:
# !pip install git+https://github.com/elastic/ember.git

In [2]:
import Ember_Wrapper

## Requirements

The requirements file is basically this output copy pasted into a txt file

In [23]:
print('\n'.join(f'{m.__name__}=={m.__version__}' for m in globals().values() if getattr(m, '__version__', None)))

numpy==1.21.2
pandas==1.3.4
sklearn==1.0.1
seaborn==0.11.2
scipy==1.6.0
xgboost==1.5.0
hyperopt==0.2.5
shap==0.40.0


## pip install git+https://github.com/elastic/ember.git
## ember, numpy, pandas, shap, hyperopt, xgboost

## Load Data

In [None]:
# Flattened EMBER Feature set
# Easier to feed into ML models right away
# df1 = pd.read_excel("flatten_train.xlsx")

Use pickling so you don't have to re read Excel file every time.

In [None]:
# pd.to_pickle(df1, "./data.pkl")

In [None]:
df1 = pd.read_pickle("./data.pkl")

In [8]:
df2 = pd.read_excel("./Data/raw_train.xlsx")

Checking how many malware entries.

In [None]:
sum(df1['category'])

In [None]:
df1.shape

In [None]:
900/18000

5%, super imbalanced.

## EDA

### Visualizations go here:
### try plotting distributions of certain features to compare them between the malware and not malware tables

In [None]:
df1.head(5)

In [None]:
list(df1)

In [None]:
malware = df1[df1['category'] == 1]
not_malware = df1[df1['category'] == 0]

In [None]:
def dist_plotter(feature_name): #random function, make some more
    fig, ax = plt.subplots(2,1)
#     ax.set(ylabel='common ylabel', title=feature_name)
    sns.violinplot(malware[feature_name], inner="quartile", ax=ax[0], color='r')
    sns.violinplot(not_malware[feature_name], inner="quartile", ax=ax[1], color='r')

In [None]:
dist_plotter('byteentropy_211')

## Undersampling

Since only 5% of entries are malware, we want a similar 50/50 split between malware and not malware.

In [9]:
majority = df2[df2.category==0] # Majority class
minority = df2[df2.category==1] # Minority class

In [10]:
majority_undersampled = resample(majority, replace=False, n_samples=900) # Randomly selects 900 records from majority to match minority class size
# New downsampled dataset
df_undersampled = pd.concat([majority_undersampled, minority])  # Minority class + sample of 900 from majority
df_undersampled.category.value_counts()

0    900
1    900
Name: category, dtype: int64

In [7]:
Ember_Wrapper.create_vectorize_features(df_undersampled)
# Ember_Wrapper.create_vectorize_features(df2)



In [11]:
X = np.load('./X_data.npy')
y = np.load('./y_data.npy')
print('X.shape', X.shape)
print('%Malware:', sum(y) / len(y))

X.shape (1800, 2381)
%Malware: 0.5


In [12]:
full_X = X
full_y = y

In [None]:
# params = {'learning_rate': 0.66, 'max_depth': 1, 'n_estimators': 61}

In [13]:
X_train_f, X_test_f, y_train_f, y_test_f = train_test_split(full_X, full_y, test_size=0.2) #train test split 80/20

In [None]:
# #hyperparameter domain to search over
# hyperparam_space = {
#     'max_depth': hp.choice('max_depth', np.arange(1, 10, 1, dtype=int)), #larger values = overfitting
#     'n_estimators': hp.choice('n_estimators', np.arange(100, 500, 1, dtype=int)), #larger values = overfitting
#     'learning_rate': hp.quniform('learning_rate', 0, 1, 0.01), #aka eta = step size shrinkage to prevent overfitting
#     'gamma': hp.quniform('gamma', 0, 1, 0.05), #gamma: min loss reduction to partition leaf nodes (for overfitting)
# #     'min_child_weight': hp.quniform('min_child_weight', 1, 8, 0.5),
# #     'subsample': hp.quniform('subsample', 0.5, 1, 0.05),
# }

# def xgb_score(params): #function to train and test different hyperparams
#     model = xgb.XGBClassifier(**params, eval_metric='logloss')
#     model.fit(X_train_fs, y_train, early_stopping_rounds=20,
#              eval_set=[(X_train_fs, y_train), (X_test_fs, y_test)])
#     score = -cross_val_score(model, X_train_fs, y_train, cv=10, scoring='roc_auc').mean()
#     print(score)
#     return {'loss': score, 'status': STATUS_OK}
            
# def xgb_optimize(trials, space): #fmin is the main library function
#     best = fmin(xgb_score, space, algo=tpe.suggest, max_evals=5)
#     return best
            
# trials = Trials() #database that store completed hyperparameters and score
# best_xgb_params = xgb_optimize(trials, hyperparam_space) #calls fmin

In [None]:
xgb_model_f= xgb.XGBClassifier(eval_metric='logloss', scale_pos_weight=19)
xgb_model_f.fit(X_train_f, y_train_f)
preds_f = [pred[1] for pred in xgb_model_f.predict_proba(X_test_f)]
score_f = roc_auc_score(y_test_f, preds_f, average='weighted')
print('auc_roc: ', score_f)

y_pred_f = xgb_model_f.predict(X_test_f)
f1_f=f1_score(y_test_f, y_pred_f)
print('F1 Score:', f1_f)
print(confusion_matrix(y_test_f, y_pred_f))
print(classification_report(y_test_f, y_pred_f))

In [9]:
df = df_undersampled #make it a new df

In [10]:
X = df.drop('category', axis=1) #X is everything but the 'category' col

In [11]:
y = df['category'] #y is just the target column

In [12]:
X = X._get_numeric_data() #drop all nonnumeric ones (for now?)

In [13]:
print(X.shape, y.shape) #sanity check arrays

(1800, 0) (1800,)


## Feature Selection with ANOVA

Ideally should have selected features after heatmaps/correlation plots/distributions etc but skipped all that for now.

This is automatic but more of a "black box".

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) #train test split 80/20

In [None]:
# fs = SelectKBest(score_func=f_classif, k=600) #THIS K IS HOW MANY FEATURES YOU WANT
# fs.fit(X_train, y_train)
# mask = fs.get_support()
# new_features = X_train.columns[mask]

In [None]:
# X_train_fs = X_train[new_features]
# X_test_fs = X_test[new_features]

In [15]:
X_train_fs = X_train
X_test_fs = X_test

In [None]:
# for i in range(len(fs.scores_[:10])):
#     print('Feature %d: %f' % (i, fs.scores_[i]))
# # plot the scores
#     plt.bar([i for i in range(len(fs.scores_))], fs.scores_)
#     plt.show()

Now we can use X_train_fs and X_test_fs in place of X_train and X_test.

In [16]:
print(X_train_fs.shape, X_test_fs.shape)

(1440, 2381) (360, 2381)


In [17]:
print(y_train.shape, y_test.shape)

(1440,) (360,)


## Modeling

This function basically fits a specified model and outputs it's f1_score as well as ROC curve, can add more graphs/plots to it as well.

In [None]:
def modeler(model):
    model.fit(X_train_fs, y_train) #fit specified model
    y_pred = model.predict(X_test_fs) #predict on test set
    f1 = f1_score(y_test, y_pred) #get f1 score
    print('F1 Score:', f1)
    print(metrics.confusion_matrix(y_test, y_pred))
    
    #this plots the ROC curve, play around iwth it
    from sklearn.metrics import roc_curve, auc
    fpr, tpr, thresholds = roc_curve(y_pred, y_test)
    roc_auc = auc(fpr, tpr)

    plt.figure()
    plt.plot(fpr, tpr, color='darkorange', label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic')
    plt.legend(loc="lower right")
    plt.show()

In [None]:
lr = LogisticRegression()
modeler(lr)

In [None]:
# lgbm = lgb.LGBMClassifier()
# modeler(lgbm)

In [None]:
xgbc = xgb.XGBClassifier()
modeler(xgbc)

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
modeler(rf)

In [None]:
from sklearn.svm import SVC
svm = SVC()
modeler(svm)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
modeler(knn)

In [None]:
from sklearn import tree
dt = tree.DecisionTreeClassifier()
modeler(dt)

Best ones seem to be LGBM and XGB. Ideal ROC graph looks hugs the top left corner.

## Hyperparameter Optimization

Time to tune with Hyperopt.
(Scaling/normalizing isn't needed for gradient boosted decision trees so skip it)

# EVERYONE CAN TRY OPTIMIZING WITH DIFFERENT HYPERPARAMETERS LIKE IN THE COMMENTED OUT LINES, check the user docs for more parameters

In [None]:
# #hyperparameter domain to search over
# hyperparam_space = {
#     'num_leaves':       hp.choice('num_leaves', np.arange(30, 250, 1)),
#     'learning_rate':    hp.quniform('learning_rate', 0, 0.3, 0.01),
#     'max_depth':        hp.choice('max_depth', np.arange(2, 100, 1, dtype=int)),
#     'min_child_weight': hp.choice('min_child_weight', np.arange(1, 50, 1, dtype=int)),
#     'colsample_bytree': hp.uniform('colsample_bytree', 0.4, 1),
#     'subsample':        hp.uniform('subsample', 0.5, 1),
# }

Following code takes a couple minutes to run:

In [None]:
# def score(params): #function to train and test different hyperparams
#     model = lgb.LGBMClassifier(**params)
#     model.fit(X_train_fs, y_train, early_stopping_rounds=20,
#              eval_set=[(X_train_fs, y_train), (X_test_fs, y_test)])
# #     y_pred = model.predict(X_test_fs)
# #     score = mean_squared_error(y_test, y_pred)
#     score = -cross_val_score(model, X_train_fs, y_train, cv=10, scoring='roc_auc').mean()
#     print(score)
#     return score
            
# def optimize(trials, space): #fmin is the main library function
#     best = fmin(score, space, algo=tpe.suggest, max_evals=10)
#     return best
            
# trials = Trials() #database that store completed hyperparameters and score
# best_params = optimize(trials, hyperparam_space) #calls fmin

# #finds best hyperparameters
# # space_eval(hyperparam_space, best_params)

In [None]:
# best_params

In [None]:
# lgbm_model = lgb.LGBMClassifier(**best_params)
# lgbm_model.fit(X_train_fs, y_train)
# preds = [pred[1] for pred in lgbm_model.predict_proba(X_test_fs)]
# score = roc_auc_score(y_test, preds, average='weighted')
# print('auc_roc score: ', score)

# y_pred = lgbm_model.predict(X_test_fs)
# f1 = f1_score(y_test, y_pred) #get f1 score
# print('F1 Score:', f1)
# print(confusion_matrix(y_test, y_pred))
# print(classification_report(y_test, y_pred))

In [None]:
# shap_values = shap.TreeExplainer(lgbm_model).shap_values(X_test_fs)
# shap.summary_plot(shap_values, X_test_fs)

In [None]:
# lgb.plot_importance(lgbm_model, max_num_features=20)
# plt.figure(figsize=(25, 12))
# plt.show()

In [20]:
#hyperparameter domain to search over
hyperparam_space = {
    'max_depth': hp.choice('max_depth', np.arange(1, 10, 1, dtype=int)), #larger values = overfitting
    'n_estimators': hp.choice('n_estimators', np.arange(100, 500, 1, dtype=int)), #larger values = overfitting
    'learning_rate': hp.quniform('learning_rate', 0, 1, 0.01), #aka eta = step size shrinkage to prevent overfitting
#     'gamma': hp.quniform('gamma', 0, 1, 0.05), #gamma: min loss reduction to partition leaf nodes (for overfitting)
#     'min_child_weight': hp.quniform('min_child_weight', 1, 8, 0.5),
#     'subsample': hp.quniform('subsample', 0.5, 1, 0.05),
}

In [21]:
def xgb_score(params): #function to train and test different hyperparams
    model = xgb.XGBClassifier(**params, eval_metric='logloss')
    model.fit(X_train_fs, y_train, early_stopping_rounds=20,
             eval_set=[(X_train_fs, y_train), (X_test_fs, y_test)])
    score = -cross_val_score(model, X_train_fs, y_train, cv=10, scoring='roc_auc').mean()
    print(score)
    return {'loss': score, 'status': STATUS_OK}
            
def xgb_optimize(trials, space): #fmin is the main library function
    best = fmin(xgb_score, space, algo=tpe.suggest, max_evals=5)
    return best
            
trials = Trials() #database that store completed hyperparameters and score
best_xgb_params = xgb_optimize(trials, hyperparam_space) #calls fmin

[0]	validation_0-logloss:0.32467	validation_1-logloss:0.48172                                                                              
[1]	validation_0-logloss:0.18556	validation_1-logloss:0.39888                                                                              
[2]	validation_0-logloss:0.11201	validation_1-logloss:0.33559                                                                              
[3]	validation_0-logloss:0.07417	validation_1-logloss:0.30651                                                                              
[4]	validation_0-logloss:0.05116	validation_1-logloss:0.30095                                                                              
[5]	validation_0-logloss:0.03927	validation_1-logloss:0.28558                                                                              
[6]	validation_0-logloss:0.03047	validation_1-logloss:0.27588                                                                              
[7]	validation_0-log

[86]	validation_0-logloss:0.06649	validation_1-logloss:0.23440                                                                             
[87]	validation_0-logloss:0.06498	validation_1-logloss:0.23370                                                                             
[88]	validation_0-logloss:0.06355	validation_1-logloss:0.23304                                                                             
[89]	validation_0-logloss:0.06237	validation_1-logloss:0.23292                                                                             
[90]	validation_0-logloss:0.06132	validation_1-logloss:0.23234                                                                             
[91]	validation_0-logloss:0.06074	validation_1-logloss:0.23179                                                                             
[92]	validation_0-logloss:0.05998	validation_1-logloss:0.23209                                                                             
[93]	validation_0-lo

[38]	validation_0-logloss:0.24572	validation_1-logloss:0.34045                                                                             
[39]	validation_0-logloss:0.24105	validation_1-logloss:0.33753                                                                             
[40]	validation_0-logloss:0.23727	validation_1-logloss:0.33446                                                                             
[41]	validation_0-logloss:0.23301	validation_1-logloss:0.33098                                                                             
[42]	validation_0-logloss:0.22880	validation_1-logloss:0.32843                                                                             
[43]	validation_0-logloss:0.22457	validation_1-logloss:0.32540                                                                             
[44]	validation_0-logloss:0.22140	validation_1-logloss:0.32320                                                                             
[45]	validation_0-lo

[154]	validation_0-logloss:0.06829	validation_1-logloss:0.23471                                                                            
[155]	validation_0-logloss:0.06779	validation_1-logloss:0.23399                                                                            
[156]	validation_0-logloss:0.06716	validation_1-logloss:0.23359                                                                            
[157]	validation_0-logloss:0.06681	validation_1-logloss:0.23286                                                                            
[158]	validation_0-logloss:0.06612	validation_1-logloss:0.23291                                                                            
[159]	validation_0-logloss:0.06558	validation_1-logloss:0.23241                                                                            
[160]	validation_0-logloss:0.06534	validation_1-logloss:0.23185                                                                            
[161]	validation_0-l

[48]	validation_0-logloss:0.16475	validation_1-logloss:0.26552                                                                             
[49]	validation_0-logloss:0.16278	validation_1-logloss:0.26505                                                                             
[50]	validation_0-logloss:0.16024	validation_1-logloss:0.26285                                                                             
[51]	validation_0-logloss:0.15772	validation_1-logloss:0.26204                                                                             
[52]	validation_0-logloss:0.15513	validation_1-logloss:0.26164                                                                             
[53]	validation_0-logloss:0.15356	validation_1-logloss:0.26100                                                                             
[54]	validation_0-logloss:0.15164	validation_1-logloss:0.25928                                                                             
[55]	validation_0-lo

[5]	validation_0-logloss:0.12913	validation_1-logloss:0.29678                                                                              
[6]	validation_0-logloss:0.10780	validation_1-logloss:0.29117                                                                              
[7]	validation_0-logloss:0.09771	validation_1-logloss:0.29754                                                                              
[8]	validation_0-logloss:0.07870	validation_1-logloss:0.29440                                                                              
[9]	validation_0-logloss:0.06839	validation_1-logloss:0.28947                                                                              
[10]	validation_0-logloss:0.05931	validation_1-logloss:0.29354                                                                             
[11]	validation_0-logloss:0.05024	validation_1-logloss:0.28750                                                                             
[12]	validation_0-lo

In [24]:
best_xgb_params

{'learning_rate': 0.21, 'max_depth': 1, 'n_estimators': 392}

In [25]:
# params = {'learning_rate': 0.66, 'max_depth': 1, 'n_estimators': 61}

In [26]:
xgb_model = xgb.XGBClassifier(**best_xgb_params, eval_metric='logloss')
xgb_model.fit(X_train_fs, y_train)
preds = [pred[1] for pred in xgb_model.predict_proba(X_test_fs)]
score = roc_auc_score(y_test, preds, average='weighted')
print('auc_roc: ', score)

y_pred = xgb_model.predict(X_test_fs)
f1=f1_score(y_test, y_pred)
print('F1 Score:', f1)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

auc_roc:  0.9685758513931888
F1 Score: 0.9067357512953369
[[149  21]
 [ 15 175]]
              precision    recall  f1-score   support

           0       0.91      0.88      0.89       170
           1       0.89      0.92      0.91       190

    accuracy                           0.90       360
   macro avg       0.90      0.90      0.90       360
weighted avg       0.90      0.90      0.90       360



In [27]:
xgb_model.save_model("xgb_model.txt")

In [6]:
saved_model = xgb.Booster()
saved_model.load_model("xgb_model.txt")

In [19]:
X_test_D = xgb.DMatrix(X_test_fs)

In [24]:
y_pred = saved_model.predict(X_test_D)
f1=f1_score(y_test, (y_pred > 0.6622))
print('F1 Score:', f1)
print(confusion_matrix(y_test, (y_pred > 0.6622)))
print(classification_report(y_test, (y_pred > 0.6622)))

F1 Score: 0.9711286089238845
[[164   1]
 [ 10 185]]
              precision    recall  f1-score   support

           0       0.94      0.99      0.97       165
           1       0.99      0.95      0.97       195

    accuracy                           0.97       360
   macro avg       0.97      0.97      0.97       360
weighted avg       0.97      0.97      0.97       360



In [23]:
def get_fpr(y_true, y_pred):
    nbenign = (y_true == 0).sum()
    nfalse = (y_pred[y_true == 0] == 1).sum()
    return nfalse / float(nbenign)


def find_threshold(y_true, y_pred, fpr_target):
    thresh = 0.0
    fpr = get_fpr(y_true, y_pred > thresh)
    while fpr > fpr_target and thresh < 1.0:
        thresh += 0.0001
        fpr = get_fpr(y_true, y_pred > thresh)
    return thresh, fpr

# testdf = emberdf[emberdf["subset"] == "test"]
print("ROC AUC:", roc_auc_score(y_test, y_pred))
print()

threshold, fpr = find_threshold(y_test, y_pred, 0.01)
fnr = (y_pred[y_test == 1] < threshold).sum() / float((y_test == 1).sum())
print("Ember Model Performance at 1% FPR:")
print("Threshold: {:.4f}".format(threshold))
print("False Positive Rate: {:.3f}%".format(fpr * 100))
print("False Negative Rate: {:.3f}%".format(fnr * 100))
print("Detection Rate: {}%".format(100 - fnr * 100))
print()

threshold, fpr = find_threshold(y_test, y_pred, 0.001)
fnr = (y_pred[y_test == 1] < threshold).sum() / float((y_test == 1).sum())
print("Ember Model Performance at 0.1% FPR:")
print("Threshold: {:.4f}".format(threshold))
print("False Positive Rate: {:.3f}%".format(fpr * 100))
print("False Negative Rate: {:.3f}%".format(fnr * 100))
print("Detection Rate: {}%".format(100 - fnr * 100))

ROC AUC: 0.9970163170163171

Ember Model Performance at 1% FPR:
Threshold: 0.6622
False Positive Rate: 0.606%
False Negative Rate: 5.128%
Detection Rate: 94.87179487179488%

Ember Model Performance at 0.1% FPR:
Threshold: 0.7413
False Positive Rate: 0.000%
False Negative Rate: 6.667%
Detection Rate: 93.33333333333333%


In [None]:
shap_values = shap.TreeExplainer(xgb_model).shap_values(X_test_fs)
shap.summary_plot(shap_values, X_test_fs)

In [None]:
xgb.plot_importance(xgb_model, importance_type='cover', max_num_features=20)
plt.figure(figsize=(25, 12))
plt.show()

## Conclusions

## Visualizations should include the above ROC plots, SHAP plots, as well as the built in feature importance plots.

# scrap the below in final version

### Anomaly Detection Approach??

In [None]:
df1.shape #go back to original dataset

In [None]:
df1 = df1._get_numeric_data() #drop nonnumeric columns

#### Isolation Forest

Setting contamination rate to 5% as seen in original data, using 200 estimators.

In [None]:
iso = IsolationForest(n_estimators=200, max_samples='auto', contamination=0.05)
new_df = df1.drop(columns=['category'])
iso.fit(new_df)
new_df['anomaly_score'] = iso.predict(new_df)
new_df[new_df['anomaly_score'] == -1].shape

As expected, found 900 "outliers" (in our case malware).

In [None]:
iso_df = pd.concat([new_df, df1['category']], axis=1) #join the anomaly prediction with og data
iso_df.head(10)

In [None]:
iso_df.groupby('anomaly_score').size() #sanity check

In [None]:
iso_df['anomaly'] = "" #add new empty column to switch out -1 to 1 and 1 to 0

In [None]:
iso_df.loc[iso_df.anomaly_score == -1, "anomaly"] = 1
iso_df.loc[iso_df.anomaly_score == 1, "anomaly"] = 0

In [None]:
print(iso_df.groupby('anomaly').size(), iso_df.groupby('category').size())

In [None]:
iso_df.head(5)

Calculate f1 score between new anomaly column and given category column.

In [None]:
f1_score(list(iso_df['category']), list(iso_df['anomaly']))

### what happened???