<a href="https://colab.research.google.com/github/Paradoxxs/Paradoxxs.github.io/blob/main/Simple_PE_scanner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [16]:
import pandas as pd
import numpy as np
import pickle
import sklearn.ensemble as ske
from sklearn.model_selection import train_test_split
from sklearn import tree, linear_model
from sklearn.feature_selection import SelectFromModel
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix

Load data

In [17]:
data = pd.read_csv('/content/drive/MyDrive/Data science/Data/antivirus_demo_data.csv', sep='|')
X = data.drop(['Name', 'md5', 'legitimate'], axis=1).values
y = data['legitimate'].values

print('Researching important feature based on %i total features\n' % X.shape[1])

Researching important feature based on 54 total features



Feature selection using Trees Classifier

In [18]:
fsel = ske.ExtraTreesClassifier().fit(X, y)
model = SelectFromModel(fsel, prefit=True)
X_new = model.transform(X)
nb_features = X_new.shape[1]

Split the data into to test and train

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X_new, y ,test_size=0.2)

In [20]:
features = []

print('%i features identified as important:' % nb_features)

13 features identified as important:


In [21]:
indices = np.argsort(fsel.feature_importances_)[::-1][:nb_features]
for f in range(nb_features):
    print("%d. feature %s (%f)" % (f + 1, data.columns[2+indices[f]], fsel.feature_importances_[indices[f]]))




1. feature DllCharacteristics (0.137176)
2. feature Characteristics (0.107942)
3. feature Machine (0.107144)
4. feature VersionInformationSize (0.060890)
5. feature ImageBase (0.060444)
6. feature SectionsMaxEntropy (0.057674)
7. feature MajorSubsystemVersion (0.057278)
8. feature Subsystem (0.054732)
9. feature ResourcesMaxEntropy (0.046115)
10. feature SizeOfOptionalHeader (0.038017)
11. feature ResourcesMinEntropy (0.037360)
12. feature MajorOperatingSystemVersion (0.027987)
13. feature SectionsMinEntropy (0.021433)


Sort the feature by importances

In [None]:
for f in sorted(np.argsort(fsel.feature_importances_)[::-1][:nb_features]):
    features.append(data.columns[2+f])

Defines the algorithm for comparison

In [22]:

algorithms = {
        "DecisionTree": tree.DecisionTreeClassifier(max_depth=10),
        "RandomForest": ske.RandomForestClassifier(n_estimators=50),
        "GradientBoosting": ske.GradientBoostingClassifier(n_estimators=50),
        "AdaBoost": ske.AdaBoostClassifier(n_estimators=100),
        "GNB": GaussianNB()
    }

For loop for testing the different algorithms

In [23]:
results = {}
print("\nNow testing algorithms")
for algo in algorithms:
    clf = algorithms[algo]
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    print("%s : %f %%" % (algo, score*100))
    results[algo] = score


Now testing algorithms
DecisionTree : 99.003984 %
RandomForest : 99.340819 %
GradientBoosting : 98.735965 %
AdaBoost : 98.551250 %
GNB : 69.949294 %


Printing out the algroithm with the higest score.

In [27]:
winner = max(results, key=results.get)
print(' %s with a %f %%' % (winner, results[winner]*100))

 RandomForest with a 99.340819 % 


Save the algorithm and the feature list for later predictions

In [30]:
open('./features.pkl', 'wb').write(pickle.dumps(features))

Saving algorithm and feature list in classifier directory...
Saved


In [31]:
# Identify false and true positive rates
clf = algorithms[winner]
res = clf.predict(X_test)
mt = confusion_matrix(y_test, res)
print("False positive rate : %f %%" % ((mt[0][1] / float(sum(mt[0])))*100))
print('False negative rate : %f %%' % ( (mt[1][0] / float(sum(mt[1]))*100)))

False positive rate : 0.476388 %
False negative rate : 1.084599 %
