<a href="https://colab.research.google.com/github/Liza-b-13/liza-repo/blob/main/notebooks_test/dection_PE_Headers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 1. **Machine learning malware detection using PE headers**

For the first tests i will be using this dataset:

📁->Malware dataset : [link](https://github.com/PacktPublishing/Mastering-Machine-Learning-for-Penetration-Testing/blob/master/Chapter03/MalwareData.csv.gz)

Dataset Created by : 🧑‍💻Prateek Lalwani


-> **Legitimate Files** 41,323 Windows binaries

-> **Malware Files** 96,724 downloaded from the VirusShare website

-> 138,048 **in total**



---


Load and preview the dataset:

In [None]:
import pandas as pd
url = "https://raw.githubusercontent.com/PacktPublishing/Mastering-Machine-Learning-for-Penetration-Testing/master/Chapter03/MalwareData.csv.gz"

# Reads directly from the gzip file
MalwareDataset = pd.read_csv(url, compression='gzip', sep='|', low_memory=False)

# Check shape and preview
print(MalwareDataset.shape)
MalwareDataset.head()

(138047, 57)


Unnamed: 0,Name,md5,Machine,SizeOfOptionalHeader,Characteristics,MajorLinkerVersion,MinorLinkerVersion,SizeOfCode,SizeOfInitializedData,SizeOfUninitializedData,...,ResourcesNb,ResourcesMeanEntropy,ResourcesMinEntropy,ResourcesMaxEntropy,ResourcesMeanSize,ResourcesMinSize,ResourcesMaxSize,LoadConfigurationSize,VersionInformationSize,legitimate
0,memtest.exe,631ea355665f28d4707448e442fbf5b8,332,224,258,9,0,361984,115712,0,...,4,3.262823,2.568844,3.537939,8797.0,216,18032,0,16,1
1,ose.exe,9d10f99a6712e28f8acd5641e3a7ea6b,332,224,3330,9,0,130560,19968,0,...,2,4.250461,3.420744,5.080177,837.0,518,1156,72,18,1
2,setup.exe,4d92f518527353c0db88a70fddcfd390,332,224,3330,9,0,517120,621568,0,...,11,4.426324,2.846449,5.271813,31102.272727,104,270376,72,18,1
3,DW20.EXE,a41e524f8d45f0074fd07805ff0c9b12,332,224,258,9,0,585728,369152,0,...,10,4.364291,2.669314,6.40072,1457.0,90,4264,72,18,1
4,dwtrig20.exe,c87e561258f2f8650cef999bf643a731,332,224,258,9,0,294912,247296,0,...,2,4.3061,3.421598,5.190603,1074.5,849,1300,72,18,1


**Code Explanation:**

`pandas`: a Python library used for handling and analyzing data, especially tabular data.

`pd.read_csv()`: reads data from a CSV (or compressed .gz) file directly into a DataFrame.

* ` compression='gzip'` → reads the compressed file without unzipping manually.

* `sep='|'` → specifies that columns are separated by | instead of commas.


In [None]:
Legit = MalwareDataset[0:41323].drop(['legitimate'], axis=1) #drops the column legitimate cause we have it as variable now
Malware = MalwareDataset[41323::].drop(['legitimate'], axis=1)

print("The shape of the legit dataset is: %s samples, %s features"%(Legit.shape[0], Legit.shape[1]))
print("The shape of the malware dataset is: %s samples, %s features"%(Malware.shape[0],Malware.shape[1]))

The shape of the legit dataset is: 41323 samples, 56 features
The shape of the malware dataset is: 96724 samples, 56 features


**Code Explanation:**

`Legit.shape` gives the dimensions of the DataFrame as a tuple:
→ `(number_of_rows, number_of_columns)`

`Legit.shape[0]` → number of rows (samples)

`Legit.shape[1]` → number of columns (features)



---



In [None]:
print(MalwareDataset.columns) #printing all the features from the dataset

Index(['Name', 'md5', 'Machine', 'SizeOfOptionalHeader', 'Characteristics',
       'MajorLinkerVersion', 'MinorLinkerVersion', 'SizeOfCode',
       'SizeOfInitializedData', 'SizeOfUninitializedData',
       'AddressOfEntryPoint', 'BaseOfCode', 'BaseOfData', 'ImageBase',
       'SectionAlignment', 'FileAlignment', 'MajorOperatingSystemVersion',
       'MinorOperatingSystemVersion', 'MajorImageVersion', 'MinorImageVersion',
       'MajorSubsystemVersion', 'MinorSubsystemVersion', 'SizeOfImage',
       'SizeOfHeaders', 'CheckSum', 'Subsystem', 'DllCharacteristics',
       'SizeOfStackReserve', 'SizeOfStackCommit', 'SizeOfHeapReserve',
       'SizeOfHeapCommit', 'LoaderFlags', 'NumberOfRvaAndSizes', 'SectionsNb',
       'SectionsMeanEntropy', 'SectionsMinEntropy', 'SectionsMaxEntropy',
       'SectionsMeanRawsize', 'SectionsMinRawsize', 'SectionMaxRawsize',
       'SectionsMeanVirtualsize', 'SectionsMinVirtualsize',
       'SectionMaxVirtualsize', 'ImportsNbDLL', 'ImportsNb',
       'Impor

In [None]:
pd.set_option('display.max_columns', None) #to show all the columns, otherwise it cuts out most cause too long
print(MalwareDataset.head(5)) #printing the first 5 lines

           Name                               md5  Machine  \
0   memtest.exe  631ea355665f28d4707448e442fbf5b8      332   
1       ose.exe  9d10f99a6712e28f8acd5641e3a7ea6b      332   
2     setup.exe  4d92f518527353c0db88a70fddcfd390      332   
3      DW20.EXE  a41e524f8d45f0074fd07805ff0c9b12      332   
4  dwtrig20.exe  c87e561258f2f8650cef999bf643a731      332   

   SizeOfOptionalHeader  Characteristics  MajorLinkerVersion  \
0                   224              258                   9   
1                   224             3330                   9   
2                   224             3330                   9   
3                   224              258                   9   
4                   224              258                   9   

   MinorLinkerVersion  SizeOfCode  SizeOfInitializedData  \
0                   0      361984                 115712   
1                   0      130560                  19968   
2                   0      517120                 621568   
3 



---


🌿 **Decision Tree Classifier**

A Decision Tree Classifier is a model that makes decisions by asking a series of yes/no questions about the data — like a flowchart.

In [None]:
import sklearn
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split, cross_val_score

In [None]:
data_in = MalwareDataset.drop(['Name','md5','legitimate'], axis=1).values # .values converts the DataFrame to a NumPy array because the model wants a mathematical array not the pandas one
labels = MalwareDataset['legitimate'].values #we use that column as a label
extratrees = ExtraTreesClassifier().fit(data_in, labels)
select = SelectFromModel(extratrees, prefit=True)
data_in_new = select.transform(data_in)
print("shape of the old dataset %s , shape of the new dataset %s"  %(data_in.shape, data_in_new.shape))

shape of the old dataset (138047, 54) , shape of the new dataset (138047, 12)


**Explanation:**

Here we train an Extra Trees model to determine feature importance, and then use that information to create a new dataset (data_in_new) containing only the most relevant features for our malware detection task.

In [None]:
import numpy as np
features = data_in_new.shape[1] #number of features
importances = extratrees.feature_importances_
indices = np.argsort(importances)[::-1] #argsort returns the indices that would sort an array in ascending order

for f in range(features):
    print("%d"%(f+1), MalwareDataset.columns[2+indices[f]], importances[indices[f]])


1 DllCharacteristics 0.1571048387769568
2 Characteristics 0.122248456522263
3 Machine 0.08785697417912322
4 VersionInformationSize 0.07668675379350166
5 ImageBase 0.053664883756823094
6 SectionsMaxEntropy 0.04976013315641163
7 SizeOfOptionalHeader 0.048121932655247994
8 Subsystem 0.04524622154121746
9 MajorSubsystemVersion 0.043003872988941094
10 ResourcesMinEntropy 0.040423052229779224
11 ResourcesMaxEntropy 0.03767461564362369
12 SectionsMinEntropy 0.023682970674417395




---


🌳Training our model with a **random forest classifier**

A Random Forest Classifier is an ensemble of many decision trees combined together.
* It trains many decision trees (often hundreds).

* Each tree sees a random subset of the data and features.

* When predicting, each tree votes for a class.

* The majority vote becomes the final prediction.

It's better than a simple tree classifier because it's:

* More accurate: reduces overfitting (a single tree can memorize data).

* More stable: less sensitive to noise in the data.

* More robust: averages out errors from individual trees.

In [None]:
Legit_Train, Legit_Test, Malware_Train, Malware_Test = train_test_split(data_in_new, labels, test_size=0.2)
classif = sklearn.ensemble.RandomForestClassifier(n_estimators=50) #number of trees