# Evasive PDF Samples

Based on https://www.kaggle.com/datasets/fouadtrad2/evasive-pdf-samples

This dataset is a collection of evasive PDF samples, labeled as malicious (1) or benign (0). Since the dataset has an evasive nature, it can be used to test the robustness of trained PDF malware classifiers against evasion attacks. The dataset contains 500,000 generated evasive samples, including 450,000 malicious and 50,000 benign PDFs. 

## Imports the libraries

We start by importing the dependencies: Gymnasium, numpy, and random.

In [16]:
import numpy as np
import pandas as pd
import random
import warnings

## Define Helper Functions 

Here we'll find functions that calculate and get metrics about the dataset

In [17]:
def calculateMeanPageSize(sample):
    total_page_size = sum(pdf.pdfsize for pdf in sample)
    mean_page_size = total_page_size / len(sample)
    return int(mean_page_size)

In [18]:
def write_dataset(sample):
    data = {
        'nLines': [len(p)],
        'MeanPages': [calculateMeanPageSize(sample)]
    }
    df = pd.DataFrame(data)
    
    # Print the DataFrame
    print(df)

## Importing datasets

In [19]:
sample = pd.read_csv('sample.csv')
print(f"We have {len(sample)} samples.")
sample.head()

We have 500000 samples.


Unnamed: 0,pdfsize,pages,title characters,images,obj,endobj,stream,endstream,xref,trailer,...,ObjStm,JS,OBS_JS,Javascript,OBS_Javascript,OpenAction,OBS_OpenAction,Acroform,OBS_Acroform,class
0,644.326,70,0,1,348,351,128,128,1,1,...,0,1,0,1,0,1,0,1,0,1
1,648.05,68,0,1,348,345,124,124,1,1,...,0,1,0,1,0,0,0,1,0,1
2,696.506,68,0,1,353,353,128,125,1,1,...,0,1,0,1,0,0,0,1,0,1
3,715.926,68,0,0,759,667,250,192,1,1,...,0,1,0,1,0,1,0,1,0,1
4,707.102,70,10,2,388,373,141,138,1,1,...,0,1,0,1,0,1,0,1,0,1


## Building the base model 

In [20]:
sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 21 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   pdfsize           500000 non-null  float64
 1   pages             500000 non-null  int64  
 2   title characters  500000 non-null  int64  
 3   images            500000 non-null  int64  
 4   obj               500000 non-null  int64  
 5   endobj            500000 non-null  int64  
 6   stream            500000 non-null  int64  
 7   endstream         500000 non-null  int64  
 8   xref              500000 non-null  int64  
 9   trailer           500000 non-null  int64  
 10  startxref         500000 non-null  int64  
 11  ObjStm            500000 non-null  int64  
 12  JS                500000 non-null  int64  
 13  OBS_JS            500000 non-null  int64  
 14  Javascript        500000 non-null  int64  
 15  OBS_Javascript    500000 non-null  int64  
 16  OpenAction        50

In [21]:
sample.shape

(500000, 21)

In [22]:
sample.describe()

Unnamed: 0,pdfsize,pages,title characters,images,obj,endobj,stream,endstream,xref,trailer,...,ObjStm,JS,OBS_JS,Javascript,OBS_Javascript,OpenAction,OBS_OpenAction,Acroform,OBS_Acroform,class
count,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,...,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0
mean,563.363772,55.101686,5.617004,1.041594,273.595072,273.47229,95.115512,95.3315,0.969714,1.001358,...,0.008572,0.873134,0.0,0.795662,0.0,0.4366,0.0,0.887564,0.0,0.9
std,280.213763,30.233062,6.501397,0.734654,142.33328,142.734185,51.683914,52.094421,0.263349,0.244811,...,0.198168,0.547981,0.0,0.416932,0.0,0.495965,0.0,0.519314,0.0,0.3
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,585.86425,67.0,0.0,1.0,266.0,266.0,85.0,87.0,1.0,1.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
50%,657.841,68.0,4.0,1.0,346.0,345.0,123.0,122.0,1.0,1.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
75%,708.50325,69.0,9.0,2.0,355.0,354.0,126.0,126.0,1.0,1.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
max,1761.042,287.0,267.0,18.0,760.0,760.0,254.0,254.0,3.0,3.0,...,15.0,3.0,0.0,5.0,0.0,1.0,0.0,2.0,0.0,1.0


In [23]:
sample.describe(include=["int64"])

Unnamed: 0,pages,title characters,images,obj,endobj,stream,endstream,xref,trailer,startxref,ObjStm,JS,OBS_JS,Javascript,OBS_Javascript,OpenAction,OBS_OpenAction,Acroform,OBS_Acroform,class
count,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0
mean,55.101686,5.617004,1.041594,273.595072,273.47229,95.115512,95.3315,0.969714,1.001358,0.997758,0.008572,0.873134,0.0,0.795662,0.0,0.4366,0.0,0.887564,0.0,0.9
std,30.233062,6.501397,0.734654,142.33328,142.734185,51.683914,52.094421,0.263349,0.244811,0.198658,0.198168,0.547981,0.0,0.416932,0.0,0.495965,0.0,0.519314,0.0,0.3
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,67.0,0.0,1.0,266.0,266.0,85.0,87.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
50%,68.0,4.0,1.0,346.0,345.0,123.0,122.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
75%,69.0,9.0,2.0,355.0,354.0,126.0,126.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
max,287.0,267.0,18.0,760.0,760.0,254.0,254.0,3.0,3.0,4.0,15.0,3.0,0.0,5.0,0.0,1.0,0.0,2.0,0.0,1.0
