# Evasive PDF Samples

Based on https://www.kaggle.com/datasets/fouadtrad2/evasive-pdf-samples

## 1. Introduction

### 1.1 Context

This dataset is a collection of evasive PDF samples, labeled as malicious (1) or benign (0).  Since the dataset has an evasive nature, it can be used to test the robustness of trained PDF malware classifiers against evasion attacks. The dataset contains 500,000 generated evasive samples, including 450,000 malicious and 50,000 benign PDFs. 

### 2.2 Objective

The primary objective is to create a machine learning application that classifies PDF samples as malicious or benign using various supervised learning algorithms.

Planned machine learning models:

1. **Decision Tree** - Are relatively fast to train and can handle large datasets, which makes them suitable for this problem.

2. **K-Nearest Neighbours (KNN)** - Can handle non-linear data and does not make assumptions about the distribution of the data.

3. **Neural Networks** - Can be used for both classification and regression tasks, and can often achieve high levels of accuracy with appropriate training.

## 2. Importing Libraries

We start by importing the dependencies: Gymnasium, numpy, and random.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import copy as copy


## 3. Importing the Dataset

### 3.1 Dataset Overview

In the following section we are loading the data set.

In [None]:
sample = pd.read_csv('sample.csv')
print(f"There are {len(sample)} samples.")
print(f"This the following format:")
sample.shape

In [None]:
print(f"This the following format:")
sample.shape

In [None]:
print("First few rows of the dataset:")
sample.head()


In [None]:
print("Overview of the dataset:")
sample.info()

Seeing the output table, we can check that our dataset is composed by 20 attributes (columns) and it is visible that every column are numerical features ( all `int64` besides the pdfsize attribute's that are represented by `float64`). As we can see above, the dataset does not contain null values.


### 3.2 Column Datatypes

In [None]:
print(sample.dtypes)


### 3.3 Data Description

1. **pdfsize** - The PDF size in Megabytes
2. **pages** - Pages
3. **title characters** - The number of characters in the title
4. **Images** - The number of images
5. **obj** - The number of keywords /obj
6. **endobj** - The number of keywords /endobj
7. **stream** - The number of keywords /stream
8. **endstream** - The number of keywords /endstream
9. **xref** - The number of xref tables
10. **trailer** - The number of keywords /trailer
11. **startxref** - The number of keywords /startxref
12. **ObjStm** - The number of keywords /Objstm (Object Streams)
13. **JS** - The number of keywords /JS
14. **OBS_JS** - The number of keywords /JS (obfuscated)
15. **Javascript** - The number of keywords /Javascript
16. **OBS_Javascript** - The number of keywords /Javascript (obfuscated)
17. **OpenAction** - The number of keywords /OpenAction
18. **OBS_OpenAction** - The number of keywords /OpenAction (obfuscated)
19. **Acroform** - The number of keywords /Acroform
20. **OBS_Acroform** - The number of keywords /Acroform (obfuscated)
21. **class** - Benign (0) or malicious (1)

In [None]:
print(f"Summary statistics of the dataset:")
sample.describe()

- The **count** row shows the number of non-null entries in each column of the dataset.
- The **mean** rows the average value of each column.
- The **std** shows the standard deviation of each column, displaying how much the values deviate from the mean.
- The **25%**, **50%** and **75%** rows show the first quartile, the median quartile and the third quartile values for each column.
- The **min** and **max** rows show the minimum and maximum values in each column.

### 3.4 Target Variable Distribution

In [None]:
print(f"Target Variable Distribution:")
sample['class'].value_counts()

- The class = **1** represents the malicious files
- The class = **0** represents the benign files


In [None]:
malicious_sample = sample[sample['class'] == 1]
benign_sample = sample[sample['class'] == 0]

In [None]:
malicious_count = len(malicious_sample)
benign_count = len(benign_sample)

data = [malicious_count, benign_count]  
colors = ("#eaac8b", "#6d597a") 

fig1, ax1 = plt.subplots()
ax1.pie(data, colors=colors, labels=['Malicious', 'Benign'], autopct='%1.1f%%')
ax1.axis('equal') 

plt.show()

### 3.5 Histograms

In [None]:
plt.figure(figsize=(16, 20))
sample.hist(figsize=(16, 20), color='#003f5c')
plt.xlabel("Frequency", fontsize=14)
plt.ylabel("Number of Samples", fontsize=14)
plt.show()

### 3.6 Malicious vs Benign

In [None]:
malicious_sample = sample[sample['class'] == 1]
benign_sample = sample[sample['class'] == 0]
num_features = sample.shape[1] - 1 

fig, axs = plt.subplots(num_features, 2, figsize=(16, num_features * 4))
for i, column in enumerate(sample.columns):
    if column == 'class':
        continue
    sb.histplot(malicious_sample[column], label='Malicious', color='#eaac8b', kde=True, ax=axs[i, 0], bins=30, alpha=0.6)
    sb.histplot(benign_sample[column], label='Benign', color='#6d597a', kde=True, ax=axs[i, 0], bins=30, alpha=0.6)
    axs[i, 0].set_title(f'{column} Distribution')
    axs[i, 0].set_xlabel(column)
    axs[i, 0].set_ylabel('Frequency')
    axs[i, 0].legend()
    sb.boxplot(x='class', y=column, data=sample, ax=axs[i, 1])
    axs[i, 1].set_title(f'{column} Boxplot')
    axs[i, 1].set_xlabel('Class')
    axs[i, 1].set_ylabel(column)

plt.tight_layout()
plt.show()


## 4. Data Preprocessing

### 4.1 Handling Missing Values

In [None]:
# Check for missing values
missing_values = sample.isnull().sum()
print("Missing values in each column:\n", missing_values)

# Drop rows with any missing values
sample_cleaned_rows = sample.dropna(axis=0)

# Display the shape of the DataFrame before and after dropping rows
print("Shape before dropping rows:", sample.shape)
print("Shape after dropping rows:", sample_cleaned_rows.shape)


### 4.2 Removing Redundant Features

We also can observe another interesting feature: the 'obj' column and the 'endobj' column, as well as the 'stream' column and the 'endstream' column, exhibit similar results. To prevent potential redundancy and ensure accurate analysis, we will also remove the 'endobj' and 'endstream' columns.

In [None]:
# Remove redundant columns
sample = sample.drop(columns=['endobj', 'endstream'])
print("Shape of DataFrame after removing columns:", sample.shape)

## 5. Exploratory Data Analysis (EDA)

### 5.1 Data Distribution Visualization

In [None]:
print(f"New summary statistics of the dataset:")
sample.describe()

- The **count** row shows the number of non-null entries in each column of the dataset.
- The **mean** rows the average value of each column.
- The **std** shows the standard deviation of each column, displaying how much the values deviate from the mean.
- The **25%**, **50%** and **75%** rows show the first quartile, the median quartile and the third quartile values for each column.
- The **min** and **max** rows show the minimum and maximum values in each column.

### 5.2 Pair Plot for Best Features

This section will focus on visualizing the relationship between the selected best features using a pair plot.

In [None]:
# Specify the fraction of data to sample
sample_fraction = 0.1

# Randomly sample a fraction of the dataset
sampled_data = sample.sample(frac=sample_fraction, random_state=42)

# Display the shape of the sampled dataset
print("Shape of Sampled Data:", sampled_data.shape)

best_features = ['obj', 'pdfsize', 'pages', 'stream', 'Javascript', 'Acroform', 'class']

# Pair plot for the sampled data
pair_plottable_data = sampled_data[best_features]
sb.pairplot(pair_plottable_data, hue='class')


With this graph, we can easily visualize the relationships between the selected features (obj, pdfsize, pages, stream, Javascript, Acroform) and their distribution concerning the 'class' variable, aiding in exploratory data analysis and potentially providing insights into patterns or correlations within the data.

## 6. Correlation Analysis

The process of determining which features to remove to improve machine learning model performance involves a combination of exploratory data analysis, feature selection techniques, and domain knowledge.

Highly correlated features can sometimes be redundant. Using correlation heatmaps can help identify these.

### 6.1 Correlation Matrix

In [None]:
correlation_matrix = sample.corr()
plt.figure(figsize=(16, 12))
sb.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix')
plt.show()

Observations from the Correlation Matrix:

- **Empty Columns**: There are columns with no values, indicating zero correlation with all other columns in our dataset. These columns do not contribute to the model and should be removed.

- **Low Correlation Values**: Some columns, especially those with a correlation value as low as 0.01 with our target variable (class), have minimal or no impact on the dataset. These columns can be excluded to reduce noise and improve model performance.


### 6.2 Identifying and Removing Low-Variance Features

To better understand our dataset, we will examine the distribution of the columns that showed no correlation in the correlation matrix. These columns include OBS_JS, OBS_Javascript, OBS_OpenAction, and OBS_Acroform.

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(12, 10))

# Plot distribution of OBS_JS
sb.histplot(sample['OBS_JS'], bins=30, kde=True, ax=axs[0, 0], color='skyblue')
axs[0, 0].set_title('Distribution of OBS_JS')
axs[0, 0].set_xlabel('OBS_JS')
axs[0, 0].set_ylabel('Frequency')

# Plot distribution of OBS_Javascript
sb.histplot(sample['OBS_Javascript'], bins=30, kde=True, ax=axs[0, 1], color='salmon')
axs[0, 1].set_title('Distribution of OBS_Javascript')
axs[0, 1].set_xlabel('OBS_Javascript')
axs[0, 1].set_ylabel('Frequency')

# Plot distribution of OBS_OpenAction
sb.histplot(sample['OBS_OpenAction'], bins=30, kde=True, ax=axs[1, 0], color='green')
axs[1, 0].set_title('Distribution of OBS_OpenAction')
axs[1, 0].set_xlabel('OBS_OpenAction')
axs[1, 0].set_ylabel('Frequency')

# Plot distribution of OBS_Acroform
sb.histplot(sample['OBS_Acroform'], bins=30, kde=True, ax=axs[1, 1], color='purple')
axs[1, 1].set_title('Distribution of OBS_Acroform')
axs[1, 1].set_xlabel('OBS_Acroform')
axs[1, 1].set_ylabel('Frequency')

# Adjust layout
plt.tight_layout()

# Show plot
plt.show()


As we can see from the plots, these four columns have constant values (all values are the same), meaning they don't provide any useful information for our analysis or modeling process.

To improve the predictive power of our model and reduce noise, we will remove these columns from our dataset.

In [None]:
# Remove the specified columns
columns_to_remove = ['OBS_JS', 'OBS_Javascript', 'OBS_OpenAction', 'OBS_Acroform']
sample = sample.drop(columns=columns_to_remove)

# Check the shape of the DataFrame after removing the columns
print("Shape of DataFrame after removing columns:", sample.shape)


There is also a column that should be remove because of how low is it correlation value: ObjStm. As we can see the correlation matrix, the values between this column and the others are really low, specially with our main column (class), with a correlation value of 0.1

In [None]:
# Remove the specified columns
columns_to_remove = ['ObjStm']
sample = sample.drop(columns=columns_to_remove)

# Check the shape of the DataFrame after removing the columns
print("Shape of DataFrame after removing columns:", sample.shape)


## 7. Addressing Class Imbalance

Below, we can see that our dataset is unbalenced because there is a significant difference between the values of the binary feature ('class' column). 

There are 50000 0's (or no's) and 450000 1's (or yes's). 

To balance our dataset, we will use **Synthetic Minority Over-sampling Technique (SMOTE)** for oversampling and **RandomUnderSampler** for undersampling.

### 7.1 Techniques to Handle Imbalanced Classes

In [None]:
print("Values of the feature column ('class')")
sample['class'].value_counts()
sample_undersample = copy.deepcopy(sample)
sample_original = copy.deepcopy(sample)
sample_best = copy.deepcopy(sample[['obj', 'pdfsize', 'pages', 'stream', 'Javascript', 'Acroform', 'class']])



#### 7.1.1 Undersampling

In [None]:
rus = RandomUnderSampler(random_state=42)
X = sample.drop('class', axis=1)
y = sample['class']
X_res_undersampling, y_res_undersampling = rus.fit_resample(X, y)
print("Values of the feature column ('class') after undersampling")
y_res_undersampling.value_counts()

rus = RandomUnderSampler(random_state=42)
X_best = sample_best.drop('class', axis=1)
y_best = sample_best['class']
X_res_best_undersampling, y_res_best_undersampling = rus.fit_resample(X_best, y_best)
print("Values of the feature column ('class') after undersampling")
y_res_best_undersampling.value_counts()

rus = RandomUnderSampler(random_state=42)
X_under = sample_undersample.drop('class', axis=1)
y_under = sample_undersample['class']
X_res_sample_undersampling, y_res_sample_undersampling = rus.fit_resample(X_under, y_under)
print("Values of the feature column ('class') after undersampling")
y_res_sample_undersampling.value_counts()

As you can see in the result output, we now have an undersampled balanced dataset because we have similar numbers of no's and yes's (0's and 1's, respectively). 

#### 7.2.2 Oversampling

In [None]:
smote = SMOTE(random_state=42)
X_res_overslampling, y_res_oversampling = smote.fit_resample(X, y)
X_res_best_overslampling, y_res_best_oversampling = smote.fit_resample(X_best, y_best)
print("Values of the feature column ('class') after oversampling")
y_res_oversampling.value_counts()


As you can see in the result output, we now have an oversampled balanced dataset because we have similar numbers of no's and yes's (0's and 1's, respectively). 

## 8. Model Training and Evaluation

### 8.1 Splitting Data into Traning and Testing Sets

In [None]:
# Define features (X) and target variable (y)

X_original = sample_original.drop('class', axis=1)
y_original = sample_original['class']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train_under, X_test_under, y_train_under, y_test_under = train_test_split(X_under, y_under, test_size=0.2, random_state=100)

X_train_original, X_test_original, y_train_original, y_test_original = train_test_split(X_original, y_original, test_size=0.2, random_state=83)

### 8.2 Training Multiple Supervised Learning Models

Section where we'll train our models to predict the type of evasive of the pdf file.

In [None]:
# Input Columns are the columns we will use for input in the model
input_cols = ['pdfsize', 'pages', 'title characters', 'images', 'obj', 'stream', 'xref', 'trailer', 'startxref', 'ObjStm', 'JS', 'Javascript', 'OpenAction', 'Acroform']

# Input Columns are the options for the output of the model
output_cols = ['Benign', 'Malicious']

# Averages to calculate for Precision, Recall, and F1-score
averages = ['binary', 'micro', 'macro', 'weighted']

#### Test Model function 

In [None]:
def test_model(model, x_test, y_test):
    # Predict the target values for the test data
    y_pred = model.predict(x_test)

    # Calculate accuracy, precision, recall, f1-score and kappa score
    accuracy = metrics.accuracy_score(y_test, y_pred)
    precision = metrics.precision_score(y_test, y_pred)
    recall = metrics.recall_score(y_test, y_pred)
    f1 = metrics.f1_score(y_test, y_pred)
    conf_matrix = metrics.confusion_matrix(y_test, y_pred)
    
    return accuracy, precision, recall, f1, conf_matrix

#### 8.2.1 Decision Tree

In [None]:
def train_classifier(model_data, classifier, testSize):
    # split data for trainging and testing
    x = model_data.drop(columns = ['class'], axis = 1)
    target = model_data['class']
    x_train, x_test, y_train, y_test = train_test_split(x, target, test_size = testSize, random_state = 42)

    # Fit the classifier to the training data
    classifier.fit(x_train, y_train)

    # Evaluate the classifier on the test data
    values = test_model(classifier, x_test, y_test)

    return values

In [None]:
def decision_tree_algorithm(dataset, test_size=0.2):
    # Model using Decision Tree algorithm
    decision_tree_classifier = tree.DecisionTreeClassifier(random_state = 0)

    (accuracy, precision, recall, f1, conf_matrix) = train_classifier(dataset, decision_tree_classifier, test_size)
    
    return accuracy, precision, recall, f1, conf_matrix, decision_tree_classifier

In [None]:
decision_tree_values = decision_tree_algorithm(sample)
(accuracy, precision, recall, f1, conf_matrix, decision_tree_classifier) = decision_tree_values

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 score:", f1)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix = conf_matrix, display_labels = decision_tree_classifier.classes_)

# Plot the confusion matrix
fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(ax=ax)
ax.set_title('Confusion Matrix')
plt.show()

Here’s a breakdown of the results from the confusion matrix:

- True Positive (TP): 90,000 cases were correctly predicted as malicious.
- True Negative (TN): 9,964 cases were correctly predicted as benign.
- False Positive (FP): 60 cases were incorrectly predicted as malicious when they were actually benign.
- False Negative (FN): 74 cases were incorrectly predicted as benign when they were actually malicious.

In [None]:
decision_tree_values = decision_tree_algorithm(sample_best)
(accuracy, precision, recall, f1, conf_matrix, decision_tree_classifier) = decision_tree_values

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 score:", f1)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix = conf_matrix, display_labels = decision_tree_classifier.classes_)

# Plot the confusion matrix
fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(ax=ax)
ax.set_title('Confusion Matrix')
plt.show()

## Key Observations:

**High Accuracy:** With an accuracy of 0.99866, the model correctly predicts most of the samples, both benign and malicious.

**High Precision and Recall:** Precision of 0.99933 for class 1 indicates that almost all PDFs predicted as malicious are indeed malicious, and a recall of 0.99918 means the model is excellent at detecting almost all malicious PDFs in the dataset.

**F1 Score:** The F1 score, which balances precision and recall, is also very high at 0.99925, indicating that the model's overall accuracy in terms of precision and recall balance is outstanding.

#### 8.2.2. K-Nearest Neighbors

In [None]:
def knn_algorithm(dataset, test_size=0.2, n_neighbors=5):
    knn_classifier = KNeighborsClassifier(n_neighbors=n_neighbors)
    accuracy, precision, recall, f1, conf_matrix = train_classifier(dataset, knn_classifier, test_size)
   
    return accuracy, precision, recall, f1, conf_matrix, knn_classifier

In [None]:
knn_values = knn_algorithm(sample, test_size=0.2, n_neighbors=5)
accuracy, precision, recall, f1, conf_matrix, knn_classifier = knn_values

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 score:", f1)

# Plot the confusion matrix
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=knn_classifier.classes_)
fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(ax=ax)
ax.set_title('Confusion Matrix')
plt.show()

In [None]:
knn_values_best = knn_algorithm(sample_best, test_size=0.2, n_neighbors=5)
accuracy_best, precision_best, recall_best, f1_best, conf_matrix_best, knn_classifier_best = knn_values_best

print("Accuracy:", accuracy_best)
print("Precision:", precision_best)
print("Recall:", recall_best)
print("F1 score:", f1_best)

# Plot the confusion matrix
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=conf_matrix_best, display_labels=knn_classifier.classes_)
fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(ax=ax)
ax.set_title('Confusion Matrix')
plt.show()

#### 8.2.3 Neural Networks

In [None]:
def neural_networks_algorithm(dataset, test_size=0.2):
    # Create the classifier object
    neural_networks_classifier = MLPClassifier(hidden_layer_sizes=(10, 10, 10), max_iter=200, alpha=0.0001, solver='adam', verbose=0, random_state=0, tol=0.000000001)

    (accuracy, precision, recall, f1, conf_matrix) = train_classifier(dataset, neural_networks_classifier, test_size)

    return accuracy, precision, recall, f1, conf_matrix, neural_networks_classifier



In [None]:
neural_networks_data_best = neural_networks_algorithm(sample_best)
(accuracy_best, precision_best, recall_best, f1_best, conf_matrix_best, neural_networks_classifier_best) = neural_networks_data_best


neural_networks_data = neural_networks_algorithm(sample)
(accuracy, precision, recall, f1, conf_matrix, neural_networks_classifier) = neural_networks_data


### 8.3 Evaluating Models

#### 8.3.1 Confusion Matrix

In [None]:
#neural networks confusion matrix:

#sample_best:

print("Accuracy:", accuracy_best)
print("Precision:", precision_best)
print("Recall:", recall_best)
print("F1 score:", f1_best)

disp = metrics.ConfusionMatrixDisplay(confusion_matrix = conf_matrix_best, display_labels = decision_tree_classifier.classes_)

# Plot the confusion matrix
fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(ax=ax)
ax.set_title('Confusion Matrix')
plt.show()

In [None]:
#sample:

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 score:", f1)

disp = metrics.ConfusionMatrixDisplay(confusion_matrix = conf_matrix, display_labels = decision_tree_classifier.classes_)

# Plot the confusion matrix
fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(ax=ax)
ax.set_title('Confusion Matrix')
plt.show()

## Key Observations:

**Accuracy:** Sample with 0.99871 and Sample_best with 0.99614. The normal sample has higher accuracy (0.99871 vs. 0.99614).

**Precision:** Sample with 0.9990668221963006 and Sample_best with 0.998653041232523. The normal sample has higher precision (0.99907 vs. 0.99865), indicating it better avoids false positives.

**Recall:** Sample with 0.9994998666311016 and Sample_best with 0.9970547701609318. The normal sample has higher recall (0.99950 vs. 0.99705), meaning it is better at identifying true positives.

**F1 Score:** Sample with 0.9992832974982082 and Sample_best with 0.9978532657056416. The normal sample has a higher F1 score (0.99928 vs. 0.99785), indicating better overall performance in balancing precision and recall.

From the sample and sample_best datasets we can conclude that there was a loss of information since the the sample has better performance in all metrics.

#### 8.3.2 Precision

For the original sample:

As we can see in the images, the model with best precision is ?? and ... is the ... .

For the sample with the 6 best features:

As we can see in the images, the model with best precision is ?? and ... is the ... .

#### 8.3.3 Recall

For the original sample:

As we can see in the images, the model with best recall is ?? and ... is the ... .

For the sample with the 6 best features:

As we can see in the images, the model with best recall is ?? and ... is the ... .

#### 8.3.4 Accuracy

For the original sample:

As we can see in the images, the model with best accuracy is ?? and ... is the ... .

For the sample with the 6 best features:

As we can see in the images, the model with best accuracy is ?? and ... is the ... .



#### 8.3.5 F1 Score

For the original sample:

As we can see in the images, the model with best f1 score is ?? and ... is the ... .

For the sample with the 6 best features:

As we can see in the images, the model with best f1 score is ?? and ... is the ... .

### 8.4 Comparing Model Performance