# Anomaly Detection Data Preparation

This document outlines the steps involved in preparing datasets for a machine learning anomaly detection model. 

## 1. Imports and Data Loading
The script begins by importing necessary libraries, such as `numpy` and `pandas`, and then loads the training, test, and validation datasets from specified file paths using `pd.read_csv()`.

## 2. Data Cleaning
After loading the training dataset, the script checks for NaN values and fills them with 0. This ensures that missing data does not affect subsequent analysis. A function, `safe_float`, is defined to convert values to float safely, returning `None` for values that cannot be converted.

## 3. Data Type Specification
To optimize memory usage, the datasets are reloaded with specified data types. This step helps in efficiently processing large datasets by using appropriate data types for each column.

## 4. Deep Copying DataFrames
Deep copies of the original dataframes are created for manipulation. This ensures that the original data remains intact while allowing for modifications in the copies.

## 5. Feature Engineering
New features are created by aggregating existing columns and computing various metrics, such as:
- **Packets per Second (pps)**: Calculated as the total packets divided by the duration.
- **Bytes per Second (bps)**: Calculated for both one-way and two-way traffic based on the bytes transferred.
- **Bytes per Packet (bpp)**: Calculated for both one-way and two-way traffic by dividing the bytes transferred by the total packets.

## 6. Merging Data
The script checks if the 'label' column exists in the validation dataset. If present, it creates a new column with binary values based on the presence of the label 'Botnet'. Additionally, it merges training data into the validation data to incorporate additional features.

## 7. Saving Processed Data
Finally, the processed validation and training datasets are saved as CSV files. This allows for easy access and further analysis in subsequent steps of the machine learning pipeline.

This structured approach ensures that the data is clean, well-defined, and enriched with relevant features, setting a strong foundation for the anomaly detection model.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import copy
from sklearn.preprocessing import LabelEncoder

In [6]:
training_df = pd.read_csv(
    "/Users/sanatkumargupta/Desktop/ML_Anomaly_Detection/data/trainFeatGen.csv",
    encoding="utf8",
    header=0,
    dtype={
        'timestamp': object,
        'duration': float,
        'protocol': object,
        'source_ip': object,
        'source_port': object,
        'direction': object,
        'destination_ip': object,
        'destination_port': object,
        'state': object,
        'source_type_service': object,
        'destination_type_service': object,
        'number_total_packets': int,
        'bytes_both_directions': int,
        'bytes_source_to_destination': int
    },
    low_memory=False
)


In [7]:
training_df = pd.read_csv("/Users/sanatkumargupta/Desktop/ML_Anomaly_Detection/data/trainFeatGen.csv", encoding="utf8", header=0, low_memory=False)

# Check for NaN values
print(training_df.isna().sum())

# If there are NaN values, you can fill or drop them as necessary
training_df.fillna(0, inplace=True)  # Example: fill NaNs with 0

timestamp                           0
duration                            0
protocol                            0
source_ip                           0
source_port                         0
direction                           0
destination_ip                      0
destination_port                    0
state                               0
source_type_of_service              0
destination_type_of_service         0
total_packets                       0
bytes_transferred_total             0
bytes_transferred_source_to_dest    0
dtype: int64


In [8]:
def safe_float(value):
    try:
        return float(value)
    except ValueError:
        return None  # or some default value

training_df = pd.read_csv(
    "/Users/sanatkumargupta/Desktop/ML_Anomaly_Detection/data/trainFeatGen.csv",
    encoding="utf8",
    header=0,
    converters={'duration': safe_float},
    low_memory=False
)

In [13]:
training_df = pd.read_csv("/Users/sanatkumargupta/Desktop/ML_Anomaly_Detection/data/trainFeatGen.csv", encoding="utf8", header=0, dtype={'timestamp': object, 'duration': float, 'protocol': object, 'source_ip': object, 'source_port': object, 'direction': object, 'destination_ip': object, 'destination_port': object, 'state': object, 'source_type_service': object, 'destination_type_service': object, 'number_total_packets': int, 'bytes_both_directions': int, 'bytes_source_to_destination': int}, low_memory=False)

test_df = pd.read_csv("/Users/sanatkumargupta/Desktop/ML_Anomaly_Detection/data/testFeatGen.csv", encoding="utf8", header=0, dtype={'timestamp': object, 'duration': float, 'protocol': object, 'source_ip': object, 'source_port': object, 'direction': object, 'destination_ip': object, 'destination_port': object, 'state': object, 'source_type_service': object, 'destination_type_service': object, 'number_total_packets': int, 'bytes_both_directions': int, 'bytes_source_to_destination': int}, low_memory=False)

valid_df = pd.read_csv("/Users/sanatkumargupta/Desktop/ML_Anomaly_Detection/data/validationFeatGen.csv", encoding="utf8", header=0, dtype={'timestamp': object, 'duration': float, 'protocol': object, 'source_ip': object, 'source_port': object, 'direction': object, 'destination_ip': object, 'destination_port': object, 'state': object, 'source_type_service': object, 'destination_type_service': object, 'number_total_packets': int, 'bytes_both_directions': int, 'bytes_source_to_destination': int, 'label': object}, low_memory=False)


In [15]:
traindata = copy.deepcopy(training_df)
testdata = copy.deepcopy(test_df)
validdata = copy.deepcopy(valid_df)

**Checking training data**

In [23]:
print(traindata.columns)

Index(['timestamp', 'duration', 'protocol', 'source_ip', 'source_port',
       'direction', 'destination_ip', 'destination_port', 'state',
       'source_type_of_service', 'destination_type_of_service',
       'total_packets', 'bytes_transferred_total',
       'bytes_transferred_source_to_dest', 'source_type_service'],
      dtype='object')


In [24]:
traindata['source_port'] = traindata['source_port'].fillna('None')
traindata['direction'] = traindata['direction'].str.strip()
traindata['destination_port'] = traindata['destination_port'].fillna('None')
traindata['state'] = traindata['state'].fillna('None')
traindata['source_type_service'] = traindata['source_type_of_service']

**Checking valid data**

In [27]:
validdata['source_port'] = validdata['source_port'].fillna('None')
validdata['direction'] = validdata['direction'].str.strip()
validdata['destination_port'] = validdata['destination_port'].fillna('None')
validdata['state'] = validdata['state'].fillna('None')
validdata['source_type_of_service'] = validdata['source_type_of_service'].fillna('None')
validdata['destination_type_of_service'] = validdata['destination_type_of_service'].fillna('None')

In [29]:
# Check if 'label' column exists
if 'label' in validdata.columns:
    truelabels = validdata[['label']].copy()
    truelabels['labelvalues'] = np.where(truelabels['label'].str.contains('Botnet', case=False, na=False), 1, 0)
    labelvalues = truelabels[['labelvalues']].copy()
    validationdata = validdata.join(labelvalues)
else:
    print("Column 'label' not found in validdata.")


Column 'label' not found in validdata.


In [34]:
print(validdata.columns)
print(traindata.columns)
validdata = validdata.reset_index().merge(traindata[['source_type_service']].reset_index(), left_index=True, right_index=True)

Index(['timestamp', 'duration', 'protocol', 'source_ip', 'source_port',
       'direction', 'destination_ip', 'destination_port', 'state',
       'source_type_of_service', 'destination_type_of_service',
       'total_packets', 'bytes_transferred_total',
       'bytes_transferred_source_to_dest', 'label'],
      dtype='object')
Index(['timestamp', 'duration', 'protocol', 'source_ip', 'source_port',
       'direction', 'destination_ip', 'destination_port', 'state',
       'source_type_of_service', 'destination_type_of_service',
       'total_packets', 'bytes_transferred_total',
       'bytes_transferred_source_to_dest', 'source_type_service'],
      dtype='object')


In [35]:
truelabels = validdata[['label']].copy()
truelabels['labelvalues'] = np.where(truelabels['label'].str.contains('Botnet', case = False, na = False), 1, 0)
labelvalues = truelabels[['labelvalues']].copy()
validationdata = validdata.join(labelvalues)

In [37]:
testdata['flowID'] = testdata[['source_ip', 'source_port', 'direction', 'destination_ip', 'destination_port']].agg('/'.join, axis = 1)

In [39]:
print(testdata.columns)

Index(['timestamp', 'duration', 'protocol', 'source_ip', 'source_port',
       'direction', 'destination_ip', 'destination_port', 'state',
       'source_type_of_service', 'destination_type_of_service',
       'total_packets', 'bytes_transferred_total',
       'bytes_transferred_source_to_dest', 'flowID'],
      dtype='object')


In [40]:
testdata.loc[:, 'pps'] = testdata.total_packets/testdata.duration.replace({0: np.inf})

In [43]:
testdata.loc[:, 'bps_oneway'] = testdata.bytes_transferred_source_to_dest/testdata.duration.replace({0: np.inf})

In [46]:
testdata.loc[:, 'bpp_oneway'] = testdata.bytes_transferred_source_to_dest/testdata.total_packets.replace({0: np.inf})

In [47]:
testdata.loc[:, 'bps_twoway'] = testdata.bytes_transferred_total/testdata.duration.replace({0: np.inf})

In [48]:
testdata.loc[:, 'bpp_twoway'] = testdata.bytes_transferred_total/testdata.total_packets.replace({0: np.inf})

In [49]:
testdata.to_csv('testFeatGen.csv', index = False)

In [52]:
validationdata['flowID'] = validationdata[['source_ip', 'source_port', 'direction', 'destination_ip', 'destination_port']].agg('/'.join, axis = 1)
validationdata.loc[:, 'pps'] = validationdata.total_packets/validationdata.duration.replace({0: np.inf})
validationdata.loc[:, 'bps_oneway'] = validationdata.bytes_transferred_source_to_dest/validationdata.duration.replace({0: np.inf})
validationdata.loc[:, 'bpp_oneway'] = validationdata.bytes_transferred_source_to_dest/validationdata.total_packets.replace({0: np.inf})
validationdata.loc[:, 'bps_twoway'] = validationdata.bytes_transferred_total/validationdata.duration.replace({0: np.inf})
validationdata.loc[:, 'bpp_twoway'] = validationdata.bytes_transferred_total/validationdata.total_packets.replace({0: np.inf})
validationdata.to_csv('validationFeatGen.csv', index = False)

In [53]:
traindata['flowID'] = traindata[['source_ip', 'source_port', 'direction', 'destination_ip', 'destination_port']].agg('/'.join, axis = 1)
traindata.loc[:, 'pps'] = traindata.total_packets/traindata.duration.replace({0: np.inf})
traindata.loc[:, 'bps_oneway'] = traindata.bytes_transferred_source_to_dest/traindata.duration.replace({0: np.inf})
traindata.loc[:, 'bpp_oneway'] = traindata.bytes_transferred_source_to_dest/traindata.total_packets.replace({0: np.inf})
traindata.loc[:, 'bps_twoway'] = traindata.bytes_transferred_total/traindata.duration.replace({0: np.inf})
traindata.loc[:, 'bpp_twoway'] = traindata.bytes_transferred_total/traindata.total_packets.replace({0: np.inf})
traindata.to_csv('TrainFeatGen.csv', index = False)