# Data Preprocessing for Anomaly Detection in UEBA

## Introduction
Data preprocessing is a crucial step in the machine learning pipeline, especially for anomaly detection tasks in User and Entity Behavior Analytics (UEBA). This process involves cleaning and transforming raw data into a format suitable for model training. Effective preprocessing can enhance model performance by reducing noise, handling missing values, and encoding categorical variables.

In this notebook, we will:
- Load the dataset
- Inspect the data for any anomalies
- Handle missing values
- Encode categorical variables
- Normalize numerical features
- Save the preprocessed dataset for future use

Let's begin by loading the necessary libraries and the dataset.


In [22]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import os


In [23]:
# Load the dataset with a specified encoding
file_path = r'C:\Users\USER\UEBA_Project\anomaly_detection\data\raw\UNSW_NB15_training-set.csv'
df = pd.read_csv(file_path, encoding='ISO-8859-1')  

# Display the first few rows of the dataset
df.head()


Unnamed: 0,ï»¿id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
0,1,1.1e-05,udp,-,INT,2,0,496,0,90909.0902,...,1,2,0,0,0,1,2,0,Normal,0
1,2,8e-06,udp,-,INT,2,0,1762,0,125000.0003,...,1,2,0,0,0,1,2,0,Normal,0
2,3,5e-06,udp,-,INT,2,0,1068,0,200000.0051,...,1,3,0,0,0,1,3,0,Normal,0
3,4,6e-06,udp,-,INT,2,0,900,0,166666.6608,...,1,3,0,0,0,2,3,0,Normal,0
4,5,1e-05,udp,-,INT,2,0,2126,0,100000.0025,...,1,3,0,0,0,2,3,0,Normal,0


## 1. Inspecting the Data
Before preprocessing, it's essential to understand the dataset's structure and identify any issues. We'll check the data types, look for missing values, and examine basic statistics.


In [24]:
# Check the data types and missing values
df.info()

# Summary statistics
df.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82332 entries, 0 to 82331
Data columns (total 45 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ï»¿id              82332 non-null  int64  
 1   dur                82332 non-null  float64
 2   proto              82332 non-null  object 
 3   service            82332 non-null  object 
 4   state              82332 non-null  object 
 5   spkts              82332 non-null  int64  
 6   dpkts              82332 non-null  int64  
 7   sbytes             82332 non-null  int64  
 8   dbytes             82332 non-null  int64  
 9   rate               82332 non-null  float64
 10  sttl               82332 non-null  int64  
 11  dttl               82332 non-null  int64  
 12  sload              82332 non-null  float64
 13  dload              82332 non-null  float64
 14  sloss              82332 non-null  int64  
 15  dloss              82332 non-null  int64  
 16  sinpkt             823

Unnamed: 0,ï»¿id,dur,spkts,dpkts,sbytes,dbytes,rate,sttl,dttl,sload,...,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,label
count,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,...,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0,82332.0
mean,41166.5,1.006756,18.666472,17.545936,7993.908,13233.79,82410.89,180.967667,95.713003,64549020.0,...,4.928898,3.663011,7.45636,0.008284,0.008381,0.129743,6.46836,9.164262,0.011126,0.5506
std,23767.345519,4.710444,133.916353,115.574086,171642.3,151471.5,148620.4,101.513358,116.667722,179861800.0,...,8.389545,5.915386,11.415191,0.091171,0.092485,0.638683,8.543927,11.121413,0.104891,0.497436
min,1.0,0.0,1.0,0.0,24.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
25%,20583.75,8e-06,2.0,0.0,114.0,0.0,28.60611,62.0,0.0,11202.47,...,1.0,1.0,1.0,0.0,0.0,0.0,1.0,2.0,0.0,0.0
50%,41166.5,0.014138,6.0,2.0,534.0,178.0,2650.177,254.0,29.0,577003.2,...,1.0,1.0,3.0,0.0,0.0,0.0,3.0,5.0,0.0,1.0
75%,61749.25,0.71936,12.0,10.0,1280.0,956.0,111111.1,254.0,252.0,65142860.0,...,4.0,3.0,6.0,0.0,0.0,0.0,7.0,11.0,0.0,1.0
max,82332.0,59.999989,10646.0,11018.0,14355770.0,14657530.0,1000000.0,255.0,253.0,5268000000.0,...,59.0,38.0,63.0,2.0,2.0,16.0,60.0,62.0,1.0,1.0


## 2. Handling Missing Values
Missing values can significantly impact model performance. We will identify columns with missing values and decide how to handle them—either by dropping or imputing values.


In [25]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values in each column:")
print(missing_values[missing_values > 0])

# Dropping unnecessary columns (if any)
# df = df.drop(columns=['unnecessary_column'])  # Example


Missing values in each column:
Series([], dtype: int64)


##### Impute missing values or drop rows/columns as necessary
##### For example, here we can fill missing values with the mean or drop them

In [None]:

df.fillna(df.mean(), inplace=True)  # Example: Fill missing values with the mean for numerical columns


## 3. Encoding Categorical Variables
Machine learning models typically require numerical input. We'll convert categorical variables into numerical format using techniques such as one-hot encoding.


In [26]:
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
print("Categorical columns to encode:", categorical_cols)

# Initialize a dictionary to store label encoders for features
label_encoders = {}

# Encode categorical features
for col in categorical_cols:
    if col != 'attack_cat':  # Skip the target variable
        label_encoder = LabelEncoder()  # Create a new encoder for each column
        df[col] = label_encoder.fit_transform(df[col])
        label_encoders[col] = label_encoder  # Store the encoder in the dictionary

# Encode the target variable
attack_cat_encoder = LabelEncoder()
df['attack_cat'] = attack_cat_encoder.fit_transform(df['attack_cat'])

# Save the fitted label encoders for future use
joblib.dump(label_encoders, 'label_encoders.pkl')  # Save feature encoders
joblib.dump(attack_cat_encoder, 'attack_cat_encoder.pkl')  # Save target variable encoder

# Check the first few rows of the encoded dataset
print(df.head())

Categorical columns to encode: ['proto', 'service', 'state', 'attack_cat']
   ï»¿id       dur  proto  service  state  spkts  dpkts  sbytes  dbytes  \
0      1  0.000011    117        0      4      2      0     496       0   
1      2  0.000008    117        0      4      2      0    1762       0   
2      3  0.000005    117        0      4      2      0    1068       0   
3      4  0.000006    117        0      4      2      0     900       0   
4      5  0.000010    117        0      4      2      0    2126       0   

          rate  ...  ct_dst_sport_ltm  ct_dst_src_ltm  is_ftp_login  \
0   90909.0902  ...                 1               2             0   
1  125000.0003  ...                 1               2             0   
2  200000.0051  ...                 1               3             0   
3  166666.6608  ...                 1               3             0   
4  100000.0025  ...                 1               3             0   

   ct_ftp_cmd  ct_flw_http_mthd  ct_src_ltm  ct

## 4. Normalizing Numerical Features
To ensure that all features contribute equally to the model, we will normalize the numerical features. This step is especially important for algorithms sensitive to feature scales.


In [20]:
# Normalize numerical features
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
numerical_cols = df.select_dtypes(include=[np.number]).columns
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])


## 5. Saving the Preprocessed Dataset
After preprocessing, we will save the cleaned and transformed dataset to a new CSV file for future use in model training.


In [27]:
# Save the preprocessed dataset
preprocessed_file_path = r'C:\Users\USER\UEBA_Project\anomaly_detection\data\processed\preprocessed_train_data.csv'
df.to_csv(preprocessed_file_path, index=False)

print("Preprocessed dataset saved successfully!")


Preprocessed dataset saved successfully!
