#### Copyright 2025, Battelle Energy Alliance, LLC, ALL RIGHTS RESERVED

### Necessary Imports
The sys.path.append is a workaround (I don't recommend it, but it works if you aren't familiar with package configuration) to make sure Python is able to find your files.

Zeek is the dataset itself.
ZeekCleaner is a utility that can automatically detect the type of and clean your data accordingly.

In [1]:
import sys
sys.path.append("/home/katoaa/internship2024/katoaa")

from data.datasets import Zeek
from data.cleaning import ZeekCleaner

### Reading Data
Just specify the path and pass it as data_path to your Zeek object.

If you're interested in customizing the reading process, read the documentation for the `read` method and use the parameters for yourself.

In [8]:
base_path = "/home/zeeklogs"
logs = Zeek(data_path=base_path)
logs.n_connections()

530173

### Data Augmentation
We can use some built in functionality to narrow down a large dataset to just keep observations we want.
To view additional functionality, view the documentation by hovering in VSCode, or going to datasets.py

In [3]:
# Removes connections that had no subsequent appearances in other log files
logs.remove_empty_connections()
logs.n_connections()

322621

In [4]:
# Keep the first n connections (or random ones if shuffle is true)
logs.keep_n_connections(n=1e5, shuffle=True)
logs.n_connections()

100000

In [None]:
# We can remove anything from the data
logs.remove(col="id.orig_h", values=['8.8.8.8'])
logs.remove(col="id.resp_h", values=['8.8.8.8'])
logs.n_connections()

### Viewing Data

In [6]:
logs.sort()
logs.get('conn').head()

Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,proto,service,duration,orig_bytes,...,conn_state,local_orig,local_resp,missed_bytes,history,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,tunnel_parents
361168,2022-07-14 19:19:53.920833111,Czh3vq2mAUUjhH8rXf,192.168.70.203,53856,23.96.94.139,443,tcp,,NaT,,...,OTH,,,0,C,0,0,0,0,
425922,2022-07-14 19:20:04.698591948,CsAxj51ItCrQTCpcu,192.168.70.203,51734,13.88.31.235,443,tcp,,0 days 02:26:45.363254,0.0,...,OTH,,,0,CdCCa,0,0,234,30514,
362850,2022-07-14 19:20:12.146811008,C0rtsK1EyEKaFOVrBb,192.168.70.222,37754,192.168.70.130,88,tcp,,0 days 00:00:00.009936,1404.0,...,OTH,,,0,^cCD,1,1444,0,0,
361181,2022-07-14 19:20:12.183937073,Ci2nby33LNuetdR31a,192.168.70.222,43676,192.168.70.130,445,tcp,,NaT,,...,OTH,,,0,C,0,0,0,0,
361182,2022-07-14 19:20:12.354305983,CnGo9j2UnnqItVRKlh,192.168.70.222,37762,192.168.70.130,88,tcp,,NaT,,...,OTH,,,0,C,0,0,0,0,


In [7]:
logs.get('http').head()

Unnamed: 0,ts,uid,id.orig_h,id.orig_p,id.resp_h,id.resp_p,trans_depth,method,host,uri,...,tags,username,password,proxied,orig_fuids,orig_filenames,orig_mime_types,resp_fuids,resp_filenames,resp_mime_types
0,2022-07-27 15:53:42.778825998,CMnAy51RtcL6pWtfjh,192.168.70.232,48120,35.232.111.17,80,1,GET,connectivity-check.ubuntu.com,/,...,(empty),,,,,,,,,
1,2022-07-27 15:55:27.064935923,CTCA6k27HXksnP0iXf,192.168.70.160,61925,23.192.208.58,80,1,HEAD,officecdn.microsoft.com,/pr/492350f6-3a01-4f97-b9c0-c7c6ddf67d60/Offic...,...,(empty),,,,,,,,,
3,2022-07-27 15:55:28.858403921,CTCA6k27HXksnP0iXf,192.168.70.160,61925,23.192.208.58,80,2,HEAD,officecdn.microsoft.com,/pr/492350f6-3a01-4f97-b9c0-c7c6ddf67d60/Offic...,...,(empty),,,,,,,,,
5,2022-07-27 15:55:28.939876080,CTCA6k27HXksnP0iXf,192.168.70.160,61925,23.192.208.58,80,3,HEAD,officecdn.microsoft.com,/pr/492350f6-3a01-4f97-b9c0-c7c6ddf67d60/Offic...,...,(empty),,,,,,,,,
7,2022-07-27 15:55:29.007740974,CTCA6k27HXksnP0iXf,192.168.70.160,61925,23.192.208.58,80,4,GET,officecdn.microsoft.com,/pr/492350f6-3a01-4f97-b9c0-c7c6ddf67d60/Offic...,...,(empty),,,,,,,,,


### Getting Ready for a Model

This data comes from HPC. All attacker IPs will end up in the test set.

In [8]:
attacker_ips = ["192.168.70.137", "64.227.69.82", "104.248.193.232"]
train, test = logs.train_test_split(test_ips=attacker_ips, ratio=0.1, shuffle=True)

In [9]:
print(train.n_connections())
print(test.n_connections())

68419
7602


In [None]:
# This will preprocess our data. You can customize this extensively via configs/cleaner_assignments.py
# and data/log_cleaners.py
cleaner = ZeekCleaner()
train_processed = cleaner.fit_transform(train)
test_processed = cleaner.transform(test)

Notice that in this instance, some transforms were not learned from the training data. As the warning suggests,
this means that the log file never appears when the call to .fit was made. Presumably, this means its a log file
that is only found in the test set. You can choose to handle this how you'd like.

### PyTorch Ready

Now, we can make our logs model input compatible. There is a prebuilt way to do this, though for more specific
needs you might need to make your own dataloader using calls to data.get(log).

In [11]:
test_loader = test_processed.to_torch_loader('conn', batch_size=1024, shuffle=False, num_workers=4)
for idx, X in enumerate(test_loader):
    print(X.shape)

torch.Size([1024, 47])
torch.Size([1024, 47])
torch.Size([1024, 47])
torch.Size([1024, 47])
torch.Size([1024, 47])
torch.Size([1024, 47])
torch.Size([1024, 47])
torch.Size([434, 47])


### Saving and Reusing Work

In [12]:
train.save("/projects/data/example.pkl")
reloaded = Zeek.load("/projects/data/example.pkl")  # This is a static method
train.n_connections() == reloaded.n_connections()

True