This application collects local port and network packet time series data, stores the data on the client, and evaluates anomalies using cron while the host machine is active.
The application collects network traffic and detects anomalies in port occupancy using cron jobs.
The network traffic collector scans for port occupancy and streams network packets on the en0
interface. Jobs run every minute, with each storing port usage and network packets in an in-process database. For each round, two threads are spun off the main thread, calling their respective services to scan ports and stream network packets to store in the database.
The anomaly detector runs every hour and predicts anomalies within batched time series data using a convolutional neural network (CNN), a recurrent neural network (RNN), and an autoencoder (AE). The CNN extracts local features in port usage. The RNN, which uses Long Short-Term Memory (LSTM) network, extracts temporal features in port usage. Finally, the AE compresses and decompresses the data in order to generalize features to detect anomalies for unlabeled data. For each batch, the keras.engine.Sequential
model stored on the client predicts anomalies by using the reconstruction threshold from the undercomplete autoencoder and then refits with the data after evaluation, preserving the temporality of time series data and maintaining the model weights.
The neural network model architecture uses Keras high-level API to construct layers. The first layer is an input layer followed by a 1-dimensional convolutional layer with the shape defined as the port input space (1 x 65535). A 1-dimensional maximum pooling layer then compresses the local features into a lower dimensional space. Dropout is added to further allow our model to generalize local features. An LSTM layer then allows for our model to realize temporal features in the data. Lastly, we have two dense layers to compress and decompress the data, implementing an under-complete autoencoder, to further generalize and label anomalous behavior.
Lastly, stale data is deleted from the database in order to maintain a small footprint on the client.
The database consists of four tables: rounds, ports, packets, and rounds_ports. This allows for port status data to belong to a particular round for a client, supporting federation. The network packet data also belongs to a particular port and a particular round, associating packets to specific ports and supporting federation.
- Rounds ( id, start_time )
- Ports ( id, value )
- RoundsPorts ( id, round_id, port_id, timestamp )
- Packets ( id, timestamp, protocols, qry_name, resp_name, port_id, dest_port, payload, round_id )
Clone the repository and navigate to the directory of choice and install the dependencies with:
pip3 install -r requirements.txt
To initialize the SQLite3 database, execute the following:
sqlite3 -init init.sql ~/ports.db ""
Unfortunately, SQLite does not support native variable syntax. Therefore, seeding the database is done at the application layer.
python3 ~/<path_to_repository>/app/db/reset.py
Setup a cronjob with:
crontab -e
and use the following syntax to schedule the job. Cron requires the global path to find the file.
*/1 * * * * python3 app/port_collector.py -d ports.db -r 60 -p
0 * * * * python3 app/port_detector.py -d ports.db -r 60
0 0 * * * python3 app/port_cleanup.py -d ports.db
and all files are executable (ex. chmod +x app/port_collector.py
).
We must also give cron full disk access - osxdaily.com/2020/04/27/fix-cron-permissions-macos-full-disk-access/
- The cron job will initialize the database at the root. See dependencies for accessing the database console.
- You will likely need to configure cron to use the correct installation of Python. You can swap out
python3
with the result ofwhich python3
. - The logs from the cron execution can be saved in a log file with the following syntax:
python3 app/port_collector.py >> collector.log 2>&1
The filescollector.log
anddetector.log
are ignored from Git for this purpose. - Configuration of the PATH environment variable to allow cron to use the latest version of Python might be required. The Live Capture functionality from
pyshark
requires a recent version of Python and PIP. - The application will execute without Wireshark and the Sniffer thread streaming network packets to the database. However, we can configure Wireshark to allow execution by cron by modifying the file
dumpcap
path inconfig.ini
forpyshark
within your current Python dependencies (ex./usr/local/lib/python3.9/site-packages/pyshark
). The dumpcap path should point to your TShark installation (ex.dumpcap_path = /Applications/Wireshark.app/Contents/MacOS/tshark
)
Build the docker image with
docker compose up --build --detach --force-recreate
We can then attach to the container with the following
docker exec -it <container_id> /bin/bash
- Mapping the network from the host machine is not supported by Windows.
- Installing Wireshark is required on the host machine. Only Linux allows commands from the Dockerfile to setup Wireshark on the host. This will not be supported.
The Python library pyshark
is used to sniff packets with LiveCapture. The capture packets include protocols TCP, DNS, QUIC, and UDP on the en0
interface.
The Python library threading
manages concurrency within the application. Two threads are initialized to scan ports and sniff packets continuously.
The Python library sockets
is used to bind to local ports and determine the port occupancy. Sockets is installed with the standard Python library
A lightweight, in process database to store data on the client. SQLite3 is installed with the standard Python library.
To access the SQLite console, simply connect to the database with:
sqlite3 ports.db
.tables
- List tables.mode list
- Set display mode
The Python library numpy
is used to preprocess the stored data into an acceptable data structure for the neural network model.
The Python library tensorflow
is required to construct the neural network model.
The Python library scikit-learn
allows us to split the data into training and testing datasets.
The Python library matplotlib
is a plotting library that allows us to visualize our model and data.
The Python library python-dotenv
loads environment variables necessary for development.
The Python library argparse
from the Python standard library is used to supply values from the cron scheduler.
Develop a server application to support federation and manage client participation in the neural network usage (training and evaluation), aggregating weights from the client models at time t + 1 using Stochastic Parallel Gradient Descent (SPGD).
Tune model hyperparameters to optimize our model performance.
Use and secure credentials for the client application to write to the database on the host machine.
Use the data to help predict anomalies!
One interesting project would be to train a GAN to fake port activity. This could be useful in testing load capacity in cloud environments.