This is the source code of our paper: ZEST: Attention-based Zero-Shot Learning for Unseen IoT Device Classification. https://arxiv.org/abs/2310.08036
On the UNSW website, IoT traffic data is available in two formats:
- CSV data: 2 weeks
- PCAP data: 2 months
Since we want lots of CSV data, we convert the large PCAP dataset to CSV. For this purpose, we use the bash script here.
- Download PCAP data from UNSW website, put data in
../../pcap_unsw
folder - Create the destination folder:
../../csv_from_pcap/
- In
unsw_pcap_to_csv.sh
, setpcapFiles
array to the files you want to convert - Run the script:
./unsw_pcap_to_csv.sh
Note: tshark
has to be in the path.
The converter does the following filtering and extraction:
- Only keep UDP and TCP packets
- ICMP failure packets with 2 protocols are ignored
- Fragmented an reassembled packets are ignored for the moment, they lack a port in PCAP
Conversion logic:
CSV file | PCAP file |
---|---|
Packet_ID | frame.number |
TIME | frame.time_epoch |
Size | frame.len |
MAC_source | eth.src |
MAC_destination | eth.dst |
IP_source | ip.src |
IP_destination | ip.dst |
Layer_4_protocol | ip.proto |
Port_source | tcp.srcport or udp.srcport |
Port_destination | tcp.dstport or udp.dstport |
Extract traffic features from raw packets.
- Input: raw packet data in CSV format
- Output: packet sequences per device in CSV format. For each packet, we extract the following features:
- Packet size
- Time since last packet
- Direction (in/out)
- Protocol (transport layer - UDP/TCP)
- Source port (-1 if random, 1 if 1...1024)
- Destination port (-1 if random, 1 if 1...1024)
- Source IP address (internal/external)
- Destination IP address (internal/external)
Run the jupyter notebook from_csv_to_sequence.ipynb
. In particular, the function create_features_from_csv()
generates the sequences.
Generate the dataset for window-based device classification.
To generate the dataset, run the jupyter notebook csv_to_npz_windows.ipynb
, the dataset generation happens in the function csv_sequences_to_npz_windows()
.
Make sure you have previously generated the device sequences.
The dataset is stored as *.npz
file: a compressed numpy array.
Inside the npz file, you can find:
x
: the windows that serve as input to a ML modely
: the labels that serve as ground truth
The dimensions of the numpy arrays are as follows:
x
: WINDOWS x WINDOW_LENGTH x FEATURES. This is a long list of windows, where each window consists of consecutive packets for one IoT device.y
: WINDOWS x 1 x CLASSES. This is a long list of labels, in one-hot encoding format.
Train a Bi-LSTM on the packet windows. Learn to classify windows into IoT classes.
- Make sure you have previously generated the window dataset
- The training is implemented in
train_from_npz.ipynb
- The model checkpoints can be stored in a configurable folder
- In Google Colab, make sure you activate a GPU and high RAM, using the "Runtime" -> "Change runtime type" menu
This is the main code for our ZEST framework
- The data is splited into seen devices and unseen devices.
- Train a supervised self-attention based model for only seen devices.
- Use the pretrained model to extract attributes for both seen and unseen devices.
- Train a CVAE model to learn the mapping between attributes and data space.
- Use the decoder of CVAE model to generate Pesudo data for unseen devices.
- Train a supervised classifier to do the classification.
There are 4 baselines, all of them are based on the Bi-LSTM model we mentioned above. But VAE-K use VAE to reduce the dimension They utilize different solutions to do the classification.
-It employs VAE to extract features and applys k-means to perform clustering subsequently.
-You can refer to: Zhang, S., Wang, Z., Yang, J., Bai, D., Li, F., Li, Z., ... & Liu, X. (2021, June). Unsupervised IoT Fingerprinting Method via Variational Auto-encoder and K-means. In ICC 2021-IEEE International Conference on Communications (pp. 1-6). IEEE.
-It ultilizes BiLSTM to extract features, and uses k-means to do clustering
-It ultilizes BiLSTM to extract features, and uses seeded k-means to do clustering
-It ultilizes BiLSTM to extract features, and uses seeded k-means to do clustering. Then use Random forest to enhance the performance.
-You can refer to : Thangavelu, V., Divakaran, D. M., Sairam, R., Bhunia, S. S., & Gurusamy, M. (2018). DEFT: A distributed IoT fingerprinting technique. IEEE Internet of Things Journal, 6(1), 940-952.