An in depth approach to detecting significant real-time shifts in network performance indicating network degradation. Building on the data generation process behind DANE and Viasat's network stats, we build a classification system that determines if there are substantial changes to packet loss rate and degree of latency. Please visit our webpage for a more comprehensive view of this project.
-
Generate data using our modified fork of DANE
make
,docker.io
, anddocker-compose
are required on your machine to run modified_dane properly.- a recursive flag is required to properly install modified_dane:
git clone https://github.com/jenna-my/modified_dane --recursive
-
Clone this branch of the repository
git clone https://github.com/LauraDiao/Anomaly_Detectives
-
Place all raw DANE csv files within the directory
data/raw
of this repository. If the directory has not been created, run the commandrun.py
once to generate all relevant directories.
Each of these targets implements a core feature of the repository within run.py
. All code can be executed with the run.py according to various targets specified below.
Example call: python run.py data inference
data
: generates features from unseen and seen dataeda
: Generates visualizations used in exploring which features to use for the modeltrain
: prints results of model performance tested on training ("seen") data with four different models with varying architectures: decision tree, random forest, extra trees, and gradient boostinference
: (deprecated) prints results of model performance tested on testing ("unseen") data with the same exact models.clean
: Removes files generated by targets in commonly used output directoriestest
: Verifies target functionality by running the targetsdata
,eda
,train
, andinference
with a subset of the original model training data.all
: runs all targets excepttest
Our modified version of DANE creates csv files with a naming scheme in the following format:
datevalue_latency-loss-deterministic-laterlatency-laterloss-iperf.csv
e.g. 20220117T015822_200-100-true-200-10000-iperf.csv
this format is crucial for the model to train on the proper labels.
lst
: [1, 2], # list of runs to compare side by side made by plottogether() inside of eda.pyfilen1
: "combined_subset_latency.csv", - subset of the processed data to make edafilen2
: "combined_t_latency.csv", - features generated from processed datafilen3
: "combined_all_latency.csv" - all processed -
n_jobs
: -1 - number of cores the model training is done ontrain_window
: 20 - number of seconds that the model will aggregate on for training window sizepca_components
: 4 - number of components for PCA, we determined 4 was optimal for our modeltest_size
: 0.005 - model validation set size (train test split)threshold
: -0.15 - threshold for loss anomaly detectionemplosswindow
: 25 - rolling window aggregation of empirical loss, set at 25 secondspct_change_window
: 2 - how many seconds the anomaly detection system looks back for determining change.verbose
: "True" - whether terminal output should be verbose or not. For debugging purposes.