# Hidden in Time, Revealed in Frequency: Spectral Features and Multiresolution Analysis for Encrypted Internet Traffic Classification

## Authors
- Nathan Dillbary
- Amit Dvir
- Chen Hajaj
- Ran Dubin
- Roi Yozevitch

## Summary
This Jupyter Notebook is associated with the paper titled "Hidden in Time, Revealed in Frequency: Spectral Features and Multiresolution Analysis for Encrypted Internet Traffic Classification," presented at the IEEE CCNC 2024 Conference. The research addresses the challenge of classifying encrypted internet traffic, introducing two novel methods: STFT-TC (Short-Time Fourier Transform-Based Traffic Classifier) and DWT-TC (Discrete Wavelet Transform-Based Traffic Classifier). These methods utilize time-frequency analysis techniques to provide improved accuracy and insight in encrypted traffic classification, demonstrating significant advancements over existing approaches.
The notebook herein contains implementations related to the paper's content.



---

## License
MIT License

Copyright (c) 2024 Nathan Dillbary, Amit Dvir, Chen Hajaj, Ran Dubin, Roi Yozevitch

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

In [None]:
import numpy as np
import pandas as pd
from typing import List, Optional
import scipy.signal
import librosa
import pywt

In [None]:
def extract_stft_features(data: List[int], features_to_extract: Optional[List[str]] = None) -> np.ndarray:
    """
    This function performs Short-Time Fourier Transform (STFT) on the given time-series data and extracts a comprehensive
    set of spectral and statistical features from the magnitude spectrogram.
    STFT converts the time-series into a 2D time-frequency domain representation, allowing for the analysis of
    frequency components' evolution over time.

    Args:
        data (List[int]): The input data, a time-series of integers, typically representing signals like audio or network traffic.
        features_to_extract (Optional[List[str]]): A list of specific feature names to extract from the spectrogram.
          If None or left unspecified, a comprehensive default set including mean, standard deviation, spectral centroid,
          spectral bandwidth, spectral contrast, spectral flatness, spectral rolloff, chroma STFT, and MFCC is used.

    Returns:
        np.ndarray: A 1D numpy array of concatenated features, resulting in a single feature vector representing the
        original input data's spectral characteristics.

    Detailed Feature Descriptions:
    - Mean & Standard Deviation: Provide insight into the average power and its variability across frequencies.
    - Spectral Centroid & Bandwidth: Indicate the 'center of mass' of the power spectrum and the width of frequency components.
    - Spectral Contrast: Reflects the level of contrast between the most and least intense frequencies.
    - Spectral Flatness: Measures how noise-like a signal is, by comparing the geometric mean to the arithmetic mean of the power spectrum.
    - Spectral Rolloff: Represents the frequency below which a specified percentage of the total spectral energy lies.
    - Chroma STFT: Projects the spectrum onto 12 bins representing the 12 distinct semitones (or chroma) of the musical octave.
    - MFCC: Mel-frequency cepstral coefficients, providing a representation of the power spectrum's shape.

    Each feature captures a different characteristic of the signal's frequency content, making this function versatile for
    various signal processing and machine learning tasks in time-frequency analysis domains.
    """
        f, t, Zxx = scipy.signal.stft(data, fs=1000, window='hann', nperseg=256, noverlap=128)
    mag_spectrogram = np.abs(Zxx)
    if not features_to_extract:
        features_to_extract = ('mean', 'std', 'spectral_centroid', 'spectral_bandwidth',
                               'spectral_contrast', 'spectral_flatness','spectral_rolloff', 'chroma_stft', 'mfcc')
    feature_map = {
        'mean': lambda: np.mean(mag_spectrogram, axis=1),
        'std': lambda: np.std(mag_spectrogram, axis=1),
        'spectral_centroid': lambda: librosa.feature.spectral_centroid(S=mag_spectrogram)[0],
        'spectral_bandwidth': lambda: librosa.feature.spectral_bandwidth(S=mag_spectrogram)[0],
        'spectral_contrast': lambda: librosa.feature.spectral_contrast(S=mag_spectrogram)[0],
        'spectral_flatness': lambda: librosa.feature.spectral_flatness(S=mag_spectrogram)[0],
        'spectral_rolloff': lambda: librosa.feature.spectral_rolloff(S=mag_spectrogram, sr=1000)[0],
        'chroma_stft': lambda: librosa.feature.chroma_stft(S=mag_spectrogram, sr=1000)[0],
        'mfcc': lambda: librosa.feature.mfcc(S=mag_spectrogram, sr=1000)[0]
    }
    computed_features = [feature_map[feat]() for feat in features_to_extract if feat in feature_map]
    feature_vec = np.concatenate(computed_features)
    return feature_vec

In [None]:
def extract_wavelet_features(data: List[int], features_to_extract: Optional[List[str]] = None,
                             wavelet: str = 'coif6', level: int = 4) -> np.ndarray:
    """
    Extracts wavelet features from the given time-series data utilizing the Discrete Wavelet Transform (DWT) for multi-resolution analysis.
    DWT is a powerful tool for signal processing, allowing decomposition of the signal into various frequency bands with localized
    time information. This method is particularly adept for analyzing non-stationary signals found in real-world scenarios, such as
    network traffic where frequency components vary over time.

    Args:
        data (List[int]): The input data, representing a time-series of integers.
        features_to_extract (Optional[List[str]]): A list specifying which statistical features to compute from the wavelet coefficients.
          If None, a default set including mean, standard deviation, median, max, min, range, energy, crest factor, and shape factor is extracted.
        wavelet (str): Specifies the mother wavelet type to be used for the DWT. Default is 'coif6'.
        level (int): Determines the level of wavelet decomposition. Higher levels result in finer frequency resolution.

    Returns:
        np.ndarray: A 1D numpy array containing the concatenated features, forming a comprehensive feature vector for the input data.

    Detailed Feature Descriptions:
    - Mean, Standard Deviation, Median, Max, Min, Range: Provide basic statistical measures of the wavelet coefficients at each level of decomposition.
    - Energy: Sums up the squares of the coefficients, indicating dominant frequency components.
    - Crest Factor: Ratio of the max value to the root mean square of the coefficients, highlighting the presence of spikes or transients.
    - Shape Factor: Represents the signal's waveform shape by comparing the RMS value to the mean absolute value.

    The resulting feature vector captures both the localized frequency and time characteristics of the original signal, offering
    a nuanced understanding of its structure and behavior. This makes the function invaluable for tasks requiring detailed and
    sensitive signal characterization, such as fault diagnosis, anomaly detection, or pattern recognition in time-series data.
    """
        coeffs = pywt.wavedec(data, wavelet, level=level)
    epsilon = 1e-10  # small constant to avoid division by zero

    feature_map = {
        'mean': lambda x: np.mean(x),
        'std': lambda x: np.std(x),
        'median': lambda x: np.median(x),
        'max': lambda x: np.max(x),
        'min': lambda x: np.min(x),
        'range': lambda x: np.max(x) - np.min(x),
        'energy': lambda x: np.sum(np.square(x)),
        'crest_factor': lambda x: np.max(np.abs(x)) / (np.sqrt(np.mean(np.square(x))) + epsilon),
        'shape_factor': lambda x: np.sqrt(np.mean(np.square(x))) / (np.mean(np.abs(x)) + epsilon),
    }

  # If no features are specified, use all features
    if not features_to_extract:
        features_to_extract = list(feature_map.keys())

    # Compute the requested features
    computed_features = []
    for coeff in coeffs:
        computed_features.extend([feature_map[feat](coeff) for feat in features_to_extract if feat in feature_map])

    # Assemble the feature vector
    feature_vec = np.array(computed_features)
    return feature_vec

In [None]:
def apply_feature_extraction(df, columns):
    for col in columns:
        if col in df.columns:
            df[f'stft_{col}'] = df[col].apply(extract_stft_features).apply(list)
            df[f'dwt_{col}'] = df[col].apply(extract_wavelet_features).apply(list)
    return df

In [None]:
def apply_functions_for_server_direction(df, column, functions_with_args):
    """
    Apply given functions with additional arguments to a DataFrame column for the server direction only.

    :param df: DataFrame to modify.
    :param column: The column on which functions will be applied.
    :param functions_with_args: A list of tuples. Each tuple contains a function and its additional arguments.
    :return: Modified DataFrame.
    """
    for func, *args in functions_with_args:
        func_name = func.__name__

        df[f'{func_name}_server'] = df[column].apply(lambda x: func(x, n=args[0], direction=1))

    return df

In [None]:
def apply_feature_extraction(df: pd.DataFrame, columns: List[str]) -> pd.DataFrame:
    """
    Applies feature extraction functions to specified columns in the DataFrame. This function is designed to
    iterate through a list of columns and, for each, apply Short-Time Fourier Transform (STFT) and Discrete Wavelet
    Transform (DWT) feature extraction if the column exists in the DataFrame. The extracted features from each method
    are added as new columns to the DataFrame, prefixed with 'stft_' and 'dwt_' respectively.

    Args:
        df (pd.DataFrame): The DataFrame to which the feature extraction methods will be applied.
        columns (List[str]): A list of column names in the DataFrame for which the feature extraction will be applied.

    Returns:
        pd.DataFrame: The modified DataFrame with new columns for each of the specified columns' extracted features using STFT and DWT.

    Each of the specified columns will be checked for existence in the DataFrame, and if present, two new columns
    will be added corresponding to that column: one for the STFT features and one for the DWT features. The values
    in these new columns are lists of features extracted from the original column's data.
    """
    for col in columns:
        if col in df.columns:
            df[f'stft_{col}'] = df[col].apply(extract_stft_features).apply(list)
            df[f'dwt_{col}'] = df[col].apply(extract_wavelet_features).apply(list)
    return df