# DNS over HTTPS Experiments
This notebook serves to run all the experiments for our work on the CIRA-CIC-DoHBrw-2020 dataset. This notebook will train and validate 9 machine learning models and 2 deep learning models. Additionally, the experiments will determine how the performance of these models changes as we increase the size of the feature set.

In [1]:
# Import the dataset saved on the google drive
from google.colab import drive

# Graphing capabilities
import matplotlib.pyplot as plt

# Data management
import pandas as pd
import numpy as np

# For stratified 10-fold cross validation
from sklearn.model_selection import StratifiedKFold

# Scikit-Learn ML Models
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB

# Keras-TensorFlow DNN Model
from keras.models import Sequential
from keras.layers import BatchNormalization, Dense, Dropout
from keras.regularizers import l2

# Fast.ai DNN Model
from fastai.tabular import *

# Normalization
from keras.utils import normalize, to_categorical

print('Imports complete.')

Imports complete.


## Custom Objects and Functions
Below are the custom objects and functions created for these sorts of projects. We will begin by going through all of these and briefly describing what they are used for and what they do. Where appropriate, there should also be descriptions available through the `help` function as well. 


### Metric Management
Below, we have two separate objects created to help alleviate in-code management of metrics. Previously, the code to manipulate metric data was incorporated within the model training functions and involved dictionaries within dictionaries. In an attempt to resolve the readability and debugging issues that were arising, this OOP-based method should be easier to understand!  
Here, we will start by explaining some of the inner workings of the `Metric` object, then we will move on to explaining the `MetricsManager` object.

#### Metric Object
Each metric has an assigned name and fold value. In this case, the `self.name` attribute refers to the name of the model the metric is for (I'm sorry if this is not intuitive, when I was making the code it felt more intuitive than it does when I'm writing the documentation!). Appropriately, `self.fold_num` refers to which fold this specific metric was collected from. While this may not be entirely important, it's helpful to be able to distinguish between the values acquired across folds. This also allows us to keep track of a few different things simultaneously. Another thought behind this is having the fold number readily available may assist in expanding the code.  
Regardless of that stuff, **what does this object actually do?** Good question. The Metric object is made to hold all the metrics gathered from a specific model in a given fold. For example, if I am on the fifth fold of running `RandomForestClassifier` and have acquired the accuracy of 97% through `RandomForestClassifier.score()`, assuming I've already created a `MetricsManager` object, then all I have to do to save this information for later is: 
```
# Example 1
metric = Metric('rf', fold_num=5)
metric.addValue('acc', 0.97)
metrics_manager.addMetric(metric)
```
While there is no standardization among the names of the models or measures (read metrics), consistency across code is helpful. The code below will work -- because the computer doesn't have an understanding about what "accuracy" means -- however the output will be less than helpful since it will treat `acc` as a separate measure than `Accuracy`, as seen in Example 2. Also, note that how the form in which the data is handed through the API will affect the output. This is also demonstrated in Example 2 Output.  
```
# Example 2
metric0 = Metric('RandomForest', fold_num=1)
metric0.addValue('Accuracy', 97.50)
metrics_manager.addMetric(metric0)

metric1 = Metric('cart', fold_num=1)
metric1.addValue('acc', 0.95)
metrics_manager.addMetric(metric1)
```
and the output
```
# Example 2 Output
model     Accuracy   acc        
---------------------
RandomForest 97.50±0.00           
cart                  0.95±0.00
```
It ain't the smartest beast in the word, but it can clean up the code substantially. If you care about how clean the output, or if you want to automatically generate a table for markdown, do it yourself.
Other than that, the `Metric` object has clearly-defined getters and setters. If you need something, feel free to add it.

#### MetricsManager object
This object is *way* more and interesting. Every `MetricsManager` object will have an internal list of all the `Metric` objects it knows. You can see how to add a `Metric` object in both Example 1 and 2. The `getMetrics()` method will return the `Metric` objects the manager knows about that meet the given criteria. As an example, if we hand the manager a bunch of information and then we only want some of it back (Example 3), we can do that!
```
# Example 3
mm = MetricsManager()

m = Metric('rf', fold=1)
m.addValue('acc', 0.97)
m.addValue('time', 0.39)
mm.addMetric(m)

m = Metric('rf', fold=2)
m.addValue('acc', 0.95)
m.addValue('time', 0.99)
mm.addMetric(m)

m = Metric('rf', fold=3)
m.addValue('acc', 0.93)
m.addValue('time', 0.98)
mm.addMetric(m)

m = Metric('dt', fold=1)
m.addValue('time', 0.75)
mm.addMetric(m)

m = Metric('xgboost', fold=1)
m.addValue('acc', 0.99)
m.addValue('time', 50)
mm.addMetric(m)

m = Metric('xgboost', fold=1)
m.addValue('acc', 0.99)
m.addValue('time', 519)
mm.addMetric(m)

mm.printMeasures(metrics=['time'])
```
```
# Example 3 Output 1
model     time       
--------------
rf         0.79±0.28
dt         0.75±0.00
xgboost  284.50±234.50
```
The `MetricsManager.printMeasures()` method will take `models` and `metrics` arguments, which may be either a single string identifier (such as `'rf'` or `'DecisionTreeClassifier'`) or a list of identifiers (`['acc', 'time']`). If you request data that the manager can not give you, however, you will not receive that data. Using the code from Example 3, we can see that if instead of `mm.printMeasures(metrics=['time'])` we say `mm.printMeasures(model=['rf', 'xgboost', 'dt'], metrics='acc')`, we will get the following output instead.
```
# Example 3 Output 2
model     acc        
--------------
rf         0.95±0.02
xgboost    0.99±0.00
```

In [2]:
# Objects used to help manage the metrics data
class Metric:
    def __init__(self, name, fold):
        self.name = name
        self.fold_num = fold
        self.values = {}

    def __str__(self):
        return str({self.name: self.values})

    def __repr__(self):
        return str({self.name: self.values})

    def addValue(self, m_type, value):
        if m_type != None and value != None:
            self.values[m_type] = value

    def getValue(self, m_type):
        if m_type in self.values:
            return self.values[m_type]

    def getName(self):
        return self.name

    def getMeasures(self):
        # Retuns all the types of measurements (accuracy or time or whatever you have)
        return self.values.keys()

    def getValues(self):
        return self.values

    def containsType(self, m_type):
        # Checks to see if the measurement type (accuracy for example) is contained in here
        if type(m_type) == list:
            for m in m_type:
                if m not in self.values:
                    return False
            return True
        elif type(m_type) == str:
            if m_type in self.values:
                return True
            else:
                return False
        else:
            return False

    def getMetricWithMeasure(self, m_type):
        # Return a metric with only the data requested, which may be in list format if there is more than one measurement desired
        if type(m_type) == list:
            new_metric = Metric(self.name, fold=self.fold_num)
            for m in m_type:
                new_metric.addValue(m, self.values[m])

            return new_metric
        elif type(m_type) == str:
            new_metric = Metric(self.name, fold=self.fold_num)
            new_metric.addValue(m_type, self.values[m_type])

            return new_metric

class MetricsManager:
    def __init__(self):
        self.metrics_list = []
    
    def getMetrics(self, model_name='all', m_type='all'):
        # If they want everything, give them everything
        if model_name == 'all' and m_type == 'all':
            return self.metrics_list
        # If they want a list of models, the conditional in the lambda function changes a little bit
        elif type(model_name) == list:
            # This line is a blast! It searches through all of the metrics the manager knows about, and returns all the metrics that have both the name and metrics the user wants in a list
            return list(
                filter(
                    None, 
                    map( 
                        lambda m : m.getMetricWithMeasure(m_type) if (m.getName() in model_name) and (m.containsType(m_type) or m_type == 'all') else None, 
                        self.metrics_list
                    )
                )
            )
        # Return the data requested as per the terrible line below
        else:
            # This line is a blast! It searches through all of the metrics the manager knows about, and returns all the metrics that have both the name and metrics the user wants in a list
            return list(
                filter(
                    None, 
                    map( 
                        lambda m : m.getMetricWithMeasure(m_type) if (m.getName() == model_name or model_name == 'all') and (m.containsType(m_type) or m_type == 'all') else None, 
                        self.metrics_list
                    )
                )
            )

    def addMetric(self, metric):
        self.metrics_list.append(metric)

    def printMeasures(self, model='all', metrics='all'):
        # Acquire all of the metrics the user wants us to print first so there's no weird filtering going on later
        metrics = self.getMetrics(model_name=model, m_type=metrics)

        # Figure out all of the metrics that are going to be available and figure out their ordering
        #   If we are printing time and accuracy data, we want the columns to be consistent
        measurements = []
        for metric in metrics:
            metric_measures = metric.getMeasures()
            for measure in metric_measures:
                if measure not in measurements:
                    measurements.append(measure)

        # Formatting for the header, we need to print the model column name, then each of the values collected
        print('{:10}'.format('model'), end='')
        for measure in measurements:
            print('{:11}'.format(measure), end='')
        print('\n', end='')
        print('-------'*(len(measurements)+1))

        # Go through all of the models and print their data one line at a time
        printed_models = []
        for metric in metrics:
            metric_name = metric.getName()
            
            # If the model hasn't already been printed (this can happen if I have multiple folds for one classifier), then print the data
            if metric_name not in printed_models:
                print('{:9}'.format(metric_name), end='')

                # Get all of the values stored in this metric (it's in a dictionary)
                metric_values = metric.getValues()
                
                # Go through all of the measurement values in the order as determined above
                for measure in measurements:
                    if measure in metric_values:
                        # We need to go through all of the data and calculate the mean and std deviation from each fold
                        #  If there are no folds, then this won't cause any damage (Keep unique identifiers!)
                        vals = []
                        for m in metrics:
                            if m.getName() == metric_name:
                                vals.append(m.getValues()[measure])
                        # Print the calcuated mean and standard deviations
                        print('{:6.2f}\u00B1{:6<.2f}'.format(np.mean(vals), np.std(vals)), end='')
                    # If there is no value to print, just skip over this element
                    else:
                        print(' '*11, end='')
                # Make note of the model we just printed. We don't want any repeats
                printed_models.append(metric_name)
                print('\n', end='')

In [3]:
# Metric manager tests and examples
mm = MetricsManager()

m = Metric('rf', fold=1)
m.addValue('acc', 0.97)
m.addValue('time', 0.39)
mm.addMetric(m)

m = Metric('rf', fold=2)
m.addValue('acc', 0.95)
m.addValue('time', 0.99)
mm.addMetric(m)

m = Metric('rf', fold=3)
m.addValue('acc', 0.93)
m.addValue('time', 0.98)
mm.addMetric(m)

m = Metric('dt', fold=1)
m.addValue('time', 0.75)
mm.addMetric(m)

m = Metric('xgboost', fold=1)
m.addValue('acc', 0.99)
m.addValue('time', 50)
mm.addMetric(m)

m = Metric('xgboost', fold=1)
m.addValue('acc', 0.99)
m.addValue('time', 519)
mm.addMetric(m)

mm.printMeasures(model=['rf', 'xgboost', 'dt'], metrics='acc')

model     acc        
--------------
rf         0.95±0.02
xgboost    0.99±0.00


In [4]:
metrics_manager = MetricsManager()

metric0 = Metric('RandomForest', fold=1)
metric0.addValue('Accuracy', 97.50)
metrics_manager.addMetric(metric0)

metric1 = Metric('cart', fold=1)
metric1.addValue('acc', 0.95)
metrics_manager.addMetric(metric1)

metrics_manager.printMeasures()

model     Accuracy   acc        
---------------------
RandomForest 97.50±0.00           
cart                  0.95±0.00


In [5]:
def train_and_eval_on(X, y, feature_set, metrics_manager):
    """
    train_and_eval_on function
        Description: This function will train all the models on the given feature set of the X (data) for predicting y (target) and add the acquired metrics 
          to the MetricsManager object from the user

        Args: 
            X => pd.DataFrame object containing the data
            y => pd.Series object containings the target classifications
            feature_set => list of features in X to use for training
            metrics_manager => MetricsManager object (custom)

        Returns:
            Nothing
        
        Keys used for the manager:
                        Random Forest => rf
                        Decision Tree => dt
                        k-Nearest Neighbors => knn
                        Support Vector Machine => svm
                        Logistic Regression => lr
                        Linear Discriminant Analysis => lda
                        AdaBoost => ab
                        Naive Bayes => nb
                        Keras-TensorFlow => keras
                        Fast.ai => fastai
    """

    # Select the given features within the data
    X = X[feature_set]

    print('Training with {} features'.format(len(X.columns)))

    # Create stratified, 10-fold cross validation object
    random_state = 0
    sss = StratifiedKFold(n_splits=10, shuffle=True, random_state=random_state)

    i=1

    # Experiment with 10-fold cross validation
    for train_idx, test_idx in sss.split(X, y):

        print('fold num {}'.format(i))
        i+=1

        # Split the data into the training and testing sets
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

        # Random Forest Model
        rf = RandomForestClassifier(random_state=random_state)
        rf.fit(X_train, y_train)
        score = rf.score(X_test, y_test)

        m = Metric('rf', fold=i)
        m.addValue('acc', 100*score)
        mm.addMetric(m)

        # Decision Tree Model
        dt = DecisionTreeClassifier(random_state=random_state)
        dt.fit(X_train, y_train)
        score = dt.score(X_test, y_test)

        m = Metric('dt', fold=i)
        m.addValue('acc', 100*score)
        mm.addMetric(m)

        # k-Nearest Neighbors Model
        knn = KNeighborsClassifier()
        knn.fit(X_train, y_train)
        score = knn.score(X_test, y_test)

        m = Metric('knn', fold=i)
        m.addValue('acc', 100*score)
        mm.addMetric(m)

        # Support Vector Machine Model
        svm = SVC(random_state=random_state)
        svm.fit(X_train, y_train)
        score = svm.score(X_test, y_test)

        m = Metric('svm', fold=i)
        m.addValue('acc', 100*score)
        mm.addMetric(m)

        # Logistic Regression Model
        lr = LogisticRegression(random_state=random_state)
        lr.fit(X_train, y_train)
        score = lr.score(X_test, y_test)

        m = Metric('lr', fold=i)
        m.addValue('acc', 100*score)
        mm.addMetric(m)

        # Linear Discriminant Analysis Model
        lda = LinearDiscriminantAnalysis()
        lda.fit(X_train, y_train)
        score = lda.score(X_test, y_test)

        m = Metric('lda', fold=i)
        m.addValue('acc', 100*score)
        mm.addMetric(m)

        # AdaBoost Model
        ab = AdaBoostClassifier(random_state=random_state)
        ab.fit(X_train, y_train)
        score = ab.score(X_test, y_test)

        m = Metric('ab', fold=i)
        m.addValue('acc', 100*score)
        mm.addMetric(m)

        # Naive Bayes Model
        nb = GaussianNB()
        nb.fit(X_train, y_train)
        score = nb.score(X_test, y_test)

        m = Metric('nb', fold=i)
        m.addValue('acc', 100*score)
        mm.addMetric(m)

        # Keras-TensorFlow DNN Model
        dnn_keras = Sequential(layers=[
                                 Dense(128, kernel_regularizer=l2(0.001), activation='relu',input_shape=(len(X_train.columns),)),
                                 BatchNormalization(),
                                 Dense(64, activation='relu', kernel_regularizer=l2(0.001)),
                                 BatchNormalization(),
                                 Dense(y_train.nunique(), activation='softmax')
        ])
        dnn_keras.compile(
            optimizer='adam', 
            loss='categorical_crossentropy', 
            metrics=['accuracy'])
        dnn_keras.fit(X_train, pd.get_dummies(y_train), epochs=100, verbose=0, batch_size=512)
        _, score = dnn_keras.evaluate(X_test, pd.get_dummies(y_test), verbose=0)

        m = Metric('keras', fold=i)
        m.addValue('acc', 100*score)
        mm.addMetric(m)

        # Fast.ai DNN Model
        data_fold = (TabularList.from_df(df, path=path, cont_names=X_train.columns, procs=[Categorify, Normalize])
                     .split_by_idxs(train_idx, test_idx)
                     .label_from_df(cols=dep_var)
                     .databunch(num_workers=0))
        dnn_fastai = tabular_learner(data_fold, layers=[200, 100], metrics=accuracy)
        dnn_fastai.fit_one_cycle(cyc_len=10, callbacks=None)
        _, score = dnn_fastai.validate()

        m = Metric('fastai', fold=i)
        m.addValue('acc', 100*score)
        mm.addMetric(m)

In [6]:
def show_graph(figure, feature_count, metrics_dict, exp_type=''):
  """
  show_graph function

    Description: This function will take the metrics dictionary provided and update the graph already to show the most recent results

    Args:
      figure => matplotlib.pyplot.figure object
      metrics_dict => dictionary of metrics as described in `train_and_eval_on` function
      exp_type => string indicating the type of experiment to change the title of the graph

    Returns:
      nothing
  """
  # Reorganize the data so we have all of the random forest metrics with increasing features side by side
  reorganized_dictionary = {}

  for feature_vals in metrics_dict.keys():
    for key in metrics_dict[feature_vals].keys():
      # If a given model is not in the new dictionary, add it
      if key not in reorganized_dictionary:
        reorganized_dictionary[key] = {}

      # If there isn't a specific feature number in the model dictionary, add it
      if feature_vals not in reorganized_dictionary[key]:
        reorganized_dictionary[key][feature_vals] = []

      # If there is anything to the record, add it
      if len( metrics_dict[feature_vals][key] ) > 0:
        accuracies = metrics_dict[feature_vals][key]
        mean = np.mean(accuracies)
        std = np.std(accuracies)

        #print('Accuracies: {}'.format(accuracies))
        #print('Mean: {}'.format(mean))
        #print('Std: {}'.format(std))

        reorganized_dictionary[key][feature_vals].append( [mean, std] ) 

  #print('Models: {}'.format( list(reorganized_dictionary.keys()) ))

  for model in reorganized_dictionary.keys():
    # The x-axis will have the feature_count
    xs = []

    # The y-axis will have the accuracy for that feature_count value
    ys = []

    # The y-axis will also have the std for these accuracies since they are accumulated over 10 folds
    yerrs = []

    for x in reorganized_dictionary[model].keys():
      if len(reorganized_dictionary[model][x]) > 0:
        xs.append(x)
        ys.append(reorganized_dictionary[model][x][0][0])
        yerrs.append(reorganized_dictionary[model][x][0][1])
    #print('xs: {}'.format(xs))
    #print('ys: {}'.format(ys))
    if len(xs) > 0:
      plt.errorbar(x=xs, y=ys, yerr=yerrs, label=model)

  #print(reorganized_dictionary)
  if exp_type == 'multi':
    plt.title('Multi-class Classification Model Accuracies with Increasing Features')
  elif exp_type == 'binary':
    plt.title('Binary Classification Model Accuracies with Increasing Features')
  plt.ylabel('Accuracy')
  plt.xlabel('Number of Features')

  plt.xticks(xs[4::5])

  plt.legend()
  plt.show()


In [7]:
def get_data(path, layer=0, nans=False):
    """ get_data function
        Description: This function will take the given path and user-defined layer from the dataset, import the datafiles, and then return the combined pandas DataFrame
        Arguments:
            path => string, path to the directory containing the l1-doh.csv, l1-nonhod.csv, etc files.
            layer => int, the level of layer desired. This will change the dataset that is imported. Values can be 1 or 2. Default is 0.
            nans => boolean, Whether the user wants NaNs in the data or wants them removed. This function will automatically remove all rows with Nan values.
        Returns:
            df => pandas.DataFrame, contains complete data
        Raises:
            AttributeError for incorrect layer number
            Any additional read errors are raised to the user
    """
    import pandas as pd

    if layer not in [1,2]:
        raise AttributeError('Must provide valid layer for dataset: layer equals 1 or 2')
    else:

        # Select the files that the user has chosen
        filenames = []
        if layer == 1:
            filenames.append('l1-doh.csv')
            filenames.append('l1-nondoh.csv')
        else:
            filenames.append('l2-benign.csv')
            filenames.append('l2-malicious.csv')

        # Read the files into dataframes
        df0 = pd.read_csv(path + '/' + filenames[0])
        df1 = pd.read_csv(path + '/' + filenames[1])

        df = pd.concat([df0, df1])

        # Remove any rows with Nan values
        if not nans:
            df.dropna(axis='index', inplace=True)

        return df

In [8]:
def balance_data(df, label_column):
    labels = df[label_column].unique()
    sample_length_list = []
    for i in range(len(labels)):
        samples = df.loc[ df[label_column] == labels[i] ]
        sample_length_list.append( len(samples) )
        #print('Number of {} samples: {}'.format(labels[i], len( samples )))

    random_state = 0
    smallest_count = min(sample_length_list)
    dfs = []
    for i in range(len(labels)):
        #dfs.append( df.loc[ df[label_column] == labels[i] ].sample(smallest_count) )

        # We are only sampling 40 purely for testing reasons to help speed up the dev process!
        # Uncomment the line above this to actually run the complete tests
        dfs.append( df.loc[ df[label_column] == labels[i] ].sample(40) )

    return pd.concat(dfs)

In [9]:
# Set up google drive access
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


## Layer 1 Experiments: DoH or nonDoH

In [10]:
path = '/content/gdrive/My Drive/doh_dataset/Total-CSVs'
df = get_data(path=path, layer=1)
df.head()

Unnamed: 0,SourceIP,DestinationIP,SourcePort,DestinationPort,TimeStamp,Duration,FlowBytesSent,FlowSentRate,FlowBytesReceived,FlowReceivedRate,PacketLengthVariance,PacketLengthStandardDeviation,PacketLengthMean,PacketLengthMedian,PacketLengthMode,PacketLengthSkewFromMedian,PacketLengthSkewFromMode,PacketLengthCoefficientofVariation,PacketTimeVariance,PacketTimeStandardDeviation,PacketTimeMean,PacketTimeMedian,PacketTimeMode,PacketTimeSkewFromMedian,PacketTimeSkewFromMode,PacketTimeCoefficientofVariation,ResponseTimeTimeVariance,ResponseTimeTimeStandardDeviation,ResponseTimeTimeMean,ResponseTimeTimeMedian,ResponseTimeTimeMode,ResponseTimeTimeSkewFromMedian,ResponseTimeTimeSkewFromMode,ResponseTimeTimeCoefficientofVariation,Label
0,192.168.20.191,176.103.130.131,50749,443,2020-01-14 15:49:11,95.08155,62311,655.342703,65358,687.388878,7474.676771,86.456213,135.673751,102.0,54,1.168467,0.944683,0.637236,670.585814,25.895672,45.065277,48.811292,1.49506,-0.433974,1.682529,0.574626,0.001053,0.032457,0.027624,0.026854,0.026822,0.071187,0.024715,1.174948,DoH
1,192.168.20.191,176.103.130.131,50749,443,2020-01-14 15:50:52,122.309318,93828,767.136973,101232,827.672018,10458.118598,102.264943,141.245474,114.0,54,0.799261,0.853132,0.724023,708.465878,26.617022,52.287903,48.830314,31.719656,0.389704,0.772748,0.509047,0.00117,0.0342,0.024387,0.021043,0.026981,0.293297,-0.075845,1.402382,DoH
2,192.168.20.191,176.103.130.131,50749,443,2020-01-14 15:52:55,120.958413,38784,320.639127,38236,316.108645,7300.293933,85.441758,133.715278,89.0,54,1.570027,0.932978,0.638983,1358.911235,36.863413,50.316114,39.770747,0.417528,0.858198,1.353607,0.732636,0.000785,0.028021,0.029238,0.026921,0.026855,0.248064,0.085061,0.958348,DoH
3,192.168.20.191,176.103.130.131,50749,443,2020-01-14 15:54:56,110.50108,61993,561.017141,69757,631.278898,8499.282518,92.191553,139.123548,114.0,54,0.817544,0.923333,0.66266,1118.135436,33.438532,51.693726,34.882495,13.280934,1.508251,1.148758,0.646859,0.000411,0.020274,0.019925,0.019268,0.026918,0.097199,-0.344926,1.017535,DoH
4,176.103.130.131,192.168.20.191,443,50749,2020-01-14 15:56:46,54.229891,83641,1542.341289,76804,1416.266907,8052.745751,89.737092,138.91342,114.0,114,0.83288,0.277627,0.645993,341.696613,18.485038,36.435619,49.822561,7.342519,-2.172613,1.573873,0.507334,0.079079,0.281209,0.02593,4.7e-05,2.1e-05,0.276133,0.092135,10.844829,DoH


In [11]:
bad_columns = ['SourceIP', 'DestinationIP', 'SourcePort', 'DestinationPort', 'TimeStamp']
df.drop(labels=bad_columns, axis='columns', inplace=True)

In [12]:
# The target classifications are in the 'Label' columns, 
#  thus this is the independent variable!
dep_var = 'Label'
df[dep_var].value_counts()

NonDoH    889809
DoH       269299
Name: Label, dtype: int64

In [13]:
# Balance the data out
df = balance_data(df, dep_var)

In [14]:
df[dep_var].value_counts()

NonDoH    40
DoH       40
Name: Label, dtype: int64

In [15]:
# Split up the data into the data (X) and classifications (y)
X = df.loc[:, df.columns != dep_var]
y = df[dep_var]

In [16]:
best_features_layer1 = ['Duration', 'ResponseTimeTimeSkewFromMedian', 'ResponseTimeTimeMode',
       'ResponseTimeTimeMedian', 'ResponseTimeTimeMean',
       'PacketTimeSkewFromMedian', 'PacketTimeMode', 'PacketTimeMedian',
       'PacketTimeMean', 'ResponseTimeTimeSkewFromMode', 'PacketTimeVariance',
       'PacketLengthCoefficientofVariation', 'PacketTimeStandardDeviation',
       'PacketLengthMode', 'PacketLengthMedian', 'PacketLengthMean',
       'FlowBytesSent', 'ResponseTimeTimeCoefficientofVariation',
       'PacketLengthStandardDeviation', 'PacketLengthVariance',
       'PacketTimeCoefficientofVariation', 'FlowReceivedRate',
       'ResponseTimeTimeStandardDeviation', 'PacketLengthSkewFromMode',
       'FlowBytesReceived', 'PacketLengthSkewFromMedian', 'FlowSentRate',
       'ResponseTimeTimeVariance', 'PacketTimeSkewFromMode']
print('These are the best 4 features for layer 1: {}'.format(best_features_layer1[:4]))
print('These are the worst 4 features for layer 1: {}'.format(best_features_layer1[-4:]))

These are the best 4 features for layer 1: ['Duration', 'ResponseTimeTimeSkewFromMedian', 'ResponseTimeTimeMode', 'ResponseTimeTimeMedian']
These are the worst 4 features for layer 1: ['PacketLengthSkewFromMedian', 'FlowSentRate', 'ResponseTimeTimeVariance', 'PacketTimeSkewFromMode']


In [17]:
mm = MetricsManager()

train_and_eval_on(X=X, y=y, feature_set=best_features_layer1, metrics_manager=mm)

Training with 29 features
fold num 1


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


epoch,train_loss,valid_loss,accuracy,time
0,0.586454,0.69492,0.5,00:00
1,0.580707,0.68957,0.5,00:00
2,0.531031,0.680415,0.875,00:00
3,0.475715,0.669179,0.625,00:00
4,0.423041,0.653644,0.625,00:00
5,0.384319,0.643595,0.625,00:00
6,0.351636,0.634413,0.625,00:00
7,0.324862,0.626189,0.625,00:00
8,0.302895,0.617732,0.625,00:00
9,0.282508,0.610007,0.625,00:00


fold num 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


epoch,train_loss,valid_loss,accuracy,time
0,0.820875,0.575008,0.625,00:00
1,0.804633,0.566682,0.75,00:00
2,0.720911,0.552958,0.875,00:00
3,0.615335,0.541664,0.875,00:00
4,0.540659,0.535697,0.875,00:00
5,0.482755,0.525448,0.75,00:00
6,0.443664,0.504967,0.75,00:00
7,0.405678,0.49055,0.75,00:00
8,0.373337,0.4775,0.75,00:00
9,0.349449,0.46618,0.875,00:00


fold num 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


epoch,train_loss,valid_loss,accuracy,time
0,0.765419,0.700385,0.375,00:00
1,0.752629,0.680744,0.75,00:00
2,0.681751,0.642219,0.75,00:00
3,0.608368,0.598549,0.875,00:00
4,0.543536,0.560751,0.875,00:00
5,0.496919,0.530435,0.875,00:00
6,0.460474,0.503342,0.875,00:00
7,0.427805,0.480372,0.875,00:00
8,0.403187,0.46111,0.875,00:00
9,0.381008,0.447029,0.875,00:00


fold num 4


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


epoch,train_loss,valid_loss,accuracy,time
0,0.735889,0.704162,0.375,00:00
1,0.701706,0.683957,0.375,00:00
2,0.625626,0.640974,0.625,00:00
3,0.546125,0.593731,0.875,00:00
4,0.476257,0.556285,0.75,00:00
5,0.433208,0.533176,0.75,00:00
6,0.395045,0.518491,0.875,00:00
7,0.365774,0.508397,0.875,00:00
8,0.344276,0.500763,0.875,00:00
9,0.325577,0.49408,0.875,00:00


fold num 5


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression




epoch,train_loss,valid_loss,accuracy,time
0,0.886537,0.71156,0.25,00:00
1,0.873783,0.690165,0.5,00:00
2,0.788095,0.654848,0.75,00:00
3,0.681902,0.610475,0.875,00:00
4,0.605068,0.558097,0.875,00:00
5,0.539365,0.489833,1.0,00:00
6,0.49416,0.434186,1.0,00:00
7,0.458598,0.391661,1.0,00:00
8,0.428922,0.358499,1.0,00:00
9,0.408296,0.331377,1.0,00:00


fold num 6


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression




epoch,train_loss,valid_loss,accuracy,time
0,1.008191,0.700878,0.375,00:00
1,0.966739,0.690287,0.5,00:00
2,0.911775,0.666552,0.875,00:00
3,0.787911,0.636518,0.875,00:00
4,0.680737,0.605416,0.875,00:00
5,0.600387,0.580382,0.875,00:00
6,0.539256,0.560405,0.875,00:00
7,0.500802,0.546531,0.875,00:00
8,0.462507,0.536522,0.875,00:00
9,0.430212,0.528817,0.875,00:00


fold num 7


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression




epoch,train_loss,valid_loss,accuracy,time
0,0.845274,0.689571,0.5,00:00
1,0.818758,0.661387,0.625,00:00
2,0.759776,0.60938,1.0,00:00
3,0.658915,0.551555,0.875,00:00
4,0.585831,0.501302,0.875,00:00
5,0.538746,0.460169,0.875,00:00
6,0.494776,0.428511,0.875,00:00
7,0.460535,0.404621,0.875,00:00
8,0.429277,0.387612,0.875,00:00
9,0.405086,0.375743,0.875,00:00


fold num 8


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression




epoch,train_loss,valid_loss,accuracy,time
0,0.800919,0.660596,0.5,00:00
1,0.764018,0.634454,0.75,00:00
2,0.681831,0.583541,0.875,00:00
3,0.598476,0.530093,0.875,00:00
4,0.540559,0.483077,0.75,00:00
5,0.486774,0.445686,0.75,00:00
6,0.445597,0.414478,0.75,00:00
7,0.408562,0.388164,0.75,00:00
8,0.382097,0.367361,0.875,00:00
9,0.360537,0.351039,0.875,00:00


fold num 9


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression




epoch,train_loss,valid_loss,accuracy,time
0,1.09059,0.692755,0.625,00:00
1,1.056711,0.672676,0.625,00:00
2,0.93709,0.632657,0.75,00:00
3,0.800092,0.592068,0.875,00:00
4,0.700378,0.560399,0.875,00:00
5,0.620434,0.536668,0.875,00:00
6,0.558503,0.515276,0.875,00:00
7,0.511095,0.497619,0.875,00:00
8,0.475301,0.48367,0.875,00:00
9,0.442453,0.473212,0.875,00:00


fold num 10


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression




epoch,train_loss,valid_loss,accuracy,time
0,0.731904,1.554641,0.25,00:00
1,0.715179,2.918347,0.625,00:00
2,0.654629,5.824059,0.5,00:00
3,0.575111,9.875998,0.625,00:00
4,0.508116,13.667651,0.75,00:00
5,0.460425,17.287638,0.75,00:00
6,0.420152,21.05547,0.75,00:00
7,0.390085,23.878262,0.75,00:00
8,0.36189,26.253588,0.75,00:00
9,0.340437,28.223818,0.75,00:00


In [18]:
mm.printMeasures()

model     acc        
--------------
rf        96.25±5.73
dt        96.25±5.73
knn       81.25±10.08
svm       58.75±12.56
lr        83.75±9.76
lda       75.00±9.68
ab        92.50±8.29
nb        83.75±13.75
keras     65.00±12.25
fastai    85.00±9.35


## Layer 2 Experiments: Benign-DoH or Malicious-DoH

In [19]:
path = '/content/gdrive/My Drive/doh_dataset/Total-CSVs'
df = get_data(path=path, layer=2)
df.head()

Unnamed: 0,SourceIP,DestinationIP,SourcePort,DestinationPort,TimeStamp,Duration,FlowBytesSent,FlowSentRate,FlowBytesReceived,FlowReceivedRate,PacketLengthVariance,PacketLengthStandardDeviation,PacketLengthMean,PacketLengthMedian,PacketLengthMode,PacketLengthSkewFromMedian,PacketLengthSkewFromMode,PacketLengthCoefficientofVariation,PacketTimeVariance,PacketTimeStandardDeviation,PacketTimeMean,PacketTimeMedian,PacketTimeMode,PacketTimeSkewFromMedian,PacketTimeSkewFromMode,PacketTimeCoefficientofVariation,ResponseTimeTimeVariance,ResponseTimeTimeStandardDeviation,ResponseTimeTimeMean,ResponseTimeTimeMedian,ResponseTimeTimeMode,ResponseTimeTimeSkewFromMedian,ResponseTimeTimeSkewFromMode,ResponseTimeTimeCoefficientofVariation,Label
0,192.168.20.191,176.103.130.131,50749,443,2020-01-14 15:49:11,95.08155,62311,655.342703,65358,687.388878,7474.676771,86.456213,135.673751,102.0,54,1.168467,0.944683,0.637236,670.585814,25.895672,45.065277,48.811292,1.49506,-0.433974,1.682529,0.574626,0.001053,0.032457,0.027624,0.026854,0.026822,0.071187,0.024715,1.174948,Benign
1,192.168.20.191,176.103.130.131,50749,443,2020-01-14 15:50:52,122.309318,93828,767.136973,101232,827.672018,10458.118598,102.264943,141.245474,114.0,54,0.799261,0.853132,0.724023,708.465878,26.617022,52.287903,48.830314,31.719656,0.389704,0.772748,0.509047,0.00117,0.0342,0.024387,0.021043,0.026981,0.293297,-0.075845,1.402382,Benign
2,192.168.20.191,176.103.130.131,50749,443,2020-01-14 15:52:55,120.958413,38784,320.639127,38236,316.108645,7300.293933,85.441758,133.715278,89.0,54,1.570027,0.932978,0.638983,1358.911235,36.863413,50.316114,39.770747,0.417528,0.858198,1.353607,0.732636,0.000785,0.028021,0.029238,0.026921,0.026855,0.248064,0.085061,0.958348,Benign
3,192.168.20.191,176.103.130.131,50749,443,2020-01-14 15:54:56,110.50108,61993,561.017141,69757,631.278898,8499.282518,92.191553,139.123548,114.0,54,0.817544,0.923333,0.66266,1118.135436,33.438532,51.693726,34.882495,13.280934,1.508251,1.148758,0.646859,0.000411,0.020274,0.019925,0.019268,0.026918,0.097199,-0.344926,1.017535,Benign
4,176.103.130.131,192.168.20.191,443,50749,2020-01-14 15:56:46,54.229891,83641,1542.341289,76804,1416.266907,8052.745751,89.737092,138.91342,114.0,114,0.83288,0.277627,0.645993,341.696613,18.485038,36.435619,49.822561,7.342519,-2.172613,1.573873,0.507334,0.079079,0.281209,0.02593,4.7e-05,2.1e-05,0.276133,0.092135,10.844829,Benign


In [20]:
bad_columns = ['SourceIP', 'DestinationIP', 'SourcePort', 'DestinationPort', 'TimeStamp']
df.drop(labels=bad_columns, axis='columns', inplace=True)

In [21]:
# The target classifications are in the 'Label' columns, 
#  thus this is the independent variable!
dep_var = 'Label'
df[dep_var].value_counts()

Malicious    249553
Benign        19746
Name: Label, dtype: int64

In [22]:
# Balance the data out
df = balance_data(df, dep_var)

In [23]:
df[dep_var].value_counts()

Benign       40
Malicious    40
Name: Label, dtype: int64

In [24]:
# Split up the data into the data (X) and classifications (y)
X = df.loc[:, df.columns != dep_var]
y = df[dep_var]

In [25]:
best_features_layer2 = ['PacketLengthStandardDeviation', 'PacketLengthCoefficientofVariation',
       'FlowReceivedRate', 'PacketLengthMean', 'Duration',
       'PacketTimeSkewFromMedian', 'FlowSentRate', 'PacketLengthVariance',
       'PacketTimeMean', 'PacketTimeStandardDeviation',
       'ResponseTimeTimeMedian', 'PacketTimeMedian',
       'ResponseTimeTimeSkewFromMode', 'ResponseTimeTimeMean',
       'ResponseTimeTimeMode', 'PacketTimeCoefficientofVariation',
       'ResponseTimeTimeSkewFromMedian', 'PacketTimeMode', 'FlowBytesSent',
       'FlowBytesReceived', 'PacketLengthMode',
       'ResponseTimeTimeCoefficientofVariation', 'PacketLengthSkewFromMedian',
       'PacketTimeVariance', 'PacketLengthMedian', 'PacketTimeSkewFromMode',
       'ResponseTimeTimeStandardDeviation', 'ResponseTimeTimeVariance',
       'PacketLengthSkewFromMode']
print('These are the best 4 features for layer 2: {}'.format(best_features_layer2[:4]))
print('These are the worst 4 features for layer 2: {}'.format(best_features_layer2[-4:]))

These are the best 4 features for layer 2: ['PacketLengthStandardDeviation', 'PacketLengthCoefficientofVariation', 'FlowReceivedRate', 'PacketLengthMean']
These are the worst 4 features for layer 2: ['PacketTimeSkewFromMode', 'ResponseTimeTimeStandardDeviation', 'ResponseTimeTimeVariance', 'PacketLengthSkewFromMode']


In [26]:
mm = MetricsManager()

train_and_eval_on(X=X, y=y, feature_set=best_features_layer1, metrics_manager=mm)

Training with 29 features
fold num 1


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression




epoch,train_loss,valid_loss,accuracy,time
0,0.737612,0.675974,0.5,00:00
1,0.687723,0.662456,0.625,00:00
2,0.637703,0.638048,0.75,00:00
3,0.557583,0.604583,0.75,00:00
4,0.495572,0.572135,0.75,00:00
5,0.440726,0.536911,0.75,00:00
6,0.403011,0.506367,0.875,00:00
7,0.371606,0.479389,0.875,00:00
8,0.34386,0.457346,0.875,00:00
9,0.32225,0.438344,0.875,00:00


fold num 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression




epoch,train_loss,valid_loss,accuracy,time
0,0.847909,0.716261,0.125,00:00
1,0.839858,0.70419,0.25,00:00
2,0.766856,0.66972,0.625,00:00
3,0.665648,0.62569,0.875,00:00
4,0.583602,0.582032,0.875,00:00
5,0.523103,0.543118,1.0,00:00
6,0.47768,0.509218,1.0,00:00
7,0.437856,0.480057,1.0,00:00
8,0.406566,0.454399,1.0,00:00
9,0.378726,0.432347,1.0,00:00


fold num 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression




epoch,train_loss,valid_loss,accuracy,time
0,0.590634,0.648027,0.5,00:00
1,0.573107,0.64733,0.5,00:00
2,0.52767,0.672234,0.5,00:00
3,0.4631,0.704037,0.5,00:00
4,0.411604,0.727815,0.5,00:00
5,0.368722,0.736181,0.625,00:00
6,0.334228,0.745755,0.625,00:00
7,0.304314,0.755084,0.625,00:00
8,0.279468,0.7702,0.625,00:00
9,0.260007,0.783458,0.625,00:00


fold num 4


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression




epoch,train_loss,valid_loss,accuracy,time
0,0.752658,0.706581,0.5,00:00
1,0.741617,0.690353,0.5,00:00
2,0.657006,0.656469,0.5,00:00
3,0.563958,0.620703,0.75,00:00
4,0.493579,0.5868,0.75,00:00
5,0.439085,0.557701,0.75,00:00
6,0.393707,0.5315,0.75,00:00
7,0.359186,0.509287,0.75,00:00
8,0.327891,0.493083,0.75,00:00
9,0.303407,0.481899,0.75,00:00


fold num 5


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression




epoch,train_loss,valid_loss,accuracy,time
0,0.804119,0.697153,0.5,00:00
1,0.790285,0.674077,0.625,00:00
2,0.716402,0.629154,0.75,00:00
3,0.620098,0.579321,0.875,00:00
4,0.54365,0.535138,1.0,00:00
5,0.485686,0.497747,1.0,00:00
6,0.438311,0.467382,1.0,00:00
7,0.398583,0.441322,1.0,00:00
8,0.369424,0.41918,1.0,00:00
9,0.342391,0.402224,1.0,00:00


fold num 6


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression




epoch,train_loss,valid_loss,accuracy,time
0,0.752959,3.184497,0.5,00:00
1,0.743129,2.691381,0.75,00:00
2,0.676512,1.329074,0.75,00:00
3,0.58831,0.460545,0.875,00:00
4,0.517741,0.42565,0.875,00:00
5,0.462496,0.40422,0.875,00:00
6,0.420076,0.386454,0.875,00:00
7,0.381683,0.371209,0.875,00:00
8,0.35297,0.357442,0.875,00:00
9,0.329214,0.345491,0.875,00:00


fold num 7


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression




epoch,train_loss,valid_loss,accuracy,time
0,0.953188,0.690687,0.375,00:00
1,0.930186,0.673346,0.5,00:00
2,0.824864,0.63555,0.875,00:00
3,0.705569,0.59564,0.875,00:00
4,0.61947,0.560106,0.75,00:00
5,0.552441,0.532983,0.875,00:00
6,0.505176,0.512305,0.875,00:00
7,0.462008,0.495343,0.875,00:00
8,0.424171,0.481742,0.875,00:00
9,0.395186,0.470955,0.875,00:00


fold num 8


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression




epoch,train_loss,valid_loss,accuracy,time
0,0.685956,0.688266,0.5,00:00
1,0.655174,0.671575,0.625,00:00
2,0.587368,0.638689,0.75,00:00
3,0.509931,0.611456,0.75,00:00
4,0.452054,0.593212,0.75,00:00
5,0.39767,0.577433,0.75,00:00
6,0.355142,0.563061,0.875,00:00
7,0.321153,0.548312,0.875,00:00
8,0.294016,0.535285,0.875,00:00
9,0.269552,0.525381,0.875,00:00


fold num 9


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression




epoch,train_loss,valid_loss,accuracy,time
0,0.587231,0.633969,0.5,00:00
1,0.596934,0.608814,0.875,00:00
2,0.551342,0.57168,0.875,00:00
3,0.488635,0.526712,0.875,00:00
4,0.439118,0.487738,0.875,00:00
5,0.399527,0.448704,0.875,00:00
6,0.365224,0.418393,0.875,00:00
7,0.329432,0.392143,0.875,00:00
8,0.304603,0.372309,0.875,00:00
9,0.283798,0.357904,0.875,00:00


fold num 10


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression




epoch,train_loss,valid_loss,accuracy,time
0,0.681245,0.722606,0.5,00:00
1,0.646734,0.720947,0.5,00:00
2,0.578844,0.709337,0.625,00:00
3,0.51226,0.679567,0.625,00:00
4,0.456178,0.641594,0.625,00:00
5,0.410669,0.598661,0.625,00:00
6,0.374916,0.566063,0.625,00:00
7,0.342692,0.54325,0.625,00:00
8,0.32015,0.530267,0.75,00:00
9,0.298653,0.521244,0.75,00:00


In [27]:
mm.printMeasures()

model     acc        
--------------
rf        93.75±6.25
dt        92.50±8.29
knn       85.00±10.90
svm       81.25±11.52
lr        70.00±12.75
lda       87.50±12.50
ab        98.75±3.75
nb        73.75±6.73
keras     70.00±16.96
fastai    85.00±10.90
