# Replication of Experiments
This notebook's goal is to attempt to replicate the experiments presented in Arash *et al.* using the ISCXTor2016 dataset provided by the Canadian Institute for Cybersecurity at the University of New Brunswick (CIC-UNB). The experiments in this work are split into Scenario-A and Scenario-B. Scenario-A's goal is to classify traffic samples as Tor or NonTor, while Scenario-B attempts to classify the type of traffic (FTP, browsing, video and audio-streaming, VoIP, chat, mail, and P2P) seen in Tor samples.   

Let's begin with Scenario-A.

## Scenario-A
In these experiments, the team uses three machine learning algorithms: ZeroR, C4.5, and k-Nearest Neighbors. The experiments were originally completed in Weka, however they will attempt to be replicated here using the `sklearn` python library.  

The `sklearn` models used will be the following:  
 - ZeroR &rarr; DummyClassifier
 - C4.5 &rarr; DecisionTreeClassifier
 - k-Nearest Neighbor &rarr; KNeighborsClassifier

In [1]:
# DataFrame handling
import pandas as pd

# Models
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

# Split data with stratified cv
from sklearn.model_selection import StratifiedKFold

# Encoding of classifications
from sklearn.preprocessing import LabelEncoder

print('Imports complete.')

Imports complete.


In [2]:
# Set up a few constants to keep track of
random_state=1
path='../../tor_dataset/Scenario-A/'
dep_var = 'class'

### Preprocessing
We have to import the dataset and modify the classification (`y`) so we can hand it to the `sklearn` models. 

In [3]:
def get_Xy(filename='', verbose=False):
    """
        This function takes a filename, loads the data into a dataframe, then separates the classification data
        
        args:
            filename => str, path to csv file to be loaded
            
        returns:
            list(X,y) => data, classifications
    """
    df = pd.read_csv(filename)
    
    if verbose:
        print('Before encoding and splitting:')
        print(df.head())
    
    # Actual data
    X = df.loc[:, df.columns != dep_var]
    
    # Classifications
    encoder = LabelEncoder()
    y = encoder.fit_transform(df[dep_var])
    
    if verbose:
        print('Classification encoding:')
        for i in range(len(encoder.classes_)):
            print('\t{} => {}'.format(i, encoder.classes_[i]))
        
        print('After encoding and splitting:')
        print('X = ')
        print(X.head())
        print('\ny = ')
        print(y[:5])
    
    # X holds the data while y holds the classifications
    return X, y

In [4]:
# Demonstration of how the function operates...
X, y = get_Xy(path + 'TimeBasedFeatures-15s-TOR-NonTOR.csv', verbose=True)

Before encoding and splitting:
   duration  total_fiat  total_biat  min_fiat  min_biat       max_fiat  \
0   9368711          16           4   1564818   1549373  190205.285714   
1   7340238          18           4   1567554   1527893  165686.977273   
2   4644225          29          15   1270547   1079974  165865.178571   
3   4978735          19           8   2492050   2457286  239543.250000   
4  11838189          19          10   3094089   3093543  243766.500000   

        max_biat      mean_fiat      mean_biat  flowPktsPerSecond  ...  \
0  203290.456522  389822.391917  370323.719754          10.353612  ...   
1  186914.846154  317267.548742  304370.651301          11.580006  ...   
2  195302.130435  329473.126261  300492.588227          11.412022  ...   
3  276596.388889  612435.304238  628339.573544           8.034169  ...   
4  295954.725000  599721.781709  625632.703972           7.602514  ...   

     std_flowiat  min_active   mean_active  max_active    std_active  \
0  2676

In [5]:
# All of the data files
files=['TimeBasedFeatures-15s-TOR-NonTOR.csv', 
       'TimeBasedFeatures-30s-TOR-NonTOR.csv', 
       'TimeBasedFeatures-60s-TOR-NonTOR.csv', 
       'TimeBasedFeatures-120s-TOR-NonTOR.csv']

# Lists for accuracies collected from models
list_dummy = []
list_dt = []
list_knn = []

for file in files:
    print('Training for {}...'.format(file), end='')
    
    # Load in the data
    X, y = get_Xy(path + file)
    
    # Mean accuracies for each model
    mean_dummy = 0 # This is the worst kind of dummy
    mean_dt = 0
    mean_knn = 0
    
    # 10-fold Stratified Cross-Validation
    n_splits = 10
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
    for train_idxs, test_idxs in skf.split(X, y):
        # Define the training and testing sets
        X_train, X_test = X.iloc[train_idxs], X.iloc[test_idxs]
        y_train, y_test = y[train_idxs], y[test_idxs]
        
        # Initialize the models
        dummy = DummyClassifier(strategy='most_frequent')
        dt = DecisionTreeClassifier(random_state=random_state)
        knn = KNeighborsClassifier()
        
        # Train the models
        dummy.fit(X_train, y_train)
        dt.fit(X_train, y_train)
        knn.fit(X_train, y_train)
        
        # Evaluate the models
        results_dummy = dummy.score(X_test, y_test)
        results_dt = dt.score(X_test, y_test)
        results_knn = knn.score(X_test, y_test)  
        
        # Add the results to the running mean
        mean_dummy += results_dummy / (n_splits * 1.0)
        mean_dt += results_dt / (n_splits * 1.0)
        mean_knn += results_knn / (n_splits * 1.0)
    
    # Push the mean results from all of the splits to the lists
    list_dummy.append(mean_dummy)
    list_dt.append(mean_dt)
    list_knn.append(mean_knn)
    
    print('done')
    
print('All trainings complete!')

Training for TimeBasedFeatures-15s-TOR-NonTOR.csv...done
Training for TimeBasedFeatures-30s-TOR-NonTOR.csv...done
Training for TimeBasedFeatures-60s-TOR-NonTOR.csv...done
Training for TimeBasedFeatures-120s-TOR-NonTOR.csv...done
All trainings complete!


### Results
Below are the final results in attempting to replicate Scenario-A's experiments.

In [6]:
# Output results
print('File\t\t\t\t\tDummy\tDecision Tree\tk-Nearest Neighbor')
print('-'*82)
for i in range(len(files)):
    print('{}\t{:.2f}%\t{:.2f}%\t\t{:.2f}%'.format(files[i], 100*list_dummy[i], 100*list_dt[i], 100*list_knn[i]))

File					Dummy	Decision Tree	k-Nearest Neighbor
----------------------------------------------------------------------------------
TimeBasedFeatures-15s-TOR-NonTOR.csv	84.99%	99.91%		99.88%
TimeBasedFeatures-30s-TOR-NonTOR.csv	89.22%	99.90%		99.93%
TimeBasedFeatures-60s-TOR-NonTOR.csv	94.44%	99.94%		99.91%
TimeBasedFeatures-120s-TOR-NonTOR.csv	95.82%	99.96%		99.92%


If we compare the data we see above with the reported metrics...  
![Results from Arash *et al.* for Scenario-A](../media/scenarioa_paper.png)  
...we clearly see that our results match up with the team's. Despite the fact that we are collected just the accuracy here, we can see that the recall and precision are extremely close than what we are reporting above. In the case of ZeroR, we are seeing the accuracy of our model around the middle between ZeroR's recall and precision metrics.

## Scenario-B
In these experiments, the team uses three machine learning algorithms: Random Forest, C4.5, and k-Nearest Neighbors. The experiments were originally completed in Weka, however they will attempt to be replicated here using the `sklearn` python library.  

The following `sklearn` models will be used:
 - Random Forest &rarr; RandomForestClassifier
 - C4.5 &rarr; DecisionTreeClassifier
 - k-Nearest Neighbors &rarr; KNeighborsClassifier
 
 *Much of the process here is similar to what is done in Scenario-A. If you have any questions, please refer to Scenario-A first for any clarification*

In [7]:
# DataFrame handling
import pandas as pd

# Models
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

# Split data with stratified cv
from sklearn.model_selection import StratifiedKFold

# Encoding of classifications
from sklearn.preprocessing import LabelEncoder

print('Imports complete.')

Imports complete.


In [8]:
# Set up a few constants to keep track of
random_state=1
path='../../tor_dataset/Scenario-B/'
dep_var = 'class'

In [9]:
# All of the data files
files=['TimeBasedFeatures-15s-Layer2.csv',
      'TimeBasedFeatures-30s-Layer2.csv',
      'TimeBasedFeatures-60s-Layer2.csv',
      'TimeBasedFeatures-120s-Layer2.csv']

# Lists for accuracies collected from models
list_rf = []
list_dt = []
list_knn = []

for file in files:
    print('Training for {}...'.format(file), end='')
    
    # Load in the data
    X, y = get_Xy(path + file)
    
    # Mean accuracies for each model
    mean_rf = 0 # This is the worst kind of dummy
    mean_dt = 0
    mean_knn = 0
    
    # 10-fold Stratified Cross-Validation
    n_splits = 10
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
    for train_idxs, test_idxs in skf.split(X, y):
        # Define the training and testing sets
        X_train, X_test = X.iloc[train_idxs], X.iloc[test_idxs]
        y_train, y_test = y[train_idxs], y[test_idxs]
        
        # Initialize the models
        rf = RandomForestClassifier(random_state=random_state)
        dt = DecisionTreeClassifier(random_state=random_state)
        knn = KNeighborsClassifier()
        
        # Train the models
        rf.fit(X_train, y_train)
        dt.fit(X_train, y_train)
        knn.fit(X_train, y_train)
        
        # Evaluate the models
        results_rf = rf.score(X_test, y_test)
        results_dt = dt.score(X_test, y_test)
        results_knn = knn.score(X_test, y_test)  
        
        # Add the results to the running mean
        mean_rf += results_rf / (n_splits * 1.0)
        mean_dt += results_dt / (n_splits * 1.0)
        mean_knn += results_knn / (n_splits * 1.0)
    
    # Push the mean results from all of the splits to the lists
    list_rf.append(mean_rf)
    list_dt.append(mean_dt)
    list_knn.append(mean_knn)
    
    print('done')
    
print('All trainings complete!')

Training for TimeBasedFeatures-15s-Layer2.csv...done
Training for TimeBasedFeatures-30s-Layer2.csv...done
Training for TimeBasedFeatures-60s-Layer2.csv...done
Training for TimeBasedFeatures-120s-Layer2.csv...done
All trainings complete!


### Results
Below are the final results in attempting to replicate Scenario-B's experiments.

In [10]:
# Output results
print('File\t\t\t\t\tRandomForest\tDecision Tree\tk-Nearest Neighbor')
print('-'*82)
for i in range(len(files)):
    print('{}\t{:.2f}%\t\t{:.2f}%\t\t{:.2f}%'.format(files[i], 100*list_rf[i], 100*list_dt[i], 100*list_knn[i]))

File					RandomForest	Decision Tree	k-Nearest Neighbor
----------------------------------------------------------------------------------
TimeBasedFeatures-15s-Layer2.csv	83.75%		78.87%		71.22%
TimeBasedFeatures-30s-Layer2.csv	81.48%		77.20%		67.28%
TimeBasedFeatures-60s-Layer2.csv	80.14%		75.97%		62.72%
TimeBasedFeatures-120s-Layer2.csv	78.61%		74.07%		62.99%


If we compare our results with those from the Arash *et al.* paper...  
![Results from Arash *et al.* for Scenario-B](../media/scenariob_paper.png)
...we see that our results, once again, line up pretty well with those we are seeing from the previous work. In this circumstance, our `sklearn` models appear to be marginally out-performing the `Weka` models! However, this is not the focus of the research.   

Now that we've shown our ability to replicate the models presented in this work, we have the mobility to move on to the main dish, deep learning!

## References
Arash Habibi Lashkari, Gerard Draper-Gil, Mohammad Saiful Islam Mamun and Ali A. Ghorbani, "Characterization of Tor Traffic Using Time Based Features", In the proceeding of the 3rd International Conference on Information System Security and Privacy, SCITEPRESS, Porto, Portugal, 2017.