## Import Libraries

In [18]:
# preprocessing/data manipulation
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import RandomizedSearchCV, cross_val_score, train_test_split
from sklearn.metrics import make_scorer, accuracy_score, classification_report
import pandas as pd

# classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb

## Read CSVs

For the purposes of this project, we will evaluate the results of the synthesized dataset "VAE_synthetic" and comparing it to the original dataset "condensed_dataset".

We will do this by first training various classification models on the synthetic dataset, and then fitting

In [19]:
test_data = pd.read_csv('/content/condensed_dataset.csv')

train_data = pd.read_csv('/content/VAE_Synthetic.csv')

## Splitting and Pre-processing Dataset

Here, we begin pre-processing the dataset.

We first check the DataFrame for any NA values. Once complete, we then split the dataset, stratify the target variable, as well as add a standard scaler. The dataset will be split 80/20 for training/testing. Missing data imputations are conducted as well, using median values so as to mitigate both outlier and skewed data influence.

In [20]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 23 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   dt           500 non-null    int64  
 1   switch       500 non-null    int64  
 2   src          500 non-null    object 
 3   dst          500 non-null    object 
 4   pktcount     500 non-null    int64  
 5   bytecount    500 non-null    int64  
 6   dur          500 non-null    int64  
 7   dur_nsec     500 non-null    int64  
 8   tot_dur      500 non-null    float64
 9   flows        500 non-null    int64  
 10  packetins    500 non-null    int64  
 11  pktperflow   500 non-null    int64  
 12  byteperflow  500 non-null    int64  
 13  pktrate      500 non-null    int64  
 14  Pairflow     500 non-null    int64  
 15  Protocol     500 non-null    object 
 16  port_no      500 non-null    int64  
 17  tx_bytes     500 non-null    int64  
 18  rx_bytes     500 non-null    int64  
 19  tx_kbps 

In [21]:
# Check for NAs in entire DataFrame
print(train_data.isnull().values.any())

# Check for NAs in the columns
print(train_data.isnull().any())

# Check for NAs in the rows
print(train_data.isnull().any(axis=1))

# Check for null values in DataFrame
na_ct = train_data.isnull().values.flatten().sum()

# Count number of False values
non_na_ct = train_data.size - na_ct

print("Number of True Values (NAs):", na_ct)
print("Number of False values (Non-NAs):", non_na_ct)

False
dt             False
switch         False
src            False
dst            False
pktcount       False
bytecount      False
dur            False
dur_nsec       False
tot_dur        False
flows          False
packetins      False
pktperflow     False
byteperflow    False
pktrate        False
Pairflow       False
Protocol       False
port_no        False
tx_bytes       False
rx_bytes       False
tx_kbps        False
rx_kbps        False
tot_kbps       False
label          False
dtype: bool
0      False
1      False
2      False
3      False
4      False
       ...  
495    False
496    False
497    False
498    False
499    False
Length: 500, dtype: bool
Number of True Values (NAs): 0
Number of False values (Non-NAs): 11500


In [22]:
# Create a mapping dictionary for all categorical columns
mapping_dict = {}

for col in train_data.columns:
    if train_data[col].dtype == 'object':
        mapping = {label: idx for idx, label in enumerate(np.unique(train_data[col]))}
        mapping_dict[col] = mapping

for col, mapping in mapping_dict.items():
    train_data[col] = train_data[col].map(mapping)

In [23]:
# Training Data
X = train_data.drop(columns =['label'], axis = 1)
y = train_data['label']

# Stratified train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [24]:
# Adding standard scaling to X_train and X_test
sc = StandardScaler()

X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

In [25]:
# Handling Missing Data
imputer = SimpleImputer(strategy='median')
X_train_imputed = imputer.fit_transform(X_train_scaled)
X_test_imputed = imputer.transform(X_test_scaled)

## Employing Classification Methods on Training Dataset

After pre-processing is complete, we then begin running the training dataset through each classification method.

The purpose of this project will be to determine the quality of the fitted model generated from the synthesized dataset with the highest quality from our previous project. As such, six classification methods will be utilized and evaluated in order to determine which one will be the most effective at evaluating the final test dataset.

We first introduce each classifier method and then conduct cross-validation to check the robustness of each model. For the purpose of this project, 5 folds will be used, save for SVM.

The accuracy scores of each will be shown as an output to compare, with the mean accuracy score of the 5 splits being used as final metric to choose the best model.

## Logistic Regression

Logistic Regression scores show that this model may be suitable as the primary classification method for the test dataset. It performs almost as well as SVM, which can be an indicator that the decision boundary between classes are approximately linear. With a mean accuracy score of 0.97, this was one of the better methods overall.

In [26]:
# Introduce model
lr_clf = LogisticRegression(max_iter=2000, penalty='l2', C=1.0, random_state=42)

# Cross-validation
cv_lr = cross_val_score(lr_clf, X_train_imputed, y_train, cv=5, scoring='accuracy')

# Print Accuracy scores
print("Cross-validation Accuracy scores:", cv_lr)
print("Mean Accuracy score:", cv_lr.mean())

Cross-validation Accuracy scores: [1.     0.9375 0.975  0.9625 0.975 ]
Mean Accuracy score: 0.97


## SVM

**This classification method was chosen**, as it had the highest overall accuracy score among the tested classifier methods. This would suggest that it might be the best fit for this specific dataset among all the models tested.

In [27]:
# This classifier was chosen due to having the highest mean accuracy score overall

# Create a pipeline for SVM and Standard Scaling
pipe_svc = make_pipeline(
    StandardScaler(),
    SVC(random_state=42),
)

# Define parameter distributions for random search
param_range = [0.01, 1.0, 100.0]

param_grid = [{'svc__C': param_range,
              'svc__kernel': ['linear']},
             {'svc__C': param_range,
             'svc__gamma': param_range,
             'svc__kernel': ['rbf']}]

# Creating Randomized Search, setting estimators, parameters, and maximum CPU usage
rs = RandomizedSearchCV(estimator=pipe_svc, param_distributions=param_grid, scoring='accuracy', refit=True, n_iter=5, cv=5, random_state=1, n_jobs=-1)
rs.fit(X_train_imputed, y_train)

# Get the best parameters and best accuracy score
best_params = rs.best_params_
best_score = rs.best_score_

# Print Accuracy scores
print("Best Parameters:", best_params)
print("Best Accuracy Score:", best_score)

Best Parameters: {'svc__kernel': 'linear', 'svc__C': 1.0}
Best Accuracy Score: 0.9724999999999999


## Decision Tree

Decision Tree achieved the lowest accuracy score out of all methods chosen, with a score of approximately 0.939. This would suggest that the dataset may be noisy or complex, and that a single decision may not be enough to sufficiently capture such nuances. Additionally, a decision tree's tendency to overfitting would also lend to reasons why it has the lowest score.

In [28]:
# Introduce model
dt_clf = DecisionTreeClassifier(random_state=42)

# Cross-validation
cv_dt = cross_val_score(dt_clf, X_train_imputed, y_train, cv=5, scoring='accuracy')

# Print Accuracy scores
print("Cross-validation Accuracy Scores:", cv_dt)
print("Mean Accuracy score:", cv_dt.mean())

Cross-validation Accuracy Scores: [0.975  0.9125 0.95   0.95   0.9125]
Mean Accuracy score: 0.9399999999999998


## K-Nearest Neighbors

KNN was able to perform well, achieving a mean accuracy score of 0.96. This suggests that the class distribution within the dataset allows for effective classification based on proximity. However, when compared to the other models, KNN is slightly edged out in terms of robustness.

In [29]:
# Introduce model
knn_clf = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)

# Cross-validation
cv_knn = cross_val_score(knn_clf, X_train_imputed, y_train, cv=5, scoring='accuracy')

# Print Accuracy scores
print("Cross-validation Accuracy Scores:", cv_knn)
print("Mean Accuracy score:", cv_knn.mean())

Cross-validation Accuracy Scores: [0.975  0.9375 0.95   0.9625 0.975 ]
Mean Accuracy score: 0.96


## Random Forest

 While one decision tree alone may not be enough to produce an adequate accuracy score, having multiple decision trees would allow for a significant improvement in predictions. Utilizing Random Forest yielded a result of approximately 0.96749. By leveraging multiple decision trees, less variance is generated, and increases the overall accuracy score compared to one single tree.

While it may not have produced the highest score, this method would still be considered with larger and more complex datasets.

In [30]:
# Introduce model, setting additional parameters for n number of trees and maximum CPU usage
rf_clf = RandomForestClassifier(n_estimators=50, n_jobs=-1, random_state=42)

# Cross-validation
cv_rf = cross_val_score(rf_clf, X_train_imputed, y_train, cv=5, scoring='accuracy')

# Print Accuracy scores
print("Cross-validation Accuracy Scores:", cv_rf)
print("Mean Accuracy score:", cv_rf.mean())

Cross-validation Accuracy Scores: [0.9875 0.95   0.9625 0.9625 0.975 ]
Mean Accuracy score: 0.9674999999999999


## XGBoost

Our last classification method, XGBoost, yielded approximately 0.965. While this result may be adequate on its own and may perform better on larger and more complex datasets, SVM was able to slightly edge out XGBoost's score.

In [31]:
# Introduce model
xgb_clf = xgb.XGBClassifier(random_state=42)

# Cross-validation
cv_xgb = cross_val_score(xgb_clf, X_train_imputed, y_train, cv=5, scoring='accuracy')

# Print Accuracy scores
print("Cross-validation Accuracy Scores:", cv_xgb)
print("Mean Accuracy score:", cv_xgb.mean())

Cross-validation Accuracy Scores: [0.9875 0.95   0.9625 0.9625 0.9625]
Mean Accuracy score: 0.9650000000000001


## Model Comparison At-a-Glance

Six models were tested. SVM performed the best, with Logistic Regression close behind.
The performance of each model is the following:
    
| Model    | Accuracy Score |
| :--------: | :--------------------: |
| Logistic Regression | 0.97           |
| SVM    | **0.9724999999999999**          |
| Decision Tree    | 0.9399999999999998          |
| KNN    | 0.96           |
| Random Forest    | 0.9674999999999999           |
| XGBoost  | 0.9650000000000001|

With these results, we conclude that SVM has the best results, and thus will utilize SVM for our final model.

## Preparing Classifer Method for Test Dataset

Here, we re-create the code to pre-process the training and testing dataset to be evaluated by the SVM algorithm. The code remains mostly the same as earlier, with the biggest differences being the encoding the results of the label to a new column.

In [32]:
# Create a mapping dictionary for all categorical columns
mapping_dict = {}

# Creating mapping dictionary for training data
for col in train_data.columns:
    if train_data[col].dtype == 'object':
        mapping = {label: idx for idx, label in enumerate(np.unique(train_data[col]))}
        mapping_dict[col] = mapping

for col, mapping in mapping_dict.items():
    train_data[col] = train_data[col].map(mapping)

# Creating mapping dictionary for testing data
for col in test_data.columns:
    if test_data[col].dtype == 'object':
        mapping = {label: idx for idx, label in enumerate(np.unique(test_data[col]))}
        mapping_dict[col] = mapping

for col, mapping in mapping_dict.items():
    test_data[col] = test_data[col].map(mapping)

# Training Data
X_train = train_data.drop(columns =['label'], axis = 1)
y_train = train_data['label']

# Test Data
X_test = test_data.drop(columns =['label'], axis = 1)
y_test = test_data['label']

# Adding standard scaling to X_train and X_test
sc = StandardScaler()

X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

# Handling Missing Data
imputer = SimpleImputer(strategy='median')
X_train_imputed = imputer.fit_transform(X_train_scaled)
X_test_imputed = imputer.transform(X_test_scaled)

## Evaluating Testing Dataset and Generating CSV File

Once the data from the training and testing dataset have been pre-processed, the testing dataset is then evaluated with our chosen classifer method, of which the results will be converted to a csv file.

In [33]:
# Fit the model to the training data

# Define parameter distributions for random search
param_range = [0.01, 1.0, 100.0]

param_grid = [{'svc__C': param_range,
              'svc__kernel': ['linear']},
             {'svc__C': param_range,
             'svc__gamma': param_range,
             'svc__kernel': ['rbf']}]

# Creating Randomized Search, setting estimators, parameters, and maximum CPU usage
rs = RandomizedSearchCV(estimator=pipe_svc, param_distributions=param_grid, scoring='accuracy', refit=True, n_iter=5, cv=5, random_state=1, n_jobs=-1)
rs.fit(X_train_imputed, y_train)

# Predict the fitted model on the test data
rs_pred = rs.predict(X_test_imputed)

# Check accuracy of the test data
accuracy = accuracy_score(y_test, rs_pred)
print(f"Accuracy: {accuracy}")

# Create a new DataFrame with the original test data and replace the 'label' column
result = test_data.copy()
result['label'] = rs_pred

# Save the result DataFrame to a CSV file
result.to_csv('result.csv', index=False)

Accuracy: 0.96


## Final Visualization

Once the csv file has been generated, a final check and visualization of the DataFrame is conducted before submission.  



In [34]:
submission = pd.read_csv('/content/result.csv')
print(submission)

value_counts = submission['label'].value_counts()
print(value_counts)

        dt  switch  src  dst  pktcount  bytecount  dur   dur_nsec  \
0    11425       1    0    1     45304   48294064  100  716000000   
1    11605       1    0    1    126395  134737070  280  734000000   
2    11425       1    1    1     90333   96294978  200  744000000   
3    11425       1    1    1     90333   96294978  200  744000000   
4    11425       1    1    1     90333   96294978  200  744000000   
..     ...     ...  ...  ...       ...        ...  ...        ...   
995   9966       1    0    0     59972   63930152  133  252000000   
996   9966       1    0    0     59972   63930152  133  252000000   
997   9966       1    0    0     59972   63930152  133  252000000   
998   9966       1    0    0     59972   63930152  133  252000000   
999   9936       1    0    0     46440   49505040  103  248000000   

          tot_dur  flows  ...  pktrate  Pairflow  Protocol  port_no  \
0    1.010000e+11      3  ...      451         0         0        3   
1    2.810000e+11      2  ...

## Conclusion

There are some key takeaways that were gathered from this project.

The first is that a generally high mean accuracy score was maintained across all the models tested. This suggests that the dataset was generated in a way that maintains the underlying patterns of the original dataset. In other words, this indicates that the synthetic dataset was a good proxy for the original data - at least, for these models.

Another takeaway was that with models such as Logistic Regression and SVM performing similarly in terms of accuracy suggests that the dataset has well-separated classes in the feature space - SVM, especially.

Finally, the high mean accuracy scores across all the models would indicate that the synthetic data is of high quality, and that it was able to successfully preserve the relationships and features that were present in the original dataset. Thus, we can conclude that the synthetic dataset is useful for model testing and validation.