# Classification Without Unknown Data
this notbook will present the classification preformance of chosen model in predicting the BOA tuple of a record (samples). when the unknown records are removed (our original dataset include records which thier BOA label is unknow, when BOA label can be unknown only for one of the three Browser, Operation-system, Application and considered unknown for all the tuple).

# Import

In [1]:
import pandas as pd
from pycaret.classification import *
import time

# Settings

In [2]:
# set constants
target_label = 'tuple'
learning_model = ['rf','et','lightgbm','xgboost']
num_features = ['min_packet_size', 'min_fpkt', 'min_bpkt']
file_name = "new_all_features_"
path = target_label + "_dataset/"

In [3]:
# function for making model-prediction over the data set and measure the run time 
def timed_prediction(in_data,in_model):
    t = time.process_time()
    predicted = predict_model(in_model, data=in_data)
    elapsed_time = time.process_time() - t
    print("prediction took: " + str(elapsed_time))
    return predicted

In [4]:
# compare answers and labeled test
def compare_prediction_with_answers(in_predicted, in_answers):
    count=0
    index = in_predicted.index
    number_of_rows = len(index)
    errors_arr = []
    for i in range(0,number_of_rows):
        if str(int(in_answers[i])) != str(int(in_predicted.iloc[i]['Label'])):
            count=count+1
            cur_error = str(in_answers[i]) + "!=" + str(in_predicted.iloc[i]['Label'])
            errors_arr.append(cur_error)
#             print("error in line " + str(i) +
#                   " " + str(in_answers[i]) +
#                   "!=" + str(in_predicted.iloc[i]['Label']))
#     print("Errors: " + str(errors_arr))
    print("Number of error: " + str(count) + " from " +
          str(number_of_rows) + " test samples \nWhich is "
          + str(count/number_of_rows) + "% of error.")
    return count

In [5]:
# function for checkign the correction of the model-prediction over the data
def check_correction(in_predicted):
    in_answers=in_predicted['Label']
    return compare_prediction_with_answers(in_predicted, in_answers)

# Read Data

In [7]:
data = pd.read_csv(path+file_name+target_label+'_train.csv',
                      sep='\t',
                      skiprows=[1])

# Setup Classifier and Comparing Learning Models

In [8]:
setup(data=data,
      target=target_label,
      numeric_features=num_features,
      silent=True)
model = compare_models(whitelist=learning_model)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
0,Extreme Gradient Boosting,0.977,0.0,0.8834,0.9759,0.9756,0.973,0.9731,12.9354
1,Extra Trees Classifier,0.976,0.0,0.8796,0.9757,0.9749,0.9719,0.9719,7.0813
2,Random Forest Classifier,0.9664,0.0,0.8444,0.9654,0.9643,0.9606,0.9607,4.45
3,Light Gradient Boosting Machine,0.9509,0.0,0.8263,0.9725,0.9587,0.943,0.9441,6.3204


# Prediction

In [9]:
predicted = timed_prediction(data,model)

prediction took: 25.296875


In [10]:
check_correction(predicted)

Number of error: 0 from 14442 test samples 
Which is 0.0% of error.


0

# Read Unseen Test

In [11]:
unseen_data = pd.read_csv(path+file_name+target_label+'_test.csv',
                      sep='\t',
                      skiprows=[1])

In [12]:
# saving the target column
answers = unseen_data[target_label]

In [13]:
# dropping traget column from test.
unseen_data = unseen_data.drop(columns=[target_label])

# Make Unseen Test

In [14]:
predicted = predict_model(model, data=unseen_data)

In [15]:
compare_prediction_with_answers(predicted,answers)

Number of error: 139 from 6189 test samples 
Which is 0.022459201809662304% of error.


139

# Removing unknown tuple labels from data

### Original data

In [18]:
data.shape

(14442, 60)

### removing unknown browser

In [19]:
data=data[((data.tuple/100)%10) < 5]

In [22]:
data.shape

(13094, 60)

### removing unknown os

In [23]:
data=data[(data.tuple%10) < 4]

In [26]:
data.shape

(13094, 60)

### removing unknown application

In [27]:
data=data[(data.tuple/1000) < 18]

In [30]:
data.shape

(11799, 60)

# Setup Classifier and Comparing Learning Models

In [31]:
setup(data=data,
      target=target_label,
      numeric_features=num_features,
      silent=True)
model = compare_models(whitelist=learning_model)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
0,Extreme Gradient Boosting,0.9897,0.0,0.9562,0.9896,0.9893,0.9871,0.9871,6.1864
1,Extra Trees Classifier,0.9896,0.0,0.9552,0.99,0.9894,0.9869,0.9869,4.3139
2,Random Forest Classifier,0.9845,0.0,0.9233,0.9841,0.9837,0.9805,0.9805,2.654
3,Light Gradient Boosting Machine,0.9824,0.0,0.8996,0.988,0.9845,0.9779,0.978,3.8779


# Removing unknown tuple labels from test

In [33]:
unseen_data=unseen_data[((unseen_data.tuple/100)%10) < 5]

In [34]:
unseen_data.shape

(5611, 60)

In [35]:
unseen_data=unseen_data[(unseen_data.tuple%10) < 4]

In [36]:
unseen_data.shape

(5611, 60)

In [37]:
unseen_data=unseen_data[(unseen_data.tuple/1000) < 18]

In [38]:
unseen_data.shape

(5056, 60)

In [39]:
unseen_data=unseen_data.reset_index(drop=True)

# deal with answers column

In [40]:
# saving the target column
answers = unseen_data[target_label]

In [41]:
# dropping traget column from test.
unseen_data = unseen_data.drop(columns=[target_label])

# Make Unseen Test

In [42]:
predicted = predict_model(model, data=unseen_data)

In [43]:
compare_prediction_with_answers(predicted,answers)

Number of error: 56 from 5056 test samples 
Which is 0.011075949367088608% of error.


56

we can see that removing the unknown labels records reduce the number of errors