## Testing the ModelPreprocessor Class

In this notebook, we will be testing the `ModelPreprocessor` class to ensure it functions correctly with our dataset. The `ModelPreprocessor` class is designed to preprocess data and prepare it for model prediction. Here are the steps we will follow:

1. **Load the Dataset**: We will start by loading a sample of the UNSW_NB15 training set.
2. **Sample Selection**: We will select 10 random examples from the training set to use as our test samples.
3. **Data Cleaning**: We will remove unnecessary columns such as 'attack_cat' and 'label' from our test samples.
4. **Save Sample Data**: The cleaned sample data will be saved to a parquet file for further processing.
5. **Initialize Preprocessor**: We will create an instance of the `ModelPreprocessor` class using the sample data and a pre-trained model.
6. **Preprocess Data**: The sample data will be preprocessed using the `preprocess` method of the `ModelPreprocessor` class.



In [1]:
import pandas as pd
import sys
sys.path.append('../../')
import src.data.UNSW_NB15_preprocessor.Preprocessor as prep

# Get test samples from UNSW_NB15_training-set.parquet
df = pd.read_parquet('../../data/UNSW_NB15_data/UNSW_NB15_training-set.parquet')

# Select 10 random examples from the training set
df_sample = df.sample(n=10, random_state=17)  

# Remove 'attack_cat' and 'label' columns
df_sample = df_sample.drop(columns=['attack_cat', 'label'], errors='ignore')

# Save to parquet file
df_sample.to_parquet('10_samples.parquet', index=False)

In [2]:
# Display the samples
df_sample.head()

Unnamed: 0,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,sload,...,smean,dmean,trans_depth,response_body_len,ct_src_dport_ltm,ct_dst_sport_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,is_sm_ips_ports
109275,1e-05,unas,-,INT,2,0,200,0,100000.0,80000000.0,...,100,0,0,0,3,3,0,0,0,0
160022,0.435792,tcp,-,FIN,10,6,2516,268,34.42009,41579.47,...,252,45,0,0,2,1,0,0,0,0
28600,1.004934,tcp,http,FIN,12,18,1580,10168,28.857618,11535.09,...,132,565,1,0,1,1,0,0,1,0
121557,58.899662,ospf,-,REQ,58,0,6264,0,0.967747,836.1339,...,108,0,0,0,1,1,0,0,0,0
16055,0.02142,tcp,smtp,FIN,52,42,37268,3380,4341.736816,13651540.0,...,717,80,0,0,2,1,0,0,0,0


In [4]:
# Paths to the data and model files
data_path = "10_samples.parquet"
model_path = "../../models/UNSW_NB15_models/catboost_model_94.5_Recall.cbm"

In [5]:
# Class Test
preprocessor = prep.ModelPreprocessor(model_path)

In [6]:
# Preprocess the data
df_sample_preprocessed = preprocessor.preprocess(data_path)

In [7]:
# Display the preprocessed data
df_sample_preprocessed.head()

Unnamed: 0,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,sload,...,djit,tcprtt,smean,trans_depth,response_body_len,ct_src_dport_ltm,ct_dst_sport_ltm,is_ftp_login,ct_flw_http_mthd,is_sm_ips_ports
0,1e-05,unas,-,INT,1.098612,0.0,200.0,0.0,100000.0,18.197536,...,0.0,0.0,4.61512,0.0,0.0,3.0,3.0,0.0,0.0,0.0
1,0.435792,tcp,-,FIN,2.397895,1.94591,2516.0,268.0,34.42009,10.635386,...,103.853882,0.110482,5.53339,0.0,0.0,2.0,1.0,0.0,0.0,0.0
2,1.004934,tcp,http,FIN,2.564949,2.944439,1580.0,10168.0,28.857618,9.353235,...,0.0,0.000633,4.890349,1.0,0.0,1.0,1.0,0.0,1.0,0.0
3,58.899662,ospf,-,REQ,4.077538,0.0,6264.0,0.0,0.967747,6.729984,...,0.0,0.0,4.691348,0.0,0.0,1.0,1.0,0.0,0.0,0.0
4,0.02142,tcp,smtp,FIN,3.970292,3.7612,37268.0,3380.0,4341.736816,16.429363,...,0.891523,0.000671,6.576469,0.0,0.0,2.0,1.0,0.0,0.0,0.0
