## Loading Data

In [6]:
import pandas as pd
Raw_Data = pd.read_csv("ML-EdgeIIoT-dataset.csv", low_memory= False) 
Raw_Data.info() # 63 Features and 157800 entries

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157800 entries, 0 to 157799
Data columns (total 63 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   frame.time                 157800 non-null  object 
 1   ip.src_host                157800 non-null  object 
 2   ip.dst_host                157800 non-null  object 
 3   arp.dst.proto_ipv4         157800 non-null  object 
 4   arp.opcode                 157800 non-null  float64
 5   arp.hw.size                157800 non-null  float64
 6   arp.src.proto_ipv4         157800 non-null  object 
 7   icmp.checksum              157800 non-null  float64
 8   icmp.seq_le                157800 non-null  float64
 9   icmp.transmit_timestamp    157800 non-null  float64
 10  icmp.unused                157800 non-null  float64
 11  http.file_data             157800 non-null  object 
 12  http.content_length        157800 non-null  float64
 13  http.request.uri.query     15

checking NaN values

In [7]:
print(f"Number of NaN Values :", Raw_Data.columns[Raw_Data.isna().any()].tolist())

Number of NaN Values : []


## Trivial Feature Removal 

Removing the label Feature and saving it in a pickle File <br>
No NaN values in Dataframe so the number of sample is constant througout the preprocessing.

In [8]:
Attack_type = Raw_Data.pop('Attack_type') # removes the 'Attack_type' Features from dataframe and save it in the variable
Attack_label = Raw_Data.pop('Attack_label') # removes the 'Attack_label' Features from dataframe and save it in the variable
Attack_type.to_pickle("Attack_type.pkl")
Attack_label.to_pickle("Attack_label.pkl")

Features with Information of *timestamp*, *Port*, *IP-address*, *payload* can be removed. (according to Edge Dataset) 

In [9]:
Columns_to_drop = [0, 1, 2, 3, 6, 9, 11, 13, 16, 27, 31, 32, 34, 35, 51] # marked Red in Edge-pdf 
Raw_Data = Raw_Data.drop(Raw_Data.columns[Columns_to_drop], axis=1) 
Raw_Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157800 entries, 0 to 157799
Data columns (total 46 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   arp.opcode                 157800 non-null  float64
 1   arp.hw.size                157800 non-null  float64
 2   icmp.checksum              157800 non-null  float64
 3   icmp.seq_le                157800 non-null  float64
 4   icmp.unused                157800 non-null  float64
 5   http.content_length        157800 non-null  float64
 6   http.request.method        157800 non-null  object 
 7   http.referer               157800 non-null  object 
 8   http.request.version       157800 non-null  object 
 9   http.response              157800 non-null  float64
 10  http.tls_port              157800 non-null  float64
 11  tcp.ack                    157800 non-null  float64
 12  tcp.ack_raw                157800 non-null  float64
 13  tcp.checksum               15

Feature *mqtt.proto_len* and *mqtt.protoname* are same in meaning. (one in numerics and other in string) <br>
similarly *mqtt.topic_len* and *mqtt.topic* 

In [10]:
Columns_to_drop = [39,40] # Features with strings are removed
Raw_Data = Raw_Data.drop(Raw_Data.columns[Columns_to_drop], axis=1) 
Raw_Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157800 entries, 0 to 157799
Data columns (total 44 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   arp.opcode                 157800 non-null  float64
 1   arp.hw.size                157800 non-null  float64
 2   icmp.checksum              157800 non-null  float64
 3   icmp.seq_le                157800 non-null  float64
 4   icmp.unused                157800 non-null  float64
 5   http.content_length        157800 non-null  float64
 6   http.request.method        157800 non-null  object 
 7   http.referer               157800 non-null  object 
 8   http.request.version       157800 non-null  object 
 9   http.response              157800 non-null  float64
 10  http.tls_port              157800 non-null  float64
 11  tcp.ack                    157800 non-null  float64
 12  tcp.ack_raw                157800 non-null  float64
 13  tcp.checksum               15

Features with only *0* as values should be removed. <br>
*icmp.unused*, *http.tls_port*, *dns.qry.type*, *dns.retransmit_request_in*, *mqtt.msg_decoded_as*, *mbtcp.len*, *mbtcp.trans_id*, *mbtcp.unit_id*. <br>
!!! But should be noted for future pipeline

In [11]:
Columns_to_drop = [4, 10, 27, 30, 36, 41, 42, 43] # marked pink in file Edge-pdf
Raw_Data = Raw_Data.drop(Raw_Data.columns[Columns_to_drop], axis=1) # In sklearn VarianceThreshhold does the same thing.  
Raw_Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157800 entries, 0 to 157799
Data columns (total 36 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   arp.opcode              157800 non-null  float64
 1   arp.hw.size             157800 non-null  float64
 2   icmp.checksum           157800 non-null  float64
 3   icmp.seq_le             157800 non-null  float64
 4   http.content_length     157800 non-null  float64
 5   http.request.method     157800 non-null  object 
 6   http.referer            157800 non-null  object 
 7   http.request.version    157800 non-null  object 
 8   http.response           157800 non-null  float64
 9   tcp.ack                 157800 non-null  float64
 10  tcp.ack_raw             157800 non-null  float64
 11  tcp.checksum            157800 non-null  float64
 12  tcp.connection.fin      157800 non-null  float64
 13  tcp.connection.rst      157800 non-null  float64
 14  tcp.connection.syn  

## Encoding

Some Features have categorical values. (*http.request.method*, *http.referer*, *http.request.version*, *dns.qry.name.len*, *mqtt.conack.flags*) <br>
Hence Encoding is neccessary. But the above mentioned Features have values '0' and '0.0'. Eventhough both are same value because the value is in <br>
data tpe *object*, encoding will take it as seperate values. 
Hence the mapping.

In [12]:
# function to change values of '0' to '0.0'
def format(row_value):
    return '0.0' if row_value == '0' else row_value

Raw_Data["http.request.method"] = Raw_Data["http.request.method"].apply(format)
Raw_Data['http.referer'] = Raw_Data['http.referer'].apply(format)
Raw_Data['http.request.version'] = Raw_Data['http.request.version'].apply(format)
Raw_Data['dns.qry.name.len'] = Raw_Data['dns.qry.name.len'].apply(format)
Raw_Data['mqtt.conack.flags'] = Raw_Data['mqtt.conack.flags'].apply(format)


In [13]:
Raw_Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157800 entries, 0 to 157799
Data columns (total 36 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   arp.opcode              157800 non-null  float64
 1   arp.hw.size             157800 non-null  float64
 2   icmp.checksum           157800 non-null  float64
 3   icmp.seq_le             157800 non-null  float64
 4   http.content_length     157800 non-null  float64
 5   http.request.method     157800 non-null  object 
 6   http.referer            157800 non-null  object 
 7   http.request.version    157800 non-null  object 
 8   http.response           157800 non-null  float64
 9   tcp.ack                 157800 non-null  float64
 10  tcp.ack_raw             157800 non-null  float64
 11  tcp.checksum            157800 non-null  float64
 12  tcp.connection.fin      157800 non-null  float64
 13  tcp.connection.rst      157800 non-null  float64
 14  tcp.connection.syn  

Saving the Variable in pickle file for export functionality.

In [14]:
Raw_Data.to_pickle("Raw_Data_after_extraction.pkl")

Common encoding types are; <br>
* *Label encoding* has a negative effect that it might lead to an ordinal relationship with target data, nevertheless it had been used by authors of Edge.
* *One hot encoding* has no such negative effect as label encoding but still the number of features might increase.
* *Target encoding* replace each unique values of the categorical features with respective mean of the target Feature values.

In [15]:
from sklearn.compose import make_column_transformer # for multiple column transformation
# One hot encoder have option of 'drop' parameter. This will be initalised with 'first' or 'if_binary' to drop first encoded columns
# because it is otherwise highly correalted. however if we are having an idea of using Regularisation this is not needed.
from sklearn.preprocessing import OneHotEncoder
transformer = make_column_transformer( (OneHotEncoder(), ['http.request.method', 'http.referer', 'http.request.version', 'dns.qry.name.len', 'mqtt.conack.flags']), remainder='passthrough', verbose_feature_names_out= False)
transformed = transformer.fit_transform(Raw_Data)
Data_ohe = pd.DataFrame(data=transformed, columns=transformer.get_feature_names_out())
Data_ohe.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157800 entries, 0 to 157799
Data columns (total 55 columns):
 #   Column                                                                             Non-Null Count   Dtype  
---  ------                                                                             --------------   -----  
 0   http.request.method_0.0                                                            157800 non-null  float64
 1   http.request.method_GET                                                            157800 non-null  float64
 2   http.request.method_OPTIONS                                                        157800 non-null  float64
 3   http.request.method_POST                                                           157800 non-null  float64
 4   http.request.method_TRACE                                                          157800 non-null  float64
 5   http.referer_() { _; } >_[$($())] { echo 93e4r0-CVE-2014-6278: true; echo;echo; }  157800 non

Saving the one hot encoded Variable in pickle file for export functionality.

In [16]:
Data_ohe.to_pickle("Data_ohe.pkl")

In [18]:
# Target Encoding
from sklearn.preprocessing import TargetEncoder

Input = Raw_Data.loc[:, ['http.request.method', 'http.referer', 'http.request.version', 'dns.qry.name.len', 'mqtt.conack.flags']]
Target = Attack_label
T_encoder = TargetEncoder(target_type="binary")
Input_encoded = T_encoder.fit_transform(Input, Target)
Data_tar = Raw_Data.copy()
columns_to_replace = ['http.request.method', 'http.referer', 'http.request.version', 'dns.qry.name.len', 'mqtt.conack.flags']
Data_tar[columns_to_replace] = pd.DataFrame(data=Input_encoded, columns=T_encoder.get_feature_names_out())
Data_tar.info() 


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157800 entries, 0 to 157799
Data columns (total 36 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   arp.opcode              157800 non-null  float64
 1   arp.hw.size             157800 non-null  float64
 2   icmp.checksum           157800 non-null  float64
 3   icmp.seq_le             157800 non-null  float64
 4   http.content_length     157800 non-null  float64
 5   http.request.method     157800 non-null  float64
 6   http.referer            157800 non-null  float64
 7   http.request.version    157800 non-null  float64
 8   http.response           157800 non-null  float64
 9   tcp.ack                 157800 non-null  float64
 10  tcp.ack_raw             157800 non-null  float64
 11  tcp.checksum            157800 non-null  float64
 12  tcp.connection.fin      157800 non-null  float64
 13  tcp.connection.rst      157800 non-null  float64
 14  tcp.connection.syn  

Saving the Target encoded Variable in pickle file for export functionality.

In [19]:
Data_tar.to_pickle("Data_tar_enc.pkl")

***

## Conclusion

In this File the Dataset from Edge Dataset for shallow ML is loaded in pandas dataframe. First the types of attacks as a Series are stored in a pickle file.<br>
Then Unwanted Features according to Edge Authors are first removed, accompanied by removal of redundant Features and lastly features with constant value as 0.<br>
<br>
Left with several categorical features. They are encoded with two techniques, namely one hot encoding and Target encoding. **Eventhough the Edge authors used label<br>
encoding for their ML analysis, here it is avoided because of the number of Features would increase rapidly. The impact of this will be seen in the model**<br>
<br>
**Further to be noted is One hot encoding the estimator with attribute 'drop' is avoided. why? see above. But impact of this should be noted too.**<br>

"Attack_type.pkl" => Series with Attack types<br>
"Attack_label.pkl" => Series with Attack label<br>
"Raw_Data_after_extraction.pkl" => Data after all unwanted Feature Removal<br>
"Data_ohe.pkl" => One hot encoded Data<br>
"Data_tar_enc.pkl" => Target encoded Data<br>