# Data Field Descriptions

#### Frame-related
- `frame.time`: The time it takes to render a single frame of a video or animation.

#### IP-related
- `ip.src_host`: The source IP address of a packet.
- `ip.dst_host`: The destination IP address of a packet.

#### ARP (Address Resolution Protocol) related
- `arp.dst.proto_ipv4`: The target IP address for an ARP request.
- `arp.opcode`: The operation type for an ARP request or response.
- `arp.hw.size`: The length of the hardware address in an ARP request or response.
- `arp.src.proto_ipv4`: The sender IP address for an ARP request.

#### ICMP (Internet Control Message Protocol) related
- `icmp.checksum`: The checksum for an ICMP packet.
- `icmp.seq_le`: The sequence number for an ICMP packet.
- `icmp.transmit_timestamp`: The timestamp for when an ICMP packet was transmitted.
- `icmp.unused`: Unused field in ICMP packets.

#### HTTP (Hypertext Transfer Protocol) related
- `http.file_data`: The data contained in the HTTP response body.
- `http.content_length`: The length of the HTTP response body.
- `http.request.uri.query`: The query string in the HTTP request URI.
- `http.request.method`: The HTTP request method.
- `http.referer`: The HTTP Referrer header.
- `http.request.full_uri`: The full HTTP request URI.
- `http.request.version`: The HTTP request version.
- `http.response`: The HTTP response message.
- `http.tls_port`: The TLS port number used for the HTTP connection.

#### TCP (Transmission Control Protocol) related
- `tcp.ack`: The TCP acknowledgment sequence number.
- `tcp.ack_raw`: The raw TCP acknowledgment sequence number.
- `tcp.checksum`: The TCP checksum.
- `tcp.connection.fin`: Whether the TCP connection is in the FIN state.
- `tcp.connection.rst`: Whether the TCP connection is in the RST state.
- `tcp.connection.syn`: Whether the TCP connection is in the SYN state.
- `tcp.connection.synack`: Whether the TCP connection is in the SYN-ACK state.
- `tcp.dstport`: The TCP destination port number.
- `tcp.flags`: The TCP flags field.
- `tcp.flags.ack`: Whether the TCP ACK flag is set.
- `tcp.len`: The length of the TCP packet.
- `tcp.options`: The TCP options field.
- `tcp.payload`: The TCP payload data.
- `tcp.seq`: The TCP sequence number.
- `tcp.srcport`: The TCP source port number.

#### UDP (User Datagram Protocol) related
- `udp.port`: The UDP port number.
- `udp.stream`: The UDP stream identifier.
- `udp.time_delta`: The time delta between the UDP packet and the previous UDP packet in the same stream.

#### DNS (Domain Name System) related
- `dns.qry.name`: The DNS query name.
- `dns.qry.name.len`: The length of the DNS query name.
- `dns.qry.qu`: The DNS query type.
- `dns.qry.type`: The DNS query type.
- `dns.retransmission`: Whether the DNS query is a retransmission.
- `dns.retransmit_request`: The DNS retransmission request.
- `dns.retransmit_request_in`: The DNS retransmission request in.

#### MQTT (Message Queuing Telemetry Transport) related
- `mqtt.conack.flags`: The MQTT CONNACK flags field.
- `mqtt.conflag.cleansess`: Whether the MQTT CONNACK Clean Session flag is set.
- `mqtt.conflags`: The MQTT CONNACK flags field.
- `mqtt.hdrflags`: The MQTT header flags field.
- `mqtt.len`: The length of the MQTT packet.
- `mqtt.msg_decoded_as`: The decoded MQTT message.
- `mqtt.msg`: The MQTT message.
- `mqtt.msgtype`: The MQTT message type.
- `mqtt.proto_len`: The length of the MQTT protocol header.
- `mqtt.protoname`: The MQTT protocol name.
- `mqtt.topic`: The MQTT topic.
- `mqtt.topic_len`: The length of the MQTT topic.
- `mqtt.ver`: The MQTT version.

#### Modbus TCP related
- `mbtcp.len`: The length of the Modbus TCP packet.
- `mbtcp.trans_id`: The Modbus TCP transaction identifier.
- `mbtcp.unit_id`: The Modbus TCP unit identifier.

#### Attack-related
- `Attack_label`: The label of an attack.
- `Attack_type`: The type of an attack.


# Problems in each Column
- frame.time - Object(should be float) , no null values ,  dirty values
- ip.src_host - Object ,  no null values, irregular values
- ip.dst_host - Object ,  no null values, irregular values
- arp.dst.proto_ipv4 - Object ,  no null values, irregular values
- arp.opcode - float ,  no null values
- arp.hw.size - float ,  no null values
- arp.src.proto_ipv4 - Object ,  no null values, irregular values
- icmp.checksum - float ,  no null values
- icmp.seq_le  - float ,  no null values
- icmp.transmit_timestamp - float ,  no null values
- icmp.unused -float-no null values
- http.file_data- object -no  null values - irregular values
- http.content_length -float-no null values
- http.request.uri.query - object -no  null values - irregular values
- http.request.method - object -no  null values - irregular values
- http.referer - object(float should be here) -no  null values - irregular values
- http.request.full_uri - object -no  null values - irregular values
- http.request.version - object -no  null values - irregular value
- http.request.version - object -no  null values - irregular value
- http.tls_port - float(int should be) -no  null values - irregular value
- tcp.ack -float-no null values

In [174]:
df["tcp.ack"].sample(10)

38571          0.0
144281         0.0
61852          1.0
8607      571297.0
102881         0.0
153468         0.0
83151      30692.0
28209        317.0
150213         0.0
25190          1.0
Name: tcp.ack, dtype: float64

In [169]:
df["tcp.ack"].isna().sum()

0

In [2]:
import pandas as pd
import numpy as np
from ydata_profiling import ProfileReport
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

  from .autonotebook import tqdm as notebook_tqdm


## Import Dataset

In [3]:
def import_dataset(path):
    df=pd.read_csv(path)
    return df

In [4]:
def get_data_types(df):
    df_types = pd.DataFrame(df.dtypes, columns=['data_type'])
    df_types.index.name = 'column_name'
    return df_types

In [5]:
def get_null_values(df):
    df_null_values = pd.DataFrame(df.isna().sum(), columns=['null_value'])
    df_null_values["columns"]=df.columns
    df_null_values['null_values']=df.isna().sum()
    df_null_values.drop(columns=["null_value"],inplace=True)
    df_null_values.reset_index(drop=True, inplace=True)
    overall_null_values = df.isnull().sum().sum()
    return df_null_values

## Separate Columns

In [6]:
def get_columns(df):
    numeric_cols = df.select_dtypes(include=[np.number,np.int64,np.float64]).columns.tolist()
    categorical_cols=df.select_dtypes(include=['category','object']).columns.tolist()
    print(f'Number of numeric columns are {len(numeric_cols)} and categorical columns are {len(categorical_cols)}')
    return  numeric_cols,categorical_cols
    

In [8]:
path="ML-EdgeIIoT-dataset.csv"
df=import_dataset(path)

  df=pd.read_csv(path)


In [7]:
df.head()

Unnamed: 0,frame.time,ip.src_host,ip.dst_host,arp.dst.proto_ipv4,arp.opcode,arp.hw.size,arp.src.proto_ipv4,icmp.checksum,icmp.seq_le,icmp.transmit_timestamp,icmp.unused,http.file_data,http.content_length,http.request.uri.query,http.request.method,http.referer,http.request.full_uri,http.request.version,http.response,http.tls_port,tcp.ack,tcp.ack_raw,tcp.checksum,tcp.connection.fin,tcp.connection.rst,tcp.connection.syn,tcp.connection.synack,tcp.dstport,tcp.flags,tcp.flags.ack,tcp.len,tcp.options,tcp.payload,tcp.seq,tcp.srcport,udp.port,udp.stream,udp.time_delta,dns.qry.name,dns.qry.name.len,dns.qry.qu,dns.qry.type,dns.retransmission,dns.retransmit_request,dns.retransmit_request_in,mqtt.conack.flags,mqtt.conflag.cleansess,mqtt.conflags,mqtt.hdrflags,mqtt.len,mqtt.msg_decoded_as,mqtt.msg,mqtt.msgtype,mqtt.proto_len,mqtt.protoname,mqtt.topic,mqtt.topic_len,mqtt.ver,mbtcp.len,mbtcp.trans_id,mbtcp.unit_id,Attack_label,Attack_type
0,6.0,192.168.0.152,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,MITM
1,6.0,192.168.0.101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,MITM
2,6.0,192.168.0.152,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,MITM
3,6.0,192.168.0.101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,MITM
4,6.0,192.168.0.152,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,MITM


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157800 entries, 0 to 157799
Data columns (total 63 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   frame.time                 157800 non-null  object 
 1   ip.src_host                157800 non-null  object 
 2   ip.dst_host                157800 non-null  object 
 3   arp.dst.proto_ipv4         157800 non-null  object 
 4   arp.opcode                 157800 non-null  float64
 5   arp.hw.size                157800 non-null  float64
 6   arp.src.proto_ipv4         157800 non-null  object 
 7   icmp.checksum              157800 non-null  float64
 8   icmp.seq_le                157800 non-null  float64
 9   icmp.transmit_timestamp    157800 non-null  float64
 10  icmp.unused                157800 non-null  float64
 11  http.file_data             157800 non-null  object 
 12  http.content_length        157800 non-null  float64
 13  http.request.uri.query     15

In [9]:
numeric_cols,categorical_cols=get_columns(df)

Number of numeric columns are 43 and categorical columns are 20


In [16]:
profile = ProfileReport(df, title="Profiling Report")
profile.to_file('Data Analysis.html')

  is_valid_dtype = pdt.is_categorical_dtype(series) and not pdt.is_bool_dtype(
  not pdt.is_categorical_dtype(series)
  not pdt.is_categorical_dtype(series)
  is_valid_dtype = pdt.is_categorical_dtype(series) and not pdt.is_bool_dtype(
  not pdt.is_categorical_dtype(series)
  not pdt.is_categorical_dtype(series)
  is_valid_dtype = pdt.is_categorical_dtype(series) and not pdt.is_bool_dtype(
  if pdt.is_categorical_dtype(series):
  if pdt.is_categorical_dtype(series):
  if pdt.is_categorical_dtype(series):
  not pdt.is_categorical_dtype(series)
  not pdt.is_categorical_dtype(series)
  not pdt.is_categorical_dtype(series)
  if pdt.is_categorical_dtype(series):
  if pdt.is_categorical_dtype(series):
  if pdt.is_categorical_dtype(series):
  not pdt.is_categorical_dtype(series)
  is_valid_dtype = pdt.is_categorical_dtype(series) and not pdt.is_bool_dtype(
  not pdt.is_categorical_dtype(series)
  is_valid_dtype = pdt.is_categorical_dtype(series) and not pdt.is_bool_dtype(
  not pdt.is_categor

In [10]:
df_data_types=get_data_types(df)

In [11]:

null_values=get_null_values(df)
print(null_values)

                      columns  null_values
0                  frame.time            0
1                 ip.src_host            0
2                 ip.dst_host            0
3          arp.dst.proto_ipv4            0
4                  arp.opcode            0
5                 arp.hw.size            0
6          arp.src.proto_ipv4            0
7               icmp.checksum            0
8                 icmp.seq_le            0
9     icmp.transmit_timestamp            0
10                icmp.unused            0
11             http.file_data            0
12        http.content_length            0
13     http.request.uri.query            0
14        http.request.method            0
15               http.referer            0
16      http.request.full_uri            0
17       http.request.version            0
18              http.response            0
19              http.tls_port            0
20                    tcp.ack            0
21                tcp.ack_raw            0
22         