# Wrangle

Comprised of 2 parts: Acquisition and Preparation.

In Acquisition, the goal is to create functions that download the RT-IoT2022 dataset. In Preparation, I will conduct light analysis on the data in order to develop questions to be answered during a later stage.

## Acquisition
---

### Imports
I will be starting off with the basic imports of Pandas and NumPy

In [18]:
# Standard imports
import pandas as pd
import numpy as np

### Download

The first thing I need to do is to download the data. The data comes from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/942/rt-iot2022), and I need to develop or find a way to download it.

Looking on the page, it appears there is already documentation regarding importing using Python, so I will be using that.

In [20]:
# Install the necessary UC Irvine package
!pip install ucimlrepo



In [24]:
# Use the code provided to import the data:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
rt_iot2022 = fetch_ucirepo(id=942) 
  
# data (as pandas dataframes) 
X = rt_iot2022.data.features 
y = rt_iot2022.data.targets 
  
# metadata 
print(rt_iot2022.metadata) 
  
# variable information 
print(rt_iot2022.variables) 

{'uci_id': 942, 'name': 'RT-IoT2022 ', 'repository_url': 'https://archive.ics.uci.edu/dataset/942/rt-iot2022', 'data_url': 'https://archive.ics.uci.edu/static/public/942/data.csv', 'abstract': 'The RT-IoT2022, a proprietary dataset derived from a real-time IoT infrastructure, is introduced as a comprehensive resource integrating a diverse range of IoT devices and sophisticated network attack methodologies. This dataset encompasses both normal and adversarial network behaviours, providing a general representation of real-world scenarios.\nIncorporating data from IoT devices such as ThingSpeak-LED, Wipro-Bulb, and MQTT-Temp, as well as simulated attack scenarios involving Brute-Force SSH attacks, DDoS attacks using Hping and Slowloris, and Nmap patterns, RT-IoT2022 offers a detailed perspective on the complex nature of network traffic. The bidirectional attributes of network traffic are meticulously captured using the Zeek network monitoring tool and the Flowmeter plugin. Researchers can

### Examine the results
From these results, apparently a module or class object is created which holds all the data, plus some metadata concerning the dataset.

I need to examine this in depth and determine what my next steps will be.

In [43]:
rt_iot2022.data.features.columns

Index(['id.orig_p', 'id.resp_p', 'proto', 'service', 'flow_duration',
       'fwd_pkts_tot', 'bwd_pkts_tot', 'fwd_data_pkts_tot',
       'bwd_data_pkts_tot', 'fwd_pkts_per_sec', 'bwd_pkts_per_sec',
       'flow_pkts_per_sec', 'down_up_ratio', 'fwd_header_size_tot',
       'fwd_header_size_min', 'fwd_header_size_max', 'bwd_header_size_tot',
       'bwd_header_size_min', 'bwd_header_size_max', 'flow_FIN_flag_count',
       'flow_SYN_flag_count', 'flow_RST_flag_count', 'fwd_PSH_flag_count',
       'bwd_PSH_flag_count', 'flow_ACK_flag_count', 'fwd_URG_flag_count',
       'bwd_URG_flag_count', 'flow_CWR_flag_count', 'flow_ECE_flag_count',
       'fwd_pkts_payload.min', 'fwd_pkts_payload.max', 'fwd_pkts_payload.tot',
       'fwd_pkts_payload.avg', 'fwd_pkts_payload.std', 'bwd_pkts_payload.min',
       'bwd_pkts_payload.max', 'bwd_pkts_payload.tot', 'bwd_pkts_payload.avg',
       'bwd_pkts_payload.std', 'flow_pkts_payload.min',
       'flow_pkts_payload.max', 'flow_pkts_payload.tot',
       '

In [45]:
rt_iot2022.data.targets.value_counts()

Attack_type               
DOS_SYN_Hping                 94659
Thing_Speak                    8108
ARP_poisioning                 7750
MQTT_Publish                   4146
NMAP_UDP_SCAN                  2590
NMAP_XMAS_TREE_SCAN            2010
NMAP_OS_DETECTION              2000
NMAP_TCP_scan                  1002
DDOS_Slowloris                  534
Wipro_bulb                      253
Metasploit_Brute_Force_SSH       37
NMAP_FIN_SCAN                    28
Name: count, dtype: int64

In [51]:
# Check that the target column doesn't exist in features provided
rt_iot2022.data.features.columns.isin(["Attack_type"]).sum()

0

In [68]:
# Concatenate the two datasets
df = pd.concat([rt_iot2022.data.features,rt_iot2022.data.targets],axis=1)
df.sample(10)

Unnamed: 0,id.orig_p,id.resp_p,proto,service,flow_duration,fwd_pkts_tot,bwd_pkts_tot,fwd_data_pkts_tot,bwd_data_pkts_tot,fwd_pkts_per_sec,...,active.std,idle.min,idle.max,idle.tot,idle.avg,idle.std,fwd_init_window_size,bwd_init_window_size,fwd_last_window_size,Attack_type
109177,2273,21,tcp,-,0.0,1,0,1,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,64,0,64,DOS_SYN_Hping
39884,21936,21,tcp,-,2e-06,1,1,1,0,466033.8,...,0.0,0.0,0.0,0.0,0.0,0.0,64,0,64,DOS_SYN_Hping
15039,58868,53,udp,dns,0.000227,2,2,2,2,8811.563,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,ARP_poisioning
67723,49849,21,tcp,-,0.0,1,0,1,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,64,0,64,DOS_SYN_Hping
65424,47549,21,tcp,-,2e-06,1,1,1,0,524288.0,...,0.0,0.0,0.0,0.0,0.0,0.0,64,0,64,DOS_SYN_Hping
74133,55414,21,tcp,-,1e-06,1,1,1,0,1048576.0,...,0.0,0.0,0.0,0.0,0.0,0.0,64,0,64,DOS_SYN_Hping
115447,44682,21,tcp,-,0.0,1,0,1,0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,64,0,64,DOS_SYN_Hping
40791,22843,21,tcp,-,2e-06,1,1,1,0,524288.0,...,0.0,0.0,0.0,0.0,0.0,0.0,64,0,64,DOS_SYN_Hping
70568,52694,21,tcp,-,4e-06,1,1,1,0,246723.8,...,0.0,0.0,0.0,0.0,0.0,0.0,64,0,64,DOS_SYN_Hping
93414,32864,21,tcp,-,1e-06,1,1,1,0,1048576.0,...,0.0,0.0,0.0,0.0,0.0,0.0,64,0,64,DOS_SYN_Hping


In [75]:
def acquire_iot2022():
    """
    Function to download and import the RT-IoT2022 dataset. Takes no parameters. Assumes the ucimlrepo has already been installed.
    
    Parameters:
    -----------
    - None
    
    Return:
    -------
    - Pandas DataFrame containing the full 
    """
    try:
        from ucimlrepo import fetch_ucirepo 
        
        # Fetch dataset 
        rt_iot2022 = fetch_ucirepo(id=942)

        # Concatenate the two datasets
        df = pd.concat([rt_iot2022.data.features,rt_iot2022.data.targets],axis=1)

        return df,rt_iot2022
    
    except:
        print("Import failed due to 1 or more of the following reasons:\n\t - User is missing the UC Irvine Python package.\n\t - Dataset is no longer available at the queried location.\n\t - Some of the libraries in use have changed.")
    

## Preparation