<div style="
    padding: 15px 20px;
    margin: 10px 0;
    border-left: 8px solid #4F7942;
    background-color: rgba(255, 191, 0, 0.05);
    border-radius: 4px;
">

## Table of Contents

- [Imports](#imports)
- [Acquire](#acquire)
- [Prepare](#prepare)
- [Wrangle](#wrangle)
- [Miscellaneous](#miscellaneous)

</div>

#### ORIENTATION:
This file acquires the dataset in it's entirety as well as a slimmed down raw version for faster future experimentation and development.  Afterwards, the data is normalized and cleaned to allow for data exploration and feature engineering later.  The normalized and cleaned data is what is used in later notebooks.

<div style="
    padding: 15px 20px;
    margin: 10px 0;
    border-left: 8px solid #4F7942;
    background-color: rgba(255, 191, 0, 0.05);
    border-radius: 4px;
">

## Imports

- [Back to Table of Contents](#table-of-contents)

</div>

In [1]:
import wrangle as w     # Functions made specifically for this notebook to maintain cleanliness
import pandas as pd     # Needed to load-in datasets

<div style="
    padding: 15px 20px;
    margin: 10px 0;
    border-left: 8px solid #4F7942;
    background-color: rgba(255, 191, 0, 0.05);
    border-radius: 4px;
">

## Acquire

- [Back to Table of Contents](#table-of-contents)

</div>

In [5]:
# Check if raw_full.csv exists locally, if not, download it from Kaggle 
w.download_raw()

[33mraw_full.csv not found, attempting to download the raw dataset...[0m
Downloading to /Users/manupfool/.cache/kagglehub/datasets/aryashah2k/nfuqnidsv2-network-intrusion-detection-dataset/1.archive...


100%|██████████| 2.04G/2.04G [00:27<00:00, 79.3MB/s]

Extracting files...





Path to dataset files: /Users/manupfool/Desktop/University_of_Denver/inprogress_courses/COMP3703/final/raw_full.csv
[32mSuccessfully downloaded the raw datafile as raw_full.csv![0m


In [6]:
# Read in raw_full.csv then export to raw_full.parquet (This is done because the file is smaller and faster to read in for pandas)
df = pd.read_csv('raw_full.csv')
w.export_to_parquet(df)

[33mAttempting to export Pandas Dataframe to raw_full.parquet...[0m
[32mSuccessfully exported Pandas Dataframe to raw_full.parquet![0m


In [3]:
# Check if raw_short.parquet exists locally, if not, create it from raw_full.parquet (If you want a smaller dataframe to use)
w.create_short()

[33mraw_full.parquet exists, attempting to create reduced version as raw_short.parquet...[0m
[33mraw_full.parquet read into Pandas DataFrame, checking if Attack exists as a column...[0m
[33mAttack exists, creating reduced Pandas DataFrame...[0m
[32mSuccessfully reduced raw_full.parquet into raw_short.parquet at 10.0% the original size and stratified over Attack![0m


<div style="
    padding: 15px 20px;
    margin: 10px 0;
    border-left: 8px solid #4F7942;
    background-color: rgba(255, 191, 0, 0.05);
    border-radius: 4px;
">

## Prepare

- [Back to Table of Contents](#table-of-contents)

</div>

In [None]:
# Diagnose the dataset for duplicates, hard/soft nulls, and nonsensical values
w.diag_missing_values(pdDataFrame=df)

[35mWhat Is Being Diagnosed:
	Duplicate rows check
	Hard Nulls:	 pd.isna()
	Numeric Checks:	 np.isinf()
	IP Checks:	0.0.0.0
	Nonsense Vals:	-1
	Bad Strings:	['', ' ', '?', 'None', 'nan', 'NULL'][0m
[33mIPV4_SRC_ADDR detected:[0m
	bad_ips: 55894
[32mL4_SRC_PORT had 0 detected nulls![0m
[33mIPV4_DST_ADDR detected:[0m
	bad_ips: 55862
[32mL4_DST_PORT had 0 detected nulls![0m
[32mPROTOCOL had 0 detected nulls![0m
[32mL7_PROTO had 0 detected nulls![0m
[32mIN_BYTES had 0 detected nulls![0m
[32mIN_PKTS had 0 detected nulls![0m
[32mOUT_BYTES had 0 detected nulls![0m
[32mOUT_PKTS had 0 detected nulls![0m
[32mTCP_FLAGS had 0 detected nulls![0m
[32mCLIENT_TCP_FLAGS had 0 detected nulls![0m
[32mSERVER_TCP_FLAGS had 0 detected nulls![0m
[32mFLOW_DURATION_MILLISECONDS had 0 detected nulls![0m
[32mDURATION_IN had 0 detected nulls![0m
[32mDURATION_OUT had 0 detected nulls![0m
[32mMIN_TTL had 0 detected nulls![0m
[32mMAX_TTL had 0 detected nulls![0m
[32mLONGEST_FLO

Unnamed: 0,IPV4_SRC_ADDR,L4_SRC_PORT,IPV4_DST_ADDR,L4_DST_PORT,PROTOCOL,L7_PROTO,IN_BYTES,IN_PKTS,OUT_BYTES,OUT_PKTS,...,TCP_WIN_MAX_OUT,ICMP_TYPE,ICMP_IPV4_TYPE,DNS_QUERY_ID,DNS_QUERY_TYPE,DNS_TTL_ANSWER,FTP_COMMAND_RET_CODE,Label,Attack,Dataset
0,192.168.100.148,65389,192.168.100.7,80,6,7.0,420,3,0,0,...,0,35840,140,0,0,0,0.0,1,DoS,NF-BoT-IoT-v2
1,192.168.100.148,11154,192.168.100.5,80,6,7.0,280,2,40,1,...,0,0,0,0,0,0,0.0,1,DoS,NF-BoT-IoT-v2
2,192.168.1.31,42062,192.168.1.79,1041,6,0.0,44,1,40,1,...,0,0,0,0,0,0,0.0,0,Benign,NF-ToN-IoT-v2
3,192.168.1.34,46849,192.168.1.79,9110,6,0.0,44,1,40,1,...,0,0,0,0,0,0,0.0,0,Benign,NF-ToN-IoT-v2
4,192.168.1.30,50360,192.168.1.152,1084,6,0.0,44,1,40,1,...,0,0,0,0,0,0,0.0,0,Benign,NF-ToN-IoT-v2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75987971,192.168.1.30,58016,192.168.1.184,443,6,91.0,216,4,112,2,...,28960,0,0,0,0,0,0.0,1,DDoS,NF-ToN-IoT-v2
75987972,88.250.221.108,62082,172.31.64.108,445,6,0.0,585,5,344,4,...,8192,0,0,0,0,0,0.0,0,Benign,NF-CSE-CIC-IDS2018-v2
75987973,192.168.100.150,30365,192.168.100.3,80,17,188.0,56,2,0,0,...,0,0,0,0,0,0,0.0,1,DDoS,NF-BoT-IoT-v2
75987974,109.201.152.18,57501,172.31.64.38,3389,6,0.0,1476,8,1873,7,...,64000,0,0,0,0,0,0.0,0,Benign,NF-CSE-CIC-IDS2018-v2


In [3]:
# Prepare the dataframe for exploration by handling results from the above diagnosis
# Note, nothing was found except for bad IPs.  These will remain as they may correlate with malicious activity...
# This will create prepared_full.parquet
w.handle_missing_values(df)

[33mAttempting to create the prepared dataset as prepared_full.parquet...[0m
[32mSuccessfully created the prepared dataset prepared_full.parquet[0m


In [4]:
# Create the prepared_short.parquet as well
short_df = pd.read_parquet('raw_short.parquet')
w.handle_missing_values(short_df, filename='prepared_short.parquet')

[33mAttempting to create the prepared dataset as prepared_short.parquet...[0m
[32mSuccessfully created the prepared dataset prepared_short.parquet[0m


In [5]:
# Clean up the raw_full/short.csv
w.remove_csv_files()

[33mAttempting to delete raw_full.csv...[0m
[32mSuccessfully removed raw_full.csv![0m
[36mraw_short.csv not detected, moving to next file...[0m


In [6]:
# Clean up the raw.parquet files just for consistency and removal of duplicates
parquet_files = ['raw_full.parquet', 'raw_short.parquet']
w.remove_csv_files(filenames=parquet_files)

[33mAttempting to delete raw_full.parquet...[0m
[32mSuccessfully removed raw_full.parquet![0m
[33mAttempting to delete raw_short.parquet...[0m
[32mSuccessfully removed raw_short.parquet![0m


<div style="
    padding: 15px 20px;
    margin: 10px 0;
    border-left: 8px solid #4F7942;
    background-color: rgba(255, 191, 0, 0.05);
    border-radius: 4px;
">

## Wrangle

- [Back to Table of Contents](#table-of-contents)

</div>

In [2]:
# This will just run acquire/prepare processes without the diagnosis
w.main()

[33mraw_full.csv not found, attempting to download the raw dataset...[0m
Downloading to /Users/manupfool/.cache/kagglehub/datasets/aryashah2k/nfuqnidsv2-network-intrusion-detection-dataset/1.archive...


100%|██████████| 2.04G/2.04G [00:28<00:00, 76.0MB/s]

Extracting files...





Path to dataset files: /Users/manupfool/Desktop/University_of_Denver/inprogress_courses/COMP3703/final/raw_full.csv
[32mSuccessfully downloaded the raw datafile as raw_full.csv![0m
[33mAttempting to export Pandas Dataframe to raw_full.parquet...[0m
[32mSuccessfully exported Pandas Dataframe to raw_full.parquet![0m
[33mraw_full.parquet exists, attempting to create reduced version as raw_short.parquet...[0m
[33mraw_full.parquet read into Pandas DataFrame, checking if Attack exists as a column...[0m
[33mAttack exists, creating reduced Pandas DataFrame...[0m
[32mSuccessfully reduced raw_full.parquet into raw_short.parquet at 10.0% the original size and stratified over Attack![0m
[33mAttempting to create the prepared dataset as prepared_full.parquet...[0m
[32mSuccessfully created the prepared dataset prepared_full.parquet[0m
[33mAttempting to create the prepared dataset as prepared_short.parquet...[0m
[32mSuccessfully created the prepared dataset prepared_short.parquet

<div style="
    padding: 15px 20px;
    margin: 10px 0;
    border-left: 8px solid #4F7942;
    background-color: rgba(255, 191, 0, 0.05);
    border-radius: 4px;
">

## Miscellaneous

- [Back to Table of Contents](#table-of-contents)

</div>

#### IMPORTANT NOTE
- Nothing was deemed necessary to alter from the dataset (No duplicates, no hard/soft nulls, bad IPs kept since they may correlate to malicious activity)
- Thus, at the moment, prepared_full.parquet and prepared_short.parquet have nothing altered in them.  The name is merely a formality.