# Merge_CICIDS2017_Dataset
This notebook contains the procedure of merging the original CICIDS2017 Dataset. This work is used in the repetition of the code for the paper entitled "[**MTH-IDS: A Multi-Tiered Hybrid Intrusion Detection System for Internet of Vehicles**](https://arxiv.org/pdf/2105.13289.pdf)" accepted in IEEE Internet of Things Journal.
## Dataset Source
https://www.kaggle.com/datasets/cicdataset/cicids2017?resource=download
## Modifications
- Revise Column Names
- Remove and Rename some labels to fit the dataset in the research
## Reference
[**Intrusion_Detection_Using_CICIDS2017**](https://github.com/arif6008/Intrusion_Detection_Using_CICIDS2017) 

In [1]:
# The path to the directory you store the separate datasets
# NOTICE: all '\' in windows path should be replaced by '/'
dir_path = 'C:/Users/GameMax/Desktop/OmarProject/Intrusion-Detection-System-Using-Machine-Learning/MachineLearningCVE'
output_name = 'CICIDS2017.csv'

In [2]:
import pandas as pd
import os

## Find Files

In [3]:
# Check the files under the directory
# Expected Output:
# D:/download/archive/MachineLearningCSV/MachineLearningCVE\Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv
# D:/download/archive/MachineLearningCSV/MachineLearningCVE\Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv
# D:/download/archive/MachineLearningCSV/MachineLearningCVE\Friday-WorkingHours-Morning.pcap_ISCX.csv
# D:/download/archive/MachineLearningCSV/MachineLearningCVE\Monday-WorkingHours.pcap_ISCX.csv
# D:/download/archive/MachineLearningCSV/MachineLearningCVE\Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv
# D:/download/archive/MachineLearningCSV/MachineLearningCVE\Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv
# D:/download/archive/MachineLearningCSV/MachineLearningCVE\Tuesday-WorkingHours.pcap_ISCX.csv
# D:/download/archive/MachineLearningCSV/MachineLearningCVE\Wednesday-workingHours.pcap_ISCX.csv
for dirname, _, filenames in os.walk(dir_path):
    for filename in filenames:
        print(os.path.join(dirname, filename))

C:/Users/GameMax/Desktop/OmarProject/Intrusion-Detection-System-Using-Machine-Learning/MachineLearningCVE\Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv
C:/Users/GameMax/Desktop/OmarProject/Intrusion-Detection-System-Using-Machine-Learning/MachineLearningCVE\Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv
C:/Users/GameMax/Desktop/OmarProject/Intrusion-Detection-System-Using-Machine-Learning/MachineLearningCVE\Friday-WorkingHours-Morning.pcap_ISCX.csv
C:/Users/GameMax/Desktop/OmarProject/Intrusion-Detection-System-Using-Machine-Learning/MachineLearningCVE\Monday-WorkingHours.pcap_ISCX.csv
C:/Users/GameMax/Desktop/OmarProject/Intrusion-Detection-System-Using-Machine-Learning/MachineLearningCVE\Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv
C:/Users/GameMax/Desktop/OmarProject/Intrusion-Detection-System-Using-Machine-Learning/MachineLearningCVE\Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv
C:/Users/GameMax/Desktop/OmarProject/Intrusion-Detection-System-Using-M

In [4]:
file_names=[]
for dirname, _, filenames in os.walk(dir_path):
    for filename in filenames:
        # print(os.path.join(dirname, filename))
        file_names.append(os.path.join(dirname, filename))
# Print file names (should be the same as the  previous output)
file_names

['C:/Users/GameMax/Desktop/OmarProject/Intrusion-Detection-System-Using-Machine-Learning/MachineLearningCVE\\Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv',
 'C:/Users/GameMax/Desktop/OmarProject/Intrusion-Detection-System-Using-Machine-Learning/MachineLearningCVE\\Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv',
 'C:/Users/GameMax/Desktop/OmarProject/Intrusion-Detection-System-Using-Machine-Learning/MachineLearningCVE\\Friday-WorkingHours-Morning.pcap_ISCX.csv',
 'C:/Users/GameMax/Desktop/OmarProject/Intrusion-Detection-System-Using-Machine-Learning/MachineLearningCVE\\Monday-WorkingHours.pcap_ISCX.csv',
 'C:/Users/GameMax/Desktop/OmarProject/Intrusion-Detection-System-Using-Machine-Learning/MachineLearningCVE\\Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv',
 'C:/Users/GameMax/Desktop/OmarProject/Intrusion-Detection-System-Using-Machine-Learning/MachineLearningCVE\\Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv',
 'C:/Users/GameMax/Desktop/OmarProject/In

## Merge the datasets

In [5]:
# Read the first dataset
df = pd.read_csv(file_names[0])

In [6]:
# Check dataset
df.head()

Unnamed: 0,Destination Port,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,...,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,54865,3,2,0,12,0,6,6,6.0,0.0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
1,55054,109,1,1,6,6,6,6,6.0,0.0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
2,55055,52,1,1,6,6,6,6,6.0,0.0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
3,46236,34,1,1,6,6,6,6,6.0,0.0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
4,54863,3,2,0,12,0,6,6,6.0,0.0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN


In [7]:
# Merge the left datasets
for i in range(1, len(file_names)):
    # Read the dataset
    temp = pd.read_csv(file_names[i])
    # Contact the two dataframes
    df = pd.concat([df, temp])
    # Release memory
    del temp
print(df.shape)

(2830743, 79)


In [8]:
# Check result
df.head()

Unnamed: 0,Destination Port,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,...,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,54865,3,2,0,12,0,6,6,6.0,0.0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
1,55054,109,1,1,6,6,6,6,6.0,0.0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
2,55055,52,1,1,6,6,6,6,6.0,0.0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
3,46236,34,1,1,6,6,6,6,6.0,0.0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
4,54863,3,2,0,12,0,6,6,6.0,0.0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN


## Modification

In [9]:
# Revise Column Names
col_names = df.columns.to_list()
for i in range(len(col_names)):
    col_names[i] = col_names[i][1:]
df.columns = col_names
df.columns

Index(['Destination Port', 'Flow Duration', 'Total Fwd Packets',
       'Total Backward Packets', 'otal Length of Fwd Packets',
       'Total Length of Bwd Packets', 'Fwd Packet Length Max',
       'Fwd Packet Length Min', 'Fwd Packet Length Mean',
       'Fwd Packet Length Std', 'wd Packet Length Max',
       'Bwd Packet Length Min', 'Bwd Packet Length Mean',
       'Bwd Packet Length Std', 'low Bytes/s', 'Flow Packets/s',
       'Flow IAT Mean', 'Flow IAT Std', 'Flow IAT Max', 'Flow IAT Min',
       'wd IAT Total', 'Fwd IAT Mean', 'Fwd IAT Std', 'Fwd IAT Max',
       'Fwd IAT Min', 'wd IAT Total', 'Bwd IAT Mean', 'Bwd IAT Std',
       'Bwd IAT Max', 'Bwd IAT Min', 'wd PSH Flags', 'Bwd PSH Flags',
       'Fwd URG Flags', 'Bwd URG Flags', 'Fwd Header Length',
       'Bwd Header Length', 'wd Packets/s', 'Bwd Packets/s',
       'Min Packet Length', 'Max Packet Length', 'Packet Length Mean',
       'Packet Length Std', 'Packet Length Variance', 'IN Flag Count',
       'SYN Flag Count', 'R

In [10]:
# Display labels
df.Label.value_counts()

BENIGN                        2273097
DoS Hulk                       231073
PortScan                       158930
DDoS                           128027
DoS GoldenEye                   10293
FTP-Patator                      7938
SSH-Patator                      5897
DoS slowloris                    5796
DoS Slowhttptest                 5499
Bot                              1966
Web Attack � Brute Force         1507
Web Attack � XSS                  652
Infiltration                       36
Web Attack � Sql Injection         21
Heartbleed                         11
Name: Label, dtype: int64

In [11]:
# Rename labels
name_dict = {
    'DoS Hulk': 'DoS',
    'DDoS': 'DoS',
    'DoS GoldenEye': 'DoS',
    'DoS slowloris': 'DoS',
    'DoS Slowhttptest': 'DoS',
    'Heartbleed': 'DoS',
    'Web Attack � Brute Force': 'WebAttack',
    'Web Attack � XSS': 'WebAttack',
    'Web Attack � Sql Injection': 'WebAttack',
    'FTP-Patator': 'BruteForce',
    'SSH-Patator': 'BruteForce',
}
labels = df.iloc[:,-1].values
# Scan the entire dataframe
for i in range(len(labels)):
    # In the list -> rename
    if labels[i] in name_dict: labels[i] = name_dict[labels[i]]
# Display
df.Label.value_counts()

BENIGN          2273097
DoS              380699
PortScan         158930
BruteForce        13835
WebAttack          2180
Bot                1966
Infiltration         36
Name: Label, dtype: int64

## Output
This program will create a new .csv file named CICIDS2017.csv in the same directory of this notebook by default

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2830743 entries, 0 to 692702
Data columns (total 79 columns):
 #   Column                       Dtype  
---  ------                       -----  
 0   Destination Port             int64  
 1   Flow Duration                int64  
 2   Total Fwd Packets            int64  
 3   Total Backward Packets       int64  
 4   otal Length of Fwd Packets   int64  
 5   Total Length of Bwd Packets  int64  
 6   Fwd Packet Length Max        int64  
 7   Fwd Packet Length Min        int64  
 8   Fwd Packet Length Mean       float64
 9   Fwd Packet Length Std        float64
 10  wd Packet Length Max         int64  
 11  Bwd Packet Length Min        int64  
 12  Bwd Packet Length Mean       float64
 13  Bwd Packet Length Std        float64
 14  low Bytes/s                  float64
 15  Flow Packets/s               float64
 16  Flow IAT Mean                float64
 17  Flow IAT Std                 float64
 18  Flow IAT Max                 int64  
 19  F

In [13]:
df.to_csv(output_name, index = False, header=True)