# **1. Introduction**

# NetTrafficGuard EDA

*This notebook is designed to perform Exploratory Data Analysis (EDA) on network traffic datasets. The goal is to understand the structure of the data, identify patterns, and prepare the data for further analysis and modeling. We will follow a structured approach to EDA, including generating questions, applying visualization techniques, transforming and modeling data, and refining our analysis based on insights.*


# **2. Import Libraries**


In [13]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Set plotting aesthetics
sns.set(style="whitegrid")


# **3. Load Data**



In [14]:
# Define paths to datasets
cicids_path = 'E:\\Hackatone Project\\NetTrafficGuard\\data\\raw\\CICIDS2017\\Darknet.csv'
kddcup_path = 'E:\\Hackatone Project\\NetTrafficGuard\\data\\raw\KDDCup1999\\kddcup.data_10_percent_corrected'
nsl_kdd_path = 'E:\\Hackatone Project\\NetTrafficGuard\\data\\raw\\NSL-KDD\KDDTest+.csv'

# Load datasets
cicids_data = pd.read_csv(cicids_path)
kddcup_data = pd.read_csv(kddcup_path, header=None)
nsl_kdd_data = pd.read_csv(nsl_kdd_path)

# Display the first few rows of each dataset
cicids_data.head(), kddcup_data.head(), nsl_kdd_data.head()


(                                      Flow ID         Src IP  Src Port  \
 0     10.152.152.11-216.58.220.99-57158-443-6  10.152.152.11     57158   
 1     10.152.152.11-216.58.220.99-57159-443-6  10.152.152.11     57159   
 2     10.152.152.11-216.58.220.99-57160-443-6  10.152.152.11     57160   
 3    10.152.152.11-74.125.136.120-49134-443-6  10.152.152.11     49134   
 4  10.152.152.11-173.194.65.127-34697-19305-6  10.152.152.11     34697   
 
            Dst IP  Dst Port  Protocol               Timestamp  Flow Duration  \
 0   216.58.220.99       443         6  24/07/2015 04:09:48 PM            229   
 1   216.58.220.99       443         6  24/07/2015 04:09:48 PM            407   
 2   216.58.220.99       443         6  24/07/2015 04:09:48 PM            431   
 3  74.125.136.120       443         6  24/07/2015 04:09:48 PM            359   
 4  173.194.65.127     19305         6  24/07/2015 04:09:45 PM       10778451   
 
    Total Fwd Packet  Total Bwd packets  ...  Active Mean  A

# **4. Generate Questions**


## 4. Generate Questions

Generating questions is a crucial step in exploratory data analysis (EDA). It involves formulating specific queries that guide the analysis and help uncover insights from the data. Below are some fundamental questions that guide our EDA for the NetTrafficGuard project:

### 4.1 What are the basic statistics and structure of each dataset?

Understanding the basic statistics and structure of each dataset provides a foundation for further analysis. This includes:

- **Descriptive Statistics:** Mean, median, mode, standard deviation, minimum, and maximum values for numerical features.
- **Data Types:** Identifying the type of each feature (e.g., numerical, categorical, date/time).
- **Shape and Size:** Number of rows and columns in the dataset.
- **Data Distribution:** Distribution of values in key columns to understand their range and skewness.

*Example Questions:*
- What is the distribution of numerical features in the dataset?
- How many features are there, and what types are they?
- Are there any unexpected data types or formats in the dataset?

### 4.2 Are there any missing values in the datasets?

Identifying missing values is essential to ensure the completeness of the dataset. This includes:

- **Missing Value Counts:** Number of missing values per column.
- **Patterns of Missing Data:** Are missing values randomly distributed, or do they follow a specific pattern?
- **Handling Strategies:** Methods to address missing values, such as imputation or removal.

*Example Questions:*
- Which columns have missing values, and what percentage of data is missing?
- Are there specific rows or columns with disproportionately high amounts of missing data?
- What strategies can be applied to handle missing data in different columns?

### 4.3 What are the key features in each dataset, and how do they correlate with each other?

Understanding the key features and their relationships helps to identify patterns and insights. This includes:

- **Feature Importance:** Determining which features are most relevant for the analysis or model.
- **Correlation Analysis:** Examining the correlation between numerical features to identify potential relationships.
- **Categorical Feature Analysis:** Analyzing distributions and relationships among categorical features.

*Example Questions:*
- Which features have the highest correlation with the target variable (if applicable)?
- Are there any multicollinearity issues among numerical features?
- How do categorical features impact numerical features, if at all?

### 4.4 Are there any patterns or anomalies in the data?

Identifying patterns and anomalies can reveal underlying trends or issues within the dataset. This includes:

- **Pattern Detection:** Recognizing regular patterns or trends within the data.
- **Anomaly Detection:** Identifying data points that deviate significantly from the norm, which could indicate errors or outliers.
- **Temporal Analysis:** For time-series data, analyzing trends over time.

*Example Questions:*
- Are there any seasonal or temporal patterns in the dataset?
- Are there outliers or anomalies in key features that need further investigation?
- How do patterns vary across different segments of the data?

### 4.5 How do different features impact the target variable (if applicable)?

Understanding the relationship between features and the target variable is critical for building predictive models. This includes:

- **Feature-Target Analysis:** Assessing how different features influence the target variable.
- **Feature Importance:** Using statistical methods to determine the importance of each feature in predicting the target variable.
- **Interaction Effects:** Exploring interactions between features and their combined impact on the target variable.

*Example Questions:*
- Which features have the most significant impact on the target variable?
- Are there any interaction effects between features that influence the target variable?
- How can feature selection improve the performance of predictive models?

By addressing these questions, you will gain a comprehensive understanding of the datasets and uncover valuable insights that inform subsequent analysis and model development.


# **5. Data Cleaning**
<hr>
<p>Data cleaning is a critical part of the data preprocessing pipeline, ensuring that our dataset is accurate, complete, and ready for analysis. Below, we perform a detailed data cleaning process on each dataset, addressing missing values, outliers, and other anomalies.</p>

## **5.1 CICIDS2017 Data Cleaning**

<strong> 5.1.1 Load and Inspect Data </strong>




In [15]:


# Display basic information and initial rows of the dataset
print("CICIDS2017 Data Overview:")
print(cicids_data.info())
print(cicids_data.head())


CICIDS2017 Data Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158616 entries, 0 to 158615
Data columns (total 85 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   Flow ID                     158616 non-null  object 
 1   Src IP                      158616 non-null  object 
 2   Src Port                    158616 non-null  int64  
 3   Dst IP                      158616 non-null  object 
 4   Dst Port                    158616 non-null  int64  
 5   Protocol                    158616 non-null  int64  
 6   Timestamp                   158616 non-null  object 
 7   Flow Duration               158616 non-null  int64  
 8   Total Fwd Packet            158616 non-null  int64  
 9   Total Bwd packets           158616 non-null  int64  
 10  Total Length of Fwd Packet  158616 non-null  int64  
 11  Total Length of Bwd Packet  158616 non-null  int64  
 12  Fwd Packet Length Max       158616 non-null  i

In [16]:
# Check for missing values
print("\nMissing Values Analysis:")
missing_values_cicids = cicids_data.isnull().sum()
missing_values_summary = missing_values_cicids[missing_values_cicids > 0]

print("Columns with Missing Values:")
print(missing_values_summary)



Missing Values Analysis:
Columns with Missing Values:
Flow Bytes/s    48
dtype: int64


# **5.1.3 Advanced Outlier Detection**

In [17]:
import numpy as np

# Detect outliers using Z-score method
from scipy import stats

print("\nOutlier Detection:")
z_scores = np.abs(stats.zscore(cicids_data.select_dtypes(include=[np.number])))
outliers = (z_scores > 3).sum(axis=0)
print("Outliers detected in columns:")
print(outliers[outliers > 0])



Outlier Detection:


  x = asanyarray(arr - arrmean)


Outliers detected in columns:
Total Fwd Packet                532
Total Bwd packets               472
Total Length of Fwd Packet      327
Total Length of Bwd Packet      259
Fwd Packet Length Max           136
Fwd Packet Length Min          1709
Fwd Packet Length Mean         2228
Fwd Packet Length Std          3050
Bwd Packet Length Max           278
Bwd Packet Length Min          7207
Bwd Packet Length Mean         4954
Bwd Packet Length Std          2543
Flow IAT Mean                  3154
Flow IAT Std                   3341
Flow IAT Max                   2956
Flow IAT Min                   1940
Fwd IAT Mean                   5123
Fwd IAT Std                    4019
Fwd IAT Max                    2991
Fwd IAT Min                    4536
Bwd IAT Mean                   4423
Bwd IAT Std                    4421
Bwd IAT Max                    3329
Bwd IAT Min                    3537
Fwd PSH Flags                 15084
Fwd Header Length               542
Bwd Header Length               46

# **5.1.4 Handle Missing Values and Outliers**


In [18]:
# Handle missing values: Drop columns with more than 50% missing values
print("\nDropping Columns with > 50% Missing Values:")
threshold = len(cicids_data) * 0.5
cicids_data_clean = cicids_data.dropna(thresh=threshold, axis=1)

# Handle outliers: Cap values at 99th percentile
print("\nHandling Outliers:")
for col in cicids_data_clean.select_dtypes(include=[np.number]).columns:
    cap_value = cicids_data_clean[col].quantile(0.99)
    cicids_data_clean[col] = np.where(cicids_data_clean[col] > cap_value, cap_value, cicids_data_clean[col])

# Remove duplicates
print("\nRemoving Duplicate Rows:")
cicids_data_clean = cicids_data_clean.drop_duplicates()

# Save cleaned data
print("\nSaving Cleaned CICIDS2017 Data:")
cicids_data_clean.to_csv('E:\\Hackatone Project\\NetTrafficGuard\\data\\processed\\processedCICIDS2017_Cleaned.csv', index=False)



Dropping Columns with > 50% Missing Values:

Handling Outliers:

Removing Duplicate Rows:

Saving Cleaned CICIDS2017 Data:


# **5.2 KDD Cup 1999 Data Cleaning**

# **5.2.1 Load and Inspect Data**

In [19]:


# Display basic information and initial rows of the dataset
print("KDD Cup 1999 Data Overview:")
print(kddcup_data.info())
print(kddcup_data.head())


KDD Cup 1999 Data Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 494021 entries, 0 to 494020
Data columns (total 42 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   0       494021 non-null  int64  
 1   1       494021 non-null  object 
 2   2       494021 non-null  object 
 3   3       494021 non-null  object 
 4   4       494021 non-null  int64  
 5   5       494021 non-null  int64  
 6   6       494021 non-null  int64  
 7   7       494021 non-null  int64  
 8   8       494021 non-null  int64  
 9   9       494021 non-null  int64  
 10  10      494021 non-null  int64  
 11  11      494021 non-null  int64  
 12  12      494021 non-null  int64  
 13  13      494021 non-null  int64  
 14  14      494021 non-null  int64  
 15  15      494021 non-null  int64  
 16  16      494021 non-null  int64  
 17  17      494021 non-null  int64  
 18  18      494021 non-null  int64  
 19  19      494021 non-null  int64  
 20  20      494021 non-n

# **5.2.2 Detailed Missing Value Analysis**

In [20]:
# Check for missing values
print("\nMissing Values Analysis:")
missing_values_kddcup = kddcup_data.isnull().sum()
missing_values_summary = missing_values_kddcup[missing_values_kddcup > 0]

print("Columns with Missing Values:")
print(missing_values_summary)



Missing Values Analysis:
Columns with Missing Values:
Series([], dtype: int64)


# **5.2.3 Advanced Outlier Detection**

In [23]:
numeric_data = kddcup_data.select_dtypes(include=[np.number])

# Detect outliers using IQR method
print("\nOutlier Detection:")
Q1 = numeric_data.quantile(0.25)
Q3 = numeric_data.quantile(0.75)
IQR = Q3 - Q1
outliers = ((numeric_data < (Q1 - 1.5 * IQR)) | (numeric_data > (Q3 + 1.5 * IQR))).sum()

print("Outliers detected in columns:")
print(outliers[outliers > 0])


Outlier Detection:
Outliers detected in columns:
0      12350
4       4834
5      85763
6         22
7       1238
8          4
9       3192
10        63
11     73237
12      2224
13        55
14        12
15       585
16       265
17        51
18       454
21       685
24     89234
25     88335
26     29073
27     29701
28    111942
29    112000
30     34644
31     61192
34     11223
36     52132
37     94211
38     93076
39     35229
40     34216
dtype: int64


# **5.2.4 Handle Missing Values and Outliers**