

# **Week 5: Working with Cybersecurity Datasets**

## **Table of Contents**

1. [Introduction to Cybersecurity Datasets](#introduction-to-cybersecurity-datasets)
2. [NSL-KDD Dataset](#nsl-kdd-dataset)
3. [CICIDS Dataset](#cicids-dataset)
4. [Real-World Datasets](#real-world-datasets)
5. [Data Preprocessing Techniques](#data-preprocessing-techniques)
6. [Hands-on: Feature Extraction for Security Analysis](#hands-on-feature-extraction-for-security-analysis)

---

## **1. Introduction to Cybersecurity Datasets**

Data science and machine learning in **cybersecurity** heavily rely on datasets that contain network traffic, user behaviors, and security logs. These datasets allow researchers and professionals to detect threats, identify anomalies, and build Intrusion Detection Systems (IDS).

In this class, we will work with three commonly used cybersecurity datasets:

* **NSL-KDD Dataset**
* **CICIDS Dataset**
* **Real-world cybersecurity datasets**

We'll also apply **data preprocessing techniques** to prepare the data for **feature extraction** and **analysis**.

---

## **2. NSL-KDD Dataset**

The **NSL-KDD** dataset is an improved version of the **KDD Cup 1999** dataset. It is widely used for training machine learning models to detect network intrusions. It contains network traffic features like **protocol**, **duration**, **service**, and **label** (normal or attack).

### Key Features:

* **Duration**: Connection duration in seconds.
* **Protocol Type**: The protocol used (e.g., TCP, UDP, ICMP).
* **Service**: The service or application involved (e.g., HTTP, FTP).
* **Label**: The classification label (normal or one of many attack types like **DoS**, **Probe**, **U2R**, **R2L**).

### Loading the NSL-KDD Dataset

```python
import pandas as pd

# Load the NSL-KDD dataset
df_nsl = pd.read_csv('KDDTrain+.csv', header=None)

# Assign column names
df_nsl.columns = ['duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 
                  'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'num_compromised', 
                  'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 
                  'num_access_files', 'num_outbound_cmds', 'is_hot_login', 'is_guest_login', 
                  'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 
                  'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 
                  'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 
                  'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 
                  'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 
                  'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'label']

# Check the first few rows
df_nsl.head()
```

### Example: Analyzing the NSL-KDD Dataset

We can preprocess this dataset by encoding categorical features such as **protocol\_type**, **service**, and **flag**.

```python
# One-hot encoding for categorical variables
df_nsl = pd.get_dummies(df_nsl, columns=['protocol_type', 'service', 'flag'])

# Check the first few rows after encoding
df_nsl.head()
```

---

## **3. CICIDS Dataset**

The **CICIDS** (Canadian Institute for Cybersecurity Intrusion Detection System) dataset is another widely used dataset for cybersecurity research. It includes labeled network traffic data and contains **benign** as well as **malicious traffic**.

### Key Features:

* **Features similar to NSL-KDD** (bytes transferred, number of connections, etc.).
* Attack categories such as **DDoS**, **Botnet**, **Brute Force**, and **Port Scanning**.
* The dataset provides realistic network traffic, including modern cyber threats.

### Loading the CICIDS Dataset

```python
# Example of loading CICIDS dataset
df_cicids = pd.read_csv('cicids_2021.csv')

# Check the first few rows
df_cicids.head()
```

### Preprocessing the CICIDS Dataset

You can handle missing values, normalize numeric features, and apply **one-hot encoding** for categorical features, similar to the NSL-KDD dataset.

```python
# Handle missing values
df_cicids.fillna(0, inplace=True)

# Normalize numeric features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_cicids[['src_bytes', 'dst_bytes', 'num_failed_logins']] = scaler.fit_transform(df_cicids[['src_bytes', 'dst_bytes', 'num_failed_logins']])

# One-hot encoding
df_cicids = pd.get_dummies(df_cicids, columns=['protocol_type', 'service'])

# Check the first few rows after preprocessing
df_cicids.head()
```

---

## **4. Real-world Datasets**

In addition to **NSL-KDD** and **CICIDS**, real-world cybersecurity datasets, such as **network traffic logs** from enterprise systems or **publicly available datasets**, can be valuable for analyzing network behavior and detecting **anomalies**.

### Example of Real-World Datasets:

* **Flow-based Data**: Data representing network traffic, including bytes transferred and flow duration.
* **System Logs**: Logs from firewalls, intrusion detection systems, or routers.
* **Threat Intelligence Data**: Information on known cyber threats or attack patterns.

These datasets can often be processed similarly to **NSL-KDD** or **CICIDS**.

---

## **5. Data Preprocessing Techniques**

Data preprocessing is an essential part of the data science workflow, especially in **cybersecurity**. It ensures that the data is clean, relevant, and ready for machine learning models.

### Key Preprocessing Steps:

1. **Handling Missing Values**:

   * Use imputation (filling missing data with the mean, median, or mode) or remove rows with missing values.

   ```python
   df.fillna(df.mean(), inplace=True)  # Impute missing values with column mean
   ```

2. **Encoding Categorical Features**:

   * Convert categorical features (e.g., `protocol_type`, `service`, `flag`) into numerical values using **one-hot encoding**.

   ```python
   df = pd.get_dummies(df, columns=['protocol_type', 'service', 'flag'])
   ```

3. **Scaling Features**:

   * Normalize numerical features to have a consistent range using techniques like **Min-Max Scaling** or **Standardization**.

   ```python
   from sklearn.preprocessing import StandardScaler
   scaler = StandardScaler()
   df[['bytes', 'duration']] = scaler.fit_transform(df[['bytes', 'duration']])
   ```

4. **Feature Selection**:

   * Select relevant features for the model, which could include metrics like **traffic duration**, **bytes transferred**, and **failed logins**.

   ```python
   features = df[['duration', 'src_bytes', 'dst_bytes', 'num_failed_logins']]
   ```

5. **Dealing with Imbalanced Data**:

   * In cybersecurity datasets, the data is often **imbalanced** (e.g., many normal instances and few attack instances). Techniques like **oversampling**, **undersampling**, or using **SMOTE** (Synthetic Minority Over-sampling Technique) can be applied to balance the dataset.

   ```python
   from imblearn.over_sampling import SMOTE
   smote = SMOTE(random_state=42)
   X_res, y_res = smote.fit_resample(X, y)
   ```

---

## **6. Hands-on: Feature Extraction for Security Analysis**

In this exercise, we will focus on **feature extraction** from the datasets to build a model that can classify network traffic as **normal** or **malicious**.

### Steps:

1. **Load the dataset** and preprocess the data (handle missing values, scale numeric features, encode categorical features).
2. **Select relevant features** for building the model.
3. **Extract additional features** like **flow duration**, **source IP traffic** patterns, and **request types**.
4. **Train a machine learning model** (e.g., Decision Tree, SVM, Neural Networks) on the dataset.
5. **Evaluate the model’s performance** using metrics like **accuracy**, **precision**, **recall**, and **F1-score**.

```python
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

# Feature extraction: Select relevant features
features = df[['duration', 'src_bytes', 'dst_bytes', 'num_failed_logins']]
target = df['label']  # Target: normal or malicious traffic

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)

# Train the model
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
```



