#### COMP3602: Data Analysis and Visualization with Python, Spring 2024

# Project Part 2

Source URL of Dataset: [https://www.kaggle.com/datasets/teamincribo/cyber-security-attacks](https://www.kaggle.com/datasets/teamincribo/cyber-security-attacks)

**Group Members:**

- Abdulaziz Saud Al Jabri (134563)

- Mazin Humood Al Dhuhli (134362)


---

Data preprocessing is
a crucial step in any data analysis or machine learning project. It
involves cleaning, transforming, and organizing raw data into a
format that is suitable for analysis. This step is essential because
the quality of the data used for analysis directly affects the
accuracy of the results obtained.

---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
data = pd.read_csv('cybersecurity_attacks.csv')
data.head()

---
1. Fill in **missing values** in each feature (if it has) by using the mean value of that feature.

In [None]:
# Code borrowed from https://www.geeksforgeeks.org/pandas-filling-nan-in-categorical-data/

df = pd.DataFrame(data)
df_clean = df.apply(lambda x: x.fillna(x.value_counts().index[0]))
df_clean.head()

---
2. Use **Box plots** to identify which features have
outliers and replace these values with the mean value. For
a supervised dataset, where the class labels are known,
this mean value should be computed using only the values
that belong to the same class. On the other hand, for
unsupervised datasets where class labels are not available,
this mean value should be computed using all the values in
the feature.


In [None]:
plt.figure(figsize=(10, 10))
data.boxplot()
plt.show()

---
3. Use **LabelEncoder** for the features which have
categorical data to convert it into numerical data.

In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
data['Timestamp'] = labelencoder.fit_transform(data['Timestamp'])
data['Source IP Address'] = labelencoder.fit_transform(data['Source IP Address'])
data['Destination IP Address'] = labelencoder.fit_transform(data['Destination IP Address'])
data['Protocol'] = labelencoder.fit_transform(data['Protocol'])
data['Packet Type'] = labelencoder.fit_transform(data['Packet Type'])
data['Traffic Type'] = labelencoder.fit_transform(data['Traffic Type'])
data['Payload Data'] = labelencoder.fit_transform(data['Payload Data'])
data['Malware Indicators'] = labelencoder.fit_transform(data['Malware Indicators'])
data['Alerts/Warnings'] = labelencoder.fit_transform(data['Alerts/Warnings'])
data['Attack Type'] = labelencoder.fit_transform(data['Attack Type'])
data['Attack Signature'] = labelencoder.fit_transform(data['Attack Signature'])
data['Action Taken'] = labelencoder.fit_transform(data['Action Taken'])
data['Severity Level'] = labelencoder.fit_transform(data['Severity Level'])
data['User Information'] = labelencoder.fit_transform(data['User Information'])
data['Network Segment'] = labelencoder.fit_transform(data['Network Segment'])
data['Geo-location Data'] = labelencoder.fit_transform(data['Geo-location Data'])
data['Proxy Information'] = labelencoder.fit_transform(data['Proxy Information'])
data['Firewall Logs'] = labelencoder.fit_transform(data['Firewall Logs'])
data['IDS/IPS Alerts'] = labelencoder.fit_transform(data['IDS/IPS Alerts'])
data['Log Source'] = labelencoder.fit_transform(data['Log Source'])
data['Device Information'] = labelencoder.fit_transform(data['Device Information'])


---
4. Apply **Min-Max Normalization** on each feature to
scale the values in the range [0, 1].

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)
data.head()

---
5. Analyze the dataset as its in long (tidy) or wide
format. If it is already in long format, then **convert it** (by
selecting 2 or more variables) to **wide** and visa-versa.

In [None]:
other_columns = data.columns[1:]
melted_data = data.melt(id_vars=['Timestamp'], value_vars=other_columns)
melted_data

---
6. For unsupervised dataset (if you selected),
**implement covariance-based method** to identify irrelevant
features without using any libraries to compute the
correlation coefficient. Utilize a heatmap to visualize the
correlations and provide a list of the identified irrelevant
features.

---
7. For supervised dataset (if you selected),
**implement ANOVA method** to identify irrelevant features
without using any libraries to compute the F-statistics.
Utilize a bar chart to visualize the computed F- statistics
and provide a list of the identified irrelevant features.

--> Our dataset is an unsuperivsed dataset. So, step 7 will not be implemented.