<a href="https://colab.research.google.com/github/Jinendra-Gambhir/Cybersecurity-Advanced-Anomaly-Detection-and-Web-Threat-Analysis/blob/main/Data_Cleaning_For_Cybersecurity_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Purpose of the Project

Dataset Link : [Cybersecurity-Suspicious Web Threat Interactions](https://www.kaggle.com/datasets/jancsg/cybersecurity-suspicious-web-threat-interactions)
---


* Anomaly Detection: Developing models to detect unusual behaviors in web traffic.
* Classification: Training models to automatically classify traffic as normal or suspicious.
* Security Analysis: Conducting security analyses to understand the tactics, techniques, and procedures of attackers.

# **Data Analysis**
* Load the dataset into your analysis environment.
* Examine the structure of the dataset (number of rows and columns, data types, etc.).
* Check for missing values and handle them appropriately (imputation, deletion, etc.).
* Explore basic statistics and distributions of the features.

In [1]:
data_path = '/content/cybersecurity_attacks.csv'

In [2]:
# Importing necessary Libraries
import pandas as pd
from google.colab import files

In [3]:
df = pd.read_csv(data_path)
df.shape

(40000, 25)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 25 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Timestamp               40000 non-null  object 
 1   Source IP Address       40000 non-null  object 
 2   Destination IP Address  40000 non-null  object 
 3   Source Port             40000 non-null  int64  
 4   Destination Port        40000 non-null  int64  
 5   Protocol                40000 non-null  object 
 6   Packet Length           40000 non-null  int64  
 7   Packet Type             40000 non-null  object 
 8   Traffic Type            40000 non-null  object 
 9   Payload Data            40000 non-null  object 
 10  Malware Indicators      20000 non-null  object 
 11  Anomaly Scores          40000 non-null  float64
 13  Attack Type             40000 non-null  object 
 14  Attack Signature        40000 non-null  object 
 15  Action Taken            40000 non-null

In [5]:
df.columns

Index(['Timestamp', 'Source IP Address', 'Destination IP Address',
       'Source Port', 'Destination Port', 'Protocol', 'Packet Length',
       'Packet Type', 'Traffic Type', 'Payload Data', 'Malware Indicators',
       'Action Taken', 'Severity Level', 'User Information',
       'Device Information', 'Network Segment', 'Geo-location Data',
       'Proxy Information', 'Firewall Logs', 'IDS/IPS Alerts', 'Log Source'],
      dtype='object')

In [6]:
df.head()

Unnamed: 0,Timestamp,Source IP Address,Destination IP Address,Source Port,Destination Port,Protocol,Packet Length,Packet Type,Traffic Type,Payload Data,...,Action Taken,Severity Level,User Information,Device Information,Network Segment,Geo-location Data,Proxy Information,Firewall Logs,IDS/IPS Alerts,Log Source
0,2023-05-30 06:33:58,103.216.15.12,84.9.164.252,31225,17616,ICMP,503,Data,HTTP,Qui natus odio asperiores nam. Optio nobis ius...,...,Logged,Low,Reyansh Dugal,Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ...,Segment A,"Jamshedpur, Sikkim",150.9.97.135,Log Data,,Server
1,2020-08-26 07:08:30,78.199.217.198,66.191.137.154,17245,48166,ICMP,1174,Data,HTTP,Aperiam quos modi officiis veritatis rem. Omni...,...,Blocked,Low,Sumer Rana,Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ...,Segment B,"Bilaspur, Nagaland",,Log Data,,Firewall
2,2022-11-13 08:23:25,63.79.210.48,198.219.82.17,16811,53600,UDP,306,Control,HTTP,Perferendis sapiente vitae soluta. Hic delectu...,...,Ignored,Low,Himmat Karpe,Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...,Segment C,"Bokaro, Rajasthan",114.133.48.179,Log Data,Alert Data,Firewall
3,2023-07-02 10:38:46,163.42.196.10,101.228.192.255,20018,32534,UDP,385,Data,HTTP,Totam maxime beatae expedita explicabo porro l...,...,Blocked,Medium,Fateh Kibe,Mozilla/5.0 (Macintosh; PPC Mac OS X 10_11_5; ...,Segment B,"Jaunpur, Rajasthan",,,Alert Data,Firewall
4,2023-07-16 13:11:07,71.166.185.76,189.243.174.238,6131,26646,TCP,1462,Data,DNS,Odit nesciunt dolorem nisi iste iusto. Animi v...,...,Blocked,Low,Dhanush Chad,Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ...,Segment C,"Anantapur, Tripura",149.6.110.119,,Alert Data,Firewall


In [7]:
#Find out whether dataset has any null values
df.isna().sum()

Timestamp                     0
Source IP Address             0
Destination IP Address        0
Source Port                   0
Destination Port              0
Protocol                      0
Packet Length                 0
Packet Type                   0
Traffic Type                  0
Payload Data                  0
Malware Indicators        20000
Anomaly Scores                0
Attack Type                   0
Attack Signature              0
Action Taken                  0
Severity Level                0
User Information              0
Device Information            0
Network Segment               0
Geo-location Data             0
Proxy Information         19851
Firewall Logs             19961
IDS/IPS Alerts            20050
Log Source                    0
dtype: int64

In [8]:
df.describe()

Unnamed: 0,Source Port,Destination Port,Packet Length,Anomaly Scores
count,40000.0,40000.0,40000.0,40000.0
mean,32970.35645,33150.86865,781.452725,50.113473
std,18560.425604,18574.668842,416.044192,28.853598
min,1027.0,1024.0,64.0,0.0
25%,16850.75,17094.75,420.0,25.15
50%,32856.0,33004.5,782.0,50.345
75%,48928.25,49287.0,1143.0,75.03
max,65530.0,65535.0,1500.0,100.0


**Data Cleaning**


* Remove duplicate records if any.


In [9]:
# Remove duplicate records
df_cleaned = df.drop_duplicates()
print(f"Number of records after removing duplicates: {df_cleaned.shape[0]}")

Number of records after removing duplicates: 40000


* Handle missing values in the dataset.

In [15]:
# Check for remaining missing values
print("Missing values before handling:")
print(df_cleaned.isna().sum())

Missing values before handling:
Timestamp                 0
Source IP Address         0
Destination IP Address    0
Source Port               0
Destination Port          0
Protocol                  0
Packet Length             0
Packet Type               0
Traffic Type              0
Payload Data              0
Malware Indicators        0
Anomaly Scores            0
Attack Type               0
Attack Signature          0
Action Taken              0
Severity Level            0
User Information          0
Device Information        0
Network Segment           0
Geo-location Data         0
Proxy Information         0
Firewall Logs             0
IDS/IPS Alerts            0
Log Source                0
dtype: int64


In [16]:
# Fill missing values for numerical columns with the median
numerical_columns = df_cleaned.select_dtypes(include=['float64', 'int64']).columns
for col in numerical_columns:
    df_cleaned[col].fillna(df_cleaned[col].median(), inplace=True)

# Fill missing values for categorical columns with the mode
categorical_columns = df_cleaned.select_dtypes(include=['object']).columns
for col in categorical_columns:
    df_cleaned[col].fillna(df_cleaned[col].mode()[0], inplace=True)

# Check for remaining missing values
print("Missing values after handling:")
print(df_cleaned.isna().sum())

Missing values after handling:
Timestamp                 0
Source IP Address         0
Destination IP Address    0
Source Port               0
Destination Port          0
Protocol                  0
Packet Length             0
Packet Type               0
Traffic Type              0
Payload Data              0
Malware Indicators        0
Anomaly Scores            0
Attack Type               0
Attack Signature          0
Action Taken              0
Severity Level            0
User Information          0
Device Information        0
Network Segment           0
Geo-location Data         0
Proxy Information         0
Firewall Logs             0
IDS/IPS Alerts            0
Log Source                0
dtype: int64


**Downloading** Cleaned Data File for Further Process.

In [11]:
# Save the cleaned data to a CSV file
cleaned_data_path = '/content/cleaned_Cybersecurity_data.csv'
df_cleaned.to_csv(cleaned_data_path, index=False)

# Download the CSV file
files.download(cleaned_data_path)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>