# Basic Analysis to understand the dataSet

### 1/ First step load and look at the data to understand what are we talking about

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


cyberdata=pd.read_csv("cybersecurity_attacks.csv")
cyberdata.head()

### 2/ We are looking for missing value

So, we have 5 columns, each with around 20 million (20,000k) missing values out of 40 million (40,000k) total rows.

Maybe we need to understand why there is consistency in the NaN data.

In [7]:
nan_value=cyberdata.isnull().sum()
nan_value
print(nan_value[nan_value > 0])
print(f"Nombre de ligne : {len(cyberdata)}")

Malware Indicators    20000
Proxy Information     19851
Firewall Logs         19961
IDS/IPS Alerts        20050
dtype: int64
Nombre de ligne : 40000


In [18]:
nan_cols = cyberdata[["Malware Indicators", "Alerts/Warnings", "Proxy Information", "Firewall Logs", "IDS/IPS Alerts"]]
nan_cols

Unnamed: 0,Malware Indicators,Alerts/Warnings,Proxy Information,Firewall Logs,IDS/IPS Alerts
0,IoC Detected,,150.9.97.135,Log Data,
1,IoC Detected,,,Log Data,
2,IoC Detected,Alert Triggered,114.133.48.179,Log Data,Alert Data
3,,Alert Triggered,,,Alert Data
4,,Alert Triggered,149.6.110.119,,Alert Data
...,...,...,...,...,...
39995,IoC Detected,,,Log Data,Alert Data
39996,IoC Detected,,60.51.30.46,Log Data,
39997,IoC Detected,,,Log Data,Alert Data
39998,IoC Detected,Alert Triggered,137.76.130.8,Log Data,


In [19]:
nbvariable_nan = cyberdata[["Malware Indicators", "Alerts/Warnings", "Proxy Information", "Firewall Logs", "IDS/IPS Alerts"]].nunique()
print(nbvariable_nan)


Malware Indicators        1
Proxy Information     20148
Firewall Logs             1
IDS/IPS Alerts            1
dtype: int64


The columns "Malware Indicators," "Alerts/Warnings," "Firewall Logs," and "IDS/IPS Alerts" serve as indicators or alerts. If no issue is detected, there is no need to log them, which likely explains why they appear as NaN in the dataset. These alerts are only recorded when an event occurs; otherwise, they remain absent.

### 3/ We will understand the type of data we are face of

In [8]:
print(cyberdata.dtypes)

Timestamp                  object
Source IP Address          object
Destination IP Address     object
Source Port                 int64
Destination Port            int64
Protocol                   object
Packet Length               int64
Packet Type                object
Traffic Type               object
Payload Data               object
Malware Indicators         object
Anomaly Scores            float64
Attack Type                object
Attack Signature           object
Action Taken               object
Severity Level             object
User Information           object
Device Information         object
Network Segment            object
Geo-location Data          object
Proxy Information          object
Firewall Logs              object
IDS/IPS Alerts             object
Log Source                 object
dtype: object


In [22]:
nbvariable_ds=cyberdata.nunique()
print(nbvariable_ds)

Timestamp                 39997
Source IP Address         40000
Destination IP Address    40000
Source Port               29761
Destination Port          29895
Protocol                      3
Packet Length              1437
Packet Type                   2
Traffic Type                  3
Payload Data              40000
Malware Indicators            1
Anomaly Scores             9826
Attack Type                   3
Attack Signature              2
Action Taken                  3
Severity Level                3
User Information          32389
Device Information        32104
Network Segment               3
Geo-location Data          8723
Proxy Information         20148
Firewall Logs                 1
IDS/IPS Alerts                1
Log Source                    2
dtype: int64


In [32]:
categorical_data=cyberdata[["Alerts/Warnings","Attack Type","Attack Signature","Action Taken","Severity Level","Packet Type","Traffic Type","Protocol",
"Firewall Logs","IDS/IPS Alerts","Log Source","Network Segment","Malware Indicators"]]
categorical_data           
                            

Unnamed: 0,Alerts/Warnings,Attack Type,Attack Signature,Action Taken,Severity Level,Packet Type,Traffic Type,Protocol,Firewall Logs,IDS/IPS Alerts,Log Source,Network Segment,Malware Indicators
0,,Malware,Known Pattern B,Logged,Low,Data,HTTP,ICMP,Log Data,,Server,Segment A,IoC Detected
1,,Malware,Known Pattern A,Blocked,Low,Data,HTTP,ICMP,Log Data,,Firewall,Segment B,IoC Detected
2,Alert Triggered,DDoS,Known Pattern B,Ignored,Low,Control,HTTP,UDP,Log Data,Alert Data,Firewall,Segment C,IoC Detected
3,Alert Triggered,Malware,Known Pattern B,Blocked,Medium,Data,HTTP,UDP,,Alert Data,Firewall,Segment B,
4,Alert Triggered,DDoS,Known Pattern B,Blocked,Low,Data,DNS,TCP,,Alert Data,Firewall,Segment C,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,,DDoS,Known Pattern A,Logged,Medium,Control,HTTP,UDP,Log Data,Alert Data,Firewall,Segment A,IoC Detected
39996,,DDoS,Known Pattern A,Logged,High,Control,HTTP,UDP,Log Data,,Firewall,Segment C,IoC Detected
39997,,DDoS,Known Pattern B,Blocked,Low,Data,DNS,UDP,Log Data,Alert Data,Server,Segment C,IoC Detected
39998,Alert Triggered,Malware,Known Pattern B,Ignored,Low,Data,FTP,UDP,Log Data,,Server,Segment B,IoC Detected


In [29]:
no_categorical=cyberdata.drop(columns=categorical_data)
no_categorical

Unnamed: 0,Timestamp,Source IP Address,Destination IP Address,Source Port,Destination Port,Packet Length,Payload Data,Anomaly Scores,User Information,Device Information,Geo-location Data,Proxy Information
0,2023-05-30 06:33:58,103.216.15.12,84.9.164.252,31225,17616,503,Qui natus odio asperiores nam. Optio nobis ius...,28.67,Reyansh Dugal,Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ...,"Jamshedpur, Sikkim",150.9.97.135
1,2020-08-26 07:08:30,78.199.217.198,66.191.137.154,17245,48166,1174,Aperiam quos modi officiis veritatis rem. Omni...,51.50,Sumer Rana,Mozilla/5.0 (compatible; MSIE 8.0; Windows NT ...,"Bilaspur, Nagaland",
2,2022-11-13 08:23:25,63.79.210.48,198.219.82.17,16811,53600,306,Perferendis sapiente vitae soluta. Hic delectu...,87.42,Himmat Karpe,Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...,"Bokaro, Rajasthan",114.133.48.179
3,2023-07-02 10:38:46,163.42.196.10,101.228.192.255,20018,32534,385,Totam maxime beatae expedita explicabo porro l...,15.79,Fateh Kibe,Mozilla/5.0 (Macintosh; PPC Mac OS X 10_11_5; ...,"Jaunpur, Rajasthan",
4,2023-07-16 13:11:07,71.166.185.76,189.243.174.238,6131,26646,1462,Odit nesciunt dolorem nisi iste iusto. Animi v...,0.52,Dhanush Chad,Mozilla/5.0 (compatible; MSIE 5.0; Windows NT ...,"Anantapur, Tripura",149.6.110.119
...,...,...,...,...,...,...,...,...,...,...,...,...
39995,2023-05-26 14:08:42,26.36.109.26,121.100.75.240,31005,6764,1428,Quibusdam ullam consequatur consequuntur accus...,39.28,Adira Madan,Mozilla/5.0 (iPad; CPU iPad OS 14_2_1 like Mac...,"Nashik, Manipur",
39996,2023-03-27 00:38:27,17.21.163.81,196.108.134.78,2553,28091,1184,Quaerat neque esse. Animi expedita natus commo...,27.25,Rati Dara,Mozilla/5.0 (Windows; U; Windows 98; Win 9x 4....,"Vadodara, Mizoram",60.51.30.46
39997,2022-03-31 01:45:49,162.35.217.57,98.107.0.15,22505,25152,1043,Enim at aspernatur illum. Saepe numquam eligen...,31.01,Samiha Joshi,Mozilla/5.0 (Windows; U; Windows NT 4.0) Apple...,"Mahbubnagar, Himachal Pradesh",
39998,2023-09-22 18:32:38,208.72.233.205,173.79.112.252,20013,2703,483,Officiis dolorem sed harum provident earum dis...,97.85,Rasha Chauhan,Mozilla/5.0 (X11; Linux i686) AppleWebKit/536....,"Rourkela, Arunachal Pradesh",137.76.130.8


We can see here we have 13 categorical data with 4 nan value  and 12 no categorical data with 1 nan value

We see that Proxy Information have 20000 missing value and are not categorical so its a must to delete it,
Payload Data,User Information are not useful information 
And Timestamp	Source IP Address	Destination IP Address	Source Port	Destination Port are supisious we need to see that.