## Cybersecurity Data Analysis With Pandas

##### This project is to analyze cybersecurity incident data using Python and Pandas to uncover trends and patterns in network status, source and destination IP, time stamp, and threat level. The project involved data importation and exploratory analysis to support proactive threat identification and informed security decision-making.

##### Installing and importing pandas

In [1]:
import pandas as pd


##### Loading the dataset from github

In [2]:
url= "https://raw.githubusercontent.com/ritaafrica/data/refs/heads/main/network_traffic_data.csv"
df = pd.read_csv(url)

##### Basic Data Exploration

In [5]:
print(df.head(7))

             Timestamp     Source_IP  Destination_IP Protocol   Port  \
0  2025-03-19 13:04:10     10.0.0.15    192.168.1.20      TCP    NaN   
1  2025-03-19 13:03:40  192.168.1.13  172.217.169.46     ICMP  443.0   
2  2025-03-19 13:03:10      10.0.0.5    203.0.113.99     HTTP  443.0   
3  2025-03-19 13:02:40      10.0.0.9    192.168.1.20      TCP    NaN   
4  2025-03-19 13:02:10   192.168.1.4  172.217.169.46      FTP    NaN   
5  2025-03-19 13:01:40     10.0.0.43  172.217.169.46      DNS   53.0   
6  2025-03-19 13:01:10     10.0.0.26        10.0.0.5     ICMP   53.0   

   Bytes_Sent  Bytes_Received   Status Threat_Level  
0        5411            8989  Blocked          Low  
1        4999           11808  Allowed       Medium  
2        6360           10852  Allowed       Medium  
3        4011           14314  Blocked          Low  
4        5254            8718  Blocked       Medium  
5        6915           12981  Allowed          Low  
6        3431            2826  Allowed       

In [4]:
df.shape


(1000, 9)

##### Selecting Specific Columns for Display

In [15]:
network_summary = df[["Timestamp", "Source_IP", "Destination_IP", "Status"]]

In [16]:
# displaying the first five rows
network_summary.head()

Unnamed: 0,Timestamp,Source_IP,Destination_IP,Status
0,2025-03-19 13:04:10,10.0.0.15,192.168.1.20,Blocked
1,2025-03-19 13:03:40,192.168.1.13,172.217.169.46,Allowed
2,2025-03-19 13:03:10,10.0.0.5,203.0.113.99,Allowed
3,2025-03-19 13:02:40,10.0.0.9,192.168.1.20,Blocked
4,2025-03-19 13:02:10,192.168.1.4,172.217.169.46,Blocked


##### Filtering Specific Datas

In [7]:
# filter only blocked traffic
blocked_traffic = df[df["Status"] == "Blocked"]

In [8]:
# Displaying the blocked_traffic
blocked_traffic

Unnamed: 0,Timestamp,Source_IP,Destination_IP,Protocol,Port,Bytes_Sent,Bytes_Received,Status,Threat_Level
0,2025-03-19 13:04:10,10.0.0.15,192.168.1.20,TCP,,5411,8989,Blocked,Low
3,2025-03-19 13:02:40,10.0.0.9,192.168.1.20,TCP,,4011,14314,Blocked,Low
4,2025-03-19 13:02:10,192.168.1.4,172.217.169.46,FTP,,5254,8718,Blocked,Medium
9,2025-03-19 12:59:40,10.0.0.43,10.0.0.5,ICMP,3389.0,3305,6621,Blocked,Low
10,2025-03-19 12:59:10,10.0.0.33,203.0.113.99,UDP,3389.0,3700,11297,Blocked,Medium
...,...,...,...,...,...,...,...,...,...
992,2025-03-19 04:48:10,10.0.0.11,203.0.113.99,HTTP,,2839,2939,Blocked,Medium
993,2025-03-19 04:47:40,192.168.1.39,192.168.1.20,ICMP,22.0,4178,8307,Blocked,Low
995,2025-03-19 04:46:40,10.0.0.46,172.217.169.46,DNS,53.0,2290,6246,Blocked,Low
997,2025-03-19 04:45:40,10.0.0.3,192.168.1.20,UDP,21.0,6655,13170,Blocked,Low


In [9]:
# Selecting key details for analysis
blocked_summary = blocked_traffic[["Timestamp", "Source_IP", "Destination_IP", "Threat_Level"]]

In [10]:
# Displaying first few rows
blocked_summary.head()

Unnamed: 0,Timestamp,Source_IP,Destination_IP,Threat_Level
0,2025-03-19 13:04:10,10.0.0.15,192.168.1.20,Low
3,2025-03-19 13:02:40,10.0.0.9,192.168.1.20,Low
4,2025-03-19 13:02:10,192.168.1.4,172.217.169.46,Medium
9,2025-03-19 12:59:40,10.0.0.43,10.0.0.5,Low
10,2025-03-19 12:59:10,10.0.0.33,203.0.113.99,Medium


In [12]:
# checking for the blocked traffic with threat levels that are high
blocked_threat_summary = blocked_summary[blocked_summary["Threat_Level"] == "High"]

In [14]:
blocked_threat_summary.shape

(60, 4)

##### Filtering Suspicious Traffic

In [18]:
# Checking for critical level threats
High_risk_traffic = df[df["Threat_Level"] == "Critical"]

In [20]:
#Displaying few rows
High_risk_traffic.head()

Unnamed: 0,Timestamp,Source_IP,Destination_IP,Protocol,Port,Bytes_Sent,Bytes_Received,Status,Threat_Level
59,2025-03-19 12:34:40,10.0.0.47,192.168.1.20,ICMP,,5885,463,Allowed,Critical
96,2025-03-19 12:16:10,192.168.1.35,203.0.113.99,FTP,8080.0,9371,7189,Allowed,Critical
134,2025-03-19 11:57:10,192.168.1.17,172.217.169.46,DNS,22.0,6714,13124,Blocked,Critical
150,2025-03-19 11:49:10,192.168.1.42,10.0.0.5,HTTP,53.0,2702,634,Allowed,Critical
209,2025-03-19 11:19:40,10.0.0.17,203.0.113.99,TCP,3389.0,5085,10014,Blocked,Critical


##### Filtering Traffic with High Data Transfer (where Bytes Sent is greater than 5000)

In [22]:
High_data_transfer = df[df["Bytes_Sent"] > 5000]

In [23]:
# Displaying few
High_data_transfer.head()

Unnamed: 0,Timestamp,Source_IP,Destination_IP,Protocol,Port,Bytes_Sent,Bytes_Received,Status,Threat_Level
0,2025-03-19 13:04:10,10.0.0.15,192.168.1.20,TCP,,5411,8989,Blocked,Low
2,2025-03-19 13:03:10,10.0.0.5,203.0.113.99,HTTP,443.0,6360,10852,Allowed,Medium
4,2025-03-19 13:02:10,192.168.1.4,172.217.169.46,FTP,,5254,8718,Blocked,Medium
5,2025-03-19 13:01:40,10.0.0.43,172.217.169.46,DNS,53.0,6915,12981,Allowed,Low
7,2025-03-19 13:00:40,192.168.1.36,192.168.1.20,TCP,21.0,5655,119,Allowed,Medium


In [26]:
# Show the number of such events
print(f"Number of high data transfers: {len(High_data_transfer)}")

Number of high data transfers: 518


##### Splitting the Dataset into X (Features) and y (Target Variable)

In [27]:
# Selecting features X- Excluding the target variable
X = df.drop(columns = ["Threat_Level"])

# Selecting the target variable (Y)
Y = df["Threat_Level"]

In [29]:
# Display the first few rows of X and Y
print("Features (X):")
print(X.head())
print("\nTarget variable (Y):")
print(Y.head())


Features (X):
             Timestamp     Source_IP  Destination_IP Protocol   Port  \
0  2025-03-19 13:04:10     10.0.0.15    192.168.1.20      TCP    NaN   
1  2025-03-19 13:03:40  192.168.1.13  172.217.169.46     ICMP  443.0   
2  2025-03-19 13:03:10      10.0.0.5    203.0.113.99     HTTP  443.0   
3  2025-03-19 13:02:40      10.0.0.9    192.168.1.20      TCP    NaN   
4  2025-03-19 13:02:10   192.168.1.4  172.217.169.46      FTP    NaN   

   Bytes_Sent  Bytes_Received   Status  
0        5411            8989  Blocked  
1        4999           11808  Allowed  
2        6360           10852  Allowed  
3        4011           14314  Blocked  
4        5254            8718  Blocked  

Target variable (Y):
0       Low
1    Medium
2    Medium
3       Low
4    Medium
Name: Threat_Level, dtype: object


##### Removing a Column


In [30]:
df = df.drop(columns =["Timestamp"])

In [31]:
# Display first few to confirm
df.head()

Unnamed: 0,Source_IP,Destination_IP,Protocol,Port,Bytes_Sent,Bytes_Received,Status,Threat_Level
0,10.0.0.15,192.168.1.20,TCP,,5411,8989,Blocked,Low
1,192.168.1.13,172.217.169.46,ICMP,443.0,4999,11808,Allowed,Medium
2,10.0.0.5,203.0.113.99,HTTP,443.0,6360,10852,Allowed,Medium
3,10.0.0.9,192.168.1.20,TCP,,4011,14314,Blocked,Low
4,192.168.1.4,172.217.169.46,FTP,,5254,8718,Blocked,Medium
