# **Cybersecurity Data Analysis** with **Pandas**

PROJECT DESCRIPTION:

This project focuses on leveraging Pandas to analyze real-world network traffic logs with the goal of identifying potential cybersecurity threats. Through data filtering, pattern recognition, and threat classification, the analysis provides actionable insights for incident response and proactive threat mitigation. The dataset was also structured for future integration with machine learning models. Designed from a DevOps perspective, this project emphasizes the importance of embedding security (DevSecOps) into automation workflows, reinforcing continuous monitoring and data-driven decision-making in secure infrastructure management.

**By: Musimbi Francis**

musimbifrancis5@gmail.com

9th April, 2025.

OBJECTIVE: As a DevOps student, this project aims to deepen my understanding and skills in using Pandas for automated network traffic analysis, enabling real-time threat detection and data preprocessing for integration into DevSecOps pipelines.

---------------------------------------------------------------------------------------------------------

### _STEP 1:_ Importing Pandas from the Python Library

In [1]:
import pandas as pd

_Explanation:_

_**Pandas** is a powerful data analysis library in Python that is well suited for loading and reading data from various sources, including Excel, CSV, JSON, and SQL.

_**Pandas** makes it easy to split, merge, join, and organize datasets, making it an essential tool for data manipulation. It is great for organizing, cleaning, and transforming data.

### _STEP 2:_ Loading the Dataset fromGitHub for Analysis

In [2]:
# Loading the dataset from github and storing it in a variable.
url = "https://raw.githubusercontent.com/ritaafrica/data/refs/heads/main/network_traffic_data.csv"
df = pd.read_csv(url)

In [3]:
# Display the first 5 rows.
df.head()

Unnamed: 0,Timestamp,Source_IP,Destination_IP,Protocol,Port,Bytes_Sent,Bytes_Received,Status,Threat_Level
0,2025-03-19 13:04:10,10.0.0.15,192.168.1.20,TCP,,5411,8989,Blocked,Low
1,2025-03-19 13:03:40,192.168.1.13,172.217.169.46,ICMP,443.0,4999,11808,Allowed,Medium
2,2025-03-19 13:03:10,10.0.0.5,203.0.113.99,HTTP,443.0,6360,10852,Allowed,Medium
3,2025-03-19 13:02:40,10.0.0.9,192.168.1.20,TCP,,4011,14314,Blocked,Low
4,2025-03-19 13:02:10,192.168.1.4,172.217.169.46,FTP,,5254,8718,Blocked,Medium


### _STEP 3:_ Basic Data Exploration

In [4]:
# Displaying the Total Number of Rows and Columns of the Dataset.
print(f'Rows, Columns: \n{df.shape}')

Rows, Columns: 
(1000, 9)


In [5]:
# Displaying all coulumn Names in the Dataset.
print(f'Column Names: \n{df.columns}')

Column Names: 
Index(['Timestamp', 'Source_IP', 'Destination_IP', 'Protocol', 'Port',
       'Bytes_Sent', 'Bytes_Received', 'Status', 'Threat_Level'],
      dtype='object')


In [6]:
# Displaying the Columns datatypes, all values and memory usage.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Timestamp       1000 non-null   object 
 1   Source_IP       1000 non-null   object 
 2   Destination_IP  1000 non-null   object 
 3   Protocol        1000 non-null   object 
 4   Port            874 non-null    float64
 5   Bytes_Sent      1000 non-null   int64  
 6   Bytes_Received  1000 non-null   int64  
 7   Status          1000 non-null   object 
 8   Threat_Level    1000 non-null   object 
dtypes: float64(1), int64(2), object(6)
memory usage: 70.4+ KB


In [7]:
# Describing summary statistics of the Dataset
df.describe()

Unnamed: 0,Port,Bytes_Sent,Bytes_Received
count,874.0,1000.0,1000.0
mean,1819.73913,5143.572,7562.659
std,2899.374632,2808.256143,4240.206295
min,21.0,106.0,102.0
25%,22.0,2857.0,4025.5
50%,80.0,5224.0,7584.5
75%,3389.0,7487.75,11147.75
max,8080.0,9984.0,14977.0


### _STEP 4:_ Selecting Columns for Display

In [8]:
# Selecting these Columns for Display.
selected_columns = df[["Source_IP", "Destination_IP", "Status", "Threat_Level"]]

# Displaying the first 9 Rows of the Selected columns.
selected_columns.head(10)

Unnamed: 0,Source_IP,Destination_IP,Status,Threat_Level
0,10.0.0.15,192.168.1.20,Blocked,Low
1,192.168.1.13,172.217.169.46,Allowed,Medium
2,10.0.0.5,203.0.113.99,Allowed,Medium
3,10.0.0.9,192.168.1.20,Blocked,Low
4,192.168.1.4,172.217.169.46,Blocked,Medium
5,10.0.0.43,172.217.169.46,Allowed,Low
6,10.0.0.26,10.0.0.5,Allowed,High
7,192.168.1.36,192.168.1.20,Allowed,Medium
8,192.168.1.26,192.168.1.20,Allowed,Medium
9,10.0.0.43,10.0.0.5,Blocked,Low


### _STEP 5:_ Exposing the Threat Level in the Dataset

In [9]:
# Storing and Selecting columns to Display the Threat Level on the Company's Network.
threat_level = df[["Source_IP", "Destination_IP", "Protocol", "Threat_Level"]]

# Displaying the First 7 Rows of the Threat level.
threat_level.head(7)

Unnamed: 0,Source_IP,Destination_IP,Protocol,Threat_Level
0,10.0.0.15,192.168.1.20,TCP,Low
1,192.168.1.13,172.217.169.46,ICMP,Medium
2,10.0.0.5,203.0.113.99,HTTP,Medium
3,10.0.0.9,192.168.1.20,TCP,Low
4,192.168.1.4,172.217.169.46,FTP,Medium
5,10.0.0.43,172.217.169.46,DNS,Low
6,10.0.0.26,10.0.0.5,ICMP,High


### _STEP 6:_ Filtering only the Blocked Traffic from the Dataset

In [10]:
# Filtering the Blocked Traffic and ignoring the case.
blocked_traffic = df[df["Status"].str.lower() == "blocked"]

# Creating a summary of Blocked Traffic.
blocked_summary = blocked_traffic[["Timestamp", "Source_IP", "Destination_IP", "Threat_Level", "Status"]]

# Displaying the first five Rows of the Blocked Traffic.
blocked_summary.head()

Unnamed: 0,Timestamp,Source_IP,Destination_IP,Threat_Level,Status
0,2025-03-19 13:04:10,10.0.0.15,192.168.1.20,Low,Blocked
3,2025-03-19 13:02:40,10.0.0.9,192.168.1.20,Low,Blocked
4,2025-03-19 13:02:10,192.168.1.4,172.217.169.46,Medium,Blocked
9,2025-03-19 12:59:40,10.0.0.43,10.0.0.5,Low,Blocked
10,2025-03-19 12:59:10,10.0.0.33,203.0.113.99,Medium,Blocked


In [11]:
print(f'Total Number of Blocked Traffic: {len(blocked_summary)}')

Total Number of Blocked Traffic: 532


### _STEP 7:_ Filtering Suspicious Traffic from the Dataset

In [12]:
# Filtering only Suspicious Traffic(High-risk/Critical) and Ignoring case.
suspicious_traffic = df[df["Threat_Level"].str.lower() == "critical"]

# Displaying the first 10 Rows of Suspicious Traffic.
suspicious_traffic.head(10)

Unnamed: 0,Timestamp,Source_IP,Destination_IP,Protocol,Port,Bytes_Sent,Bytes_Received,Status,Threat_Level
59,2025-03-19 12:34:40,10.0.0.47,192.168.1.20,ICMP,,5885,463,Allowed,Critical
96,2025-03-19 12:16:10,192.168.1.35,203.0.113.99,FTP,8080.0,9371,7189,Allowed,Critical
134,2025-03-19 11:57:10,192.168.1.17,172.217.169.46,DNS,22.0,6714,13124,Blocked,Critical
150,2025-03-19 11:49:10,192.168.1.42,10.0.0.5,HTTP,53.0,2702,634,Allowed,Critical
209,2025-03-19 11:19:40,10.0.0.17,203.0.113.99,TCP,3389.0,5085,10014,Blocked,Critical
212,2025-03-19 11:18:10,192.168.1.23,8.8.8.8,FTP,21.0,7190,10232,Blocked,Critical
219,2025-03-19 11:14:40,10.0.0.30,192.168.1.20,TCP,22.0,2702,4498,Allowed,Critical
232,2025-03-19 11:08:10,10.0.0.3,172.217.169.46,DNS,,2606,11416,Blocked,Critical
251,2025-03-19 10:58:40,192.168.1.48,203.0.113.99,DNS,53.0,7644,5920,Allowed,Critical
256,2025-03-19 10:56:10,10.0.0.31,203.0.113.99,ICMP,22.0,9167,8793,Allowed,Critical


In [13]:
    print(f'Total Number of Suspicious Traffic: {len(suspicious_traffic)}')

Total Number of Suspicious Traffic: 47


### _STEP 8:_ Filtering Traffic with High Data Transfer

In [14]:
# Filtering Traffic where Bytes Sent is greater than 5000.
high_data_transfer = df[df["Bytes_Sent"] > 5000]

# Displaying the fisrt 5 Rows of Traffic with High Data Transfer Greater than 5000 Bytes.
high_data_transfer.head()

Unnamed: 0,Timestamp,Source_IP,Destination_IP,Protocol,Port,Bytes_Sent,Bytes_Received,Status,Threat_Level
0,2025-03-19 13:04:10,10.0.0.15,192.168.1.20,TCP,,5411,8989,Blocked,Low
2,2025-03-19 13:03:10,10.0.0.5,203.0.113.99,HTTP,443.0,6360,10852,Allowed,Medium
4,2025-03-19 13:02:10,192.168.1.4,172.217.169.46,FTP,,5254,8718,Blocked,Medium
5,2025-03-19 13:01:40,10.0.0.43,172.217.169.46,DNS,53.0,6915,12981,Allowed,Low
7,2025-03-19 13:00:40,192.168.1.36,192.168.1.20,TCP,21.0,5655,119,Allowed,Medium


In [15]:
# Displaying the total number of traffic with high data transfer greater than 5000 Bytes.
print(f'Total Number of Traffic with High Data Transfer Greater than 5000 Bytes: \n{len(high_data_transfer)}')

Total Number of Traffic with High Data Transfer Greater than 5000 Bytes: 
518


### _STEP 9:_ Splitting the Dataset into X (Features) and y (Target Variable)

In [16]:
x = df.drop(columns =["Threat_Level"]) # Selecting all columns to be Features excepth Threat Level.

y = df["Threat_Level"] # Selecting Threat Level as the Target Variable.

In [17]:
# Displaying the first five Rows of the selected Features (All columns except Threat Level).
x.head()

Unnamed: 0,Timestamp,Source_IP,Destination_IP,Protocol,Port,Bytes_Sent,Bytes_Received,Status
0,2025-03-19 13:04:10,10.0.0.15,192.168.1.20,TCP,,5411,8989,Blocked
1,2025-03-19 13:03:40,192.168.1.13,172.217.169.46,ICMP,443.0,4999,11808,Allowed
2,2025-03-19 13:03:10,10.0.0.5,203.0.113.99,HTTP,443.0,6360,10852,Allowed
3,2025-03-19 13:02:40,10.0.0.9,192.168.1.20,TCP,,4011,14314,Blocked
4,2025-03-19 13:02:10,192.168.1.4,172.217.169.46,FTP,,5254,8718,Blocked


In [18]:
# Displaying the first five Rows of the Target Variable (Threat Level).
y.head()

0       Low
1    Medium
2    Medium
3       Low
4    Medium
Name: Threat_Level, dtype: object

### _STEP 10:_ Removing a Column

In [26]:
# Removing the Timestamp column from the Dataset.
df = x.drop(columns=["Timestamp"])

# Displaying the other columns without Timestamp for comfirmation.
df.head()

Unnamed: 0,Source_IP,Destination_IP,Protocol,Port,Bytes_Sent,Bytes_Received,Status
0,10.0.0.15,192.168.1.20,TCP,,5411,8989,Blocked
1,192.168.1.13,172.217.169.46,ICMP,443.0,4999,11808,Allowed
2,10.0.0.5,203.0.113.99,HTTP,443.0,6360,10852,Allowed
3,10.0.0.9,192.168.1.20,TCP,,4011,14314,Blocked
4,192.168.1.4,172.217.169.46,FTP,,5254,8718,Blocked


**CONCLUSION:**
This project provided practical experience in leveraging data analysis techniques for cybersecurity, reinforcing the critical role of automation and threat intelligence in DevOps and DevSecOps workflows.

**_Key Insights:_**

1. Blocked and High-Risk Threat Identification: Efficient filtering of traffic logs can quickly isolate critical threats for immediate response.

2. Anomaly Detection: Unusual data transfer volumes often signal potential security breaches and require continuous monitoring.

3. Threat Classification: Categorizing threats by severity enhances prioritization and resource allocation during incident handling.

4. Data Readiness for ML: Structuring security logs into feature-rich datasets is essential for future integration with predictive models.