# Cybersecurity Data Analysis

### A Project in developing a Cybersecurity Data Analysis

##### The purpose of this project is to analyze network traffic data to identify potential security threats, unusual patterns, blocked or allowed connections, and improve cybersecurity monitoring.

By: Louise M. Persidis

louisepersidis20@gmail.com

(https://github.com/Louise-persidis)

#### Import Library

In [None]:
# Import neccessary library
import pandas as pd

Explanation: The import pandas as pd statement is used to bring in the Pandas library, which provides powerful tools for data manipulation and analysis. In this project, it enables the processing of network traffic data, analyzing connection timestamps, filtering security events, and detecting anomalies in network activity.

#### Load dataset from GitHub

In [None]:
# Load the dataset from GitHub
df = pd.read_csv('https://raw.githubusercontent.com/ritaafrica/data/main/network_traffic_data.csv')

Explanation: This section of the code is responsible for loading a network traffic dataset from GitHub into a Pandas DataFrame for analysis.

In [9]:
# Display the first 5 rows
print(df.head(5))

             Timestamp     Source_IP  Destination_IP Protocol   Port  \
0  2025-03-19 13:04:10     10.0.0.15    192.168.1.20      TCP    NaN   
1  2025-03-19 13:03:40  192.168.1.13  172.217.169.46     ICMP  443.0   
2  2025-03-19 13:03:10      10.0.0.5    203.0.113.99     HTTP  443.0   
3  2025-03-19 13:02:40      10.0.0.9    192.168.1.20      TCP    NaN   
4  2025-03-19 13:02:10   192.168.1.4  172.217.169.46      FTP    NaN   

   Bytes_Sent  Bytes_Received   Status Threat_Level  
0        5411            8989  Blocked          Low  
1        4999           11808  Allowed       Medium  
2        6360           10852  Allowed       Medium  
3        4011           14314  Blocked          Low  
4        5254            8718  Blocked       Medium  


Explanation: This code displays the first 5 rows of the dataset to give an overview of its structure and content

#### Basic Data Exploration

In [10]:
# Check the number of rows and columns
print(f"Dataset Shape: {df.shape}")

Dataset Shape: (1000, 9)


Explanation: This section displays the number of rows and columns in the dataset.

In [11]:
# Get column names
print("\nColumn Names: ")
print(df.columns)


Column Names: 
Index(['Timestamp', 'Source_IP', 'Destination_IP', 'Protocol', 'Port',
       'Bytes_Sent', 'Bytes_Received', 'Status', 'Threat_Level'],
      dtype='object')


Explanation: This section lists all column names, helping to understand the dataset structure.

In [12]:
# Display basic info about the dataset
print("\nDataset Info:")
print(df.info())


Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Timestamp       1000 non-null   object 
 1   Source_IP       1000 non-null   object 
 2   Destination_IP  1000 non-null   object 
 3   Protocol        1000 non-null   object 
 4   Port            874 non-null    float64
 5   Bytes_Sent      1000 non-null   int64  
 6   Bytes_Received  1000 non-null   int64  
 7   Status          1000 non-null   object 
 8   Threat_Level    1000 non-null   object 
dtypes: float64(1), int64(2), object(6)
memory usage: 70.4+ KB
None


Explanation: This section shows the data types, non-null values, and memory usage, which helps identify missing values and data types.

In [13]:
# Show summary statistics
print("\nSummary Statistics:")
print(df.describe())


Summary Statistics:
              Port   Bytes_Sent  Bytes_Received
count   874.000000  1000.000000     1000.000000
mean   1819.739130  5143.572000     7562.659000
std    2899.374632  2808.256143     4240.206295
min      21.000000   106.000000      102.000000
25%      22.000000  2857.000000     4025.500000
50%      80.000000  5224.000000     7584.500000
75%    3389.000000  7487.750000    11147.750000
max    8080.000000  9984.000000    14977.000000


Explanation: This section provides summary statistics (mean, min, max, standard deviation, etc.) for numerical columns, giving insights into data distribution.

#### Selecting Specific Columns for Display

In [15]:
# Select important columns
selected_columns = df[["Timestamp", "Source_IP", "Destination_IP", "Status"]]

# Display the first few rows
print(selected_columns.head())

             Timestamp     Source_IP  Destination_IP   Status
0  2025-03-19 13:04:10     10.0.0.15    192.168.1.20  Blocked
1  2025-03-19 13:03:40  192.168.1.13  172.217.169.46  Allowed
2  2025-03-19 13:03:10      10.0.0.5    203.0.113.99  Allowed
3  2025-03-19 13:02:40      10.0.0.9    192.168.1.20  Blocked
4  2025-03-19 13:02:10   192.168.1.4  172.217.169.46  Blocked


Explanation: This code extracts and displays key columns from the network traffic dataset for focused analysis.

#### Storing Selected Columns in a New Variable

In [17]:
# Store a single column as a Series
source_ips = df["Source_IP"]

# Store multiple columns in a DataFrame
network_activity = df[["Source_IP", "Destination_IP", "Protocol", "Threat_Level"]]

# Dispaly the first few rows
print(network_activity.head())

      Source_IP  Destination_IP Protocol Threat_Level
0     10.0.0.15    192.168.1.20      TCP          Low
1  192.168.1.13  172.217.169.46     ICMP       Medium
2      10.0.0.5    203.0.113.99     HTTP       Medium
3      10.0.0.9    192.168.1.20      TCP          Low
4   192.168.1.4  172.217.169.46      FTP       Medium


Explanation: This code is used to extract specific data from the network traffic dataset by storing a single column as a Series and multiple columns as a DataFrame for further analysis.

#### Filtering Data - Selecting & Storing Blocked Traffic

In [26]:
# Filter only blocked traffic
blocked_traffic = df[df["Status"] == "Blocked"]

# Select key details for analysis
blocked_summary = blocked_traffic[["Timestamp", "Source_IP", "Destination_IP", "Status", "Threat_Level"]]

# Display the first few rows
print(blocked_summary.head())

              Timestamp    Source_IP  Destination_IP   Status Threat_Level
0   2025-03-19 13:04:10    10.0.0.15    192.168.1.20  Blocked          Low
3   2025-03-19 13:02:40     10.0.0.9    192.168.1.20  Blocked          Low
4   2025-03-19 13:02:10  192.168.1.4  172.217.169.46  Blocked       Medium
9   2025-03-19 12:59:40    10.0.0.43        10.0.0.5  Blocked          Low
10  2025-03-19 12:59:10    10.0.0.33    203.0.113.99  Blocked       Medium


Explanation: This code filters and analyzes blocked network traffic, focusing on key details that could indicate potential security threats. It identifies potential cyber threats by isolating blocked traffic, assists in network security monitoring by tracking suspicious activities, and also provides insights into frequent attackers or high-threat connections.

#### Filtering Suspicious Traffic

In [25]:
# Filter high-risk (Critical) traffic
high_risk_traffic = df[df["Threat_Level"] == "Critical"]

# Display summary
print(high_risk_traffic.head())

               Timestamp     Source_IP  Destination_IP Protocol    Port  \
59   2025-03-19 12:34:40     10.0.0.47    192.168.1.20     ICMP     NaN   
96   2025-03-19 12:16:10  192.168.1.35    203.0.113.99      FTP  8080.0   
134  2025-03-19 11:57:10  192.168.1.17  172.217.169.46      DNS    22.0   
150  2025-03-19 11:49:10  192.168.1.42        10.0.0.5     HTTP    53.0   
209  2025-03-19 11:19:40     10.0.0.17    203.0.113.99      TCP  3389.0   

     Bytes_Sent  Bytes_Received   Status Threat_Level  
59         5885             463  Allowed     Critical  
96         9371            7189  Allowed     Critical  
134        6714           13124  Blocked     Critical  
150        2702             634  Allowed     Critical  
209        5085           10014  Blocked     Critical  


Explanation: This code filters and analyzes high-risk (Critical) network traffic, focusing on the most severe security threats. It detects and prioritize critical threats for cybersecurity response, assists in incident response by highlighting severe attacks, and also provides insights into patterns of high-risk network activity.

#### Filtering Traffic with High Data Transfer

In [31]:
# Filter traffic where Bytes_Sent is greater than 5000
high_data_transfer = df[df["Bytes_Sent"] > 5000]

# Display the first few rows
print(high_data_transfer.head())

# Show the number of such events
print(f"Number of high-data transfers: {len(high_data_transfer)}")

             Timestamp     Source_IP  Destination_IP Protocol   Port  \
0  2025-03-19 13:04:10     10.0.0.15    192.168.1.20      TCP    NaN   
2  2025-03-19 13:03:10      10.0.0.5    203.0.113.99     HTTP  443.0   
4  2025-03-19 13:02:10   192.168.1.4  172.217.169.46      FTP    NaN   
5  2025-03-19 13:01:40     10.0.0.43  172.217.169.46      DNS   53.0   
7  2025-03-19 13:00:40  192.168.1.36    192.168.1.20      TCP   21.0   

   Bytes_Sent  Bytes_Received   Status Threat_Level  
0        5411            8989  Blocked          Low  
2        6360           10852  Allowed       Medium  
4        5254            8718  Blocked       Medium  
5        6915           12981  Allowed          Low  
7        5655             119  Allowed       Medium  
Number of high-data transfers: 518


Explanation: This code is used to identify and analyze network events with unusually high data transfer, which could indicate data exfiltration or other suspicious activity and also, outputs the total number of high-volume data transfer events, helping to assess the scale of potential threats.

#### Spliting the Dataset into X (Features) and y (Target) Variable

In [32]:
# Select features (X) - Exclude the target variable
X = df.drop(columns=["Threat_Level"])

# Select the target variable (y)
y = df["Threat_Level"]

#  Display the first few rows of X and y
print("Features (X):")
print(X.head())

print("\nTarget Variable (y):")
print(y.head())

Features (X):
             Timestamp     Source_IP  Destination_IP Protocol   Port  \
0  2025-03-19 13:04:10     10.0.0.15    192.168.1.20      TCP    NaN   
1  2025-03-19 13:03:40  192.168.1.13  172.217.169.46     ICMP  443.0   
2  2025-03-19 13:03:10      10.0.0.5    203.0.113.99     HTTP  443.0   
3  2025-03-19 13:02:40      10.0.0.9    192.168.1.20      TCP    NaN   
4  2025-03-19 13:02:10   192.168.1.4  172.217.169.46      FTP    NaN   

   Bytes_Sent  Bytes_Received   Status  
0        5411            8989  Blocked  
1        4999           11808  Allowed  
2        6360           10852  Allowed  
3        4011           14314  Blocked  
4        5254            8718  Blocked  

Target Variable (y):
0       Low
1    Medium
2    Medium
3       Low
4    Medium
Name: Threat_Level, dtype: object


Explanation: This section is responsible for preparing the dataset for machine learning or classification tasks by separating the features (inputs) from the target variable (output).

#### Removing a Column

In [33]:
# Remove the 'Timestamp' column
df = df.drop(columns=["Timestamp"])

# Display the first few rows to confirm
print(df.head())

      Source_IP  Destination_IP Protocol   Port  Bytes_Sent  Bytes_Received  \
0     10.0.0.15    192.168.1.20      TCP    NaN        5411            8989   
1  192.168.1.13  172.217.169.46     ICMP  443.0        4999           11808   
2      10.0.0.5    203.0.113.99     HTTP  443.0        6360           10852   
3      10.0.0.9    192.168.1.20      TCP    NaN        4011           14314   
4   192.168.1.4  172.217.169.46      FTP    NaN        5254            8718   

    Status Threat_Level  
0  Blocked          Low  
1  Allowed       Medium  
2  Allowed       Medium  
3  Blocked          Low  
4  Blocked       Medium  


Explanation: This code removes the Timestamp column from the dataset to simplify the data structure for further processing.