# Basic Network Traffic Statistics

### Load data

In [9]:
from utils import read_csv_files_in_directory

df = read_csv_files_in_directory("data", "cache", "network_analysis_data.pkl")

- **Compute the total number of network flows recorded**  
	Used `len(df)` to count the number of rows, as each row represents a network flow.

In [10]:
print("Total number of network flows:", len(df))

Total number of network flows: 2071657


- **Identify the top 5 most used protocols**  
	Applied `df['protocolName'].value_counts().head(5)` to get the most frequent protocols.

In [11]:
from utils import display_top_items

top_protocols = df['protocolName'].value_counts().head(5)
display_top_items("Top 5 most used protocols", top_protocols)

Top 5 most used protocols:
tcp_ip: 1644056
udp_ip: 419246
icmp_ip: 8211
igmp: 77
ip: 66


- **Identify the top 10 most active source and destination IPs**  
	Used `.value_counts().head(10)` on `source` and `destination` columns separately.

In [12]:
top_sources = df['source'].value_counts().head(10)
display_top_items("Top 10 most active Source IPs", top_sources)

top_destinations = df['destination'].value_counts().head(10)
display_top_items("\nTop 10 most active Destination IPs", top_destinations)

Top 10 most active Source IPs:
192.168.5.122: 268267
192.168.2.107: 208379
192.168.4.118: 135374
192.168.1.101: 116292
192.168.4.121: 105454
192.168.1.105: 101359
192.168.2.109: 99183
192.168.3.116: 97241
192.168.2.110: 90658
192.168.3.115: 88915

Top 10 most active Destination IPs:
198.164.30.2: 232409
192.168.5.122: 199437
203.73.24.75: 193200
125.6.164.51: 106826
67.220.214.50: 49298
202.210.143.140: 36189
82.98.86.183: 25214
95.211.98.12: 25095
209.112.44.10: 21824
62.140.213.243: 20509


- **Compute the average packet size**  
	Calculated `total_bytes` by summing `totalSourceBytes` and `totalDestinationBytes`.  
	Calculated `total_packets` by summing `totalSourcePackets` and `totalDestinationPackets`.  
	Computed average packet size as `total_bytes / total_packets`.

In [13]:
total_bytes = df['totalSourceBytes'].sum() + df['totalDestinationBytes'].sum()
total_packets = df['totalSourcePackets'].sum(
) + df['totalDestinationPackets'].sum()
if total_packets > 0:
    avg_packet_size = total_bytes / total_packets
    print(f"\nAverage packet size: {avg_packet_size:.2f} bytes")
else:
    print("\nNo packets found to compute average.")


Average packet size: 736.92 bytes


- **Find the most common source-destination pair**  
	Grouped by `['source', 'destination']`, applied `.size()`, and sorted to get the top pair.

In [14]:
df['src_dst_pair'] = df['source'] + " → " + df['destination']
most_common_sd_pair = df['src_dst_pair'].value_counts().head(1)
display_top_items("Most common source-destination pair", most_common_sd_pair)

Most common source-destination pair:
192.168.5.122 → 198.164.30.2: 232409


- **Identify which IPs are consistently communicating over multiple time windows**  
	Converted `startDateTime` to hourly windows using `.dt.floor('h')`.  
	For each window, collected unique IPs from both `source` and `destination`.  
	Counted how many windows each IP appeared in.  
	Selected IPs that appeared in at least 80% of all windows.

In [16]:
import pandas as pd
from collections import Counter

df['startDateTime'] = pd.to_datetime(df['startDateTime'])
df['time_window'] = df['startDateTime'].dt.floor('h')
window_ip_sets = df.groupby('time_window')[['source', 'destination']].apply(
    lambda g: set(g['source']) | set(g['destination'])
)

ip_window_count = Counter()
for ip_set in window_ip_sets:
    for ip in ip_set:
        ip_window_count[ip] += 1
total_windows = len(window_ip_sets)

threshold = int(total_windows * 0.8)
consistent_ips = [ip for ip, count in ip_window_count.items()
                  if count >= threshold]

print(f"\nTotal time windows: {total_windows}")
print(f"Threshold: {threshold} windows (80%)")
print("Consistently communicating IPs:")
for ip in sorted(consistent_ips):
    print(f"{ip} (in {ip_window_count[ip]} windows)")


Total time windows: 146
Threshold: 116 windows (80%)
Consistently communicating IPs:
0.0.0.0 (in 142 windows)
131.202.240.218 (in 144 windows)
142.166.14.69 (in 124 windows)
142.166.14.70 (in 128 windows)
142.166.14.71 (in 137 windows)
142.166.14.72 (in 129 windows)
142.166.14.77 (in 128 windows)
142.166.14.78 (in 130 windows)
142.166.14.79 (in 135 windows)
142.166.14.80 (in 136 windows)
142.166.14.85 (in 136 windows)
142.166.14.86 (in 129 windows)
192.168.1.101 (in 145 windows)
192.168.1.102 (in 145 windows)
192.168.1.103 (in 145 windows)
192.168.1.104 (in 145 windows)
192.168.1.105 (in 145 windows)
192.168.1.255 (in 145 windows)
192.168.2.106 (in 145 windows)
192.168.2.107 (in 145 windows)
192.168.2.108 (in 145 windows)
192.168.2.109 (in 145 windows)
192.168.2.110 (in 145 windows)
192.168.2.111 (in 145 windows)
192.168.2.112 (in 145 windows)
192.168.2.113 (in 145 windows)
192.168.2.255 (in 145 windows)
192.168.3.114 (in 145 windows)
192.168.3.115 (in 145 windows)
192.168.3.116 (in 1

- **Detect irregular spikes in traffic volume over time**  
	Grouped data by hourly window and summed total traffic in MB.  
	Computed mean and standard deviation.  
	Flagged windows where volume exceeded `mean + 2 * std`.

In [17]:
df['startDateTime'] = pd.to_datetime(df['startDateTime'])
df['time_window'] = df['startDateTime'].dt.floor('h')
df['total_bytes'] = df['totalSourceBytes'] + \
    df['totalDestinationBytes']
df['total_mb'] = df['total_bytes'] / 1048576

traffic_by_window = df.groupby('time_window')['total_mb'].sum()
mean = traffic_by_window.mean()
std = traffic_by_window.std()
threshold = mean + 2 * std

spikes = traffic_by_window[traffic_by_window > threshold]

print(
    f"\nMean: {mean:.2f} MB, Std: {std:.2f} MB, Threshold: {threshold:.2f} MB")
print("Detected traffic spikes:")
for time, volume in spikes.items():
    print(f"{time} → {volume:.2f} MB")


Mean: 500.01 MB, Std: 732.33 MB, Threshold: 1964.68 MB
Detected traffic spikes:
2010-06-13 10:00:00 → 2247.59 MB
2010-06-15 16:00:00 → 7362.41 MB


- **Compute the variance of packet sizes and discuss potential reasons for high variance**  
	Computed average packet size per flow as `total_bytes / total_packets`.  
	Used `.var()` to calculate the variance.  
	High variance may be due to a mix of small control packets and large file transfers.