# Objective: 

The aim is to analyze network traffic to detect any anomalies or suspicious activities.

## Data: 

CICIDS 2017 dataset, which includes a wide range of attacks and normal traffic. It's available here:

https://www.unb.ca/cic/datasets/ids-2017.html

## Procedure: 

Acquire from https://www.unb.ca/cic/datasets/ids-2017.html.

Preprocess

Exploratory data analysis

Classification machine learning algorithms:
- Logistic Regression
- Random Forest
- Gradient Boosting
- XGBoost


## Resources:

### Websites:

https://www.studytonight.com/network-programming-in-python/analyzing-network-traffic
https://plainenglish.io/blog/network-traffic-analysis-with-python-f95ed4e76c28

#### pcap files into DataFrames:

https://www.automox.com/blog/visualizing-network-data-using-python-part-1

https://www.automox.com/blog/visualizing-network-data-using-python-part-2

https://www.automox.com/blog/visualizing-network-data-using-python-part-3

#### Network Traffic Visualization (Geolocation):
https://medium.com/vinsloev-academy/python-cybersecurity-network-tracking-using-wireshark-and-google-maps-2adf3e497a93

#### Examples for malware traffic analysis:
https://www.malware-traffic-analysis.net/2021/index.html

#### Specific indicator of compromise:
https://cylab.be/blog/245/network-traffic-analysis-with-python-scapy-and-some-machine-learning

### YouTube:
https://www.youtube.com/watch?v=oA7QhYOhW_0
https://www.youtube.com/watch?v=xuNuy8n8u-Y

### LinkedIn Learning:
https://www.linkedin.com/learning/applied-ai-for-it-operations-aiops/network-traffic-analysis

### Books:
https://www.techtarget.com/searchnetworking/feature/Learn-how-to-master-network-traffic-analysis-with-Python

Research paper:
https://www.scitepress.org/papers/2018/66398/66398.pdf

### Current Tasks
- Practise reading pcap files.
- Figure out a way to capture pcap files from the notebook.
- Identify indicators of compromise.
- Parallel processing?

## Code:

In [1]:
#Imports
from scapy.all import *
from scapy.utils import PcapReader
import os, psutil, statistics, multiprocessing
import pandas as pd
import numpy as np
from prepare import *
from analize import *


In [2]:
#Preferences
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:.3f}')

In [3]:
#Get traffic data in CSV format:
load_data = False
if load_data == True:
    mon0 = fixing_col_names(pd.read_csv("csv_files/Monday-WorkingHours.pcap_ISCX.csv"))
    tues0 = fixing_col_names(pd.read_csv("csv_files/Tuesday-WorkingHours.pcap_ISCX.csv"))
    wed0 = fixing_col_names(pd.read_csv("csv_files/Wednesday-WorkingHours.pcap_ISCX.csv"))
    thur0 = fixing_col_names(pd.read_csv("csv_files/Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv"))
    thur1 = fixing_col_names(pd.read_csv("csv_files/Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv"))
    fri0 = fixing_col_names(pd.read_csv("csv_files/Friday-WorkingHours-Morning.pcap_ISCX.csv"))
    fri1 = fixing_col_names(pd.read_csv("csv_files/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv"))
    fri2 = fixing_col_names(pd.read_csv("csv_files/Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv"))

In [4]:
#Measuring size of pcap files:
for file in os.listdir("pcap_files"):
    print(f"Pcap file size: {os.path.getsize(f'pcap_files//{file}')/1_000_000_000:.3f} GB")

#System memory:
print(f"Available memory: {psutil.virtual_memory()[1]/1_000_000_000:.3f} GB")

Pcap file size: 8.839 GB
Pcap file size: 10.823 GB
Pcap file size: 0.047 GB
Pcap file size: 8.303 GB
Pcap file size: 11.048 GB
Pcap file size: 13.421 GB
Available memory: 23.910 GB


In [5]:
#`PcapReader()` iterates though the pcap entries. Creates generator object.
#`rdpcap()` loads all pcap entries in memory at once. Takes a very long time and can fill up memory.

In [6]:
#Setting global variables for data collection:
i = 0
dns_counts = {}
ip_counts = {}
seq_counts = {}

In [7]:
#Generator object for pcap file:
captured_packets = PcapReader("pcap_files/Friday-WorkingHours.pcap")

In [None]:
#Building functions
def get_ioc_counts(chunk):
    """
    Checks a chunk of packets for indicators of compromise.
    Returns a counts dictionary for each indicator (3 so far).
    """
    dns_counts = {}
    ip_counts = {}
    seq_counts = {}
    
    #Looking for indicators of compromise in pcap:
    for packet in captured_packets:
        if packet.haslayer(IP):
            #DNS replies that contain no answer (NXDOMAIN errors):
            if packet.haslayer(DNS) and (packet[DNS].qr == 1) and (packet[DNS].ancount == 0):
                dns = packet[IP].dst
                dns_counts[IP] = dns_counts.get(dns, 0) + 1
    
        #IP addresses that send a lot of packets:
            ip = packet[IP].src
            ip_counts[IP] = ip_counts.get(ip, 0) + 1
    
        #Repeated TCP sequence numbers:
        if packet.haslayer(TCP):
            seq = packet[TCP].seq
            seq_counts[seq] = seq_counts.get(seq, 0) + 1
    
    return dns_counts, ip_counts, seq_counts

In [None]:
#Building functions
def set_threshold(packet_counts, sigma_value = 3, default_threshold = 25, print_stats = False):
    """

    """
    #Setting the threshold at 99.7% to identify packets:
    if len(packet_counts) < 2:
        threshold = default_threshold
    else:
        mean = statistics.mean(packet_counts.values())
        stddev = statistics.stdev(packet_counts.values())
        threshold = mean + sigma_value * stdev
    if print_stats == True:
        print(f"The threshold is: {threshold}")
        
    return threshold

In [None]:
#Building functions
def eval_packets(threshold, packet_counts, print_stats = False):
    #Checking packets against thresholds:
    suspicious = []
    #DNS
    for sus_item, occurrences in dns_counts.items():
        if occurrences < dns_threshold:
            continue
        suspicious.append(sus_item)
    if print_stats == True:
        print(len(suspicious))
    return suspicious

In [None]:
#Looking for indicators of compromise in pcap:
for packet in captured_packets:
    if packet.haslayer(IP):
        #DNS replies that contain no answer (NXDOMAIN errors):
        if packet.haslayer(DNS) and (packet[DNS].qr == 1) and (packet[DNS].ancount == 0):
            dns = packet[IP].dst
            dns_counts[IP] = dns_counts.get(dns, 0) + 1

    #IP addresses that send a lot of packets:
        ip = packet[IP].src
        ip_counts[IP] = ip_counts.get(ip, 0) + 1

    #Repeated TCP sequence numbers:
    if packet.haslayer(TCP):
        seq = packet[TCP].seq
        seq_counts[seq] = seq_counts.get(seq, 0) + 1
    #Progress updates:
    i += 1
    if i % 1_000_000 == 0:
        print(f"{i} packets processed")

In [None]:
#Establishing a threshold at 99.7% or three standard deviations from the mean.
if len(dns_counts) < 2:
    dns_threshold = 25
else:
    dns_mean = statistics.mean(dns_counts.values())
    dns_stddev = statistics.stdev(dns_counts.values())
    dns_threshold = dns_mean + 3 * dns_stdev

print(f"The DNS threshold is: {dns_threshold}")

In [None]:
#Establish the threshold for IP addresses that send lots of packets:
if len(ip_counts) < 2:
    ip_threshold = 25
else:
    ip_mean = statistics.mean(ip_counts.values())
    ip_stddev = statistics.stdev(ip_counts.values())
    ip_threshold = ip_mean + 3 * ip_stdev

print(f"The IP threshold is: {ip_threshold}")

In [11]:
#Establish threshold for repeated TCP sequence numbers:
if len(seq_counts) < 2:
    seq_threshold = 25
else:
    seq_mean = statistics.mean(seq_counts.values())
    seq_stddev = statistics.stdev(seq_counts.values())
    seq_threshold = seq_mean + 4 * seq_stddev

print(f"The sequence threshold is: {seq_threshold:.3f}")

The sequence threshold is: 429.372


In [12]:
#Checking packets against thresholds:
dns_suspicious = []
ip_suspicious = []
seq_suspicious = []
for ip_addy, occurrences in dns_counts.items():
    if occurrences < dns_threshold:
        continue
    dns_suspicious.append(ip_addy)

for ip_addy, occurrences in ip_counts.items():
    if occurrences < ip_threshold:
        continue
    ip_suspicious.append(ip_addy)

for seq, occurrences in seq_counts.items():
    if occurrences < seq_threshold:
        continue
    seq_suspicious.append(seq)

In [13]:
print(len(dns_suspicious))
print("------------------------------")
print(len(ip_suspicious))
print("------------------------------")
print(len(seq_suspicious))

0
------------------------------
0
------------------------------
806


In [14]:
#Generators must be reinitialize to iterate through again:
captured_packets = PcapReader("pcap_files/Friday-WorkingHours.pcap")

In [15]:
sus_packets = []

In [16]:
j = 0
dns_suspicious = set(dns_suspicious)
ip_suspicious = set(ip_suspicious)
seq_suspicious = set(seq_suspicious)

for packet in captured_packets:
    #Add section to collect suspicious DNS packets
    #Add section to collect suspicious IP packets

    #Collect packets with suspicious TCP sequence numbers:
    if packet.haslayer(TCP) and (int(packet[TCP].seq) in seq_suspicious):
        sus_packets.append(packet)

    #Progress updates:
    j += 1
    if j % 1_000_000 == 0:
        print(f"{j} packets processed")

1000000 packets processed
2000000 packets processed
3000000 packets processed
4000000 packets processed
5000000 packets processed
6000000 packets processed
7000000 packets processed
8000000 packets processed
9000000 packets processed


In [17]:
sus_df = packets_to_dataframe(sus_packets)

In [18]:
sus_df.iloc[0]

Src_MAC     b8:ac:6f:1d:1f:6c
Dst_MAC     00:c1:b1:14:eb:31
Src_IP           192.168.10.9
Dst_IP        206.126.112.141
Src_Port                 1098
Dst_Port                  443
Name: 0, dtype: object

In [19]:
sus_df.Src_MAC.value_counts()

00:1e:4f:d4:ca:28    1291839
00:19:b9:0a:69:f1     158798
00:c1:b1:14:eb:31      83314
00:25:00:a8:c4:60      64121
b8:ac:6f:36:07:ee      23702
00:23:ae:9b:8a:bf      20009
b8:ac:6f:36:04:e3      18821
b8:ac:6f:1d:1f:6c      15053
b8:ac:6f:36:08:f5      13052
b8:ac:6f:36:0a:8b      11038
00:23:ae:9b:95:67       1784
00:23:ae:9b:ad:b3       1260
Name: Src_MAC, dtype: int64

In [20]:
sus_df.Dst_MAC.value_counts()

00:c1:b1:14:eb:31    1619477
00:19:b9:0a:69:f1      78756
00:1e:4f:d4:ca:28       1238
b8:ac:6f:36:08:f5       1208
b8:ac:6f:36:0a:8b        665
b8:ac:6f:36:07:ee        649
b8:ac:6f:1d:1f:6c        647
00:23:ae:9b:95:67         59
b8:ac:6f:36:04:e3         34
00:23:ae:9b:8a:bf         24
00:23:ae:9b:ad:b3         22
00:25:00:a8:c4:60         12
Name: Dst_MAC, dtype: int64

In [21]:
sus_df.Src_IP.value_counts()

192.168.10.15      1291839
192.168.10.50       158798
172.16.0.1           78756
192.168.10.25        64121
192.168.10.14        23702
192.168.10.16        20009
192.168.10.12        18821
192.168.10.9         15053
192.168.10.8         13052
192.168.10.5         11038
205.174.165.73        3908
192.168.10.17         1784
192.168.10.19         1260
200.61.221.226          38
174.37.200.231          36
198.105.254.11          32
198.105.244.11          29
52.2.143.27             26
200.144.6.216           24
52.8.72.161             19
52.52.96.7              18
210.140.236.12          17
210.140.236.11          14
52.209.203.172          14
52.18.141.29            14
162.255.119.249         14
116.213.76.136          13
210.140.236.10          13
121.40.44.206           13
66.235.121.175          12
117.25.130.180          12
176.53.21.90            12
185.4.227.115           12
185.189.12.242          12
125.6.190.6             12
54.208.60.220           12
218.248.240.47          12
1

In [22]:
sus_df.Dst_IP.value_counts()

13.107.4.50        1058466
8.253.104.126       181156
172.16.0.1          158798
192.168.10.50        78756
192.99.204.26        56290
162.125.4.6          31189
23.15.4.9            24437
91.189.88.162        16804
91.189.88.161        15473
23.15.4.17            7232
151.101.128.223       6015
209.58.138.15         5887
209.48.71.154         4147
52.84.26.193          3628
210.158.199.65        3346
178.239.182.230       3240
195.218.200.139       2899
52.219.0.28           2128
23.15.4.16            2039
192.0.77.2            1888
104.16.39.35          1558
205.185.208.154       1373
208.111.171.236       1321
104.16.181.67         1287
192.168.10.15         1238
23.203.70.225         1230
199.244.48.55         1230
206.126.112.141       1220
192.168.10.8          1208
52.84.145.177         1189
52.219.4.112          1116
202.247.51.62         1091
62.241.12.20          1058
47.89.68.225          1050
23.15.4.19            1033
23.15.4.10            1031
104.27.143.149        1002
5

In [23]:
#View traffic from csv files
#mon0.head()
#tues0.head()
#wed0.head()
#thur0.head()
#thur1.head()
#fri0.head()
#fri1.head()
#fri2.head()

In [24]:
#tues0[tues0["Label"] == "FTP-Patator"].describe().T

In [25]:
#mon0["Label"].value_counts()

In [26]:
#tues0["Label"].value_counts()

In [27]:
#wed0["Label"].value_counts()

In [28]:
#thur0["Label"].value_counts()

In [29]:
#thur1["Label"].value_counts()

In [30]:
#fri0["Label"].value_counts()

In [31]:
#fri1["Label"].value_counts()

In [32]:
#fri2["Label"].value_counts()