# Objective: 

The aim is to analyze network traffic to detect any anomalies or suspicious activities.

## Data: 

CICIDS 2017 dataset, which includes a wide range of attacks and normal traffic. It's available here:

https://www.unb.ca/cic/datasets/ids-2017.html

## Procedure: 

Acquire from https://www.unb.ca/cic/datasets/ids-2017.html.

Preprocess

Exploratory data analysis

Classification machine learning algorithms:
- Logistic Regression
- Random Forest
- Gradient Boosting
- XGBoost


## Resources:

### Websites:

https://www.studytonight.com/network-programming-in-python/analyzing-network-traffic
https://plainenglish.io/blog/network-traffic-analysis-with-python-f95ed4e76c28

#### pcap files into DataFrames:

https://www.automox.com/blog/visualizing-network-data-using-python-part-1

https://www.automox.com/blog/visualizing-network-data-using-python-part-2

https://www.automox.com/blog/visualizing-network-data-using-python-part-3

#### Network Traffic Visualization (Geolocation):
https://medium.com/vinsloev-academy/python-cybersecurity-network-tracking-using-wireshark-and-google-maps-2adf3e497a93

#### Examples for malware traffic analysis:
https://www.malware-traffic-analysis.net/2021/index.html

#### Specific indicator of compromise:
https://cylab.be/blog/245/network-traffic-analysis-with-python-scapy-and-some-machine-learning

### YouTube:
https://www.youtube.com/watch?v=oA7QhYOhW_0
https://www.youtube.com/watch?v=xuNuy8n8u-Y

### LinkedIn Learning:
https://www.linkedin.com/learning/applied-ai-for-it-operations-aiops/network-traffic-analysis

### Books:
https://www.techtarget.com/searchnetworking/feature/Learn-how-to-master-network-traffic-analysis-with-Python

Research paper:
https://www.scitepress.org/papers/2018/66398/66398.pdf

### Current Tasks
- Practise reading pcap files.
- Figure out a way to capture pcap files from the notebook.
- Identify indicators of compromise.
- Parallel processing?

## Code:

In [1]:
#Imports
from scapy.all import *
from scapy.utils import PcapReader
import os, psutil, statistics, multiprocessing
import pandas as pd
import numpy as np
from prepare import *
from analize import *


In [2]:
#Preferences
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:.3f}')

In [3]:
#Get traffic data in CSV format:
load_data = False
if load_data == True:
    mon0 = fixing_col_names(pd.read_csv("csv_files/Monday-WorkingHours.pcap_ISCX.csv"))
    tues0 = fixing_col_names(pd.read_csv("csv_files/Tuesday-WorkingHours.pcap_ISCX.csv"))
    wed0 = fixing_col_names(pd.read_csv("csv_files/Wednesday-WorkingHours.pcap_ISCX.csv"))
    thur0 = fixing_col_names(pd.read_csv("csv_files/Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv"))
    thur1 = fixing_col_names(pd.read_csv("csv_files/Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv"))
    fri0 = fixing_col_names(pd.read_csv("csv_files/Friday-WorkingHours-Morning.pcap_ISCX.csv"))
    fri1 = fixing_col_names(pd.read_csv("csv_files/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv"))
    fri2 = fixing_col_names(pd.read_csv("csv_files/Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv"))

In [4]:
#Measuring size of pcap files:
for file in os.listdir("pcap_files"):
    print(f"Pcap file size: {os.path.getsize(f'pcap_files//{file}')/1_000_000_000:.3f} GB")

#System memory:
print(f"Available memory: {psutil.virtual_memory()[1]/1_000_000_000:.3f} GB")

Pcap file size: 8.839 GB
Pcap file size: 10.823 GB
Pcap file size: 0.047 GB
Pcap file size: 8.303 GB
Pcap file size: 11.048 GB
Pcap file size: 13.421 GB
Available memory: 26.744 GB


In [5]:
#`PcapReader()` iterates though the pcap entries. Creates generator object.
#`rdpcap()` loads all pcap entries in memory at once. Takes a very long time and can fill up memory.

In [6]:
#Generator object for pcap file:
pcap_file = "pcap_files/test0_wifi_capture.pcap"
captured_packets = PcapReader("pcap_files/test0_wifi_capture.pcap")

In [7]:
#file_size and chunk_size are not being measured with the same units...
file_size = os.path.getsize(pcap_file)
chunk_size = max(1, int(20_000_000))

In [8]:
dns_counts, ip_counts, seq_counts = get_ioc_counts(captured_packets)

DNS counts: 1
IP counts: 1
SEQ counts: 12792


In [9]:
file_size

47256430

In [10]:
print(chunk_size)

20000000


In [11]:
pool = multiprocessing.Pool()

In [12]:
#Breaking data into chunks and applying function.
chunks = []
chunk = []
for packet in PcapReader(pcap_file):
    chunk.append(packet)
    if len(chunk) >= chunk_size:
        print(len(chunk))
        chunks.append(chunk)
        chunk = []
if chunk:
    chunks.append(chunk)
print(len(chunk))
print(len(chunks))
results = pool.map(get_ioc_counts, chunks)

67036
1


In [13]:
pool.close()
pool.join()

In [15]:
"""
Returning a list of three items (dns_counts, ip_counts, seq_counts). 
Unsure about if the sequence counts are combined or duplicates. 
It's a dictionary so it should have unique keys but it could also be dropping numbers.
Read about how pools combine chunks after they are finished processing.
"""
dns_counts0 = results[0][0]
ip_counts0 = results[0][1]
seq_counts0 = results[0][2]

In [16]:
seq_counts0 == seq_counts

True

In [17]:
print(f"DNS counts: {len(dns_counts)}")
print(f"IP counts: {len(ip_counts)}")
print(f"SEQ counts: {len(seq_counts)}")
print(f"DNS counts0: {len(dns_counts0)}")
print(f"IP counts0: {len(ip_counts0)}")
print(f"SEQ counts0: {len(seq_counts0)}")

DNS counts: 1
IP counts: 1
SEQ counts: 12792
DNS counts0: 1
IP counts0: 1
SEQ counts0: 12792


In [None]:
dns_threshold = set_threshold(dns_counts)

In [None]:
ip_threshold = set_threshold(ip_counts)

In [None]:
seq_threshold = set_threshold(seq_counts, sigma_value=4)

In [None]:
sus_dns = eval_packets(dns_threshold, dns_counts)

In [None]:
sus_ip = eval_packets(ip_threshold, ip_counts)

In [None]:
sus_seq = eval_packets(seq_threshold, seq_counts)

In [None]:
#Looking for indicators of compromise in pcap:
for packet in captured_packets:
    if packet.haslayer(IP):
        #DNS replies that contain no answer (NXDOMAIN errors):
        if packet.haslayer(DNS) and (packet[DNS].qr == 1) and (packet[DNS].ancount == 0):
            dns = packet[IP].dst
            dns_counts[IP] = dns_counts.get(dns, 0) + 1

    #IP addresses that send a lot of packets:
        ip = packet[IP].src
        ip_counts[IP] = ip_counts.get(ip, 0) + 1

    #Repeated TCP sequence numbers:
    if packet.haslayer(TCP):
        seq = packet[TCP].seq
        seq_counts[seq] = seq_counts.get(seq, 0) + 1
    #Progress updates:
    i += 1
    if i % 1_000_000 == 0:
        print(f"{i} packets processed")

In [None]:
#Establishing a threshold at 99.7% or three standard deviations from the mean.
if len(dns_counts) < 2:
    dns_threshold = 25
else:
    dns_mean = statistics.mean(dns_counts.values())
    dns_stddev = statistics.stdev(dns_counts.values())
    dns_threshold = dns_mean + 3 * dns_stdev

print(f"The DNS threshold is: {dns_threshold}")

In [None]:
#Establish the threshold for IP addresses that send lots of packets:
if len(ip_counts) < 2:
    ip_threshold = 25
else:
    ip_mean = statistics.mean(ip_counts.values())
    ip_stddev = statistics.stdev(ip_counts.values())
    ip_threshold = ip_mean + 3 * ip_stdev

print(f"The IP threshold is: {ip_threshold}")

In [None]:
#Establish threshold for repeated TCP sequence numbers:
if len(seq_counts) < 2:
    seq_threshold = 25
else:
    seq_mean = statistics.mean(seq_counts.values())
    seq_stddev = statistics.stdev(seq_counts.values())
    seq_threshold = seq_mean + 4 * seq_stddev

print(f"The sequence threshold is: {seq_threshold:.3f}")

In [None]:
#Checking packets against thresholds:
dns_suspicious = []
ip_suspicious = []
seq_suspicious = []
for ip_addy, occurrences in dns_counts.items():
    if occurrences < dns_threshold:
        continue
    dns_suspicious.append(ip_addy)

for ip_addy, occurrences in ip_counts.items():
    if occurrences < ip_threshold:
        continue
    ip_suspicious.append(ip_addy)

for seq, occurrences in seq_counts.items():
    if occurrences < seq_threshold:
        continue
    seq_suspicious.append(seq)

In [None]:
print(len(dns_suspicious))
print("------------------------------")
print(len(ip_suspicious))
print("------------------------------")
print(len(seq_suspicious))

In [None]:
#Generators must be reinitialize to iterate through again:
captured_packets = PcapReader("pcap_files/test0_wifi_capture.pcap")

In [None]:
sus_packets = []

In [None]:
j = 0
dns_suspicious = set(dns_suspicious)
ip_suspicious = set(ip_suspicious)
seq_suspicious = set(seq_suspicious)

for packet in captured_packets:
    #Add section to collect suspicious DNS packets
    #Add section to collect suspicious IP packets

    #Collect packets with suspicious TCP sequence numbers:
    if packet.haslayer(TCP) and (int(packet[TCP].seq) in seq_suspicious):
        sus_packets.append(packet)

    #Progress updates:
    j += 1
    if j % 1_000_000 == 0:
        print(f"{j} packets processed")

In [None]:
sus_df = packets_to_dataframe(sus_packets)

In [None]:
sus_df.iloc[0]

In [None]:
sus_df.Src_MAC.value_counts()

In [None]:
sus_df.Dst_MAC.value_counts()

In [None]:
sus_df.Src_IP.value_counts()

In [None]:
sus_df.Dst_IP.value_counts()

In [None]:
#View traffic from csv files
#mon0.head()
#tues0.head()
#wed0.head()
#thur0.head()
#thur1.head()
#fri0.head()
#fri1.head()
#fri2.head()

In [None]:
#tues0[tues0["Label"] == "FTP-Patator"].describe().T

In [None]:
#mon0["Label"].value_counts()

In [None]:
#tues0["Label"].value_counts()

In [None]:
#wed0["Label"].value_counts()

In [None]:
#thur0["Label"].value_counts()

In [None]:
#thur1["Label"].value_counts()

In [None]:
#fri0["Label"].value_counts()

In [None]:
#fri1["Label"].value_counts()

In [None]:
#fri2["Label"].value_counts()