In [None]:
from data_collection.parse_pcap import pcap_to_pandas
from utils import *
import pandas as pd
import numpy as np
from datetime import datetime, timezone
from sklearn.preprocessing import LabelEncoder
from math import log

# Probability

**Start downloading this file and place it in AI4ALL-IoT/example_pcaps: https://drive.google.com/file/d/1Lr1dleCbZcQWfHoW_u6Q2uZFte17Y2Z_/view?usp=sharing. You'll need it later!**

## Introduction

Today, we're going to explore probability. The concept of probability is a powerful tool that lets us answer interesting questions about our data, and it serves as the foundation of a commonly used machine learning technique for classification We'll also be building a Naive Bayes classifier from scratch, so you'll get hands-on experience coding a machine learning classifier from scratch!

Let's start with some simple probability examples on the board. Let's see how much you can recall from lecture!

Say I have a bucket with 10 blue balls and 20 red balls. If I choose a ball at random from the bucket, what is the probability that I choose a red ball? That is, we want to calculate:

$P($red ball$)\ =\ ??$

This is equal to the fraction of red balls over the total number of balls.

$P($red ball$)\ =\ \frac{\text{# of red balls}}{\text{# of total balls}}\ =\ \frac{20}{30}\ =\ \frac{2}{3}$

Similarly, the chance of picking a blue ball is:

$P($blue ball$)\ =\ \frac{\text{# of blue balls}}{\text{# of total balls}}\ =\ \frac{10}{30}\ =\ \frac{1}{3}$

Now, let's say we want to find the probability of picking a red ball out of the bucket, **AND THEN** picking a blue ball out of the bucket. When we want to find the probability of two events both occurring, we multiply their probabilities together. The resulting probability is:

$P(\text{red ball})*P(\text{blue ball}\ |\ \text{red ball missing})$

Here, we introduce the concept of conditional probability. $P(\text{blue ball}\ |\ \text{red ball missing})$ represents the probability that a blue ball is pulled from the bucket, **given** that a red ball has already been taken out.

Are these two events independent? Does pulling a red ball affect the result of the probability of pulling a red ball followed by a blue ball? If it had no effect, the overall probability would be equivalent to:

$P(\text{red ball})*P(\text{blue ball})$

But it's not! By removing a red ball, there are now fewer overall balls to choose from, which changes the resulting probability. The full probability is therefore calculated as:

$P(\text{red ball})*P(\text{blue ball}\ |\ \text{red ball missing})\ =\ \frac{20}{30}*\frac{10}{29}\ =\ \frac{20}{87}$

Now that you've had a chance to review, let's dive into the data.

## Exercises

### Probability of a TCP Packet

Let's compute the probability that a packet from our capture was a TCP packet:

$P(\text{TCP Packet})\ =\ \frac{\text{# of TCP packets}}{\text{# of total packets}}$

We'll start by loading some captured data into Python, and filtering out packets that don't have a DNS query field or a DNS response field. You'll need to fill in the blanks with the correct information. For tcp_packets, there are three options for each blank: "data", "protocol", or "TCP". Consult yesterday's lab if you need!

In [None]:
data = ?? # call our helper "pcap_to_pandas" function, and pass in the argument "example_pcaps/tplink_switch.pcap"
tcp_packets = ??[??[??] == "??"] # packets with the protocol column equal to "TCP"

# len gives the number of packets in some data
num_tcp_packets = len(??) # number of TCP packets
num_total_packets = len(??) # number of total packets

tcp_probability = ?? / ?? # probability that a packet is a TCP packet

print(tcp_probability)

### Probability of a DNS Packet, Given Source Port or Dest Port is 53

Now, let's compute the probability that a packet from our capture was a DNS packet, given that at least one of its ports was 53. We define a DNS packet as a packet that has a DNS query **OR** a DNS response field. We are calculating:

$P($DNS Query $\cup$ DNS Response | Source Port == 53 $\cup$ Dst Port == 53$)$

The $\cup$ means "OR".

The probability can be calculated as:

$P(\text{DNS Query} \cup \text{DNS Response}\ |\ \text{Source Port == 53} \cup \text{Dst Port == 53})\ =\ \frac{\text{# of packets with a DNS query or DNS response field}}{\text{# of packets with a SRC port or DST port 53}}$

Because of conditional probability, rather than dividing by the total number of packets, we divide by only the # of packets that satisfy the condition that the SRC or DST port is equal to 53.

You'll need to fill in the blanks with the correct information. For dns_queries and dns_responses, there are three options for each blank: "data", "dns_query", or "dns_resp". Consult yesterday's lab if you need!

In [11]:
dns_queries = data[data["dns_query"].notnull()] # packets with a DNS query column that isn't None
dns_responses = data[data["dns_resp"].notnull()] # packets with a DNS response column that isn't None

src_port_53 = data[data["port_src"] == 53]
dst_port_53 = data[data["port_dst"] == 53]

num_dns_queries = len(dns_queries)
num_dns_responses = len(dns_responses)
num_dns_total = num_dns_queries + num_dns_responses

num_port_53 = len(src_port_53) + len(dst_port_53)

# Note: This is tricky! Consult the DNS columns of the data in this notebook and/or Wireshark. if you are stuck.
dns_probability = num_dns_queries / num_port_53 # probability that a packet is a DNS packet, given that at least one port is 53

print(dns_probability) # Should be 1 (100%).

1.0


You should expect an answer of 100%. If you got over 100% instead, your probability is likely overcounting some packets!

Hint: Examine the "dns_query" and "dns_resp" columns of packets that contain a DNS query or response.

### Probability that a DNS Response is Longer than the Mean Packet Length

Now let's answer the following questions about our packets. What is the probability that a given DNS response has a length longer than the average length of all packets?

$P($Length > Mean Length of **All** Packets | DNS Response$)$

In [None]:
mean_length = ??[??].mean() # the mean length of all packets
longer_than_mean = dns_responses[dns_responses["length"] > ??] # number of DNS packets with a length longer than mean_length

num_longer = len(longer_than_mean)
print(num_longer / num_dns_responses)

Exactly half! Let's open Wireshark and examine these packets.

## Additional Exercises

1. (Challenge!) Find the probability that a DNS request is immediately followed by a DNS response in the packet trace. This will give us an idea of how fast DNS responses are received, relative to other network traffic.

# Naive Bayes

Now we're going to use the Naive Bayes algorithm to predict which task a user is most likely doing given a particular packet. While there are existing python functions for performing a naive Bayes classification, we already know everything we need to do it ourselves!

## Data Preprocessing

Like the previous lecture, we first need to label the data with what activity was happening at the time each packet is received.

First, download the ross.pcap file at https://drive.google.com/file/d/1Lr1dleCbZcQWfHoW_u6Q2uZFte17Y2Z_/view?usp=sharing. Place it in the AI4ALL-IoT/example_pcaps folder.

In [14]:
# Load the data, may take a few minutes.
data = pcap_to_pandas("example_pcaps/ross.pcap")
labels = pd.read_csv('example_pcaps/ross_labels.txt', header=None, names=["time", "activity"])

First, let's add the timestamp (remember, this is measured in seconds since the epoch) to the data set.

In [15]:
def convert_to_datetime(time):
    return datetime.strptime(time, '%Y-%m-%d %H:%M:%S.%f')
    
labels['datetime'] = labels['time'].apply(convert_to_datetime)

tzlocal = datetime.now().astimezone().tzinfo
labels['timestamp'] = labels['datetime'].apply(lambda dt: dt.replace(tzinfo=tzlocal).timestamp())

Next, we're going to use the activity log to label the data set.

In [16]:
label_encoder = LabelEncoder()
labels['label'] = label_encoder.fit_transform(labels['activity'])

for index, row in labels.iterrows():
    data.loc[data['time'] >= row['timestamp'], 'label'] = row['label']
    
num_labels = max(labels['label'])

Finally, we're going to take 20% of the data set and reserve it as test data.

In [20]:
msk = np.random.rand(len(data)) < 0.8
train = data[msk]
test = data[~msk]

## Classification

The simplest statistic we need to compute is the probability that each label occurs:

In [21]:
label_probs = np.zeros(num_labels + 1)

for i in range(num_labels + 1):
    label_probs[i] = len(train[train['label'] == i]) / len(train)

Next, we are going to go through the training set and tally up what values appear in different fields and how often they appear

In [22]:
ip_table = {} # The table stores tuples of the form (label, value)
ip_list = [] # The list stores every value we've seen (uniquely).

dns_table = {}
dns_list = []

port_table = {}
port_list = []

protocol_table = {}
protocol_list = []

def update_table(table, lst, label, value):
    if value is not None:
        table[(label, value)] = table.get((label, value), 0) + 1
    if value is not None and value not in lst:
        lst.append(value)

for index, row in train.iterrows():
    update_table(ip_table, ip_list, row['label'], row['ip_src'])
    update_table(ip_table, ip_list, row['label'], row['ip_dst'])
    update_table(port_table, port_list, row['label'], row['port_src'])
    update_table(port_table, port_list, row['label'], row['port_dst'])
    update_table(protocol_table, protocol_list, row['label'], row['protocol'])
    update_table(dns_table, dns_list, row['label'], row['dns_query'])
    update_table(dns_table, dns_list, row['label'], row['dns_resp'])

Now we use these tallies to compute the logarithm of the event probabilites. Typically, we prefer to work with log probabilities because many of these events have very small chances of happening. When multiplied together the resulting joint probability often end up inconveniently small for computers to work with. Taking logarithms will not change what we are trying to do conceptually, but improves the numerical properties of the algorithm.

In [24]:
def compute_log_probs(table, lst, smoothing):
    log_probs = {}
    
    for l in range(num_labels + 1):
        total = sum([table.get((l, val), 0) for val in lst])
        
        for val in lst:
            if (l, val) in table:
                log_probs[(l, val)] = log(table[(l, val)] + smoothing) - log(total + smoothing * (len(lst) + 1))
        
        log_probs[(l, '<UNK>')] = log(smoothing) - log(total + smoothing * (len(lst) + 1))
    
    return log_probs

ip_log_prob = compute_log_probs(ip_table, ip_list, 1e-5)
dns_log_prob = compute_log_probs(dns_table, dns_list, 1e-5)
port_log_prob = compute_log_probs(port_table, port_list, 1e-5)
protocol_log_prob = compute_log_probs(protocol_table, protocol_list, 1e-5)

Finally, we are ready to create the classifier. When presented with a new row of data, we simply sum all the relevant log probabilities for each class and report the class with the highest log probability.

In [27]:
def get_log_prob(table, val, label):
    return table.get((label, val), table[(label, '<UNK>')])

def classify(row):
    best_label = -1
    best_label_score = float('-Inf')
    
    for l in range(num_labels + 1):
        score = log(label_probs[l])
        
        score = score + get_log_prob(ip_log_prob, row['ip_src'], l)
        score = score + get_log_prob(ip_log_prob, row['ip_dst'], l)
        
        if row['is_dns']:
            if row['dns_query'] is not None:
                score = score + get_log_prob(dns_log_prob, row['dns_query'], l)
            
            if row['dns_resp'] is not None:
                score = score + get_log_prob(dns_log_prob, row['dns_resp'], l)
        
        score = score + get_log_prob(port_log_prob, row['port_src'], l)
        score = score + get_log_prob(port_log_prob, row['port_dst'], l)
        
        score = score + get_log_prob(protocol_log_prob, row['protocol'], l)
        
        if score > best_label_score:
            best_label = l
            best_label_score = score
    
    return best_label

correct = 0

for index, row in test.iterrows():
    if classify(row) == row['label']:
        correct = correct + 1

print('Accuracy: {}'.format(correct / len(test)))

Accuracy: 0.5682810061628799
