# Naïve Bayes Classification on Network Traffic


In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timezone
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Allow us to load modules from the parent directory
import sys
sys.path.append("../lib") 
from parse_pcap import pcap_to_pandas, send_rates

**Load the Packet Capture**

In [2]:
# call our helper "pcap_to_pandas" function, and pass in the argument "example_pcaps/tplink_switch.pcap"
pkts = pcap_to_pandas('../pcaps/tplink_switch.pcap') 

# only look at TCP and UDP packets
pkts = pkts[(pkts['protocol']=='TCP') | (pkts['protocol']=='UDP')]

**How many packets are there?**

In [3]:
# number of total packets
num_total_packets = len(pkts)
num_total_packets

174

### Prior Probability

In [4]:
# packets with the protocol column equal to "TCP"
tcp_packets = pkts.loc[pkts['protocol'] == 'TCP'] 
tcp_packets.head(5)

Unnamed: 0,datetime,dns_query,dns_resp,ip_dst,ip_dst_int,ip_src,ip_src_int,is_dns,length,mac_dst,mac_dst_int,mac_src,mac_src_int,port_dst,port_src,protocol,time,time_normed
18,2017-12-07 14:11:31,,,34.195.88.49,583227400.0,172.24.1.81,2887254000.0,False,74,b8:27:eb:2d:24:15,202481588839445,50:c7:bf:09:f3:4c,88818833814348,50443.0,55594.0,TCP,1512677491.7808,7.46535
19,2017-12-07 14:11:31,,,172.24.1.81,2887254000.0,34.195.88.49,583227400.0,False,74,50:c7:bf:09:f3:4c,88818833814348,b8:27:eb:2d:24:15,202481588839445,55594.0,50443.0,TCP,1512677491.790298,7.474848
20,2017-12-07 14:11:31,,,34.195.88.49,583227400.0,172.24.1.81,2887254000.0,False,66,b8:27:eb:2d:24:15,202481588839445,50:c7:bf:09:f3:4c,88818833814348,50443.0,55594.0,TCP,1512677491.79184,7.47639
23,2017-12-07 14:11:32,,,34.195.88.49,583227400.0,172.24.1.81,2887254000.0,False,281,b8:27:eb:2d:24:15,202481588839445,50:c7:bf:09:f3:4c,88818833814348,50443.0,55594.0,TCP,1512677492.208559,7.893109
24,2017-12-07 14:11:32,,,172.24.1.81,2887254000.0,34.195.88.49,583227400.0,False,66,50:c7:bf:09:f3:4c,88818833814348,b8:27:eb:2d:24:15,202481588839445,55594.0,50443.0,TCP,1512677492.216527,7.901077


**How many TCP packets are there?**

In [5]:
# number of TCP packets
num_tcp_packets = len(tcp_packets)

num_tcp_packets

142

**What is the prior probability of a TCP packet?**

In [6]:
# probability that a packet is a TCP packet
tcp_probability = num_tcp_packets / num_total_packets 

tcp_probability

0.8160919540229885

**What is the prior probability of a UDP packet?**

In [7]:
udp_packets = pkts.loc[pkts['protocol'] == 'UDP']
num_udp_packets = len(udp_packets) 

# probability that a packet is a UDP packet
udp_probability = num_udp_packets / num_total_packets 

udp_probability

0.1839080459770115

### Conditional Probability

#### DNS Packets Given Port 53

A Domain Name System (DNS) packet is:
* a DNS query **OR** 
* a DNS response. 

DNS traffic is typically sent or received on port 53. Let's compute the probability that a packet is a DNS packet, given that the source port or destination port is 53. 

We are calculating:

$P($DNS Query $\cup$ DNS Response | Source Port == 53 $\cup$ Dst Port == 53$)$

The probability can be calculated as:

$P(\text{DNS Query} \cup \text{DNS Response}\ |\ \text{Source Port == 53} \cup \text{Dst Port == 53})\ =\ \frac{\text{# of packets with a DNS query or DNS response field}}{\text{# of packets with a SRC port or DST port 53}}$

According to conditional probability, rather than dividing by the total number of packets, we divide by only the # of packets that satisfy the condition that the SRC or DST port is equal to 53.


In [8]:
# packets with a DNS query column that isn't None
dns_queries = pkts.loc[pkts["dns_query"].notnull()] 
dns_queries = dns_queries.loc[dns_queries["port_dst"] == 53]
dns_queries.head(1)

Unnamed: 0,datetime,dns_query,dns_resp,ip_dst,ip_dst_int,ip_src,ip_src_int,is_dns,length,mac_dst,mac_dst_int,mac_src,mac_src_int,port_dst,port_src,protocol,time,time_normed
15,2017-12-07 14:11:31,b's1a.time.edu.cn.',,172.24.1.1,2887254000.0,172.24.1.81,2887254000.0,True,75,b8:27:eb:2d:24:15,202481588839445,50:c7:bf:09:f3:4c,88818833814348,53.0,32835.0,UDP,1512677491.532799,7.217349


In [9]:
# packets with a DNS response column that isn't None
dns_responses = pkts.loc[pkts["dns_resp"].notnull()] 
dns_responses = dns_responses.loc[dns_responses["port_src"] == 53]

dns_responses.head(1)

Unnamed: 0,datetime,dns_query,dns_resp,ip_dst,ip_dst_int,ip_src,ip_src_int,is_dns,length,mac_dst,mac_dst_int,mac_src,mac_src_int,port_dst,port_src,protocol,time,time_normed
17,2017-12-07 14:11:31,b'devs.tplinkcloud.com.',b'devs.tplinkcloud.com.',172.24.1.81,2887254000.0,172.24.1.1,2887254000.0,True,533,50:c7:bf:09:f3:4c,88818833814348,b8:27:eb:2d:24:15,202481588839445,43866.0,53.0,UDP,1512677491.775682,7.460232


We should expect one response for each query. Let's check that assumption.

In [10]:
num_dns_queries = len(dns_queries)
num_dns_responses = len(dns_responses)
num_dns_total = num_dns_queries + num_dns_responses

print(num_dns_queries)
print(num_dns_responses)

6
6


So, we have 6 queries and 6 responses.

Let's now see how many packets have either a source port of 53 or a destination port of 53.

In [11]:
port_53_pkts = pkts.loc[(pkts["port_src"] == 53) |
                            (pkts["port_dst"] == 53)]

port_53_pkts

Unnamed: 0,datetime,dns_query,dns_resp,ip_dst,ip_dst_int,ip_src,ip_src_int,is_dns,length,mac_dst,mac_dst_int,mac_src,mac_src_int,port_dst,port_src,protocol,time,time_normed
15,2017-12-07 14:11:31,b's1a.time.edu.cn.',,172.24.1.1,2887254000.0,172.24.1.81,2887254000.0,True,75,b8:27:eb:2d:24:15,202481588839445,50:c7:bf:09:f3:4c,88818833814348,53.0,32835.0,UDP,1512677491.532799,7.217349
16,2017-12-07 14:11:31,b'devs.tplinkcloud.com.',,172.24.1.1,2887254000.0,172.24.1.81,2887254000.0,True,80,b8:27:eb:2d:24:15,202481588839445,50:c7:bf:09:f3:4c,88818833814348,53.0,43866.0,UDP,1512677491.763646,7.448196
17,2017-12-07 14:11:31,b'devs.tplinkcloud.com.',b'devs.tplinkcloud.com.',172.24.1.81,2887254000.0,172.24.1.1,2887254000.0,True,533,50:c7:bf:09:f3:4c,88818833814348,b8:27:eb:2d:24:15,202481588839445,43866.0,53.0,UDP,1512677491.775682,7.460232
21,2017-12-07 14:11:31,b's1a.time.edu.cn.',b's1a.time.edu.cn.',172.24.1.81,2887254000.0,172.24.1.1,2887254000.0,True,121,50:c7:bf:09:f3:4c,88818833814348,b8:27:eb:2d:24:15,202481588839445,32835.0,53.0,UDP,1512677491.885528,7.570078
57,2017-12-07 14:11:47,b's1b.time.edu.cn.',,172.24.1.1,2887254000.0,172.24.1.81,2887254000.0,True,75,b8:27:eb:2d:24:15,202481588839445,50:c7:bf:09:f3:4c,88818833814348,53.0,39900.0,UDP,1512677507.922651,23.607201
58,2017-12-07 14:11:47,b's1b.time.edu.cn.',b's1b.time.edu.cn.',172.24.1.81,2887254000.0,172.24.1.1,2887254000.0,True,91,50:c7:bf:09:f3:4c,88818833814348,b8:27:eb:2d:24:15,202481588839445,39900.0,53.0,UDP,1512677507.952146,23.636696
69,2017-12-07 14:12:03,b'0.cn.pool.ntp.org.',,172.24.1.1,2887254000.0,172.24.1.81,2887254000.0,True,77,b8:27:eb:2d:24:15,202481588839445,50:c7:bf:09:f3:4c,88818833814348,53.0,55754.0,UDP,1512677523.990789,39.675339
70,2017-12-07 14:12:04,b'0.cn.pool.ntp.org.',b'0.cn.pool.ntp.org.',172.24.1.81,2887254000.0,172.24.1.1,2887254000.0,True,141,50:c7:bf:09:f3:4c,88818833814348,b8:27:eb:2d:24:15,202481588839445,55754.0,53.0,UDP,1512677524.028974,39.713524
118,2017-12-07 14:16:20,b'fr.pool.ntp.org.',,172.24.1.1,2887254000.0,172.24.1.81,2887254000.0,True,75,b8:27:eb:2d:24:15,202481588839445,50:c7:bf:09:f3:4c,88818833814348,53.0,34673.0,UDP,1512677780.09671,295.78126
119,2017-12-07 14:16:20,b'fr.pool.ntp.org.',b'fr.pool.ntp.org.',172.24.1.81,2887254000.0,172.24.1.1,2887254000.0,True,513,50:c7:bf:09:f3:4c,88818833814348,b8:27:eb:2d:24:15,202481588839445,34673.0,53.0,UDP,1512677780.105092,295.789642


So now let's compute the probability that a packet is a DNS packet, given that at least one port is 53.

<center><br>
$P(\mathrm{DNS}|\mathrm{port}=53$)
    </center>

In [12]:
# Of the port 53 packets, get the DNS packets.
dns_53_pkts = port_53_pkts[port_53_pkts['is_dns']==True]

# probability that a packet is a DNS packet, given that at least one port is 53
dns_probability = len(dns_53_pkts) / len(port_53_pkts)

print(dns_probability) 

1.0


You should expect an answer of 100%. If you got over 100% instead, your probability is likely overcounting some packets.

### Probability that a DNS Response is Longer than the Mean Packet Length

What is the probability that a given DNS response has a length longer than the average length of all TCP and UDP packets?

$P($Length > Mean Length of **All** TCP and UDP Packets | DNS Response$)$

In [13]:
dns = dns_53_pkts
dns_responses = dns[dns['dns_resp'].notna()]

# number of DNS packets with a length longer than mean_length
long_resp = dns[dns['length'] > pkts['length'].mean()]

len(long_resp) / len(dns_responses)

0.5

### Temporal Relationships

Find the probability that a DNS request is immediately followed by a DNS response in the packet trace. 

This will give us an idea of how fast DNS responses are received, relative to other network traffic.

In [14]:
dns

Unnamed: 0,datetime,dns_query,dns_resp,ip_dst,ip_dst_int,ip_src,ip_src_int,is_dns,length,mac_dst,mac_dst_int,mac_src,mac_src_int,port_dst,port_src,protocol,time,time_normed
15,2017-12-07 14:11:31,b's1a.time.edu.cn.',,172.24.1.1,2887254000.0,172.24.1.81,2887254000.0,True,75,b8:27:eb:2d:24:15,202481588839445,50:c7:bf:09:f3:4c,88818833814348,53.0,32835.0,UDP,1512677491.532799,7.217349
16,2017-12-07 14:11:31,b'devs.tplinkcloud.com.',,172.24.1.1,2887254000.0,172.24.1.81,2887254000.0,True,80,b8:27:eb:2d:24:15,202481588839445,50:c7:bf:09:f3:4c,88818833814348,53.0,43866.0,UDP,1512677491.763646,7.448196
17,2017-12-07 14:11:31,b'devs.tplinkcloud.com.',b'devs.tplinkcloud.com.',172.24.1.81,2887254000.0,172.24.1.1,2887254000.0,True,533,50:c7:bf:09:f3:4c,88818833814348,b8:27:eb:2d:24:15,202481588839445,43866.0,53.0,UDP,1512677491.775682,7.460232
21,2017-12-07 14:11:31,b's1a.time.edu.cn.',b's1a.time.edu.cn.',172.24.1.81,2887254000.0,172.24.1.1,2887254000.0,True,121,50:c7:bf:09:f3:4c,88818833814348,b8:27:eb:2d:24:15,202481588839445,32835.0,53.0,UDP,1512677491.885528,7.570078
57,2017-12-07 14:11:47,b's1b.time.edu.cn.',,172.24.1.1,2887254000.0,172.24.1.81,2887254000.0,True,75,b8:27:eb:2d:24:15,202481588839445,50:c7:bf:09:f3:4c,88818833814348,53.0,39900.0,UDP,1512677507.922651,23.607201
58,2017-12-07 14:11:47,b's1b.time.edu.cn.',b's1b.time.edu.cn.',172.24.1.81,2887254000.0,172.24.1.1,2887254000.0,True,91,50:c7:bf:09:f3:4c,88818833814348,b8:27:eb:2d:24:15,202481588839445,39900.0,53.0,UDP,1512677507.952146,23.636696
69,2017-12-07 14:12:03,b'0.cn.pool.ntp.org.',,172.24.1.1,2887254000.0,172.24.1.81,2887254000.0,True,77,b8:27:eb:2d:24:15,202481588839445,50:c7:bf:09:f3:4c,88818833814348,53.0,55754.0,UDP,1512677523.990789,39.675339
70,2017-12-07 14:12:04,b'0.cn.pool.ntp.org.',b'0.cn.pool.ntp.org.',172.24.1.81,2887254000.0,172.24.1.1,2887254000.0,True,141,50:c7:bf:09:f3:4c,88818833814348,b8:27:eb:2d:24:15,202481588839445,55754.0,53.0,UDP,1512677524.028974,39.713524
118,2017-12-07 14:16:20,b'fr.pool.ntp.org.',,172.24.1.1,2887254000.0,172.24.1.81,2887254000.0,True,75,b8:27:eb:2d:24:15,202481588839445,50:c7:bf:09:f3:4c,88818833814348,53.0,34673.0,UDP,1512677780.09671,295.78126
119,2017-12-07 14:16:20,b'fr.pool.ntp.org.',b'fr.pool.ntp.org.',172.24.1.81,2887254000.0,172.24.1.1,2887254000.0,True,513,50:c7:bf:09:f3:4c,88818833814348,b8:27:eb:2d:24:15,202481588839445,34673.0,53.0,UDP,1512677780.105092,295.789642


In [15]:
df = dns.sort_values(by=['dns_query','datetime']).loc[:,['datetime','dns_query','dns_resp']]
df['dns_resp'] = df['dns_resp'].astype(bool)
df

Unnamed: 0,datetime,dns_query,dns_resp
69,2017-12-07 14:12:03,b'0.cn.pool.ntp.org.',False
70,2017-12-07 14:12:04,b'0.cn.pool.ntp.org.',True
16,2017-12-07 14:11:31,b'devs.tplinkcloud.com.',False
17,2017-12-07 14:11:31,b'devs.tplinkcloud.com.',True
122,2017-12-07 14:16:20,b'devs.tplinkcloud.com.',False
123,2017-12-07 14:16:20,b'devs.tplinkcloud.com.',True
118,2017-12-07 14:16:20,b'fr.pool.ntp.org.',False
119,2017-12-07 14:16:20,b'fr.pool.ntp.org.',True
15,2017-12-07 14:11:31,b's1a.time.edu.cn.',False
21,2017-12-07 14:11:31,b's1a.time.edu.cn.',True


In [16]:
immediate = 0
for (index, row) in df.iterrows():
    # if is DNS response, how many packets ago was the query?
    if row['dns_resp']:
        lag = index-last_index
        if (lag == 1):
            immediate = immediate + 1
    last_index = index

immediate / len(df[df['dns_resp']==True])

0.8333333333333334

## Part 2: Naïve Bayes Classifier

Now we're going to use the naïve Bayes algorithm to predict which task a user is most likely doing given a particular packet. While there are existing python functions for performing a naive Bayes classification, we already know everything we need to do it ourselves!

### Load the Packet Traces

We first need to label the data with what activity was happening at the time each packet is received.

First, download the activity.pcap file at https://drive.google.com/file/d/1Lr1dleCbZcQWfHoW_u6Q2uZFte17Y2Z_/view?usp=sharing. 

Place it in the example_pcaps folder.

In [17]:
# Load the data, may take a few minutes.
data = pcap_to_pandas("../pcaps/activity.pcap")
data.head(5)

Unnamed: 0,datetime,dns_query,dns_resp,ip_dst,ip_dst_int,ip_src,ip_src_int,is_dns,length,mac_dst,mac_dst_int,mac_src,mac_src_int,port_dst,port_src,protocol,time,time_normed
0,2018-07-30 14:51:40,,,255.255.255.255,4294967000.0,128.112.93.99,2154848000.0,False,184,ff:ff:ff:ff:ff:ff,281474976710655,0c:4d:e9:b0:8e:4b,13528772677195,17500.0,17500.0,UDP,1532980300.670566,0.0
1,2018-07-30 14:51:40,,,128.112.93.255,2154848000.0,128.112.93.99,2154848000.0,False,184,ff:ff:ff:ff:ff:ff,281474976710655,0c:4d:e9:b0:8e:4b,13528772677195,17500.0,17500.0,UDP,1532980300.670856,0.00029
2,2018-07-30 14:51:41,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,82,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980301.370868,0.700302
3,2018-07-30 14:51:41,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,1514,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980301.370965,0.700399
4,2018-07-30 14:51:41,,,162.222.44.11,2732469000.0,128.112.92.150,2154847000.0,False,1514,04:09:73:5f:c9:00,4438636873984,a8:60:b6:01:d0:a9,185133323899049,4282.0,56524.0,TCP,1532980301.370966,0.7004


In [18]:
pkts = data
pkts.shape

(189085, 18)

### Load Labels

Now we will load the labels associated with the traffic trace above, giving us activity labels associated with different parts of the timeseries.

In [19]:
labels = pd.read_csv('../pcaps/activity_labels.txt', header=None, names=["time", "activity"])
labels.head(5)

Unnamed: 0,time,activity
0,2018-07-30 14:51:41.327734,WEB
1,2018-07-30 14:54:12.815653,AUDIO
2,2018-07-30 14:56:09.083618,VIDEO
3,2018-07-30 14:58:24.929799,WEB
4,2018-07-30 14:58:33.808876,GAMING


### Prepare the Dataset

1. Add a UNIX timestamp (remember, this is measured in seconds since the epoch) to the data set.
2. Use a **label encoder** to assign an integer label for each activity in the dataset.

**Note:** If you receive an error here about missing timezone information you may need to (re)-install it on your machine. <br />
On Linux, this is done with `apt-get install --reinstall tzdata`.

/usr/lib/python3/dist-packages/dateutil/zoneinfo/__init__.py:34: UserWarning: I/O error(2): No such file or directory
  warnings.warn("I/O error({0}): {1}".format(e.errno, e.strerror))

In [20]:
from dateutil import tz

def convert_to_datetime(time):
    return datetime.strptime(time, '%Y-%m-%d %H:%M:%S.%f')

# Clean this up!
# Force GMT -0400
labels['datetime'] = labels['time'].apply(convert_to_datetime)
tzlocal = tz.gettz('CDT')
tzlocal = datetime.now().astimezone().tzinfo
labels['timestamp'] = labels['datetime'].apply(lambda dt: dt.replace(tzinfo=tzlocal).timestamp())

# Use the label encoder to add an integer label to each entry.
label_encoder = LabelEncoder()
labels['label'] = label_encoder.fit_transform(labels['activity'])
labels



Unnamed: 0,time,activity,datetime,timestamp,label
0,2018-07-30 14:51:41.327734,WEB,2018-07-30 14:51:41.327734,1532984000.0,4
1,2018-07-30 14:54:12.815653,AUDIO,2018-07-30 14:54:12.815653,1532984000.0,0
2,2018-07-30 14:56:09.083618,VIDEO,2018-07-30 14:56:09.083618,1532984000.0,3
3,2018-07-30 14:58:24.929799,WEB,2018-07-30 14:58:24.929799,1532984000.0,4
4,2018-07-30 14:58:33.808876,GAMING,2018-07-30 14:58:33.808876,1532984000.0,1
5,2018-07-30 15:00:20.571626,INACTIVE,2018-07-30 15:00:20.571626,1532984000.0,2


In [21]:
label_encoder.classes_

array([' AUDIO', ' GAMING', ' INACTIVE', ' VIDEO', ' WEB'], dtype=object)

In [22]:
for index, row in labels.sort_values(by=['time']).iterrows():
    pkts.loc[data['datetime'] >= row['time'], 'label'] = row['label']
    pkts.loc[data['datetime'] >= row['time'], 'activity'] = row['activity']

pkts = pkts.loc[:,['datetime','length','ip_src','ip_dst','port_src','port_dst',
                   'protocol','dns_query','dns_resp','activity','label']]
pkts

Unnamed: 0,datetime,length,ip_src,ip_dst,port_src,port_dst,protocol,dns_query,dns_resp,activity,label
0,2018-07-30 14:51:40,184,128.112.93.99,255.255.255.255,17500.0,17500.0,UDP,,,,
1,2018-07-30 14:51:40,184,128.112.93.99,128.112.93.255,17500.0,17500.0,UDP,,,,
2,2018-07-30 14:51:41,82,128.112.92.150,162.222.44.11,56524.0,4282.0,TCP,,,,
3,2018-07-30 14:51:41,1514,128.112.92.150,162.222.44.11,56524.0,4282.0,TCP,,,,
4,2018-07-30 14:51:41,1514,128.112.92.150,162.222.44.11,56524.0,4282.0,TCP,,,,
...,...,...,...,...,...,...,...,...,...,...,...
189080,2018-07-30 15:01:39,66,162.222.44.11,128.112.92.150,4282.0,56524.0,TCP,,,INACTIVE,2.0
189081,2018-07-30 15:01:39,66,162.222.44.11,128.112.92.150,4282.0,56524.0,TCP,,,INACTIVE,2.0
189082,2018-07-30 15:01:39,66,162.222.44.11,128.112.92.150,4282.0,56524.0,TCP,,,INACTIVE,2.0
189083,2018-07-30 15:01:40,346,128.112.92.1,128.112.93.255,520.0,520.0,UDP,,,INACTIVE,2.0


## Create Training and Test Sets

Let's take 20% of the data set and reserve it as test data.

In [23]:
from sklearn.model_selection import train_test_split

X = pkts.loc[:,:'dns_resp']
y = pkts.loc[:,'label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
X_test.shape

(37817, 9)

## Naïve Bayes Classification by Hand

The simplest statistic we need to compute is the probability that each label occurs.

### Prior Probabilities for Each Activity

We first need to compute our prior probabilities, $p(y)$, for the target variable, which is an activity, for each possible value of $y$ (i.e., each activity).

In [24]:
total = pkts['label'].count()

act_counts = [pkts[pkts['label'] == i]['label'].count() for i in range(0,5)]

act_prior = [pkts[pkts['label'] == i]['label'].count()/total for i in range(0,5)]
act_prior
p_y = act_prior

p_y

[0.12586631960977263,
 0.17594091567998815,
 0.12117893533949148,
 0.27830153741971664,
 0.2987122919510311]

### Feature Likelihood for Each Class

Now we compute the feature likelihood for each class. A typical way to do this is with parameter estimation, by assuming a distribution of the features (e.g., gaussian, multinomial). Here we'll start with something much simpler: We'll assume the likelihood $P(x1,x2|y)$ is simply the values in the dataset itself.

In other words, we'll just say that the likelihood of the probability for a given community area and hour, given arrest or no arrest, is simply the number of observations of a (port, ip, protocol) tuple in the event of some activity, divided by the total number of activities of that type.

In [25]:
likelihood = {}

for index, row in X_train.iterrows():
    (l,p) = row.loc[['length','protocol']]
    a = y_train[index]
    try:
        likelihood[(l,p,a)] = likelihood[(l,p,a)] + 1
    except KeyError:
        likelihood[(l,p,a)] = 1

for (l,p,a) in likelihood:
    likelihood[(l,p,a)] = likelihood[(l,p,a)] / act_counts[a]

TypeError: list indices must be integers or slices, not numpy.float64

### Predictions

Given a length and a protocol, we can predict the activity. This is probably not a very good classifier, given the limited number of features we're using, but we'll demonstrate for the sake of example.

We will now maximize the likelihood of the class, given the observation.

Suppose TCP and length 87. We can determine that it is more likely that the activity is most likely to be audio (activity 2).

Here is an example, showing:

$p(\textrm{length=87, protocol=TCP} | \textrm{activity = AUDIO}) \cdot p(\textrm{activity = AUDIO})$ and <br />
$p(\textrm{length=87, protocol=TCP} | \textrm{activity = WEB}) \cdot p(\textrm{activity = WEB}) $

In [None]:
print(likelihood[(87,'TCP',1)] *p_y[1])
print(likelihood[(87,'TCP',4)] *p_y[4])

Therefore, given a packet with protocol TCP and length 87, the classifier would say that the packet is most likely an audio packet.

## Naïve Bayes with Scikit-Learn

Now we will perform the same computation with Python's sklearn library. We'll use the ComplementNB class, which is an adaptation of the multinomial Naïve Bayes classifier that deals better with imbalanced datasets. The technique is described in more detail in this paper.

This classifier uses destination port and length as entries.

### Training the Classifier

In [None]:
# Import the Naïve Bayes Classifiers. 
# (We'll only use Complement for now.)

from sklearn.naive_bayes import ComplementNB
nb = ComplementNB() 

# Clean up the NaN columns.
# Note: This makes the resulting dataset significantly smaller because there are a lot of NaN values.
pkts = pkts.dropna()

X = pkts.loc[:,:'dns_resp']
y = pkts.loc[:,'label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
X_train

features = X_train.loc[:,['length','port_dst']].values
target = y_train.values

nb.fit(features, target)

nb.predict([[394,14756]])[0]

X

### Evaluating the Accuracy of the Classifier

This classifier needs a lot of work! Only 9% accuracy.

In [None]:
from sklearn.metrics import accuracy_score

test = X_test.loc[:,['length','port_dst']]
y_hat = nb.predict(test.values)

# Compare the predicted values (y_hat) to the true labels in the test set (y_test).
accuracy_score(y_test,y_hat)

---