# Passive Network Measurement

Network operators look at different types of network traffic data to understand properties of their networks. Some network data can be collected directly from network devices (e.g., routers, switches) while they are forwarding live traffic.  Collecting this data does not affect network behavior and is therefore called "passive" (as opposed to "active" measurements).

In this assignment, you will analyze traffic volumes - a type of passive network measurement data.

This notebook has several parts. Each part contains sections marked with TODO that you need to complete. 

**Put your name and netID in the cell below:**

**Name:**

**NetId:**

## Background

### Traffic Measurement with IPFIX

Routers in most networks collect traffic measurements using the [IPFIX protocol](https://en.wikipedia.org/wiki/IP_Flow_Information_Export). [NetFlow](https://en.wikipedia.org/wiki/NetFlow), a proprietary form of IPFIX defined by Cisco, is well-known in the networks community because Cisco supplies routers for many large networks.

In this assignment, you'll analyze a trace of NetFlow records captured from a router that connects Princeton University's campus network to the Internet. The assignment will ask you to perform similar kinds of analysis that a network operator would perform -- asking questions about the most popular endpoints for the campus traffic, the most popular web applications, and so forth (As you can imagine, when we start to think about security, the ability to analyze these baselines will come in handy!).

The flow records are in the file `netflow.csv`.  To simplify the analysis, we have ensured that the IP addresses of the campus network start with `128.112` and have their 16 lower bits anonymized to protect the privacy of users. To further simplify your task, we have parsed these records into CSV (comma-separated variable) format, with the names of the fields listed in the first row of the file. (In a real network, routers export IPFIX records as binary files.)

### Functional data analysis with map() and reduce()

Several of the data analysis steps in this assignment use a "MapReduce" programming model. MapReduce originated in functional programming languages and involves using two functions (called `map()` and `reduce()`...surprise!) to apply functions to iterable data (like linked-lists, arrays, etc.). 

##### map()

A general `map()` function has two arguments: another function (which itself takes one argument) and an iterable object. `map()` then applies (maps) the argument function to every item in the iterable object. See [the documentation](https://docs.python.org/3.6/library/functions.html#map) of Python's built-in `map()`function for more details. The following toy example uses `map()` to add 3 to every element of a list

In [6]:
some_numbers = [1,2,3]
three_more = map(lambda x: x+3, some_numbers)
print(list(three_more))

[4, 5, 6]


`map()` is often used with anonymous function (the `lambda` in the above example), but can be used just as easily with normal functions:

In [7]:
def add3(i):
    return i+3
 
some_numbers = [1,2,3]
three_more = map(add3, some_numbers)
print(list(three_more))

[4, 5, 6]


Note that real implementations of `map()` allow the mapped function to take more than one argument or include more information in a closure, but this won't be necessary for this assignment.

##### reduce()

A general `reduce()` function takes another function (which itself takes *two* values), an iterable object, and an optional initializer value. It apples the function of two arguments cumulatively to the items of the iterable object, from left to right, so as to reduce the sequence to a single value. This allows `reduce()` to compute summaries over all data in the iterable object. See [the documentation](https://docs.python.org/3.6/library/functools.html#functools.reduce) of Python `functools`'s `reduce()` function for more details. The following example uses `reduce()` to count the number of 4s in a list of integers:

In [9]:
from functools import reduce

def count_4s(count, i):
    # The order of the arguments matters. 
    #     The first argument is the accumulated value
    #     The secod argument is next value from the iterable
    if i == 4:
        return count + 1
    else:
        return count

some_numbers = [1,4,0,1,4]
num_fours = reduce(count_4s, some_numbers, 0) # 0 is the initializer value 
print(num_fours)

2


Again, real implementations of `reduce()` allow the reduction function to take more than two arguments or include more information in a closure, but this won't be necessary for this assignment.

[MapReduce](https://en.wikipedia.org/wiki/MapReduce) is popular because it allows analysis tasks on large data sets to be easily parallelized .  Although there are many open-source and proprietary MapReduce-style data processing libraries (typically with different ways of expressing iterable datasets and distributing tasks over many computers), they all involve `map()` and `reduce()` functions like you will use in this assignment.

### Parse IPFIX Data
The `netflow.csv` file contains some pre-processed netflow data. The data is "unsampled," i.e. it compiles flow statistics for every packet that traverses any interface on the border router.  We used the `nfdump` tool to process the raw NetFlow data that the router collected. Each row of the `netflow.csv` file, except for the header on top, logs the following information for a flow:

```
Date first seen, Time first seen (m:s), Date last seen, Time last seen (m:s), Duration (s), Protocol, 
Src IP addr, Src port, Dst IP addr, Dst port, Packets, Bytes, Flags, Input interface, Output interface		

```

To analyze this data, we first need to read it into a python data structure.  The following code uses the built-in `csv` library to read `netflow.csv` into a list of dictionaries.  The `csv` library documentation is [here](https://docs.python.org/3.6/library/csv.html).

In [12]:
import csv

with open('netflow.csv', 'r') as netflow_file:
    netflow_reader = csv.DictReader(netflow_file)
    netflow_data = list(netflow_reader)
    
print("Number of flow records: {}".format(len(netflow_data)))
print("Sample flow record: {}".format(netflow_data[0]))

Number of flow records: 105360
Sample flow record: OrderedDict([('Date first seen', '10/29/15'), ('Time first seen (m:s)', '04:48.9'), ('Date last seen', '10/29/15'), ('Time last seen (m:s)', '04:48.9'), ('Duration (s)', '0'), ('Protocol', 'ICMP'), ('Src IP addr', '172.16.241.1'), ('Src port', '0'), ('Dst IP addr', '128.112.213.189'), ('Dst port', '11'), ('Packets', '1'), ('Bytes', '94'), ('Flags', '.A....'), ('Input interface', '120'), ('Output interface', '0')])


### Analyze IPFIX Data

The following sections each focus on answering a specific question using the netflow data you have prepared. These questions are both of interest to real network operators and might reveal some surprising facts about how the campus community uses the Internet.  

#### What are the most popular IP addresses accessed by the users?

In order to answer this question, we have to decide how to measure IP address popularity. Total traffic volume across all flows seems like a reasonable option, but so does total number of flows to an IP address regardless of volume.  Network operaters actually use both metrics (among others), which we will do here as well.  

*Step 1: Determine popular IP addresses by number of flows*

Complete the following code to produce a python dictionary `ips_by_flows` with counts of the total number of flows to each external (not 128.112.\*.\*) IP address in `netflow_data`.  The keys of the dict should be IP addresses and the values should be integer flow counts.

First complete the `count_by_flows()` function, which should take an existing dict of the form described above and update it appropriately from `current_flow`.  If you are confused about datatypes, use print statements to inspect variables. 

You may want to create an additional helper function to test if an IP address starts with `128.112`.

You will then need to use the `reduce()` function to build a dictionary result.  As a hint, the initializer argument to `reduce()` should be `defaultdict(lambda: 0)`. The [defaultdict()](https://docs.python.org/3.6/library/collections.html#collections.defaultdict) function creates a dictionary with default values that are the output of the argument function (in this case, just 0). This allows you to increment the value of a particular key without first checking to see if the key is already in the dictionary (if you used `{}` to create the dict instead of `defaultdict()`, this would raise a KeyError).

The provided code will print and plot the most popular IPs.  The `check_ips_by_flows()` function will compare the md5 hash of the top 15 most popular IPs your answer against the md5 hash of the correct answer. This will print a message letting you know whether you are correct or need to keep debugging. 

In [15]:
%matplotlib inline
from collections import defaultdict
from plotting import plot_flows
from testing import check_ips_by_flows

# TODO: complete count_by_flows function
def count_by_flows(counts, current_flow):
    # counts is the current dict result
    # current_flow being processed
    pass
      

# TODO: use reduce() function to apply count_by_flows to netflow_data and assign the result to ips_by_flows


# print the top 5 IP addresses by number of flows 
sorted_ips_by_flows = sorted(ips_by_flows.items(), reverse=True, key=lambda x: x[1])
print("Most popular IP addresses by number of flows: {}\n".format(sorted_ips_by_flows[0:5]))

# check the results
check_ips_by_flows(sorted_ips_by_flows[0:15])

# plot the results
plot_flows(sorted_ips_by_flows)

NameError: name 'ips_by_flows' is not defined

*Step 2: Determine popular IP addresses by total volume*

Complete the following code to produce a dict `ips_by_volume` with counts of the total number of bytes to each external (non-campus) IP address.  The keys of the dict should be IP addresses and the values should be integer byte counts. Remember that the values in `netflow_data` are strings. You will need to convert them to ints or floats to do arithmetic.

In [17]:
%matplotlib inline
from plotting import plot_volumes
from testing import check_ips_by_volume

# TODO: complete count_by_volume function
def count_by_volume(counts, current_flow):
    # counts is the current dict result
    # current_flow being processed
    pass

        
# TODO: use reduce() function to apply count_by_volume to netflow_data and assign the result to ips_by_volume


# print the top 5 IP addresses by volume
sorted_ips_by_volume = sorted(ips_by_volume.items(), reverse=True, key=lambda x: x[1])
print("Most popular IP addresses by volume: {}\n".format(sorted_ips_by_volume[0:5]))

# check the results
check_ips_by_volume(sorted_ips_by_volume[0:15])

# plot the results
plot_volumes(sorted_ips_by_volume)

NameError: name 'ips_by_volume' is not defined

#### What are the most popular applications (by protocol) among the network users?

What application protocols do you think are the most common on the network?  Web traffic (HTTP & SSL)? Secure remote connection (SSH)? Email (SMTP, IMAP, POP3)?  In practice, a network operator may want to identify popular applications  to make provisioning plans or change network configurations to treat traffic from different applications differently (e.g., to route traffic on different links). This could prevent high-volume applications, (e.g. video streaming) from interrupting the performance of critical low-volume applications.  

You can answer these questions by finding the most popular ports of traffic flows in the netflow data. Many application protocols use well-known fixed ports for their traffic.  For example, HTTP traffic happens on port 80, SSL traffic on port 443, SSH traffic on port 22, SMTP on port 25. 

Again we will use both number of flows and total traffic volume as metrics for port "popularity".

Complete the following code to create `ports_by_flows` and `ports_by_volume` dicts from the netflow data.  Use the same strategy as you did above to create `ips_by_flows` and `ips_by_volume`. 

Include all destination ports and source ports **lower than 1024** (ports lower than 1024 are "well-known" and easily mapped to applications). 

In [18]:
%matplotlib inline
from plotting import plot_ports
from testing import check_ports_by_flows, check_ports_by_volume

# TODO: create ports_by_flows and ports_by_volume dicts from netflow_data


# Print the most popular ports and check the results
sorted_ports_by_flows = sorted(ports_by_flows.items(), reverse=True, key=lambda x: x[1])
sorted_ports_by_volume = sorted(ports_by_volume.items(), reverse=True, key=lambda x: x[1])
print("Most popular ports by number of flows: {}".format(sorted_ports_by_flows[0:5]))
print("Most popular ports by volume: {}\n".format(sorted_ports_by_volume[0:5]))
check_ports_by_flows(sorted_ports_by_flows[0:15])
check_ports_by_volume(sorted_ports_by_volume[0:15])

# plot the results 
plot_ports(sorted_ports_by_flows, sorted_ports_by_volume)

NameError: name 'ports_by_flows' is not defined

### Questions

Answer the following questions about the results of the above analysis. 

#### Q1. 
What are the 5 most popular external (non-Princeton) IP addresses by number of flows and by traffic volume?

#### A1.
*TODO: Your answer here.*


#### Q2. 
Use the "whois" command from your Vagrant terminal to learn what you can about these IP addresses (e.g. `whois 169.54.233.126`).  Choose 2 addresses from your answer to Q1 and write up what you learned, as well as why you think they were among the most popular. 

#### A2. 
*TODO: Your answer here.*


#### Q3. 
What are the 5 most popular ports by number of flows and by traffic volume? What applications are associated them? There is a wikipedia page of fixed ports here: https://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers. You can also search online to find fixed port/application mappings.  Are you suprised by which applications are the most popular?

#### A3.
*TODO: Your answer here.*


#### Q4. 
Why do you think that telnet (port 23) composes so much of the traffic through Princeton's network?

#### A4. 
*TODO: Your answer here.*


#### Q5. 
The provided NetFlow data was captured over a 5 minute period from approximately 6:05am to 6:10am. How much do you think the capture time affected the resulting most popular applications?  What changes would you expect to see if the data had been captured during a different 5 minute window (of your choice)? 

#### A5.
*TODO: Your answer here.*


## Submission

**Remember to "Save and Checkpoint" (from the "File" menu above) before you leave the notebook or close your tab.**