# HOMEWORK 2 (Data fetching) - Network Measurement and Data Analysis Lab

*Stefano Maxenti, 10526141, 970133*

## Data acquisition
<a id='data_acquisition'></a>
Of the 20 websites, I collect data only from 18 of them using curl. I don't fetch anything from washingtonpost.com and from rt.com: the former because it rejects curl connections (probably by inspecting the user-agent), the latter because curl reaches only an anti-DDoS service page (probably due to cyberwarfare) and is not able to be redirected correctly, thus it would not provide useful insights in the project. 

To obtain traffic traces, I set up a very small Docker container ("hw2") based on Ubuntu on a VPS located in the Netherlands, where I installed **tcpdump** and **curl** and write a small script. The "-L" flag in Curl is used to follow redirections, while the "-4" to force IPv4.

Using a docker container reduces the amount of noise traffic because no other applications are running; in addition to that, I force curl to use a specific range of ephemeral ports (2000-2100) and I filter just on those.

In [14]:
import os

In [2]:
!cat scripts/fetching.sh

#!/bin/bash

list=( "https://www.indiatimes.com" "https://www.ndtv.com" "https://www.cnbc.com" "https://www.timesofindia.com" "https://www.express.co.uk" "https://www.news18.com" "https://www.nypost.com" "https://www.abc.net.au" "https://www.bbc.co.uk" "https://www.msn.com" "https://www.cnn.com" "https://www.news.google.com" "https://www.dailymail.co.uk" "https://www.nytimes.com" "https://www.theguardian.com" "https://www.foxnews.com" "https://www.finance.yahoo.com" "https://www.news.yahoo.com" )

. /etc/profile

for i in "${list[@]}"
do
	a=$(echo $i|cut -d "." -f2,3,4)
	echo $a
	/usr/sbin/tcpdump -i eth0 -w /root/$a-$(date +%Y-%m-%d_%H-%M-%S).pcap portrange 2000-2100 &
	sleep 2
	curl -L -4 $i --local-port 2000-2100
	#killall curl
	sleep 5
	pkill tcpdump
done


With **tshark**, I convert each pcap file to a CSV file.
These files are then imported inside the notebook and are available in the zip file.

In [1]:
!cat scripts/pcap_to_csv.sh

#!/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# This bash script iterates over all pcap files in the same folder.
# For each one of them, outputs a CSV file with the same name using tshark
for i in *.pcap
do
	/usr/bin/echo "Processing " $i
	/usr/bin/tshark -r $i -T fields -e frame.number -e frame.time -e frame.len \
		-e ip.len -e ip.src -e ip.dst -e ip.proto \
		-e tcp.srcport -e tcp.dstport -e tcp.len -e tcp.option_kind \
			-E header=y -E separator=, -E quote=d > CSV/$i.csv
done


I then set up a crontab entry on the host machine:

Fetching happens every ten minutes between 6 and 23.

In [4]:
!cat scripts/crontab_entries

*/10 06-23 * * * docker exec -t hw2 /root/script.sh


Unfortunately, curl may not provide a clear representation of real traffic, because it does not download images and does not run javascript. Some more details are provided in the conclusion sections of Biflow and CUMUL approaches.

I try another approach: after spinning up a Xubuntu virtual machine in Virtualbox, I collect traces coming from a real browser (Firefox).

The automation script is very similar to the previous one:

In [13]:
!cat scripts/fetching_firefox.sh
!echo ""
!cat scripts/crontab_entries_firefox

#!/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

list=( "https://www.indiatimes.com" "https://www.ndtv.com" "https://www.cnbc.com" "https://www.timesofindia.com" "https://www.express.co.uk" "https://www.news18.com" "https://www.nypost.com" "https://www.abc.net.au" "https://www.bbc.co.uk" "https://www.msn.com" "https://www.cnn.com" "https://www.news.google.com" "https://www.dailymail.co.uk" "https://www.nytimes.com" "https://www.theguardian.com" "https://www.foxnews.com" "https://www.finance.yahoo.com" "https://www.news.yahoo.com" )

. /etc/profile

for i in "${list[@]}"
do
	a=$(echo $i|cut -d "." -f2,3,4)
	echo $a
	/usr/bin/tcpdump -i enp0s3 -w /home/stefano/DATASET_HW2/$a-$(date +%Y-%m-%d_%H-%M-%S).pcap port 443 &
	sleep 2
	/usr/bin/firefox $i &
	sleep 20
	wmctrl -c "Firefox" -x "Navigator.Firefox"
	sleep 2
	pkill firefox
	pkill tcpdump
done


*/30 06-23 * * * export DISPLAY=:0 && /home/stefano/DATASET_HW2/script.sh


To reduce the size of the uploaded zip, I do not include the raw CSVs. They can be downloaded here:

In [None]:
#CSV CURL - train
!wget "https://polimi365-my.sharepoint.com/:u:/g/personal/10526141_polimi_it/ESh0NZOxC0dIpDwOWonGPDEB2kKhmdznvfuSADRS7_kdxA?download=1" -O "input/CSV_curl.zip"
!unzip "input/CSV_curl.zip" -d "input/"

In [None]:
#CSV FIREFOX - train
!wget "https://polimi365-my.sharepoint.com/:u:/g/personal/10526141_polimi_it/EWis4276qyJDqaTRBAAlhTcB0gA0k1HFDV25gnbM3syAWg?download=1" -O "input/CSV_firefox.zip"
!unzip "input/CSV_firefox.zip" -d "input/"

In [None]:
#CSV CURL - some days later
!wget "https://polimi365-my.sharepoint.com/:u:/g/personal/10526141_polimi_it/EcstXWzLgIlPokDeauhr8-0BHjgNxKav7jzx0oDqiA6f-Q?download=1" -O "TEST/curl.zip"
!unzip "TEST/curl.zip" -d "TEST/"

In [None]:
#CSV FIREFOX - some days later
!wget "https://polimi365-my.sharepoint.com/:u:/g/personal/10526141_polimi_it/Ed0xg5e-IN9PmcbDB7tkPWoBq24hpAwaD1eEuraBstraGw?download=1" -O "TEST/firefox.zip"
!unzip "TEST/firefox.zip" -d "TEST/"

In [6]:
print("Overall, I have " + str(len([name for name in os.listdir('input/CSV_curl')])) + " Curl captures"
     ' and ' + str(len([name for name in os.listdir('input/CSV_firefox')])) + " Firefox captures")

Overall, I have 8925 Curl captures and 2277 Firefox captures


To avoid data leakage and the test influencing in any way the training, I first split into training and testing and then apply normalization on the training data.
The obtained scaler values (mean and variance) are then applied to the test set.

For final testing (1 day later, 3 days later, 7 days later - only for curl), I increase the fetching interval to reduce a bit the number of samples.
Notice that I start using Firefox some days after using curl, so the dates are different between the datasets.

In [15]:
print("1 day later:  I have " + str(len([name for name in os.listdir('TEST/curl/1DAY')])) + " Curl captures")
print("3 days later: I have " + str(len([name for name in os.listdir('TEST/curl/3DAYS')])) + " Curl captures")
print("7 days later: I have " + str(len([name for name in os.listdir('TEST/curl/7DAYS')])) + " Curl captures")
print("")
print("1 day later:  I have " + str(len([name for name in os.listdir('TEST/firefox/1DAY')])) + " Firefox captures")
print("3 days later: I have " + str(len([name for name in os.listdir('TEST/firefox/3DAYS')])) + " Firefox captures")

1 day later:  I have 540 Curl captures
3 days later: I have 505 Curl captures
7 days later: I have 468 Curl captures

1 day later:  I have 523 Firefox captures
3 days later: I have 522 Firefox captures
