## ELEC0088: Software for Network Services and Design Project Assignment

### Objective: Network Throughput Forecasting
In this project, we will build a network forecasting system that is able to forecast the data rate for a given IP address in the Internet. This could be of paramount benefit for many applications, services providers and ISPs. The system should be able to forecast the throughput of any IP address in the Internet.

This project comprised mainly two parts: <br>
1. Collecting as much data as possible of <IP address, throughput> pairs e.g. (128.34.5.2, 20304).
2. Train and test a deep learning model with the data and show how accurate the model is.

### Learning outcomes/goals of the project:
1. To be able to implement networking applications based on the socket interface
2. To be able to design and implement a deep learning model.

### Notes:
**Be particularly careful on the quality of the data. For example, bear in mind TCP slow start and explain clearly the amount of data retrieved for each sample (how many bytes did you get from each site). Put the data in the web page (called it data.csv, comments should have)**

## Part 1: Collecting Data
A complete huge dataset can be collected through:
1. Using the code in networking session to get IP addresses.
2. Using python libraries e.g BeautifulSoup, Selenium for Web Scraping
3. Using applications like wget in bash

In [1]:
# Import packages
from selenium import webdriver
from datetime import datetime
import time
import pandas as pd
import os
import io
import requests
import argparse
import socket
import itertools
from urllib.parse import urlparse
from multiprocessing import Pool, cpu_count

### Starting a Webdriver
The parameter can be passed into the option of the webdriver to launch browser in maximised or incognito mode

In [2]:
# Define the Chrome options to open the window in incognito mode
option = webdriver.ChromeOptions()
option.add_argument('--incognito')

# Find the ChromeDriver path
SNS_dir = os.path.abspath(os.curdir)
Scrap_dir = os.path.join(SNS_dir, 'Scraping')
DRIVER_PATH = os.path.join(Scrap_dir, 'chromedriver')

# Create an instance of Webdriver
wd = webdriver.Chrome(executable_path=DRIVER_PATH, options=option)

# Build the google query for images
url = 'https://www.google.com/imghp?safe=off&site=&tbm=isch&source=hp'
query = 'food'
country = '{UK}'

# The URL for filtering search engines (query, country)
search_img = '&q={q}&oq={q}&cr=country{c}&gs_1=img&&tbs=isz:lt,islt:4mp'
search_url = url + search_img

# Intialize the Chrome Webdriver to open the URL
wd.get(search_url.format(q=query, c=country))
wd.implicitly_wait(5)  # in seconds

### Using argument parser library to parse parameter into URL
argparse to parse query and country parameters into URL

In [3]:
# # Define argument parser to read in URL
# arg_parser = argparse.ArgumentParser(description='Search images using Google search')
# arg_parser.add_argument('query', metavar='query', type=str, help='Query for the URL')
# arg_parser.add_argument('country', metavar='country', type=str, help='Query for the URL')
# arg_parser.add_argument('--img_count', metavar='count', default=100, type=int, help='How many images to fetch')

# args = arg_parser.parse_args()
# query = args.query
# country = args.country

### Retrive all the images link and store in a text file
Define a function of fetch all image link from webpage

In [4]:
def fetch_img_urls(query, country, img_to_fetch, wd):
    def scroll_to_end(wd):
        wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)    
    
    # Build a search query for picture more than 4 Megapixels
    search_url = 'https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&cr=country{c}&gs_l=img&tbs=isz:lt,islt:8mp'
    wd.get(search_url.format(q=query, c=country))
    
    img_urls = set()
    img_count = 0
    results_start = 0
    
    while img_count < img_to_fetch:
        scroll_to_end(wd)
        
        # Get all the image thumbnail results
        thumbnail_results = wd.find_elements_by_css_selector("img.Q4LuWd")
        num_results = len(thumbnail_results)
        
        print(f"Found: {num_results} search results. Extracting links from {results_start}:{num_results}")
        
        for img in thumbnail_results[results_start:num_results]:
            # Clicking the resulting thumbnail to get real image 
            try:
                img.click()
                time.sleep(0.5)
            except Exception:
                continue

            # Extract the image urls and get rid of the encrypted gstatic links
            real_img = wd.find_elements_by_css_selector('img.n3VNCb')
            for x in real_img:
                if x.get_attribute('src') and 'http' in x.get_attribute('src') and not 'gstatic' in x.get_attribute('src'):
                    img_urls.add(x.get_attribute('src'))
    
            # Total number of images urls extracted
            img_count = len(img_urls)
            
            # Set the limit for retrieved image urls
            if img_count >= img_to_fetch:
                print(f"Image links found: {img_count} ... DONE!!!")
                break
        else:
            print("Looking for more image links ...")
            time.sleep(30)
            return
            show_more_results = wd.find_element_by_css_selector(".mye4qd")
            if show_more_results:
                wd.execute_script("document.querySelector('.mye4qd').click();")

        # move the result startpoint further down
        results_start = len(thumbnail_results)

    return img_urls

### Get IP address from image URLs
Using socket gethostname to obtain IP address

In [5]:
with webdriver.Chrome(executable_path=DRIVER_PATH, options=option) as wd:
    start_timer = datetime.now()
    example_urls = fetch_img_urls('food', 'UK', 10, wd)
    time_elapsed = datetime.now() - start_timer
    print("Time elapsed (hh:mm:ss.ms) {}".format(time_elapsed))
    
print(len(example_urls))
print(example_urls)

Found: 100 search results. Extracting links from 0:100
Image links found: 10 ... DONE!!!
Time elapsed (hh:mm:ss.ms) 0:00:29.842277
10
{'https://www.littlegreenduckie.com/wp-content/uploads/2018/01/My-shopping.jpg', 'https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/healthy-lidl-foods-trolley-1563127711.jpg', 'https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/img-20190503-111204-1557323837.jpg', 'https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/healthy-coop-food-large-flatlay-two-1563206717.jpg', 'https://www.boxfituk.com/blog/wp-content/uploads/2019/07/Fruit-and-Vegetables.jpg', 'https://www.cda.eu/wp-content/uploads/2020/01/Literal-Food-Translations-sml.png', 'https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/healthy-asda-foods-2-1547482772.jpg', 'https://s2.r29static.com/bin/entry/7ff/x,80/2252844/image.jpg', 'https://www.alchemylive.london/wp-content/uploads/2018/02/Foodstation.jpg', 'https://baxterstorey.com/wp-content/uploads/2019/09/Pa

### Get the IP address from URLs using socket interface
Define function prototype to get IP address from the example urls above and store in pandas dataframe

In [6]:
def get_ipaddr(img_urls):
    # Display the result in dataframe
    url_list = []
    ip_list = []
    
    # Break the URL into component and get the IP address of URL
    for urls in img_urls:
        website = urlparse(urls)
        ip_addr = socket.gethostbyname(website.netloc)
        print(f"URL: {website.netloc} and IP: {ip_addr}")
        url_list.append(website.netloc)
        ip_list.append(ip_addr)
    
    url_ip_dict = dict(zip(url_list, ip_list))
    dataframe = pd.DataFrame([[keys, values] for keys, values in url_ip_dict.items()]).rename(columns={0:'URL', 1:'IP address'})
    
    return dataframe, ip_list

df, ip_addr = get_ipaddr(example_urls)
df

URL: www.littlegreenduckie.com and IP: 185.151.30.163
URL: hips.hearstapps.com and IP: 199.232.56.155
URL: hips.hearstapps.com and IP: 199.232.56.155
URL: hips.hearstapps.com and IP: 199.232.56.155
URL: www.boxfituk.com and IP: 172.67.69.52
URL: www.cda.eu and IP: 172.67.131.51
URL: hips.hearstapps.com and IP: 199.232.56.155
URL: s2.r29static.com and IP: 199.232.57.179
URL: www.alchemylive.london and IP: 35.197.204.225
URL: baxterstorey.com and IP: 185.164.44.9


Unnamed: 0,URL,IP address
0,www.littlegreenduckie.com,185.151.30.163
1,hips.hearstapps.com,199.232.56.155
2,www.boxfituk.com,172.67.69.52
3,www.cda.eu,172.67.131.51
4,s2.r29static.com,199.232.57.179
5,www.alchemylive.london,35.197.204.225
6,baxterstorey.com,185.164.44.9


### Calculating throughput from each of succesfully parsed URLs.
Using GET request to retrive content and calculate the throughput. **Note that the TCP has a slow start, therefore small content can be ignored using by setting the slow start threshold (ssthresh is usually 65535 bytes)**

In [7]:
def calc_througput(ssthresh, img_urls):
    data = 0
    throughput = 0
    throughput_list = []
    
    # Calculating throughput and set timeout for very large files
    for urls in img_urls:
        try:
            start_time = time.time()
            img_content = len(requests.get(urls, stream=True).content)
            if img_content > ssthresh:
                data += img_content
        except Exception as e:
            print(f"ERROR - Could not download {urls} - {e}")
            data = 0

        end_time = time.time()
        throughput = data/(end_time - start_time)
        throughput_list.append(throughput)
    
    return throughput_list

throughput = calc_througput(65535, example_urls)
new_dict = dict(zip(ip_addr, throughput))
dataframe = pd.DataFrame([[keys, values] for keys, values in new_dict.items()]) \
            .rename(columns={0:'IP address', 1:'Throughput'})
dataframe

Unnamed: 0,IP address,Throughput
0,185.151.30.163,4645415.0
1,199.232.56.155,29965840.0
2,172.67.69.52,29738150.0
3,172.67.131.51,47653310.0
4,199.232.57.179,35782450.0
5,35.197.204.225,110992600.0
6,185.164.44.9,31898540.0


### Using Multithreading to fetch image by parsing different countries
Define function prototype to fetch images urls from different countries

In [8]:
country_dict = {'Asia': ['CN', 'IN', 'JP', 'KR', 'TW'],
                'Europe': ['GB', 'FR', 'IT', 'DE', 'RU'],
                'Africa': ['NG', 'EG', 'ET', 'TZ', 'ZA'],
                'Oceania': ['NZ', 'AU'],
                'South America': ['MX', 'CO', 'AR', 'BR', 'CL'],
                'North America': ['US', 'CA']}

country_tuple = [(keys, values) for keys, values in country_dict.items()]
print(country_tuple)


def get_webdriver():
    wd = getattr(threadLocal, 'webdriver', None)
    
    if wd is None:
        option = webdriver.ChromeOptions()
        option.add_argument('--incognito')
        wd = webdriver.Chrome(executable_path=DRIVER_PATH, options=option)
        setattr(threadLocal, 'webdriver', wd)    
    return wd

[('Asia', ['CN', 'IN', 'JP', 'KR', 'TW']), ('Europe', ['GB', 'FR', 'IT', 'DE', 'RU']), ('Africa', ['NG', 'EG', 'ET', 'TZ', 'ZA']), ('Oceania', ['NZ', 'AU']), ('South America', ['MX', 'CO', 'AR', 'BR', 'CL']), ('North America', ['US', 'CA'])]


In [9]:
check = [(values) for values in country_dict.values()]
merge = list(itertools.chain(*check))
print(merge)

['CN', 'IN', 'JP', 'KR', 'TW', 'GB', 'FR', 'IT', 'DE', 'RU', 'NG', 'EG', 'ET', 'TZ', 'ZA', 'NZ', 'AU', 'MX', 'CO', 'AR', 'BR', 'CL', 'US', 'CA']


## Splitting IP address of the URL into 32 bit processed vector
Splitting the IP address into 32 bit dimensional data because each IPV4 represent 32-bit where each number range from 0 to 255 and represent 8-bits. For example, IP = 203.132.63.117 will be IP-32bits = 11001011100001000011111101110101.

## Distance between two IP address
\begin{equation*}
distance(IP1, IP2) = (32 - (IP1 XOR IP2))\frac{1}{32}
\end{equation*}

The distance is 32 minus the number of leading zeroes in the bitwise **XOR** between two IP and normalize by factor of 32


In [10]:
import ipaddress

def convert2bin(ip_list):
    bin_ip = []
    
    for ip in ip_list:
        ip_vector = format(int(ipaddress.ip_address(ip)), '032b')
        bin_ip.append(ip_vector)
        
    return bin_ip

In [41]:
def pipeline(country):
    # Retrieve image urls
    with webdriver.Chrome(executable_path=DRIVER_PATH, options=option) as wd:
        start_timer = datetime.now()
        img_urls = fetch_img_urls(query, country, 20, wd)
        time_elapsed = datetime.now() - start_timer
        print("Time elapsed (hh:mm:ss.ms) {}".format(time_elapsed))
    
        # Get unique IP address of each images urls
        df, ip_addr = get_ipaddr(img_urls)
        
        # Create country list for mapping
        cr_list = [country] * len(img_urls)
        
        # Convert IP address to binary 32 bits
        ip_binary = convert2bin(ip_addr)

        # Calculating throughput
        throughput = calc_througput(65535, img_urls)
        dataframe = pd.DataFrame(list(zip(cr_list, ip_addr, throughput, ip_binary)), \
                                 columns=['Country', 'IP address', 'Throughput', 'Binary IP'])
#         dataframe = pd.DataFrame([[keys, values] for keys, values in dataset.items()]) \
#                     .rename(columns={0:'IP address', 1:'Throughput'})

        # Splitting binary 32 bits IP address into multidimensional columns for machine learning
        for i in range(32):
            dataframe['B'+str(i)] = dataframe['Binary IP'].str[i]
    
    return dataframe

In [42]:
checking = pipeline('CA')
checking

Found: 100 search results. Extracting links from 0:100
Image links found: 20 ... DONE!!!
Time elapsed (hh:mm:ss.ms) 0:01:15.287645
URL: cdn.fashionmagazine.com and IP: 13.35.193.66
URL: lh3.googleusercontent.com and IP: 216.58.210.193
URL: media-exp1.licdn.com and IP: 184.26.149.102
URL: www.theglobeandmail.com and IP: 88.221.135.10
URL: ramblingsofasuburbanmummy.files.wordpress.com and IP: 192.0.72.19
URL: lh3.googleusercontent.com and IP: 216.58.210.193
URL: assets.blog.foodnetwork.ca and IP: 143.204.169.25
URL: www.theglobeandmail.com and IP: 88.221.135.10
URL: kidspressmagazine.com and IP: 104.28.16.4
URL: cdn.audleytravel.com and IP: 151.101.18.133
URL: www.weightwatchers.com and IP: 151.101.18.99
URL: www.yummymummyclub.ca and IP: 104.24.112.89
URL: static.sscontent.com and IP: 104.16.169.89
URL: www.readersdigest.ca and IP: 104.18.23.250
URL: lh3.googleusercontent.com and IP: 216.58.210.193
URL: i.pinimg.com and IP: 199.232.56.84
URL: cdn.fashionmagazine.com and IP: 13.35.193.66

Unnamed: 0,Country,IP address,Throughput,Binary IP,B0,B1,B2,B3,B4,B5,...,B22,B23,B24,B25,B26,B27,B28,B29,B30,B31
0,CA,13.35.193.66,1841313.0,00001101001000111100000101000010,0,0,0,0,1,1,...,0,1,0,1,0,0,0,0,1,0
1,CA,216.58.210.193,8150619.0,11011000001110101101001011000001,1,1,0,1,1,0,...,1,0,1,1,0,0,0,0,0,1
2,CA,184.26.149.102,8156517.0,10111000000110101001010101100110,1,0,1,1,1,0,...,0,1,0,1,1,0,0,1,1,0
3,CA,88.221.135.10,14300770.0,01011000110111011000011100001010,0,1,0,1,1,0,...,1,1,0,0,0,0,1,0,1,0
4,CA,192.0.72.19,10403220.0,11000000000000000100100000010011,1,1,0,0,0,0,...,0,0,0,0,0,1,0,0,1,1
5,CA,216.58.210.193,68478450.0,11011000001110101101001011000001,1,1,0,1,1,0,...,1,0,1,1,0,0,0,0,0,1
6,CA,143.204.169.25,29521620.0,10001111110011001010100100011001,1,0,0,0,1,1,...,0,1,0,0,0,1,1,0,0,1
7,CA,88.221.135.10,30553700.0,01011000110111011000011100001010,0,1,0,1,1,0,...,1,1,0,0,0,0,1,0,1,0
8,CA,104.28.16.4,22621970.0,01101000000111000001000000000100,0,1,1,0,1,0,...,0,0,0,0,0,0,0,1,0,0
9,CA,151.101.18.133,65404460.0,10010111011001010001001010000101,1,0,0,1,0,1,...,1,0,1,0,0,0,0,1,0,1


In [43]:
# Save the dataframe into csv
checking.to_csv('testing23.csv', index=False)