# Networks: structure, evolution & processes
**Internet Analytics - Lab 2**

---

**Group:** W

**Names:**

* Olivier Cloux
* Thibault Urien
* Saskia Reiss

---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

---

## 2.2 Network sampling

#### Exercise 2.7: Random walk on the Facebook network

In [None]:
import requests
import random
import numpy as np
URL_TEMPLATE = 'http://iccluster118.iccluster.epfl.ch:{p}/v1.0/facebook?user={user_id}';

def json_format(uid):
        port = 5050
        url = URL_TEMPLATE.format(user_id=uid, p=port)
        response = requests.get(url)
        return response.json()

In [None]:
def crawler(maximum, TP_prob, seed):
    """Crawls facebook from a seed node.
    
    Keyword arguments:
    maximum -- numberof nodes to crawl
    TP_prob -- probability to keep crawling.
    seed -- first node
    
    Crawls Facebook users to find age of users. With probability TP_prob, crawler keeps going
    from user to user. Otherwise, it jumps to a random node, that was already seen (friend of a visited node)
    but not yet visited.
    
    return a table containing the ages of visited nodes
    """
    #quick check of parameters
    if(maximum < 1 or TP_prob < 0 or TP_prob > 1 or not isinstance(seed, str)):
        print("Error, bad parameter")
        return [-1]
    #needed global variables
    vis = set() #keep track of visited nodes
    age = [] #list of ages, to be returned
    buffer_max = 100;
    buffer = set([]) #buffer of nodes to visit when TP
    user_id = seed #UID of current node
    i = 0
    while i < maximum:
        data = json_format(user_id)
        #retrieve list of friends not yet visited
        friends_list = set(data['friends']).difference(vis)
        
        age.append(data['age'])
        
        #creates buffer of possible nodes to teleport to.
        #each nodes adds few  of its friends. 
        if(len(buffer) > buffer_max): #If buffer full, remove 10 random and add 10 new.
            to_remove = random.sample(buffer, 10)
            buffer = buffer.difference(to_remove)
        #take at most 10 friends uid from current node.
        some_random_friends = set(random.sample(friends_list, min(len(friends_list), 10)))
        buffer = buffer.union(some_random_friends)
        buffer.discard(user_id) #ensure current uid not in buffer, to avoid duplicate visit
        
        #actual crawling. Teleport if current node has no friend or 
        #if given probability was met, otherwise keep crawling
        if(random.randrange(0, 1) < theta and (len(friends_list) is not 0)): #continue crawling
            random_friend = random.sample(friends_list, 1).pop()
            user_id = random_friend
        else: #Teleport to random node in the buffer
            user_id = random.sample(buffer, 1).pop()      
        vis.add(user_id)
        i += 1
    return age

In [None]:
N = 5000;
seed = 'f30ff3966f16ed62f5165a229a19b319'
theta = 0.9
age = crawler(N, theta, seed)
crawler
print('The mean age is',np.mean(age))

#### Exercise 2.8

We obtain a mean age of about 20-22 (small variations regarding the theta of random teleport). The real mean age being 43, we see a huge variation. 

This can be explained with the *friendship paradox*. Young people have more (young) friends (so the younger population is more connected than their elders), and older people have fewer friends. Also, we can imagine older people have a majority of younger friends. Older people represents dead ends, with few and young friends. The crawler will so find young nodes more easily. 

A solution to that would be to ponderate the nodes, according to the number of friends. An other possibility would be to select more easily nodes with few friends ; that is, when chosing a node, pick with higher probability the ones with few friends.