# Networks: structure, evolution & processes
**Internet Analytics - Lab 2**

---

**Group:** K

**Names:**

* Mathieu Sauser
* Jérémy Chaverot
* Heikel Jebali
* Luca Mouchel
---

#### Instructions

*This is a template for part 2 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

---

## 2.2 Network sampling

#### Exercise 2.7: Random walk on the Facebook network

In [2]:
import requests
import random
import numpy as np

In [3]:
URL_TEMPLATE = 'http://iccluster050.iccluster.epfl.ch:5050/v1.0/facebook?user={user_id}'

"""
    This function returns the data of a user with the given user_id in JSON format.
"""
def get_data(user_id):
    return requests.get(URL_TEMPLATE.format(user_id=user_id)).json()

"""
    This function performs a random walk on the Facebook graph, starting from the Source s and visits N nodes.
    we retrieve data from each node and store the age of each node in the ages list which 
    we use to compute the average age.
"""
def random_walk(s, N, log_progess=False):
    #attributes is a list of tuples (user_id, age, number_of_friends)
    attributes = []
    source = get_data(s)
    attributes.append((s, source['age'], len(source['friends'])))

    u = s
    i = 0
    while i < N:
        if log_progess:
            if i % 100 == 0 and i != 0:
                ages = np.array(attributes)[:,1].astype(np.int64)
                print(f'{i} iterations done, current age average: {round(np.mean(ages), 2)}')
        data = get_data(u)
        age = data['age']
        neighbors = data['friends']
        v = random.choice(neighbors)
        u = v
        attributes.append((u, age, len(neighbors)))
        i = i+1
    return attributes
        
#because we have tuples of the form (user_id, age, number_of_friends), to collect the list of ages, we apply numpy 
#slicing using [:, 1] which returns every element of index 1 in the tuple, which is the age.
ages = np.array(random_walk(s="a5771bce93e200c36f7cd9dfd0e5deaa", N=1000, log_progess=True))[:, 1].astype(np.int64)
print(f"At 1000 iterations, we find that the average age is: {round(np.mean(ages), 3)}")


100 iterations done, current age average: 22.58
200 iterations done, current age average: 24.04
300 iterations done, current age average: 23.33
400 iterations done, current age average: 22.37
500 iterations done, current age average: 22.05
600 iterations done, current age average: 22.67
700 iterations done, current age average: 25.73
800 iterations done, current age average: 26.04
900 iterations done, current age average: 25.91
At 1000 iterations, we find that the average age is: 25.599


Our estimation of the average age of a Facebook user is $25.6$ years. To get this estimation, we visited $1000$ users.

#### Exercise 2.8

$1.$ Our estimation is pretty far from the true average age, by nearly 20 years.

$2.$ It seems people tend to be friends with users the same age as them. 

$3.$ As seen in class, we will use an unbiased estimator. Because we don't know the total number of edges and the total number of vertices, the estimator we use is the following:

$$\hat{F} = \frac{\sum_t f(X_t)/d_{X_t}}{\sum_t 1/d_{X_t}}$$ 


In [14]:
ls = random_walk("a5771bce93e200c36f7cd9dfd0e5deaa", 1000, False)

What we do here is basically allowing users to only appear once in the computations then collect the ages and degrees (the number of friends of each user), and computing the unbiased estimator which gives a result very close to the average age released by facebook. If we visit a larger number of nodes, we would surely get closer to the true average age.

In [15]:
ls = np.array(ls)
def accurate_estimation():
    #axis=0 is the index of  the user_id in the tuples, so we make each user_id appear only once
    _, indices = np.unique(ls, axis=0, return_index=True)
    new_ages = ls[:, 1][indices].astype(np.int64)
    new_degrees = ls[:, 2][indices].astype(np.int64)
    #return the value of the unbiased estimator
    return np.sum(new_ages / new_degrees) / np.sum(1/new_degrees)


In [16]:
round(accurate_estimation(), 2)

41.54

Without using an unbiased estimator, one could try to have a naive approach and say imagine we just allow users to appear only once to not mess with the computation and then compute the mean of the ages, but this still shows some bias as we can see that the age we find is far from the average age facebook released.

In [17]:
def inaccurate_estimation():
    _, indices = np.unique(ls, axis=0, return_index=True)
    new_ages = ls[:, 1][indices].astype(np.int64)
    return np.mean(new_ages)

round(inaccurate_estimation(), 2)

21.92