
Part 1: Preparing data

The dataset you will be working with is available here: https://snap.stanford.edu/data/egonets-Facebook.html

You're first job is to

1. Download the data
2. Unpack the data
3. Import the data as an undirected graph in networkx

This should all be done from your notebook in Python. This is an important step for you to automate data preprocessing.

Note: this could take a while, so if you feel adventurous you can use the multiprocessing library to speed things up.

Hand-in:

    The code for downloading, unpacking and loading the dataset



In [2]:
import wget
import gzip
import shutil
import networkx as nx
import numpy as np

wget.download('https://snap.stanford.edu/data/facebook_combined.txt.gz')
with gzip.open('facebook_combined.txt.gz', 'rb') as f_in:
    with open('facebook_combined.txt', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)
g = nx.read_edgelist('facebook_combined.txt')

Part 2: Analyse the data

Now, let's take a look at the network you imported.

By node degree we mean the number of edges to and from a node. This is different in an undirected network, where in-degree == out-degree, and a directed network where in-degree != out-degree.

By graph degree we mean the number of edges in the entire network.

Hand-in code that display:

1. The number of nodes in the network
2. The number of edges in the network
3. The average degree in the network
4. A visualisation of the network inside your notebook



In [3]:
import matplotlib.pyplot as plt
from networkx.drawing.nx_agraph import graphviz_layout, write_dot

G = nx.path_graph(g)
print('Number of nodes: {}'.format(G.number_of_nodes()))
print('Number of edges: {}'.format(G.number_of_edges()))
avg_deg = np.array([g.degree(n) for n in g.nodes()])
print(np.sum(avg_deg)/len(avg_deg))
nx.draw(g, with_labels = True) 
plt.savefig("graph.png") 

Number of nodes: 4039
Number of edges: 4038
43.69101262688784



Part 3: Find the most popular people

We're naturally interested in who has the most friends, so we want to extract top 10. That is, the 10 most connected people.

Hand-in:

1. Code that extracts and reports the 10 people with the most connections in the network



In [None]:
index_name = [list(g.nodes())[v] for v in avg_deg.argsort()[-10:][::-1]]

for elm in index_name:
    print('The node: {}, has connections: {}'.format(elm, g.degree(elm)))