# Motivating Hypothetical: DataSciencester

Congratulations! You’ve just been hired to lead the data science efforts at DataSciencester, the social network for data scientists.

-------------------------------------------------------------------------

### Data Preprocessing

You are given a data dump that consists of a list of users, each represented by a dict that contains that user’s id (which is a number) and name (which, in one of the great cosmic coincidences, rhymes with the user’s id):

In [1]:
users = [
{ "id": 0, "name": "Hero" },
{ "id": 1, "name": "Dunn" },
{ "id": 2, "name": "Sue" },
{ "id": 3, "name": "Chi" },
{ "id": 4, "name": "Thor" },
{ "id": 5, "name": "Clive" },
{ "id": 6, "name": "Hicks" },
{ "id": 7, "name": "Devin" },
{ "id": 8, "name": "Kate" },
{ "id": 9, "name": "Klein" }
]

In [2]:
# the “friendship” data, represented as a list of pairs of IDs:

friendship_pairs = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4), (4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)]

Let’s create a dict where the keys are user ids and the values are lists of friend ids. (Looking things up in dict is very fast.)

In [3]:
# Initialize the dict with an empty list for each user id:
friendships = {user["id"]: [] for user in users}

# And loop over the friendship pairs to populate it:
for i, j in friendship_pairs:
    friendships[i].append(j) # Add j as a friend of user i
    friendships[j].append(i) # Add i as a friend of user j

We find the total number of connections, by summing up the lengths of all the friends lists. And then we just divide by the number of users:

In [4]:
def number_of_friends(user):
    """How many friends does _user_ have?"""
    user_id = user["id"]
    friend_ids = friendships[user_id]
    return len(friend_ids)
total_connections = sum(number_of_friends(user) for user in users) # 24

num_users = len(users) # length of the users list
avg_connections = total_connections / num_users # 24 / 10 == 2.4

Sorting out the list from most friends to least friends

In [5]:
# Create a list (user_id, number_of_friends).
num_friends_by_id = [(user["id"], number_of_friends(user)) for user in users]
num_friends_by_id.sort( # Sort the list
    key=lambda id_and_friends: id_and_friends[1], # by num_friends
    reverse=True) # largest to smallest

# Each pair is (user_id, num_friends):
# [(1, 3), (2, 3), (3, 3), (5, 3), (8, 3),
# (0, 2), (4, 2), (6, 2), (7, 2), (9, 1)]

What we’ve just computed is the *network metric degree centrality*. This has the virtue of being pretty easy to calculate, but it doesn’t always give the results you’d want or expect. For example, in the DataSciencester network Thor (id 4) only has
two connections, while Dunn (id 1) has three. Yet when we look at the network, it intuitively seems like Thor should be more central.

------------------------------------------------------------------------------

### Data Scientists You May Know

You want to design a “Data Scientists You May Know” suggester.

As a first, you write some code to iterate over their friends and collect the friends’ friends:

In [6]:
def foaf_ids_bad(user):
    """foaf is short for "friend of a friend" """
    return [foaf_id
            for friend_id in friendships[user["id"]]
            for foaf_id in friendships[friend_id]]

In [9]:
users[0]

{'id': 0, 'name': 'Hero'}

In [10]:
print(friendships[0])

[1, 2]


In [11]:
print(friendships[1])

[0, 2, 3]


Knowing that people are friends of friends in multiple ways seems like interesting information, so maybe instead we
should produce a count of mutual friends. And we should probably exclude people already known to the user. This correctly tells _Chi (id 3)_ that she has two mutual friends with _Hero (id 0)_ but only one mutual friend with _Clive (id 5)_.

In [12]:
from collections import Counter # not loaded by default
def friends_of_friends(user):
    user_id = user["id"]
    return Counter(
        foaf_id
        for friend_id in friendships[user_id] # For each of my friends,
        for foaf_id in friendships[friend_id] # find their friends
        if foaf_id != user_id # who aren't me
        and foaf_id not in friendships[user_id] # and aren't my friends.
    )
print(friends_of_friends(users[3])) # Counter({0: 2, 5: 1})

Counter({0: 2, 5: 1})


As a data scientist, you know that you also might enjoy meeting users with similar interests. (This is a good example of the “substantive expertise” aspect of data science.) After asking around, you manage to get your hands on this data, as a list of pairs (user_id, interest):

In [13]:
interests = [
(0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"),
(0, "Spark"), (0, "Storm"), (0, "Cassandra"),
(1, "NoSQL"), (1, "MongoDB"), (1, "Cassandra"), (1, "HBase"),
(1, "Postgres"), (2, "Python"), (2, "scikit-learn"), (2, "scipy"),
(2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Python"),
(3, "statistics"), (3, "regression"), (3, "probability"),
(4, "machine learning"), (4, "regression"), (4, "decision trees"),
(4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"),
(5, "Haskell"), (5, "programming languages"), (6, "statistics"),
(6, "probability"), (6, "mathematics"), (6, "theory"),
(7, "machine learning"), (7, "scikit-learn"), (7, "Mahout"),
(7, "neural networks"), (8, "neural networks"), (8, "deep learning"),
(8, "Big Data"), (8, "artificial intelligence"), (9, "Hadoop"),
(9, "Java"), (9, "MapReduce"), (9, "Big Data") 
]

It’s easy to build a function that finds users with a certain interest:

In [14]:
def data_scientists_who_like(target_interest):
    """Find the ids of all users who like the target interest."""
    return [user_id
        for user_id, user_interest in interests
        if user_interest == target_interest]

This works, but it has to examine the whole list of interests for every search. If we have a lot of users and interests (or if we just want to do a lot of searches), we’re probably better off building an index from interests to users:

In [16]:
from collections import defaultdict

# Keys are interests, values are lists of user_ids with that interest
user_ids_by_interest = defaultdict(list)
for user_id, interest in interests:
    user_ids_by_interest[interest].append(user_id)

# And another from users to interests:
# Keys are user_ids, values are lists of interests for that user_id.
interests_by_user_id = defaultdict(list)
for user_id, interest in interests:
    interests_by_user_id[user_id].append(interest)

In [17]:
def most_common_interests_with(user):
    return Counter(
        interested_user_id
        for interest in interests_by_user_id[user["id"]]
        for interested_user_id in user_ids_by_interest[interest]
        if interested_user_id != user["id"]
    )