# Data Science From Scratch

Joel Grus  
1st Edition, Second Release, O'Reilly  
ISBN: 978-1-491-90142-7
Notes by Joe McGrath  
[Github page with code for the book](https://github.com/joelgrus/data-science-from-scratch)  


## Contents

* [Preface](#Preface)
* [Chapter 1 - Introduction](#Chapter-1---Introduction)
    * [Simple Social Network](#Simple-Social-Network)
    * [Basic Recommender](#Basic-Recommender)

## Preface
[Back to contents](#Contents)

Data science is [generally held to be](http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram) the intersection of hacking skills, maths & statistics and substantiative experience. As you can't really put experience in a book, that's not covered so much here. Hacking skills are also a bit nebulous and very personal - so that'll ccome with time too.

*Doing* is the best way to learn most of this, so the book is more of a grounding / framework to build off. The book's designed for smaller-scale analyses, but references some larger ideas.

## Chapter 1 - Introduction
[Back to contents](#Contents)

There's so much data around these days that the problem of what to *do* with data is as - if not more - difficult than acquiring the data. Data scientists run the gamut from statisticians to computer scientists with a lot of variance in capabilities and skills.

> A data scientist is someone who extracts insights from messy data.

With data science, even the most innocuous information can be correlated and processed into the information we want.

### Simple Social Network

As an example, data from a fictional social network:


In [10]:
users = [
    {'id' : 0, 'name' : 'Hero'},
    {'id' : 1, 'name' : 'Dunn'},
    {'id' : 2, 'name' : 'Sue'},
    {'id' : 3, 'name' : 'Chi'},
    {'id' : 4, 'name' : 'Thor'},
    {'id' : 5, 'name' : 'Clive'},
    {'id' : 6, 'name' : 'Hicks'},
    {'id' : 7, 'name' : 'Devin'},
    {'id' : 8, 'name' : 'Kate'},
    {'id' : 9, 'name' : 'Klein'},
]

friendships = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4),
               (4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9),
              ]


From this, we can build up a network of friendships into the original dictionary:

In [11]:
for user in users:
    user['friends'] = []

for i, j in friendships:
    # Note this is recursive as we're embedding a *reference* to the object.
    users[i]['friends'].append(users[j])
    users[j]['friends'].append(users[i])


Then it's fairly simple to calculate the average number of connections:

In [12]:
def number_of_friends(user):
    "how many friends does _user_ have?"
    return len(user['friends'])

total_connections = sum(number_of_friends(user) for user in users) # 24

avg_connections = total_connections / len(users)
print(avg_connections)

2.4


Then if we want to sort them by count of friends:

In [24]:
num_friends_by_id = [(user['id'], number_of_friends(user)) for user in users]
print(num_friends_by_id)

print(sorted(num_friends_by_id,
             key = lambda user_in: user_in[1], # Lambda function to suck out the relevant value
             reverse = True
            )
     )

[(0, 2), (1, 3), (2, 3), (3, 3), (4, 2), (5, 3), (6, 2), (7, 2), (8, 3), (9, 1)]
[(1, 3), (2, 3), (3, 3), (5, 3), (8, 3), (0, 2), (4, 2), (6, 2), (7, 2), (9, 1)]


This sorted list of friends is a network metric - the *degree of centrality*. It's simple to calculate, but doesn't necessarily mean much about the *structure* of the network. In this example Thor (id 4) is a bridge between two well-interconnected networks of friends, but has a relitively low friend-count.

### Basic Recommender

Taking a friends-of-friends approach, we can make a 'people you might know' feature.

In [26]:
def friends_of_friend_ids_bad(user):
    # "foaf" is short for "friend of a friend"
    return [foaf["id"]
            for friend in user["friends"] # Cycle through the friends of friends
            for foaf in friend["friends"]
           ]

print(friends_of_friend_ids_bad(users[0]))

[0, 2, 3, 0, 1, 3]


So the problems with this method are:

* It returns the target (as they're friends with their friends).
* Returns current friends.
* Returns duplicates (shared friends)

So we create a few functions to clean up the data: