# Working with GitHub Data

This Jupyter Notebook provides a semi-gentile introduction to the GitHub API for researchers interested in studying sociotechnical systems (like GitHub). This tutorial is available on GitHub at: https://github.com/mcburton/sociotech-workshop-2016


**Expectations**

This tutorial assumes some familiarity with Python, APIs, and JSON. You don't necessarily need to be a programming wizard or unicorn, but it helps to have *seen* python code before and be familiar with the HyperText Transfer Protocol (HTTP). If you are completely unfamilar with APIs, I recommend reading Zapier's [An Introduction to APIs](https://zapier.com/learn/apis/), which is a gentle introduction to HTTP, data formats, authentication, and API design.

In sum, this tutorial will make more sense if you have:
* Familiarity with Python code and programming
* Conceptual understanding of HTTP and RESTful APIs

That said, I'll try to explain each part of this tutorial such that even someone with no experience in these areas can get something out of it.

## Categories of Being GitHub

Getting data from GitHub's API requires an understanding of the primitive objects within the sociotechnical world of GitHub. At the most basic level, GitHub divides its world into two aspects, *users* and *repositories*. While there are may other kinds of objects in the ontological universe of GitHub, these are two main ones. *Repositories* are where the code and activities around code reside. Code, changes to code, issues, and documentation are stored in repositories. 
Users are the the "social layer" of GitHub. Users can own or contribute to repositories, they can also belong to *Organizations*, *Follow* other users and have *Followers* themselves.  

For the purposes of this tutorial, we are going to focus on the *Users* portion of GitHub and leave *Repositories* (and all their associated components) aside. 



### The Research Question Driving Data Collection

We'd like to build a social network graph of the members of the Twitter and Facebook organizations on Github. Specifically we want to visualize not just the members of those organizations, but also the people who follow members of Twitter and Facebook. 

* The list of GitHub Users who are members of the Facebook and Twitter Organizations
* The list of followers of each member of the Facebook and Twitter Organizations

The tutorial below will demonstrate how to go about accessing this data from the API. But before we can go about collecting data, we need to get an access token.

## Create a GitHub Account & Access Token

In order to effectively access the GitHub API, you need to have a GitHub account. While it is possible to programmatically download data from GitHub without an account, the unauthenticated [rate limiting](https://developer.github.com/v3/#rate-limiting) makes this a chore. If you have a GitHub account you can use a personal access token without dealing with all the hassles of the [OAuth authorization workflow](https://developer.github.com/v3/oauth/) 


If you don't have a GitHub account, [visit the sign up page and create one](https://github.com/join). **NOTE**, if you use your *.edu* email address you can [request a discount](https://education.github.com/discount_requests/new), which gives you a free [micro plan](https://github.com/pricing)(up to 5 private repositories).

Once you create your account or if you already had a GitHub account, you need to create a [personal access token](https://help.github.com/articles/creating-an-access-token-for-command-line-use/).
* Go to **Settings** and click **Personal Access Tokens** (if you are logged in you can just [click here](https://github.com/settings/tokens)). 
* Click **Generate new token** and give the token a short description like "sociotech workshop" or "collecting all the data n'at"
* In the "Select Scopes" section of the form **uncheck all of the boxes.** We are going to use the default scope, (no scope), which will give us read-online access to all public information on Github. For more information [see the documentation on scopes](https://developer.github.com/v3/oauth/#scopes).
* Click the **Generate token** button to create a unique access token. This will take you to a new page with the token, ***COPY THIS AND SAVE IT SOMEWHERE BECAUSE GITHUB WON'T SHOW IT TO YOU AGAIN.*** I recommend pasting it in the python code cell below.

In [7]:
# execute this cell once you have pasted your token in so we can use it later in the notebook
token = "<paste your token here>"

Great! Now that you have an access token you can start downloading data from GitHub programmatically via the [GitHub API](https://developer.github.com/v3/).

## Accessing the GitHub API

The design of the GitHub API follows the [RESTful architectual style](https://en.wikipedia.org/wiki/Representational_state_transfer), which means a lot of things but for our purposes mainly means that we'll mainly be spending our time learning the URLs that will get us the data we want. Because the API uses HTTP, we can actually just use a web browser to interact with the API and get data. The base URL for all of GitHub's API is https://api.github.com. 

* Let's visit [https://api.github.com](https://api.github.com) and see what happens! Go ahead, click the link!

You should hopefully see something that looks like that barf below:
```
{
"current_user_url": "https://api.github.com/user",
"current_user_authorizations_html_url": "https://github.com/settings/connections/applications{/client_id}",
"authorizations_url": "https://api.github.com/authorizations",
"code_search_url": "https://api.github.com/search/code?q={query}{&page,per_page,sort,order}",
"emails_url": "https://api.github.com/user/emails",
"emojis_url": "https://api.github.com/emojis",
"events_url": "https://api.github.com/events",
"feeds_url": "https://api.github.com/feeds",
"followers_url": "https://api.github.com/user/followers",
"following_url": "https://api.github.com/user/following{/target}",
"gists_url": "https://api.github.com/gists{/gist_id}",
"hub_url": "https://api.github.com/hub",
"issue_search_url": "https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}",
"issues_url": "https://api.github.com/issues",
"keys_url": "https://api.github.com/user/keys",
"notifications_url": "https://api.github.com/notifications",
"organization_repositories_url": "https://api.github.com/orgs/{org}/repos{?type,page,per_page,sort}",
"organization_url": "https://api.github.com/orgs/{org}",
"public_gists_url": "https://api.github.com/gists/public",
"rate_limit_url": "https://api.github.com/rate_limit",
"repository_url": "https://api.github.com/repos/{owner}/{repo}",
"repository_search_url": "https://api.github.com/search/repositories?q={query}{&page,per_page,sort,order}",
"current_user_repositories_url": "https://api.github.com/user/repos{?type,page,per_page,sort}",
"starred_url": "https://api.github.com/user/starred{/owner}{/repo}",
"starred_gists_url": "https://api.github.com/gists/starred",
"team_url": "https://api.github.com/teams",
"user_url": "https://api.github.com/users/{user}",
"user_organizations_url": "https://api.github.com/user/orgs",
"user_repositories_url": "https://api.github.com/users/{user}/repos{?type,page,per_page,sort}",
"user_search_url": "https://api.github.com/search/users?q={query}{&page,per_page,sort,order}"
}
```

What should happen is you will see a bunch of crap that is only kinda human-readable. What you are looking at is [JSON or JavaScript Object Notation](https://en.wikipedia.org/wiki/JSON).

There isn't actually that much information contained at this particular *endpoint* instead what we are seeing are pointers to other API endpoints, other URLS, that have specific information. Accessing `https://api.github.com` provides some *self-documentation* for how to go about fetching other information from the API. That said, the output above not very informative and anyone using GitHub's API should read the [human oriented documentation](https://developer.github.com/v3/) for an explaination of the types of data available and how to access them.

### GitHub Zen

One of the more fun API, and not very well documented, endpoints GitHub offers is the "zen" service. Every time you visit this endpoint you get a little bit of "GitHub zen":
*[https://api.github.com/zen](https://api.github.com/zen)

The only appearance of GitHub zen in the documentation appears in the [ping event payload](https://developer.github.com/webhooks/#ping-event-payload) of the [webhooks](https://developer.github.com/webhooks/) system (for triggering events on Organizations or Repositories activity).

## Accessing the GitHub API with Python

While you can technically use a web browser to access and download data from the GitHub API, it is a lot easier to *programmatically* access data with code. I like to use python for these purposes because it is very easy to collect, clean, and analyze data in the same environment. Not only is the design of the language good for this purpose, but people have written libraries to make this work even easier. Because we are accessing the GitHub API using HTTP, we are going to use the [Requests](http://python-requests.org) library to help make data collection even easier.

I am also providing two helper functions that will make data collection easy, one (`parseLinkHeader`) that parses an HTTP header to get a pagination link and a second function (`collect_things`) to automatically download multiple pages of things from the API.

In [8]:
# importing helper libraries
from requests import get as GET

# defining some helper functions
def parseLinkHeader(headers):
    """Parses the link headers, useful for pagination.
    Copied from https://github.com/PyGithub/PyGithub/blob/master/github/PaginatedList.py
    Returns a dictionary of the link-headers.
    """
    links = {}
    if "link" in headers:
        linkHeaders = headers["link"].split(", ")
        for linkHeader in linkHeaders:
            (url, rel) = linkHeader.split("; ")
            url = url[1:-1]
            rel = rel[5:-1]
            links[rel] = url
    return links

def collect_things(endpoint, access_token):
    """A helper function to download all the entities in a set of paginated API calls.
    This is meant to be used with GitHub entpoints that return lists of things.
    Returns a list of all the things from specified endpoint."""
    # some boilerplate variables
    parameters = {"access_token":access_token,
                  "per_page":100}
    
    # make the call to github
    response = GET(endpoint, parameters)
    
    # parse the link headers for easy pagination
    link_headers = parseLinkHeader(response.headers)
    # start saving the list of members 
    things = response.json()
    
    # fetch additional pages of thinks
    while 'next' in link_headers:
        response = GET(link_headers['next'], params=parameters)
        link_headers = parseLinkHeader(response.headers)
        things = things + response.json()
    
    return things

## Accessing members of Twitter and Facebook

The first step in data collection is to download the lists of members of the [Twitter](https://github.com/twitter) and [Facebook](https://github.com/facebook) Github Organizations. 

To do this we must access GitHub's endpoint for an [Organization's Members](https://developer.github.com/v3/orgs/members/). This endpoint looks like: 

`GET /orgs/:org/members` where `:orgs` is the name of the organization of interest. 

The documentation for this endpoint says it: 
> List all users who are members of an organization. If the authenticated user is also a member of this organization then both concealed and public members will be returned.

The code below uses the `collect_things` function to get a list of all members of the Facebook Organization.


In [9]:
organization = "Facebook" 

# build the endpoint out of the organization variable
base_url = "https://api.github.com"
endpoint = base_url+"/orgs/{}/members".format(organization) 

# download the member list using the collect_things function and store the results
# in the facebook_members variable
facebook_members = collect_things(endpoint, token)

# display the results
print("There are {} members in the {} Github Organization".format(len(facebook_members), organization))

There are 311 members in the Facebook Github Organization


Sweet! Now we can do the same thing for Twitter

In [10]:
organization = "Twitter" 

# build the endpoint out of the organization variable
base_url = "https://api.github.com"
endpoint = base_url+"/orgs/{}/members".format(organization) 

# download the member list using the collect_things function and store the results
# in the facebook_members variable
twitter_members = collect_things(endpoint, token)

# display the results
print("There are {} members in the {} Github Organization".format(len(twitter_members), organization))

There are 95 members in the Twitter Github Organization


Amazing. Now we have two variables that contain the members of the Twitter and Facebook Organizations. The printouts above tell us how many members are in each organization, but what exactly did we get back from Github in terms of raw data? Each of these variables is a list of things. Each thing is some information about each member of the organization. The [documentation](https://developer.github.com/v3/orgs/members/#response) for the endpoint has an example, but we should inspect *our* data.

In [11]:
# print the contents of the first element of the list
facebook_members[0]

{'avatar_url': 'https://avatars.githubusercontent.com/u/1088217?v=3',
 'events_url': 'https://api.github.com/users/3lvis/events{/privacy}',
 'followers_url': 'https://api.github.com/users/3lvis/followers',
 'following_url': 'https://api.github.com/users/3lvis/following{/other_user}',
 'gists_url': 'https://api.github.com/users/3lvis/gists{/gist_id}',
 'gravatar_id': '',
 'html_url': 'https://github.com/3lvis',
 'id': 1088217,
 'login': '3lvis',
 'organizations_url': 'https://api.github.com/users/3lvis/orgs',
 'received_events_url': 'https://api.github.com/users/3lvis/received_events',
 'repos_url': 'https://api.github.com/users/3lvis/repos',
 'site_admin': False,
 'starred_url': 'https://api.github.com/users/3lvis/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/3lvis/subscriptions',
 'type': 'User',
 'url': 'https://api.github.com/users/3lvis'}

Now we can start seeing some *traces* of information about the members of the Facebook organization. The output above shows the JSON representation of a specific member of the Facebook Organization. This JSON object has 17 fields, some with specific traces of information and others which are pointers to additional endpoints with more information about this GitHub User.

`id`, `login`, `site_admin`, and `type` are all specific bits of metadata about this particular GitHub User. The remaining fields are pointers to API endpoints to get information like this user's avatar, who they are following or their followers, or a list of their repositories. Because we are interested in finding followers, the `followers_url` endpoint is of particular interest. According to the [documentation](https://developer.github.com/v3/users/followers/#list-followers-of-a-user) this endpoint returns a list of a User's followers. That seems useful!

## Downloading Member's Followers

So now we know where to go to get the information we need, but we have ton of Users in the Facebook and Twitter lists and it would be a huge pain to visit each URL manually. Fortunately, we can write a script that downloads each member's followers by looping over the member lists and accessing their `followers_url`.

The code below loops over each member of the Facebook organization, uses the `collect_things` function to download all of a User's followers, modifies the record by adding the name of the followee (the parent record), and saves the results in a variable.


In [12]:
# create a container for storing all the followers of 
followers_of_facebook = [] 

# loop over the followers of facebook members
for member in facebook_members:
    
    # get the endpoint URL for this member's followers
    followers_url = member['followers_url']
    # fetch all of this user's followers
    followers = collect_things(followers_url, token)
    
    # add the name of the followee to each member
    for follower in followers:
        follower.update({'parent':member['login']})
    
    followers_of_facebook.extend(followers)
    
print("Found {} followers of facebook members.".format(len(followers_of_facebook)))

Found 48585 followers of facebook members.


Now we can do the same for Twitter. Note, you might be thinking, we seem to be re-using the same code over and over again, couldn't this be written as a set of re-usable functions? Yes, it can, but for the purposes of explanation and teaching I've made the easier to read by humans rather than computers (or programmers).

In [13]:
# create a container for storing all the followers of 
followers_of_twitter = [] 

# loop over the followers of facebook members
for member in twitter_members:
    
    # get the endpoint URL for this member's followers
    followers_url = member['followers_url']
    # fetch all of this user's followers
    followers = collect_things(followers_url, token)
    
    # add the name of the followee to each member
    for follower in followers:
        follower.update({'parent':member['login']})
    
    followers_of_twitter.extend(followers)
    
print("Found {} followers of twitter members.".format(len(followers_of_twitter)))

Found 33602 followers of twitter members.


In [14]:
# display the first element of the followers list
followers_of_twitter[0]

{'avatar_url': 'https://avatars.githubusercontent.com/u/1706363?v=3',
 'events_url': 'https://api.github.com/users/rehans71/events{/privacy}',
 'followers_url': 'https://api.github.com/users/rehans71/followers',
 'following_url': 'https://api.github.com/users/rehans71/following{/other_user}',
 'gists_url': 'https://api.github.com/users/rehans71/gists{/gist_id}',
 'gravatar_id': '',
 'html_url': 'https://github.com/rehans71',
 'id': 1706363,
 'login': 'rehans71',
 'organizations_url': 'https://api.github.com/users/rehans71/orgs',
 'parent': 'agargenta',
 'received_events_url': 'https://api.github.com/users/rehans71/received_events',
 'repos_url': 'https://api.github.com/users/rehans71/repos',
 'site_admin': False,
 'starred_url': 'https://api.github.com/users/rehans71/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/rehans71/subscriptions',
 'type': 'User',
 'url': 'https://api.github.com/users/rehans71'}

OK. Now we have a giant pile of trace data stored in four variables, `facebook_members`,`twitter_members`, `followers_of_facebook`, and `followers_of_twitter`. These variables contain the raw output from the GitHub API, but no data wrangling or shaping has been done. Considering how long it took to collect this data, it is a good idea to save it to disk so we can work with it at a later date. 

In [15]:
# import the json library for reading/writing JSON files from disk
import json

In [16]:
# write the python lists to disk as JSON formatted data files. 

with open("data/facebook_members.json", 'w') as f:
    json.dump(facebook_members, f)
    
with open("data/facebook_followers.json",'w') as f:
    json.dump(followers_of_facebook, f)

with open("data/twitter_members.json", 'w') as f:
    json.dump(twitter_members, f)

with open("data/twitter_followers.json", 'w') as f:
    json.dump(followers_of_twitter, f)

It is always a good idea to save the "rawest" trace data you have to disk before you begin slicing a dicing the data into other forms for analysis. This way you can slice and dice the data in multiple ways depending on the kind of analysis you want to perform. Also, because we are dealing with social media data from the web, there is a possibility the results will change over time. Facebook and Twitter are constantly adding (or removing) users from their GitHub organizations, what we have is a snapshot of their organization's membership at a particular point in time (and we should probably be documenting these trace data's provenance in some metadata...).

It is just good practice to save your "raw" data to disk in a format as close as possible to how it is natively expressed by the data source, in GitHub's case, as JSON. From now on, when we want to work with our data we can just load these JSON files into memory rather than re-accessing the GitHub API. This is also a LOT faster.

# Sociotechnical Data Preparation from the GitHub API

This notebook works through the steps of cleaning, wrangling, and preparing GitHub data from the `data/` directory that was downloaded in the previous section of the tutorial. This section of the tutorial will load a set of JSON files, extract specific information from the "raw" data structure, and save the extracted features to disk in a format suitable for network analysis (node and edges CSV files).



## Data Fitness

How we process our trace data is dependent upon the research question and mode of analysis. For the purposes of this tutorial, we are interested in building a social network graph of the members of the Twitter and Facebook GitHib organizations. To do this, we need to *get our data into shape*, hence we need to perform some *data fitness*.



In [18]:
# read the JSON data files into memory

with open("data/facebook_members.json", 'r') as f:
    facebook_members = json.load(f)
    
with open("data/facebook_followers.json",'r') as f:
    followers_of_facebook = json.load(f)

with open("data/twitter_members.json", 'r') as f:
    twitter_members = json.load(f)

with open("data/twitter_followers.json", 'r') as f:
    followers_of_twitter = json.load(f)

Now we have loaded all the data back into memory, we should explore the data to better understand the shape it is in and how much we have.

In [19]:
# examine the first members of the facebook group
facebook_members[0]

{'avatar_url': 'https://avatars.githubusercontent.com/u/1088217?v=3',
 'events_url': 'https://api.github.com/users/3lvis/events{/privacy}',
 'followers_url': 'https://api.github.com/users/3lvis/followers',
 'following_url': 'https://api.github.com/users/3lvis/following{/other_user}',
 'gists_url': 'https://api.github.com/users/3lvis/gists{/gist_id}',
 'gravatar_id': '',
 'html_url': 'https://github.com/3lvis',
 'id': 1088217,
 'login': '3lvis',
 'organizations_url': 'https://api.github.com/users/3lvis/orgs',
 'received_events_url': 'https://api.github.com/users/3lvis/received_events',
 'repos_url': 'https://api.github.com/users/3lvis/repos',
 'site_admin': False,
 'starred_url': 'https://api.github.com/users/3lvis/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/3lvis/subscriptions',
 'type': 'User',
 'url': 'https://api.github.com/users/3lvis'}

In [20]:
# examine the first member of the followers of facebook
followers_of_facebook[0]

{'avatar_url': 'https://avatars.githubusercontent.com/u/21961?v=3',
 'events_url': 'https://api.github.com/users/arbales/events{/privacy}',
 'followers_url': 'https://api.github.com/users/arbales/followers',
 'following_url': 'https://api.github.com/users/arbales/following{/other_user}',
 'gists_url': 'https://api.github.com/users/arbales/gists{/gist_id}',
 'gravatar_id': '',
 'html_url': 'https://github.com/arbales',
 'id': 21961,
 'login': 'arbales',
 'organizations_url': 'https://api.github.com/users/arbales/orgs',
 'parent': '3lvis',
 'received_events_url': 'https://api.github.com/users/arbales/received_events',
 'repos_url': 'https://api.github.com/users/arbales/repos',
 'site_admin': False,
 'starred_url': 'https://api.github.com/users/arbales/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/arbales/subscriptions',
 'type': 'User',
 'url': 'https://api.github.com/users/arbales'}

Notice both of these objects have almost identical structures, but there is one additional attribute for the followers. During data collection we added a `parent` attribute to each follower because that information was not part of the GitHub provided data. We need this information to build the graph of following/follower relationships. 

We are fortunately, these data are very clean so we don't need to spend much time fixing errors or formatting individual fields. Additionally, the structure of the data are consistent so we don't need have robust error handling. We are lucky, GitHub is a developer friendly technology company so their API and data structures are well designed (and open).

With clean and consistent data, we can jump right into exploratory data analysis. Let's see how much data we have.

In [21]:
# print out the length of the twitter and facebook member lists
print("There are {} members of the Facebook organization.".format(len(facebook_members)))
print("There are {} members of the Twitter organization.".format(len(twitter_members)))

# print out the length of the twitter and facebook follower lists
print("Facebook members have {} total followers.".format(len(followers_of_facebook)))
print("Twitter members have {} total followers.".format(len(followers_of_twitter)))

There are 311 members of the Facebook organization.
There are 95 members of the Twitter organization.
Facebook members have 48585 total followers.
Twitter members have 33602 total followers.


Because we just did a crude grab of member's followers, it is probably the case that many of them are duplicates (meaning some GitHub Users might follow some of the same people in the Facebook or Twitter Organizations). One way to quickly count the number of unique followers is to use Python's built in [set datatypes](https://docs.python.org/3.5/library/stdtypes.html#set-types-set-frozenset) to quickly remove duplicates. 

In [22]:
# create a python set of the login names of facebookers and twitterers
facebook_follower_names = set([follower['login'] for follower in followers_of_facebook])
twitter_follower_names = set([follower['login'] for follower in followers_of_twitter])

# print the number of unique followers through the length of the set
print("Facebook members have {} unique followers.".format(len(facebook_follower_names)))
print("Twitter members have {} unique followers.".format(len(twitter_follower_names)))

Facebook members have 28058 unique followers.
Twitter members have 24710 unique followers.


These numbers are a bit smaller than the total number of followers, which means there are some people who follow the same people. Now that we have computed the list of unique names as sets, we can easily calculate how many followers of Facebooker overlap with the followers of Twitterers.

In [23]:
len(facebook_follower_names & twitter_follower_names)

5074

This also begs the question, are there any members of the Facebook organization who are also members of the Twitter organization (remember, not all members of organizations work for the corporations).

In [24]:
facebook_member_names = set([follower['login'] for follower in facebook_members])
twitter_member_names = set([follower['login'] for follower in twitter_members])

len(facebook_member_names & twitter_member_names)

0

Nope.

## Getting in Shape for Social Network Analysis

To visualize the social graph, we need to generate two CSV files, a *nodes.csv* file that contains information about each user and an *edges.csv* file that describes how they are related. Every user mentioned in the edges file needs to have a corresponding entry in the nodes file. The nodes file will contain information like the `login` id, and if they belong to either the twitter or facebook organization.

### Building *nodes.csv*

The `nodes.csv` file needs to have the following shape:
```
<gh_username>,<tw_member?>,<fb_member?>,<tw_follower?>,<fb_follower?>
```
Where `gh_username` is the GitHub login name, `tw_member?` and `fb_member?` are binary indicators of membership, and `tw_follower?` and `fb_follower?` are binary indicators if they follower Twitter or Facebook members. Note: There is also some code below that generates a `node-alt.csv` that expresses organizational membership as a multi-value attribute (`facebook`, `twitter`, `neither`) instead of a series of binary attributes.
```
<gh_username>,<organization>
```

To build a list of nodes and their attributes we need to iterate over the each of the four user lists updating a master list as we go. However, because there are duplicates in the lists we need to first create a master dictionary of usernames and their attributes. Currently, we are "storing" a user's attribute information in their membership in one of the four variables we've been exploring above. 

In [25]:
# create a master list of all the names, this looks weird because of Python's set operators
all_the_names = set().union(facebook_member_names,twitter_member_names,facebook_follower_names,twitter_follower_names)

# open the nodes.csv file for writing
with open("data/nodes.csv", 'w') as f:
    
    # write the header line
    f.write("id,tw_member?,fb_member?,tw_follower?,fb_follower?\n")
    # loop over the master list of users
    for name in all_the_names:
        row = "{},{},{},{},{}\n".format(name,
                                       name in twitter_member_names,
                                       name in facebook_member_names,
                                       name in twitter_follower_names,
                                       name in facebook_follower_names)
        f.write(row)

In [26]:
# creating an alternative file that expresses organization membership as attribute values

# create a master list of all the names, this looks weird because of Python's set operators
all_the_names = set().union(facebook_member_names,twitter_member_names,facebook_follower_names,twitter_follower_names)

# open the nodes.csv file for writing
with open("data/nodes-alt.csv", 'w') as f:
    
    # write the header line
    f.write("id,organization\n")
    # loop over the master list of users
    for name in all_the_names:
        if name in twitter_member_names:
            org = "Twitter"
        elif name in facebook_member_names:
            org = "Facebook"
        else:
            org = "Neither"
        
        row = "{},{}\n".format(name,org)
        f.write(row)

### Building *edges.csv*

Now we need to create an edges csv file that expresses the relationships between nodes in the graph. The `edges.csv` file has the following, very simple, shape:
```
source,target
``` 
Where `source` is the GitHub login name of the user who is following the `target`. The usernames should match the names we collected in the `nodes.csv` file above.

The information we need about followers and followees is contained in the `followers_of_facebook` and `followers_of_twitter` variables, specifically the `login` and `parent` attributes (remember, we added the `parent` attribute in the data collection stage to make building this edges files easier). 

In [27]:
# open the edges.csv file for writing
with open("data/edges.csv",'w') as f:
    
    # write the header row
    f.write("source,target\n")
    
    # loop over both lists of follwers
    for user in followers_of_facebook + followers_of_twitter:
        row = "{},{}\n".format(user['login'],user['parent'])
        f.write(row)

Great! Now we are done re-shaping the data. Now you can download the data files and start performing network analysis!

## Downloading the Social Graph Files

You can download these newly created CSV files via the links below:

* [data/nodes.csv](data/nodes.csv)
* [data/edges.csv](data/edges.csv)

You can also download the "raw" JSON files at the links below (note, you should right-click "save link as"):
* [data/facebook_members.json](data/facebook_members.json)
* [data/twitter_members.json](data/twitter_members.json)
* [data/facebook_followers.json](data/facebook_followers.json)
* [data/twitter_followers.json](data/twitter_followers.json)

## Where to go next?


For scholars interested in this kind of data collection, I suggest checking out the following resources:
* [Mining the Social Web, 2nd Edition](http://miningthesocialweb.com/) - A book all about downloading data from APIs, web scraping, and analyzing that data in a database.
* [Python for Informatics](http://pythonlearn.com/) - a book, MOOC, and OER course materials for learning python.
* [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) - another O'Reilly book focused on using python for data-intensive analysis.

For GitHub Research, check out these resources:
* [PyGithub](https://github.com/PyGithub/PyGithub) - a python library for accessing the GitHub API. We aren't using it because it obscures some of the sociotechnical dynamics we want to highlight and it has poor documentation.
* [GitHub Archive](https://www.githubarchive.org/) - An archive/public dataset of all public information from GitHub. The data lives inside Google's BigQuery ecosystem and can be analyzed inside Google's cloud machine.
