# Identifying How Scientific Papers Are Shared and Who Is Sharing Them on Twitter

Welcome to the data collection and analysis part of [_Identifying How Scientific Papers Are Shared and Who Is Sharing Them on Twitter_](https://docs.google.com/document/d/1mKA4p5m7ubqJuyTDbDOm9J0F4wYBUoW9t1zhmWEN__M/edit#heading=h.esjw8tv92vde). In this part of the course, you will see, be able to execute, and modify the code that is necessary to collect user information from Twitter, including all of the follower relationships, using Python. This data will then be used to construct networks, and to calculate network statistics for analysis. 

This webpage you are looking at is a [Jupyter Notebook](http://jupyter.org/). It allows for a combination of HTML (like what you are reading now) and code (like you see immediately below). The code can be executed by clicking on the cell (the text area with the code in it) and then hitting `Ctrl + Enter`. 

Try it below. 

In [None]:
# a) intro cell

byt = b')\x16\x01H\x05\x1b\x03\rT\r\x17\x0b\x13\x0c\x00\r\x0bN\x03\x16\x19\rO\x1e\t\r\x1c\x07\x01N\x13\x16\x10\rN'
bytes(c ^ b'python'[i % 6] for i, c in enumerate(byt)).decode()

You should see an "Out" label on the left, and then some words. 

Once you did that, we are ready to get started. The first step, however, is to get ourselves programatic access to Twitter. For this, you will need to have a Twitter account. It can be your personal Twitter account, since we are only planning on reading information from Twitter, and not posting any actual tweets (unless you want to). Once you're logged in to your Twitter account, go to [https://apps.twitter.com/](https://apps.twitter.com/). 

Once there, click on "Create New App", and fill in the form. The `name`, `description`, and `website` you provide do not matter. You can leave the `callback URL` blank. Agree to the terms of service and create the app. 

One the second page, choose the `Keys and Access Tokens` tab and then, near the bottom, `Create Access Tokens`. 

Now you have everything you need to programatically interact with Twitter. The Consumer Key and Secret (near the top of the page) and the Access Token and Secret (on the bottom) essentially replace your username and password. 

Enter the four values below. Make sure you place the values between single quotes. For example: 

`consumer_key = 'youareawesomekeepitup'`

**run the cell below**

In [None]:
# b) setup cell 

# These are some Python modules we may need, so we are loading them here
import tweepy
import re
import json
import sqlite3
import numpy as np
import pandas as pd
import datetime, time, os, sys

# Add in your four values below. The access usually has a hyphen in it
# so make sure you get it all when you copy paste
consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''

print('OK')

A note: This server is hosted by Digital Ocean and Juan Pablo Alperin is the only person that currently has access. It is essentially secure, but you should be aware that when you save your Jupyter notebook, you are saving those Twitter credentials on the server and everyone in the course could see your keys and tokens. There is little risk within this closed setting, but you should probably revoke the tokens after the course is done and make new ones. 

----

So far you just saved those keys into some variables, but we have not done anything with the Twitter API. The [Twitter API](https://dev.twitter.com/rest/public) is a defined set of commands you can give Twitter to read and write Twitter data. It is essentially a programmatic way of doing all the things you can usually do on Twitter: read tweets, see who follows who, look at timelines, etc. 

Fortunately, we do not need to worry about writing all the Python code to do all the nitty gritty. There is a Python module called [Tweepy](http://pythoncentral.io/introduction-to-tweepy-twitter-for-python/) that does most of it for us. 

**Run the cell below.** *In fact, every time you see a code cell, run it before proceeding. Subsequent code might rely on it.*

In [None]:
# c) authentication cell 

# Ask Tweepy to authenticate you to Twitter (log you in)
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

# Set up access to the Twitter API using the authentication info and some options
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

print('You are now logged in as: %s (%s)' % (api.me().screen_name, api.me().name))

Great! Now you are logged in to Twitter through the API using Tweepy! We are ready to really get going. 

Let's try a few things so that you can get a sense of how the API works. You can see a full list of the methods Tweepy makes available by looking at [their documentation](http://docs.tweepy.org/en/v3.5.0/api.html). 

We are going to start with the **[get_status](http://docs.tweepy.org/en/v3.5.0/api.html#API.get_status)** method. 

**Run the cell below. Then try changing the Tweet ID (last part of every tweet URL)**

In [None]:
# d) fetch tweet cell

# Twitter calls tweet 'statuses'
status = api.get_status(870093876853227520)

# Print the time the tweet was created
print("Created at: %s" % status.created_at)

# Print the text of the tweet
print("Text: %s" % status.text)

# Print the author of the tweet
print("Author: %s" % status.author.screen_name)

# Loop through the hashtags of the tweet and print them (if there are any)
if 'hashtags' in status.entities: 
    print("hashtags: %s" % ', '.join([hashtag['text'] for hashtag in status.entities['hashtags']]))
        
# Loop through the urls of the tweet and print them (if there are any)
if 'urls' in status.entities: 
    print("urls: %s" % ', '.join([url['expanded_url'] for url in status.entities['urls']]))

We can also use the API to fetch information about specific users using the **[get_user](http://docs.tweepy.org/en/v3.5.0/api.html#API.get_user)** method. 

In [None]:
# e) fetch user cell

# We can use the API object to see who we are logged in as
# You can also change your_screen_name to any other screen name (remember to put it in 'single_quotes')
your_screen_name = api.me().screen_name

# Now we can use Tweepy to get information about your user
user = api.get_user(your_screen_name)

print(user.location)

print("You first joined Twitter on: %s" % user.created_at)
print("This is what your bio says: ")
print(user.description)
print()

Of course, we can also get information about who follows you with the **[followers](http://docs.tweepy.org/en/v3.5.0/api.html#API.followers)** method, and about who you follow using the **friends** method (for some reason, not in Twitter documentation). 

There are also two other methods for accessing followers and friends, that allow you to get many more users at a time. These are the **[followers_ids](http://docs.tweepy.org/en/v3.5.0/api.html#API.followers_ids)** and **[friends_ids](http://docs.tweepy.org/en/v3.5.0/api.html#API.friends_ids)**. These latter two methods return a list of user_ids, while the first two return a list of user objects. 

That is, the first two give you all the information about the actual followers/friends (screen names, descriptions, etc.), while the latter two give you a list of the ids only. While the first two are more detailed, they only return 100 people at a time, while the latter two return 1,000 people at a time. 

In [None]:
# f) fetch followers cell 

# Get a list of the last 100 of the people who follow you
followers = api.followers(user.id)
# print out the most recent one. The [0] says the first item on the list
print("The last person who followed you was: @%s" % followers[0].screen_name)

# Get a list of the last 100 people you followed
friends = api.friends(user.id)
print("The last person you followed was: @%s" % friends[0].screen_name)

Below are the URL's of a tweet from each of you. Imagine these were tweets that all had the URL to the same academic article, which is what we essentially get from Altmetric. 

You can change, add, or remove tweets from this list, and the rest of the code will work. 

**Note: You probably don't want to add too many users, because it will slow down the data collection.**

In [None]:
# g) setup our sample cell 

sample = [
'https://twitter.com/juancommander/status/870093876853227520',
'https://twitter.com/mauvepg/status/909456056748617728',
'https://twitter.com/ScottSteedman2/status/1094281373999755264',
'https://twitter.com/learnpublishing/status/1108952174237630464', 
    
'https://twitter.com/MooreaCorrigan/status/1098472426369765376',
'https://twitter.com/loramouammer_/status/1024831021408116736', 
'https://twitter.com/astrpt/status/1112149646934044672', 
'https://twitter.com/wholeheartedeat/status/1102639436649648128',
'https://twitter.com/JazminWelch/status/1111150156982812672'
]

Your original source of data may be very different, but in the end, what you want is a list of tweet_ids. If you had a CSV or an Excel file with a list of IDs, you'd just need to write a few lines of additional Python code to read those into a list. 

In [None]:
# h) extract tweet id from sample cell

# We are just going to extract the tweet ID (the numeric part at the end of the URL)
tweet_ids = [] # start an empty list
for tweet_url in sample:
    tweet_id = tweet_url.split('/')[-1] # split on the /, then grab the last part [-1]
    tweet_ids.append(tweet_id) # add tweet ID to our list

# This should now have a list of all the tweet_ids
print(tweet_ids)

Now we'll loop through that list of tweet IDs to fetch all of the tweet details. Tweepy will put each tweet into a `Status Object`. The code below fetches from the API, using the same `get_status` method we used above, and appends it to a list.

In [None]:
# i) fetch many tweets cell

statuses = [] # start a list of empty statuses
for tweet_id in tweet_ids:
    # The try/except block is just in case there is an error
    # instead of stopping on the error, it will spit it out and keep going
    # None of the above tweets should cause errors
    try: 
        statuses.append(api.get_status(tweet_id))
    except tweepy.TweepError as error: 
        print('Had a problem getting tweet_id = %s' % tweet_id)
        print(error)     

Note that the above could also have been done with a single line (and a single API call), using the [statuses_lookup](http://docs.tweepy.org/en/v3.5.0/api.html#API.statuses_lookup) endopint, which allows querying up to 100 tweets at a time: 

`statuses = api.statuses_lookup(tweet_ids)` 

In those statuses, we get all of the user objects. So we can just loop through and find out who the users are behind those tweets. We already had the screen_names in the URLs, but this will allow us to have the user_ids (the numeric identifier) for each user. While users can change their screen names over time, their numeric identifier stays the same, so fetching the tweet_id, and using that to get the user_id, is the most robust way of ensuring you have the right user. 

In [None]:
# j) get user from tweet cell 

users = {}  # Make an empty dictionary. It will have a mapping of user_id => User Object
# dictionaries let us have a collection of objects, like a list, but with an index so we can later
# retrieve any individual item. 
# We can then do something like users[12345] and it will return the User object for user with id 12345

# Loop through each of the statuses, and identify the tweet's author
for status in statuses: 
    user_id = status.author.id
    users[user_id] = status.author  # Make a mapping of user_id => User Object

print("These are all the user ids we have collected:")
print(users.keys())

print("\n\n")
print('Here is what the first user object looks like:')
first_user_id = list(users.keys())[0]
print(json.dumps(users[first_user_id]._json, indent=4))

Now the users dictionary has a mapping of user_id => user object for all users in our sample of tweets. For our network project, we are really only going to use the user_ids, but you can imagine scenarios where you want to use the rest of the user object. For example, to extract information from the descriptions, their location, etc. 

For example, let's see who is the user that has tweeted the most. 

Or let's see how many users have set their location:

In [None]:
# k) analyze users cell

# Set the maximum to 0 to start, and the user to empty
max_statuses = 0
max_statuses_user = None

for user_id, user in users.items():
    # If the next user in the loop has more than the max we found
    # save that new maximum and the user that goes along with it
    if user.statuses_count > max_statuses:
        max_statuses = user.statuses_count
        max_statuses_user = user  
        
print('%s has the most tweets at: %s' % (max_statuses_user.screen_name, max_statuses))

has_location = 0
has_geo_enabled = 0
for user_id, user in users.items():
    if len(user.location) > 0:
        has_location += 1
    if user.geo_enabled:
        has_geo_enabled +=1
print('%i people (%.2f%%) have something in their location field.' % (has_location, 100.0*has_location/len(users)))
print('%i people (%.2f%%) have geolocation enabled.' % (has_geo_enabled, 100.0*has_geo_enabled/len(users)))

Now lets get to constructing the network! For this we need to get all of the followers for every person in our sample. We will use the [followers_ids](http://docs.tweepy.org/en/v3.5.0/api.html#API.followers_ids) endpoint, which lets us fetch up to 5,000 followers at a time. That means we need to do two API calls for someone with 8,000 followers, three for someone with 11,000, etc. 

The API allows us to do 15 calls to this endpoint every 15 minutes, so if someone has a lot of followers, or if you have too many people, there might be a need for a 15 minute break while this finishes.

This use of the API is a little different, because we need to "page" through the results if a user has more than 5,000 followers. The followers_ids endpoint will return 5,000 on each page, and we need to move through each page until we get them all. Fortunately, Tweepy handles all of this for us with something called a Cursor. 

In [None]:
# l) get all followers cell

followers = {} # Make an empty dict, where we'll put user_id = [list of follower ids]
for user_id, user in users.items():
    ids = []
    
    # The Cursor method of Tweepy is for things that can take multiple pages to complete
    # Combined with the .pages() at the end, they return something we can loop through
    # So then we add a 'for page', which will then loop us through all the pages of results
    for page in tweepy.Cursor(api.followers_ids, id=user_id).pages():
        ids.extend(page)  # Append the list that came back with the list of each page
        print('Adding %s followers for %s' % (len(page), user.screen_name))

    followers[user_id] = ids # Save all of these followers

Now we are going to write the edgelist (the follower/friend information) so that we can then construct the network. 

**Be sure to set your participant number below, or you will not be able to save files.**

In [None]:
# m) set your working directory cell

# Put your number here. This will avoid you overwriting each other's work.
student_number = 'instructor'

datadir = 'data/%s' % student_number

In [None]:
# n) write out the edge list cell

with open('%s/participant_edgelist.csv' % datadir, 'w') as f:    
    # Loop through our dict of follower lists
    for user_id, list_of_followers in followers.items():
        # For each user, loop through each of their followers
        for follower in list_of_followers:
            # Write an edge from our user (one of you) to each of your followers
            # Direction indicates information can go from you to them
            f.write('%s\t%s\n' % (user_id, follower))

The file `participant_edgelist.csv` now has all of our network information. You can see a random sample of it with the following command, which is using the [Python Pandas](http://pandas.pydata.org/) module. Pandas is a data analysis library which is commonly used for data science work, but is beyond the scope of this workshop. 

You can run the command multiple times for different samples. You'll notice that these are using the numeric user_ids, not screen_names, for the reasons outlined above. 

In [None]:
# o) look inside the edge list file

pd.read_csv('%s/participant_edgelist.csv' % datadir, sep='\t', header=None, names=['Source', 'Target']).sample(10)

Now that we have an edgelist, we can start using a network analysis library to do some network specific stuff. For that, we'll use the [Python iGraph](http://igraph.org/python/) module. There are several other network analysis toolkits for Python, namely [NetworkX](https://networkx.github.io/) and Stanford's [SNAP for Python](https://snap.stanford.edu/snappy/). All have their pros and cons. 

In my opinion, iGraph strikes a balance between being complete, fast, and intuitive. However, opinions on this may vary wildly. iGraph does have the downside of having pretty terrible documentation. Even the [tutorial](http://igraph.org/python/doc/tutorial/tutorial.html) is pretty incomplete. 

All of the following tasks could be completed with any of the above listed softwares. 

In [None]:
# p) import iGraph cell
try: 
    reload(igraph)
except:
    import igraph as ig # When installing this library, be sure to install python-igraph

We'll use the [Read_Ncol](http://igraph.org/python/doc/igraph.GraphBase-class.html#Read_Ncol) to read in an edgelist, where we consider the two columns to be the "names" of the nodes. Because our edgelist is made up of user_ids, these will still be a bunch of numbers, but it will make it easier for us to keep track of who is who.

In [None]:
# q) read in your edgelist into iGraph cell

G = ig.Graph.Read_Ncol('%s/participant_edgelist.csv' % datadir, names=True, directed=True)
print("The graph of everyone now has %s nodes and %s edges" % (G.vcount(), G.ecount()))

Now we can make a subgraph that only leaves all of you, and removes your followers. Note that this will keep all the edges between you, so we have a subgraph, which tells us who follows who in this group.

We'll use the [subgraph](http://igraph.org/python/doc/igraph.Graph-class.html#subgraph) method, that expects us to pass in the list of user_ids we want to keep. 

In [None]:
# r) make a subgraph cell

# We need to convert the numerical IDs to strings, because it is what iGraph expects
user_ids = [str(user_id) for user_id in users.keys()]

subG = G.subgraph(user_ids)
print("This subgraph of just you now has %s nodes and %s edges" % (subG.vcount(), subG.ecount()))

In [None]:
# s) add attributes to nodes and look at them cell

# subG.vs has the list of vertices 
# we want to add the screen_name as an attribute
# We can find those in the users dictionary from before
for v_index in range(subG.vcount()):
    user_id = int(subG.vs[v_index]['name'])  # the 'name' field has the user_id, convert to int
    subG.vs[v_index]['screen_name'] = users[user_id].screen_name
    
# Let's check the first few
[v.attributes() for v in subG.vs[0:5]]

We can use iGraph to plot graphs, but it does not work great for larger graphs, and it is very hard to change the representation. This is why we turn to Gephi for actual visualizations, but I wanted you to get a sense of what this network looked like. 

In [None]:
# t) plot the graph cell

labels = ['@' + v['screen_name'] for v in subG.vs]
ig.plot(subG, vertex_label=labels)

The nice thing about using libraries, is that we can calculate all kinds of network statistics from them.

In [None]:
# u) calculate graph stats cell

graph_stats = {}
# Some very basic stats
graph_stats['density'] = subG.density()
graph_stats['num_nodes'] = subG.vcount()
graph_stats['num_edges'] = subG.ecount()
graph_stats['diameter'] = subG.diameter()

# Stats about in and out degree. Note that these indegree()/outdegree() return a list
# of the degree of each node. So we then take the average using a Numpy method 
# (Numpy is a math library very commonly used in Python)
graph_stats['in_degree_mean'] = np.mean(subG.indegree())
graph_stats['out_degree_mean'] = np.mean(subG.outdegree())
graph_stats['degree_mean'] = np.mean(subG.degree())

# Divide the graph into its components.
# Specify we mean weakly connected, and then get each component as a subgraph
components = subG.components(mode=ig.WEAK).subgraphs()

# Sort these components in reverse order, using the number of nodes 
wccs = sorted(components, key=lambda g: g.vcount(), reverse=True)

# then run some basic statistics on the largest component (which can be found at wccs[0])
graph_stats['biggest_wcc_num_nodes'] = wccs[0].vcount()
graph_stats['biggest_wcc_num_nodes_p'] = wccs[0].vcount()*100.0/subG.vcount()
graph_stats['biggest_wcc_density'] = wccs[0].density()

# Run the infomap algorithm on the largest component, and get its modularity score
graph_stats['biggest_wcc_infomap_modularity'] = wccs[0].community_infomap().modularity

In [None]:
# v) print out statscell

for k,v in graph_stats.items():
    if type(v) != int:
        print('%s: %.2f' % (k,v))
    else:
        print('%s: %s' % (k,v))

This was a basic, but complete example, of how you can take a list of tweets, collect all of the tweet, user, and follower information from Twitter, and then create a network in Python for analysis. 

This code would work on a much larger sample of tweets, but it may take a long time to complete. There are places where it could be made more efficient, and more robust (resistent to errors), but it was written like this for simplicity. 

When doing work like this with larger datasets, a few extra considerations need to be taken into account. These were beyond the scope of this course, but are listed here to give a sense of the scope of the work. 

We found that when working with older data, some tweets were no longer available (e.g. accounts may be private or deleted, and individual tweets may be deleted). The code needs to handle the empty returns, and you need to consider how to minimize the gaps in the data. For our diffusion network of scientific papers on Twitter, we tried to identify users by their screen_name when we could not fetch the tweet. This helped fill in the user information. However, for private accounts, we also needed a way to identify their follower information. In these cases, we used the follower networks of everyone else to fill the gaps (by getting the friends information, we get the other side of the relationship). 

There is also a lot of data management issues to work with. This tutorial showed how to write some files, but when working with multiple networks and thousands of tweets, more needs to be done. For example, we saved all of our tweet, user, and follower information in a local database (using SQLite). This helped us avoid fetching the same tweet or user information multiple times, even if the user appeared in our sample multiple times. Saving all the data locally as it is collected also avoids having to do all of the steps at once, since you can come back later and re-load some data without having to wait for the Twitter API. 