# Twitter Graph Data
## Scraping list of followers and their followers  

Goal of this notebook is to make a first step into learning how to scrape data from Twitter and visualize it in ***graph*** format; main issues will be the following:
1. use tweepy to scrape Twitter for all of my followers and their followers;
2. create a pandas dataframe from all connections;
3. use NetworkX to create a network from these data and run some basic graph analytics.


In [1]:
import tweepy
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
plt.style.use('ggplot')

c:\Users\Dylan\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
c:\Users\Dylan\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll


In [5]:
consumer_key = 'xxx'
consumer_secret = 'xxx'
access_token = 'xxx'
access_token_secret = 'xxx'

In [6]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth=auth, wait_on_rate_limit=True)

You can get the **user ID** if you know their screen name.

In [None]:
# getting user_id given the screen name 
me = api.get_user(screen_name = 'DylanTartarini')
me.id

The followin code creates a **list of followers** for every user in the list

In [None]:
# create a list of followers for every user in the list
user_list = ["1013061841524871168"]
follower_list = []
for user in user_list:
    followers = []
    try:
        for page in tweepy.Cursor(api.get_follower_ids, user_id=user).pages():
            followers.extend(page)
            print(len(followers))
    except tweepy.TweepyException:
        print("error")
        continue
    follower_list.append(followers)

Putting the list of followers into a dataframe..

In [None]:
df = pd.DataFrame(columns=['source','target']) # empty df
df['target'] = follower_list[0] # set the list of followers as the target column
df['source'] = 1013061841524871168 # set my user ID as the source 
df.head()

**Convert the DataFrame** into a graph (a network) using NetworkX

In [None]:
G = nx.from_pandas_edgelist(df, 'source', 'target') # turn df into graph
pos = nx.spring_layout(G) # specify layout for visual

Plot the network using matplotlib

In [None]:
f, ax = plt.subplots(figsize=(10, 10))
nodes = nx.draw_networkx_nodes(G, pos,alpha=0.8)
nodes.set_edgecolor('k')
nx.draw_networkx_labels(G, pos, font_size=8)
nx.draw_networkx_edges(G, pos, width=1.0, alpha=0.2)

What if we wanted to obtain the followers of these users (which are my own followers)? In order to do so, we should be able to loop through the list of all the users, get their followers and add those links to the original DataFrame. 

**WARNING**: this code might take a very long time to run, depending on the number of followers of the initial user, and on the number of followers each user has.

In [None]:
# obtain all of the users following a specific user's followers
user_list = list(df['target']) # use the list of followers we extracted in the code above i.e. my 17 followers

for userID in user_list:
    print(userID)
    followers = []
    follower_list = []

    # fetching the user
    user = api.get_user(user_id=userID)

    # fetching the followers_count
    followers_count = user.followers_count

    try:
        for page in tweepy.Cursor(api.get_follower_ids, user_id=userID).pages():
            followers.extend(page)
            print(len(followers))
            if followers_count >= 5000: #Only take first 5000 followers
                break
    except tweepy.TweepyException:
        print("error")
        continue
    follower_list.append(followers)
    temp = pd.DataFrame(columns=['source', 'target'])
    temp['target'] = follower_list[0]
    temp['source'] = userID
    df = df.append(temp)
    df.to_csv("networkOfFollowers.csv") # saving the df to csv

### Network Analytics  
Once data has been converted into a graph, we can run some basic network analytics

Now we can read the csv and turn the DataFrame into a graph using NetworkX.

In [2]:
df = pd.read_csv('networkOfFollowers.csv') # read into a df
G = nx.from_pandas_edgelist(df, 'source', 'target')

In [3]:
G.number_of_nodes() # find the total number of nodes in this graph

18463

Find out which are the most influential nodes in the network using measures of centrality. The simplest of those is ***Degree Centrality***, which is a function of the number of connections each node has.

In [4]:
G_sorted = pd.DataFrame(sorted(G.degree, key=lambda x: x[1], reverse=True))
G_sorted.columns = ['nconst','degree']
G_sorted.head()

Unnamed: 0,nconst,degree
0,748492600408145920,5001
1,129443586,5001
2,455250598,4951
3,971816547898490881,1281
4,1199331702927175680,738


The most influential node in my network is 1564545318 which has a degree of 5002, which mens that, other than its own 5000 followers (the limit setted in the beginning), this user follows 2 of my own followers.
To get the username of the said account, we can easily do like this

In [7]:
u = api.get_user(user_id='748492600408145920')
u.screen_name

'advaneo'

The network we obtained might be very big, in terms of nodes: any analytics will take a long time to run, and any visualization will not be understandable.  
We will filter the network to a more manageable number of nodes, using the ***k_core*** function from NetworkX. This will filter out nodes with degree less than a given number, k.  

In [8]:
G_tmp = nx.k_core(G, 2) # exclude nodes with degree less than 5

With a smaller graph we can easily perform some network analytics. The starting point might be to split the graph into groups using a ***community detection algorythm***.

In [9]:
from community import community_louvain
partition = community_louvain.best_partition(G_tmp)  #G_tmp
# Turn partition into dataframe
partition1 = pd.DataFrame([partition]).T
partition1 = partition1.reset_index()
partition1.columns = ['names','group']

Run the Degree Centrality again now that the network is smaller

In [10]:
G_sorted = pd.DataFrame(sorted(G_tmp.degree, key=lambda x: x[1], reverse=True))
G_sorted.columns = ['names','degree']
G_sorted.head()

Unnamed: 0,names,degree
0,971816547898490881,197
1,455250598,163
2,129443586,148
3,968819223727411200,145
4,1199331702927175680,130


In [11]:
dc = G_sorted

In [12]:
combined = pd.merge(dc,partition1, how='left', left_on="names",right_on="names")
combined

Unnamed: 0,names,degree,group
0,971816547898490881,197,1
1,455250598,163,3
2,129443586,148,0
3,968819223727411200,145,3
4,1199331702927175680,130,1
...,...,...,...
407,114412418,2,1
408,2181113836,2,3
409,2875590639,2,1
410,45295601,2,0


We can now visualize the graph using the following code

In [None]:
import os

default_edge_color = 'gray'
default_node_color = '#407cc9'
enhanced_node_color = '#f5b042'
enhanced_edge_color = '#cc2f04'

In [None]:
def draw_graph(G, node_names={}, node_size=50, layout = None):
    pos_nodes = nx.spring_layout(G) if layout is None else layout(G)
    nx.draw(G, pos_nodes, with_labels=False, node_size=node_size, edge_color='gray')
  
    pos_attrs = {}
    for node, coords in pos_nodes.items():
        pos_attrs[node] = (coords[0], coords[1] + 0.08)
  
    nx.draw_networkx_labels(G, pos_attrs, labels=node_names, font_family='serif')
  
    plt.axis('off')
    axis = plt.gca()
    axis.set_xlim([1.2*x for x in axis.get_xlim()])
    axis.set_ylim([1.2*y for y in axis.get_ylim()])

In [None]:
draw_graph(G=G_tmp)

We conclude this brief introduction exporting the node list and the edge list as csv files

In [None]:
combined = combined.rename(columns={"names": "Id"}) 
edges = nx.to_pandas_edgelist(G_tmp)
nodes = combined['Id']
edges.to_csv("edges.csv")
combined.to_csv("nodes.csv")

### Backup

In [None]:
pos = nx.spring_layout(G_tmp)

f, ax = plt.subplots(figsize=(10, 10))
#cc = nx.betweenness_centrality(G2)
nodes = nx.draw_networkx_nodes(G_tmp, pos,
                               cmap=plt.cm.Set1,
                               node_color=combined['group'],
                               alpha=0.8)
nodes.set_edgecolor('k')
nx.draw_networkx_labels(G, pos, font_size=8)
nx.draw_networkx_edges(G, pos, width=1.0, alpha=0.2)