### An attempt at getting empirical information spread data

In [109]:
import pandas as pd 
import networkx as nx
import matplotlib.pyplot as plt

The dataset contains hyperlinks from subreddit to subreddit (with timestamp) for a period of 2 years.

In [2]:
data = pd.read_csv('soc-redditHyperlinks-body.tsv',sep = "\t")

The idea is to use "having a link to node $e$" as information. Thus, every subreddit that has a hyperlink to $e$ will count as having the information of whatever is contained/discussed in that subreddit $e$. Because we also have the timestamp of each hyperlink, we can see the evolution of the information spread. 

The steps would be:

1. Find node to use as node $e$ (i.e. information). OPTIONS: (1) The one with highest degree, (2) One that was created in 2014 (when the dataset starts)
2. Create attribute `information` for each node an give value $1$ if it has a hyperlink to $e$. Also create attribute `date` and give the timestamp of the first hyperlink
3. Delete node $e$ from the network, together with all edges pointint to it. 

Trial with no timestamps. This will just output the final state of the network (unweighted links and whether each node has or doesn't have the information)

In [70]:
# AS A TRIAL WE START BY CHOOSING THE NODE WITH THE HIGHEST NUMBER OF MENTIONS
info_node = data.TARGET_SUBREDDIT.value_counts().head(n=1).index[0]

In [107]:
# GETTING LIST OF LINKS
links_df = data.iloc[:,[0,1]].drop_duplicates()
source = list(links_df['SOURCE_SUBREDDIT'])
target = list(links_df['TARGET_SUBREDDIT'])
links = [(source[i],target[i]) for i in range(0,len(links_df['SOURCE_SUBREDDIT']))]

# GETTING INFORMATION ATTRIBUTE PER NODE
nodes = list(set(data.SOURCE_SUBREDDIT))
nodes_with_info = list(set(data[data.TARGET_SUBREDDIT == info_node].SOURCE_SUBREDDIT))

nodes_attr = {}
for n in nodes:
    if n in nodes_with_info:
        nodes_attr[n] = 1
    else: 
        nodes_attr[n] = 0

[('leagueoflegends', 'teamredditteams'), ('theredlion', 'soccer'), ('inlandempire', 'bikela'), ('nfl', 'cfb'), ('playmygame', 'gamedev'), ('dogemarket', 'dogecoin'), ('locationbot', 'legaladvice'), ('indiefied', 'aww'), ('posthardcore', 'bestof2013'), ('posthardcore', 'corejerk')]


Unnamed: 0,SOURCE_SUBREDDIT,TARGET_SUBREDDIT
0,leagueoflegends,teamredditteams
1,theredlion,soccer
2,inlandempire,bikela
3,nfl,cfb
4,playmygame,gamedev


Attempt at plotting/studying the resulting empirical information spread network

In [110]:
G = nx.Graph()
G.add_nodes_from(nodes)
nx.set_node_attributes(G,nodes_attr,'info')
G.add_edges_from(links)
nx.draw_networkx(G)

# Set margins for the axes so that nodes aren't clipped
ax = plt.gca()
ax.margins(0.20)
plt.axis("off")
plt.show()

KeyboardInterrupt: 

**Appendix**

Problems and things to keep in mind: 
1. The graph is a directed graph (subreddit $A$ accessed subreddit $B$, but this does not imply the other way around). This can be ignored to simplify the data. 
2. Does one hyperlink qualify as a link or should we do $n$ hyperlinks to consider a link to exist?
3. The size of the network may make the model computationally very demanding.