<a href="https://colab.research.google.com/github/SidK8/Higgs-Twitter/blob/main/higgs_twitter_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#IMPORT LIBRARIES

In [2]:
from datetime import datetime, timedelta
import math
import plotly.express as px
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd
import plotly.express as px
from scipy.stats import spearmanr
from scipy.stats import pearsonr
import random

In [3]:
from google.colab import files

# CUSTOM FUNCTIONS

In [4]:
"""
Return min-max-normalized columns for a given dataframe and the list of columns

"""
def min_max_normalization(df,column_list):
  """

    Args:
    -------------
    df - dataframe

    column_list - list of column names to be min-max-normalized in the dataframe

    Returns:
    --------
    a dataframe with normalized columns
    """
    
  df_norm = df.copy()
  for column in column_list:
      df_norm[column] = (df_norm[column] - df_norm[column].min()) / (df_norm[column].max() - df_norm[column].min())
        
  return df_norm

In [5]:
"""
Return changed state after one step of independent cascade.
Here we assume that EVERY neighbor if affected.

"""
def independent_cascade(G,d,depth_state):
  """
    Args:
    -------------
    G - a networkX graph (for a directed graph when neighors are calculated, it is outdegree neighbors)

    d - depth state from which we need to calculate the cascade effect to d+1

    depth_state - a dictionary with keys as users and value as the state

    Returns:
    --------
    a dictionary with updated state after the cascade
    """
  current_depth = [n for n in depth_state if depth_state[n]==d]
  for n in current_depth:
      for v in G.neighbors(n):
          if v not in depth_state:
            depth_state[v] = d+1
  return depth_state

In [6]:
"""
Return average depth
 state in the dictionary form, given the seed state with 0 and all the nodes at depth with value d
"""

def average_depth(tree_dict):
  """
    Args:
    -------------
    tree_dict - a dictionary, with the root(seed) value at 0 and other nodes value at depth d from root

    Returns:
    --------
    numeric value of average depth
    """
  cumulative_depth = 0
  for d in range(max(tree_dict.values())):
    depth = d*(len([n for n in tree_dict if tree_dict[n]==d]))
    cumulative_depth = cumulative_depth + depth

  return cumulative_depth/len(tree_dict)



In [7]:
"""
Return maximum spread given a seed state or a list of them.
"""
def max_spread(G,seed,depth_max):
  """
    Args:
    -------------
    G - a networkX graph

    seed - seed state/s as a list

    depth_max - maximum number of iterations to check the spread

    Returns:
    --------
    numeric value of maximum spread
    """
  d=0
  current_spread = {n:0 for n in seed}
  while d <= depth_max:
    counter_cs = len(current_spread)
    current_spread = independent_cascade(G,d,current_spread)
    d+= 1
    if len(current_spread) == counter_cs:
      break
      
  return len(current_spread)/len(G.nodes)

In [8]:
"""
Return maximum spread and average depth given a seed state.
"""
def max_spread_avg_depth(G,seed,depth_max):
  """
    Args:
    -------------
    G - a networkX graph

    seed - seed state as a list, this works for a single input state.

    depth_max - maximum number of iterations to check the spread

    Returns:
    --------
    a dictionary with keys max_spread_value whose value is the maximum spread and avg_depth whose value is the average depth
    """
  d=0
  current_spread = {n:0 for n in seed}
  while d <= depth_max:
    counter_cs = len(current_spread)
    current_spread = independent_cascade(G,d,current_spread)
    d+= 1
    if len(current_spread) == counter_cs:
      break
      
  return {'max_spread_value':len(current_spread)/len(G.nodes),
            'avg_depth': average_depth(current_spread)
            }


In [9]:
"""
Return changed state after one step of independent cascade.
Here we assume that EVERY neighbor if affected at random.

"""
def independent_cascade_random(G,d,depth_state,rand):
  """
    Args:
    -------------
    G - a networkX graph (for a directed graph when neighors are calculated, it is outdegree neighbors)

    seed - seed state as a list, this works for a single input state.

    depth_max - maximum number of iterations to check the spread

    rand -  a parameter in (0,1] which decides if the neigbor node is affected

    Returns:
    --------
    a dictionary with updated state after the random independent cascasde
    """
  
  current_depth = [n for n in depth_state if depth_state[n]==d]
  for n in current_depth:
    for v in G.neighbors(n):
        if v not in depth_state and np.random.uniform() <= rand:
          depth_state[v] = d+1
  return depth_state

In [10]:
"""
Return maximum spread and average depth given a seed state.
"""
def max_spread_avg_depth_random(G,seed,depth_max,rand):
  """
    Args:
    -------------
    G - a networkX graph (for a directed graph when neighors are calculated, it is outdegree neighbors)

    d - depth state from which we need to calculate the cascade effect to d+1

    depth_state - a dictionary with keys as users and value as the state

    rand -  a parameter in (0,1] which decides if the neigbor node is affected

    Returns:
    --------
    a dictionary with keys max_spread_value whose value is the maximum spread after random independent cascade
    and avg_depth whose value is the average depth

    """
  d=0
  current_spread = {n:0 for n in seed}
  while d <= depth_max:
    counter_cs = len(current_spread)
    current_spread = independent_cascade_random(G,d,current_spread,rand)
    d+= 1
    if len(current_spread) == counter_cs:
      break
      
  return {'max_spread_value':len(current_spread)/len(G.nodes),
            'avg_depth': average_depth(current_spread)
            }


#DATA SETS

In [11]:
followers_url = r'https://snap.stanford.edu/data/higgs-social_network.edgelist.gz'
retweet_url = r'https://snap.stanford.edu/data/higgs-retweet_network.edgelist.gz'
reply_url = r'https://snap.stanford.edu/data/higgs-reply_network.edgelist.gz'
mention_url = r'https://snap.stanford.edu/data/higgs-mention_network.edgelist.gz'
activity_url = r'https://snap.stanford.edu/data/higgs-activity_time.txt.gz'

In [12]:
df_activity = pd.read_csv(activity_url, compression='gzip', header=None, sep=' ')
df_followers = pd.read_csv(followers_url, compression='gzip', header=None, sep=' ')
df_retweet = pd.read_csv(retweet_url, compression='gzip', header=None, sep=' ')
df_reply = pd.read_csv(reply_url, compression='gzip', header=None, sep=' ')
df_mention = pd.read_csv(mention_url, compression='gzip', header=None, sep=' ')

In [13]:
df_followers.columns = ['user','follower']
df_activity.columns = ['user', 'target_user', 'timestamp', 'interaction_type']
df_retweet.columns = ['user','retweet_user', 'weight']
df_mention.columns = ['user','mention_user', 'weight']
df_reply.columns = ['user','reply_user', 'weight']

##LOADING DATA INTO NETWORKX GRAPHS

**We load them as directed graph, from the follwer to the user. In the case of interaction graph, the direction is from A to B if @A interacts with @B. The interaction can be either retweets, mentions or replies.**


In [14]:
# follwer graph
G_followers = nx.from_pandas_edgelist(df_followers,source='follower',
                                   target='user',
                                   create_using=nx.DiGraph())

In [15]:
# retweet graph
G_rt = nx.from_pandas_edgelist(df_retweet,source='user',
                                   target='retweet_user', edge_attr='weight',
                                   create_using=nx.DiGraph())

In [16]:
# mention graph
G_mt = nx.from_pandas_edgelist(df_mention,source='user',
                                   target='mention_user', edge_attr='weight',
                                   create_using=nx.DiGraph())

In [17]:
# reply graph
G_re = nx.from_pandas_edgelist(df_reply,source='user',
                                   target='reply_user', edge_attr='weight',
                                   create_using=nx.DiGraph())

#HOW THE RUMOUR SPREAD

In [46]:
# cleaning df_activity

df_activity['user'] = df_activity['user'].astype(str)
df_activity['target_user'] = df_activity['target_user'].astype(str)
df_activity['date'] = df_activity['timestamp'].apply(datetime.fromtimestamp)

In [47]:
"""
Return the new percentage afftected in the given time window.
"""
def spread_interaction_temporal(df_interaction,G,window_size):
  """
    Args:
    -------------
    df_interaction - a dataframe with `user`, `target_user`,`date` which captures the instant
                    of the interaction.

    G - a networkX graph


    Returns:
    --------
    a dataframe with time and the percentage affected (new ones) at that time.

  """
  current_state = {}
  for i in list(G):
    current_state[i] = 0
  df_interaction_temporal = pd.DataFrame()

  maximum_time = df_interaction['date'].max()
  minimum_time = df_interaction['date'].min()

  total_time = math.ceil((maximum_time - minimum_time).total_seconds() / 60)

  for window in range(0,total_time,window_size):
    window_start = minimum_time + timedelta(minutes=window)
    window_end = minimum_time + timedelta(minutes=window + window_size)
    df_window = df_interaction.query('date >= @window_start & date < @window_end')
    window_changed_nodes = list(set(list(df_window['user'].unique()) + list(df_window['target_user'].unique())))
    if len(window_changed_nodes)>0:
      for i in window_changed_nodes:
        current_state[i] = 1
    
    df_interaction_temporal = df_interaction_temporal.append({'time': window,'changed_pc': len([n for n in current_state if current_state[n]==1])/len(list(G))}, ignore_index=True)
  
  return df_interaction_temporal

                                                                                                           

In [None]:
df_retweet_spread = spread_interaction_temporal(df_activity.query('interaction_type == "RT"'),G_rt,1)

In [49]:
fig_retweet_spread = px.line(df_retweet_spread, x='time', y='changed_pc')
fig_retweet_spread.show()

**From t 4500 to 6000 saw an explosing of activity. We will take a snapshot of this period to analyse and see the infulencial players in this period.**

In [None]:
df_mention_spread = spread_interaction_temporal(df_activity.query('interaction_type == "MT"'),G_mt,1)

In [51]:
fig_mention_spread = px.line(df_mention_spread, x='time', y='changed_pc')
fig_mention_spread.show()

In [None]:
df_reply_spread = spread_interaction_temporal(df_activity.query('interaction_type == "RE"'),G_re,1)

In [53]:
fig_reply_spread = px.line(df_reply_spread, x='time', y='changed_pc')
fig_reply_spread.show()

**Very similar profile, which allows us to focus on the retweets network in the further analysis**

#STATIC NETWORK ANALYSIS

##CENTRALITY

**Which centrality measure better captures the spread, local, global or a combination?**

### Follower network

In [18]:
#eigenvector centrality 
eigenvector_centrality_followers = nx.eigenvector_centrality(G_followers)

In [19]:
#indegree centrality
indegree_centrality_followers = nx.in_degree_centrality(G_followers)

In [20]:
# a dataframe with centrality values
df_centrality_followers = pd.DataFrame([eigenvector_centrality_followers,indegree_centrality_followers]).transpose()
df_centrality_followers.columns = ['eigenvector','indegree']
df_centrality_followers['user'] = df_centrality_followers.index

### Interaction networks

In [21]:
eigenvector_centrality_rt = nx.eigenvector_centrality(G_rt)

In [22]:
indegree_centrality_rt = nx.in_degree_centrality(G_rt)

In [23]:
# a dataframe with centrality values

df_centrality_rt = pd.DataFrame([eigenvector_centrality_rt,indegree_centrality_rt]).transpose()
df_centrality_rt.columns = ['eigenvector','indegree']
df_centrality_rt['user'] = df_centrality_rt.index

In [24]:
#due to the skewness of the data, we take log

df_centrality_rt['log_eigenvector'] = df_centrality_rt['eigenvector'].apply(lambda x: math.log(x))

In [25]:
#we need to scale them to the range [0,1]

df_centrality_rt_norm = min_max_normalization(df_centrality_rt,['log_eigenvector','indegree'])

In [26]:
# we take the mean of the scaled log_eigenvector and indegree centralities

df_centrality_rt_norm['avg_centrality'] = df_centrality_rt_norm[['log_eigenvector','indegree']].mean(axis=1)

In [None]:
df_centrality_rt_norm.to_csv("centrality_rt_norm.csv")

files.download('centrality_rt_norm.csv') 

##WHICH CENTRALITY CAPTURES THE SPREAD ACCORDING TO OUR DEFN.

In [36]:
# taking top 500 users with high eigenvector centrality, 500 with high indegree and 500 random ones.

top_eigen = df_centrality_rt.sort_values(['eigenvector'], ascending=False)['user'].tolist()[:500]
top_indegree = df_centrality_rt.sort_values(['indegree'], ascending=False)['user'].tolist()[:500]
random_ones = random.sample(df_centrality_rt[~(df_centrality_rt['user'].isin(top_eigen + top_indegree))]['user'].tolist(),500)
test_users = list(set(top_eigen + top_indegree + random_ones))

In [37]:
df_test = df_centrality_rt_norm[df_centrality_rt_norm['user'].isin(test_users)]

In [38]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1368 entries, 3393 to 131584
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   eigenvector      1368 non-null   float64
 1   indegree         1368 non-null   float64
 2   user             1368 non-null   int64  
 3   log_eigenvector  1368 non-null   float64
 4   avg_centrality   1368 non-null   float64
dtypes: float64(4), int64(1)
memory usage: 64.1 KB


In [None]:
# note we reverse the retweet graph to see the flow of information.
df_test['dict'] = df_test['user'].apply(lambda x: max_spread_avg_depth(G_rt.reverse(),[x],20))

In [None]:
df_test['avg_depth'] = df_test['dict'].apply(lambda x: x['avg_depth'])
df_test['max_spread_value'] = df_test['dict'].apply(lambda x: x['max_spread_value'])

In [41]:
df_test.to_csv("test_centrality_spread.csv")

files.download('test_centrality_spread.csv') 

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [43]:
pearsonr(df_test['max_spread_value'], df_test['log_eigenvector'])

(0.9119017849315461, 0.0)

In [54]:
spearmanr(df_test['max_spread_value'], df_test['log_eigenvector'])

SpearmanrResult(correlation=0.8435435322042937, pvalue=0.0)

In [44]:
pearsonr(df_test['max_spread_value'], df_test['indegree'])

(0.15075655019944267, 2.107847860739717e-08)

In [55]:
spearmanr(df_test['max_spread_value'], df_test['indegree'])

SpearmanrResult(correlation=0.7315245316775402, pvalue=1.8123657785156e-229)

In [45]:
pearsonr(df_test['max_spread_value'], df_test['avg_centrality'])

(0.9060249940497889, 0.0)

In [56]:
spearmanr(df_test['max_spread_value'], df_test['avg_centrality'])

SpearmanrResult(correlation=0.8512211322342017, pvalue=0.0)

**As we are using a ranking variables to see which centralilty metric better captures the influence, the combinaton of a local and spectral measure better captures our spread, albeit slight improvement over only sprectral measure, as shown by the Spearman correlation.**

## SPREAD

**Just looking at the maximum-spread doesn't fully capture the dynamics, we need average-depth too. Given similar values of maximum-spread, the average-depth captures the nature of this spread of information.**

**Smaller average-depth indicates that the maximum-spread is achieved by the seed node, or the nodes at smaller depth from the node.**

In [59]:
fig_spread_depth_eigen = px.scatter(df_test, x="log_eigenvector", y="max_spread_value", color="avg_depth", 
                              hover_data = ['user'])
fig_spread_depth_eigen.show()

**Hovering over the plot we clearly see that in this sample data set, the key influencers are on the upper half, especially the user 88, 14454.**

**There are users with high eigenvector centrality but very low spread, which can be interpreted as they are central for spread in an undirected network, but in the directed on they don't have much ability to propogate information.**

Note to self: Is pagerank better here?

In [60]:
fig_spread_depth_combined = px.scatter(df_test, x="avg_centrality", y="max_spread_value", color="avg_depth", 
                              hover_data = ['user'])
fig_spread_depth_combined.show()

#INTERESTING USERS

##STATIC NETWORK

In [72]:
# top retweet influencers according to average_centrality
top_rt_influencers = df_centrality_rt_norm.sort_values(['avg_centrality'], ascending=False)['user'].tolist()[:10]

Let us now consider connection network, in particular follwers and follwing

In [75]:
df_indegree_connection = pd.DataFrame(list(G_followers.in_degree()), columns = ['user','followers'])
df_outdegree_connection = pd.DataFrame(list(G_followers.out_degree()), columns = ['user','following'])

In [76]:
df_connection = pd.merge(df_indegree_connection,df_outdegree_connection,how='left',left_on=['user'], right_on=['user'])

In [78]:
#calculate the followers to following ration
df_connection['ratio_ff'] = df_connection['followers']/df_connection['following']

Let us look at the followers, following for the top 10 influencial ones in Retweet network

In [81]:
df_top_10 = df_connection[df_connection['user'].isin(top_rt_influencers)]

In [82]:
df_top_10

Unnamed: 0,user,followers,following,ratio_ff
87,88,128,45221,0.002831
348,349,86,12249,0.007021
518,519,475,12764,0.037214
676,677,50,39820,0.001256
1987,1988,717,27065,0.026492
3548,3549,86,27555,0.003121
3997,3998,4,2873,0.001392
5225,5226,924,12299,0.075128
11990,11991,176,1860,0.094624
14453,14454,30,145,0.206897


**User 14454 is interesting as he has very less followers or following but has a high maximum_spread in retweet network.**

In [None]:
follwers_14454 = [n for n in G_followers.reverse().neighbors(0)]

In [None]:
follwing_14454 = [n for n in G_followers.neighbors(0)]

In [None]:
#df_connection[~(df_connection['following']==0)].sort_values(['ratio_ff'],ascending=False).head(10)

## TEMPORAL SNAPSHOT

In this snapshot we consider it as a static netowrk

In [61]:
# We consider only Retweets as the mode of interaction
df_rt_temporal = df_activity.query('interaction_type == "RT"')

We consider the window from 4500 to 6000 mins.

In [63]:
minimum_time = df_rt_temporal['date'].min()
window_start = minimum_time + timedelta(minutes=4500)
window_end = minimum_time + timedelta(minutes=6000)
df_window = df_rt_temporal.query('date >= @window_start & date < @window_end')  

In [64]:
#nodes affected before the considered window.
#df_affected_before = df_rt_temporal.query('date < @window_start')

#nodes_affected_before_window = list(set(list(df_affected_before['user'].unique()) + list(df_affected_before['target_user'].unique())))


In [65]:
# construct a networkx graph in the window of interest

G_window = nx.from_pandas_edgelist(df_window,source='user',
                                   target='target_user',
                                   create_using=nx.DiGraph())

In [67]:
G_window.number_of_nodes()

181256

In [68]:
eigenvector_centrality_rt_window = nx.eigenvector_centrality(G_window)
indegree_centrality_rt_window = nx.in_degree_centrality(G_window)
df_centrality_rt_window = pd.DataFrame([eigenvector_centrality_rt_window,indegree_centrality_rt_window]).transpose()
df_centrality_rt_window.columns = ['eigenvector','indegree']
df_centrality_rt_window['user'] = df_centrality_rt_window.index
df_centrality_rt_window['log_eigenvector'] = df_centrality_rt_window['eigenvector'].apply(lambda x: math.log(x))
#we need to scale them to the range [0,1]
df_centrality_rt_window_norm = min_max_normalization(df_centrality_rt_window,['log_eigenvector','indegree'])
# we take the mean of the scaled log_eigenvector and indegree centralities
df_centrality_rt_window_norm['avg_centrality'] = df_centrality_rt_window_norm[['log_eigenvector','indegree']].mean(axis=1)

In [71]:
df_centrality_rt_window_norm.sort_values(['avg_centrality','log_eigenvector'], ascending=[False,False]).head(20)

Unnamed: 0,eigenvector,indegree,user,log_eigenvector,avg_centrality
88,0.401794,1.0,88,1.0,1.0
677,0.078075,0.386316,677,0.940201,0.663259
14454,0.00733,0.456834,14454,0.853849,0.655341
1988,0.109867,0.277876,1988,0.95267,0.615273
349,0.041348,0.201711,349,0.916999,0.559355
3549,0.054392,0.127884,3549,0.927008,0.527446
3571,0.021793,0.153139,3571,0.893622,0.52338
3998,0.336222,0.052687,3998,0.993497,0.523092
5226,0.099216,0.07778,5226,0.948948,0.513364
2342,0.082833,0.077295,2342,0.94236,0.509828


# Summary


The model of spread is captured by independent cascade 
model, where a given node can affect all its neighbors (outdegree neighbours in the directed graph). Then the affected nodes go on to further affect the unnafected ones in their neighbourhood, till no new unaffected member can be reached. To model this we can also introduce a randomness about which of the nodes can be affected, but for the current purpose we do not use any of it.

Just looking at the maximum-spread doesn't fully capture the dynamics, we need average-depth too. Given similar values of maximum-spread, the average-depth captures the nature of this spread of information. Smaller average-depth indicates the broadcasting ability and the larger one indicates that the same spread is achieved in many steps (viral?).


We have various metrics to capture the centrality, local measures like indegree centrality and global ones like eigenvector centrality. In order to capture the spread as defined, the combination of them is a slightly better metric to capture the spread than the eigenvector centrality alone.

We also see that top influencers in the retweet network have few follwers compared to the ones they follow.

One assumption is that follwer network is not dependent on time.