In [1]:
import re
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import string

import tweepy                  #Getting twitter data like tweets, followers, friends

import nltk
from textblob import TextBlob  #Sentiment analysis

import networkx as nx          #Drawing & analyzing network

import warnings 
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Summary

This notebook contains the project which I've carried out for the course Analysis of Large Scale Social Networks (https://onderwijsaanbod.kuleuven.be/syllabi/e/H0T26AE.htm#activetab=begintermen_idp5387920) <br>

## Main goal of this notebook
The goal is to analyze the network of twitter users who tweeted regularly about #Brexit during the month of April, 2019.

## Activities carried out in this notebook
- Collect all the original tweets with #Brexit in English language in the month of April, 2019. Retweets not collected, but only count of retweets is noted. This is done using <b><i>tweepy</i></b> package. <i>(Note that, since I am using free developer version of twitter api, it is only possible to collect the tweets that are tweeted over the past 1 week. So in order to collect the tweets for the entire month of April, it was necessary to collect the data during every week in April).</i>
- Perform sentiment analysis on all these tweets. This is done using <b><i>textblob</i></b> package. Only the sentiment polarity is considered here. Polarity value ranges between -1 and +1. Polarity values between -1 & 0 means that the twitter users feel negative about #Brexit, and polarity values between 0 & +1 means that the twitter users feel positive about #Brexit. 
- Divide the data into four parts, each representing a week in April. The idea is to analyze the network of twitter users who have tweeted regularly (each week) about #Brexit. For further simplifying the network analysis only those twitter users are considered whose tweets are retweeted at least 10 times during each of the four weeks of April. The idea is to analyze twitter users whose opinion about #Brexit is deemed valuable in twitter.
- From the above criteria, it was found that 240 twitter users, also called <i>nodes</i> or <i>vertices</i>, have tweeted regularly in English about #Brexit during the month of April, 2019. The next step is to find the possible <i>edges</i> between the nodes. This is also done using tweepy package. There are three possible types of edges, i) source node follow destination node, ii) destination node follow source node, and iii) mutal following relationship (meaning source node & destination node follow each other).
- The next activitiy carried out is constructing the social network. This is done using <b><i>networkx</i></b> package. For simplicity, the network is contructed only using mutually following edges.
- The next activity carried out is finding out the properties in the social network such as Betweenness, Centrality etc.
- Then different community detection algorithms are applied to the twitter network. The community detection algorithms are checked with the sentiments in each of the community. The idea is to evaluate the communitiy detection algorithms depending on how much the sentiments of twitter users about #Brexit vary.


#### The usernames of twitter users in the network are not displayed. They are mapped into ids like user_0, user_1... <br>

In [2]:
t_credentials = dict()
#These are the credentials obtained by setting up your twitter developer account
t_credentials['CONSUMER_KEY'] = '-----------------'  
t_credentials['CONSUMER_SECRET'] = '-----------------'
t_credentials['ACCESS_KEY'] = '-----------------'
t_credentials['ACCESS_SECRET'] = '-----------------'

#load Twitter API credentials
consumer_key = t_credentials['CONSUMER_KEY']
consumer_secret = t_credentials['CONSUMER_SECRET']
access_key = t_credentials['ACCESS_KEY']
access_secret = t_credentials['ACCESS_SECRET']

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)

In [3]:
#Tweet data is collected using this cell. 
#Since the data is already collected and stored in brexit_tweets_april.csv, no need to run this cell.
#Just read the data from brexit_tweets_april.csv file which is done below.
'''

hash_tags = '#brexit -filter:retweets'
tweet_cols = ['created_at','id','screen_name','location','followers_count','friends_count','retweeted','retweet_count',
              'text','tags','mentions']
tweet_df = pd.DataFrame(columns = tweet_cols)
for tweet in tweepy.Cursor(api.search,q=hash_tags, result_type='recent', # Example Values: mixed, recent, popular
                           lang="en",tweet_mode='extended',until='2019-04-30',wait_on_rate_limit=True).items(400000):
    tags=[]
    for i in range(len(tweet.entities['hashtags'])):
        tags.append('#'+tweet.entities['hashtags'][i]['text'].lower())
    mentions = []
    for i in range(len(tweet.entities['user_mentions'])):
        mentions.append('@'+tweet.entities['user_mentions'][i]['screen_name'])
    df = pd.DataFrame([[tweet.created_at,tweet.id,tweet.user.screen_name,tweet.user.location,tweet.user.followers_count,
                        tweet.user.friends_count,tweet.retweeted,tweet.retweet_count,tweet.full_text,tags,mentions]],columns = tweet_cols)
    tweet_df = tweet_df.append(df)
    tweet_df_rows = tweet_df.shape[0]
    if(tweet_df_rows%100==0):
        print(str(tweet_df_rows)+'---'+str(df['created_at']))
tweet_df.reset_index(drop=True,inplace=True)
tweet_df = tweet_df.sort_values(['created_at'],ascending=[False])
print(tweet_df.shape)
print(tweet_df['created_at'].min())
print(tweet_df['created_at'].max())
print(tweet_df['retweet_count'].max())
tweet_df.head()
'''

'\n\nhash_tags = \'#brexit -filter:retweets\'\ntweet_cols = [\'created_at\',\'id\',\'screen_name\',\'location\',\'followers_count\',\'friends_count\',\'retweeted\',\'retweet_count\',\n              \'text\',\'tags\',\'mentions\']\ntweet_df = pd.DataFrame(columns = tweet_cols)\nfor tweet in tweepy.Cursor(api.search,q=hash_tags, result_type=\'recent\', # Example Values: mixed, recent, popular\n                           lang="en",tweet_mode=\'extended\',until=\'2019-04-30\',wait_on_rate_limit=True).items(400000):\n    tags=[]\n    for i in range(len(tweet.entities[\'hashtags\'])):\n        tags.append(\'#\'+tweet.entities[\'hashtags\'][i][\'text\'].lower())\n    mentions = []\n    for i in range(len(tweet.entities[\'user_mentions\'])):\n        mentions.append(\'@\'+tweet.entities[\'user_mentions\'][i][\'screen_name\'])\n    df = pd.DataFrame([[tweet.created_at,tweet.id,tweet.user.screen_name,tweet.user.location,tweet.user.followers_count,\n                        tweet.user.friends_coun

In [4]:
tweet_df = pd.read_csv('Outputs/brexit_tweets_april.csv',lineterminator='\n')
tweet_df['mentions'] = tweet_df['mentions\r'].str.strip()
tweet_df['dummy_count'] = 1
tweet_df = tweet_df.drop(['Unnamed: 0','id','location','retweeted','mentions\r'],axis='columns')

#Map the twitter user names into ids
tweeters = sorted(list(set(list(tweet_df['screen_name']))))
tweeters_id = [('user_'+str(i)) for i in range(len(tweeters))]
mapping_dict = {tweeters[i]:tweeters_id[i] for i in range(len(tweeters))}
tweet_df['screen_name'] = tweet_df['screen_name'].map(mapping_dict)

print(tweet_df.shape)
print(tweet_df['created_at'].min())
print(tweet_df['created_at'].max())
tweet_df.head()

(324352, 9)
2019-04-01 00:00:00
2019-04-29 23:59:59


Unnamed: 0,created_at,screen_name,followers_count,friends_count,retweet_count,text,tags,mentions,dummy_count
0,2019-04-29 23:59:59,user_61373,749,1327,0,WOW - Another Brexit extension - time now unti...,"['#brexit', '#brexitclock', '#clock', '#eu', '...",[],1
1,2019-04-29 23:58:19,user_56917,281,687,0,".@santanderuk I want to report CEO fraud, howe...",['#brexit'],['@santanderuk'],1
2,2019-04-29 23:56:55,user_72791,381,1895,0,Voting for #brexit. It is very obvious. https:...,['#brexit'],[],1
3,2019-04-29 23:56:53,user_9686,21,118,0,@LordCFalconer 1. First ref result couldn't ev...,"['#brexit', '#peoplesvote']",['@LordCFalconer'],1
4,2019-04-29 23:56:53,user_42521,19,130,0,"#MAGA #BREXIT #GOP \r\n\r\nMake a difference, ...","['#maga', '#brexit', '#gop']",[],1


In [5]:
len(set(tweet_df['screen_name']))

99229

99,229 twitter users have originally tweeted about #Brexit in English during the month of April, 2019.

### Functions that calculates sentiment in tweets

In [6]:
retweet_threshold = 10  

def clean_tweet(tweet):
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

def fill_sentiments(tweet_df):
    tweet_df['tweet_sentiment_polarity'] = 0.0
    tweet_df['tweet_sentiment_subjectivity'] = 0.0
    for index,row in tweet_df.iterrows():
        cleaned_tweet = row['text']
        s_analysis = TextBlob(cleaned_tweet)
        tweet_df.at[index,'tweet_sentiment_polarity'] = s_analysis.sentiment.polarity
        tweet_df.at[index,'tweet_sentiment_subjectivity'] = s_analysis.sentiment.subjectivity
    return tweet_df

### Week 1 tweets

In [7]:
tweet_df_week1 = tweet_df[tweet_df['created_at']<='2019-04-06 23:59:59'].copy()
tweet_df_week1.reset_index(drop=True,inplace=True)
tweet_df_week1 = fill_sentiments(tweet_df_week1)
print(tweet_df_week1.shape)
print(tweet_df_week1['created_at'].min())
print(tweet_df_week1['created_at'].max())
tweet_df_week1.head()

(135507, 11)
2019-04-01 00:00:00
2019-04-06 23:59:54


Unnamed: 0,created_at,screen_name,followers_count,friends_count,retweet_count,text,tags,mentions,dummy_count,tweet_sentiment_polarity,tweet_sentiment_subjectivity
0,2019-04-06 23:59:54,user_61373,755,1323,0,"If BREXIT is April 12th we have: 5 days, 21 h...","['#brexit', '#brexitclock', '#clock', '#eu', '...",[],1,0.0,0.0
1,2019-04-06 23:59:52,user_78016,3264,4850,0,@Two_Penneth Demagogues plying their dismal tr...,['#brexit'],['@Two_Penneth'],1,-0.066667,0.5
2,2019-04-06 23:59:49,user_18730,8472,9450,0,What a police state looks like - \r\r\r\n🇫🇷#Fr...,"['#france', '#macronmustgo', '#nobolshevism', ...",[],1,0.0,0.0
3,2019-04-06 23:59:42,user_30713,142,200,0,I don't know how many words Trump made but the...,['#brexit'],[],1,-0.055556,0.522222
4,2019-04-06 23:59:25,user_89867,2,47,0,@LauraEmilyBush My statement as the self-procl...,"['#germany', '#brexit']",['@LauraEmilyBush'],1,-0.05,0.4


In [8]:
retweets_df_week1 = tweet_df_week1[['screen_name','retweet_count']].groupby('screen_name').sum()
followers_df_week1 = tweet_df_week1[['screen_name','followers_count']].groupby('screen_name').max() #Max of the week
friends_df_week1 = tweet_df_week1[['screen_name','friends_count']].groupby('screen_name').max()  #Max of the week
sentiment_polarity_df_week1 = tweet_df_week1[['screen_name','tweet_sentiment_polarity']].groupby('screen_name').sum()
sentiment_subjectivity_df_week1 = tweet_df_week1[['screen_name','tweet_sentiment_subjectivity']].groupby('screen_name').sum()
tweet_count_df_week1 = tweet_df_week1[['screen_name','dummy_count']].groupby('screen_name').sum()

week1_stats_df = pd.concat([tweet_count_df_week1,retweets_df_week1,followers_df_week1,friends_df_week1,
                            sentiment_polarity_df_week1,sentiment_subjectivity_df_week1],axis='columns')
week1_stats_df['screen_name'] = week1_stats_df.index
week1_stats_df = week1_stats_df[['screen_name','dummy_count','retweet_count','followers_count','friends_count',
                                 'tweet_sentiment_polarity','tweet_sentiment_subjectivity']]
week1_stats_df = week1_stats_df.rename(columns={"retweet_count":"total_retweet_count","followers_count":"max_followers_count",
                                                "friends_count":"max_friends_count","dummy_count":"total_tweet_count",
                                                "tweet_sentiment_polarity":"agg_sentiment_polarity",
                                                "tweet_sentiment_subjectivity":"agg_sentiment_subjectivity"})
week1_stats_df = week1_stats_df[week1_stats_df['total_retweet_count']>=retweet_threshold]
week1_stats_df['agg_sentiment_polarity'] = week1_stats_df['agg_sentiment_polarity']/week1_stats_df['total_tweet_count']
week1_stats_df['agg_sentiment_subjectivity'] = week1_stats_df['agg_sentiment_subjectivity']/week1_stats_df['total_tweet_count']
week1_stats_df.reset_index(drop=True,inplace=True)
print(week1_stats_df['agg_sentiment_polarity'].max())
print(week1_stats_df['agg_sentiment_polarity'].min())
print(week1_stats_df.shape)
week1_stats_df.tail()

1.0
-0.71
(2700, 7)


Unnamed: 0,screen_name,total_tweet_count,total_retweet_count,max_followers_count,max_friends_count,agg_sentiment_polarity,agg_sentiment_subjectivity
2695,user_98820,1,12,522,254,0.0,0.0
2696,user_98925,24,249,9042,8052,-0.061127,0.44652
2697,user_99092,16,32,13232,204,-0.082292,0.380729
2698,user_9917,4,19,33314,752,0.108036,0.391369
2699,user_9961,17,122,3636723,485,0.033783,0.244404


### Week 2 tweets

In [9]:
tweet_df_week2 = tweet_df[(tweet_df['created_at']>'2019-04-06 23:59:59') & (tweet_df['created_at']<='2019-04-13 23:59:59')].copy()
tweet_df_week2.reset_index(drop=True,inplace=True)
tweet_df_week2 = fill_sentiments(tweet_df_week2)
print(tweet_df_week2.shape)
print(tweet_df_week2['created_at'].min())
print(tweet_df_week2['created_at'].max())
tweet_df_week2.head()

(100633, 11)
2019-04-07 00:00:03
2019-04-13 23:59:28


Unnamed: 0,created_at,screen_name,followers_count,friends_count,retweet_count,text,tags,mentions,dummy_count,tweet_sentiment_polarity,tweet_sentiment_subjectivity
0,2019-04-13 23:59:28,user_36737,28,165,0,@Nigel_Farage @brexitparty_uk @Nigel_Farage So...,"['#greatergood', '#brexit']","['@Nigel_Farage', '@brexitparty_uk', '@Nigel_F...",1,0.101786,0.392857
1,2019-04-13 23:58:37,user_4214,31,27,0,@SKinnock You are just like your dad - a panty...,"['#labour', '#brexit']",['@SKinnock'],1,-0.25,0.0
2,2019-04-13 23:58:17,user_36737,28,165,0,@Nigel_Farage Sorry to hear you won't be on th...,"['#greatergood', '#brexit']",['@Nigel_Farage'],1,0.101786,0.392857
3,2019-04-13 23:57:28,user_56063,402,1738,0,"Lets keep this fresh folks, please retweet \r\...","['#brexit', '#brextension']",[],1,0.3,0.5
4,2019-04-13 23:57:02,user_31716,435,857,0,Petition: Halt #Brexit For A Public Inquiry ht...,['#brexit'],[],1,0.0,0.066667


In [10]:
retweets_df_week2 = tweet_df_week2[['screen_name','retweet_count']].groupby('screen_name').sum()
followers_df_week2 = tweet_df_week2[['screen_name','followers_count']].groupby('screen_name').max() #Max of the week
friends_df_week2 = tweet_df_week2[['screen_name','friends_count']].groupby('screen_name').max()  #Max of the week
sentiment_polarity_df_week2 = tweet_df_week2[['screen_name','tweet_sentiment_polarity']].groupby('screen_name').sum()
sentiment_subjectivity_df_week2 = tweet_df_week2[['screen_name','tweet_sentiment_subjectivity']].groupby('screen_name').sum()
tweet_count_df_week2 = tweet_df_week2[['screen_name','dummy_count']].groupby('screen_name').sum()

week2_stats_df = pd.concat([tweet_count_df_week2,retweets_df_week2,followers_df_week2,friends_df_week2,
                            sentiment_polarity_df_week2,sentiment_subjectivity_df_week2],axis='columns')
week2_stats_df['screen_name'] = week2_stats_df.index
week2_stats_df = week2_stats_df[['screen_name','dummy_count','retweet_count','followers_count','friends_count',
                                 'tweet_sentiment_polarity','tweet_sentiment_subjectivity']]
week2_stats_df = week2_stats_df.rename(columns={"retweet_count":"total_retweet_count","followers_count":"max_followers_count",
                                                "friends_count":"max_friends_count","dummy_count":"total_tweet_count",
                                                "tweet_sentiment_polarity":"agg_sentiment_polarity",
                                                "tweet_sentiment_subjectivity":"agg_sentiment_subjectivity"})
week2_stats_df = week2_stats_df[week2_stats_df['total_retweet_count']>=retweet_threshold]
week2_stats_df['agg_sentiment_polarity'] = week2_stats_df['agg_sentiment_polarity']/week2_stats_df['total_tweet_count']
week2_stats_df['agg_sentiment_subjectivity'] = week2_stats_df['agg_sentiment_subjectivity']/week2_stats_df['total_tweet_count']
week2_stats_df.reset_index(drop=True,inplace=True)
print(week2_stats_df['agg_sentiment_polarity'].max())
print(week2_stats_df['agg_sentiment_polarity'].min())
print(week2_stats_df.shape)
week2_stats_df.tail()

1.0
-1.0
(2097, 7)


Unnamed: 0,screen_name,total_tweet_count,total_retweet_count,max_followers_count,max_friends_count,agg_sentiment_polarity,agg_sentiment_subjectivity
2092,user_9960,12,14,71,184,0.066111,0.378889
2093,user_9961,5,42,3657779,487,-0.012667,0.165333
2094,user_9964,3,14,744963,203,0.166667,0.166667
2095,user_9966,1,11,12634,918,0.0,0.0
2096,user_9970,3,10,3,17,-0.041667,0.291667


### Week 3 tweets

In [11]:
tweet_df_week3 = tweet_df[(tweet_df['created_at']>'2019-04-13 23:59:59') & (tweet_df['created_at']<='2019-04-20 23:59:59')].copy()
tweet_df_week3.reset_index(drop=True,inplace=True)
tweet_df_week3 = fill_sentiments(tweet_df_week3)
print(tweet_df_week3.shape)
print(tweet_df_week3['created_at'].min())
print(tweet_df_week3['created_at'].max())
tweet_df_week3.head()

(42048, 11)
2019-04-14 00:00:00
2019-04-20 23:59:53


Unnamed: 0,created_at,screen_name,followers_count,friends_count,retweet_count,text,tags,mentions,dummy_count,tweet_sentiment_polarity,tweet_sentiment_subjectivity
0,2019-04-20 23:59:53,user_61373,751,1329,0,WOW - Another Brexit extension - time now unti...,"['#brexit', '#brexitclock', '#clock', '#eu', '...",[],1,0.1,1.0
1,2019-04-20 23:59:16,user_42372,1522,3686,1,I don't think he's feeling it...\r\r\r\n\r\r\r...,"['#ico', '#ethereum', '#crypto', '#crowdfundin...",[],1,-0.025,0.7
2,2019-04-20 23:58:39,user_3678,23602,22764,20,#Remain voting 4 Dummies (Like me) \r\r\r\n1....,"['#remain', '#brexit']",[],1,0.0,0.0
3,2019-04-20 23:58:24,user_7535,90,195,7,#brexitparty #brexit #BrexitBetrayal #northeas...,"['#brexitparty', '#brexit', '#brexitbetrayal',...",[],1,0.0,0.0
4,2019-04-20 23:58:10,user_12992,308,691,0,@highwaysagency @theresa_may Fiddlesticks to #...,['#brexit'],"['@highwaysagency', '@theresa_may']",1,0.0,0.0


In [12]:
retweets_df_week3 = tweet_df_week3[['screen_name','retweet_count']].groupby('screen_name').sum()
followers_df_week3 = tweet_df_week3[['screen_name','followers_count']].groupby('screen_name').max() #Max of the week
friends_df_week3 = tweet_df_week3[['screen_name','friends_count']].groupby('screen_name').max()  #Max of the week
sentiment_polarity_df_week3 = tweet_df_week3[['screen_name','tweet_sentiment_polarity']].groupby('screen_name').sum()
sentiment_subjectivity_df_week3 = tweet_df_week3[['screen_name','tweet_sentiment_subjectivity']].groupby('screen_name').sum()
tweet_count_df_week3 = tweet_df_week3[['screen_name','dummy_count']].groupby('screen_name').sum()

week3_stats_df = pd.concat([tweet_count_df_week3,retweets_df_week3,followers_df_week3,friends_df_week3,
                            sentiment_polarity_df_week3,sentiment_subjectivity_df_week3],axis='columns')
week3_stats_df['screen_name'] = week3_stats_df.index
week3_stats_df = week3_stats_df[['screen_name','dummy_count','retweet_count','followers_count','friends_count',
                                 'tweet_sentiment_polarity','tweet_sentiment_subjectivity']]
week3_stats_df = week3_stats_df.rename(columns={"retweet_count":"total_retweet_count","followers_count":"max_followers_count",
                                                "friends_count":"max_friends_count","dummy_count":"total_tweet_count",
                                                "tweet_sentiment_polarity":"agg_sentiment_polarity",
                                                "tweet_sentiment_subjectivity":"agg_sentiment_subjectivity"})
week3_stats_df = week3_stats_df[week3_stats_df['total_retweet_count']>=retweet_threshold]
week3_stats_df['agg_sentiment_polarity'] = week3_stats_df['agg_sentiment_polarity']/week3_stats_df['total_tweet_count']
week3_stats_df['agg_sentiment_subjectivity'] = week3_stats_df['agg_sentiment_subjectivity']/week3_stats_df['total_tweet_count']
week3_stats_df.reset_index(drop=True,inplace=True)
print(week3_stats_df['agg_sentiment_polarity'].max())
print(week3_stats_df['agg_sentiment_polarity'].min())
print(week3_stats_df.shape)
week3_stats_df.tail()

1.0
-1.0
(1037, 7)


Unnamed: 0,screen_name,total_tweet_count,total_retweet_count,max_followers_count,max_friends_count,agg_sentiment_polarity,agg_sentiment_subjectivity
1032,user_99088,1,159,15723,197,0.01875,0.05625
1033,user_99092,19,74,13264,205,0.171053,0.493421
1034,user_9919,3,15,108,316,-0.025,0.419444
1035,user_9944,63,26,7,0,-0.005518,0.040665
1036,user_9961,6,109,3674083,487,-0.095859,0.215934


### Week 4 tweets

In [13]:
tweet_df_week4 = tweet_df[tweet_df['created_at']>'2019-04-23 23:59:59'].copy()
tweet_df_week4.reset_index(drop=True,inplace=True)
tweet_df_week4 = fill_sentiments(tweet_df_week4)
print(tweet_df_week4.shape)
print(tweet_df_week4['created_at'].min())
print(tweet_df_week4['created_at'].max())
tweet_df_week4.head()

(32613, 11)
2019-04-24 00:00:10
2019-04-29 23:59:59


Unnamed: 0,created_at,screen_name,followers_count,friends_count,retweet_count,text,tags,mentions,dummy_count,tweet_sentiment_polarity,tweet_sentiment_subjectivity
0,2019-04-29 23:59:59,user_61373,749,1327,0,WOW - Another Brexit extension - time now unti...,"['#brexit', '#brexitclock', '#clock', '#eu', '...",[],1,0.1,1.0
1,2019-04-29 23:58:19,user_56917,281,687,0,".@santanderuk I want to report CEO fraud, howe...",['#brexit'],['@santanderuk'],1,0.1,0.275
2,2019-04-29 23:56:55,user_72791,381,1895,0,Voting for #brexit. It is very obvious. https:...,['#brexit'],[],1,0.0,0.65
3,2019-04-29 23:56:53,user_9686,21,118,0,@LordCFalconer 1. First ref result couldn't ev...,"['#brexit', '#peoplesvote']",['@LordCFalconer'],1,0.083333,0.377778
4,2019-04-29 23:56:53,user_42521,19,130,0,"#MAGA #BREXIT #GOP \r\n\r\nMake a difference, ...","['#maga', '#brexit', '#gop']",[],1,0.0,0.0


In [14]:
retweets_df_week4 = tweet_df_week4[['screen_name','retweet_count']].groupby('screen_name').sum()
followers_df_week4 = tweet_df_week4[['screen_name','followers_count']].groupby('screen_name').max() #Max of the week
friends_df_week4 = tweet_df_week4[['screen_name','friends_count']].groupby('screen_name').max()  #Max of the week
sentiment_polarity_df_week4 = tweet_df_week4[['screen_name','tweet_sentiment_polarity']].groupby('screen_name').sum()
sentiment_subjectivity_df_week4 = tweet_df_week4[['screen_name','tweet_sentiment_subjectivity']].groupby('screen_name').sum()
tweet_count_df_week4 = tweet_df_week4[['screen_name','dummy_count']].groupby('screen_name').sum()

week4_stats_df = pd.concat([tweet_count_df_week4,retweets_df_week4,followers_df_week4,friends_df_week4,
                            sentiment_polarity_df_week4,sentiment_subjectivity_df_week4],axis='columns')
week4_stats_df['screen_name'] = week4_stats_df.index
week4_stats_df = week4_stats_df[['screen_name','dummy_count','retweet_count','followers_count','friends_count',
                                 'tweet_sentiment_polarity','tweet_sentiment_subjectivity']]
week4_stats_df = week4_stats_df.rename(columns={"retweet_count":"total_retweet_count","followers_count":"max_followers_count",
                                                "friends_count":"max_friends_count","dummy_count":"total_tweet_count",
                                                "tweet_sentiment_polarity":"agg_sentiment_polarity",
                                                "tweet_sentiment_subjectivity":"agg_sentiment_subjectivity"})
week4_stats_df = week4_stats_df[week4_stats_df['total_retweet_count']>=retweet_threshold]
week4_stats_df['agg_sentiment_polarity'] = week4_stats_df['agg_sentiment_polarity']/week4_stats_df['total_tweet_count']
week4_stats_df['agg_sentiment_subjectivity'] = week4_stats_df['agg_sentiment_subjectivity']/week4_stats_df['total_tweet_count']
week4_stats_df.reset_index(drop=True,inplace=True)
print(week4_stats_df['agg_sentiment_polarity'].max())
print(week4_stats_df['agg_sentiment_polarity'].min())
print(week4_stats_df.shape)
week4_stats_df.tail()

1.0
-0.8
(915, 7)


Unnamed: 0,screen_name,total_tweet_count,total_retweet_count,max_followers_count,max_friends_count,agg_sentiment_polarity,agg_sentiment_subjectivity
910,user_98925,3,32,9040,8129,0.166667,0.296296
911,user_98978,1,578,95838,492,0.0,0.0
912,user_9944,35,13,7,0,-0.012517,0.040272
913,user_9954,24,19,11520,2835,0.020731,0.346759
914,user_9961,2,13,3711821,487,0.075,0.45


### Check how many people have tweeted #Brexit for each week of April

In [15]:
#Here don't take into account of the retweet_threshold
week1_tweeters = set(tweet_df_week1['screen_name'])
print("Number of #brexit (en) tweeters in week 1 is: " + str(len(week1_tweeters)))
week2_tweeters = set(tweet_df_week2['screen_name'])
print("Number of #brexit (en) tweeters in week 2 is: " + str(len(week2_tweeters)))
week3_tweeters = set(tweet_df_week3['screen_name'])
print("Number of #brexit (en) tweeters in week 3 is: " + str(len(week3_tweeters)))
week4_tweeters = set(tweet_df_week4['screen_name'])
print("Number of #brexit (en) tweeters in week 4 is: " + str(len(week4_tweeters)))

all_week_tweeters = week1_tweeters & week2_tweeters & week3_tweeters & week4_tweeters
print("Number of common #brexit (en) tweeters for all weeks is: " + str(len(all_week_tweeters)))

Number of #brexit (en) tweeters in week 1 is: 57271
Number of #brexit (en) tweeters in week 2 is: 43266
Number of #brexit (en) tweeters in week 3 is: 19234
Number of #brexit (en) tweeters in week 4 is: 14442
Number of common #brexit (en) tweeters for all weeks is: 3427


### Screen only the most popular twitter users whose aggregate retweets for the week > retweet_threshold

In [16]:
week1_popular_tweeters = set(week1_stats_df['screen_name'])
print("Number of #brexit (en + retweet_threshold) tweeters in week 1 is: " + str(len(week1_popular_tweeters)))
week2_popular_tweeters = set(week2_stats_df['screen_name'])
print("Number of #brexit (en + retweet_threshold) tweeters in week 2 is: " + str(len(week2_popular_tweeters)))
week3_popular_tweeters = set(week3_stats_df['screen_name'])
print("Number of #brexit (en + retweet_threshold) tweeters in week 3 is: " + str(len(week3_popular_tweeters)))
week4_popular_tweeters = set(week4_stats_df['screen_name'])
print("Number of #brexit (en + retweet_threshold) tweeters in week 4 is: " + str(len(week4_popular_tweeters)))

all_week_popular_tweeters = week1_popular_tweeters & week2_popular_tweeters & week3_popular_tweeters & week4_popular_tweeters
print("Number of common #brexit (en + retweet_threshold) tweeters for all weeks is: " + str(len(all_week_popular_tweeters)))

Number of #brexit (en + retweet_threshold) tweeters in week 1 is: 2700
Number of #brexit (en + retweet_threshold) tweeters in week 2 is: 2097
Number of #brexit (en + retweet_threshold) tweeters in week 3 is: 1037
Number of #brexit (en + retweet_threshold) tweeters in week 4 is: 915
Number of common #brexit (en + retweet_threshold) tweeters for all weeks is: 240


### Finding edges in the network

In [17]:
nodes = list(all_week_popular_tweeters)
def create_pairings(source):
        result = []
        for p1 in range(len(source)):
                for p2 in range(p1+1,len(source)):
                        result.append([source[p1],source[p2]])
        return result

pairings = create_pairings(nodes)
print("%d pairings" % len(pairings))

28680 pairings


In [18]:
screen_name_cols = ['source_screen_name','destination_screen_name']
network_df = pd.DataFrame(pairings, columns = screen_name_cols)
network_df['has_mutual_following'] = False #Initialize it to false then compute the follower friend mutual relations
network_df['source_follow_dest'] = False
network_df['dest_follow_source'] = False  #source is a friend of dest
print(network_df.shape)
network_df.head()

(28680, 5)


Unnamed: 0,source_screen_name,destination_screen_name,has_mutual_following,source_follow_dest,dest_follow_source
0,user_77516,user_68504,False,False,False
1,user_77516,user_82690,False,False,False
2,user_77516,user_69243,False,False,False
3,user_77516,user_51280,False,False,False
4,user_77516,user_86664,False,False,False


#### Get the mutual following info among the list of popular Twitter users

In [19]:
#This takes time. Takes about 1 hour for finding connection between every 750 pair of nodes
'''
for index,row in network_df.iterrows():
    ff_rel = api.show_friendship(source_screen_name=row['source_screen_name'], target_screen_name=row['destination_screen_name'])
    network_df.at[index,'has_mutual_following'] = (ff_rel[0].followed_by == True and ff_rel[0].following == True)
    network_df.at[index,'source_follow_dest'] = (ff_rel[0].following == True)
    network_df.at[index,'dest_follow_source'] = (ff_rel[0].followed_by == True)

print(network_df.shape)
network_df.head()
network_df.to_csv('Outputs\mutual_folling_info_retweet_thresh_'+str(retweet_threshold)+'.csv')
'''

"\nfor index,row in network_df.iterrows():\n    ff_rel = api.show_friendship(source_screen_name=row['source_screen_name'], target_screen_name=row['destination_screen_name'])\n    network_df.at[index,'has_mutual_following'] = (ff_rel[0].followed_by == True and ff_rel[0].following == True)\n    network_df.at[index,'source_follow_dest'] = (ff_rel[0].following == True)\n    network_df.at[index,'dest_follow_source'] = (ff_rel[0].followed_by == True)\n\nprint(network_df.shape)\nnetwork_df.head()\nnetwork_df.to_csv('Outputs\\mutual_folling_info_retweet_thresh_'+str(retweet_threshold)+'.csv')\n"

In [20]:
network_df = pd.read_csv('Outputs\mutual_folling_info_retweet_thresh_'+str(retweet_threshold)+'.csv') #load from saved
network_df = network_df.drop(columns='Unnamed: 0',axis=1)
network_df['source_screen_name'] = network_df['source_screen_name'].map(mapping_dict)
network_df['destination_screen_name'] = network_df['destination_screen_name'].map(mapping_dict)
print(network_df.shape)
network_df.head()

(28680, 5)


Unnamed: 0,source_screen_name,destination_screen_name,has_mutual_following,source_follow_dest,dest_follow_source
0,user_54269,user_21251,False,False,False
1,user_54269,user_32545,False,False,False
2,user_54269,user_89002,False,False,False
3,user_54269,user_78819,False,False,False
4,user_54269,user_43012,False,True,False


In [21]:
network_df['has_mutual_following'].sum()

2349

Number of edges (mutual following) in the network is 2349

In [22]:
network_df['source_follow_dest'].sum() + network_df['dest_follow_source'].sum()

7264

In [23]:
network_stats_df_week1 = week1_stats_df[week1_stats_df['screen_name'].isin(all_week_popular_tweeters)].copy()
network_stats_df_week1.reset_index(drop=True,inplace=True)

network_stats_df_week2 = week2_stats_df[week2_stats_df['screen_name'].isin(all_week_popular_tweeters)].copy()
network_stats_df_week2.reset_index(drop=True,inplace=True)

network_stats_df_week3 = week3_stats_df[week3_stats_df['screen_name'].isin(all_week_popular_tweeters)].copy()
network_stats_df_week3.reset_index(drop=True,inplace=True)

network_stats_df_week4 = week4_stats_df[week4_stats_df['screen_name'].isin(all_week_popular_tweeters)].copy()
network_stats_df_week4.reset_index(drop=True,inplace=True)

network_stats_df_week1.to_csv('Outputs\week_1_network_retweet_thresh_'+str(retweet_threshold)+'.csv')
network_stats_df_week2.to_csv('Outputs\week_2_network_retweet_thresh_'+str(retweet_threshold)+'.csv')
network_stats_df_week3.to_csv('Outputs\week_3_network_retweet_thresh_'+str(retweet_threshold)+'.csv')
network_stats_df_week4.to_csv('Outputs\week_4_network_retweet_thresh_'+str(retweet_threshold)+'.csv')

print(network_stats_df_week1.shape)
network_stats_df_week1.head()

(240, 7)


Unnamed: 0,screen_name,total_tweet_count,total_retweet_count,max_followers_count,max_friends_count,agg_sentiment_polarity,agg_sentiment_subjectivity
0,user_10155,9,3511,46494,778,0.154563,0.330952
1,user_1024,18,247,3118,1360,0.004818,0.383896
2,user_11029,49,50,9782,10720,0.007205,0.252049
3,user_1106,13,34,1282,1435,0.055707,0.366349
4,user_11431,20,3602,7478,7254,0.08122,0.457911


### Network for week 1

In [24]:
#Plot the mutual relation using networkX library
nodes = list(network_stats_df_week1['screen_name'])
pd.DataFrame(nodes,columns=['Twitter_users']).to_csv('nodes_list_retweet_thresh_'+str(retweet_threshold)+'.csv')
size_of_nodes = list(network_stats_df_week1['total_retweet_count'])
color_of_nodes = list(network_stats_df_week1['agg_sentiment_polarity']*2) #Multiply by 2 to see more contrast in colors
mutual_follow_lol = network_df[['source_screen_name','destination_screen_name']][network_df['has_mutual_following']==True].values.tolist()
pd.DataFrame(mutual_follow_lol,columns=['Twitter_user_1','Twitter_user_2']).to_csv('Outputs\edges_list_retweet_thresh_'+str(retweet_threshold)+'.csv')
mutual_follow_edges = []
for mfe in mutual_follow_lol:
    mutual_follow_edges.append((mfe[0],mfe[1]))

G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(mutual_follow_edges)
'''
plt.figure(figsize=(75,75))
pos = nx.spring_layout(G, k=1.5*1/np.sqrt(len(G.nodes())), iterations=30)

nx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,cmap=plt.get_cmap('YlOrBr'),seed=10)
plt.savefig("network_graph_week1_retweet_thresh_"+str(retweet_threshold)+".jpeg") #save as jpeg
plt.show() #display
'''
#Intensive color means the sentiment for #Brexit is positive

'\nplt.figure(figsize=(75,75))\npos = nx.spring_layout(G, k=1.5*1/np.sqrt(len(G.nodes())), iterations=30)\n\nnx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,cmap=plt.get_cmap(\'YlOrBr\'),seed=10)\nplt.savefig("network_graph_week1_retweet_thresh_"+str(retweet_threshold)+".jpeg") #save as jpeg\nplt.show() #display\n'

Popular Social Network Graph of people who have tweeted #Brexit on 1st week of April
- Size of nodes indicates number of retweets
- Edges indicates mutual following relation
- Color of nodes indicates sentiment
<img src="Images/network_graph_week1_retweet_thresh_10.jpeg">

### Network for week 2

In [25]:
#Plot the mutual relation using networkX library
nodes = list(network_stats_df_week2['screen_name'])
size_of_nodes = list(network_stats_df_week2['total_retweet_count'])
color_of_nodes = list(network_stats_df_week2['agg_sentiment_polarity']*2) #Multiply by 2 to see more contrast in colors
mutual_follow_lol = network_df[['source_screen_name','destination_screen_name']][network_df['has_mutual_following']==True].values.tolist()
mutual_follow_edges = []
for mfe in mutual_follow_lol:
    mutual_follow_edges.append((mfe[0],mfe[1]))

G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(mutual_follow_edges)
'''
plt.figure(figsize=(75,75))
pos = nx.spring_layout(G, k=1.5*1/np.sqrt(len(G.nodes())), iterations=20)
nx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,
        cmap=plt.get_cmap('YlOrBr'),seed=10)
plt.savefig("network_graph_week2_retweet_thresh_"+str(retweet_threshold)+".jpeg") #save as jpeg
plt.show() #display
'''
#Intensive color means the sentiment for #Brexit is positive

'\nplt.figure(figsize=(75,75))\npos = nx.spring_layout(G, k=1.5*1/np.sqrt(len(G.nodes())), iterations=20)\nnx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,\n        cmap=plt.get_cmap(\'YlOrBr\'),seed=10)\nplt.savefig("network_graph_week2_retweet_thresh_"+str(retweet_threshold)+".jpeg") #save as jpeg\nplt.show() #display\n'

Popular Social Network Graph of people who have tweeted #Brexit on 2nd week of April
- Size of nodes indicates number of retweets
- Edges indicates mutual following relation
- Color of nodes indicates sentiment
<img src="Images/network_graph_week2_retweet_thresh_10.jpeg">

### Network for week 3

In [26]:
#Plot the mutual relation using networkX library
nodes = list(network_stats_df_week3['screen_name'])
size_of_nodes = list(network_stats_df_week3['total_retweet_count'])
color_of_nodes = list(network_stats_df_week3['agg_sentiment_polarity']*2) #Multiply by 2 to see more contrast in colors
mutual_follow_lol = network_df[['source_screen_name','destination_screen_name']][network_df['has_mutual_following']==True].values.tolist()
mutual_follow_edges = []
for mfe in mutual_follow_lol:
    mutual_follow_edges.append((mfe[0],mfe[1]))

G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(mutual_follow_edges)
plt.figure(figsize=(75,75))
pos = nx.spring_layout(G, k=1.5*1/np.sqrt(len(G.nodes())), iterations=25)
'''
nx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,
        cmap=plt.get_cmap('YlOrBr'),seed=10)
plt.savefig("network_graph_week3_retweet_thresh_"+str(retweet_threshold)+".jpeg") #save as jpeg
plt.show() #display
'''
#Intensive color means the sentiment for #Brexit is positive

'\nnx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,\n        cmap=plt.get_cmap(\'YlOrBr\'),seed=10)\nplt.savefig("network_graph_week3_retweet_thresh_"+str(retweet_threshold)+".jpeg") #save as jpeg\nplt.show() #display\n'

<Figure size 5400x5400 with 0 Axes>

Popular Social Network Graph of people who have tweeted #Brexit on 3rd week of April
- Size of nodes indicates number of retweets
- Edges indicates mutual following relation
- Color of nodes indicates sentiment
<img src="Images/network_graph_week3_retweet_thresh_10.jpeg">

### Network for week 4

In [27]:
#Plot the mutual relation using networkX library
nodes = list(network_stats_df_week4['screen_name'])
size_of_nodes = list(network_stats_df_week4['total_retweet_count'])
color_of_nodes = list(network_stats_df_week4['agg_sentiment_polarity']*2) #Multiply by 2 to see more contrast in colors
mutual_follow_lol = network_df[['source_screen_name','destination_screen_name']][network_df['has_mutual_following']==True].values.tolist()
mutual_follow_edges = []
for mfe in mutual_follow_lol:
    mutual_follow_edges.append((mfe[0],mfe[1]))

G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(mutual_follow_edges)
'''
plt.figure(figsize=(75,75))
pos = nx.spring_layout(G, k=1.5*1/np.sqrt(len(G.nodes())), iterations=25)
nx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,
        cmap=plt.get_cmap('YlOrBr'),seed=10)
plt.savefig("network_graph_week4_retweet_thresh_"+str(retweet_threshold)+".jpeg") #save as jpeg
plt.show() #display
'''
#Intensive color means the sentiment for #Brexit is positive

'\nplt.figure(figsize=(75,75))\npos = nx.spring_layout(G, k=1.5*1/np.sqrt(len(G.nodes())), iterations=25)\nnx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,\n        cmap=plt.get_cmap(\'YlOrBr\'),seed=10)\nplt.savefig("network_graph_week4_retweet_thresh_"+str(retweet_threshold)+".jpeg") #save as jpeg\nplt.show() #display\n'

Popular Social Network Graph of people who have tweeted #Brexit on 4th week of April
- Size of nodes indicates number of retweets
- Edges indicates mutual following relation
- Color of nodes indicates sentiment
<img src="Images/network_graph_week4_retweet_thresh_10.jpeg">

# Graph Algorithms (Undirected)

# Connectivity

##### ALL NODE CONNECTIVITY
Compute node connectivity between all pairs of nodes. (This call takes time)

In [28]:
#network_all_node_pair_connectivity = approx.all_pairs_node_connectivity(G)

##### LOCAL NODE CONNECTIVITY
Give a source node & a target node to check if there is a connectivity between them. <br>
Local node connectivity for two non adjacent nodes s and t is the minimum number of nodes that must be removed (along with their incident edges) to disconnect them.

In [29]:
from networkx.algorithms import approximation as approx
network_local_node_connectivity = approx.local_node_connectivity(G,'user_35532','user_17872')
network_local_node_connectivity

1

##### NODE CONNECTIVITY
Returns node connectivity for a graph or digraph G. <br>
Node connectivity is equal to the minimum number of nodes that must be removed to disconnect G or render it trivial. If source and target nodes are provided, this function returns the local node connectivity: the minimum number of nodes that must be removed to break all paths from source to target in G.

In [30]:
network_node_connectivity = approx.node_connectivity(G)
network_node_connectivity

0

# Clustering

##### BIPARTITE CLUSTERING
Compute a bipartite clustering coefficient for nodes.

In [31]:
#nx.algorithms.bipartite.clustering(G) #The graph is not bipartite, so it produces error

##### CLUSTER TRIANGLES
Finds the number of triangles that include a node as one vertex.

In [32]:
network_clustering_triangles = nx.triangles(G)
sorted(network_clustering_triangles.items(), key=lambda x: x[1],reverse=True)[:10]

[('user_92741', 1391),
 ('user_84416', 1283),
 ('user_44225', 1258),
 ('user_3678', 1186),
 ('user_42890', 1161),
 ('user_57298', 1160),
 ('user_20654', 1152),
 ('user_17859', 1138),
 ('user_15822', 1136),
 ('user_48474', 1098)]

##### CLUSTER TRANSITIVITY
Compute graph transitivity, the fraction of all possible triangles present in G.<br>
Possible triangles are identified by the number of “triads” (two edges with a shared vertex).

In [33]:
network_clustering_transitivity = nx.transitivity(G)
network_clustering_transitivity

0.5360104570681637

##### SQUARE CLUSTERING
Compute the squares clustering coefficient for nodes. <br>
For each node return the fraction of possible squares that exist at the node

In [34]:
network_square_clustering = nx.square_clustering(G)
sorted(network_square_clustering.items(), key=lambda x: x[1],reverse=True)[:10]

[('user_94349', 0.28558008454673556),
 ('user_74305', 0.1870406943156933),
 ('user_80677', 0.15698987890836796),
 ('user_47010', 0.15256036873675907),
 ('user_87390', 0.15028773793714031),
 ('user_67536', 0.14956017514102007),
 ('user_62278', 0.14862690556055524),
 ('user_48786', 0.14755314522524365),
 ('user_58611', 0.1469646154425896),
 ('user_66427', 0.1452530772894783)]

##### CLUSTERING
Compute the clustering coefficient for nodes. <br>
For unweighted graphs, the clustering of a node u is the fraction of possible triangles through that node that exist.

In [35]:
network_clustering = nx.clustering(G)
sorted(network_clustering.items(), key=lambda x: x[1],reverse=True)[:10]

[('user_43189', 1.0),
 ('user_65597', 1.0),
 ('user_94349', 0.9722222222222222),
 ('user_7443', 0.9047619047619048),
 ('user_54257', 0.8333333333333334),
 ('user_58611', 0.8333333333333334),
 ('user_87390', 0.8225806451612904),
 ('user_74305', 0.8068783068783069),
 ('user_38508', 0.8),
 ('user_5725', 0.8)]

##### AVERAGE CLUSTERING
Estimates the average clustering coefficient of G.

In [36]:
network_average_clustering = nx.average_clustering(G)
network_average_clustering

0.41115590293803694

##### GENERALIZED DEGREE
Compute the generalized degree for nodes. <br>
For each node, the generalized degree shows how many edges of given triangle multiplicity the node is connected to. The triangle multiplicity of an edge is the number of triangles an edge participates in.

In [37]:
network_generalized_degree = nx.generalized_degree(G)
print(network_generalized_degree['user_35532'])
print(network_generalized_degree['user_61373'])

Counter({0: 1})
Counter({4: 4, 5: 2, 8: 1, 3: 1, 6: 1, 2: 1, 7: 1})


# Centrality

##### DEGREE CENTRALITY
Compute the degree centrality for nodes. <br>
The degree centrality for a node v is the fraction of nodes it is connected to.

In [38]:
network_degree_centrality = nx.degree_centrality(G)
sorted(network_degree_centrality.items(), key=lambda x: x[1],reverse=True)[:10]

[('user_92741', 0.34728033472803344),
 ('user_44225', 0.3305439330543933),
 ('user_84416', 0.3263598326359832),
 ('user_3678', 0.305439330543933),
 ('user_20654', 0.29707112970711297),
 ('user_61379', 0.2928870292887029),
 ('user_42890', 0.2845188284518828),
 ('user_15822', 0.28033472803347276),
 ('user_17859', 0.2719665271966527),
 ('user_57298', 0.2719665271966527)]

##### EIGENVECTOR CENTRALITY
Compute the eigenvector centrality for the graph G. <br>
Eigenvector centrality computes the centrality for a node based on the centrality of its neighbors.

In [39]:
network_eigenvector_centrality = nx.eigenvector_centrality(G)
sorted(network_eigenvector_centrality.items(), key=lambda x: x[1],reverse=True)[:10]

[('user_92741', 0.17348576620042236),
 ('user_84416', 0.166450250620681),
 ('user_44225', 0.16406086008344295),
 ('user_3678', 0.16029369536136398),
 ('user_57298', 0.15739072469917165),
 ('user_20654', 0.15705114078841137),
 ('user_42890', 0.15685190782988093),
 ('user_17859', 0.1559405548902361),
 ('user_15822', 0.15565096385877436),
 ('user_48474', 0.15339907605386296)]

##### CLOSENESS CENTRALITY
Compute closeness centrality for nodes. <br>
Closeness centrality of a node u is the reciprocal of the average shortest path distance to u over all n-1 reachable nodes.

In [40]:
network_closeness_centrality = nx.closeness_centrality(G)
sorted(network_closeness_centrality.items(), key=lambda x: x[1],reverse=True)[:10]

[('user_44225', 0.5070859765150493),
 ('user_61379', 0.5008716875871687),
 ('user_52937', 0.47970809515390817),
 ('user_20654', 0.4763534930898948),
 ('user_3678', 0.4697831000817583),
 ('user_92741', 0.4697831000817583),
 ('user_42890', 0.46656540761544496),
 ('user_15822', 0.46444465576264743),
 ('user_95542', 0.46444465576264743),
 ('user_13335', 0.46339149327792484)]

##### BETWEENNESS CENTRALITY
Compute the shortest-path betweenness centrality for nodes. <br>
Betweenness centrality of a node v is the sum of the fraction of all-pairs shortest paths that pass through v.

In [41]:
network_betweenness_centrality = nx.betweenness_centrality(G)
sorted(network_betweenness_centrality.items(), key=lambda x: x[1],reverse=True)[:10]

[('user_61379', 0.10730944737937745),
 ('user_44225', 0.08452071997138032),
 ('user_43280', 0.05153190002908908),
 ('user_52937', 0.04812279584162896),
 ('user_41762', 0.03997594854822955),
 ('user_89002', 0.03989287766002407),
 ('user_44816', 0.0385338188447167),
 ('user_68318', 0.03211931475974364),
 ('user_36827', 0.03201759162831476),
 ('user_84416', 0.031925313476213676)]

##### EDGE BETWEENNESS CENTRALITY
Compute betweenness centrality for edges. <br>
Betweenness centrality of an edge e is the sum of the fraction of all-pairs shortest paths that pass through e.

In [42]:
network_edge_betweenness_centrality = nx.edge_betweenness_centrality(G)
sorted(network_edge_betweenness_centrality.items(), key=lambda x: x[1],reverse=True)[:10]

[(('user_13730', 'user_68318'), 0.03205716734157897),
 (('user_12973', 'user_13730'), 0.030404463040446306),
 (('user_44225', 'user_68318'), 0.02813210119291191),
 (('user_12973', 'user_17872'), 0.015341701534170154),
 (('user_12507', 'user_16803'), 0.013132408566267479),
 (('user_34599', 'user_60456'), 0.012463734563953666),
 (('user_44225', 'user_89002'), 0.011991079682978226),
 (('user_17596', 'user_44225'), 0.011544631530768816),
 (('user_36827', 'user_61379'), 0.01064916168404423),
 (('user_44816', 'user_87299'), 0.010565839052837308)]

#### Centrality Plot
The idea behind this plot is to know where the nodes with some of the highest centrality measures are located in the network.

In [43]:
df_centrality = network_stats_df_week1.copy()
df_centrality.head()
df_centrality['color'] = 'gold'
df_centrality['total_retweet_count'] = 500
df_centrality.at[224,'total_retweet_count'] = 10000  #(user_92741)
df_centrality.at[149,'total_retweet_count'] = 10000  #(user_61379)
df_centrality.at[102,'total_retweet_count'] = 10000  #(user_44225)

df_centrality.at[224,'color'] = 'red'  #Degree Centrality & Eigenvector centrality (user_92741)
df_centrality.at[149,'color'] = 'lime'  #Betweenness Centrality (user_61379)
df_centrality.at[102,'color'] = 'magenta'  #Closeness Centrality (user_44225)

color_of_nodes = list(df_centrality['color']) #Multiply by 2 to see more contrast in colors
nodes = list(df_centrality['screen_name'])
size_of_nodes = list(df_centrality['total_retweet_count'])
mutual_follow_lol = network_df[['source_screen_name','destination_screen_name']][network_df['has_mutual_following']==True].values.tolist()
mutual_follow_edges = []
for mfe in mutual_follow_lol:
    mutual_follow_edges.append((mfe[0],mfe[1]))

G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(mutual_follow_edges)
'''
plt.figure(figsize=(75,75))
pos = nx.spring_layout(G, k=3*1/np.sqrt(len(G.nodes())), iterations=25)
nx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,
        cmap=plt.get_cmap('YlOrBr'),seed=10)
plt.savefig("centrality.jpeg") #save as jpeg
plt.show() #display
'''
#Intensive color means the sentiment for #Brexit is positive

'\nplt.figure(figsize=(75,75))\npos = nx.spring_layout(G, k=3*1/np.sqrt(len(G.nodes())), iterations=25)\nnx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,\n        cmap=plt.get_cmap(\'YlOrBr\'),seed=10)\nplt.savefig("centrality.jpeg") #save as jpeg\nplt.show() #display\n'

Centrality Plot
<img src="Images/centrality.jpeg">
In the above graph: <br>
- The red coloured node represents twitter user of high degree centrality & high eigenvector centrality. 
- The green coloured node represents twitter user of high betweenness centrality. 
- The magenta coloured node represents twitter user of high closeness centrality.  <br>

Analysis of centrality measures can be helpful in <i>finding who are the main influencers in the network, finding the source of fake news, fraud detection</i> etc.

## Communicability

Return communicability between all pairs of nodes in G. <br>
The communicability between pairs of nodes in G is the sum of closed walks of different lengths starting at node u and ending at node v.

In [44]:
network_communicatability = nx.communicability(G)
#network_communicatability['user_0'] #Shows the communicability of user_0 with all other nodes
sorted(network_communicatability['user_44225'].items(), key=lambda x: x[1],reverse=True)[:10] #Top 10 nodes

[('user_92741', 1.9266768407210386e+18),
 ('user_84416', 1.8485433726343764e+18),
 ('user_44225', 1.8219986486205778e+18),
 ('user_3678', 1.7801676649838587e+18),
 ('user_57298', 1.7479288234139569e+18),
 ('user_20654', 1.7441566860271252e+18),
 ('user_42890', 1.7419457973843866e+18),
 ('user_17859', 1.7318258482502144e+18),
 ('user_15822', 1.728608771612648e+18),
 ('user_48474', 1.7035979898308593e+18)]

##### COMMUNICABILITY BETWEENNESS CENTRALITY
- communicability() - Communicability between pairs of nodes in G.
- communicability_betweenness_centrality() - Communicability betweeness centrality for each node in G.

In [45]:
network_communicatability_bw_centrality = nx.communicability_betweenness_centrality(G)

  B = (expA - scipy.linalg.expm(A.A)) / expA


## Link Analysis

##### PAGERANK
PageRank analysis of graph structure. <br>
PageRank computes a ranking of the nodes in the graph G based on the structure of the incoming links. It was originally designed as an algorithm to rank web pages. 

In [46]:
network_pagerank = nx.pagerank(G)
sorted(network_pagerank.items(), key=lambda x: x[1],reverse=True)[:10] 

[('user_61379', 0.014508408683615682),
 ('user_44225', 0.013261792034877443),
 ('user_92741', 0.013083332893889622),
 ('user_84416', 0.012904765954242633),
 ('user_43280', 0.01186284582726005),
 ('user_20654', 0.011628817381389286),
 ('user_3678', 0.01143823851463272),
 ('user_42890', 0.01114367442676755),
 ('user_15822', 0.010453559304481873),
 ('user_89002', 0.010304839764382142)]

##### HITS
Return HITS hubs and authorities values for nodes. <br>
The HITS algorithm computes two numbers for a node. Authorities estimates the node value based on the incoming links. Hubs estimates the node value based on outgoing links.

In [47]:
network_hubs,network_authorities = nx.hits(G)

### Trees

##### MINIMUM SPANNING TREE
Returns a minimum spanning tree or forest on an undirected graph G.

In [48]:
network_min_spanning_tree = nx.minimum_spanning_tree(G)
sorted(network_min_spanning_tree.edges(data=True))[:10]

[('user_10155', 'user_50014', {}),
 ('user_10155', 'user_6055', {}),
 ('user_10155', 'user_86539', {}),
 ('user_1024', 'user_43237', {}),
 ('user_1024', 'user_53291', {}),
 ('user_1024', 'user_58600', {}),
 ('user_1024', 'user_60456', {}),
 ('user_1024', 'user_61379', {}),
 ('user_1024', 'user_67550', {}),
 ('user_1024', 'user_7058', {})]

##### MAXIMUM SPANNING TREE
Returns a maximum spanning tree or forest on an undirected graph G.

In [49]:
network_max_spanning_tree = nx.maximum_spanning_tree(G)
sorted(network_max_spanning_tree.edges(data=True))[:10]

[('user_10155', 'user_50014', {}),
 ('user_10155', 'user_6055', {}),
 ('user_10155', 'user_86539', {}),
 ('user_1024', 'user_43237', {}),
 ('user_1024', 'user_53291', {}),
 ('user_1024', 'user_58600', {}),
 ('user_1024', 'user_60456', {}),
 ('user_1024', 'user_61379', {}),
 ('user_1024', 'user_67550', {}),
 ('user_1024', 'user_7058', {})]

##### MINIMUM SPANNING EDGES
Generate edges in a minimum spanning forest of an undirected weighted graph. <br>
A minimum spanning tree is a subgraph of the graph (a tree) with the minimum sum of edge weights. A spanning forest is a union of the spanning trees for each connected component of the graph.

In [50]:
network_min_spanning_edges = nx.minimum_spanning_edges(G)
sorted(list(network_min_spanning_edges))[:10]

[('user_10155', 'user_50014', {}),
 ('user_10155', 'user_6055', {}),
 ('user_10155', 'user_86539', {}),
 ('user_1024', 'user_43237', {}),
 ('user_1024', 'user_53291', {}),
 ('user_1024', 'user_58600', {}),
 ('user_1024', 'user_60456', {}),
 ('user_1024', 'user_61379', {}),
 ('user_1024', 'user_67550', {}),
 ('user_1024', 'user_7058', {})]

##### MAXIMUM SPANNING EDGES
Generate edges in a maximum spanning forest of an undirected weighted graph. <br>
A maximum spanning tree is a subgraph of the graph (a tree) with the maximum possible sum of edge weights. A spanning forest is a union of the spanning trees for each connected component of the graph.

In [51]:
network_max_spanning_edges = nx.maximum_spanning_edges(G)
sorted(list(network_max_spanning_edges))[:10]

[('user_10155', 'user_50014', {}),
 ('user_10155', 'user_6055', {}),
 ('user_10155', 'user_86539', {}),
 ('user_1024', 'user_43237', {}),
 ('user_1024', 'user_53291', {}),
 ('user_1024', 'user_58600', {}),
 ('user_1024', 'user_60456', {}),
 ('user_1024', 'user_61379', {}),
 ('user_1024', 'user_67550', {}),
 ('user_1024', 'user_7058', {})]

### Vitality

##### CLOSENESS VITALITY
Returns the closeness vitality for nodes in the graph. <br>
The closeness vitality of a node is the change in the sum of distances between all node pairs when excluding that node.

In [52]:
network_vitality = nx.closeness_vitality(G)#Requires closely connected graph, else returns nan

### Wiener index

Returns the Wiener index of the given graph. <br>
The Wiener index of a graph is the sum of the shortest-path distances between each pair of reachable nodes. For pairs of nodes in undirected graphs, only one orientation of the pair is counted.

In [53]:
nx.wiener_index(G)

inf

# Community Detection

#### KERNIGHAN–LIN (BIPARTITIAN) COMMUNITY
Partition a graph into two blocks using the Kernighan–Lin algorithm. <br>
This algorithm paritions a network into two sets by iteratively swapping pairs of nodes to reduce the edge cut between the two sets.

In [54]:
network_community_kl = nx.community.kernighan_lin.kernighan_lin_bisection(G) 
#It may give better results if we give the weights of edges
len(network_community_kl)

2

#### GREEDY MODULARITY COMMUNITY
Find communities in graph using Clauset-Newman-Moore greedy modularity maximization. <br>
This method currently supports the Graph class and does not consider edge weights. Greedy modularity maximization begins with each node in its own community and joins the pair of communities that most increases modularity until no such pair exists.

In [55]:
network_community_modularity = nx.community.greedy_modularity_communities(G)
len(network_community_modularity)

23

#### K-CLIQUE COMMUNITY DETECTION
Find k-clique communities in graph using the percolation method. <br>
A k-clique community is the union of all cliques of size k that can be reached through adjacent (sharing k-1 nodes) k-cliques. <br>
This community detection algorithm is mainly used to detect overlapping communities in a network. k-clique is a clique with k nodes. K-clique community is a union of all k-cliques that can be reached from each other through a series of adjacent k-cliques. Two k-cliques are said to be adjacent if they share k-1 nodes. We usually consider maximal cliques for this algorithm. In this network, optimal results was found for k=5. 

In [56]:
network_community_k_clique = nx.community.k_clique_communities(G,5) #Set k = 5
network_community_k_clique = list(network_community_k_clique)
len(network_community_k_clique)

3

#### LABEL PROPAGATION COMMUNITY
Generates community sets determined by label propagation. <br>
The algorithm works by propagating labels throughout the network and forming communities based on this process of label propagation. The idea is, within a cluster, all nodes connected to each other, will eventually converge to the same label. The intuition behind the algorithm is that a single label can quickly become dominant in a densely connected group of nodes but will have trouble crossing a sparsely connected region. The algorithm stops when every node has a label that the maximum no. of their neighbour has.

In [57]:
network_community_lpa = nx.community.label_propagation.label_propagation_communities(G)
network_community_lpa = list(network_community_lpa)[::-1]
len(network_community_lpa)

22

#### GIRVAN–NEWMAN COMMUNITY
Finds communities in a graph using the Girvan–Newman method. <br>
The Girvan–Newman algorithm detects communities by progressively removing edges from the original graph. The algorithm removes the “most valuable” edge, traditionally the edge with the highest betweenness centrality, at each step. As the graph breaks down into pieces, the tightly knit community structure is exposed and the result can be depicted as a dendrogram.

In [58]:
network_community_gn = nx.community.centrality.girvan_newman(G)
network_community_gn = list(tuple(set(c) for c in next(network_community_gn)))
len(network_community_gn)

19

#### LOUVAIN METHOD
A very popular heuristic algorithm used for community detection. It maximizes modularity score for each community. The algorithm first assigns each node to its own community and then goes through each node and evaluate the modularity gain from removing the node from its own community and placing to its neighbouring community. The procedure is repeated until the modularity score keeps increasing. When clusters cannot be improved further by moving individual nodes, the Louvain algorithm aggregates the network, so that each cluster in the original network becomes a node in the aggregated network. In the aggregated network, the algorithm then starts to move individual nodes from one cluster to another. By repeating the node movement and aggregation, the Louvain algorithm is able to find high-quality clusters in a short time. Since the Louvain algorithm keeps moving nodes from one cluster to another, at some point it may move the crucial node to a different cluster, thereby breaking the connectivity of the original cluster. Perhaps surprisingly, the Louvain algorithm cannot fix this shattered connectivity.  Also, this method reaches to a local maximum modularity based on the order of nodes chosen. And hence results in different final distributions of communities each time. The issue with this algorithm is they have trouble detecting small communities in large networks.

In [59]:
#!pip install python-louvain
import community
partition = community.best_partition(G)
network_community_louvain = []
for i in range(len(set(partition.values()))):
    community_members = []
    for key, value in partition.items():
        if value == i:
            community_members.append(key)
    network_community_louvain.append(set(community_members))
    
len(network_community_louvain)

23

#### LEIDEN METHOD
The problem of shattered connectivity in Louvain algorithm is fixed in the Leiden algorithm. The Leiden algorithm is able to split clusters instead of only merging them, as is done by the Louvain algorithm. By splitting clusters in a specific way, the Leiden algorithm guarantees that clusters are well-connected. It is impossible to improve the quality of the clusters by moving one or more nodes from one cluster to another. This is a strong property of the Leiden algorithm. It states that the clusters it finds are not too far from optimal. Finally, rather than continuously checking for all nodes in a network whether they can be moved to a different cluster, as is done in the Louvain algorithm, the Leiden algorithm performs this check only for so-called unstable nodes. As a result, the Leiden algorithm does not only find higher quality clusters than the Louvain algorithm, it also does so in much less time.

In [60]:
#I wasn't able to install this in my local laptop, but I was able to run it on Google Colab and save the results on dataframe
#!pip install leidenalg  
#import leidenalg
#!pip install igraph
#import igraph as ig
#nx.write_graphml(G,'graph.graphml')
#Gix = ig.read('graph.graphml',format="graphml")
#network_community_leiden = leidenalg.find_partition(Gix, leidenalg.ModularityVertexPartition);
#network_community_leiden = list(network_community_leiden)
leiden_df = pd.read_csv('leiden_df.csv')[['screen_name','leiden_ID']] #Saved dataframe after running it on Google Colab
leiden_df['screen_name'] = leiden_df['screen_name'].map(mapping_dict)

group_ids = list(set(leiden_df['leiden_ID']))
network_community_leiden = []
for ids in group_ids:
    network_community_leiden.append((set(leiden_df['screen_name'][leiden_df['leiden_ID']==ids])))
    
len(network_community_leiden)

23

### Visualize the communities detected by various community detection algorithms

In [61]:
community_df = network_stats_df_week1.copy()
community_df['kernighanLin_ID'] = '' #Bipartitian
community_df['kernighanLin_size'] = 500
community_df['GModularity_ID'] = ''
community_df['GModularity_size'] = 500
community_df['kClique_ID'] = ''
community_df['kClique_size'] = 500
community_df['labelProp_ID'] = ''
community_df['labelProp_size'] = 500
community_df['girvanNew_ID'] = ''
community_df['girvanNew_size'] = 500
community_df['louvain_ID'] = ''
community_df['louvain_size'] = 500
community_df['leiden_ID'] = ''
community_df['leiden_size'] = 500
for index,row in community_df.iterrows():
    for h in range(len(network_community_kl)):
        if row['screen_name'] in network_community_kl[h]:
            community_df.at[index,'kernighanLin_ID'] = h
            community_df.at[index,'kernighanLin_size'] = 2500
    for i in range(len(network_community_modularity)):
        if len(network_community_modularity[i]) >= 3:  #Only consider communities having at least 3 members
            if row['screen_name'] in network_community_modularity[i]:
                community_df.at[index,'GModularity_ID'] = i
                community_df.at[index,'GModularity_size'] = 2500
    for j in range(len(network_community_k_clique)):
        if row['screen_name'] in network_community_k_clique[j]:
            community_df.at[index,'kClique_ID'] = j
            community_df.at[index,'kClique_size'] = 2500
    for k in range(len(network_community_lpa)):
        if len(network_community_lpa[k]) >= 3:  #Only consider communities having at least 3 members
            if row['screen_name'] in network_community_lpa[k]:
                community_df.at[index,'labelProp_ID'] = k
                community_df.at[index,'labelProp_size'] = 2500
    for l in range(len(network_community_gn)):
        if len(network_community_gn[l]) >= 3:  #Only consider communities having at least 3 members
            if row['screen_name'] in network_community_gn[l]:
                community_df.at[index,'girvanNew_ID'] = l
                community_df.at[index,'girvanNew_size'] = 2500
    for m in range(len(network_community_louvain)):
        if len(network_community_louvain[m]) >= 3:  #Only consider communities having at least 3 members
            if row['screen_name'] in network_community_louvain[m]:
                community_df.at[index,'louvain_ID'] = m
                community_df.at[index,'louvain_size'] = 2500
    for n in range(len(network_community_leiden)):
        if len(network_community_leiden[n]) >= 3:  #Only consider communities having at least 3 members
            if row['screen_name'] in network_community_leiden[n]:
                community_df.at[index,'leiden_ID'] = n
                community_df.at[index,'leiden_size'] = 2500
print(community_df.shape)
community_df.head()

(240, 21)


Unnamed: 0,screen_name,total_tweet_count,total_retweet_count,max_followers_count,max_friends_count,agg_sentiment_polarity,agg_sentiment_subjectivity,kernighanLin_ID,kernighanLin_size,GModularity_ID,...,kClique_ID,kClique_size,labelProp_ID,labelProp_size,girvanNew_ID,girvanNew_size,louvain_ID,louvain_size,leiden_ID,leiden_size
0,user_10155,9,3511,46494,778,0.154563,0.330952,0,2500,0,...,,500,4,2500,0,2500,0,2500,2,2500
1,user_1024,18,247,3118,1360,0.004818,0.383896,0,2500,1,...,1.0,2500,5,2500,0,2500,1,2500,0,2500
2,user_11029,49,50,9782,10720,0.007205,0.252049,1,2500,0,...,0.0,2500,4,2500,0,2500,2,2500,1,2500
3,user_1106,13,34,1282,1435,0.055707,0.366349,0,2500,1,...,1.0,2500,5,2500,0,2500,1,2500,0,2500
4,user_11431,20,3602,7478,7254,0.08122,0.457911,1,2500,0,...,0.0,2500,4,2500,0,2500,2,2500,1,2500


In [62]:
color_list = ['blue','red','green','brown','orange','crimson','cyan','pink','darkslategray','darkgreen','olive']
community_df['kernighanLin_ID'] = community_df['kernighanLin_ID'].astype(str)
community_df['kClique_ID'] = community_df['kClique_ID'].astype(str)
community_df['GModularity_ID'] = community_df['GModularity_ID'].astype(str)
community_df['labelProp_ID'] = community_df['labelProp_ID'].astype(str)
community_df['girvanNew_ID'] = community_df['girvanNew_ID'].astype(str)
community_df['louvain_ID'] = community_df['louvain_ID'].astype(str)
community_df['leiden_ID'] = community_df['leiden_ID'].astype(str)
for c in range(len(color_list)):
    community_df['kernighanLin_ID'] = community_df['kernighanLin_ID'].replace(str(c),color_list[c])
    community_df['kClique_ID'] = community_df['kClique_ID'].replace(str(c),color_list[c])
    community_df['GModularity_ID'] = community_df['GModularity_ID'].replace(str(c),color_list[c])
    community_df['labelProp_ID'] = community_df['labelProp_ID'].replace(str(c),color_list[c])
    community_df['girvanNew_ID'] = community_df['girvanNew_ID'].replace(str(c),color_list[c])
    community_df['louvain_ID'] = community_df['louvain_ID'].replace(str(c),color_list[c])
    community_df['leiden_ID'] = community_df['leiden_ID'].replace(str(c),color_list[c])
community_df['kernighanLin_ID'] = community_df['kernighanLin_ID'].replace('','gold')
community_df['kClique_ID'] = community_df['kClique_ID'].replace('','gold')
community_df['GModularity_ID'] = community_df['GModularity_ID'].replace('','gold')
community_df['labelProp_ID'] = community_df['labelProp_ID'].replace('','gold')
community_df['girvanNew_ID'] = community_df['girvanNew_ID'].replace('','gold')
community_df['louvain_ID'] = community_df['louvain_ID'].replace('','gold')
community_df['leiden_ID'] = community_df['leiden_ID'].replace('','gold')
community_df.head()

Unnamed: 0,screen_name,total_tweet_count,total_retweet_count,max_followers_count,max_friends_count,agg_sentiment_polarity,agg_sentiment_subjectivity,kernighanLin_ID,kernighanLin_size,GModularity_ID,...,kClique_ID,kClique_size,labelProp_ID,labelProp_size,girvanNew_ID,girvanNew_size,louvain_ID,louvain_size,leiden_ID,leiden_size
0,user_10155,9,3511,46494,778,0.154563,0.330952,blue,2500,blue,...,gold,500,orange,2500,blue,2500,blue,2500,green,2500
1,user_1024,18,247,3118,1360,0.004818,0.383896,blue,2500,red,...,red,2500,crimson,2500,blue,2500,red,2500,blue,2500
2,user_11029,49,50,9782,10720,0.007205,0.252049,red,2500,blue,...,blue,2500,orange,2500,blue,2500,green,2500,red,2500
3,user_1106,13,34,1282,1435,0.055707,0.366349,blue,2500,red,...,red,2500,crimson,2500,blue,2500,red,2500,blue,2500
4,user_11431,20,3602,7478,7254,0.08122,0.457911,red,2500,blue,...,blue,2500,orange,2500,blue,2500,green,2500,red,2500


In [63]:
nodes = list(community_df['screen_name'])
size_of_nodes = list(community_df['kernighanLin_size'])
color_of_nodes = list(community_df['kernighanLin_ID']) #Multiply by 2 to see more contrast in colors
mutual_follow_lol = network_df[['source_screen_name','destination_screen_name']][network_df['has_mutual_following']==True].values.tolist()
mutual_follow_edges = []
for mfe in mutual_follow_lol:
    mutual_follow_edges.append((mfe[0],mfe[1]))

G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(mutual_follow_edges)
'''
plt.figure(figsize=(75,75))
pos = nx.spring_layout(G, k=2*1/np.sqrt(len(G.nodes())), iterations=30)
nx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,seed=1)
plt.savefig("kernighan_lin_community_"+str(retweet_threshold)+".jpeg")
plt.show() #display
'''

'\nplt.figure(figsize=(75,75))\npos = nx.spring_layout(G, k=2*1/np.sqrt(len(G.nodes())), iterations=30)\nnx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,seed=1)\nplt.savefig("kernighan_lin_community_"+str(retweet_threshold)+".jpeg")\nplt.show() #display\n'

Kernighan Lin Community
<img src="Images/kernighan_lin_community_10.jpeg">

In [64]:
nodes = list(community_df['screen_name'])
size_of_nodes = list(community_df['kClique_size'])
color_of_nodes = list(community_df['kClique_ID']) #Multiply by 2 to see more contrast in colors
mutual_follow_lol = network_df[['source_screen_name','destination_screen_name']][network_df['has_mutual_following']==True].values.tolist()
mutual_follow_edges = []
for mfe in mutual_follow_lol:
    mutual_follow_edges.append((mfe[0],mfe[1]))

G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(mutual_follow_edges)
'''
plt.figure(figsize=(75,75))
pos = nx.spring_layout(G, k=2*1/np.sqrt(len(G.nodes())), iterations=30)
nx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,seed=1)
plt.savefig("k_clique_community_"+str(retweet_threshold)+".jpeg")
plt.show() #display
'''

'\nplt.figure(figsize=(75,75))\npos = nx.spring_layout(G, k=2*1/np.sqrt(len(G.nodes())), iterations=30)\nnx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,seed=1)\nplt.savefig("k_clique_community_"+str(retweet_threshold)+".jpeg")\nplt.show() #display\n'

K-Clique Community
<img src='Images/k_clique_community_10.jpeg'>

In [65]:
nodes = list(community_df['screen_name'])
size_of_nodes = list(community_df['GModularity_size'])
color_of_nodes = list(community_df['GModularity_ID']) #Multiply by 2 to see more contrast in colors
mutual_follow_lol = network_df[['source_screen_name','destination_screen_name']][network_df['has_mutual_following']==True].values.tolist()
mutual_follow_edges = []
for mfe in mutual_follow_lol:
    mutual_follow_edges.append((mfe[0],mfe[1]))

G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(mutual_follow_edges)
'''
plt.figure(figsize=(75,75))
pos = nx.spring_layout(G, k=1.5*1/np.sqrt(len(G.nodes())), iterations=25)
nx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,seed=10)
plt.savefig("greedy_modularity_community_"+str(retweet_threshold)+".jpeg")
plt.show() #display
'''

'\nplt.figure(figsize=(75,75))\npos = nx.spring_layout(G, k=1.5*1/np.sqrt(len(G.nodes())), iterations=25)\nnx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,seed=10)\nplt.savefig("greedy_modularity_community_"+str(retweet_threshold)+".jpeg")\nplt.show() #display\n'

Greedy Modularity Community
<img src='Images/greedy_modularity_community_10.jpeg'>

In [66]:
nodes = list(community_df['screen_name'])
size_of_nodes = list(community_df['labelProp_size'])
color_of_nodes = list(community_df['labelProp_ID']) #Multiply by 2 to see more contrast in colors
mutual_follow_lol = network_df[['source_screen_name','destination_screen_name']][network_df['has_mutual_following']==True].values.tolist()
mutual_follow_edges = []
for mfe in mutual_follow_lol:
    mutual_follow_edges.append((mfe[0],mfe[1]))

G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(mutual_follow_edges)
'''
plt.figure(figsize=(75,75))
pos = nx.spring_layout(G, k=1.5*1/np.sqrt(len(G.nodes())), iterations=25)
nx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,seed=10)
plt.savefig("label_propagation_community_"+str(retweet_threshold)+".jpeg")
plt.show() #display
'''

'\nplt.figure(figsize=(75,75))\npos = nx.spring_layout(G, k=1.5*1/np.sqrt(len(G.nodes())), iterations=25)\nnx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,seed=10)\nplt.savefig("label_propagation_community_"+str(retweet_threshold)+".jpeg")\nplt.show() #display\n'

Label Propagation Community
<img src='Images/label_propagation_community_10.jpeg'>

In [67]:
nodes = list(community_df['screen_name'])
size_of_nodes = list(community_df['girvanNew_size'])
color_of_nodes = list(community_df['girvanNew_ID']) #Multiply by 2 to see more contrast in colors
mutual_follow_lol = network_df[['source_screen_name','destination_screen_name']][network_df['has_mutual_following']==True].values.tolist()
mutual_follow_edges = []
for mfe in mutual_follow_lol:
    mutual_follow_edges.append((mfe[0],mfe[1]))

G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(mutual_follow_edges)
'''
plt.figure(figsize=(75,75))
pos = nx.spring_layout(G, k=1.5*1/np.sqrt(len(G.nodes())), iterations=25)
nx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,seed=10)
plt.savefig("girvan_newman_community_"+str(retweet_threshold)+".jpeg")
plt.show() #display
'''

'\nplt.figure(figsize=(75,75))\npos = nx.spring_layout(G, k=1.5*1/np.sqrt(len(G.nodes())), iterations=25)\nnx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,seed=10)\nplt.savefig("girvan_newman_community_"+str(retweet_threshold)+".jpeg")\nplt.show() #display\n'

Girvan-Newman Community
<img src='Images/girvan_newman_community_10.jpeg'>

In [68]:
nodes = list(community_df['screen_name'])
size_of_nodes = list(community_df['louvain_size'])
color_of_nodes = list(community_df['louvain_ID']) #Multiply by 2 to see more contrast in colors
mutual_follow_lol = network_df[['source_screen_name','destination_screen_name']][network_df['has_mutual_following']==True].values.tolist()
mutual_follow_edges = []
for mfe in mutual_follow_lol:
    mutual_follow_edges.append((mfe[0],mfe[1]))

G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(mutual_follow_edges)
'''
plt.figure(figsize=(75,75))
pos = nx.spring_layout(G, k=1.5*1/np.sqrt(len(G.nodes())), iterations=25)
nx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,seed=10)
plt.savefig("louvain_method_community_"+str(retweet_threshold)+".jpeg")
plt.show() #display
'''

'\nplt.figure(figsize=(75,75))\npos = nx.spring_layout(G, k=1.5*1/np.sqrt(len(G.nodes())), iterations=25)\nnx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,seed=10)\nplt.savefig("louvain_method_community_"+str(retweet_threshold)+".jpeg")\nplt.show() #display\n'

Louvain Method
<img src='Images/louvain_method_community_10.jpeg'>

In [69]:
nodes = list(community_df['screen_name'])
size_of_nodes = list(community_df['leiden_size'])
color_of_nodes = list(community_df['leiden_ID']) #Multiply by 2 to see more contrast in colors
mutual_follow_lol = network_df[['source_screen_name','destination_screen_name']][network_df['has_mutual_following']==True].values.tolist()
mutual_follow_edges = []
for mfe in mutual_follow_lol:
    mutual_follow_edges.append((mfe[0],mfe[1]))

G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(mutual_follow_edges)
'''
plt.figure(figsize=(75,75))
pos = nx.spring_layout(G, k=1.5*1/np.sqrt(len(G.nodes())), iterations=25)
nx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,seed=10)
plt.savefig("leiden_method_community_"+str(retweet_threshold)+".jpeg")
plt.show() #display
'''

'\nplt.figure(figsize=(75,75))\npos = nx.spring_layout(G, k=1.5*1/np.sqrt(len(G.nodes())), iterations=25)\nnx.draw(G,pos=pos,with_labels = True,font_size=50,node_size=size_of_nodes,node_color=color_of_nodes,seed=10)\nplt.savefig("leiden_method_community_"+str(retweet_threshold)+".jpeg")\nplt.show() #display\n'

Leiden Method
<img src='Images/leiden_method_community_10.jpeg'>

## Evaluation of Community algorithms based on Sentiments about Brexit

In [70]:
#Display the aggregate sentiment of the twitter users for each week
all_week_sentiment_df = pd.concat((network_stats_df_week1[['screen_name','agg_sentiment_polarity']].rename(columns={'agg_sentiment_polarity':'week1_sentiment'}),
                          network_stats_df_week2[['screen_name','agg_sentiment_polarity']].rename(columns={'agg_sentiment_polarity':'week2_sentiment','screen_name':'screen_name_2'}),
                          network_stats_df_week3[['screen_name','agg_sentiment_polarity']].rename(columns={'agg_sentiment_polarity':'week3_sentiment','screen_name':'screen_name_3'}),
                          network_stats_df_week4[['screen_name','agg_sentiment_polarity']].rename(columns={'agg_sentiment_polarity':'week4_sentiment','screen_name':'screen_name_4'})
                          ),axis=1)
all_week_sentiment_df = all_week_sentiment_df.drop(['screen_name_2','screen_name_3','screen_name_4'],axis=1)
print(all_week_sentiment_df.shape)
all_week_sentiment_df.head()

(240, 5)


Unnamed: 0,screen_name,week1_sentiment,week2_sentiment,week3_sentiment,week4_sentiment
0,user_10155,0.154563,0.056186,0.16,-0.09375
1,user_1024,0.004818,-0.041788,0.231871,0.093469
2,user_11029,0.007205,-0.038345,0.018403,-0.022658
3,user_1106,0.055707,0.018275,0.05121,0.029384
4,user_11431,0.08122,0.164103,0.102105,0.034583


In [71]:
node_df = pd.DataFrame()
node_df['screen_name'] = list(network_stats_df_week1['screen_name'])
nw_community_df = pd.merge(tweet_df[['screen_name','retweet_count']].copy(),node_df,on='screen_name',how='inner')
nw_community_df = nw_community_df.groupby('screen_name').sum()
nw_community_df.reset_index(inplace=True)
nw_community_df['retweet_count'] = nw_community_df['retweet_count'].astype(float)
nw_community_df['kernighanLin_ID'] = '' #Bipartitian
nw_community_df['GModularity_ID'] = ''
nw_community_df['kClique_ID'] = ''
nw_community_df['labelProp_ID'] = ''
nw_community_df['girvanNew_ID'] = ''
nw_community_df['louvain_ID'] = ''
for index,row in nw_community_df.iterrows():
    for h in range(len(network_community_kl)):
        if row['screen_name'] in network_community_kl[h]:
            nw_community_df.at[index,'kernighanLin_ID'] = h
    for i in range(len(network_community_modularity)):
        if row['screen_name'] in network_community_modularity[i]:
            nw_community_df.at[index,'GModularity_ID'] = i
    for j in range(len(network_community_k_clique)):
        if row['screen_name'] in network_community_k_clique[j]:
            nw_community_df.at[index,'kClique_ID'] = j
    for k in range(len(network_community_lpa)):
        if row['screen_name'] in network_community_lpa[k]:
            nw_community_df.at[index,'labelProp_ID'] = k
    for l in range(len(network_community_gn)):
        if row['screen_name'] in network_community_gn[l]:
            nw_community_df.at[index,'girvanNew_ID'] = l
    for m in range(len(network_community_louvain)):
        if row['screen_name'] in network_community_louvain[m]:
            nw_community_df.at[index,'louvain_ID'] = m

leiden_df = pd.read_csv('leiden_df.csv')[['screen_name','leiden_ID']]
leiden_df['screen_name'] = leiden_df['screen_name'].map(mapping_dict)
nw_community_df = pd.merge(nw_community_df,leiden_df,on='screen_name',how='inner')
nw_community_df = nw_community_df.drop(['retweet_count'],axis=1)

community_validation_df = pd.merge(nw_community_df,all_week_sentiment_df,on='screen_name',how='inner')
print(community_validation_df.shape)
community_validation_df.head()

(240, 12)


Unnamed: 0,screen_name,kernighanLin_ID,GModularity_ID,kClique_ID,labelProp_ID,girvanNew_ID,louvain_ID,leiden_ID,week1_sentiment,week2_sentiment,week3_sentiment,week4_sentiment
0,user_10155,0,0,,4,0,0,2,0.154563,0.056186,0.16,-0.09375
1,user_1024,0,1,1.0,5,0,1,0,0.004818,-0.041788,0.231871,0.093469
2,user_11029,1,0,0.0,4,0,2,1,0.007205,-0.038345,0.018403,-0.022658
3,user_1106,0,1,1.0,5,0,1,0,0.055707,0.018275,0.05121,0.029384
4,user_11431,1,0,0.0,4,0,2,1,0.08122,0.164103,0.102105,0.034583


In [72]:
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
community_validation_df[['week1_sentiment','week2_sentiment','week3_sentiment','week4_sentiment']] = scaler.fit_transform(community_validation_df[['week1_sentiment','week2_sentiment','week3_sentiment','week4_sentiment']])
community_validation_df.head()

Unnamed: 0,screen_name,kernighanLin_ID,GModularity_ID,kClique_ID,labelProp_ID,girvanNew_ID,louvain_ID,leiden_ID,week1_sentiment,week2_sentiment,week3_sentiment,week4_sentiment
0,user_10155,0,0,,4,0,0,2,0.587082,0.525253,0.620118,0.29457
1,user_1024,0,1,1.0,5,0,1,0,0.408083,0.37364,0.696668,0.524757
2,user_11029,1,0,0.0,4,0,2,1,0.410936,0.378969,0.469305,0.381978
3,user_1106,0,1,1.0,5,0,1,0,0.468913,0.466587,0.504247,0.445964
4,user_11431,1,0,0.0,4,0,2,1,0.499411,0.692251,0.558455,0.452357


In [73]:
comm_cols = ['mean_sentiment_CV']
community_sent_cv_df = pd.DataFrame(columns = comm_cols)
community_alg = {'Kernighan-Lin':'kernighanLin_ID','Greedy Modularity':'GModularity_ID','k-Clique':'kClique_ID',
                 'Label-Propagation':'labelProp_ID','Girvan-Newman':'girvanNew_ID','Louvain Method':'louvain_ID',
                 'Leiden Method':'leiden_ID'}
for key,value in community_alg.items():
    df = community_validation_df[[value,'week1_sentiment','week2_sentiment','week3_sentiment','week4_sentiment']].copy()
    mean_cv = np.mean(np.mean(df.groupby(value).std()/df.groupby(value).mean()))
    df = pd.DataFrame([mean_cv],columns=comm_cols,index=[key])
    community_sent_cv_df = community_sent_cv_df.append(df)
community_sent_cv_df

Unnamed: 0,mean_sentiment_CV
Kernighan-Lin,0.259036
Greedy Modularity,0.211898
k-Clique,0.254298
Label-Propagation,0.182775
Girvan-Newman,0.18461
Louvain Method,0.22488
Leiden Method,0.224992


The above table shows the mean co-efficient of variation in sentiments about #Brexit among the communities detected by various community detection algorithms. The variance tells us that the communities formed by the algorithm have users whose sentiments are varying to that level. So, the algorithm having the least average variance, is the one which is able to detect the communities with like minded users, who expressed similarity in their opinions and hence least variance. <br>
So from the above table, for this social network, <b><i>Label-Propagation</i></b> community detection algorithm divided the network such that the members of each communities share similar ideas about #Brexit.

In [74]:
#Same as above, but show the results week-wise
comm_cols = ['mean_sentiment_CV_week1','mean_sentiment_CV_week2','mean_sentiment_CV_week3','mean_sentiment_CV_week4']
community_sent_cv_df = pd.DataFrame(columns = comm_cols)
community_alg = {'Kernighan-Lin':'kernighanLin_ID','Greedy Modularity':'GModularity_ID','k-Clique':'kClique_ID',
                 'Label-Propagation':'labelProp_ID','Girvan-Newman':'girvanNew_ID','Louvain Method':'louvain_ID',
                 'Leiden Method':'leiden_ID'}
for key,value in community_alg.items():
    df = community_validation_df[[value,'week1_sentiment','week2_sentiment','week3_sentiment','week4_sentiment']].copy()
    mean_cv = np.mean(df.groupby(value).std()/df.groupby(value).mean())
    df = pd.DataFrame([[mean_cv[0],mean_cv[1],mean_cv[2],mean_cv[3]]],
                      columns=comm_cols,index=[key])
    community_sent_cv_df = community_sent_cv_df.append(df)
community_sent_cv_df

Unnamed: 0,mean_sentiment_CV_week1,mean_sentiment_CV_week2,mean_sentiment_CV_week3,mean_sentiment_CV_week4
Kernighan-Lin,0.225892,0.253841,0.254636,0.301774
Greedy Modularity,0.195532,0.20986,0.237113,0.205085
k-Clique,0.228701,0.248205,0.242319,0.297966
Label-Propagation,0.214525,0.200492,0.176703,0.139379
Girvan-Newman,0.201543,0.197264,0.198285,0.14135
Louvain Method,0.201771,0.226227,0.24214,0.229384
Leiden Method,0.202053,0.226074,0.242723,0.22912
