<img src="https://github.com/Minyall/sc207_materials/blob/master/images/gephi_network.png?raw=true" align="right" width="300">


# SC207 - Session 8
# Social Network Analysis with Gephi
## Reshaping your Data into a Network


### Imports

Today we will just need...
- Pandas to import and reshape our twitter data

- <img src="https://github.com/Minyall/sc207_materials/blob/master/images/gephi-logo-2010-transparent.png?raw=true" align="left" width="75">....to visualise and explore our data.


In [1]:
import pandas as pd

## Converting Twitter Data

- We're going to make a Retweet network. 
- In this network every Node will represent a different user, 
- An edge between user a and user b will represent one user retweeting the other
- Edges will be given a `weight` that counts how many unique times user a retweeted user b. 
- We will make our network `directional` meaning that we will record seperately
  -     how many times `a` is retweeted by -> `b` 
  -     and how many times `b` -> is retweeted by `a`

In [2]:
filename = 'example_twitter_data.pkl'
tweets = pd.read_pickle(filename)

In [3]:
tweets.head()

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,metadata,source,in_reply_to_status_id,...,retweet_count,favorite_count,favorited,retweeted,lang,possibly_sensitive,extended_entities,quoted_status_id,quoted_status_id_str,quoted_status
0,Sat Nov 21 13:12:34 +0000 2020,1330137025148817408,1330137025148817408,RT @Keir_Starmer: In the interest of transpare...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""https://mobile.twitter.com"" rel=""nofo...",,...,946,0,False,False,en,,,,,
1,Sat Nov 21 13:12:33 +0000 2020,1330137020711235586,1330137020711235586,RT @mrjamesob: Priti Patel bullies people who ...,False,"[0, 128]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/iphone"" r...",,...,2519,0,False,False,en,,,,,
2,Sat Nov 21 13:12:32 +0000 2020,1330137019679432707,1330137019679432707,@BorisJohnson\nAnother Dominic Cummings Saga!!...,False,"[0, 117]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/android"" ...",,...,0,0,False,False,en,False,,,,
3,Sat Nov 21 13:12:32 +0000 2020,1330137016881860610,1330137016881860610,Rather than all these pious and sycophantic tw...,False,"[0, 268]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""https://about.twitter.com/products/tw...",,...,0,0,False,False,en,,,,,
4,Sat Nov 21 13:12:31 +0000 2020,1330137014180737025,1330137014180737025,RT @MarinaHyde: My bit about Priti Patel’s bul...,False,"[0, 129]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/iphone"" r...",,...,1153,0,False,False,en,False,,,,


#### Our Unpacking Funtions

In [4]:
def flatten_nested_dicts(df):
    dicts = df.to_dict(orient='records')
    flattened = pd.json_normalize(dicts)
    return flattened

### Nodes
- User screen names as the id and Label.
- User followers count and statuses count as the node attributes

### Edges 
-  `user.screen_name` and the `retweeted_status.user.screen_name` as the two ends of our edges.

In [5]:
# First let's grab just the portion of data we need

tweet_network_data = tweets[['id', 'user', 'retweeted_status']]
tweet_network_data.head()

Unnamed: 0,id,user,retweeted_status
0,1330137025148817408,"{'id': 1027183102886662144, 'id_str': '1027183...",{'created_at': 'Fri Nov 20 12:14:39 +0000 2020...
1,1330137020711235586,"{'id': 337419307, 'id_str': '337419307', 'name...",{'created_at': 'Sat Nov 21 09:24:45 +0000 2020...
2,1330137019679432707,"{'id': 890878937487876096, 'id_str': '89087893...",
3,1330137016881860610,"{'id': 488028839, 'id_str': '488028839', 'name...",
4,1330137014180737025,"{'id': 244243442, 'id_str': '244243442', 'name...",{'created_at': 'Sat Nov 21 07:03:01 +0000 2020...


In [6]:
# and then we unpack...may take a few seconds...

tweet_network_data = flatten_nested_dicts(tweet_network_data)
tweet_network_data.head()

Unnamed: 0,id,user.id,user.id_str,user.name,user.screen_name,user.location,user.description,user.url,user.entities.description.urls,user.protected,...,retweeted_status.quoted_status.place.id,retweeted_status.quoted_status.place.url,retweeted_status.quoted_status.place.place_type,retweeted_status.quoted_status.place.name,retweeted_status.quoted_status.place.full_name,retweeted_status.quoted_status.place.country_code,retweeted_status.quoted_status.place.country,retweeted_status.quoted_status.place.contained_within,retweeted_status.quoted_status.place.bounding_box.type,retweeted_status.quoted_status.place.bounding_box.coordinates
0,1330137025148817408,1027183102886662144,1027183102886662144,Nanna #BackTo60 🕷 #50sWomen #ExcludedUK,nanna39076633,,#IStandWithJoanne\n \n\n\n\...,,[],False,...,,,,,,,,,,
1,1330137020711235586,337419307,337419307,Glenny Rodgers,HenryForthwith,Not telling,"I’m so sorry, it seemed like a good idea at th...",,[],False,...,,,,,,,,,,
2,1330137019679432707,890878937487876096,890878937487876096,F355,F35514,,,,[],False,...,,,,,,,,,,
3,1330137016881860610,488028839,488028839,Mukul Chawla QC,MChawlaQC,"London, England",Partner in the Investigations team at Bryan Ca...,,[],False,...,,,,,,,,,,
4,1330137014180737025,244243442,244243442,ᖇIᑕᕼᗩᖇᗪ ᕮᒪᒪIOTT,R1chardEll10tt,The North East,I do stuff at work but don’t ask me what it is...,,[],False,...,,,,,,,,,,


Next we create our edge list, which represents who retweets who.

`user.screen_name` is the user that initated the retweet, whilst `retweeted_status.user.screen_name` is the original author of the tweet being retweeted.

We can think of this edge like so...


(`user.screen_name`) -[RETWEETED]-> (`retweeted_status.user.screen_name`)

In [7]:
edges = tweet_network_data[['user.screen_name', 'retweeted_status.user.screen_name']]
edges.head()

Unnamed: 0,user.screen_name,retweeted_status.user.screen_name
0,nanna39076633,Keir_Starmer
1,HenryForthwith,mrjamesob
2,F35514,
3,MChawlaQC,
4,R1chardEll10tt,MarinaHyde


Some of these tweets will not be Retweets, and so will have a `NaN` value in the `retweeted_status.user.screen_name` column. We can check with `edges.info()`

In [8]:
#check the info
edges.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22989 entries, 0 to 22988
Data columns (total 2 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   user.screen_name                   22989 non-null  object
 1   retweeted_status.user.screen_name  19556 non-null  object
dtypes: object(2)
memory usage: 359.3+ KB


Let's drop any rows that don't have a value under `retweeted_status.user.screen_name`

In [9]:
# we could be specific with a filter but we can also use .dropna as a shortcut

edges = edges.dropna()
edges.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19556 entries, 0 to 22988
Data columns (total 2 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   user.screen_name                   19556 non-null  object
 1   retweeted_status.user.screen_name  19556 non-null  object
dtypes: object(2)
memory usage: 458.3+ KB


In [10]:
# Rename the columns so that it is clear which is source and which is target
renaming = {'user.screen_name': 'source', 'retweeted_status.user.screen_name': 'target'}

edges = edges.rename(columns=renaming)

In [11]:
edges.head()

Unnamed: 0,source,target
0,nanna39076633,Keir_Starmer
1,HenryForthwith,mrjamesob
4,R1chardEll10tt,MarinaHyde
5,NNour50695694,LBC
6,Astraia18,meenalsworld


We also said we were going to make sure we had just one edge between each pair, but assign the edge a 'weight' score that indicated how many times that retweeting had happened. We can do this quickly using Pandas Groupby

In [12]:
# first we give every edge a weight of 1
edges['weight'] = 1
# Add weight
edges

Unnamed: 0,source,target,weight
0,nanna39076633,Keir_Starmer,1
1,HenryForthwith,mrjamesob,1
4,R1chardEll10tt,MarinaHyde,1
5,NNour50695694,LBC,1
6,Astraia18,meenalsworld,1
...,...,...,...
22983,brewdog1950,Bbmorg,1
22984,hausofrushdi,coaimpaul,1
22985,raywilson50,BorisJohnson_MP,1
22986,OLDMART01,GetBrexit_Done,1


In [15]:
# Then we group by both the source and the target columns and sum together the weights

edges.groupby(['source', 'target'], as_index=False).sum().sort_values('weight', ascending=False)
# Build the groupby to examine the weighting

Unnamed: 0,source,target,weight
6113,MarySchmoller,MarieAnnUK,6
1333,BrexitBuster,BrexitBuster,5
1414,BurtsBikeBits,StrongerStabler,5
18189,terriesinglet14,StrongerStabler,5
5845,M_Haynes01,mrjamesob,4
...,...,...,...
6427,MikeeeeV,OxfordDiplomat,1
6426,Mikecryer3,cue_bono,1
6424,Mikecryer3,Patricia344130,1
6423,Mikecryer3,GrahamJ18821678,1


In [17]:
# Looks good, lets finalise that by overwriting our edges variable- no need to sort it
edges = edges.groupby(['source', 'target'], as_index=False).sum().reset_index(drop=True)
# Set the edge type
edges['edge_type'] = 'retweeted'
edges

In [18]:
del edges['index']

In [19]:
edges.head()

Unnamed: 0,source,target,weight,edge_type
0,0606Green,MarcherLord1,1,retweeted
1,0606Green,SocialM85897394,1,retweeted
2,0ctavia,mrjamesob,1,retweeted
3,100glitterstars,RoddyQC,1,retweeted
4,101Cognitive,MattChorley,1,retweeted


# Nodes
This dataframe will be a list of unique nodes and we will assign some attributes to the nodes that we can use in Gephi later on.

First we take both the source and target columns, and append one to the other to make a long list of every user in the edges.

We drop duplicates as users may be mentioned multiple times, and then convert to a dataframe using `.to_frame` rather than a single column, specifying the dataframe's one column name to be `id`.

In [20]:
# first we take our source and target column, and stack them on top of each other
# This will create a list of every user in our edge list

node_names = edges['source'].append(edges['target'])
node_names

0              0606Green
1              0606Green
2                0ctavia
3        100glitterstars
4           101Cognitive
              ...       
19039    alfonslopeztena
19040          SueSuezep
19041        I_am_Dan___
19042      Bill_Esterson
19043       redhistorian
Length: 38088, dtype: object

In [22]:
# We have duplicates because our edge list relies on duplicating names to properly represent
# how one user may have formed edges between multiple other nodes.

# A node list should be a list of unique nodes and their attributes, 
# so we will drop the duplicates and turn the Series into a DataFrame
unique_nodes = node_names.drop_duplicates().to_frame(name='id').reset_index(drop=True)

unique_nodes

Unnamed: 0,id
0,0606Green
1,0ctavia
2,100glitterstars
3,101Cognitive
4,10dam
...,...
12883,shane_r
12884,UKfollowgain
12885,jonnybid
12886,paulmotty


Gephi will use the `id` column to match nodes in the nodes list to the nodes mentioned in the edge list. Finally we provide a `Label` column, which is the same as the `id` column but Gephi likes to have a label column which is what is displayed if node labels are on.

In [23]:

# Set the label
unique_nodes['Label'] = unique_nodes['id']
unique_nodes.head()

Unnamed: 0,id,Label
0,0606Green,0606Green
1,0ctavia,0ctavia
2,100glitterstars,100glitterstars
3,101Cognitive,101Cognitive
4,10dam,10dam


We want to ensure each user node has its `user.statuses_count` and `user.followers_count` associated. We will need to get these from our original dataframe.

In [24]:
attribute_columns = ['user.screen_name','user.followers_count','user.statuses_count']
user_data = tweet_network_data[attribute_columns]
user_data.head()

Unnamed: 0,user.screen_name,user.followers_count,user.statuses_count
0,nanna39076633,751,35166
1,HenryForthwith,420,21602
2,F35514,10,1715
3,MChawlaQC,9265,540
4,R1chardEll10tt,114,477


Currently user_data is essentially a list of tweets showing just the username, and then the status count and follower count of the user at the point they tweeted. This means that a user may occur more than once in the list, with different values. 

The solution is to ask Pandas to find all the tweets in the dataset for each user, and then choose the highest values it can find in those tweets for each user. We do this with `.groupby` and `.max` to aggregate the data.

In [26]:
user_data = user_data.groupby('user.screen_name').max().reset_index()    #group by screen name and get the max values
user_data 

Unnamed: 0,user.screen_name,user.followers_count,user.statuses_count
0,001Gunner,24,1127
1,0606Green,815,181116
2,0_ayanna,97,9508
3,0ctavia,2115,210311
4,100glitterstars,205,46651
...,...,...,...
14200,zippydazipster,2474,323594
14201,zithertilldawn,457,11477
14202,zodiacbanana,222,2542
14203,zort70,1030,48651


For our list of nodes, we now want to find the corresponding data in our `user_data` variable, for each user and include it in our `unique_nodes` list.

We can do this with a "left `.merge`" which matches the two dataframes on a specified column and then copies the data from the "right" dataframe to the corresponding rows in the "left" dataframe.

In [27]:
# nodes is on the left, user_data is on the right

nodes = unique_nodes.merge(user_data, left_on='id', right_on='user.screen_name', how='left')      #merge the two dataframes

In [28]:
nodes

Unnamed: 0,id,Label,user.screen_name,user.followers_count,user.statuses_count
0,0606Green,0606Green,0606Green,815.0,181116.0
1,0ctavia,0ctavia,0ctavia,2115.0,210311.0
2,100glitterstars,100glitterstars,100glitterstars,205.0,46651.0
3,101Cognitive,101Cognitive,101Cognitive,207.0,7334.0
4,10dam,10dam,10dam,50.0,12432.0
...,...,...,...,...,...
12883,shane_r,shane_r,shane_r,2063.0,10916.0
12884,UKfollowgain,UKfollowgain,UKfollowgain,10023.0,58686.0
12885,jonnybid,jonnybid,,,
12886,paulmotty,paulmotty,paulmotty,9652.0,36952.0


In [29]:
# no need for the extra user.screen_name column
nodes = nodes.drop(columns=['user.screen_name'])    #drop the screen name column
nodes

Unnamed: 0,id,Label,user.followers_count,user.statuses_count
0,0606Green,0606Green,815.0,181116.0
1,0ctavia,0ctavia,2115.0,210311.0
2,100glitterstars,100glitterstars,205.0,46651.0
3,101Cognitive,101Cognitive,207.0,7334.0
4,10dam,10dam,50.0,12432.0
...,...,...,...,...
12883,shane_r,shane_r,2063.0,10916.0
12884,UKfollowgain,UKfollowgain,10023.0,58686.0
12885,jonnybid,jonnybid,,
12886,paulmotty,paulmotty,9652.0,36952.0


In [30]:
edges

Unnamed: 0,source,target,weight,edge_type
0,0606Green,MarcherLord1,1,retweeted
1,0606Green,SocialM85897394,1,retweeted
2,0ctavia,mrjamesob,1,retweeted
3,100glitterstars,RoddyQC,1,retweeted
4,101Cognitive,MattChorley,1,retweeted
...,...,...,...,...
19039,zippydazipster,alfonslopeztena,1,retweeted
19040,zodiacbanana,SueSuezep,1,retweeted
19041,zort70,I_am_Dan___,1,retweeted
19042,zuluzim909,Bill_Esterson,1,retweeted


In [31]:
nodes.to_csv('retweet_node_list.csv',index=False)
edges.to_csv('retweet_edge_list.csv',index=False)

Now we go to...

<img src="https://github.com/Minyall/sc207_materials/blob/master/images/gephi-logo-2010-transparent.png?raw=true" align="left" width="150">