<img src="https://github.com/Minyall/sc207_materials/blob/master/images/gephi_network.png?raw=true" align="right" width="300">


# SC207 - Session 8
# Social Network Analysis with Gephi
## Reshaping your Data into a Network


### Imports

Today we will just need...
- Pandas to import and reshape our twitter data

- <img src="https://github.com/Minyall/sc207_materials/blob/master/images/gephi-logo-2010-transparent.png?raw=true" align="left" width="75">....to visualise and explore our data.


In [1]:
import pandas as pd

In [2]:
def flatten_nested_dicts(df):
    dicts = df.to_dict(orient='records')
    flattened = pd.json_normalize(dicts)
    return flattened

## Converting Twitter Data

- We're going to make a Retweet network. 
- In this network every Node will represent a different user, 
- An edge between user a and user b will represent one user retweeting the other
- Edges will be given a `weight` that counts how many unique times user a retweeted user b.
- We will make our network `directional` meaning that we will record seperately.
  -     how many times `a` is retweeted by -> `b` 
  -     and how many times `b` -> is retweeted by `a`

#### Stage 1
We currently have data laid out like this...

| author_id | referenced_tweet_id |
| --------- | ------------------- |
| USER A    | TWEET 1             |
| USER B    | TWEET 2             |
| USER C    | TWEET 3             |

#### Stage 2
However we want our data to look like this....

| author_id | referenced_tweet_id | referenced_tweet's author |
| --------- | ------------------- | ------------------------- |
| USER A    | TWEET 1             | USER B                    |
| USER B    | TWEET 2             | USER A                    |
| USER C    | TWEET 3             | USER C                    |


#### Stage 3
Or actually more accurately, just like this...

| retweeter | author |
|-----------|--------|
| USER A    | USER B |
| USER B    | USER A |
| USER C    | USER C |

#### Stage 4
We create a weight column that represents the number of times the author_id retweeted the referenced_tweet's author.

| retweeter | author | weight |
|-----------| ---------------------- |--------|
| USER A    | USER B                 |1|
| USER B    | USER A                 |5|
| USER C    | USER C                 |12|



## Stage 1

In [3]:
filename = 'mogg_tweets.json'

tweets = pd.read_json(filename)
tweets

Unnamed: 0,author_id,context_annotations,conversation_id,created_at,edit_history_tweet_ids,entities,id,lang,public_metrics,referenced_tweets,source,text,withheld,user_created_at,user_id,user_name,user_public_metrics,user_username,user_withheld
0,937606688034709504,,1603756552598478849,2022-12-16 14:18:53,[1603756552598478849],"{'annotations': [{'start': 17, 'end': 31, 'pro...",1603756552598478849,en,"{'retweet_count': 1124, 'reply_count': 0, 'lik...","[{'data': {'type': 'retweeted', 'id': '1603677...",Twitter for iPhone,RT @BladeoftheS: Jacob Rees-Mogg has avoided m...,,2017-12-04 08:57:09,937606688034709504,Beth Sawyer ebs28@fediscience.org,"{'followers_count': 235, 'following_count': 19...",eb_sawyer,
1,1203336448218456064,"[{'domain': {'id': '10', 'name': 'Person', 'de...",1603756550807592961,2022-12-16 14:18:52,[1603756550807592961],"{'annotations': [{'start': 12, 'end': 22, 'pro...",1603756550807592961,en,"{'retweet_count': 287, 'reply_count': 0, 'like...","[{'data': {'type': 'retweeted', 'id': '1603527...",Twitter for iPhone,RT @Zero_4: Fiona Bruce abbreviating Jacob Ree...,,2019-12-07 15:32:37,1203336448218456064,Art and revolution,"{'followers_count': 91, 'following_count': 112...",Artandrevoluti2,
2,1304562578438451203,,1603756550014656512,2022-12-16 14:18:52,[1603756550014656512],"{'annotations': [{'start': 82, 'end': 90, 'pro...",1603756550014656512,en,"{'retweet_count': 7, 'reply_count': 0, 'like_c...","[{'data': {'type': 'retweeted', 'id': '1603665...",Twitter for iPhone,RT @torysleazeUK: 🏴‍☠️ Just how long do we hav...,,2020-09-11 23:29:16,1304562578438451203,Corporal of the Parish,"{'followers_count': 317, 'following_count': 60...",corporal_of,
3,914786917895561216,"[{'domain': {'id': '46', 'name': 'Business Tax...",1603756547900940288,2022-12-16 14:18:52,[1603756547900940288],"{'mentions': [{'start': 3, 'end': 19, 'usernam...",1603756547900940288,en,"{'retweet_count': 2053, 'reply_count': 0, 'lik...","[{'data': {'type': 'retweeted', 'id': '1603506...",Twitter Web App,"RT @implausibleblog: ""You're taking something ...",,2017-10-02 09:39:31,914786917895561216,Baxter,"{'followers_count': 587, 'following_count': 15...",tuq010,
4,279176310,"[{'domain': {'id': '3', 'name': 'TV Shows', 'd...",1603756545455562752,2022-12-16 14:18:51,[1603756545455562752],"{'annotations': [{'start': 108, 'end': 116, 'p...",1603756545455562752,en,"{'retweet_count': 5, 'reply_count': 0, 'like_c...","[{'data': {'type': 'retweeted', 'id': '1603724...",Twitter for iPad,RT @nickreeves9876: @bbcquestiontime The loss ...,,2011-04-08 18:55:55,279176310,Theresa Travis,"{'followers_count': 6907, 'following_count': 2...",s9tmt,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21039,19685424,"[{'domain': {'id': '3', 'name': 'TV Shows', 'd...",1603532815030784008,2022-12-15 23:29:49,[1603532815030784008],"{'annotations': [{'start': 1, 'end': 5, 'proba...",1603532815030784008,en,"{'retweet_count': 2, 'reply_count': 0, 'like_c...",,Twitter for Android,#bbcqt Rees Mogg incredibly patronising. If yo...,,2009-01-29 00:02:42,19685424,Karl Wardlaw,"{'followers_count': 979, 'following_count': 41...",KarlosW,
21040,701806221892968449,"[{'domain': {'id': '3', 'name': 'TV Shows', 'd...",1603529272370642946,2022-12-15 23:15:45,[1603529272370642946],"{'annotations': [{'start': 0, 'end': 8, 'proba...",1603529272370642946,en,"{'retweet_count': 4, 'reply_count': 0, 'like_c...",,Twitter for Android,Rees-Mogg and Hitchens - two mediocre right-wi...,,2016-02-22 16:30:17,701806221892968449,Stu,"{'followers_count': 497, 'following_count': 53...",indstu7,
21041,105889848,"[{'domain': {'id': '3', 'name': 'TV Shows', 'd...",1603532700748582912,2022-12-15 23:29:22,[1603532700748582912],"{'urls': [{'start': 144, 'end': 167, 'url': 'h...",1603532700748582912,en,"{'retweet_count': 2, 'reply_count': 0, 'like_c...",,Twitter for iPhone,Never forget - @Jacob_Rees_Mogg … is an absolu...,,2010-01-17 20:56:59,105889848,BeCo,"{'followers_count': 1121, 'following_count': 2...",BeCo74,
21042,474887617,"[{'domain': {'id': '3', 'name': 'TV Shows', 'd...",1603532059812794368,2022-12-15 23:26:49,[1603532059812794368],"{'annotations': [{'start': 65, 'end': 66, 'pro...",1603532059812794368,en,"{'retweet_count': 3, 'reply_count': 0, 'like_c...",,Twitter Web App,How dare @Jacob_Rees_Mogg throw accusations of...,,2012-01-26 13:09:17,474887617,PM in Espana #RejoinEU #GENow,"{'followers_count': 333, 'following_count': 99...",PennyM51,


In [4]:
subset = tweets[['author_id','referenced_tweets']].dropna()
subset

Unnamed: 0,author_id,referenced_tweets
0,937606688034709504,"[{'data': {'type': 'retweeted', 'id': '1603677..."
1,1203336448218456064,"[{'data': {'type': 'retweeted', 'id': '1603527..."
2,1304562578438451203,"[{'data': {'type': 'retweeted', 'id': '1603665..."
3,914786917895561216,"[{'data': {'type': 'retweeted', 'id': '1603506..."
4,279176310,"[{'data': {'type': 'retweeted', 'id': '1603724..."
...,...,...
21032,1033836699263098880,"[{'data': {'type': 'retweeted', 'id': '1603527..."
21033,1284574966671974400,"[{'data': {'type': 'retweeted', 'id': '1603506..."
21034,81820563,"[{'data': {'type': 'retweeted', 'id': '1603529..."
21037,571392805,"[{'data': {'type': 'quoted', 'id': '1603293300..."


In [5]:
# first we deal with everything being in lists using .explode!!

edge_data = subset.explode('referenced_tweets').copy()
edge_data

Unnamed: 0,author_id,referenced_tweets
0,937606688034709504,"{'data': {'type': 'retweeted', 'id': '16036779..."
1,1203336448218456064,"{'data': {'type': 'retweeted', 'id': '16035279..."
2,1304562578438451203,"{'data': {'type': 'retweeted', 'id': '16036657..."
3,914786917895561216,"{'data': {'type': 'retweeted', 'id': '16035061..."
4,279176310,"{'data': {'type': 'retweeted', 'id': '16037246..."
...,...,...
21032,1033836699263098880,"{'data': {'type': 'retweeted', 'id': '16035279..."
21033,1284574966671974400,"{'data': {'type': 'retweeted', 'id': '16035061..."
21034,81820563,"{'data': {'type': 'retweeted', 'id': '16035299..."
21037,571392805,"{'data': {'type': 'quoted', 'id': '16032933008..."


In [6]:
# Next we unpack that series of dictionaries into their own columns
edge_data = flatten_nested_dicts(edge_data)
edge_data

Unnamed: 0,author_id,referenced_tweets.data.type,referenced_tweets.data.id,referenced_tweets.id,referenced_tweets.type
0,937606688034709504,retweeted,1603677916554039297,1603677916554039297,retweeted
1,1203336448218456064,retweeted,1603527974606741507,1603527974606741507,retweeted
2,1304562578438451203,retweeted,1603665763826798593,1603665763826798593,retweeted
3,914786917895561216,retweeted,1603506174925611020,1603506174925611020,retweeted
4,279176310,retweeted,1603724614051438593,1603724614051438593,retweeted
...,...,...,...,...,...
20639,1033836699263098880,retweeted,1603527974606741507,1603527974606741507,retweeted
20640,1284574966671974400,retweeted,1603506174925611020,1603506174925611020,retweeted
20641,81820563,retweeted,1603529996466884609,1603529996466884609,retweeted
20642,571392805,quoted,1603293300802392064,1603293300802392064,quoted


In [7]:
# and select just three columns, the original author id, the tweet id of the referenced tweet, and the type of the referenced tweet

edge_data = edge_data[['author_id','referenced_tweets.id', 'referenced_tweets.type']]
edge_data = edge_data[edge_data['referenced_tweets.type'] == 'retweeted']
edge_data

Unnamed: 0,author_id,referenced_tweets.id,referenced_tweets.type
0,937606688034709504,1603677916554039297,retweeted
1,1203336448218456064,1603527974606741507,retweeted
2,1304562578438451203,1603665763826798593,retweeted
3,914786917895561216,1603506174925611020,retweeted
4,279176310,1603724614051438593,retweeted
...,...,...,...
20637,718166364658208769,1603502488581156868,retweeted
20638,2580674796,1603527815672041472,retweeted
20639,1033836699263098880,1603527974606741507,retweeted
20640,1284574966671974400,1603506174925611020,retweeted


So to recap, we have three columns...
- source: The id of the user that retweeted somebody.
- id: The id of the tweet that they retweeted
- type: the way in which the source 'referenced' a tweet. In our case all retweets.

## Stage 2
Our hope is that all the referenced tweets are also in our dataset somewhere else, and so have associated user information. We'll create a new dataframe of tweet ids and associated user ids and then use merge to match them up.

In [8]:
# Here we create a list of all the tweets we collected and associated author id.
user_data = tweets[['id','author_id']]
user_data

Unnamed: 0,id,author_id
0,1603756552598478849,937606688034709504
1,1603756550807592961,1203336448218456064
2,1603756550014656512,1304562578438451203
3,1603756547900940288,914786917895561216
4,1603756545455562752,279176310
...,...,...
21039,1603532815030784008,19685424
21040,1603529272370642946,701806221892968449
21041,1603532700748582912,105889848
21042,1603532059812794368,474887617


In [9]:
edge_data = edge_data.merge(user_data, how='left',left_on='referenced_tweets.id', right_on='id')
edge_data

Unnamed: 0,author_id_x,referenced_tweets.id,referenced_tweets.type,id,author_id_y
0,937606688034709504,1603677916554039297,retweeted,1603677916554039297,1455903807389458436
1,1203336448218456064,1603527974606741507,retweeted,1603527974606741507,19674092
2,1304562578438451203,1603665763826798593,retweeted,1603665763826798593,1382431445063495688
3,914786917895561216,1603506174925611020,retweeted,1603506174925611020,199452338
4,279176310,1603724614051438593,retweeted,1603724614051438593,746371177174679552
...,...,...,...,...,...
19995,718166364658208769,1603502488581156868,retweeted,1603502488581156868,45970216
19996,2580674796,1603527815672041472,retweeted,1603527815672041472,1126511673794129920
19997,1033836699263098880,1603527974606741507,retweeted,1603527974606741507,19674092
19998,1284574966671974400,1603506174925611020,retweeted,1603506174925611020,199452338


### Stage 3

In [10]:
# Here we check if we're missing any author info for any reason. If we were we'd just use .dropna() to drop any rows with missing info.
edge_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20000 entries, 0 to 19999
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   author_id_x             20000 non-null  int64 
 1   referenced_tweets.id    20000 non-null  int64 
 2   referenced_tweets.type  20000 non-null  object
 3   id                      20000 non-null  int64 
 4   author_id_y             20000 non-null  int64 
dtypes: int64(4), object(1)
memory usage: 937.5+ KB


In [11]:
# Here we will rename the columns to be more descriptive, and drop any columns we don't need.

new_cols = {'author_id_x':'retweeter', 'author_id_y':'author'}
edge_data = edge_data.rename(columns=new_cols).drop(columns=['referenced_tweets.id','id','referenced_tweets.type'])
edge_data

Unnamed: 0,retweeter,author
0,937606688034709504,1455903807389458436
1,1203336448218456064,19674092
2,1304562578438451203,1382431445063495688
3,914786917895561216,199452338
4,279176310,746371177174679552
...,...,...
19995,718166364658208769,45970216
19996,2580674796,1126511673794129920
19997,1033836699263098880,19674092
19998,1284574966671974400,199452338


### Stage 4

At the moment we have one row per instance of retweeting between a pair of users. However it may be the case that one user retweeted another a number of times, and so there are duplicate rows. Rather than discard this information, we'll capture this by adding weights to our edges that count how many times the source user retweeted the target user.

In [12]:
# First we give every edge a weight of 1, because each row represents 1 instance of retweeting

edge_data['weight'] = 1
edge_data

Unnamed: 0,retweeter,author,weight
0,937606688034709504,1455903807389458436,1
1,1203336448218456064,19674092,1
2,1304562578438451203,1382431445063495688,1
3,914786917895561216,199452338,1
4,279176310,746371177174679552,1
...,...,...,...
19995,718166364658208769,45970216,1
19996,2580674796,1126511673794129920,1
19997,1033836699263098880,19674092,1
19998,1284574966671974400,199452338,1


Now we use groupby to gather together rows that have the same combination of source, target and type, and add the weight values together.

In [13]:
edge_data = edge_data.groupby(['retweeter','author'],as_index=False).sum()
edge_data.sort_values('weight',ascending=False)

Unnamed: 0,retweeter,author,weight
17851,1451964806383210499,1453753755321708547,10
16072,1240935329462444033,1453753755321708547,9
18690,1528285938161532928,1539378523,7
10690,2830030004,1539378523,4
12488,748628668507885568,748628668507885568,4
...,...,...,...
6544,477914145,864030687120281602,1
6543,477914145,2846809907,1
6542,477914145,199452338,1
6541,477914145,58436094,1


Now our `edge_data` has one row per pair, and a weight indicating how many times that pair appeared in the data. Finally we just need to relabel the columns so that Gephi understands them.

In network analysis when we talk about edges we refer to a `source` and a `target`. If you imagine an edge as an arrow the `source` is where the arrow starts and the `target` is where it points to.

In our case the direction matters, just because USER A retweete USER B doesn't mean that they retweeted back. The relationship is not necessarily mutual, like perhaps a friendship where we wouldn't necessarily consider there to be a direction to the connection. Which is the source and which is the target though? Well, either, depending on how you define what the edge represents.

We could say the edge means `RETWEETED`, so it would be...

```
(SOURCE: retweeter) -[RETWEETED]-> (TARGET: author)
```

But equally we could say the edge means `RETWEETED_BY`, meaning the positions would be reversed.

```
(SOURCE: author) -[RETWEETED_BY]-> (TARGET: retweeter)
```

In this case, it doesn't matter what we choose, so long as we remember what the direction of our edge represents. We'll go with `RETWEETED`.


In [14]:
gephi_edge_labels = {'retweeter':'Source','author':'Target'}
edge_data = edge_data.rename(columns=gephi_edge_labels)
edge_data

Unnamed: 0,Source,Target,weight
0,17773,20909048,1
1,66873,1299017696416272385,1
2,73963,1472707884693733380,1
3,79553,19674092,1
4,79553,1652177576,1
...,...,...,...
19411,1603309652233670656,40442193,1
19412,1603398425810243585,1278401163373707267,1
19413,1603414321324761089,199452338,1
19414,1603414574589427714,19674092,1


# Nodes
This dataframe will be a list of unique nodes and we will assign some attributes to the nodes that we can use in Gephi later on.
First we grab the relevant columns from our original dataset. It may have duplicates as each row represents a tweet, and it may have users that didn't end up in our edge table but we'll deal with that soon.

In [15]:
node_data = tweets[['user_id','user_name','user_public_metrics']]
node_data

Unnamed: 0,user_id,user_name,user_public_metrics
0,937606688034709504,Beth Sawyer ebs28@fediscience.org,"{'followers_count': 235, 'following_count': 19..."
1,1203336448218456064,Art and revolution,"{'followers_count': 91, 'following_count': 112..."
2,1304562578438451203,Corporal of the Parish,"{'followers_count': 317, 'following_count': 60..."
3,914786917895561216,Baxter,"{'followers_count': 587, 'following_count': 15..."
4,279176310,Theresa Travis,"{'followers_count': 6907, 'following_count': 2..."
...,...,...,...
21039,19685424,Karl Wardlaw,"{'followers_count': 979, 'following_count': 41..."
21040,701806221892968449,Stu,"{'followers_count': 497, 'following_count': 53..."
21041,105889848,BeCo,"{'followers_count': 1121, 'following_count': 2..."
21042,474887617,PM in Espana #RejoinEU #GENow,"{'followers_count': 333, 'following_count': 99..."


In [16]:
# First we drop any duplicates because we simply need one row per user
node_data = node_data.drop_duplicates('user_id')
node_data

Unnamed: 0,user_id,user_name,user_public_metrics
0,937606688034709504,Beth Sawyer ebs28@fediscience.org,"{'followers_count': 235, 'following_count': 19..."
1,1203336448218456064,Art and revolution,"{'followers_count': 91, 'following_count': 112..."
2,1304562578438451203,Corporal of the Parish,"{'followers_count': 317, 'following_count': 60..."
3,914786917895561216,Baxter,"{'followers_count': 587, 'following_count': 15..."
4,279176310,Theresa Travis,"{'followers_count': 6907, 'following_count': 2..."
...,...,...,...
21036,112853509,☮️ Matt Hill 🇪🇺 🚴 🏳️‍🌈,"{'followers_count': 1187, 'following_count': 9..."
21038,499083263,Kwame Gyamfi,"{'followers_count': 229, 'following_count': 47..."
21039,19685424,Karl Wardlaw,"{'followers_count': 979, 'following_count': 41..."
21040,701806221892968449,Stu,"{'followers_count': 497, 'following_count': 53..."


In [17]:
# Next we create a list of all users that are actually in our edge list

nodes_in_network = pd.concat([edge_data['Source'], edge_data['Target']], axis=0).drop_duplicates()
nodes_in_network

0                      17773
1                      66873
2                      73963
3                      79553
7                     465973
                ...         
19239    1595461088141008896
19272             1944262404
19274              460819776
19398              315264228
19408              402593408
Length: 10996, dtype: int64

In [18]:
node_data = node_data[node_data['user_id'].isin(nodes_in_network)]
node_data

Unnamed: 0,user_id,user_name,user_public_metrics
0,937606688034709504,Beth Sawyer ebs28@fediscience.org,"{'followers_count': 235, 'following_count': 19..."
1,1203336448218456064,Art and revolution,"{'followers_count': 91, 'following_count': 112..."
2,1304562578438451203,Corporal of the Parish,"{'followers_count': 317, 'following_count': 60..."
3,914786917895561216,Baxter,"{'followers_count': 587, 'following_count': 15..."
4,279176310,Theresa Travis,"{'followers_count': 6907, 'following_count': 2..."
...,...,...,...
21036,112853509,☮️ Matt Hill 🇪🇺 🚴 🏳️‍🌈,"{'followers_count': 1187, 'following_count': 9..."
21038,499083263,Kwame Gyamfi,"{'followers_count': 229, 'following_count': 47..."
21039,19685424,Karl Wardlaw,"{'followers_count': 979, 'following_count': 41..."
21040,701806221892968449,Stu,"{'followers_count': 497, 'following_count': 53..."


In [19]:
# Now lets expand out our user metrics

node_data = flatten_nested_dicts(node_data)
node_data

Unnamed: 0,user_id,user_name,user_public_metrics.followers_count,user_public_metrics.following_count,user_public_metrics.tweet_count,user_public_metrics.listed_count
0,937606688034709504,Beth Sawyer ebs28@fediscience.org,235,192,1241,2
1,1203336448218456064,Art and revolution,91,112,6914,0
2,1304562578438451203,Corporal of the Parish,317,605,82014,0
3,914786917895561216,Baxter,587,1562,2010,2
4,279176310,Theresa Travis,6907,2060,1756575,999
...,...,...,...,...,...,...
10991,112853509,☮️ Matt Hill 🇪🇺 🚴 🏳️‍🌈,1187,995,22545,19
10992,499083263,Kwame Gyamfi,229,473,7490,0
10993,19685424,Karl Wardlaw,979,4194,13838,13
10994,701806221892968449,Stu,497,539,9580,5


In [20]:
# Finally we need to relabel our columns for Gephi

gephi_node_labels = {'user_id':'ID','user_name':'Label',
                     'user_public_metrics.followers_count':'followers_count',
                     'user_public_metrics.following_count':'following_count',
                     'user_public_metrics.tweet_count':'tweet_count',
                     'user_public_metrics.listed_count':'listed_count'}

node_data = node_data.rename(columns=gephi_node_labels)
node_data


Unnamed: 0,ID,Label,followers_count,following_count,tweet_count,listed_count
0,937606688034709504,Beth Sawyer ebs28@fediscience.org,235,192,1241,2
1,1203336448218456064,Art and revolution,91,112,6914,0
2,1304562578438451203,Corporal of the Parish,317,605,82014,0
3,914786917895561216,Baxter,587,1562,2010,2
4,279176310,Theresa Travis,6907,2060,1756575,999
...,...,...,...,...,...,...
10991,112853509,☮️ Matt Hill 🇪🇺 🚴 🏳️‍🌈,1187,995,22545,19
10992,499083263,Kwame Gyamfi,229,473,7490,0
10993,19685424,Karl Wardlaw,979,4194,13838,13
10994,701806221892968449,Stu,497,539,9580,5


In [21]:
node_data.to_csv('mogg_node_list.csv',index=False)
edge_data.to_csv('mogg_edge_list.csv',index=False)