<img src="https://github.com/Minyall/sc207_materials/blob/master/images/gephi_network.png?raw=true" align="right" width="300">


# SC207 - Session 8
# Social Network Analysis with Gephi
## Reshaping your Data into a Network


### Imports

Today we will just need...
- Pandas to import and reshape our twitter data

- <img src="https://github.com/Minyall/sc207_materials/blob/master/images/gephi-logo-2010-transparent.png?raw=true" align="left" width="75">....to visualise and explore our data.


In [2]:
import pandas as pd

## Converting Twitter Data

- We're going to make a Retweet network. 
- In this network every Node will represent a different user, 
- An edge between user a and user b will represent one user retweeting the other
- Eges will be given a `weight` that counts how many unique times user a retweeted user b. 
- We will make our network `directional` meaning that we will record seperately
  -     how many times `a` is retweeted by -> `b` 
  -     and how many times `b` -> is retweeted by `a`

In [38]:
filename = 'example_twitter_data_2021.pkl'
# filename = 'small_example_twitter_data_2021.pkl'
tweets = pd.read_pickle(filename)

#### Our Unpacking Funtions

In [None]:
def flatten_nested_dicts(df):
    dicts = df.to_dict(orient='records')
    flattened = pd.json_normalize(dicts)
    return flattened

### Nodes
- User screen names as the id and Label.
- User followers count and statuses count as the node attributes

### Edges 
-  `user.screen_name` and the `retweeted_status.user.screen_name` as the two ends of our edges.

In [40]:
# First let's grab just the portion of data we need

tweet_network_data = 
tweet_network_data.head()

Unnamed: 0,id,user,retweeted_status
0,1462741045532364804,"{'id': 2815044799, 'id_str': '2815044799', 'na...",{'created_at': 'Mon Nov 22 09:44:47 +0000 2021...
1,1462741045393997840,"{'id': 707840609701199872, 'id_str': '70784060...",{'created_at': 'Mon Nov 22 06:52:46 +0000 2021...
2,1462741044525768707,"{'id': 351180352, 'id_str': '351180352', 'name...",{'created_at': 'Mon Nov 22 07:34:40 +0000 2021...
3,1462741040037871623,"{'id': 1351924646019538945, 'id_str': '1351924...",{'created_at': 'Mon Nov 22 09:44:47 +0000 2021...
4,1462741038444032000,"{'id': 242341151, 'id_str': '242341151', 'name...",{'created_at': 'Mon Nov 22 11:10:08 +0000 2021...


In [41]:
# and then we unpack...may take a few seconds...

tweet_network_data = 
tweet_network_data.head()

Unnamed: 0,id,user.id,user.id_str,user.name,user.screen_name,user.location,user.description,user.url,user.entities.url.urls,user.entities.description.urls,...,retweeted_status.quoted_status.place.url,retweeted_status.quoted_status.place.place_type,retweeted_status.quoted_status.place.name,retweeted_status.quoted_status.place.full_name,retweeted_status.quoted_status.place.country_code,retweeted_status.quoted_status.place.country,retweeted_status.quoted_status.place.contained_within,retweeted_status.quoted_status.place.bounding_box.type,retweeted_status.quoted_status.place.bounding_box.coordinates,retweeted_status.quoted_status.scopes.followers
0,1462741045532364804,2815044799,2815044799,Graham Wright,gwright110,London,A London based composer / producer / guitarist...,http://t.co/vPZhf6bMXi,"[{'url': 'http://t.co/vPZhf6bMXi', 'expanded_u...",[],...,,,,,,,,,,
1,1462741045393997840,707840609701199872,707840609701199872,Katy Milner 💙,shoesnvelcro,"City of London, England",,,,[],...,,,,,,,,,,
2,1462741044525768707,351180352,351180352,John O,joliv1202,west of london,middle aged mid life crisis?...what crisis? an...,,,[],...,,,,,,,,,,
3,1462741040037871623,1351924646019538945,1351924646019538945,BookshopWoman,BookshopW,"Todmorden, England",Secondhand bookshop owner and artist. Socialis...,https://t.co/24sPaZZvZD,"[{'url': 'https://t.co/24sPaZZvZD', 'expanded_...",[],...,,,,,,,,,,
4,1462741038444032000,242341151,242341151,terry morgan,viewvalley,"cefn-hengoed, wales",,,,[],...,,,,,,,,,,


Next we create our edge list, which represents who retweets who.

`user.screen_name` is the user that initated the retweet, whilst `retweeted_status.user.screen_name` is the original author of the tweet being retweeted.

We can think of this edge like so...


(`user.screen_name`) -[RETWEETED]-> (`retweeted_status.user.screen_name`)

In [42]:
edges = 
edges.head()

Unnamed: 0,user.screen_name,retweeted_status.user.screen_name
0,gwright110,JujuliaGrace
1,shoesnvelcro,JujuliaGrace
2,joliv1202,maryeffrancis
3,BookshopW,JujuliaGrace
4,viewvalley,7_StarGirlx


Some of these tweets will not be Retweets, and so will have a `NaN` value in the `retweeted_status.user.screen_name` column. We can check with `edges.info()`

In [43]:
#check the info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50860 entries, 0 to 50859
Data columns (total 2 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   user.screen_name                   50860 non-null  object
 1   retweeted_status.user.screen_name  43472 non-null  object
dtypes: object(2)
memory usage: 794.8+ KB


Let's drop any rows that don't have a value under `retweeted_status.user.screen_name`

In [44]:
# we could be specific with a filter but we can also use .dropna as a shortcut

edges = 
edges.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43472 entries, 0 to 50858
Data columns (total 2 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   user.screen_name                   43472 non-null  object
 1   retweeted_status.user.screen_name  43472 non-null  object
dtypes: object(2)
memory usage: 1018.9+ KB


In [45]:
# Rename the columns so that it is clear which is source and which is target
edges = 

In [46]:
edges.head()

Unnamed: 0,source,target
0,gwright110,JujuliaGrace
1,shoesnvelcro,JujuliaGrace
2,joliv1202,maryeffrancis
3,BookshopW,JujuliaGrace
4,viewvalley,7_StarGirlx


We also said we were going to make sure we had just one edge between each pair, but assign the edge a 'weight' score that indicated how many times that retweeting had happened. We can do this quickly using Pandas Groupby

In [47]:
# first we give every edge a weight of 1

# Add weight
edges

Unnamed: 0,source,target,weight
0,gwright110,JujuliaGrace,1
1,shoesnvelcro,JujuliaGrace,1
2,joliv1202,maryeffrancis,1
3,BookshopW,JujuliaGrace,1
4,viewvalley,7_StarGirlx,1
...,...,...,...
50852,PeterScott2,JujuliaGrace,1
50853,PennyJo95728044,obbitz,1
50854,piyakhanna,JujuliaGrace,1
50855,ThatsWhatUGets,WCountryBylines,1


In [48]:
# Then we group by both the source and the target columns and sum together the weights


# Build the groupby to examine the weighting

Unnamed: 0,source,target,weight
8848,KarenJu35962200,KarenJu35962200,46
5904,GammonBaitNews,JujuliaGrace,19
18290,angelastreet10,JujuliaGrace,19
12662,PHughes74470229,JujuliaGrace,18
27210,nillie,JujuliaGrace,18
...,...,...,...
11716,MrPangolian,JujuliaGrace,1
11715,MrNishKumar,BellRibeiroAddy,1
11714,MrNegroMilitant,JujuliaGrace,1
11712,MrMarkEThomas,drppalazzolo,1


In [49]:
# Looks good, lets finalise that by overwriting our edges variable- no need to sort it
edges = 
# Set the edge type
edges

Unnamed: 0,source,target,weight,edge_type
0,007dmax,JJustine75,1,retweeted
1,007dmax,JujuliaGrace,6,retweeted
2,007dmax,KarenJu35962200,1,retweeted
3,007dmax,Marshajane,1,retweeted
4,007dmax,NurseSayNO,2,retweeted
...,...,...,...,...
32087,zoradi63,JujuliaGrace,3,retweeted
32088,zoradi63,elaine_patten,1,retweeted
32089,zoyadin27,JujuliaGrace,1,retweeted
32090,zshnr,JujuliaGrace,1,retweeted


# Nodes
This dataframe will be a list of unique nodes and we will assign some attributes to the nodes that we can use in Gephi later on.

First we take both the source and target columns, and append one to the other to make a long list of every user in the edges.

We drop duplicates as users may be mentioned multiple times, and then convert to a dataframe using `.to_frame` rather than a single column, specifying the dataframe's one column name to be `id`.

In [68]:
# first we take our source and target column, and stack them on top of each other
# This will create a list of every user in our edge list

node_names = 
node_names

0              007dmax
1              007dmax
2              007dmax
3              007dmax
4              007dmax
             ...      
32087     JujuliaGrace
32088    elaine_patten
32089     JujuliaGrace
32090     JujuliaGrace
32091     JujuliaGrace
Length: 64184, dtype: object

In [69]:
# We have duplicates because our edge list relies on duplicating names to properly represent
# how one user may have formed edges between multiple other nodes.

# A node list should be a list of unique nodes and their attributes, 
# so we will drop the duplicates and turn the Series into a DataFrame
unique_nodes = 

unique_nodes

Unnamed: 0,id
0,007dmax
1,0121_pedro
2,01Leaford
3,01ajr
4,01garlicmonster
...,...
19541,hoxtonspanish
19542,ifonlyalabama
19543,SageSussex
19544,amirshaikh1


Gephi will use the `id` column to match nodes in the nodes list to the nodes mentioned in the edge list. Finally we provide a `Label` column, which is the same as the `id` column but Gephi likes to have a label column which is what is displayed if node labels are on.

In [70]:

# Set the label

unique_nodes.head()

Unnamed: 0,id,Label
0,007dmax,007dmax
1,0121_pedro,0121_pedro
2,01Leaford,01Leaford
3,01ajr,01ajr
4,01garlicmonster,01garlicmonster


We want to ensure each user node has its `user.statuses_count` and `user.followers_count` associated. We will need to get these from our original dataframe.

In [71]:
attribute_columns = ['user.screen_name','user.followers_count','user.statuses_count']
user_data = 
user_data.head()

Unnamed: 0,user.screen_name,user.followers_count,user.statuses_count
0,gwright110,361,1995
1,shoesnvelcro,272,631
2,joliv1202,202,8621
3,BookshopW,391,4594
4,viewvalley,417,121576


Currently user_data is essentially a list of tweets showing just the username, and then the status count and follower count of the user at the point they tweeted. This means that a user may occur more than once in the list, with different values. 

The solution is to ask Pandas to find all the tweets in the dataset for each user, and then choose the highest values it can find in those tweets for each user. We do this with `.groupby` and `.max` to aggregate the data.

In [72]:
user_data = #group by screen name and get the max values
user_data

Unnamed: 0,user.screen_name,user.followers_count,user.statuses_count
0,007dmax,245,118469
1,0121_pedro,727,35060
2,01Leaford,373,9318
3,01ajr,1273,40350
4,01garlicmonster,1991,11700
...,...,...,...
20200,zoombini,181,7997
20201,zoradi63,447,18538
20202,zoyadin27,59,4829
20203,zshnr,341,10584


For our list of nodes, we now want to find the corresponding data in our `user_data` variable, for each user and include it in our `unique_nodes` list.

We can do this with a "left `.merge`" which matches the two dataframes on a specified column and then copies the data from the "right" dataframe to the corresponding rows in the "left" dataframe.

In [73]:
# nodes is on the left, user_data is on the right

nodes = #merge the two dataframes

In [74]:
nodes

Unnamed: 0,id,Label,user.screen_name,user.followers_count,user.statuses_count
0,007dmax,007dmax,007dmax,245.0,118469.0
1,0121_pedro,0121_pedro,0121_pedro,727.0,35060.0
2,01Leaford,01Leaford,01Leaford,373.0,9318.0
3,01ajr,01ajr,01ajr,1273.0,40350.0
4,01garlicmonster,01garlicmonster,01garlicmonster,1991.0,11700.0
...,...,...,...,...,...
19541,hoxtonspanish,hoxtonspanish,hoxtonspanish,159.0,3687.0
19542,ifonlyalabama,ifonlyalabama,ifonlyalabama,2325.0,16547.0
19543,SageSussex,SageSussex,SageSussex,42.0,977.0
19544,amirshaikh1,amirshaikh1,amirshaikh1,4.0,38.0


In [75]:
# no need for the extra user.screen_name column
nodes = #drop the screen name column
nodes

Unnamed: 0,id,Label,user.followers_count,user.statuses_count
0,007dmax,007dmax,245.0,118469.0
1,0121_pedro,0121_pedro,727.0,35060.0
2,01Leaford,01Leaford,373.0,9318.0
3,01ajr,01ajr,1273.0,40350.0
4,01garlicmonster,01garlicmonster,1991.0,11700.0
...,...,...,...,...
19541,hoxtonspanish,hoxtonspanish,159.0,3687.0
19542,ifonlyalabama,ifonlyalabama,2325.0,16547.0
19543,SageSussex,SageSussex,42.0,977.0
19544,amirshaikh1,amirshaikh1,4.0,38.0


In [76]:
edges

Unnamed: 0,source,target,weight,edge_type
0,007dmax,JJustine75,1,retweeted
1,007dmax,JujuliaGrace,6,retweeted
2,007dmax,KarenJu35962200,1,retweeted
3,007dmax,Marshajane,1,retweeted
4,007dmax,NurseSayNO,2,retweeted
...,...,...,...,...
32087,zoradi63,JujuliaGrace,3,retweeted
32088,zoradi63,elaine_patten,1,retweeted
32089,zoyadin27,JujuliaGrace,1,retweeted
32090,zshnr,JujuliaGrace,1,retweeted


In [77]:
nodes.to_csv('retweet_node_list.csv',index=False)
edges.to_csv('retweet_edge_list.csv',index=False)

Now we go to...

<img src="https://github.com/Minyall/sc207_materials/blob/master/images/gephi-logo-2010-transparent.png?raw=true" align="left" width="150">