<img src="https://github.com/Minyall/sc207_materials/blob/master/images/gephi_network.png?raw=true" align="right" width="300">


# SC207 - Session 8
# Social Network Analysis with Gephi
## Reshaping your Data into a Network


### Imports

Today we will just need...
- Pandas to import and reshape our twitter data

- <img src="https://github.com/Minyall/sc207_materials/blob/master/images/gephi-logo-2010-transparent.png?raw=true" align="left" width="75">....to visualise and explore our data.


In [1]:
import pandas as pd

## Converting Twitter Data

- We're going to make a Retweet network. 
- In this network every Node will represent a different user, 
- An edge between user a and user b will represent one user retweeting the other
- Eges will be given a `weight` that counts how many unique times user a retweeted user b. 
- We will make our network `directional` meaning that we will record seperately
  -     how many times `a` is retweeted by -> `b` 
  -     and how many times `b` -> is retweeted by `a`

In [2]:
df = pd.read_pickle('example_twitter_data_unpacked.pkl')

### Nodes
- User screen names as the id and Label.
- User followers count and statuses count as the node attributes

### Edges 
-  `user.screen_name` and the `retweeted_status.user.screen_name` as the two ends of our edges.

In [4]:
edges = 
edges.head()

Unnamed: 0,user.screen_name,retweeted_status.user.screen_name
0,nanna39076633,Keir_Starmer
1,HenryForthwith,mrjamesob
2,F35514,
3,MChawlaQC,
4,R1chardEll10tt,MarinaHyde


So we took our df, which was a list of Tweets, and now are just focusing on...
- `user.screen_name` the user that tweeted the status we collected
- `retweeted_status.user.screen_name` the user that tweeted the original status update that was retweeted by `user.screen_name`

Some of these tweets will not be Retweets, and so will have a `NaN` value in the `retweeted_status.user.screen_name` column. We can check with `edges.info()`

In [5]:
edges.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24214 entries, 0 to 24213
Data columns (total 2 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   user.screen_name                   24214 non-null  object
 1   retweeted_status.user.screen_name  19556 non-null  object
dtypes: object(2)
memory usage: 378.5+ KB


Let's drop any rows that don't have a value under `retweeted_status.user.screen_name`

In [6]:
edge_filter = 
edges = 

In [7]:
# Rename the columns so that it is clear which is source and which is target
'user.screen_name'
'retweeted_status.user.screen_name'
edges = 

In [8]:
edges.head()

Unnamed: 0,source,target
0,nanna39076633,Keir_Starmer
1,HenryForthwith,mrjamesob
4,R1chardEll10tt,MarinaHyde
5,NNour50695694,LBC
6,Astraia18,meenalsworld


We also said we were going to make sure we had just one edge between each pair, but assign the edge a 'weight' score that indicated how many times that retweeting had happened. We can do this quickly using Pandas Groupby

In [9]:
# first we give every edge a weight of 1



Unnamed: 0,source,target,weight
0,nanna39076633,Keir_Starmer,1
1,HenryForthwith,mrjamesob,1
4,R1chardEll10tt,MarinaHyde,1
5,NNour50695694,LBC,1
6,Astraia18,meenalsworld,1
...,...,...,...
22983,brewdog1950,Bbmorg,1
22984,hausofrushdi,coaimpaul,1
22985,raywilson50,BorisJohnson_MP,1
22986,OLDMART01,GetBrexit_Done,1


In [10]:
# Then we group by both the source and the target columns and sum together the weights




Unnamed: 0,source,target,weight
6113,MarySchmoller,MarieAnnUK,6
1333,BrexitBuster,BrexitBuster,5
1414,BurtsBikeBits,StrongerStabler,5
18189,terriesinglet14,StrongerStabler,5
5845,M_Haynes01,mrjamesob,4
...,...,...,...
6427,MikeeeeV,OxfordDiplomat,1
6426,Mikecryer3,cue_bono,1
6424,Mikecryer3,Patricia344130,1
6423,Mikecryer3,GrahamJ18821678,1


In [11]:
# Looks good, lets finalise that by overwriting our edges variable- no need to sort it
edges = 


Unnamed: 0,source,target,weight,edge_type
0,0606Green,MarcherLord1,1,retweeted
1,0606Green,SocialM85897394,1,retweeted
2,0ctavia,mrjamesob,1,retweeted
3,100glitterstars,RoddyQC,1,retweeted
4,101Cognitive,MattChorley,1,retweeted
...,...,...,...,...
19039,zippydazipster,alfonslopeztena,1,retweeted
19040,zodiacbanana,SueSuezep,1,retweeted
19041,zort70,I_am_Dan___,1,retweeted
19042,zuluzim909,Bill_Esterson,1,retweeted


# Nodes
This dataframe will be a list of unique nodes and we will assign some attributes to the nodes that we can use in Gephi later on.

First we take both the source and target columns, and append one to the other to make a long list of every user in the edges.

We drop duplicates as users may be mentioned multiple times, and then convert to a dataframe using `.to_frame` rather than a single column, specifying the dataframe's one column name to be `id`.

Gephi will use the `id` column to match nodes in the nodes list to the nodes mentioned in the edge list. Finally we provide a `Label` column, which is the same as the `id` column but Gephi likes to have a label column which is what is displayed if node labels are on.

In [18]:
nodes = 

nodes.head()

Unnamed: 0,id,Label
0,0606Green,0606Green
2,0ctavia,0ctavia
3,100glitterstars,100glitterstars
4,101Cognitive,101Cognitive
5,10dam,10dam


We want to ensure each user node has its `user.statuses_count` and `user.followers_count` associated. We will need to get these from our original dataframe.

In [19]:
subset = ['user.screen_name','user.followers_count','user.statuses_count']
user_data = 
user_data.head()

Unnamed: 0,user.screen_name,user.followers_count,user.statuses_count
0,nanna39076633,751,35166
1,HenryForthwith,420,21602
2,F35514,10,1715
3,MChawlaQC,9265,540
4,R1chardEll10tt,114,477


Currently user_data is essentially a list of tweets showing just the username, and then the status count and follower count of the user at the point they tweeted. This means that a user may occur more than once in the list, with different values. 

The solution is to ask Pandas to find all the tweets in the dataset for each user, and then choose the highest values it can find in those tweets for each user. We do this with `.groupby` and `.max` to aggregate the data.

In [20]:
user_data = 
user_data

Unnamed: 0,user.screen_name,user.followers_count,user.statuses_count
0,001Gunner,24,1127
1,007Pseudonym,74,4924
2,0606Green,815,181116
3,0_ayanna,97,9508
4,0ctavia,2115,210311
...,...,...,...
14799,zithertilldawn,457,11477
14800,zodiacbanana,222,2542
14801,zort70,1030,48651
14802,zosteb,3141,66898


For our list of nodes, we now want to find the corresponding data in our `user_data` variable, for each user and include it in our `nodes` list.

We can do this with a "left `.merge`"

In [21]:
nodes = 
nodes.head()

In [23]:
# no need for the extra user.screen_name column
nodes = 
nodes.head()

In [25]:
edges.head()

Unnamed: 0,source,target,weight,edge_type
0,0606Green,MarcherLord1,1,retweeted
1,0606Green,SocialM85897394,1,retweeted
2,0ctavia,mrjamesob,1,retweeted
3,100glitterstars,RoddyQC,1,retweeted
4,101Cognitive,MattChorley,1,retweeted
...,...,...,...,...
19039,zippydazipster,alfonslopeztena,1,retweeted
19040,zodiacbanana,SueSuezep,1,retweeted
19041,zort70,I_am_Dan___,1,retweeted
19042,zuluzim909,Bill_Esterson,1,retweeted


In [103]:
'retweet_node_list.csv'
'retweet_edge_list.csv'