<img src="https://github.com/Minyall/sc207_materials/blob/master/images/gephi_network.png?raw=true" align="right" width="300">


# SC207 - Session 8
# Social Network Analysis with Gephi
## Reshaping your Data into a Network


### Imports

Today we will just need...
- Pandas to import and reshape our twitter data

- <img src="https://github.com/Minyall/sc207_materials/blob/master/images/gephi-logo-2010-transparent.png?raw=true" align="left" width="75">....to visualise and explore our data.


In [None]:
import pandas as pd

In [None]:
def flatten_nested_dicts(df):
    dicts = df.to_dict(orient='records')
    flattened = pd.json_normalize(dicts)
    return flattened

## Converting Twitter Data

- We're going to make a Retweet network. 
- In this network every Node will represent a different user, 
- An edge between user a and user b will represent one user retweeting the other
- Eges will be given a `weight` that counts how many unique times user a retweeted user b. 
- We will make our network `directional` meaning that we will record seperately
  -     how many times `a` is retweeted by -> `b` 
  -     and how many times `b` -> is retweeted by `a`
    -
First we need to find out the IDS of any referenced tweets, and if they're not already in our dataset get their full dataset plus user data.

In [None]:
filename = 'new_trhr.json'

tweets = pd.read_json(filename)
tweets

In [None]:
subset = tweets[['id','author_id','referenced_tweets']].dropna()
subset

In [None]:
# first we deal with everything being in lists using .explode!!

edge_data = subset.explode('referenced_tweets').copy()
edge_data

In [None]:
# Next we unpack that series of dictionaries into their own columns
edge_data = flatten_nested_dicts(edge_data)
edge_data

In [None]:
new_cols = {'id':'source_id',
            'author_id':'source_user_id',
 'referenced_tweets.type':'type',
 'referenced_tweets.id':'target_id'}
edge_data = edge_data.rename(columns=new_cols)
edge_data = edge_data[['source_id','source_user_id','type','target_id']]
edge_data.info()

We almost have everything we need except the target usernames. We should have these as part of our larger dataset because we retained both original and referencerd tweets with their associated user data


In [None]:
edge_data

In [None]:
edge_data = edge_data[edge_data['type'] == 'retweeted']

In [None]:
user_data = tweets[['id','author_id']]
user_data

In [None]:
edge_data = edge_data.merge(user_data, how='left',left_on='target_id', right_on='id')
edge_data.info()

In [None]:
new_cols = {'author_id':'target_user_id'}
edge_data = edge_data.rename(columns=new_cols).drop(columns=['target_id','id','source_id'])
edge_data

At the moment we have one row per instance of retweeting between a pair of users. However it may be the case that one user retweeted another a number of times, and so there are duplicate rows. Rather than discard this information, we'll capture this by adding weights to our edges that count how many times the source user retweeted the target user.

In [None]:
# First we give every edge a weight of 1, because each row represents 1 instance of retweeting

edge_data['weight'] = 1
edge_data

Now we use groupby to gather together rows that have the same combination of source, target and type, and add the weight values together.

In [None]:
edge_data = edge_data.groupby(['source_user_id','target_user_id','type'],as_index=False).sum()
edge_data.sort_values('weight',ascending=False)

Now our `edge_data` has one row per `source` `target` pair, and a weight indicating how many times that pair appeared in the data. Finally we just need to relabel the columns so that Gephi understands them.

In [None]:
gephi_edge_labels = {'source_user_id':'Source','target_user_id':'Target','type':'Type','weight':'Weight'}
edge_data = edge_data.rename(columns=gephi_edge_labels)
edge_data

### Nodes
- `source_user_id` and `target_user_id` as the node ids.
- `source_username` and `target_username` as the node labels
- User followers count and statuses count as the node attributes

### Edges
-  `source_user_id` and `target_user_id` as the two ends of our edges.

# Nodes
This dataframe will be a list of unique nodes and we will assign some attributes to the nodes that we can use in Gephi later on.
First we grab the relevant columns from our original dataset. It may have duplicates as each row represents a tweet, and it may have users that didn't end up in our edge table but we'll deal with that soon.

In [None]:
node_data = tweets[['user_id','user_name','user_public_metrics']]
node_data

In [None]:
# First we drop any duplicates because we simply need one row per user
node_data = node_data.drop_duplicates('user_id')
node_data

In [None]:
# Next we create a list of all users that are actually in our edge list

nodes_in_network = pd.concat([edge_data['Source'], edge_data['Target']], axis=0).drop_duplicates()
nodes_in_network

In [None]:
node_data = node_data[node_data['user_id'].isin(nodes_in_network)]
node_data

In [None]:
# Now lets expand out our user metrics

node_data = flatten_nested_dicts(node_data)
node_data

In [None]:
# Finally we need to relabel our columns for Gephi

gephi_node_labels = {'user_id':'ID','user_name':'Label',
                     'user_public_metrics.followers_count':'followers_count',
                     'user_public_metrics.following_count':'following_count',
                     'user_public_metrics.tweet_count':'tweet_count',
                     'user_public_metrics.listed_count':'listed_count'}

node_data = node_data.rename(columns=gephi_node_labels)
node_data


In [None]:
node_data.to_csv('retweet_node_list.csv',index=False)
edge_data.to_csv('retweet_edge_list.csv',index=False)

Now we go to...

<img src="https://github.com/Minyall/sc207_materials/blob/master/images/gephi-logo-2010-transparent.png?raw=true" align="left" width="150">

### Some Notes
#### Filter sets

To examine individual communities

- Giant component
    - Inter Edges (modularity Class)
        - Degree range 2

To examine the overall structure
- Giant component
    - Degree range 2


#### Measures
- Weighted in-degree to show influence
- Pagerank centrality to show those who may have the ear of an influencer.
