<img src="https://github.com/Minyall/sc207_materials/blob/master/images/gephi_network.png?raw=true" align="right" width="300">


# SC207 - Session 8
# Social Network Analysis with Gephi
## Reshaping your Data into a Network


### Imports

Today we will just need...
- Pandas to import and reshape our twitter data

- <img src="https://github.com/Minyall/sc207_materials/blob/master/images/gephi-logo-2010-transparent.png?raw=true" align="left" width="75">....to visualise and explore our data.


In [None]:
import pandas as pd

In [None]:
def flatten_nested_dicts(df):
    dicts = df.to_dict(orient='records')
    flattened = pd.json_normalize(dicts)
    return flattened

## Converting Twitter Data

- We're going to make a Retweet network. 
- In this network every Node will represent a different user, 
- An edge between user a and user b will represent one user retweeting the other
- Edges will be given a `weight` that counts how many unique times user a retweeted user b.
- We will make our network `directional` meaning that we will record seperately.
  -     how many times `a` is retweeted by -> `b` 
  -     and how many times `b` -> is retweeted by `a`

#### Stage 1
We currently have data laid out like this...

| author_id | referenced_tweet_id |
| --------- | ------------------- |
| USER A    | TWEET 1             |
| USER B    | TWEET 2             |
| USER C    | TWEET 3             |

#### Stage 2
However we want our data to look like this....

| author_id | referenced_tweet_id | referenced_tweet's author |
| --------- | ------------------- | ------------------------- |
| USER A    | TWEET 1             | USER B                    |
| USER B    | TWEET 2             | USER A                    |
| USER C    | TWEET 3             | USER C                    |


#### Stage 3
Or actually more accurately, just like this...

| retweeter | author |
|-----------|--------|
| USER A    | USER B |
| USER B    | USER A |
| USER C    | USER C |

#### Stage 4
We create a weight column that represents the number of times the author_id retweeted the referenced_tweet's author.

| retweeter | author | weight |
|-----------| ---------------------- |--------|
| USER A    | USER B                 |1|
| USER B    | USER A                 |5|
| USER C    | USER C                 |12|



## Stage 1

In [None]:
filename = 'QT.json'

tweets = pd.read_json(filename)
tweets

In [None]:
subset = tweets[['author_id','referenced_tweets']].dropna()
subset

In [None]:
# first we deal with everything being in lists using .explode!!

edge_data = subset.explode('referenced_tweets').copy()
edge_data

In [None]:
# Next we unpack that series of dictionaries into their own columns
edge_data = flatten_nested_dicts(edge_data)
edge_data

In [None]:
# and select just three columns, the original author id, the tweet id of the referenced tweet, and the type of the referenced tweet

edge_data = edge_data[['author_id','referenced_tweets.id', 'referenced_tweets.type']]
edge_data = edge_data[edge_data['referenced_tweets.type'] == 'retweeted']
edge_data

So to recap, we have three columns...
- source: The id of the user that retweeted somebody.
- id: The id of the tweet that they retweeted
- type: the way in which the source 'referenced' a tweet. In our case all retweets.

## Stage 2
Our hope is that all the referenced tweets are also in our dataset somewhere else, and so have associated user information. We'll create a new dataframe of tweet ids and associated user ids and then use merge to match them up.

In [None]:
# Here we create a list of all the tweets we collected and associated author id.
user_data = tweets[['id','author_id']]
user_data

In [None]:
edge_data = edge_data.merge(user_data, how='left',left_on='referenced_tweets.id', right_on='id')
edge_data

### Stage 3

In [None]:
# Here we check if we're missing any author info for any reason. If we were we'd just use .dropna() to drop any rows with missing info.
edge_data.info()

In [None]:
# Here we will rename the columns to be more descriptive, and drop any columns we don't need.

new_cols = {'author_id_x':'retweeter', 'author_id_y':'author'}
edge_data = edge_data.rename(columns=new_cols).drop(columns=['referenced_tweets.id','id','referenced_tweets.type'])
edge_data

### Stage 4

At the moment we have one row per instance of retweeting between a pair of users. However it may be the case that one user retweeted another a number of times, and so there are duplicate rows. Rather than discard this information, we'll capture this by adding weights to our edges that count how many times the source user retweeted the target user.

In [None]:
# First we give every edge a weight of 1, because each row represents 1 instance of retweeting

edge_data['weight'] = 1
edge_data

Now we use groupby to gather together rows that have the same combination of source, target and type, and add the weight values together.

In [None]:
edge_data = edge_data.groupby(['retweeter','author'],as_index=False).sum()
edge_data.sort_values('weight',ascending=False)

Now our `edge_data` has one row per pair, and a weight indicating how many times that pair appeared in the data. Finally we just need to relabel the columns so that Gephi understands them.

In network analysis when we talk about edges we refer to a `source` and a `target`. If you imagine an edge as an arrow the `source` is where the arrow starts and the `target` is where it points to.

In our case the direction matters, just because USER A retweete USER B doesn't mean that they retweeted back. The relationship is not necessarily mutual, like perhaps a friendship where we wouldn't necessarily consider there to be a direction to the connection. Which is the source and which is the target though? Well, either, depending on how you define what the edge represents.

We could say the edge means `RETWEETED`, so it would be...

```
(SOURCE: retweeter) -[RETWEETED]-> (TARGET: author)
```

But equally we could say the edge means `RETWEETED_BY`, meaning the positions would be reversed.

```
(SOURCE: author) -[RETWEETED_BY]-> (TARGET: retweeter)
```

In this case, it doesn't matter what we choose, so long as we remember what the direction of our edge represents. We'll go with `RETWEETED`.


In [None]:
gephi_edge_labels = {'retweeter':'Source','author':'Target'}
edge_data = edge_data.rename(columns=gephi_edge_labels)
edge_data

In [None]:
edge_data.to_csv('edges_QT.csv', index=False)

### Nodes
- `source_user_id` and `target_user_id` as the node ids.
- `source_username` and `target_username` as the node labels
- User followers count and statuses count as the node attributes

### Edges
-  `source_user_id` and `target_user_id` as the two ends of our edges.

# Nodes
This dataframe will be a list of unique nodes and we will assign some attributes to the nodes that we can use in Gephi later on.
First we grab the relevant columns from our original dataset. It may have duplicates as each row represents a tweet, and it may have users that didn't end up in our edge table but we'll deal with that soon.

In [None]:
node_data = tweets[['user_id','user_name','user_public_metrics']]
node_data

In [None]:
# First we drop any duplicates because we simply need one row per user
node_data = node_data.drop_duplicates('user_id')
node_data

In [None]:
# Next we create a list of all users that are actually in our edge list

nodes_in_network = pd.concat([edge_data['Source'], edge_data['Target']], axis=0).drop_duplicates()
nodes_in_network

In [None]:
node_data = node_data[node_data['user_id'].isin(nodes_in_network)]
node_data

In [None]:
# Now lets expand out our user metrics

node_data = flatten_nested_dicts(node_data)
node_data

In [None]:
# Finally we need to relabel our columns for Gephi

gephi_node_labels = {'user_id':'ID','user_name':'Label',
                     'user_public_metrics.followers_count':'followers_count',
                     'user_public_metrics.following_count':'following_count',
                     'user_public_metrics.tweet_count':'tweet_count',
                     'user_public_metrics.listed_count':'listed_count'}

node_data = node_data.rename(columns=gephi_node_labels)
node_data


In [None]:
node_data.to_csv('retweet_node_list.csv',index=False)
edge_data.to_csv('retweet_edge_list.csv',index=False)