<img src="https://github.com/Minyall/sc207_materials/blob/master/images/gephi_network.png?raw=true" align="right" width="300">


# SC207 - Session 8
# Social Network Analysis with Gephi
## Reshaping your Data into a Network


### Imports

Today we will just need...
- Pandas to import and reshape our twitter data

- <img src="https://github.com/Minyall/sc207_materials/blob/master/images/gephi-logo-2010-transparent.png?raw=true" align="left" width="75">....to visualise and explore our data.


In [None]:
import pandas as pd

## Converting Twitter Data

- We're going to make a Retweet network. 
- In this network every Node will represent a different user, 
- An edge between user a and user b will represent one user retweeting the other
- Eges will be given a `weight` that counts how many unique times user a retweeted user b. 
- We will make our network `directional` meaning that we will record seperately
  -     how many times `a` is retweeted by -> `b` 
  -     and how many times `b` -> is retweeted by `a`

In [None]:
filename = 'example_twitter_data.pkl'

tweets = pd.read_pickle(filename)

In [None]:
tweets

#### Our Unpacking Funtions

In [None]:
def flatten_nested_dicts(df):
    dicts = df.to_dict(orient='records')
    flattened = pd.json_normalize(dicts)
    return flattened

### Nodes
- User screen names as the id and Label.
- User followers count and statuses count as the node attributes

### Edges 
-  `user.screen_name` and the `retweeted_status.user.screen_name` as the two ends of our edges.

In [None]:
# First let's grab just the portion of data we need

tweet_network_data = tweets[['id','user', 'retweeted_status']]
tweet_network_data.head()

In [None]:
# and then we unpack...may take a few seconds...

tweet_network_data = flatten_nested_dicts(tweet_network_data)
tweet_network_data.head()

Next we create our edge list, which represents who retweets who.

`user.screen_name` is the user that initated the retweet, whilst `retweeted_status.user.screen_name` is the original author of the tweet being retweeted.

We can think of this edge like so...


(`user.screen_name`) -[RETWEETED]-> (`retweeted_status.user.screen_name`)

In [None]:
edges = tweet_network_data[['user.screen_name', 'retweeted_status.user.screen_name']]
edges.head()

Some of these tweets will not be Retweets, and so will have a `NaN` value in the `retweeted_status.user.screen_name` column. We can check with `edges.info()`

In [None]:
edges.info()

Let's drop any rows that don't have a value under `retweeted_status.user.screen_name`

In [None]:
# we could be specific with a filter but we can also use .dropna as a shortcut

edges = edges.dropna()
edges.info()

In [None]:
# Rename the columns so that it is clear which is source and which is target
edges = edges.rename(columns={'user.screen_name': 'source', 'retweeted_status.user.screen_name' :'target'})

In [None]:
edges.head()

We also said we were going to make sure we had just one edge between each pair, but assign the edge a 'weight' score that indicated how many times that retweeting had happened. We can do this quickly using Pandas Groupby

In [None]:
# first we give every edge a weight of 1

edges['weight'] = 1
edges

In [None]:
# Then we group by both the source and the target columns and sum together the weights

edges.groupby(['source','target'], as_index=False).sum().sort_values('weight',ascending=False)


In [None]:
# Looks good, lets finalise that by overwriting our edges variable- no need to sort it
edges = edges.groupby(['source','target'], as_index=False).sum().reset_index(drop=True)
edges['edge_type'] = 'retweeted'
edges

Finally - Often you get a lot of 'noise' in the sense that you may have many instances of a user existing in the dataset just because they retweeted once. This could be useful in some cases, but often the noise hides the underlying structures of relations

# Nodes
This dataframe will be a list of unique nodes and we will assign some attributes to the nodes that we can use in Gephi later on.

First we take both the source and target columns, and append one to the other to make a long list of every user in the edges.

We drop duplicates as users may be mentioned multiple times, and then convert to a dataframe using `.to_frame` rather than a single column, specifying the dataframe's one column name to be `id`.

In [None]:
# first we take our source and target column, and stack them on top of each other
# This will create a list of every user in our edge list

node_names = edges['source'].append(edges['target'])
node_names

In [None]:
# We have duplicates because our edge list relies on duplicating names to properly represent
# how one user may have formed edges between multiple other nodes.

# A node list should be a list of unique nodes and their attributes, 
# so we will drop the duplicates and turn the Series into a DataFrame
unique_nodes = node_names.drop_duplicates().to_frame(name='id').reset_index(drop=True)

unique_nodes

Gephi will use the `id` column to match nodes in the nodes list to the nodes mentioned in the edge list. Finally we provide a `Label` column, which is the same as the `id` column but Gephi likes to have a label column which is what is displayed if node labels are on.

In [None]:
unique_nodes['Label'] = unique_nodes['id']
unique_nodes.head()

We want to ensure each user node has its `user.statuses_count` and `user.followers_count` associated. We will need to get these from our original dataframe.

In [None]:
attribute_columns = ['user.screen_name','user.followers_count','user.statuses_count']
user_data = tweet_network_data[attribute_columns]
user_data.head()

Currently user_data is essentially a list of tweets showing just the username, and then the status count and follower count of the user at the point they tweeted. This means that a user may occur more than once in the list, with different values. 

The solution is to ask Pandas to find all the tweets in the dataset for each user, and then choose the highest values it can find in those tweets for each user. We do this with `.groupby` and `.max` to aggregate the data.

In [None]:
user_data = user_data.groupby('user.screen_name').max().reset_index()
user_data

For our list of nodes, we now want to find the corresponding data in our `user_data` variable, for each user and include it in our `unique_nodes` list.

We can do this with a "left `.merge`" which matches the two dataframes on a specified column and then copies the data from the "right" dataframe to the corresponding rows in the "left" dataframe.

In [None]:
# nodes is on the left, user_data is on the right

nodes = unique_nodes.merge(user_data, left_on='id', right_on='user.screen_name', how='left')

In [None]:
nodes

In [None]:
# no need for the extra user.screen_name column
nodes = nodes.drop(columns=['user.screen_name'])
nodes

In [None]:
edges

In [None]:
nodes.to_csv('retweet_node_list.csv',index=False)
edges.to_csv('retweet_edge_list.csv',index=False)

Now we go to...

<img src="https://github.com/Minyall/sc207_materials/blob/master/images/gephi-logo-2010-transparent.png?raw=true" align="left" width="150">

### Some Notes
#### Filter sets

To examine individual communities

- Giant component
    - Inter Edges (modularity Class)
        - Degree range 2

To examine the overall structure
- Giant component
    - Degree range 2


#### Measures
- Weighted in-degree to show influence
- Pagerank centrality to show those who may have the ear of an influencer.
