<img src="https://github.com/Minyall/sc207_materials/blob/master/images/gephi_network.png?raw=true" align="right" width="300">


# SC207 - Session 5
# Social Network Analysis - NetworkX and Gephi



- Social Network Analysis allows us to explore how different subjects relate to one another in complex ways.
- Often we think of network analysis relation to social networks, how different indivduals are connected to one another, and how this forms larger communities, and influences the movement of information.
- Network analysis can be used for a much wider variety of things - anything essentially that can be understood as existing in relation to something else.

# Tools
- Today we will be using two tools
1. NetworkX - a Python library for shaping data into network form. NetworkX can also use a number of network analysis techniques but we will stick to data structuring so that we can export our data into...
2. Gephi - a network visualisation and analysis tool with a click-button interface.

### Imports
Today we will just need...
- Pandas to import and reshape our twitter data
- NetworkX to do the final reshaping and export it to a file readable by Gephi.

In [None]:
# Imports here

# 1. Testing on Toy Data
- Often it can be useful to test out techniques on simpler datasets just to see how the process works before we try with something larger.
- We're going to use a simple network of that shows which Marvel characters appear in which of the Marvel Movies.
- Data originally compiled by the [Social Media Research Foundation](https://www.smrfoundation.org/nodexl/teaching-with-nodexl/teaching-resources/) from IMDB in 2018.

In [None]:
# Load the 'marvel_movie_data.csv' here as df

df = 
df.info()

In [None]:
df.head()

To build this simple data into a network is relatively easy, and we can get NetworkX to do most of the heavy lifting.
- We're going to make an undirected graph, that means there is just an edge between a movie and a character, there is no direction involved.

### Creating the Graph with an Edge List

In [None]:
marvel_G = 

In [None]:
# check number of edges

In [None]:
# check number of nodes

In [None]:
# examine nodes directly

In [None]:
# examine edges directly

It may be useful to also have a little bit of information about the nodes. Normally a network graph, if represented as a spreadsheet, would be split into to items, the edge list, which we already have, and the node list, which provides information about the attributes of our nodes.

Let's keep this simple by giving our nodes one attribute, `type` which will be either 'Movie' or 'Character'

### Create a Nodes List with Attributes

In [None]:
# We can make a series into a new dataframe by using the .to_frame() method

movie_nodes = 
# drop duplicates on the 'movie' column and reset index
movie_nodes =

# create a new 'type' column with the value 'movie'
movie_nodes.head()

In [None]:
# Same again for the characters 

char_nodes = 

# drop duplicates on the 'character' column and reset index
char_nodes = 

# create a new 'type' column with the value 'character'

char_nodes.head()

In [None]:
# next we will stack these two dataframes on top of one another using pd.concat
# BUT first we need to rename the movie column, and the character column to a common name so that the two dataframes line up properly.
movie_nodes = 
char_nodes = 

In [None]:
movie_nodes.head()

In [None]:
char_nodes.head()

### pd.concat 
- 'Concatenates' two Series, or two dataframes together into one. 
- Ideally they will share the same index or column names.
- Think of it as two spreadsheets lined up against each other either top to bottom, or left to right.

#### `axis='columns'`


<img src="https://github.com/Minyall/sc207_materials/blob/master/images/pandas%20column%20concat.png?raw=true" width=500>


#### `axis='index'`


<img src="https://github.com/Minyall/sc207_materials/blob/master/images/pandas%20row%20concat.png?raw=true" width=500>


*Images from http://www.datasciencemadesimple.com*


In [None]:
# let's concatenate them together top to bottom  (so on the 'index') and call it 'nodes'. make sure you reset the index after

nodes =
nodes['type'].value_counts()

## Adding Node Attributes

To attach atributes to NetworkX you need a dictionary where the key is the node id, in our case the movie or character name, and then the value attached to that key is another dictionary where the key is the name of the attribute, and the value is the attribute vaue.

We want an attribute dictionary for each node like this...

`{'type':'movie'}`


Which is then embedded in another dictionary where the keys are our node names...

```{'Iron Man 2':{'type': 'movie'},
 'Dr. Strange':{'type': 'character'},
 'Spiderman':{'type': 'movie'},
 ... }```

It is worth knowing that in your attribute dictionary you can have as many attributes as you like

Attribute dictionary with 2 attributes
`{'type':'movie', 'release date': 2010}`

The final dictionary keyed to the node name `{'Iron Man 2':{'type': 'movie', 'release date': 2010}}`

#### Looks complicated?
PANDAS TO THE RESCUE!

As long as we have set up our node dataframe where each attribute is its own column we can use this chain of methods...

In [None]:
# first we set our index to be the value that represents our node in the NetworkX graph


In [None]:
# then we transpose the data so columns become rows and rows become columns


In [None]:
#.. and finally we convert to a dictionary



In [None]:
node_attributes = 

In [None]:
# finally we can use this dictionary to attach the right attribute values to the right nodes using nx.set_node_attributes



In [None]:
# now examine the nodes with data=True



### Exporting to Gephi
Finally we export the networkX graph to a Gephi compatible .gexf file

In [None]:
# export using nx.write_gexf  to filename 'marvel_graph.gexf'



<img src="https://github.com/Minyall/sc207_materials/blob/master/images/gephi-logo-2010-transparent.png?raw=true" align="left" width="300">

Let's move over to Gephi to explore our Graph

# 2. Real Deal: Twitter Data

- We're going to make a Retweet network. 
- In this network every Node will represent a different user, 
- An edge between user a and user b will represent one user retweeting the other
- Eges will be given a `weight` that counts how many unique times user a retweeted user b. 
- We will make our network `directional` meaning that we will record seperately
  -     how many times `a` is retweeted by -> `b` 
  -     and how many times `b` -> is retweeted by `a`

In [None]:
# Load in 'large_brexit_tweets.pkl' using the pd.read_pickle method

df = 
len(df)

Again we will create two dataframes, one for edges and one for node attributes.

### Edges 
So we need to identify the columns that represent the two ends of our edges...
 - If we imagine that in our network an arrow goes from the original tweet (the 'retweeted status' as Twitter calls it) ---> to the new status (the retweet we collected)...
 - Look at the columns using `df.info()`
 - We probably want to have the `user.screen_name` and the `retweeted_status.user.screen_name` as the two ends of our edges

In [None]:
# examine the df info


In [None]:
# subset the dataframe so you are just looking at the two columns we'll use for our edges. Use .copy() so it is an independent object from our original data.

edges = 
edges.head()

So we took our df, which was a list of Tweets, and now are just focusing on...
- `user.screen_name` the user that tweeted the status we collected
- `retweeted_status.user.screen_name` the user that tweeted the original status update that was retweeted by `user.screen_name`

Some of these tweets will not be Retweets, and so will have a `NaN` value in the `retweeted_status.user.screen_name` column. We can check with `edges.info()`

In [None]:
# check the .info() to get an overview of the data we have
edges.info()

Let's drop any rows that don't have a value under `retweeted_status.user.screen_name`

In [None]:
# drop any rows that are empty in the column 'retweeted_status.user.screen_name'. Ensure you reset the index
edges = 
edges.info()

To make things simpler, lets rename our columns to reflect their roles in the graph, one will be the `source`, the other the `target`.
How you conceptualise the direction of an edge is up to you, and should make sense in terms of the area you are looking it.

We will make it so that...
- `source` == `retweeted_status.user.screen_name`
- `target` == `user.screen_name`

... which frames it as the `retweeted_status.user.screen_name` -[influencing]-> the `user.screen_name` who then retweets that information.

You could alternatively frame it as the `user.screen_name` -[retweeted the]-> `retweeted_status.user.screen_name` which would then require us to reverse our source and target assignments to make it go in the opposite direction.



In [None]:
# Whilst conceptualising all this is tricky, actually making the change is simply a matter of renaming

edges = 

In [None]:
edges.head()

We also said we were going to make sure we had just one edge between each pair, but assign the edge a 'weight' score that indicated how many times that retweeting had happened. We can do this quickly using Pandas Groupby

In [None]:
# first we give every edge a weight of 1 by creating a new 'weight' column and giving every row the value of 1


In [None]:
# Then we group by both the source and the target columns and sum together the weights. Make sure you reset the index too
# run the groupby without assigning it to anything to check the result

edges.

In [None]:
# Looks good, lets finalise that by overwriting our edges variable- no need to sort it
edges = 

# and create an edge_type column with the value 'retweet'


edges['edge_type'] = 'retweet'

# Nodes
This dataframe will be a list of unique nodes and we will assign some attributes to the nodes that we can use in Gephi later on.

In [None]:
# create your nodes list by concatenating the source and target columns on top of one another and reset the index

nodes = 

In [None]:
nodes

In [None]:
# use the .unique method on your nodes list to quickly filter out repeat values. 
# Wrap the result in a new dataframe with the column name 'node_name' as it will come out as a list


nodes= 

In [None]:
nodes

As an example let's use the lifetime statuses count of each user as one attribute that we can use in Gephi later.
That means we're going to need...

`'user.screen_name', 'user.statuses_count','user.followers_count'`

In [None]:
user_data = df[['user.screen_name', 'user.statuses_count','user.followers_count']].copy()
user_data

Currently user_data is essentially a list of tweets showing just the username, and then the status count and follower count of the user at the point they tweeted. This means that a user may occur more than once in the list, with different values. 

The solution is to ask Pandas to give us the highest value it can find for each column, for each user. We use groupby to achieve this.

In [None]:
# group user_data by the 'user.screen_name' then retrieve the max value and reset the index

user_data =
user_data

We now have our list of unique nodes and a dataframe of attributes for all the users in our dataset. We can merge these attributes onto our nodes list so that the correct values line up with the correct screen_names

In [None]:
# merge nodes with user data, merging on 'node_name' from the nodes list,
# and 'user.screen_name' from the user_data, and we should merge prioritising the nodes dataframe
nodes = 

In [None]:
nodes

In [None]:
# we can lose 'user.screen_name'. 
# Pandas thinks it is a unique column because of the different name but we know it is the same as 'node_name'

# Do this inplace for fun

nodes.drop(

In [None]:
nodes

In [None]:

# create the attribute dictionary using our string of methods we showed earlier.
# Set the index on the node key, transpose, and transform into a dictionary.
attribute_dict = 
attribute_dict

In [None]:

# Using two lines of code we will create our graph and load the node attributes

# create the graph from the edges edge list using a DiGraph - Directional Graph
G = 

# set the node attributes of our new graph using our attribute_dict
nx.set_node_attributes(G, values=attribute_dict)

In [None]:
# Check the nodes

G.nodes(data=True)

In [None]:
# write the graph to a gexf file named 'large_brexit.gexf'

nx.write_gexf(G, 'large_brexit.gexf')