## Structure
- Text to entities
- Loading into networkx
- Standard metrics and measures
- Communities
- Visualising with Gephi


# From Text to Network
- Last session we looked at how tv-shows and films could be transformed into relational data by thinking about them differently. 
- By representing them as characters co-occuring in scenes we were able to create a representation that showed us who the most central and critical characters were, and in some cases, demarcate out storylines through the clustering of those characters.
- If you knew the franchise or series, the results would not have been a revelation, and this is a good thing.
- Even if you'd never heard of 'Friends' if you were shown that network you would know who the six main characters were. You would perhaps know who the most prominent side characters were, and so forth. 
- Even if you knew the series well, it may show you patterns you didn't necessarily realise by consuming it directly.

Today we're going to build a dataset for a different series, the news. We'll identify the main characters, the side characters, the storylines and perhaps see things we wouldn't see otherwise.

In [None]:
# Define the nlp model

# Load in our dataset


In [None]:
# Take a sample of the first 100 articles - .copy means any changes we make are completely seperarate to our original dataset


## Creating our 'Scenes'
In our TV/film data, the material had been subdivided up into scenes so we could then ask "how often were these characters 'together'". In our data we have articles, which are analogous to 'episodes', and then paragraphs, which are analogous to scenes, subdivisions of the whole. If we think about writing, if two names are mentioned in a single paragraph it is probably because something relates them together. Before we hunt for names, let's subdivide our data into paragraphs.

In [None]:
# We're going to make a column of paragraphs by splitting the cleaned text wherever there is a newline character \n

# We can see what the first row of this column looks like as an example


In [None]:
# Now we're going to make it so that every row is a single paragraph. The index will keep track of which article each paragraph relates to.

# and then we'll take just the paragraphs column, give it a new index and keep the old one, calling it 'article_idx


## Getting our 'characters'
We are now going to check each paragraph in each story for the presence of person and organisation names. We refer to these as 'entities' and the process of finding them as... 


### Named Entity Recognition
A technique used in natural language processing to identify specific types of information in text, such as names, organisations, places, dates, etc. In the good old days, entities were identified by matching against great big lists of notable names, places etc. It was developed primarily to improve the ability of computers to answer questions more directly for things like search and building knowledge bases. 

Today transformer based text models like Spacy identify entities based on analysis of the text itself, based on the linguistic relationships between the words used.


In [None]:
# An example of different types of entities in a piece of text.

# Source: https://www.bbc.co.uk/news/articles/cdx2rk10ep0o


In [None]:
# To extract the entities in a spacy processed document we access the .ents method


In [None]:
# and each entity has its own attributes, .text and .label_


In [None]:
# Let's apply this to our paragraphs

# we define the entity types we want to keep - 
# (this could be a list but generally if you don't intend to change that list it is clearer to use a tuple)

# Our sample is only 100 items big but if you want to do larger amounts it's good to manage the pipe batch size.
#  Refer back to session 2 for more detail.

# Our destination list. Each item will be a list of entities in a paragraph

# For every paragraph in our paragraph_data['paragraph'] colum, processed in a spacy pipe using 1 process, handling 150 items at a time

# assert checks if a statement is true and throws an error if it is False
# Here we assert that our destination list is the same length as our number of paragraphs

# We then take that list and turn it into a column in our dataset


In [None]:
# Whilst it seemed a lot, without the comments it is quite a concise process.


## From Entities to Network
We now have our list of paragraphs (scenes) and entities (characters). Finally we're going to make a network.
- A Node: Represents an entity in the paragraphs.
- An Edge: Represents a co-occurence of two entities.

Node Attributes
- `n_paragraphs`: Number of paragraphs the entity occurs in.
- `n_articles`: Number of articles the entity occurs in.

Edge Attributes
- `'weight'`: Number of times the two entities co-occur in a paragraph

In [None]:
# We transform our data to be an entity per row. Again the index tracks which entities were originally together.


In [None]:
# Node attributes first
## If we group the above by entities then we can count the number of rows containing that entity, that's n_paragraphs.
# We can also count the number of unique article_idx numbers, that's n_articles

# Finally we transform this into a dictionary, ensuring that the key is the entity name (orient='index')
# We'll use this dictionary of attributes later.


In [None]:
# Next we're going to create an adjacency matrix. 
# This shows the structure of the network and takes care of the edge weight attribute

# First we turn our column of individual entities into a dummy matrix
# Each row represents an entity - for each row every column will be 0 apart from the column matching that entity's name which will be 1.


In [None]:
# To make it one row per paragraph, we group by the index (level=0) and add the rows together.
# Now each row shows 1 in the column if that entity is in the paragraph, otherwise 0.


In [None]:
# Magic time
# This is called matrix multiplication. It is a key part of graph theory.
# All we need to know is that it is able to calculate how many times each of the items in our columns co-occur.

# An adjacency matrix has all the information needed to make a weighted network


In [None]:
# Finally we can make our network using networkx

# We can check if the network's attributes look right using...
# G.nodes(data=True)
# G.edges(data=True)

In [None]:
# Let's export and take a look in Gephi

# For now, just ignore these lines. They are filtering the graph, but we'll explain them later.

# This line exports the graph to a gexf file that is compatible with gephi


# Exploring our Network

<img src="https://github.com/Minyall/sc207_290_public/blob/main/images/gephi_settings.png?raw=true" align="right" height=200>

When we first open our graph in Gephi, we'll get a mess of nodes and edges. The filters shown in the image is just one way of reducing the noise in the graph. In Gephi, filters are nested and should be read bottom to top, right to left. So reading in that order the filters shown in the picture...
1. Retains only edges that have **edge weights** between 2 and 39. 39 will just be the largest weight we have so essentially it says, minimum weight is 2.
2. Then that remaining graph is passed to the next filter, which says each node must have a minimum **degree** of 2, i.e. a node must be connected to at least two other nodes. 
3. Then *that* graph is passed to the final filter, **Giant Component**. Components are those clusters of connected nodes. We have one big one in the middle, and lots of little ones floating around it. The Giant component is the biggest one, so essentially it only retains nodes and edges part of that large cluster in the middle.

Gephi's filter UI is a little unintuitive, but it makes sense once you understand that every filter will have an impact on the next filter applied. Removing edges that have too low an edge weight, might disconnect nodes, reducing their degree, and may disconnect them from the giant component.


#### Should I always filter like this?
No. It very much depends on your data, and what you want to know. With the attributes we gave our data we could also filter by number of paragraphs a person appeared in, or number of articles. Each measurement filters people in different ways.

- Degree - Number of connections - Number of unique entities a node co-occured with.
- Edge weight - Number of times two entities co-occured.
- n_paragraphs / n_articles - Number of paragraphs an entity appeared in or number of articles. Someone may occur many times in just one article.
- Giant component - Filters out anymore not connected to the largest mass, but what if you have two relatively equal disconnected masses?

Each choice in your filtering, as well as the order you do the filtering, has an impact on any subsequent analysis (like community detection) and the final outcome. It is important to understand what each filter is doing in relation to *your data* and what it means for *your question*.

## Filtering with NetworkX
Whilst you can filter with Gephi, you may find it useful to filter with networkx for various reasons. 
- Primarily it may be that having explored your data in Gephi, you want to lock in place a certain filtered version of your data. 
    - Solidifying your filtering into clear steps in your code makes it clear exactly what filtering took place, in what order, and ensures it is consistent if you need to run analysis again.
- You may want to use NetworkX to get exact figures on various metrics, and not want to shift over the Gephi to retrieve them.
- You may want to integrate the outputs of network analysis into a larger project - (more on this next week).


In [None]:
# We'll rebuild the graph again just so we're clear on what we're starting with.


The two main filtering methods are `.remove_nodes_from` and `.remove_edges_from`. Each takes either a list of nodes to remove, or a list of edges. To filter the graph in some way, what differs is how you identify the nodes and/or edges to remove.

In [None]:
# Self loops - edges that connect nodes to themselves.
## The way we make our graph via an adjacency matrix will always produce self-loop edges.
## However, they should be removed for our analysis as it makes no sense to consider that Person A appears in the same paragraph as Person A.

# We identify the edges to remove using this networkx function

# and then remove them from our graph

# this is just so we can see the changes


In [None]:
# Removing based on edge weight
## For this one we need to build our list of edges by checking each edge to see if its weight attribute meets our criteria.

# and we remove the edges

# Note that removing edges will not remove nodes, even if they now have zero edges.

In [None]:
# Removing nodes by attribute
## Like above, we can check the attributes of nodes and remove those we don't want
### For example, filtering by n_paragraphs

#  we call the appropriate method

# Note that removing nodes will remove edges if they lose either their source or target node.


In [None]:
# Remove nodes by degree
## Uses the .degree method 

# Note we lose nodes AND edges because we will have disconnected nodes 
# if they only have one connection, so we lose the edges too.


In [None]:
# Components
## Firstly we can check how many components there are in a graph.


In [None]:
# Components
## The nx.connected_components function will produce lists of the nodes in each seperate component

## We have to use list() to force it to actually produce something so we can examine it, as it is a generator - look it up.

## Examining one of the components we see the nodes that form it


In [None]:
# Components
## To get the biggest one use the built in max function
### key= tells it what to use as its criteria for determining biggest, here it is length of the list.

## In this case, rather than remove nodes, we say we want a graph that contains nodes
## from our largest component, and we want all the edges between them.
## For this we use .subgraph


## Summary: Applying our Gephi Filters in NetworkX
To recap, let's start again with a new graph, and apply the filters from our Gephi example.

<img src="https://github.com/Minyall/sc207_290_public/blob/main/images/gephi_settings.png?raw=true" height=200>




In [None]:
# Starting again, again!

# Remove self-loops - We do this here because it should always be standard for this kind of data.


In [None]:
# Remove low weight edges

# Remove low degree nodes (this will also clean up any nodes that were disconnected by the edge filter above)

# Get the giant component


## Community Detection
Whilst we can do community detection in Gephi, we can also do it in NetworkX. Why? 
- Because we may want those community assignments available to us here in Python without stepping over to Gephi.
- We may want to test different settings and see their effects.
- It helps reinforce the understanding that a Network is not just about the visual 'picture' but that it ultimately is a representation of relational data that we can interrogate.

## Graph Metrics
A key benefit of representing you data as a network is the different kinds of measures that you can derive from that structure.

We covered the metrics themselves in prior material, but here is how to use them in Networkx. All the metrics work in the same way.

In [None]:
# Get a dictionary of scores for the chosen metric


In [None]:
# create a new column for the metric and use map to match the scores to the right rows


A simple model for this is below, just switch the metric name to the one you want.

In [None]:
# What if you want to know the scores for nodes within a specific community instead? 
# For all the figures within a specific community, which of them is the most important?
# The metrics calculated above consider the whole network, what if we want just a subset of the network?


# Egos!

### Notes on Metrics
#### Degree Centrality
We know degree is the number of edges a node has. However, *Degree Centrality* is a slightly adjusted version of this that makes it easier to know if the degree is high or low. Degree centrality is simply a node's degree frequency, divided by the total number of possible connections it could have (not including itself).

#### Betweenness Centrality
Indicates the extent to which a node stands between two others. In our data this could indicate people central to the issue overall across the different stories, but it could also indicate people that connect a side issue to a larger whole. 

Unweighted just considers if there is an edge, weighted factors in if co-occurences happen a lot.
#### Eigenvector Centrality and PageRank
Indicates the importance of a node based on the importance of the nodes it is connected to. For us this could indicate our core people, but also individuals that may be important by virtue of their closeness to important people.
#### Closeness Centrality
How close is a node to the rest of the network? On average how many steps would it take to get from a node to any other node in the network. A high closeness centrality indicates that a node is closer to all other nodes. It is often used to indicate access to other nodes, or information flow in a network. How easy would it be for you to be introduced to any other person in the university?

In our graph this indicates how connected a person is across a range of topics. If someone could easily 'step' from their position to any other person in the network in small number of steps they are likely an important figure in the issue.