## Structure
- Text to entities
- Loading into networkx
- Standard metrics and measures
- Communities
- Visualising with Gephi


# From Text to Network
- Last session we looked at how tv-shows and films could be transformed into relational data by thinking about them differently. 
- By representing them as characters co-occuring in scenes we were able to create a representation that showed us who the most central and critical characters were, and in some cases, demarcate out storylines through the clustering of those characters.
- If you knew the franchise or series, the results would not have been a revelation, and this is a good thing.
- Even if you'd never heard of 'Friends' if you were shown that network you would know who the six main characters were. You would perhaps know who the most prominent side characters were, and so forth. 
- Even if you knew the series well, it may show you patterns you didn't necessarily realise by consuming it directly.

Today we're going to build a dataset for a different series, the news. We'll identify the main characters, the side characters, the storylines and perhaps see things we wouldn't see otherwise.

In [1]:

import spacy
import pandas as pd

# Define the nlp model
nlp = spacy.load('en_core_web_sm')

# Load in our dataset
articles = pd.read_parquet('farright_dataset_cleaned.parquet')
articles.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 744 entries, 0 to 743
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   id                  744 non-null    object             
 1   type                744 non-null    object             
 2   sectionId           744 non-null    object             
 3   sectionName         744 non-null    object             
 4   webPublicationDate  744 non-null    datetime64[ns, UTC]
 5   webTitle            744 non-null    object             
 6   webUrl              744 non-null    object             
 7   apiUrl              744 non-null    object             
 8   tags                744 non-null    object             
 9   isHosted            744 non-null    bool               
 10  pillarId            744 non-null    object             
 11  pillarName          744 non-null    object             
 12  byline              744 non-null    

In [2]:
# Take a sample of the first 100 articles - .copy means any changes we make are completely seperarate to our original dataset
sample_articles = articles.head(100).copy()
sample_articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   id                  100 non-null    object             
 1   type                100 non-null    object             
 2   sectionId           100 non-null    object             
 3   sectionName         100 non-null    object             
 4   webPublicationDate  100 non-null    datetime64[ns, UTC]
 5   webTitle            100 non-null    object             
 6   webUrl              100 non-null    object             
 7   apiUrl              100 non-null    object             
 8   tags                100 non-null    object             
 9   isHosted            100 non-null    bool               
 10  pillarId            100 non-null    object             
 11  pillarName          100 non-null    object             
 12  byline              100 non-null    o

## Creating our 'Scenes'
In our TV/film data, the material had been subdivided up into scenes so we could then ask "how often were these characters 'together'". In our data we have articles, which are analogous to 'episodes', and then paragraphs, which are analogous to scenes, subdivisions of the whole. If we think about writing, if two names are mentioned in a single paragraph it is probably because something relates them together. Before we hunt for names, let's subdivide our data into paragraphs.

In [3]:
# We're going to make a column of paragraphs by splitting the cleaned text wherever there is a newline character \n
sample_articles['paragraph'] = sample_articles['cleaned_text'].str.split('\n')

# We can see what the first row of this column looks like as an example
sample_articles.loc[0,'paragraph']


["Senior Labour MPs and the UK's largest anti-fascist charity have called on Keir Starmer to mount a more heartfelt defence of diversity and anti-racism. They say they fear that Labour is not yet putting its “heart and soul” into the battle against Nigel Farage and the far right.",
 "Hope Not Hate's chief executive has written a letter to Starmer in the lead up to a planned far-right demonstration in London on Saturday, demanding the prime minister speak up more against hate and racism.",
 'In the letter, Nick Lowles said: “Hate breeds when those in power are silent. I implore you and other ministers to speak out urgently in defence of our migrant communities and our multicultural society more generally. It is imperative we all push back against the onslaught of racism we currently face because silence will only encourage our opponents more.”',
 "The challenge to No 10 speaks to many MPs' fear that the party vacated the political playing field over the summer months as Farage, asylum h

In [4]:
# Now we're going to make it so that every row is a single paragraph. The index will keep track of which article each paragraph relates to.
paragraph_data = sample_articles.explode('paragraph')

# and then we'll take just the paragraphs column, give it a new index and keep the old one, calling it 'article_idx
paragraph_data = paragraph_data[['paragraph']].reset_index(drop=False, names='article_idx')
paragraph_data

Unnamed: 0,article_idx,paragraph
0,0,Senior Labour MPs and the UK's largest anti-fa...
1,0,Hope Not Hate's chief executive has written a ...
2,0,"In the letter, Nick Lowles said: “Hate breeds ..."
3,0,The challenge to No 10 speaks to many MPs' fea...
4,0,MPs who have called for greater clarity in the...
...,...,...
2258,99,"At protests, Bolsonaristas wield placards clai..."
2259,99,"Sebastião Coelho, a retired Bolsonarista judge..."
2260,99,Officials in Donald Trump's administration hav...
2261,99,Those who have tracked Moraes's career doubt s...


## Getting our 'characters'
We are now going to check each paragraph in each story for the presence of person and organisation names. We refer to these as 'entities' and the process of finding them as... 


### Named Entity Recognition
A technique used in natural language processing to identify specific types of information in text, such as names, organisations, places, dates, etc. In the good old days, entities were identified by matching against great big lists of notable names, places etc. It was developed primarily to improve the ability of computers to answer questions more directly for things like search and building knowledge bases. 

Today transformer based text models like Spacy identify entities based on analysis of the text itself, based on the linguistic relationships between the words used.


In [5]:
# An example of different types of entities in a piece of text.

spacy_processed_text = nlp("""UK Health Secretary Wes Streeting said: "I trust doctors over President Trump, frankly, on this."
Mel Merritt, head of policy and campaigns at the National Autistic Society said: "This is dangerous, it's anti-science and it's irresponsible.
"President Donald Trump is peddling the worst myths of recent decades. Such dangerous pseudo-science is putting pregnant women and children at risk and devaluing autistic people.""")

# Source: https://www.bbc.co.uk/news/articles/cdx2rk10ep0o


spacy.displacy.render(spacy_processed_text, style='ent', jupyter=True)

In [6]:
# To extract the entities in a spacy processed document we access the .ents method

spacy_processed_text.ents

(UK,
 Health,
 Wes Streeting,
 Trump,
 Mel Merritt,
 the National Autistic Society,
 Donald Trump,
 recent decades)

In [7]:
# and each entity has its own attributes, .text and .label_
[(entity.text, entity.label_) for entity in spacy_processed_text.ents]

[('UK', 'GPE'),
 ('Health', 'ORG'),
 ('Wes Streeting', 'PERSON'),
 ('Trump', 'PERSON'),
 ('Mel Merritt', 'PERSON'),
 ('the National Autistic Society', 'ORG'),
 ('Donald Trump', 'PERSON'),
 ('recent decades', 'DATE')]

In [8]:
# Let's apply this to our paragraphs

# we define the entity types we want to keep - 
# (this could be a list but generally if you don't intend to change that list it is clearer to use a tuple)
KEEP_ENTS = ('PERSON','ORG')

# Our sample is only 100 items big but if you want to do larger amounts it's good to manage the pipe batch size.
#  Refer back to session 2 for more detail.
BATCH_SIZE = 150

# Our destination list. Each item will be a list of entities in a paragraph
paragraph_entity_lists = []

# For every paragraph in our paragraph_data['paragraph'] colum, processed in a spacy pipe using 1 process, handling 150 items at a time
for para in nlp.pipe(paragraph_data['paragraph'], n_process=1, batch_size=BATCH_SIZE): 

     # We create a list of all the paragraph's entities
     entities = [e for e in para.ents]

     # We filter that list to only contain entities we want
     entities = [e for e in entities if e.label_ in KEEP_ENTS]

     # We filter again to make sure the first letter of the entity is uppercase
     entities = [e.text for e in entities if e.text[0].isupper()]

     # Finally we make sure that there aren't any duplicates in our list (for example if a name was mentioned twice)
     # A set is like a list but every item must be unique. You put a set in a list and it turns back into a list again.
     entities = list(set(entities))

     # And send our list of entities to our destination list
     paragraph_entity_lists.append(entities)


# assert checks if a statement is true and throws an error if it is False
# Here we assert that our destination list is the same length as our number of paragraphs
assert len(paragraph_entity_lists) == len(paragraph_data)

# We then take that list and turn it into a column in our dataset
paragraph_data['entities'] = paragraph_entity_lists
paragraph_data



Unnamed: 0,article_idx,paragraph,entities
0,0,Senior Labour MPs and the UK's largest anti-fa...,"[Labour, Nigel Farage, Keir Starmer]"
1,0,Hope Not Hate's chief executive has written a ...,[Starmer]
2,0,"In the letter, Nick Lowles said: “Hate breeds ...",[Nick Lowles]
3,0,The challenge to No 10 speaks to many MPs' fea...,"[Labour, Lucy Powell, Bridget Phillipson]"
4,0,MPs who have called for greater clarity in the...,[]
...,...,...,...
2258,99,"At protests, Bolsonaristas wield placards clai...","[Francisco Antônio, Bolsonaristas, Belial]"
2259,99,"Sebastião Coelho, a retired Bolsonarista judge...","[Sebastião Coelho, Moraes, Alexandre de Moraes]"
2260,99,Officials in Donald Trump's administration hav...,"[Moraes, Donald Trump's, Jair Bolsonaro]"
2261,99,Those who have tracked Moraes's career doubt s...,"[Moraes, Bolsonaro]"


In [9]:
# Whilst it seemed a lot, without the comments it is quite a concise process.

KEEP_ENTS = ('PERSON','ORG')
BATCH_SIZE = 150

paragraph_entity_lists = []

for para in nlp.pipe(paragraph_data['paragraph'], n_process=1, batch_size=BATCH_SIZE): 
    entities = [e for e in para.ents]
    entities = [e for e in entities if e.label_ in KEEP_ENTS]
    entities = [e.text for e in entities if e.text[0].isupper()]
    entities = list(set(entities))
    paragraph_entity_lists.append(entities)

assert len(paragraph_entity_lists) == len(paragraph_data)
paragraph_data['entities'] = paragraph_entity_lists
paragraph_data

Unnamed: 0,article_idx,paragraph,entities
0,0,Senior Labour MPs and the UK's largest anti-fa...,"[Labour, Nigel Farage, Keir Starmer]"
1,0,Hope Not Hate's chief executive has written a ...,[Starmer]
2,0,"In the letter, Nick Lowles said: “Hate breeds ...",[Nick Lowles]
3,0,The challenge to No 10 speaks to many MPs' fea...,"[Labour, Lucy Powell, Bridget Phillipson]"
4,0,MPs who have called for greater clarity in the...,[]
...,...,...,...
2258,99,"At protests, Bolsonaristas wield placards clai...","[Francisco Antônio, Bolsonaristas, Belial]"
2259,99,"Sebastião Coelho, a retired Bolsonarista judge...","[Sebastião Coelho, Moraes, Alexandre de Moraes]"
2260,99,Officials in Donald Trump's administration hav...,"[Moraes, Donald Trump's, Jair Bolsonaro]"
2261,99,Those who have tracked Moraes's career doubt s...,"[Moraes, Bolsonaro]"


## From Entities to Network
We now have our list of paragraphs (scenes) and entities (characters). Finally we're going to make a network.
- A Node: Represents an entity in the paragraphs.
- An Edge: Represents a co-occurence of two entities.

Node Attributes
- `n_paragraphs`: Number of paragraphs the entity occurs in.
- `n_articles`: Number of articles the entity occurs in.

Edge Attributes
- `'weight'`: Number of times the two entities co-occur in a paragraph

In [62]:
# We transform our data to be an entity per row. Again the index tracks which entities were originally together.
row_per_entity = paragraph_data.explode('entities')
row_per_entity

Unnamed: 0,article_idx,paragraph,entities
0,0,Senior Labour MPs and the UK's largest anti-fa...,Labour
0,0,Senior Labour MPs and the UK's largest anti-fa...,Nigel Farage
0,0,Senior Labour MPs and the UK's largest anti-fa...,Keir Starmer
1,0,Hope Not Hate's chief executive has written a ...,Starmer
2,0,"In the letter, Nick Lowles said: “Hate breeds ...",Nick Lowles
...,...,...,...
2260,99,Officials in Donald Trump's administration hav...,Donald Trump's
2260,99,Officials in Donald Trump's administration hav...,Jair Bolsonaro
2261,99,Those who have tracked Moraes's career doubt s...,Moraes
2261,99,Those who have tracked Moraes's career doubt s...,Bolsonaro


In [None]:
# Node attributes first
## If we group the above by entities then we can count the number of rows containing that entity, that's n_paragraphs.
# We can also count the number of unique article_idx numbers, that's n_articles

node_attributes = row_per_entity.groupby('entities').agg(
    n_paragraphs=('article_idx', 'count'),
    n_articles=('article_idx','nunique')
    )

# Finally we transform this into a dictionary, ensuring that the key is the entity name (orient='index')
# We'll use this dictionary of attributes later.
node_attr_dict = node_attributes.to_dict(orient='index')
node_attr_dict

In [68]:
# Next we're going to create an adjacency matrix. 
# This shows the structure of the network and takes care of the edge weight attribute

# First we turn our column of individual entities into a dummy matrix
# Each row represents an entity - for each row every column will be 0 apart from the column matching that entity's name which will be 1.
dummies = pd.get_dummies(row_per_entity['entities'], dtype=int)
dummies

Unnamed: 0,ABC,AFP,AI,AI Mode,AI Overviews,ANU,APR,AU Breaking News,AUF,Aamna,...,YouGov,YouTube,YouTubers,Younge,Yucca Nani,Yvette Cooper,Zack Polanski,Zakhar Volkov,Zarah Sultana,Zohran Mamdani
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2260,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2260,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2261,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2261,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [69]:
# To make it one row per paragraph, we group by the index (level=0) and add the rows together.
# Now each row shows 1 in the column if that entity is in the paragraph, otherwise 0.
dummies_per_para = dummies.groupby(level=0).sum()
dummies_per_para

Unnamed: 0,ABC,AFP,AI,AI Mode,AI Overviews,ANU,APR,AU Breaking News,AUF,Aamna,...,YouGov,YouTube,YouTubers,Younge,Yucca Nani,Yvette Cooper,Zack Polanski,Zakhar Volkov,Zarah Sultana,Zohran Mamdani
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2258,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2259,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2260,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2261,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [70]:
# Magic time
# This is called matrix multiplication. It is a key part of graph theory.
# All we need to know is that it is able to calculate how many times each of the items in our columns co-occur.

# An adjacency matrix has all the information needed to make a weighted network

adjacency_matrix = dummies_per_para.T.dot(dummies_per_para)
adjacency_matrix

Unnamed: 0,ABC,AFP,AI,AI Mode,AI Overviews,ANU,APR,AU Breaking News,AUF,Aamna,...,YouGov,YouTube,YouTubers,Younge,Yucca Nani,Yvette Cooper,Zack Polanski,Zakhar Volkov,Zarah Sultana,Zohran Mamdani
ABC,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AFP,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AI,0,0,2,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AI Mode,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AI Overviews,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Yvette Cooper,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,3,0,0,0,0
Zack Polanski,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
Zakhar Volkov,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
Zarah Sultana,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [82]:
# Finally we can make our network using networkx

import networkx as nx

G = nx.from_pandas_adjacency(adjacency_matrix)
nx.set_node_attributes(G, node_attr_dict)

print(G.number_of_nodes(), G.number_of_edges())

# We can check if the network's attributes look right using...
# G.nodes(data=True)
# G.edges(data=True)

1353 3847


In [83]:

# Let's export and take a look in Gephi

# For now, just ignore these lines. They are filtering the graph, but we'll explain them later.
G.remove_edges_from(nx.selfloop_edges(G))
disconnected_nodes = [node for node, score in nx.degree(G) if score == 0]
G.remove_nodes_from(disconnected_nodes)

# This line exports the graph to a gexf file that is compatible with gephi
nx.write_gexf(G,'sample_farright.gexf')


# Exploring our Network

<img src="https://github.com/Minyall/sc207_290_public/blob/main/images/gephi_settings.png?raw=true" align="right" height=200>

When we first open our graph in Gephi, we'll get a mess of nodes and edges. The filters shown in the image is just one way of reducing the noise in the graph. In Gephi, filters are nested and should be read bottom to top, right to left. So reading in that order the filters shown in the picture...
1. Retains only edges that have **edge weights** between 2 and 39. 39 will just be the largest weight we have so essentially it says, minimum weight is 2.
2. Then that remaining graph is passed to the next filter, which says each node must have a minimum **degree** of 2, i.e. a node must be connected to at least two other nodes. 
3. Then *that* graph is passed to the final filter, **Giant Component**. Components are those clusters of connected nodes. We have one big one in the middle, and lots of little ones floating around it. The Giant component is the biggest one, so essentially it only retains nodes and edges part of that large cluster in the middle.

Gephi's filter UI is a little unintuitive, but it makes sense once you understand that every filter will have an impact on the next filter applied. Removing edges that have too low an edge weight, might disconnect nodes, reducing their degree, and may disconnect them from the giant component.


#### Should I always filter like this?
No. It very much depends on your data, and what you want to know. With the attributes we gave our data we could also filter by number of paragraphs a person appeared in, or number of articles. Each measurement filters people in different ways.

- Degree - Number of connections - Number of unique entities a node co-occured with.
- Edge weight - Number of times two entities co-occured.
- n_paragraphs / n_articles - Number of paragraphs an entity appeared in or number of articles. Someone may occur many times in just article.
- Giant component - Filters out anymore not connected to the largest mass, but what if you have two relatively equal disconnected masses?

Each choice in your filtering, as well as the order you do the filtering, has an impact on any subsequent analysis (like community detection) and the final outcome. It is important to understand what each filter is doing in relation to *your data* and what it means for *your question*.

## Filtering with NetworkX
Whilst you can filter with Gephi, you may find it useful to filter with networkx for various reasons. 
- Primarily it may be that having explored your data in Gephi, you want to lock in place a certain filtered version of your data. 
    - Solidifying your filtering into clear steps in your code makes it clear exactly what filtering took place, in what order, and ensures it is consistent if you need to run analysis again.
- You may want to use NetworkX to get exact figures on various metrics, and not want to shift over the Gephi to retrieve them.
- You may want to integrate the outputs of network analysis into a larger project - (more on this next week).


In [84]:

# We'll rebuild the graph again just so we're clear on what we're starting with.
G = nx.from_pandas_adjacency(adjacency_matrix)
nx.set_node_attributes(G, node_attr_dict)
print(G.number_of_nodes(), G.number_of_edges())

1353 3847


The two main filtering methods are `.remove_nodes_from` and `.remove_edges_from`. Each takes either a list of nodes to remove, or a list of edges. To filter the graph in some way, what differs is how you identify the nodes and/or edges to remove.

In [85]:
# Self loops - edges that connect nodes to themselves.
## The way we make our graph via an adjacency matrix will always produce self-loop edges.
## However, they should be removed for our analysis as it makes no sense to consider that Person A appears in the same paragraph as Person A.

# We identify the edges to remove using this networkx function
remove_edges = nx.selfloop_edges(G)

# and then remove them from our graph
G.remove_edges_from(remove_edges)

# this is just so we can see the changes
print(G.number_of_nodes(), G.number_of_edges())

1353 2494


In [86]:
# Removing based on edge weight
## For this one we need to build our list of edges by checking each edge to see if its weight attribute meets our criteria.

remove_edges = [(source, target) for source, target, attr_dict in G.edges(data=True) if attr_dict['weight'] < 2]

# and we remove the edges
G.remove_edges_from(remove_edges)

print(G.number_of_nodes(), G.number_of_edges())

# Note that removing edges will not remove nodes, even if they now have zero edges.

1353 259


In [87]:
# Removing nodes by attribute
## Like above, we can check the attributes of nodes and remove those we don't want
### For example, filtering by n_paragraphs

remove_nodes = [node for node, attr_dict in G.nodes(data=True) if attr_dict['n_paragraphs'] < 2]

#  we call the appropriate method
G.remove_nodes_from(remove_nodes)

print(G.number_of_nodes(), G.number_of_edges())
# Note that removing nodes will remove edges if they lose either their source or target node.


386 259


In [88]:
# Remove nodes by degree
## Uses the .degree method 

remove_nodes = [node for node, degree in G.degree if degree < 2]

G.remove_nodes_from(remove_nodes)

print(G.number_of_nodes(), G.number_of_edges())
# Note we lose nodes AND edges because we will have disconnected nodes 
# if they only have one connection, so we lose the edges too.


97 173


In [89]:
# Components
## Firstly we can check how many components there are in a graph.
nx.number_connected_components(G)

8

In [90]:
# Components
## The nx.connected_components function will produce lists of the nodes in each seperate component

## We have to use list() to force it to actually produce something so we can examine it, as it is a generator - look it up.
all_components = list(nx.connected_components(G))

## Examining one of the components we see the nodes that form it
all_components[4]


{'Elizabeth Tsurkov', 'Hezbollah', "Kata'ib Hezbollah"}

In [91]:
# Components
## To get the biggest one use the built in max function
### key= tells it what to use as its criteria for determining biggest, here it is length of the list.
largest_component = max(nx.connected_components(G), key=len) # This is now a list of nodes we want to KEEP

## In this case, rather than remove nodes, we say we want a graph that contains nodes
## from our largest component, and we want all the edges between them.
## For this we use .subgraph

G = G.subgraph(largest_component)
print(G.number_of_nodes(), G.number_of_edges())

83 162


## Summary: Applying our Gephi Filters in NetworkX
To recap, let's start again with a new graph, and apply the filterd from our Gephi example.

<img src="https://github.com/Minyall/sc207_290_public/blob/main/images/gephi_settings.png?raw=true" height=200>




In [92]:
# Starting again, again!
G = nx.from_pandas_adjacency(adjacency_matrix)
nx.set_node_attributes(G, node_attr_dict)

# Remove self-loops - We do this here because it should always be standard for this kind of data.
G.remove_edges_from(nx.selfloop_edges(G))

print(G.number_of_nodes(), G.number_of_edges())

1353 2494


In [93]:
# Remove low weight edges
G.remove_edges_from([(source,target) for source, target, attr_dict in G.edges(data=True) if attr_dict['weight'] < 2])

# Remove low degree nodes (this will also clean up any nodes that were disconnected by the edge filter above)
G.remove_nodes_from([node for node, degree in G.degree if degree < 2])

# Get the giant component
largest_component = max(nx.connected_components(G), key=len)
G = G.subgraph(largest_component)

print(G.number_of_nodes(), G.number_of_edges())

83 162


## Community Detection
Whilst we can do community detection in Gephi, we can also do it in NetworkX. Why? 
- Because we may want those community assignments available to us here in Python without stepping over to Gephi.
- We may want to test different settings and see their effects.
- It helps reinforce the understanding that a Network is not just about the visual 'picture' but that it ultimately is a representation of relational data that we can interrogate.

In [95]:
import community

In [99]:
community_assignments = community.best_partition(G, random_state=42)
community_assignments = pd.DataFrame.from_dict(community_assignments, orient='index', columns=['community']).reset_index(names='entity')
community_assignments

Unnamed: 0,entity,community
0,Adm Almir,0
1,Alckmin,1
2,Alexandre de Moraes,1
3,Bayrou,2
4,Bolsonaro,1
...,...,...
78,Tor,10
79,Trump,6
80,Turning Point USA,6
81,Utah Valley University,6


In [100]:
community_assignments = community_assignments.merge(node_attributes, how='left', left_on='entity', right_on='entities')
community_assignments

Unnamed: 0,entity,community,n_paragraphs,n_articles
0,Adm Almir,0,3,3
1,Alckmin,1,2,2
2,Alexandre de Moraes,1,9,8
3,Bayrou,2,16,5
4,Bolsonaro,1,78,8
...,...,...,...,...
78,Tor,10,2,2
79,Trump,6,117,34
80,Turning Point USA,6,18,13
81,Utah Valley University,6,12,12


In [124]:
import plotly.express as px

sort_by = 'n_paragraphs'
community = 1
top_n = 10

grouped = community_assignments.groupby('community')
to_plot = grouped.get_group(community).sort_values(sort_by, ascending=False).head(top_n)

fig = px.bar(data_frame=to_plot, y='entity', x=sort_by, title=f'Top {top_n} Entities in Community {community} by {sort_by}')
fig.update_yaxes(categoryorder='total ascending')


In [151]:
n_para_per_entity

entities
Kirk                175
Trump               124
Bolsonaro            90
EU                   77
Guardian             61
                   ... 
La Niña               1
El Niño               1
BOM                   1
Sam Levin             1
Sebastião Coelho      1
Name: count, Length: 1266, dtype: int64

In [148]:
membership = community_assignments.groupby('community').agg(members=('entity', list))
membership.loc[5,'members']

['Benjamin Netanyahu',
 'Bezalel Smotrich',
 'Charlie',
 'Charlie Kirk',
 'Congress',
 'Cop30',
 'Dartmouth University',
 'David Hogg',
 'Deseret News',
 'Donald Trump',
 'Donald Trump Jr.',
 'EU',
 'Emma Pitts',
 'Erika',
 'FBI',
 'Fox News',
 'García Pérez',
 'Gavin Newsom',
 'Hamas',
 'Hasan Piker',
 'Hogg',
 'House',
 'Itamar Ben-Gvir',
 'JD Vance',
 'Joe Biden',
 'Kirk',
 'Kristi Noem',
 'MSNBC',
 'Maga',
 'Matthew Dowd',
 'Mike Johnson',
 'Mike Lee',
 'New York Times',
 'Piker',
 'Putin',
 'Rush Limbaugh',
 'Steve Bannon',
 'Trump',
 'Turnberry',
 'Turning Point USA',
 'Twitch',
 'UN',
 'UVU',
 'Utah Valley University',
 'Vladimir Putin',
 'YouTube',
 'Zohran Mamdani']

In [5]:
edges = pd.DataFrame(edges)
dums = pd.get_dummies(edges.set_index('source')['target'], dtype=int)
dums = dums.groupby(level=0).sum()
matrix = dums.T.dot(dums)
matrix

Unnamed: 0,A&E-ORG,AARP-ORG,ABC Melbourne-ORG,ABC News-ORG,ABC Radio Melbourne-ORG,ABC Radio National-ORG,ABC-ORG,AC-ORG,ACA-ORG,ACCC-PER,...,Ábrego-ORG,Ábrego-PER,Álvaro Junior-PER,Álvaro López-PER,Ángel Martín-PER,Ángel Víctor Torres-PER,Çiğdem Toprak-PER,Édouard Philippe-PER,Éric Lombard-PER,Ó hAnnaidh-ORG
A&E-ORG,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AARP-ORG,0,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ABC Melbourne-ORG,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ABC News-ORG,0,0,0,6,0,0,3,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ABC Radio Melbourne-ORG,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Ángel Víctor Torres-PER,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
Çiğdem Toprak-PER,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
Édouard Philippe-PER,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
Éric Lombard-PER,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [66]:
import networkx as nx

G = nx.from_pandas_adjacency(matrix)
print(G.number_of_nodes())
print(G.number_of_edges())

7924
177901


In [82]:

sub_G = nx.edge_subgraph(G, edges=[(source, target) for source, target, attr in G.edges(data=True) if attr['weight'] > 5])

In [83]:

print(sub_G.number_of_nodes())
print(sub_G.number_of_edges())

261
1194


In [84]:
import plotly.express as px
to_plot = pd.DataFrame()
to_plot['edge_weights'] = [attr['weight'] for source,target,attr in sub_G.edges(data=True)]
to_plot['edge_weights'].describe()

count    1194.000000
mean       13.130653
std        16.643433
min         6.000000
25%         7.000000
50%         9.000000
75%        13.000000
max       251.000000
Name: edge_weights, dtype: float64

In [93]:
pd.Series(nx.degree_centrality(sub_G)).sort_values(ascending=False)

Donald Trump-PER         0.446154
Guardian-ORG             0.373077
Trump-ORG                0.373077
Trump-PER                0.303846
Labour-ORG               0.226923
                           ...   
Margaret Thatcher-PER    0.007692
Southport-ORG            0.007692
X.-PER                   0.007692
Mark Zuckerberg-PER      0.007692
Islam-ORG                0.007692
Length: 261, dtype: float64

In [95]:
pd.Series(nx.pagerank(sub_G)).sort_values(ascending=False)

Donald Trump-PER      0.051177
Guardian-ORG          0.042386
Trump-ORG             0.040943
Trump-PER             0.032746
Labour-ORG            0.022327
                        ...   
Steve Bannon-PER      0.001207
Carlos Alcaraz-PER    0.001206
Fed-ORG               0.001202
Turnberry-PER         0.001161
Oscar Piastri-PER     0.001153
Length: 261, dtype: float64

In [100]:
for component in nx.connected_components(G):
    print (len(component))

7819
10
18
2
20
8
8
21
2
7
3
4
2


In [None]:
sorted(sub_G.degree, reverse=True, key=lambda x: x[1])

[('Donald Trump-PER', 116),
 ('Trump-ORG', 97),
 ('Guardian-ORG', 97),
 ('Trump-PER', 79),
 ('Labour-ORG', 59),
 ('Keir Starmer-PER', 49),
 ('Nigel Farage-PER', 43),
 ('Reform-ORG', 32),
 ('Vladimir Putin-PER', 32),
 ('AU Breaking News-ORG', 31),
 ('Starmer-PER', 31),
 ('UN-ORG', 31),
 ('Benjamin Netanyahu-PER', 29),
 ('Hamas-ORG', 27),
 ('Netanyahu-PER', 26),
 ('Times-ORG', 23),
 ('SecureDrop-ORG', 23),
 ('EU-ORG', 22),
 ('Tor-ORG', 22),
 ('Anthony Albanese-PER', 21),
 ('Upside-ORG', 21),
 ('Mirror-ORG', 21),
 ('Yvette Cooper-PER', 19),
 ('Emmanuel Macron-PER', 19),
 ('Guardian Australia-ORG', 19),
 ('Farage-ORG', 18),
 ('Telegraph-ORG', 18),
 ('Starmer-ORG', 18),
 ('Putin-PER', 18),
 ('Labour-PER', 17),
 ('Express-ORG', 17),
 ('Angela Rayner-PER', 17),
 ('David Lammy-PER', 16),
 ('Rachel Reeves-PER', 15),
 ('JD Vance-PER', 15),
 ('Steve Witkoff-PER', 14),
 ('Palestine Action-ORG', 14),
 ('Joe Biden-PER', 14),
 ('Mail-ORG', 14),
 ('Kyiv-PER', 13),
 ('Macron-PER', 13),
 ('Maga-PER', 13

In [24]:
len(people[0])
len(table.loc[0,'person_entities'])

21

In [26]:
table.loc[0,'person_entities']

['Nigel Farage',
 'Michael Gove',
 'Jacob Rees-Mogg',
 'David Cameron',
 'Zia Yusuf',
 'David Starkey',
 'Donald Trump',
 'Sean Hannity',
 'Stani Kulechov',
 'Lord Christopher Monckton',
 'Linden Kemkaran',
 'Gawain Towler',
 'Jo Cox',
 'Anand Menon',
 'Ben Habib',
 'Howard Cox',
 'Rupert Lowe',
 'James McMurdock',
 'Lee Anderson',
 'Richard Tice',
 'Sarah Pochin']

In [25]:
people[0]

['Nigel Farage’s',
 'Michael Gove',
 'Jacob Rees-Mogg',
 'Gove',
 'David Cameron’s',
 'Zia Yusuf',
 'Rees-Mogg',
 'David Starkey',
 'Farage',
 'Donald Trump’s',
 'Sean Hannity',
 'Trump',
 'Labour',
 'Heathrow',
 'Stani Kulechov',
 'Aave Labs',
 'Christopher Monckton',
 'Hannah White',
 'Linden Kemkaran',
 'Gawain Towler',
 'Towler',
 'Towler',
 'Jo Cox',
 'Anand Menon',
 'Ben Habib',
 'Howard Cox',
 'MPs Rupert Lowe',
 'James McMurdock',
 'Lee Anderson',
 'Richard Tice',
 'Sarah Pochin']