# Creating a feature matrix from a networkx graph

In this notebook we will look at a few ways to quickly create a feature matrix from a networkx graph.

In [1]:
!pip install networkx==1.11

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting networkx==1.11
  Downloading networkx-1.11-py2.py3-none-any.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 5.1 MB/s 
Installing collected packages: networkx
  Attempting uninstall: networkx
    Found existing installation: networkx 2.6.3
    Uninstalling networkx-2.6.3:
      Successfully uninstalled networkx-2.6.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scikit-image 0.18.3 requires networkx>=2.0, but you have networkx 1.11 which is incompatible.[0m
Successfully installed networkx-1.11


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import networkx as nx
import pandas as pd

G = nx.read_gpickle('/content/drive/MyDrive/major_us_cities')

## Node based features

In [4]:
G.nodes(data=True)

[('El Paso, TX', {'population': 674433, 'location': (-106, 31)}),
 ('Long Beach, CA', {'population': 469428, 'location': (-118, 33)}),
 ('Dallas, TX', {'population': 1257676, 'location': (-96, 32)}),
 ('Oakland, CA', {'population': 406253, 'location': (-122, 37)}),
 ('Albuquerque, NM', {'population': 556495, 'location': (-106, 35)}),
 ('Baltimore, MD', {'population': 622104, 'location': (-76, 39)}),
 ('Raleigh, NC', {'population': 431746, 'location': (-78, 35)}),
 ('Mesa, AZ', {'population': 457587, 'location': (-111, 33)}),
 ('Arlington, TX', {'population': 379577, 'location': (-97, 32)}),
 ('Sacramento, CA', {'population': 479686, 'location': (-121, 38)}),
 ('Wichita, KS', {'population': 386552, 'location': (-97, 37)}),
 ('Tucson, AZ', {'population': 526116, 'location': (-110, 32)}),
 ('Cleveland, OH', {'population': 390113, 'location': (-81, 41)}),
 ('Louisville/Jefferson County, KY',
  {'population': 609893, 'location': (-85, 38)}),
 ('San Jose, CA', {'population': 998537, 'locatio

In [5]:
# Initialize the dataframe, using the nodes as the index
df = pd.DataFrame(index=G.nodes())
df.head()

"El Paso, TX"
"Long Beach, CA"
"Dallas, TX"
"Oakland, CA"
"Albuquerque, NM"


### Extracting attributes

Using `nx.get_node_attributes` it's easy to extract the node attributes in the graph into DataFrame columns.

In [6]:
df['location'] = pd.Series(nx.get_node_attributes(G, 'location'))
df['population'] = pd.Series(nx.get_node_attributes(G, 'population'))
df.head()

Unnamed: 0,location,population
"El Paso, TX","(-106, 31)",674433
"Long Beach, CA","(-118, 33)",469428
"Dallas, TX","(-96, 32)",1257676
"Oakland, CA","(-122, 37)",406253
"Albuquerque, NM","(-106, 35)",556495


### Creating node based features

Most of the networkx functions related to nodes return a dictionary, which can also easily be added to our dataframe.

In [7]:
df['clustering'] = pd.Series(nx.clustering(G))
df['degree'] = pd.Series(G.degree())
df.head()

Unnamed: 0,location,population,clustering,degree
"El Paso, TX","(-106, 31)",674433,0.7,5
"Long Beach, CA","(-118, 33)",469428,0.745455,11
"Dallas, TX","(-96, 32)",1257676,0.763636,11
"Oakland, CA","(-122, 37)",406253,1.0,8
"Albuquerque, NM","(-106, 35)",556495,0.52381,7


# Edge based features

In [8]:
G.edges(data=True)[:10]

[('El Paso, TX', 'Albuquerque, NM', {'weight': 367.88584356108345}),
 ('El Paso, TX', 'Mesa, AZ', {'weight': 536.256659972679}),
 ('El Paso, TX', 'Tucson, AZ', {'weight': 425.41386739988224}),
 ('El Paso, TX', 'Phoenix, AZ', {'weight': 558.7835703774161}),
 ('El Paso, TX', 'Colorado Springs, CO', {'weight': 797.7517116740046}),
 ('Long Beach, CA', 'Oakland, CA', {'weight': 579.5829987228403}),
 ('Long Beach, CA', 'Mesa, AZ', {'weight': 590.156204210031}),
 ('Long Beach, CA', 'Sacramento, CA', {'weight': 611.0649790490104}),
 ('Long Beach, CA', 'Tucson, AZ', {'weight': 698.6566667728368}),
 ('Long Beach, CA', 'San Jose, CA', {'weight': 518.2330606219175})]

In [9]:
# Initialize the dataframe, using the edges as the index
df = pd.DataFrame(index=G.edges())
df.head()

"(El Paso, TX, Albuquerque, NM)"
"(El Paso, TX, Mesa, AZ)"
"(El Paso, TX, Tucson, AZ)"
"(El Paso, TX, Phoenix, AZ)"
"(El Paso, TX, Colorado Springs, CO)"


### Extracting attributes

Using `nx.get_edge_attributes`, it's easy to extract the edge attributes in the graph into DataFrame columns.

In [10]:
df['weight'] = pd.Series(nx.get_edge_attributes(G, 'weight'))
df.head()

Unnamed: 0,weight
"(El Paso, TX, Albuquerque, NM)",367.885844
"(El Paso, TX, Mesa, AZ)",536.25666
"(El Paso, TX, Tucson, AZ)",425.413867
"(El Paso, TX, Phoenix, AZ)",558.78357
"(El Paso, TX, Colorado Springs, CO)",797.751712


### Creating edge based features

Many of the networkx functions related to edges return a nested data structures. We can extract the relevant data using list comprehension.

In [11]:
df['preferential attachment'] = [i[2] for i in nx.preferential_attachment(G, df.index)]
df.head()

Unnamed: 0,weight,preferential attachment
"(El Paso, TX, Albuquerque, NM)",367.885844,35
"(El Paso, TX, Mesa, AZ)",536.25666,40
"(El Paso, TX, Tucson, AZ)",425.413867,40
"(El Paso, TX, Phoenix, AZ)",558.78357,45
"(El Paso, TX, Colorado Springs, CO)",797.751712,30


In the case where the function expects two nodes to be passed in, we can map the index to a lamda function.

In [12]:
df['Common Neighbors'] = df.index.map(lambda city: len(list(nx.common_neighbors(G, city[0], city[1]))))
df.head()

Unnamed: 0,weight,preferential attachment,Common Neighbors
"(El Paso, TX, Albuquerque, NM)",367.885844,35,4
"(El Paso, TX, Mesa, AZ)",536.25666,40,3
"(El Paso, TX, Tucson, AZ)",425.413867,40,3
"(El Paso, TX, Phoenix, AZ)",558.78357,45,3
"(El Paso, TX, Colorado Springs, CO)",797.751712,30,1
