# NetworkX - Easy Graph Analytics

NetworkX is the most popular library for graph analytics available in Python, or quite possibly any language. To illustrate this, NetworkX was downloaded more than 71 million times in September of 2024 alone, which is roughly 71 times more than the next most popular graph analytics library! [*](https://en.wikipedia.org/wiki/NetworkX) NetworkX has earned this popularity from its very easy-to-use API, the wealth of documentation and examples available, the large (and friendly) community behind it, and its easy installation which requires nothing more than Python.

However, NetworkX users are familiar with the tradeoff that comes with those benefits. The pure-Python implementation often results in poor performance when graph data starts to reach larger scales, limiting the usefulness of the library for many real-world problems.

# Accelerated NetworkX - Easy (and fast!) Graph Analytics

To address the performance problem, NetworkX 3.0 introduced a mechanism to dispatch algorithm calls to alternate implementations. The NetworkX Python API remains the same but NetworkX will use more capable algorithm implementations provided by one or more backends. This approach means users don't have to give up NetworkX -or even change their code- in order to take advantage of GPU performance.

# Let's Get the Environment Setup
This notebook will demonstrate NetworkX both with and without GPU acceleration provided by the `nx-cugraph` backend.

`nx-cugraph` is available as a package installable using `pip`, `conda`, and [from source](https://github.com/rapidsai/nx-cugraph).  Before importing `networkx`, lets install `nx-cugraph` so it can be registered as an available backend by NetworkX when needed.  We'll use `pip` to install.

NOTES:
* `nx-cugraph` requires a compatible NVIDIA GPU, NVIDIA CUDA and associated drivers, and a supported OS. Details about these and other installation prerequisites can be seen [here](https://docs.rapids.ai/install#system-req).
* The `nx-cugraph` package is currently hosted by NVIDIA and therefore the `--extra-index-url` option must be used.
* `nx-cugraph` is supported on specific 11.x and 12.x CUDA versions, and the major version number must be known in order to install the correct build (this is determined automatically when using `conda`).

To find the CUDA major version on your system, run the following command:

In [1]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0


From the above output we can see we're using CUDA 12.x so we'll be installing `nx-cugraph-cu12`. If we were using CUDA 11.x, the package name would be `nx-cugraph-cu11`. We'll also be adding `https://pypi.nvidia.com` as an `--extra-index-url`:

In [2]:
!pip install nx-cugraph-cu12 --extra-index-url=https://pypi.nvidia.com

Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com


Of course, we'll also be using `networkx`, which is already provided in the Colab environment. This notebook will be using features added in version 3.3, so we'll import it here to verify we have a compatible version.

In [3]:
import networkx as nx
nx.__version__

'3.3'

In [4]:
nx.config.backend_priority=["cugraph"]  # NETWORKX_BACKEND_PRIORITY=cugraph
nx.config.cache_converted_graphs=True   # NETWORKX_CACHE_CONVERTED_GRAPHS=True

In [5]:
import warnings
warnings.filterwarnings("ignore", message="Using cached graph for 'cugraph' backend")

# Working with Big Data

These are the two files we will use and the location/commands to get them.

In [6]:
# !wget -P ../data/ https://s3.amazonaws.com/data.patentsview.org/download/g_us_patent_citation.tsv.zip
# !unzip -d ../data/ ../data/g_us_patent_citation.tsv.zip 
# !wget -P ../data/ https://s3.amazonaws.com/data.patentsview.org/download/g_patent.tsv.zip
# !unzip -d ../data/ ../data/g_patent.tsv.zip

We will do all the dataframes using cudf pandas. Anything we can do in the dataframe will be very fast

In [7]:
%load_ext cudf.pandas
import pandas as pd

This loads the US patent citation data. It contains an edge for relationship where a patent cites another patent.

In [8]:
# load the citation graph
citation_df = pd.read_csv("../data/g_us_patent_citation.tsv",
                sep='\t',
                header=0,
                usecols=[0,2],
                names=["source", "target"],
                dtype={"source":str,"target":str},
)

Since the dataframe is using pandas accelerated with cudf, accessing it is fast !!

In [9]:
len(citation_df)

142183260

This will take a few minutes. It is using NetworkX to create a 142 million edge graph on the cpu. This is a necessary overhead for loading the graph that will be later transformed into the cuGraph GPU-resident graph that will be reused in each algorithm we call, accelerating those algorithms dramatically.

In [10]:
%%time
G = nx.from_pandas_edgelist(citation_df)

CPU times: user 4min 44s, sys: 6.68 s, total: 4min 50s
Wall time: 4min 47s


This is being done on the cpu as well. It will take longer than counting the dataframe above.

In [11]:
G.number_of_edges()

141943194

Running this first algorithm wraps in the overhead of building the cuGraph GPU-resident graph.
Subsequent algoritms will be faster.

In [12]:
pr_results = nx.pagerank(G, backend="cugraph")

Running the same algorithm...but with the graph cached, it takes only seconds.

In [13]:
nx.pagerank(G, backend="cugraph")

{'10000000': 7.313559505163084e-08,
 '5093563': 5.672378921474714e-07,
 '5751830': 3.943318643203387e-07,
 '10000001': 7.840154545278512e-08,
 '7804268': 6.738226001864459e-08,
 '9022767': 5.503292658992164e-08,
 '9090016': 1.1375108300625524e-07,
 '9108352': 1.056953182369159e-07,
 '9296144': 4.626659602349458e-08,
 '9566732': 7.183564998684646e-08,
 '10000002': 3.592771622762587e-08,
 '4617207': 1.911614734930973e-07,
 '5094793': 4.5163486148169046e-07,
 '8124241': 8.391583842043129e-08,
 '10000003': 9.281761764877843e-08,
 '4342799': 1.6568962571734793e-07,
 '6071370': 1.4243191054291903e-07,
 '8147232': 2.1866481833761183e-07,
 '9352506': 4.4439270773473865e-08,
 '10000004': 2.2580611087692385e-08,
 '5632133': 1.2990069501171677e-07,
 '6726363': 1.1521717177287244e-07,
 '10000005': 5.3599538575265816e-08,
 '1919649': 4.035542809905488e-08,
 '2759217': 1.2518069368049342e-07,
 '3287765': 5.425677063483607e-08,
 '4674972': 1.8888564546127942e-07,
 '5254296': 1.3346862887565e-07,
 '58

This call finds the patent with the highest pagerank score.

In [14]:
mip = sorted(pr_results.items(), key=lambda x: x[1], reverse=True)[2:3][0]
most_important_patent = mip[0]
mip[0]

'D732697'

In [15]:
clusters = nx.community.louvain_communities(G,seed =1, backend="cugraph")

In [16]:
for cluster in clusters:
    if most_important_patent in cluster:
        save_cluster = cluster

In [17]:
save_cluster

{'10162442',
 '9854654',
 'D697666',
 '7371603',
 '8850743',
 '11956554',
 '11476626',
 '5040995',
 '9852967',
 '4747029',
 '7192692',
 '9711490',
 '9735877',
 '4275993',
 '11142012',
 '8120256',
 '6780355',
 '11317749',
 '8648980',
 'D977179',
 '9660428',
 '2149363',
 '4992700',
 '6679433',
 '10381527',
 '8596826',
 '7248176',
 '10648632',
 '774582',
 '8246201',
 'D251328',
 '10120208',
 '6301064',
 '8708525',
 '1695600',
 '7364331',
 '4804886',
 '5358437',
 '5073847',
 '9869435',
 'D62666',
 '11835185',
 '10727731',
 'D140303',
 '8330950',
 'D928377',
 '11575980',
 '1686663',
 '8071995',
 '1869221',
 'D877401',
 'D351115',
 '3342986',
 '10451237',
 '8344602',
 '5744916',
 '6739740',
 'D468053',
 '8803441',
 'D222362',
 '4338653',
 'D766838',
 '10962193',
 '4016827',
 'D633231',
 '8227997',
 '10645558',
 '1588222',
 '5016956',
 'D683066',
 '6734630',
 '7425798',
 '7868903',
 '10030852',
 '8207554',
 '4703574',
 '5485359',
 '10734543',
 '2118264',
 'D992767',
 '11519587',
 '7466084',
 

This loads the enrichment data. In this case it contains the title of each patent. The [patents view site](https://patentsview.org/download/data-download-tables) contains many other files that could be similarly loaded and merged.

In [18]:
title_df = pd.read_csv("../data/g_patent.tsv",
                sep='\t',
                header=0,
                usecols=[0,3],
                names=["patent_id", "patent_title"],
                dtype={"patent_id":"str","patent_title":str},
)
# title_df[0] = pd.to_numeric(title_df[0],errors='coerce')
len(title_df)


8890049

This code merges the ids in the cluster containing the highest ranking node with the patent titles to create a dataframe enriched with those titles and shows the first 20 in that dataframe.

In [19]:
cluster_df = pd.DataFrame(save_cluster, columns=['patent_id'])
enriched_df = title_df.merge(cluster_df, on='patent_id', how='inner')
enriched_df.iloc[0:20]

Unnamed: 0,patent_id,patent_title
0,10000015,"Methods for making optical components, optical..."
1,10000075,Multilayer imaging with a high-gloss clear ink...
2,10000150,Lighting circuit and vehicle lamp
3,10000151,Motor vehicle lighting device
4,10000160,Vehicle article carrier with integrated camera...
5,10000289,Temperature control gasper apparatus
6,10000426,Marking coating
7,10000695,"Method of manufacturing fluoride phosphor, whi..."
8,10000696,"Light-emitting ceramic, light-emitting element..."
9,10000697,Magnesium alumosilicate-based phosphor


---
U.S. Patent and Trademark Office. “Data Download Tables.” PatentsView. Accessed [10/06/2024]. 

https://patentsview.org/download/data-download-tables.

Data used is licensed under Creative Commons 4.0 

 You may obtain a copy of the License at https://creativecommons.org/licenses/by/4.0/

___
Copyright (c) 2024, NVIDIA CORPORATION.

Licensed under the Apache License, Version 2.0 (the "License");  you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
___