# Chapter 8: Graphs with cuGraph

<img src="images/chapter-08/cugraph_logo_2.png" style="width:600px;"/>

cuGRAPH is part of the RAPIDS AI suite and provides a set of graph analytics algorithms optimized for GPU performance. It supports various graph data structures and algorithms, enabling rapid processing of large-scale graph data.

It allows for a seamless passing of data between ETL tasks in cuDF and machine learning tasks in cuML.


## Key Benefits:
- Performance: Accelerate your graph computations.
- Scalability: Process large datasets that are infeasible for CPU-only solutions.
- Integration: Easily combine with other RAPIDS libraries

## Getting Started

### Prerequisites
- CUDA-capable GPU: Ensure your system has a compatible NVIDIA GPU.
- Software: Install the RAPIDS AI libraries, including cuGRAPH.

### Installation via Conda

``` 
conda create -n rapids-24.10 -c rapidsai -c conda-forge -c nvidia  \
    cudf=24.10 cugraph=24.10 python=3.12 'cuda-version>=12.0,<=12.5' 
```

### Installation via Pip

```
pip install \
    --extra-index-url=https://pypi.nvidia.com \
    cudf-cu12==24.10.* cugraph-cu12==24.10.* 
```


### Installation via Docker

``` 
docker run --gpus all --pull always --rm -it \
    --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
    nvcr.io/nvidia/rapidsai/base:24.10-cuda12.5-py3.12
```

In [None]:
!pip install \
    --extra-index-url=https://pypi.nvidia.com \
    cudf-cu12==24.10.* cugraph-cu12==24.10.* 


### Verify Installation

Run the following command in Python:

```
import cugraph
print(cugraph.__version__)
```

## Core Features


1. Graph Creation
Create graphs from various formats (edge lists, adjacency matrices).

2. Algorithms
Key algorithms include:
    - PageRank
    - Connected Components
    - Shortest Path
    - Community Detection

3. Visualization
Integrate with visualization libraries for graph representation.

## Hands-On Examples

### Example 1: Creating a Graph

Create a simple graph from an edge list.

An edge list is a simple way to represent a graph. It consists of pairs of nodes, where each pair indicates a connection (or edge) between two nodes.

In our example, we'll create a small graph represented by the following edge list:

Node 0 connects to Node 1
Node 0 connects to Node 2
Node 1 connects to Node 2
Node 2 connects back to Node 0

In [12]:
import cudf
import cugraph

# Create a sample edge list
edge_list = cudf.DataFrame({
    'src': [0, 0, 1, 2],
    'dst': [1, 2, 2, 0]
})

# Create the graph
G = cugraph.Graph()
G.from_cudf_edgelist(edge_list, source='src', destination='dst')

ImportError: /opt/conda/lib/python3.11/site-packages/pylibcugraph/libcugraph_c.so: undefined symbol: _ZN7cugraph20uniform_random_walksIiidLb0EEESt5tupleIJN3rmm14device_uvectorIT_EESt8optionalINS3_IT1_EEEEERKN4raft8handle_tERNSB_6random8RngStateERKNS_12graph_view_tIS4_T0_Lb0EXT2_EvEES6_INS_20edge_property_view_tISJ_PKS7_N6thrust15iterator_traitsISP_E10value_typeEEEENSB_4spanIKS4_Lb1ELm18446744073709551615EEEm

In [13]:
G = cugraph.Graph()

# Add edges directly (source, destination, weight)
G.add_edge(0, 1, weight=1)
G.add_edge(1, 2, weight=1)
G.add_edge(2, 0, weight=1)
G.add_edge(0, 2, weight=1)

print("Number of nodes:", G.number_of_vertices())
print("Number of edges:", G.number_of_edges())


NameError: name 'cugraph' is not defined

In [14]:
from cugraph.datasets import karate

gdf = karate.get_edgelist(download=True)
gdf.head()

G = cugraph.Graph(directed=True)
G.from_cudf_edgelist(gdf, source = 'src', destination='dst', renumber=False, store_transposed = True)

ImportError: /opt/conda/lib/python3.11/site-packages/pylibcugraph/libcugraph_c.so: undefined symbol: _ZN7cugraph20uniform_random_walksIiidLb0EEESt5tupleIJN3rmm14device_uvectorIT_EESt8optionalINS3_IT1_EEEEERKN4raft8handle_tERNSB_6random8RngStateERKNS_12graph_view_tIS4_T0_Lb0EXT2_EvEES6_INS_20edge_property_view_tISJ_PKS7_N6thrust15iterator_traitsISP_E10value_typeEEEENSB_4spanIKS4_Lb1ELm18446744073709551615EEEm

As seen above, the value at each index in 'src' corresponds to the source node that connects to the value at the same index in 'dst'. 

Next, we create an instance of the Graph class from cuGRAPH. This object will hold our graph structure.

To load the edge list into our graph object, we use the from_cudf_edgelist method. This method requires specifying which columns of the DataFrame represent the source and destination nodes.


### Optional : Visualizing Graph 

- Install required libraries : NetworkX and Matplotlib
```
pip install networkx matplotlib
```

- Convert graph to NetworkX format : 

In [8]:
import networkx as nx
import matplotlib.pyplot as plt

#creates empty NetworkX graph 

nx_graph = nx.DiGraph()


# Add edges from cuGRAPH to NetworkX
for u, v in zip(gdf['src'].to_arrow().to_pylist(), gdf['dst'].to_arrow().to_pylist()):
    nx_graph.add_edge(u, v)

# visualize the graph

plt.figure(figsize=(8, 6))
pos = nx.spring_layout(nx_graph)  # Positioning algorithm
nx.draw(nx_graph, pos, with_labels=True, node_color='lightblue', node_size=1000, font_size=15, font_weight='bold', arrows=True)
plt.title("Graph Visualization using NetworkX")
plt.show()

NameError: name 'gdf' is not defined

### Challenge 1: Modify Graph 

Now that you can visualize the graph, try modifying the edge list to create a more complex graph and visualize it again. How does the layout change with different structures?

## Example 2: Running PageRank Algorithm

Using the graph we created, let’s run the PageRank algorithm to determine the importance of each node.

In [9]:
# Perform PageRank on the weighted graph
pagerank_result = cugraph.pagerank(G)

# Display the PageRank values
print(pagerank_result)



NameError: name 'cugraph' is not defined

In [None]:
G = karate.get_graph(download=True)

In [None]:
# Call cugraph.pagerank to get the pagerank scores
gdf_page = cugraph.pagerank(G)

## NetworkX x cuGraph

Let's start by installing the zero-code change NetworkX cuGraph package:

In [None]:
!pip install nx-cugraph-cu12 --extra-index-url=https://pypi.nvidia.com

We begin by using the default networkx setting on the CPU: 

In [15]:
#%env NX_CUGRAPH_AUTOCONFIG=True

import networkx as nx
print(f"using networkx version {nx.__version__}")

#nx.config.warnings_to_ignore.add("cache")

using networkx version 3.3


In [31]:
G = nx.gnm_random_graph(5000, 40000)

import time 
start_time = time.time()
pr = nx.pagerank(G)
elapsed_time = time.time() - start_time
print(elapsed_time)

0.3166830539703369


Then let's try to set the backend to cuGraph by default instead of using the non-accelerated backend: 

In [32]:
%env NX_CUGRAPH_AUTOCONFIG=True

import networkx as nx
print(f"using networkx version {nx.__version__}")

#nx.config.warnings_to_ignore.add("cache")


import time 
start_time = time.time()
pr = nx.pagerank(G)
elapsed_time = time.time() - start_time
print(elapsed_time)

env: NX_CUGRAPH_AUTOCONFIG=True
using networkx version 3.3
0.11525797843933105


Now that we've configured our cuGraph setup with NetworkX, let's start experimenting with its functionalities using a real-world example!


## 🎬 cuGraph for Movie Recommendations 

### Getting Started 

In a saturated market where movie viewers are often overwhelmed by choices, we want to ensure that users receive tailored suggestions that highlight hidden gems and foster discovery, ultimately enhancing viewer satisfaction and engagement. cuGraph comes in handy for movie recommendations, since we can use built-in recommendation algorithms such as PageRank to recommend movies to users based on their past preferences and rankings. 

The MovieLens dataset is a rich collection of movie ratings and user preferences featuring millions of ratings from a diverse user base, capturing insights into how individuals interact with thousands of films. This dataset not only includes user-generated ratings but also metadata about the movies, such as genres, titles, and release years, making it a comprehensive resource for building and testing recommendation algorithms.

Let's begin by loading in the dataset! 

#### Dataset for User Ratings

This dataset accummulates 100000 multiple different ratings by 943 users on 1682 distinct movies, where each user has rated at least 20 movies. 

In [19]:
import pandas as pd 

columns = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep='\t', names=columns)
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


#### Dataset for Movie Information 

The movie dataset includes information about movie id, title, release dates, genre, etc. We are particular interested in using this dataset to match movie id to the title after coming up with the recommendations. 

In [20]:
item_cols = ['movie_id','movie_title','release_date', 'video_release_date',
              'MDb_URL', 'unknown','Action','Adventure','Animation',
              'Childrens','Comedy','Crime', 'Documentary', 'Drama', 'Fantasy',
              'Film-Noir', 'Horror', 'Musical', 'Mystery','Romance','Sci-Fi',
              'Thriller','War', 'Western' ]

item_df = pd.read_csv('u.item', encoding= 'ISO-8859-1', sep = '|', names = item_cols)

In [21]:
item_df.head()

Unnamed: 0,movie_id,movie_title,release_date,video_release_date,MDb_URL,unknown,Action,Adventure,Animation,Childrens,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [22]:
!pip install chardet



In [23]:
import chardet

# Read the file in binary mode and detect encoding
with open('u.item', 'rb') as f:
    result = chardet.detect(f.read())
    print(result)

{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}


#### Dataset for User Demographic Information

To get some background information on the movie viewers, we load in a user dataset that details the user's age, gender, occupation, and zip code for more context: 

In [24]:
user_columns = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
user_df = pd.read_csv('u.user', sep='|', names= user_columns)
user_df.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


### 🕸️ Constructing our Graph 

Let's constsruct a bipartite graph using `nx.Graph()`, where individual users are nodes in one partition and all the unique movies are in the other partition. We then add edges from users to the movies they rated, connected with the actual rating. 

In [25]:

C = nx.Graph()
df['user_id'] = df['user_id'].apply(lambda x: str(x) + '_user')
df['item_id'] = df['item_id'].apply(lambda x: str(x) + '_item')
user_ids = df['user_id'].unique()
#user_ids = user_ids + "-user"  #convert to string so that there are no dupulicates with the user_ids

print(f"Number of unique users : {len(user_ids)}")
item_ids =df['item_id'].unique()

print(f"Number of unique movies : {len(item_ids)}")


C.add_nodes_from(user_ids, bipartite = 0)
C.add_nodes_from(item_ids, bipartite = 1)

edges = [(row['user_id'], row['item_id'], {'rating': row['rating']}) for _, row in df.iterrows()]



C.add_edges_from(edges)


print(f"Number of nodes: {C.number_of_nodes()}")
print(f"Number of edges: {C.number_of_edges()}")


Number of unique users : 943
Number of unique movies : 1682
Number of nodes: 2625
Number of edges: 100000


As shown in the output above, some conclusions can be made. 

We have: 
- 943 unique movie viewers ranking 1682 unique movies
- Each 943 unique viewers represent a node in `bipartite = 0` of the graph
- Each 1682 unique movie is a node in `bipartite = 1`
- There are a total of 100000 ratings across all users and movies, which correspond to the number of edges
- Each rating represents an edge. 

### 🕸️ Running the PageRank Algorithm

In the realm of movie recommendations, leveraging algorithms like PageRank can significantly enhance user experience. PageRank, originally developed for ranking web pages, analyzes the relationships between movies based on user interactions, creating a network of preferences. By prioritizing films that are not only popular but also connected through user ratings and viewing habits, PageRank can provide more nuanced and relevant suggestions.

What movies are the most popular? 

Let's compute the PageRank scores with `nx.pagerank()`. 


In [34]:
pagerank_scores = nx.pagerank(C)
pagerank_df = cudf.DataFrame({'node_id': pagerank_scores.keys(), 'score': pagerank_scores.values()})
pagerank_df.tail()

Unnamed: 0,node_id,score
2620,1674_item,6.1e-05
2621,1640_item,6.4e-05
2622,1637_item,6.4e-05
2623,1630_item,6.4e-05
2624,1641_item,6.4e-05


As you can see, each item is tagged with their scores. Let's filter out the scores for each item and sort them from highest to lowest. The top 10 highest movies are displayed here: 

In [46]:
item_scores_df = pagerank_df[pagerank_df['node_id'].str.endswith('item')]
sorted_scores_df = item_scores_df.sort_values(by='score', ascending=False).head(10)
sorted_scores_df

Unnamed: 0,node_id,score
1300,50_item,0.002495
1100,258_item,0.00239
1232,286_item,0.00233
1003,288_item,0.002273
1038,294_item,0.002267
992,100_item,0.002194
995,181_item,0.002159
1595,300_item,0.002088
967,1_item,0.00193
1346,121_item,0.00181


Now let's figure out their respective movie titles using the item_df that we loaded in before. 

In [48]:
top_10_movies = []
for _, row in sorted_scores_df.to_pandas().iterrows():
    movie_id = int(row['node_id'].split('_')[0])
    movie_title = item_df[item_df['movie_id'] == movie_id].iloc[0]['movie_title']
    top_10_movies.append(movie_title)

top_10_movies

['Star Wars (1977)',
 'Contact (1997)',
 'English Patient, The (1996)',
 'Scream (1996)',
 'Liar Liar (1997)',
 'Fargo (1996)',
 'Return of the Jedi (1983)',
 'Air Force One (1997)',
 'Toy Story (1995)',
 'Independence Day (ID4) (1996)']

## Conclusion 

In this tutorial, we explored the powerful capabilities of cuGraph, a GPU-accelerated library designed for efficient graph analytics. We started by setting up the environment and importing necessary libraries, then moved on to loading and constructing graph structures using various data formats.

We delved into key algorithms, such as PageRank and community detection, demonstrating how to apply these techniques to real-world datasets. By leveraging cuGraph's ability to handle large-scale graphs, we showcased the significant performance benefits of using GPU acceleration compared to traditional CPU-based methods.

As you continue your journey with cuGraph, consider exploring additional algorithms and functionalities, as well as integrating graph analytics into larger data processing pipelines. The potential applications are vast, ranging from social network analysis to recommendation systems and beyond.

We hope this tutorial has equipped you with the foundational knowledge and skills to effectively utilize cuGraph in your own projects. Happy graphing!