# Graph Data Analysis

## Notebook 1

This notebook will introduce using graph algorithms for exploratory data analysis.

## Connect to TigerGraph Database

The code block below connects to a TigerGraph database. Make sure to change the authentication details in order for you to connect to the instance successfully.

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pyTigerGraph as tg

hostName = "YOUR_HOSTNAME_HERE"
gsqlSecret = "YOUR_SECRET_HERE"
graphname= "KDD_2022_NFT"

conn = tg.TigerGraphConnection(host=hostName, graphname="KDD_2022_NFT", gsqlSecret=gsqlSecret)
conn.getToken(gsqlSecret)

## PyTigerGraph Graph Data Science Featurizer

The code block below instantiates a `featurizer`, which allows developers to easily run graph algorithms on their database, directly from Python.

In [None]:
featurizer = conn.gds.featurizer()

## Centrality Algorithms

**From Wikipedia:** _In graph theory and network analysis, indicators of centrality assign numbers or rankings to nodes within a graph corresponding to their network position. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, super-spreaders of disease, and brain networks. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin._

In the context of this demo, we will be using a centrality measure to determine the more influential users in the NFT marketplace.

In [None]:
featurizer.listAlgorithms("Centrality")

## Installing and Running PageRank

To measure the centrality of certain users in the NFT transaction network, we choose to use PageRank. This is due to the recursive nature of defining PageRanks to vertices; we want more influential users to define who is the influence of other users.

In [None]:
featurizer.installAlgorithm("tg_pagerank")

In [None]:
params={"v_type": "NFT_User", "e_type": "USER_SOLD_TO", 'result_attr': 'pagerank'}

In [None]:
featurizer.runAlgorithm("tg_pagerank", params=params, feat_name="pagerank", schema_name=["NFT_User"], global_schema=False)

## Creating Our Own Feature
**HANDS ON CODE:** Check `query_answers` directory if you are not participating in the live tutorial.

Here, we define our own feature using a GSQL query, and use the **featurizer** to install it. 

In [None]:
%%writefile ./average_selling_price.gsql



In [None]:
featurizer.installAlgorithm("average_selling_price", query_path="./average_selling_price.gsql")

In [None]:
params = {
    "result_attr": "avg_sell_price"
}

In [None]:
featurizer.runAlgorithm("average_selling_price", params=params, feat_name="avg_sell_price", feat_type="FLOAT", custom_query=True, schema_name=["NFT_User"])

## PageRank vs. Average Selling Price

Lets compare a user's PageRank score to the average selling price of that user.

In [None]:
df = conn.getVertexDataFrame("NFT_User", where="avg_sell_price > 0", limit=100_000)
pr_sell = df[["pagerank", "avg_sell_price"]]

In [None]:
pr_sell.plot.scatter(x="pagerank", y="avg_sell_price", logx=True, logy=True)

### Remove Outliers

In [None]:
import numpy as np
from scipy import stats
pr_sell[(np.abs(stats.zscore(pr_sell)) < 3).all(axis=1)].plot.scatter(x="pagerank", y="avg_sell_price", logx=True, logy=True)

## Community Detection Algorithms

**From Wikipedia:** _In the study of complex networks, a network is said to have community structure if the nodes of the network can be easily grouped into (potentially overlapping) sets of nodes such that each set of nodes is densely connected internally. In the particular case of non-overlapping community finding, this implies that the network divides naturally into groups of nodes with dense connections internally and sparser connections between groups. But overlapping communities are also allowed. The more general definition is based on the principle that pairs of nodes are more likely to be connected if they are both members of the same community(ies), and less likely to be connected if they do not share communities. A related but different problem is community search, where the goal is to find a community that a certain vertex belongs to._

We want to include community features with our machine learning algorithms. If two users are in a small community, they might be more likely to sell at a lower price than a larger community.

In [None]:
featurizer.listAlgorithms("Community")

## Installing and Running K-Core

A k-core of a graph is a maximal connected subgraph in which every vertex is connected to at least k vertices in the subgraph. To obtain the k-core of a graph, the algorithm first deletes the vertices whose outdegree is less than k. It then updates the outdegree of the neighbors of the deleted vertices, and if that causes a vertex’s outdegree to fall below k, it will also delete that vertex. The algorithm repeats this operation until every vertex left in the subgraph has an outdegree of at least k.

Our algorithm takes a range of values for k and returns the set of the vertices that constitute the k-core with the highest possible value of k within the range. It is an implementation of Algorithm 2 in [Scalable K-Core Decomposition for Static Graphs Using a Dynamic Graph Data Structure, Tripathy et al., IEEE Big Data 2018.](https://ieeexplore.ieee.org/document/8622056)


In [None]:
featurizer.installAlgorithm("tg_kcore")

In [None]:
params = {
    "v_type": "NFT_User",
    "e_type": "USER_SOLD_TO",
    "print_accum": False,
    "result_attr": "k_core"
}

featurizer.runAlgorithm("tg_kcore", params = params, feat_name="k_core", schema_name=["NFT_User"])

## K-Core Size vs. Average Selling Price

We are now going to compare the size of a user's k-core to their average selling price.

In [None]:
df = conn.getVertexDataFrame("NFT_User", where="avg_sell_price > 0", limit=100_000)
df.head()

In [None]:
df["k_core"].value_counts()

In [None]:
len(df["k_core"].unique())

In [None]:
kcore_pr_sell = df[["pagerank", "avg_sell_price", "k_core"]]

kcore_pr_sell.groupby(["k_core"]).mean()

In [None]:
size_of_core = df["k_core"].value_counts()

In [None]:
kcore_pr_sell["size_of_core"] = kcore_pr_sell["k_core"].apply(lambda x: size_of_core[x])

In [None]:
kcore_pr_sell.head()

In [None]:
kcore_pr_sell[["avg_sell_price", "size_of_core"]].corr()

In [None]:
kcore_pr_sell.plot.scatter(x="size_of_core", y="avg_sell_price", logx=True, logy=True)

In [None]:
kcore_pr_sell[(np.abs(stats.zscore(kcore_pr_sell)) < 3).all(axis=1)].plot.scatter(x="size_of_core", y="avg_sell_price", logx=True, logy=True)

## Size of K-Core vs. PageRank
We want features to be non-correlated, so lets see if there is a correlation between the size of a community and the user's PageRank.

In [None]:
kcore_pr_sell.plot.scatter(x="size_of_core", y="pagerank", logx=True, logy=True)

In [None]:
kcore_pr_sell[(np.abs(stats.zscore(kcore_pr_sell)) < 3).all(axis=1)].plot.scatter(x="size_of_core", y="pagerank", logx=True, logy=True)