# Feature Engineering

This notebook demonstrates how to use `pyTigerGraph` for feature engineering and other common data processing tasks on graphs stored in `TigerGraph`.

## Connection to Database

The `TigerGraphConnection` class represents a connection to the TigerGraph database. Under the hood, it stores the necessary information to communicate with the database. It is able to perform quite a few database tasks. Please see its [documentation](https://docs.tigergraph.com/pytigergraph/current/intro/) for details.

To connect your database, modify the `config.json` file accompanying this notebook. Set the value of `getToken` based on whether token auth is enabled for your database. Token auth is always enabled for tgcloud databases. 

In [1]:
from pyTigerGraph import TigerGraphConnection
import json

# Read in DB configs
with open('../config.json', "r") as config_file:
    config = json.load(config_file)
    
conn = TigerGraphConnection(
    host=config["host"],
    username=config["username"],
    password=config["password"]
)

### Ingest Data

In [2]:
from pyTigerGraph.datasets import Datasets

dataset = Datasets("Cora")

Downloading:   0%|          | 0/166537 [00:00<?, ?it/s]

In [3]:
conn.ingestDataset(dataset, getToken=config["getToken"])

---- Checking database ----
A graph with name Cora already exists in the database. Skip ingestion.


### Visualize Schema

In [4]:
from pyTigerGraph.visualization import drawSchema

drawSchema(conn.getSchema(force=True))

CytoscapeWidget(cytoscape_layout={'name': 'circle', 'animate': True, 'padding': 1}, cytoscape_style=[{'selecto…

### Basic Statistics

In [5]:
# Check graph schema and other information.
print(conn.gsql("ls"))

---- Global vertices, edges, and all graphs
Vertex Types:
Edge Types:

Graphs:
- Graph Cora(Paper:v, Cite:e)
- Graph imdb(Movie:v, Actor:v, Director:v, actor_movie:e, director_movie:e, movie_actor:e, movie_director:e)
Jobs:


JSON API version: v2
Syntax version: v2



In [6]:
# Number of vertices for every vertex type
conn.getVertexCount('*')

{'Paper': 2708}

In [7]:
# Number of vertices of a specific type
conn.getVertexCount("Paper")

2708

In [8]:
# Number of edges for every type
conn.getEdgeCount()

{'Cite': 10556}

In [9]:
# Number of edges of a specific type
conn.getEdgeCount("Cite")

10556

## Feature Engineering

The `featurizer` in pyTigerGraph includes quite a few graph algorithms for feature engineering tasks. This notebook demonstrates the use of a few key functions. For examples on each algorithm, please check out the algos directory. 

The key functions are:
1. `listAlgorithm()`: If it gets the class of algorithms (e.g. Centrality) as an input, it will print the available algorithms for    the specified category; otherwise will print all available algorithms. 
2. `installAlgorithm()`: Gets tha name of the algorithmm as input and installs the algorithm if it is not already installed. 
3. `runAlgorithmm()`: Gets the algorithm name, schema type (e.g. vertex/edge, by default it is vertex), attribute name (if the result needs to be stored as an attribute in the database), and a list of schema type names (list of vertices/edges that the attribute needs to be saved in, by default it is for all vertices/edges).  

In [10]:
f = conn.gds.featurizer()

In [11]:
f.listAlgorithms()

Available algorithms per category:
- Centrality: 10 algorithms
- Classification: 3 algorithms
- Community: 6 algorithms
- Embeddings: 1 algorithms
- Path: 3 algorithms
- Topological Link Prediction: 6 algorithms
- Similarity: 3 algorithms
Call listAlgorithms() with the category name to see the list of algorithms


In [12]:
f.listAlgorithms("Centrality")

Available algorithms for Centrality:
  pagerank:
    weighted:
      01. name: tg_pagerank_wt
    unweighted:
      02. name: tg_pagerank
  article_rank:
    03. name: tg_article_rank
  betweenness:
    04. name: tg_betweenness_cent
  closeness:
    approximate:
      05. name: tg_closeness_cent_approx
    exact:
      06. name: tg_closeness_cent
  degree:
    unweighted:
      07. name: tg_degree_cent
    weighted:
      08. name: tg_weighted_degree_cent
  eigenvector:
    09. name: tg_eigenvector_cent
  harmonic:
    10. name: tg_harmonic_cent
Call runAlgorithm() with the algorithm name to execute it


### Built-in graph algorithms

Below we will show how to run the built-in PageRank algorithm. See this [doc](https://docs.tigergraph.com/graph-ml/current/centrality-algorithms/pagerank) for a quick introduction to the algorithm.

In [13]:
# Run the algorithm with paramters
params = {
    'v_type': 'Paper', 
    'e_type': 'Cite', 
    'max_change': 0.001, 
    'max_iter': 25, 
    'damping': 0.85,
    'top_k': 10, 
    'print_accum': True, 
    'result_attr': '', 
    'file_path': '', 
    'display_edges': False}

f.runAlgorithm(
    'tg_pagerank', 
    params=params,
    timeout=2147480, 
    sizeLimit=2000000
)

Installing and optimizing the queries, it might take a minute...
Queries installed successfully


[{'@@top_scores_heap': [{'Vertex_ID': '1358', 'score': 33.06401},
   {'Vertex_ID': '1701', 'score': 16.8922},
   {'Vertex_ID': '1986', 'score': 14.46646},
   {'Vertex_ID': '306', 'score': 13.72521},
   {'Vertex_ID': '1810', 'score': 9.81972},
   {'Vertex_ID': '2034', 'score': 8.61616},
   {'Vertex_ID': '1623', 'score': 7.57608},
   {'Vertex_ID': '88', 'score': 7.24722},
   {'Vertex_ID': '598', 'score': 7.13392},
   {'Vertex_ID': '1013', 'score': 6.85707}]}]

### User Defined Algorithm

The featurizer can also be used to install and run user defined queries. The query needs to be save in a local file. Below is a toy example of running a user defined query.

In [14]:
user_defined_query1 = \
'''CREATE QUERY user_defined_query1() FOR GRAPH Cora { 
  PRINT "user_defined_query1 works!"; 
}'''

with open("./user_defined_query1.gsql", "w") as outfile:
    outfile.write(user_defined_query1)

In [15]:
f.installAlgorithm(query_name="user_defined_query1", 
    query_path="./user_defined_query1.gsql" )

Installing and optimizing the queries, it might take a minute...
Queries installed successfully


'user_defined_query1'

In [16]:
f.runAlgorithm(query_name="user_defined_query1", custom_query=True)

[{'"user_defined_query1 works!"': 'user_defined_query1 works!'}]

## Data Split

For machine learning tasks, it is common to partition the data into train/validation/test subsets. `pyTigerGraph` provides the function to split either vertices or edges randomly.

### Random Vertex Split

The `VertexSplitter` split vertices into at most 3 parts randomly. The split results are stored in the provided vertex boolean attributes. Each attribute indicates which part a vertex belongs to. For example, if you want to split the vertices into 80% train, 10% validation and 10% test, you can provide as arguments to the splitter `train_mask=0.8, val_mask=0.1, test_mask=0.1`. The 3 attributes `train_mask`, `val_mask`, `test_mask` have to exist in the graph. 80% of vertices will be set to `train_mask=True`, 10% to `val_mask=True`, and 10% to `test_mask=True` at random. There will be no overlap between the parts.

In [17]:
# Initialize the splitter
split = conn.gds.vertexSplitter(train_mask=0.8, val_mask=0.1, test_mask=0.1)

Installing and optimizing queries. It might take a minute if this is the first time you use this loader.
Query installation finished.


In [18]:
# Execute a split
split.run()

Splitting vertices...
Vertex split finished successfully.


Now the split is done. Load all vertices and check if the split is correct. 

In [19]:
vertices = conn.getVertexDataFrame("Paper", select="train_mask,val_mask,test_mask")

In [20]:
for attr in ["train_mask", "val_mask", "test_mask"]:
    print("Fraction of vertices with {}=True: {}".format(
        attr, vertices[attr].sum()/len(vertices)))

Fraction of vertices with train_mask=True: 0.8068685376661743
Fraction of vertices with val_mask=True: 0.09748892171344166
Fraction of vertices with test_mask=True: 0.09564254062038405


It is also possible to split vertices of certain types, which is useful for heterogeneous graphs. Despite that Cora is a homogeneous graph, the example below shows how to specify vertex types in general.

In [21]:
# v_types takes a list of vertex types
split = conn.gds.vertexSplitter(
    v_types=["Paper"], 
    train_mask=0.8, val_mask=0.1, test_mask=0.1
)
split.run()

Splitting vertices...
Vertex split finished successfully.


### Random Edge Split

The `EdgeSplitter` split edges into at most 3 parts randomly. The split results are stored in the provided edge boolean attributes. Each attribute indicates which part an edge belongs to. For example, if you want to split the edges into 80% train and 20% validation, you can provide as arguments to the splitter `is_train=0.8, is_val=0.2`. The 2 attributes `is_train`, `is_val` have to exist in the graph. 80% of edges will be set to `is_train=True`, 20% to `is_val=True` at random. There will be no overlap between the parts.

In [22]:
# Initialize the splitter
splitter = conn.gds.edgeSplitter(is_train=0.8, is_val=0.2)

Installing and optimizing queries. It might take a minute if this is the first time you use this loader.
Query installation finished.


In [23]:
# Execute the split
splitter.run()

Splitting edges...
Edge split finished successfully.


Now the split is done. Load all edges and check if the split is correct.

In [24]:
edges = conn.getEdgesByType("Cite", fmt="df")

In [25]:
for attr in ["is_train", "is_val"]:
    print("Fraction of edges with {}=True: {}".format(
        attr, edges[attr].sum()/len(edges)))

Fraction of edges with is_train=True: 0.8014399393709739
Fraction of edges with is_val=True: 0.19856006062902615


It is also possible to split edges of certain types, which is useful for heterogeneous graphs. Despite that Cora is a homogeneous graph, the example below shows how to specify edge types in general.

In [26]:
# v_types takes a list of edge types
split = conn.gds.edgeSplitter(
    e_types=["Cite"], 
    is_train=0.8, is_val=0.2
)
split.run()

Splitting edges...
Edge split finished successfully.
