# Feature Engineering

This notebook demonstrates how to use `pyTigerGraph` for feature engineering and other common data processing tasks on graphs stored in `TigerGraph`.

## Connection to Database

The `TigerGraphConnection` class represents a connection to the TigerGraph database. Under the hood, it stores the necessary information to communicate with the database. It is able to perform quite a few database tasks. Please see its [documentation](https://docs.tigergraph.com/pytigergraph/current/intro/) for details.

To connect your database, modify the `config.json` file accompanying this notebook. Set the value of `getToken` based on whether token auth is enabled for your database. Token auth is always enabled for tgcloud databases. 

In [1]:
from pyTigerGraph import TigerGraphConnection
import json

# Read in DB configs
with open('../config.json', "r") as config_file:
    config = json.load(config_file)
    
conn = TigerGraphConnection(
    host=config["host"],
    username=config["username"],
    password=config["password"]
)

### Ingest Data

In [2]:
from pyTigerGraph.datasets import Datasets

dataset = Datasets("Cora")

Downloading:   0%|          | 0/166537 [00:00<?, ?it/s]

In [3]:
conn.ingestDataset(dataset, getToken=config["getToken"])

---- Checking database ----
A graph with name Cora already exists in the database. Please drop it first before ingesting.


### Visualize Schema

In [4]:
from pyTigerGraph.visualization import drawSchema

drawSchema(conn.getSchema(force=True))

CytoscapeWidget(cytoscape_layout={'name': 'circle', 'animate': True, 'padding': 1}, cytoscape_style=[{'selecto…

### Basic Statistics

In [None]:
# Check graph schema and other information.
print(conn.gsql("ls"))

In [None]:
# Number of vertices for every vertex type
conn.getVertexCount('*')

In [None]:
# Number of vertices of a specific type
conn.getVertexCount("Paper")

In [None]:
# Number of edges for every type
conn.getEdgeCount()

In [None]:
# Number of edges of a specific type
conn.getEdgeCount("Cite")

## Feature Engineering

The `featurizer` in pyTigerGraph includes quite a few graph algorithms for feature engineering tasks. This notebook demonstrates the use of a few key functions. For examples on each algorithm, please check out the algos directory. 

The key functions are:
1. `listAlgorithm()`: If it gets the class of algorithms (e.g. Centrality) as an input, it will print the available algorithms for    the specified category; otherwise will print all available algorithms. 
2. `installAlgorithm()`: Gets tha name of the algorithmm as input and installs the algorithm if it is not already installed. 
3. `runAlgorithmm()`: Gets the algorithm name, schema type (e.g. vertex/edge, by default it is vertex), attribute name (if the result needs to be stored as an attribute in the database), and a list of schema type names (list of vertices/edges that the attribute needs to be saved in, by default it is for all vertices/edges).  

In [None]:
f = conn.gds.featurizer()

In [None]:
f.listAlgorithms()

In [None]:
f.listAlgorithms("Centrality")

### Built-in graph algorithms

Below we will show how to run the built-in PageRank algorithm. See this [doc](https://docs.tigergraph.com/graph-ml/current/centrality-algorithms/pagerank) for a quick introduction to the algorithm.

In [None]:
# Install the algorithm by name
f.installAlgorithm("tg_pagerank")

In [None]:
# Run the algorithm with paramters
params = {'v_type': 'Paper', 'e_type': 'Cite', 'max_change': 0.001, 'maximum_iteration': 25, 'damping': 0.85,
          'top_k': 10, 'print_results': True, 'result_attribute': '', 'file_path': '', 'display_edges': False}

f.runAlgorithm(
    'tg_pagerank', 
    params=params,
    timeout=2147480, 
    sizeLimit=2000000
)


### User Defined Algorithm

The featurizer can also be used to install and run user defined queries. The query needs to be save in a local file. Below is a toy example of running a user defined query.

In [None]:
user_defined_query1 = '''CREATE QUERY user_defined_query1() FOR GRAPH Cora { 
  PRINT "user_defined_query1 works!"; 
}'''

with open("./user_defined_query1.gsql", "w") as outfile:
    outfile.write(user_defined_query1)

In [None]:
f.installAlgorithm(query_name="user_defined_query1", query_path="./user_defined_query1.gsql" )

In [None]:
f.runAlgorithm(query_name="user_defined_query1", custom_query=True)

## Data Split

For machine learning tasks, it is common to partition the data into train/validation/test subsets. `pyTigerGraph` provides the function to split either vertices or edges randomly.

### Random Vertex Split

The `VertexSplitter` split vertices into at most 3 parts randomly. The split results are stored in the provided vertex boolean attributes. Each attribute indicates which part a vertex belongs to. For example, if you want to split the vertices into 80% train, 10% validation and 10% test, you can provide as arguments to the splitter `train_mask=0.8, val_mask=0.1, test_mask=0.1`. The 3 attributes `train_mask`, `val_mask`, `test_mask` have to exist in the graph. 80% of vertices will be set to `train_mask=True`, 10% to `val_mask=True`, and 10% to `test_mask=True` at random. There will be no overlap between the parts.

In [None]:
# Initialize the splitter
split = conn.gds.vertexSplitter(train_mask=0.8, val_mask=0.1, test_mask=0.1)

In [None]:
# Execute a split
split.run()

Now the split is done. Load all vertices and check if the split is correct. 

In [None]:
vertices = conn.getVertexDataFrame("Paper", select="train_mask,val_mask,test_mask")

In [None]:
for attr in ["train_mask", "val_mask", "test_mask"]:
    print("Fraction of vertices with {}=True: {}".format(
        attr, vertices[attr].sum()/len(vertices)))

It is also possible to split vertices of certain types, which is useful for heterogeneous graphs. Despite that Cora is a homogeneous graph, the example below shows how to specify vertex types in general.

In [None]:
# v_types takes a list of vertex types
split = conn.gds.vertexSplitter(
    v_types=["Paper"], 
    train_mask=0.8, val_mask=0.1, test_mask=0.1
)
split.run()

### Random Edge Split

The `EdgeSplitter` split edges into at most 3 parts randomly. The split results are stored in the provided edge boolean attributes. Each attribute indicates which part an edge belongs to. For example, if you want to split the edges into 80% train and 20% validation, you can provide as arguments to the splitter `is_train=0.8, is_val=0.2`. The 2 attributes `is_train`, `is_val` have to exist in the graph. 80% of edges will be set to `is_train=True`, 20% to `is_val=True` at random. There will be no overlap between the parts.

In [None]:
# Initialize the splitter
splitter = conn.gds.edgeSplitter(is_train=0.8, is_val=0.2)

In [None]:
# Execute the split
splitter.run()

Now the split is done. Load all edges and check if the split is correct.

In [None]:
edges = conn.getEdgesByType("Cite", fmt="df")

In [None]:
for attr in ["is_train", "is_val"]:
    print("Fraction of edges with {}=True: {}".format(
        attr, edges[attr].sum()/len(edges)))

It is also possible to split edges of certain types, which is useful for heterogeneous graphs. Despite that Cora is a homogeneous graph, the example below shows how to specify edge types in general.

In [None]:
# v_types takes a list of edge types
split = conn.gds.edgeSplitter(
    e_types=["Cite"], 
    is_train=0.8, is_val=0.2
)
split.run()