# Feature Engineering

This notebook demonstrates how to use `pyTigerGraph` for common data processing tasks on graphs stored in `TigerGraph`.

## Connection to Database

The `TigerGraphConnection` class represents a connection to the TigerGraph database. Under the hood, it stores the necessary information to communicate with the database. Please see its documentation for details.

In [1]:
from pyTigerGraph import TigerGraphConnection

In [2]:
conn = TigerGraphConnection(
    host="http://127.0.0.1", # Change the address to your database server's
    graphname="Cora",
    username="tigergraph",
    password="tigergraph",
    useCert=False
)

Initializing the graph. It might take a minute if this is the first time you run it.


In [3]:
# Graph schema and other information.
print(conn.gsql("ls"))

Using graph 'Cora'
---- Graph Cora
Vertex Types: 
  - VERTEX Paper(PRIMARY_ID id INT, x LIST<INT>, y INT, train_mask BOOL, val_mask BOOL, test_mask BOOL, tmp_id INT, tmp_id2 INT, tmp_id3 INT, train BOOL, val BOOL, test BOOL, bool_attr BOOL DEFAULT "false") WITH STATS="OUTDEGREE_BY_EDGETYPE", PRIMARY_ID_AS_ATTRIBUTE="true"
Edge Types: 
  - DIRECTED EDGE Cite(FROM Paper, TO Paper)

Graphs: 
  - Graph Cora(Paper:v, Cite:e)
Jobs: 
Queries: 
  - export_edge(string output_path) (installed v2)
  - export_vertex_train_val_test(string output_path) (installed v2)
  - export_vertex_x_y_train_mask_val_mask_test_mask(string output_path) (installed v2)
  - get_vertex_number(string v_type, string filter_by) (installed v2)
  - shuffle_vertices(string tmp_id) (installed v2)
  - tg_neighbor_sampler_x(set<vertex> input_vertices, string vertex_filename, string edge_filename, int batch_id, int num_batches, int num_neighbors, int num_hops, string filter_by, string tmp_id) (installed v2)
  - tg_neighbor_samp

In [4]:
# Number of vertices for every vertex type
conn.getVertexCount('*')

2708

In [5]:
# Number of vertices of a specific type
conn.getVertexCount("Paper")

2708

In [6]:
# Number of edges for every type
conn.getEdgeCount()

10556

In [7]:
# Number of edges of a specific type
conn.getEdgeCount("Cite")

10556

## Feature Engineering

We are adding graph algorithms to the workbench to perform feature engineering tasks. Currently we are experimenting with PageRank and a preview is given below. The interface might change when we roll out the full feature engineering function.

### PageRank
PageRank is the algorithm that originally ran Google's search engine, where it ranked the most influential webpages higher than less influential ones. A page was determined to be influential through its PageRank score, which is based off the importance of the pages that linked to it. Generalizing this, PageRank finds the most influential vertices in a graph based upon how influential the vertices that have edges to the desired vertex are.

The documentation of the PageRank query is found here: https://docs.tigergraph.com/graph-algorithm-library/centrality/pagerank.

In [15]:
from tgml.featurizer import PageRank

In [16]:
feat = PageRank(tgraph, 
                result_attr = "pagerank", # Name of the attribute that will store results.
                timeout = 300000, # Timeout threshold in millisecond when running this algorithm, doesn't apply to the initializations including query installation and schema change.
                )

In [17]:
feat.run(max_change=0.001, max_iter=25, damping=0.85)

'PageRank scores saved to attribute pagerank'