# Featurizer #
This notebook demonstrates how to use `pyTigerGraph` for common data processing and feature engineering tasks on graphs stored in `TigerGraph`.

## Connection to Database ##
The `TigerGraphConnection` class represents a connection to the TigerGraph database. Under the hood, it stores the necessary information to communicate with the database. Please see its [documentation](https://docs.tigergraph.com/pytigergraph/current/intro/) for details.

In [1]:
from pyTigerGraph import TigerGraphConnection

In [2]:
conn = TigerGraphConnection(
    host="http://127.0.0.1", # Change the address to your database server's
    graphname="Cora",
    username="tigergraph",
    password="tigergraph",
    useCert=False
)

In [3]:
# Graph schema and other information.
print(conn.gsql("ls"))

---- Graph Cora
Vertex Types:
- VERTEX Paper(PRIMARY_ID id INT, x LIST<INT>, y INT, train_mask BOOL, val_mask BOOL, test_mask BOOL, fastrp_embedding LIST<DOUBLE>) WITH STATS="OUTDEGREE_BY_EDGETYPE", PRIMARY_ID_AS_ATTRIBUTE="true"
Edge Types:
- DIRECTED EDGE Cite(FROM Paper, TO Paper)

Graphs:
- Graph Cora(Paper:v, Cite:e)
Jobs:
Queries:






In [4]:
# Number of vertices for every vertex type
conn.getVertexCount('*')

{'Paper': 2708}

In [5]:
# Number of vertices of a specific type
conn.getVertexCount("Paper")

2708

In [6]:
# Number of edges for every type
conn.getEdgeCount()

{'Cite': 10556}

In [7]:
# Number of edges of a specific type
conn.getEdgeCount("Cite")

10556

## Feature Engineering ##
The ML Workbench includes quite a few graph algorithms to perform feature engineering tasks. The key functions are:

1. `listAlgorithm()`: If it gets the class of algorithms (e.g. Centrality) as an input, it will print the available algorithms for    the specified category; otherwise will print the entire available algorithms. 
2. `installAlgorithm()`: Gets tha name of the algorithmm as input and installs the algorithm if it is not already installed. 
3. `runAlgorithmm()`: Gets the algorithm name, schema type (e.g. vertex/edge, by default it is vertex), attribute name (if the result needs to be stored as an attribute in the database), and a list of schema type names (list of vertices/edges that the attribute needs to be saved in, by default it is for all vertices/edges).  

In [8]:
f = conn.gds.featurizer()

In [9]:
f.listAlgorithms()

The list of the categories for available algorithms in the GDS (https://github.com/tigergraph/gsql-graph-algorithms):
Centrality: 
 pagerank: 
  global: 
   weigthed: 
    https://github.com/tigergraph/gsql-graph-algorithms/blob/master/algorithms/Centrality/pagerank/global/weighted/tg_pagerank_wt.gsql. 
   unweighted: 
    https://github.com/tigergraph/gsql-graph-algorithms/blob/master/algorithms/Centrality/pagerank/global/unweighted/tg_pagerank.gsql. 
 article_rank: 
  https://github.com/tigergraph/gsql-graph-algorithms/blob/master/algorithms/Centrality/article_rank/tg_article_rank.gsql. 
 Betweenness: 
  https://github.com/tigergraph/gsql-graph-algorithms/blob/master/algorithms/Centrality/betweenness/tg_betweenness_cent.gsql. 
 closeness: 
  approximate: 
   https://github.com/tigergraph/gsql-graph-algorithms/blob/master/algorithms/Centrality/closeness/approximate/tg_closeness_cent_approx.gsql. 
  exact: 
   https://github.com/tigergraph/gsql-graph-algorithms/blob/master/algorithms/C

## Examples of running graph algorithms from GDS library ##
In the following, one example of each class of algoeirhms are provided. Some algorithms will generate a feature per vertex/edge;however, some other algorithms will calculates a number or statistics information about the graph. For example, the common neighbor algorithm calculates the number of common neighbors between two vertices.

## Get Pagerank as a feature ##
The pagerank is available in GDS library called tg_pagerank under the class of centrality algorithms https://github.com/tigergraph/gsql-graph-algorithms/blob/master/algorithms/Centrality/pagerank/global/unweighted/tg_pagerank.gsql.

In [10]:
f.installAlgorithm("tg_pagerank")

Installing and optimizing the queries, it might take a minute


'tg_pagerank'

In [11]:
params = {'v_type':'Paper','e_type':'Cite','max_change':0.001, 'max_iter': 25, 'damping': 0.85, 
         'top_k': 10, 'print_accum': True, 'result_attr':'','file_path':'','display_edges': False}

f.runAlgorithm('tg_pagerank',params=params,feat_name="pagerank",timeout=2147480,sizeLimit = 2000000)

Global schema change succeeded.


[{'@@top_scores_heap': [{'Vertex_ID': '1358', 'score': 33.06401},
   {'Vertex_ID': '1701', 'score': 16.8922},
   {'Vertex_ID': '1986', 'score': 14.46646},
   {'Vertex_ID': '306', 'score': 13.72521},
   {'Vertex_ID': '1810', 'score': 9.81973},
   {'Vertex_ID': '2034', 'score': 8.61615},
   {'Vertex_ID': '1623', 'score': 7.57608},
   {'Vertex_ID': '88', 'score': 7.24722},
   {'Vertex_ID': '598', 'score': 7.13392},
   {'Vertex_ID': '1013', 'score': 6.85707}]}]

## Run Maximal Independent Set ##
The Maximal Independent Set algorithm is available in GDS library called tg_maximal_indep_set under the class of classification algorithms https://github.com/tigergraph/gsql-graph-algorithms/blob/master/algorithms/Classification/maximal_independent_set/deterministic/tg_maximal_indep_set.gsql.

In [12]:
f.installAlgorithm("tg_maximal_indep_set")

Installing and optimizing the queries, it might take a minute


'tg_maximal_indep_set'

In [13]:
params = {'v_type': 'Paper', 'e_type': 'Cite','max_iter': 100,'print_accum': False,'file_path':''}

f.runAlgorithm('tg_maximal_indep_set',params=params)

[]

## Get Louvain as a feature ##
The Louvain algorithm is available in GDS library called tg_louvain under the class of community detection algorithms  https://github.com/tigergraph/gsql-graph-algorithms/blob/master/algorithms/Community/louvain/tg_louvain.gsql.

In [14]:
f.installAlgorithm(query_name='tg_louvain')

Installing and optimizing the queries, it might take a minute


'tg_louvain'

In [15]:
params = {'v_type': 'Paper', 'e_type':['Cite','reverse_Cite'],'wt_attr':"",'max_iter':10,'result_attr':"cid",'file_path' :"",'print_info':True}

f.runAlgorithm('tg_louvain',params,feat_name="cid")

Global schema change succeeded.


[{'AllVertexCount': 2708},
 {'InitChangeCount': 0},
 {'VertexFollowedToCommunity': 371},
 {'VertexFollowedToVertex': 114},
 {'VertexAssignedToItself': 0},
 {'FinalCommunityCount': 2280}]

## Get fastRP as a feature ##
The fastRP algorithm is available in GDS library called tg_fastRP under the class of community detection algorithms  https://github.com/tigergraph/gsql-graph-algorithms/blob/master/algorithms/GraphML/Embeddings/FastRP/tg_fastRP.gsql

In [16]:
f.installAlgorithm("tg_fastRP")

Installing and optimizing the queries, it might take a minute


'tg_fastRP'

In [17]:
params = {'v_type': 'Paper', 'e_type': ['Cite','reverse_Cite'], 'weights': '1,1,2', 'beta': -0.85, 'k': 3, 'reduced_dim': 128, 
          'sampling_constant': 1, 'random_seed': 42, 'print_accum': False,'result_attr':"",'file_path' :""}
f.runAlgorithm('tg_fastRP',params,feat_name ="fastrp_embedding")

[]

## Run Breadth-First Search Algorithm from a single source node ##
The Breadth-First Search algorithm is available in GDS library called tg_bfs under the class of Path algorithms https://github.com/tigergraph/gsql-graph-algorithms/blob/master/algorithms/Path/bfs/tg_bfs.gsql.

In [18]:
f.installAlgorithm(query_name='tg_bfs')

Installing and optimizing the queries, it might take a minute


'tg_bfs'

In [19]:
params = {'v_type': 'Paper', 'e_type':['Cite','reverse_Cite'],'max_hops':10,"v_start":("2180","Paper"),
          'print_accum':False,'result_attr':"",'file_path' :"",'display_edges':False}

f.runAlgorithm('tg_bfs',params,feat_name="bfs")

Global schema change succeeded.


[]

## Calculates the number of common neighbors between two vertices ##
The common neighbors algorithm is available in GDS library called tg_common_neighbors under the class of Topological Link Prediction algorithms https://github.com/tigergraph/gsql-graph-algorithms/blob/master/algorithms/Topological%20Link%20Prediction/common_neighbors/tg_common_neighbors.gsql


In [20]:
f.installAlgorithm(query_name='tg_common_neighbors')  

Installing and optimizing the queries, it might take a minute


'tg_common_neighbors'

In [21]:
params={"a":("2180","Paper"),"b":("431","Paper"),"e_type":"Cite","print_res":True}

f.runAlgorithm('tg_common_neighbors',params)

[{'closeness': 0}]