# Data Ingestion

This notebook demostrates how to load three example datasets into your TigerGraph database. 

The Cora dataset contains 2708 machine learning papers and 10556 citation links between the papers.  Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from a dictionary. The dictionary consists of 1433 unique words. Each paper is classified into one of seven classes based on the topic. 

The IMDB dataset contains 3 types of vertices: 4278 movies, 5257 actors, and 2081 directors; and 4 types of edges: 12828 actor to movie edges, 12828 movie to actor edges, 4278 director to movie edges, and 4278 movie to director edges. Each vertex is described by a 0/1-valued word vector indicating the absence/presence of the corresponding keywords. For movies, the keywords are extracted from their plots; and for actors and directors, the keywords are extracted from the plots of movies they participated. Each movie is classified into one of three classes, action, comedy, and drama according to their genre. The goal is to predict the class of each movie in the graph.

The hetero dataset contains a synthetic heterogenous graph, which is small and for illustration purposes.

We will be using `requests` to download the data. Your `TigerGraphConnection` object will need to be modified in order to connect to your own database instance. Check the documentation [here](https://docs.tigergraph.com/pytigergraph/current/getting-started/connection) for more details.

In [3]:
import pyTigerGraph as tg


# Set tgCloud to False if you are not using TigerGraph Cloud
conn = tg.TigerGraphConnection(host="https://f2e46e9727494dc4aa8b83c96df804d5.i.tgcloud.io", tgCloud=True)

In [4]:
conn.gsql("LS")

'---- Global vertices, edges, and all graphs\nVertex Types:\nEdge Types:\n\nGraphs:\nJobs:\n\n\nJSON API version: v2\nSyntax version: v2\n'

# Cora Dataset

In [10]:
import requests

In [7]:
print(conn.gsql(open("./data/cora/gsql/schema.gsql", "r").read()))

Successfully created vertex types: [Paper].
Successfully created edge types: [Cite].
The graph Cora is created.


In [8]:
print(conn.gsql(open("./data/cora/gsql/load.gsql", "r").read()))

Using graph 'Cora'
Successfully created loading jobs: [load_cora_data].


In [11]:
with open("./data/cora/nodes.csv", "w") as f:
    data = requests.get("https://tigergraph-public-data.s3.us-west-1.amazonaws.com/Cora/nodes.csv").text
    f.write(data)

with open("./data/cora/edges.csv", "w") as f:
    data = requests.get("https://tigergraph-public-data.s3.us-west-1.amazonaws.com/Cora/edges.csv").text
    f.write(data)

In [12]:
conn.graphname = "Cora"

In [13]:
# Optional if you are using a graph that requires authentication

conn.getToken(conn.createSecret())

('7s1vkj05qa5tt12d7vnh3s2q7hibpgv5', 1659124401, '2022-07-29 19:53:21')

In [14]:
conn.runLoadingJobWithFile("./data/cora/nodes.csv", "node_csv", "load_cora_data")

[{'sourceFileName': 'Online_POST',
  'statistics': {'validLine': 2709,
   'rejectLine': 0,
   'failedConditionLine': 0,
   'notEnoughToken': 0,
   'invalidJson': 0,
   'oversizeToken': 0,
   'vertex': [{'typeName': 'Paper',
     'validObject': 2708,
     'noIdFound': 0,
     'invalidAttribute': 1,
     'invalidAttributeLines': ['1:id'],
     'invalidAttributeLinesData': ['id,x,y,train,valid,test\n'],
     'invalidVertexType': 0,
     'invalidPrimaryId': 1,
     'invalidSecondaryId': 0,
     'incorrectFixedBinaryLength': 0}],
   'edge': [],
   'deleteVertex': [],
   'deleteEdge': []}}]

In [15]:
conn.runLoadingJobWithFile("./data/cora/edges.csv", "edge_csv", "load_cora_data")

[{'sourceFileName': 'Online_POST',
  'statistics': {'validLine': 10557,
   'rejectLine': 0,
   'failedConditionLine': 0,
   'notEnoughToken': 0,
   'invalidJson': 0,
   'oversizeToken': 0,
   'vertex': [],
   'edge': [{'typeName': 'Cite',
     'validObject': 10557,
     'noIdFound': 0,
     'invalidAttribute': 0,
     'invalidVertexType': 0,
     'invalidPrimaryId': 2,
     'invalidSecondaryId': 0,
     'incorrectFixedBinaryLength': 0}],
   'deleteVertex': [],
   'deleteEdge': []}}]

# IMDB Dataset

In [17]:
print(conn.gsql(open("./data/imdb/gsql/schema.gsql", "r").read()))

Successfully created vertex types: [movie].
Successfully created vertex types: [actor].
Successfully created vertex types: [director].
Successfully created edge types: [actor_movie].
Successfully created edge types: [director_movie].
Successfully created edge types: [movie_actor].
Successfully created edge types: [movie_director].
The graph imdb is created.


In [20]:
print(conn.gsql(open("./data/imdb/gsql/load.gsql", "r").read()))

Successfully created loading jobs: [load_imdb_data].


In [19]:
urls = ["https://tigergraph-public-data.s3.us-west-1.amazonaws.com/imdb/actor_movie.csv",
"https://tigergraph-public-data.s3.us-west-1.amazonaws.com/imdb/actor.csv",
"https://tigergraph-public-data.s3.us-west-1.amazonaws.com/imdb/director_movie.csv",
"https://tigergraph-public-data.s3.us-west-1.amazonaws.com/imdb/director.csv",
"https://tigergraph-public-data.s3.us-west-1.amazonaws.com/imdb/movie_actor.csv",
"https://tigergraph-public-data.s3.us-west-1.amazonaws.com/imdb/movie_director.csv",
"https://tigergraph-public-data.s3.us-west-1.amazonaws.com/imdb/movie.csv"]

for url in urls:
    with open("./data/imdb/"+url.split("/")[-1], "w") as f:
        data = requests.get(url).text
        f.write(data)

In [22]:
conn.graphname = "imdb"

In [25]:
# Optional if you are using a graph that requires authentication

conn.getToken(conn.createSecret())

('97q2pbgap386ccqseokqrbt16ahi6jqs', 1659126891, '2022-07-29 20:34:51')

In [26]:
conn.runLoadingJobWithFile("./data/imdb/director.csv", "director_csv", "load_imdb_data")

[{'sourceFileName': 'Online_POST',
  'statistics': {'validLine': 2082,
   'rejectLine': 0,
   'failedConditionLine': 0,
   'notEnoughToken': 0,
   'invalidJson': 0,
   'oversizeToken': 0,
   'vertex': [{'typeName': 'director',
     'validObject': 2081,
     'noIdFound': 0,
     'invalidAttribute': 1,
     'invalidAttributeLines': ['1:id'],
     'invalidAttributeLinesData': ['x,id\n'],
     'invalidVertexType': 0,
     'invalidPrimaryId': 1,
     'invalidSecondaryId': 0,
     'incorrectFixedBinaryLength': 0}],
   'edge': [],
   'deleteVertex': [],
   'deleteEdge': []}}]

In [27]:
conn.runLoadingJobWithFile("./data/imdb/actor.csv", "actor_csv", "load_imdb_data")

[{'sourceFileName': 'Online_POST',
  'statistics': {'validLine': 5258,
   'rejectLine': 0,
   'failedConditionLine': 0,
   'notEnoughToken': 0,
   'invalidJson': 0,
   'oversizeToken': 0,
   'vertex': [{'typeName': 'actor',
     'validObject': 5257,
     'noIdFound': 0,
     'invalidAttribute': 1,
     'invalidAttributeLines': ['1:id'],
     'invalidAttributeLinesData': ['x,id\n'],
     'invalidVertexType': 0,
     'invalidPrimaryId': 1,
     'invalidSecondaryId': 0,
     'incorrectFixedBinaryLength': 0}],
   'edge': [],
   'deleteVertex': [],
   'deleteEdge': []}}]

In [28]:
conn.runLoadingJobWithFile("./data/imdb/movie.csv", "movie_csv", "load_imdb_data")

[{'sourceFileName': 'Online_POST',
  'statistics': {'validLine': 4279,
   'rejectLine': 0,
   'failedConditionLine': 0,
   'notEnoughToken': 0,
   'invalidJson': 0,
   'oversizeToken': 0,
   'vertex': [{'typeName': 'movie',
     'validObject': 4278,
     'noIdFound': 0,
     'invalidAttribute': 1,
     'invalidAttributeLines': ['1:id'],
     'invalidAttributeLinesData': ['x,id,y,train_mask,val_mask,test_mask\n'],
     'invalidVertexType': 0,
     'invalidPrimaryId': 1,
     'invalidSecondaryId': 0,
     'incorrectFixedBinaryLength': 0}],
   'edge': [],
   'deleteVertex': [],
   'deleteEdge': []}}]

In [29]:
conn.runLoadingJobWithFile("./data/imdb/actor_movie.csv", "actor_movie_csv", "load_imdb_data")

[{'sourceFileName': 'Online_POST',
  'statistics': {'validLine': 12829,
   'rejectLine': 0,
   'failedConditionLine': 0,
   'notEnoughToken': 0,
   'invalidJson': 0,
   'oversizeToken': 0,
   'vertex': [],
   'edge': [{'typeName': 'actor_movie',
     'validObject': 12829,
     'noIdFound': 0,
     'invalidAttribute': 0,
     'invalidVertexType': 0,
     'invalidPrimaryId': 2,
     'invalidSecondaryId': 0,
     'incorrectFixedBinaryLength': 0}],
   'deleteVertex': [],
   'deleteEdge': []}}]

In [30]:
conn.runLoadingJobWithFile("./data/imdb/director_movie.csv", "director_movie_csv", "load_imdb_data")

[{'sourceFileName': 'Online_POST',
  'statistics': {'validLine': 4279,
   'rejectLine': 0,
   'failedConditionLine': 0,
   'notEnoughToken': 0,
   'invalidJson': 0,
   'oversizeToken': 0,
   'vertex': [],
   'edge': [{'typeName': 'director_movie',
     'validObject': 4279,
     'noIdFound': 0,
     'invalidAttribute': 0,
     'invalidVertexType': 0,
     'invalidPrimaryId': 2,
     'invalidSecondaryId': 0,
     'incorrectFixedBinaryLength': 0}],
   'deleteVertex': [],
   'deleteEdge': []}}]

In [31]:
conn.runLoadingJobWithFile("./data/imdb/movie_actor.csv", "movie_actor_csv", "load_imdb_data")

[{'sourceFileName': 'Online_POST',
  'statistics': {'validLine': 12829,
   'rejectLine': 0,
   'failedConditionLine': 0,
   'notEnoughToken': 0,
   'invalidJson': 0,
   'oversizeToken': 0,
   'vertex': [],
   'edge': [{'typeName': 'movie_actor',
     'validObject': 12829,
     'noIdFound': 0,
     'invalidAttribute': 0,
     'invalidVertexType': 0,
     'invalidPrimaryId': 2,
     'invalidSecondaryId': 0,
     'incorrectFixedBinaryLength': 0}],
   'deleteVertex': [],
   'deleteEdge': []}}]

In [33]:
conn.runLoadingJobWithFile("./data/imdb/movie_director.csv", "movie_director_csv", "load_imdb_data")

[{'sourceFileName': 'Online_POST',
  'statistics': {'validLine': 4279,
   'rejectLine': 0,
   'failedConditionLine': 0,
   'notEnoughToken': 0,
   'invalidJson': 0,
   'oversizeToken': 0,
   'vertex': [],
   'edge': [{'typeName': 'movie_director',
     'validObject': 4279,
     'noIdFound': 0,
     'invalidAttribute': 0,
     'invalidVertexType': 0,
     'invalidPrimaryId': 2,
     'invalidSecondaryId': 0,
     'incorrectFixedBinaryLength': 0}],
   'deleteVertex': [],
   'deleteEdge': []}}]

# Heterogenous Graph Example

In [35]:
conn.gsql(open("./data/heterogenous/gsql/schema.gsql", "r").read())

'Successfully created vertex types: [v0].\nSuccessfully created vertex types: [v1].\nSuccessfully created vertex types: [v2].\nSuccessfully created edge types: [v0v0].\nSuccessfully created edge types: [v1v1].\nSuccessfully created edge types: [v1v2].\nSuccessfully created edge types: [v2v0].\nSuccessfully created edge types: [v2v1].\nSuccessfully created edge types: [v2v2].\nThe graph hetero is created.'

In [36]:
conn.graphname = "hetero"
conn.getToken(conn.createSecret())

('jk2v7h5g41tho7p58j54h24cdjumh023', 1659127973, '2022-07-29 20:52:53')

In [38]:
urls = ["https://tigergraph-public-data.s3.us-west-1.amazonaws.com/fake-hetero/v0.csv",
"https://tigergraph-public-data.s3.us-west-1.amazonaws.com/fake-hetero/v1.csv",
"https://tigergraph-public-data.s3.us-west-1.amazonaws.com/fake-hetero/v2.csv",
"https://tigergraph-public-data.s3.us-west-1.amazonaws.com/fake-hetero/v0v0.csv",
"https://tigergraph-public-data.s3.us-west-1.amazonaws.com/fake-hetero/v1v1.csv",
"https://tigergraph-public-data.s3.us-west-1.amazonaws.com/fake-hetero/v1v2.csv",
"https://tigergraph-public-data.s3.us-west-1.amazonaws.com/fake-hetero/v2v0.csv",
"https://tigergraph-public-data.s3.us-west-1.amazonaws.com/fake-hetero/v2v1.csv",
"https://tigergraph-public-data.s3.us-west-1.amazonaws.com/fake-hetero/v2v2.csv"]

for url in urls:
    with open("./data/heterogenous/"+url.split("/")[-1], "w") as f:
        data = requests.get(url).text
        f.write(data)

In [39]:
conn.gsql(open("./data/heterogenous/gsql/load.gsql", "r").read())

"Using graph 'hetero'\nSuccessfully created loading jobs: [load_hetero_data]."

In [40]:
conn.runLoadingJobWithFile("./data/heterogenous/v0.csv", "v0_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v1.csv", "v1_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v2.csv", "v2_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v0v0.csv", "v0v0_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v1v1.csv", "v1v1_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v1v2.csv", "v1v2_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v2v0.csv", "v2v0_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v2v1.csv", "v2v1_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v2v2.csv", "v2v2_csv", "load_hetero_data")

[{'sourceFileName': 'Online_POST',
  'statistics': {'validLine': 967,
   'rejectLine': 0,
   'failedConditionLine': 0,
   'notEnoughToken': 0,
   'invalidJson': 0,
   'oversizeToken': 0,
   'vertex': [],
   'edge': [{'typeName': 'v2v2',
     'validObject': 967,
     'noIdFound': 0,
     'invalidAttribute': 0,
     'invalidVertexType': 0,
     'invalidPrimaryId': 2,
     'invalidSecondaryId': 0,
     'incorrectFixedBinaryLength': 0}],
   'deleteVertex': [],
   'deleteEdge': []}}]