# Data Ingestion

This notebook demostrates how to load three example datasets into your TigerGraph database. The data files are included in the github [repo](https://github.com/TigerGraph-DevLabs/mlworkbench-docs/tree/main/tutorials/basics/data). To run this notebook, you will need to clone the repo or download those files.

The **Cora** dataset contains 2708 machine learning papers and 10556 citation links between the papers.  Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from a dictionary. The dictionary consists of 1433 unique words. Each paper is classified into one of seven classes based on the topic. 

The **IMDB** dataset contains 3 types of vertices: 4278 movies, 5257 actors, and 2081 directors; and 4 types of edges: 12828 actor to movie edges, 12828 movie to actor edges, 4278 director to movie edges, and 4278 movie to director edges. Each vertex is described by a 0/1-valued word vector indicating the absence/presence of the corresponding keywords. For movies, the keywords are extracted from their plots; and for actors and directors, the keywords are extracted from the plots of movies they participated. Each movie is classified into one of three classes, action, comedy, and drama according to their genre. The goal is to predict the class of each movie in the graph.

The **hetero** dataset contains a synthetic heterogenous graph, which is small and for illustration purposes.

**NOTE**: The procedures of data ingestion are slightly different between TigerGraph on-prem databases and tgcloud databases created after 7/5/2022. Please run the corresponding sections for your database but do NOT run both.

### TigerGraph On-Prem (and TGCloud databases created before 7/5/2022)

Your `TigerGraphConnection` object will need to be modified in order to connect to your own database instance. Check the documentation [here](https://docs.tigergraph.com/pytigergraph/current/getting-started/connection) for more details.

In [None]:
from pyTigerGraph import TigerGraphConnection

conn = TigerGraphConnection(
    host="http://127.0.0.1", # Change the address to your database's
    username="tigergraph", # Change to your username
    password="tigergraph" # Change to your password
)

In [None]:
# Check metadata in the database to make sure the connection works
print(conn.gsql("LS"))

#### Cora Dataset

In [None]:
print(conn.gsql("CREATE GRAPH Cora()"))

conn.graphname = "Cora"
# Create and run schema change job
print(conn.gsql(open("./data/cora/gsql/schema.gsql", "r").read()))

# Create loading job
print(conn.gsql(open("./data/cora/gsql/load.gsql", "r").read()))

In [None]:
# COMMENT OUT THE LINE BELOW if you are NOT using a graph that requires token authentication or you will get an error
conn.getToken(conn.createSecret())

In [None]:
# Load data
conn.runLoadingJobWithFile("./data/cora/nodes.csv", "node_csv", "load_cora_data")
conn.runLoadingJobWithFile("./data/cora/edges.csv", "edge_csv", "load_cora_data")

#### IMDB Dataset

In [None]:
print(conn.gsql("CREATE GRAPH imdb()"))

conn.graphname = "imdb"
# Create graph schema
print(conn.gsql(open("./data/imdb/gsql/schema.gsql", "r").read()))
# Create loading job
print(conn.gsql(open("./data/imdb/gsql/load.gsql", "r").read()))

In [None]:
# COMMENT OUT THE LINE BELOW if you are NOT using a graph that requires token authentication
conn.getToken(conn.createSecret())

In [None]:
# Load data
conn.runLoadingJobWithFile("./data/imdb/director.csv", "director_csv", "load_imdb_data")
conn.runLoadingJobWithFile("./data/imdb/actor.csv", "actor_csv", "load_imdb_data")
conn.runLoadingJobWithFile("./data/imdb/movie.csv", "movie_csv", "load_imdb_data")
conn.runLoadingJobWithFile("./data/imdb/actor_movie.csv", "actor_movie_csv", "load_imdb_data")
conn.runLoadingJobWithFile("./data/imdb/director_movie.csv", "director_movie_csv", "load_imdb_data")
conn.runLoadingJobWithFile("./data/imdb/movie_actor.csv", "movie_actor_csv", "load_imdb_data")
conn.runLoadingJobWithFile("./data/imdb/movie_director.csv", "movie_director_csv", "load_imdb_data")

#### Heterogenous Graph Example

In [None]:
print(conn.gsql("CREATE GRAPH hetero()"))

conn.graphname="hetero"
# Create graph schema
print(conn.gsql(open("./data/heterogenous/gsql/schema.gsql", "r").read()))
# Create loading job
print(conn.gsql(open("./data/heterogenous/gsql/load.gsql", "r").read()))

In [None]:
# COMMENT OUT THE LINE BELOW if you are NOT using a graph that requires token authentication
conn.getToken(conn.createSecret())

In [None]:
# Load data
conn.runLoadingJobWithFile("./data/heterogenous/v0.csv", "v0_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v1.csv", "v1_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v2.csv", "v2_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v0v0.csv", "v0v0_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v1v1.csv", "v1v1_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v1v2.csv", "v1v2_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v2v0.csv", "v2v0_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v2v1.csv", "v2v1_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v2v2.csv", "v2v2_csv", "load_hetero_data")

### TGCloud Databases Created after 7/5/2022

#### Cora Dataset

You **must** have created the Cora graph and created a secret for the graph

In [None]:
from pyTigerGraph import TigerGraphConnection

gsqlSecret="YOUR_GSQL_SECRET_HERE" # Change to your secret for the graph

conn = TigerGraphConnection(
    host="https://mydb.i.tgcloud.io", # Change the address to your database's
    graphname="Cora",
    gsqlSecret=gsqlSecret,
)

conn.getToken(gsqlSecret)

In [None]:
# Check metadata in the database to make sure the connection works
print(conn.gsql("LS"))

In [None]:
# Create and run schema change job
print(conn.gsql(open("./data/cora/gsql/schema.gsql", "r").read()))

# Create loading job
print(conn.gsql(open("./data/cora/gsql/load.gsql", "r").read()))

In [None]:
# Load data
conn.runLoadingJobWithFile("./data/cora/nodes.csv", "node_csv", "load_cora_data")
conn.runLoadingJobWithFile("./data/cora/edges.csv", "edge_csv", "load_cora_data")

#### IMDB Dataset

You **must** have created the imdb graph and created a secret for the graph

In [None]:
from pyTigerGraph import TigerGraphConnection

gsqlSecret="YOUR_GSQL_SECRET_HERE", # Change to your secret for the graph

conn = TigerGraphConnection(
    host="https://mydb.i.tgcloud.io", # Change the address to your database's
    graphname="imdb",
    gsqlSecret=gsqlSecret,
)

conn.getToken(gsqlSecret)

In [None]:
# Create graph schema
print(conn.gsql(open("./data/imdb/gsql/schema.gsql", "r").read()))
# Create loading job
print(conn.gsql(open("./data/imdb/gsql/load.gsql", "r").read()))

In [None]:
# Load data
conn.runLoadingJobWithFile("./data/imdb/director.csv", "director_csv", "load_imdb_data")
conn.runLoadingJobWithFile("./data/imdb/actor.csv", "actor_csv", "load_imdb_data")
conn.runLoadingJobWithFile("./data/imdb/movie.csv", "movie_csv", "load_imdb_data")
conn.runLoadingJobWithFile("./data/imdb/actor_movie.csv", "actor_movie_csv", "load_imdb_data")
conn.runLoadingJobWithFile("./data/imdb/director_movie.csv", "director_movie_csv", "load_imdb_data")
conn.runLoadingJobWithFile("./data/imdb/movie_actor.csv", "movie_actor_csv", "load_imdb_data")
conn.runLoadingJobWithFile("./data/imdb/movie_director.csv", "movie_director_csv", "load_imdb_data")

#### Heterogenous Graph Example

You **must** have created the hetero graph and created a secret for the graph

In [None]:
from pyTigerGraph import TigerGraphConnection

gsqlSecret="YOUR_GSQL_SECRET_HERE", # Change to your secret for the graph

conn = TigerGraphConnection(
    host="https://mydb.i.tgcloud.io", # Change the address to your database's
    graphname="hetero",
    gsqlSecret=gsqlSecret,
)

conn.getToken(gsqlSecret)

In [None]:
# Create graph schema
print(conn.gsql(open("./data/heterogenous/gsql/schema.gsql", "r").read()))
# Create loading job
print(conn.gsql(open("./data/heterogenous/gsql/load.gsql", "r").read()))

In [None]:
# Load data
conn.runLoadingJobWithFile("./data/heterogenous/v0.csv", "v0_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v1.csv", "v1_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v2.csv", "v2_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v0v0.csv", "v0v0_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v1v1.csv", "v1v1_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v1v2.csv", "v1v2_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v2v0.csv", "v2v0_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v2v1.csv", "v2v1_csv", "load_hetero_data")
conn.runLoadingJobWithFile("./data/heterogenous/v2v2.csv", "v2v2_csv", "load_hetero_data")