# Datasets

This notebook demostrates how to load two example datasets into your TigerGraph database. It uses [pyTigerGraph](https://github.com/tigergraph/pyTigerGraph) to download the datasets and ingest them into your database. Those datasets will be used throughout the remaining notebooks in the basics directory.

The **Cora** dataset contains 2708 machine learning papers and 10556 citation links between the papers.  Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from a dictionary. The dictionary consists of 1433 unique words. Each paper is classified into one of seven classes based on the topic.

The **IMDB** dataset contains 3 types of vertices: 4278 movies, 5257 actors, and 2081 directors; and 4 types of edges: 12828 actor to movie edges, 12828 movie to actor edges, 4278 director to movie edges, and 4278 movie to director edges. Each vertex is described by a 0/1-valued word vector indicating the absence/presence of the corresponding keywords. For movies, the keywords are extracted from their plots; and for actors and directors, the keywords are extracted from the plots of movies they participated. Each movie is classified into one of three classes, action, comedy, and drama according to their genre. The goal is to predict the class of each movie in the graph.

To connect your database, modify the `config.json` file accompanying this notebook. Set the value of `getToken` based on whether token auth is enabled for your database. Token auth is always enabled for tgcloud databases. 

## Cora Dataset

### Download dataset

In [1]:
from pyTigerGraph.datasets import Datasets

dataset = Datasets("Cora")

Downloading:   0%|          | 0/166537 [00:00<?, ?it/s]

### Create connection

In [4]:
from pyTigerGraph import TigerGraphConnection
import json

# Read in DB configs
with open('../config.json', "r") as config_file:
    config = json.load(config_file)

conn1 = TigerGraphConnection(
    host=config["host"],
    username=config["username"],
    password=config["password"],
)

### Ingest data

In [5]:
conn1.ingestDataset(dataset, getToken=config["getToken"])

---- Checking database ----
---- Creating graph ----
The graph Cora is created.
---- Creating schema ----
Using graph 'Cora'
Successfully created schema change jobs: [cora_schema].
Kick off schema change job cora_schema
Doing schema change on graph 'Cora' (current version: 0)
Trying to add local vertex 'Paper' to the graph 'Cora'.
Trying to add local edge 'Cite' to the graph 'Cora'.

Graph Cora updated to new version 1
The job cora_schema completes in 0.928 seconds!
---- Creating loading job ----
Using graph 'Cora'
Successfully created loading jobs: [load_cora_data].
---- Ingesting data ----
[[{'sourceFileName': 'Online_POST', 'statistics': {'validLine': 2708, 'rejectLine': 0, 'failedConditionLine': 0, 'notEnoughToken': 0, 'invalidJson': 0, 'oversizeToken': 0, 'vertex': [{'typeName': 'Paper', 'validObject': 2708, 'noIdFound': 0, 'invalidAttribute': 0, 'invalidVertexType': 0, 'invalidPrimaryId': 0, 'invalidSecondaryId': 0, 'incorrectFixedBinaryLength': 0}], 'edge': [], 'deleteVertex': [

### Visualize schema

In [None]:
from pyTigerGraph.visualization import drawSchema

drawSchema(conn1.getSchema(force=True))

## IMDB Dataset

### Download dataset

In [1]:
from pyTigerGraph.datasets import Datasets

dataset = Datasets("imdb")

Downloading:   0%|          | 0/441333 [00:00<?, ?it/s]

### Create connection

In [2]:
from pyTigerGraph import TigerGraphConnection
import json

# Read in DB configs
with open('../config.json', "r") as config_file:
    config = json.load(config_file)

conn2 = TigerGraphConnection(
    host=config["host"],
    username=config["username"],
    password=config["password"],
)



### Ingest data

In [3]:
conn2.ingestDataset(dataset, getToken=config["getToken"])

---- Checking database ----
---- Creating graph ----
The graph imdb is created.
---- Creating schema ----
Using graph 'imdb'
Successfully created schema change jobs: [imdb_schema].
Kick off schema change job imdb_schema
Doing schema change on graph 'imdb' (current version: 0)
Trying to add local vertex 'movie' to the graph 'imdb'.
Trying to add local vertex 'actor' to the graph 'imdb'.
Trying to add local vertex 'director' to the graph 'imdb'.
Trying to add local edge 'actor_movie' to the graph 'imdb'.
Trying to add local edge 'director_movie' to the graph 'imdb'.
Trying to add local edge 'movie_actor' to the graph 'imdb'.
Trying to add local edge 'movie_director' to the graph 'imdb'.

Graph imdb updated to new version 1
The job imdb_schema completes in 1.249 seconds!
---- Creating loading job ----
Using graph 'imdb'
Successfully created loading jobs: [load_imdb_data].
---- Ingesting data ----
[[{'sourceFileName': 'Online_POST', 'statistics': {'validLine': 4278, 'rejectLine': 0, 'faile

### Visualize schema

In [4]:
from pyTigerGraph.visualization import drawSchema

drawSchema(conn2.getSchema(force=True))

CytoscapeWidget(cytoscape_layout={'name': 'circle', 'animate': True, 'padding': 1}, cytoscape_style=[{'selecto…