# Graph Building

The graph is built using helper classes, one for the node creation and one for the relationship between creation. As each type of node and relationship rely on extracting the proper data from the datasets, there is method for build each type of node and relationship. 

The library **py2neo** is used to facilitate the graph building, using Cypher queries directly. With this setup, it is possible to easily connect a local Neo4j instance or a production Neo4j database, just by replacing the host, port, user and password.

In [1]:
# import libraries and load env vars
from py2neo import Graph
import pandas as pd
#import dotenv
#import os
#dotenv.load_dotenv()

In [2]:
# Import the graph building helper classes
from NodeBuilder import NodeBuilder
from RelBuilder import RelBuilder

In [3]:
# Load the question datasets
binary_questions = pd.read_json(
    "data/enriched-binary-questions.json",
    orient="records",
    convert_dates=False, # This is necessary, otherwise Pandas messes up date conversion.
)
continuous_questions = pd.read_json(
    "data/enriched-continuous-questions.json",
    orient="records",
    convert_dates=False, # This is necessary, otherwise Pandas messes up date conversion.
)


In [4]:
# Read the prediction datasets
binary_predictions = pd.read_json(
    "data/predictions-binary-hackathon.json",
    orient="records",
    convert_dates=False, # This is necessary, otherwise Pandas messes up date conversion.
)
continuous_predictions = pd.read_json(
    "data/predictions-continuous-hackathon-v2.json",
    orient="records",
    convert_dates=False, # This is necessary, otherwise Pandas messes up date conversion.
)

binary_predictions = binary_predictions.iloc[:100]
continuous_predictions = continuous_predictions.iloc[:100]

In [5]:
# Variables to access the Neo4j database

# Using env vars
# neo4j_host = os.environ.get('NEO4J_HOST')
# neo4j_port = os.environ.get('NEO4J_PORT')
# neo4j_user = os.environ.get('NEO4J_USER')
# neo4j_password = os.environ.get('NEO4J_PASSWORD')

neo4j_host = "localhost"
neo4j_port = 7687
neo4j_user = "neo4j"
neo4j_password = "test"

In [6]:
# Create the graph instance and the building helpers
graph = Graph("bolt://" + neo4j_host + ':' + str(neo4j_port),
                             auth=(neo4j_user, neo4j_password))

### Node creation

For creating the nodes, there is a method for extracting each type of node from the dataset, making sure there are not duplicate nodes. It is good idea to count how many nodes are going to be created before executing the query.

After creating every type of node, within the Neo4j browser is possible to check how many nodes were created by using the following query. In this example, we are checking the User nodes:

For getting all the User nodes
```
MATCH (n:User) RETURN n
```

For counting the nodes:
```
MATCH (n:User) RETURN count(n) AS count
```

In [None]:
# For creating nodes
node_builder = NodeBuilder(graph, binary_questions, binary_predictions, continuous_questions, continuous_predictions)

In [None]:
# Create the binary question nodes
node_builder.create_question_nodes(question_type='binary')

In [None]:
# Create the continuous question nodes
node_builder.create_question_nodes(question_type='continuous')

In [None]:
# Create the category nodes
node_builder.create_category_nodes()

In [None]:
# Create the topic nodes
node_builder.create_topic_nodes()

In [None]:
# Create user nodes
node_builder.create_user_nodes()

### Relationship creation
For creating the relationships between the nodes, the nodes must exist. There is a method for every kind of relationship. It is good idea to know how many relationships are going to be created before executing the query. For this use case, relationships are much numerous than nodes. Each relationship has a unique name to identify easily which relationship is being queried or used. 

For checking the created relationships, within the Neo4j browser, it is possible to run the following queries. In this example, we are checking the CONTAINS relationship.


For getting all the CONTAINS relationships data
```
MATCH ()-[r:CONTAINS]->()
RETURN r
```

For counting how many CONTAINS relationships are
```
MATCH ()-[r:CONTAINS]->()
RETURN count(r) as count
```

In [7]:
# For creating relationships
rel_builder = RelBuilder(graph, binary_questions, binary_predictions, continuous_questions, continuous_predictions)

In [None]:
# Create the category-topic relationships
# Relationship name: CONTAINS
rel_builder.create_category_topic_relations()

In [None]:
# Create the question-topic relationships
# Relationship name: HAS
rel_builder.create_question_topic_relations()

In [None]:
# Create the question-category relationships
# Relationship name: BELONGS TO
rel_builder.create_question_category_relations()

In [8]:
# Create the user-question relationships
# Relationship name: CONTAINS
rel_builder.create_user_question_relations()

Created 7 relationships
