## Prerequisites and installation instructions

In order to run this notebook install BlueGraph using:

 ```
 pip install bluegraph
 ```

# Introduction to PGFrames and semantic encoding


In [None]:
import random

import numpy as np
import pandas as pd

from nltk.corpus import words

In [None]:
from bluegraph import PandasPGFrame
from bluegraph.preprocess import ScikitLearnPGEncoder

__NB:__ If an nltk error occurs, run the following code (the 'words' corpus needs to be downloaded for semantic encoding of text properties):

```
import nltk
nltk.download('words')
```

## Example 1: small property graph

Intialize a `PandasPGFrame` given a node and edge list.

In [None]:
nodes = ["Alice", "Bob", "Eric", "John", "Anna", "Laura", "Matt"]

sources = [
    "Alice", "Alice", "Bob", "Bob", "Bob", "Eric", "Anna", "Anna", "Matt"
]
targets = [
    "Bob", "Eric", "Eric", "John", "Anna", "Anna", "Laura", "John", "John"
]
edges = list(zip(sources, targets))

frame = PandasPGFrame(nodes=nodes, edges=edges)

Get nodes and edges as lists.

In [None]:
frame.nodes()

In [None]:
frame.edges()

Add properties to nodes and edges. Here, all the properties have type `numeric`. Other available types are: `categorical` and `text`.

In [None]:
age = [25, 9, 70, 42, 26, 35, 36]
frame.add_node_properties(
    {
        "@id": nodes,
        "age": age
    }, prop_type="numeric")

height = [180, 122, 173, 194, 172, 156, 177]
frame.add_node_properties(
    {
        "@id": nodes,
        "height": height
    }, prop_type="numeric")

weight = [75, 43, 68, 82, 70, 59, 81]
frame.add_node_properties(
    {
        "@id": nodes,
        "weight": weight
    }, prop_type="numeric")


weights = [1.0, 2.2, 0.3, 4.1, 1.5, 21.0, 1.0, 2.5, 7.5]
edge_weight = pd.DataFrame({
    "@source_id": sources,
    "@target_id": targets,
    "distance": weights
})
frame.add_edge_properties(edge_weight, prop_type="numeric")

Get nodes and edges as dataframes.

In [None]:
frame.nodes(raw_frame=True).sample(5)

In [None]:
frame.edges(raw_frame=True).sample(5)

## Example 2: Random graph with a given density

In this example we will generate a small random graph given a specified density value (i.e. ratio of edges realized of all possible edges between distinct pairs of nodes).

### Create a PandasPGFrame

In [None]:
N = 70  # number of nodes
density = 0.1  # density value

In [None]:
# Helper functions for graph generation

def generate_targets(nodes, s, density=0.2):
    edges = []
    for t in nodes:
        if s < t:
            edge = np.random.choice([0, 1], p=[1 - density, density])
            if edge:
                
                edges.append([s, t])
    return edges


def random_pgframe(n_nodes, density):
    nodes = list(range(n_nodes))

    edges = sum(
        map(lambda x: generate_targets(nodes, x, density), nodes), [])
    edges = pd.DataFrame(
        edges, columns=["@source_id", "@target_id"])
    edges_df = edges.set_index(["@source_id", "@target_id"])
    frame = PandasPGFrame(nodes=nodes, edges=edges_df.index)
    return frame

In [None]:
graph_frame = random_pgframe(N, density)

Get nodes and edges as dataframes.

In [None]:
graph_frame.nodes(raw_frame=True).sample(5)

In [None]:
graph_frame.edges(raw_frame=True).sample(5)

### Add node and edge types

Here we generate random types for nodes and edges.

In [None]:
types = ["Apple", "Orange", "Carrot"]
node_types = {
    n: np.random.choice(types, p=[0.5, 0.4, 0.1])
    for n in range(N)
}

In [None]:
graph_frame.add_node_types(node_types)

In [None]:
graph_frame.nodes(raw_frame=True).sample(5)

In [None]:
types = ["isFriend", "isEnemy"]
edge_types = {
    e: np.random.choice(types, p=[0.8, 0.2])
    for e in graph_frame.edges()
}

In [None]:
graph_frame.add_edge_types(edge_types)

In [None]:
graph_frame.edges(raw_frame=True).sample(5)

### Add node and edge properties

We add node properties of different data types (`numeric`, `categorical`, `text`) randomly.

In [None]:
weight = pd.DataFrame(
    [
        (n, np.random.normal(loc=35, scale=5))
        for n in graph_frame.nodes()
    ], 
    columns=["@id", "weight"]
)

In [None]:
graph_frame.add_node_properties(weight, prop_type="numeric")

In [None]:
colors = ["red", "green", "blue"]

In [None]:
colors = pd.DataFrame(
    [
        (n, np.random.choice(colors))
        for n in graph_frame.nodes()
    ], 
    columns=["@id", "color"]
)

In [None]:
graph_frame.add_node_properties(colors, prop_type="category")

In [None]:
desc = pd.DataFrame(
    [
        (n, ' '.join(random.sample(words.words(), 20)))
        for n in graph_frame.nodes()
    ], 
    columns=["@id", "desc"]
)

In [None]:
graph_frame.add_node_properties(desc, prop_type="text")

In [None]:
graph_frame.nodes(raw_frame=True).sample(5)

In [None]:
graph_frame._node_prop_types

We add edge properties of different data types (`numeric`, `categorical`, `text`) randomly.

In [None]:
years = pd.DataFrame(
    [
        (s, t, np.random.randint(0, 20))
        for s, t in graph_frame.edges()
    ], 
    columns=["@source_id", "@target_id", "n_years"]
)

In [None]:
graph_frame.add_edge_properties(years, prop_type="numeric")

In [None]:
shapes = ["dashed", "dotted", "solid"]
shapes = pd.DataFrame(
    [
        (s, t, np.random.choice(shapes))
        for s, t, in graph_frame.edges()
    ], 
    columns=["@source_id", "@target_id", "shapes"]
)

In [None]:
graph_frame.add_edge_properties(shapes, prop_type="category")

In [None]:
desc = pd.DataFrame(
    [
        (s, t, ' '.join(random.sample(words.words(), 20)))
        for s, t, in graph_frame.edges()
    ], 
    columns=["@source_id", "@target_id", "desc"]
)

In [None]:
graph_frame.add_edge_properties(desc, prop_type="text")

In [None]:
graph_frame.edges(raw_frame=True).sample(5)

In [None]:
graph_frame._edge_prop_types

### Perform semantic encoding of properties

BlueGraph allows to convert node/edge properties of different data types into numerical vectors.

Create a encoder object for homogeneous encoding (properties of all the nodes (edges) are encoded with feature vectors of the same length independently of their type).

In [None]:
hom_encoder = ScikitLearnPGEncoder(
    node_properties=["weight", "color", "desc"],
    edge_properties=["n_years", "shapes", "desc"],
    edge_features=True,
    heterogeneous=False,
    encode_types=True,
    drop_types=True,
    text_encoding="tfidf",
    standardize_numeric=True)

In [None]:
transformed_frame = hom_encoder.fit_transform(graph_frame)

In [None]:
transformed_frame.nodes(raw_frame=True).sample(5)

We can inspect encoding models for different node and edge properties created by BlueGraph.

In [None]:
hom_encoder._node_encoders

In [None]:
transformed_frame.edges(raw_frame=True).sample(5)

In [None]:
hom_encoder._edge_encoders

### Convert PGFrames to JSON

In [None]:
json_repr = graph_frame.to_json()

In [None]:
json_repr["nodes"][:2]

In [None]:
json_repr["edges"][:2]

Create a new `PandasPGFrame` from the generated representation.

In [None]:
new_frame = PandasPGFrame.from_json(json_repr)

In [None]:
new_frame.nodes(raw_frame=True).sample(5)