# GrowSmart Graph ML Proof of Concept

## Setup and imports

In [1]:
%load_ext graph_notebook.magics

The graph_notebook.magics extension is already loaded. To reload it, use:
  %reload_ext graph_notebook.magics


In [2]:
%graph_notebook_host growsmart-neptune.cluster-custom-cgogeml0cuty.eu-west-3.neptune.amazonaws.com

set host to growsmart-neptune.cluster-custom-cgogeml0cuty.eu-west-3.neptune.amazonaws.com


In [3]:
import os
import io
os.environ['DGLBACKEND'] = 'pytorch'
import pandas as pd
import dgl
import numpy as np
from neo4j import GraphDatabase
from neo4j.graph import Node, Relationship
from torch import tensor, stack
from torch.nn import Module
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from pandas.api.types import is_numeric_dtype
import torch.nn.functional as F
from dgl.nn import GraphConv
import torch
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from dgl.nn import DeepWalk
from dgl.nn import DeepWalk
from torch.optim import SparseAdam
from torch.utils.data import DataLoader

In [4]:
uri = "bolt://growsmart-neptune.cluster-custom-cgogeml0cuty.eu-west-3.neptune.amazonaws.com:8182"
driver = GraphDatabase.driver(uri, auth=("username", "password"), encrypted=True)

In [5]:
def run_query(query, parameters=None):
    with driver.session() as session:
        result = session.run(query, parameters)
        return [record for record in result]

## Let's do a Proof of Concept

https://julsimon.medium.com/a-primer-on-graph-neural-networks-with-amazon-neptune-and-the-deep-graph-library-5ce64984a276  

The advantage of creating a Graph Neural Network and exploit it for regression or classification is that the Graph stores the underlying connections between the nodes. This makes them more powerful than classic DM or ML approaches where we don't have this advantage.  
In the case of GrowSmart, we can exploit the graph in many ways depending on the end user:
- For internal usage: Analysts within the company can learn trends, patterns, and evaluate the status of the plants and gardens.
- For customer usage: The model can serve directly the customer's application and tell them if their plants/gardens are healthy or not based on the previous data and on the weather forecasts.

We can stem many model ideas from these perspectives:
- For the users:
    - Are my plants humid enough?
    - Tomorrow it's going to be sunny, what is the expected ph level of my tomates?
- For the business:
    - It has been rainy in Barcelona this week, to which users should we issue a warning for their plants?
    
    
For this Proof of Concept, let's apply Node Regression to predict humidity based on sensor data. To do this we will use DGL (Deep Graph Library) and Pytorch to train a simple CGN (Convolutional Graph Network).

In [6]:
%%oc --store-to labels

MATCH (n)
WITH labels(n)[0] AS lbl
RETURN DISTINCT lbl

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

In [7]:
%%oc --store-to labels_edge
MATCH (n)-[e]->(m)
WITH type(e) as lbl
RETURN DISTINCT lbl

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

In [8]:
run_query("MATCH (n) RETURN n LIMIT 1")

[<Record n=<Node id=16777998 labels=frozenset({'Plant'}) properties={'iot_plant_id': '580b5c4d-933a-4b01-9b16-6ca2f81a3a1a', 'status': 1.0}>>]

In [9]:
def get_nodess_v2():
    def get_nodes():
        query = f"""
            MATCH (p:Plant)<-[m:MEASURES]-(s:Sensor)-[r:REGISTERS]->(e:Event)-[c:CONTAINS]->(sd)
            WITH p, m, s, r, e, c, sd
            WHERE sd['value'] IS NOT NULL AND labels(sd)[0] <> 'iot_light' AND labels(sd)[0] <> 'iot_motion'
            RETURN ID(p) AS plant_id, p['status'] AS plant_status, ID(e) AS event_id, sd['value'] AS value, labels(sd)[0] AS value_label
        """
        result = run_query(query)
        data = []
        label = None
        for record in result:
            row = dict(record)
            data.append(row)
        df = pd.DataFrame(data)
        stream = io.StringIO()
        df.to_csv(stream, index=False)
        stream.seek(0)
        df = pd.read_csv(stream)
        stream.close()
        return df
    node_dfs = get_nodes()
    node_dfs = node_dfs.reset_index(drop=True)
    grouped = node_dfs.groupby("value_label")
    groups = []
    scaler = MinMaxScaler()
    node_dfs['value'] = grouped['value'].transform(lambda x: scaler.fit_transform(x.values.reshape(-1, 1)).flatten())
    one_hot = OneHotEncoder(sparse=False)
    encoded = one_hot.fit_transform(node_dfs['value_label'].values.reshape(-1, 1))
    one_hot_df = pd.DataFrame(encoded, columns=one_hot.get_feature_names_out(['value_label']))
    node_dfs = pd.concat([node_dfs.drop('value_label', axis=1), one_hot_df], axis=1)
    cols = node_dfs.columns
    return node_dfs


nodes_df2 = get_nodess_v2()
node_indices = {id: i for i, id in enumerate(nodes_df2['plant_id'])}
nodes_df2

Unnamed: 0,plant_id,plant_status,event_id,value,value_label_iot_co,value_label_iot_humidity,value_label_iot_lpg,value_label_iot_rainfall,value_label_iot_smoke,value_label_iot_soil_humidity,value_label_iot_soil_nitrogen,value_label_iot_soil_ph,value_label_iot_soil_phosporous,value_label_iot_soil_potassium,value_label_iot_soil_temp,value_label_iot_temp
0,PLANT_2d04335e-db7d-4f83-a21c-104be59d1c85,0.0,EVENT_5ac10a11-ebe1-5f3a-a080-18ab29e1351b,0.539419,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,PLANT_3f6c4278-9698-4e08-b386-459fb50008b8,0.0,EVENT_5ac10a11-ebe1-5f3a-a080-18ab29e1351b,0.539419,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,PLANT_e8f48e96-1fd0-4435-b02a-d5e14f39196f,0.0,EVENT_5ac10a11-ebe1-5f3a-a080-18ab29e1351b,0.539419,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,PLANT_5a508647-c0e4-421f-8296-f0ff1ed02ea4,0.0,EVENT_5ac10a11-ebe1-5f3a-a080-18ab29e1351b,0.539419,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,PLANT_2d04335e-db7d-4f83-a21c-104be59d1c85,0.0,EVENT_5ac10a11-ebe1-5f3a-a080-18ab29e1351b,0.000000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6163,PLANT_b76442c6-0eba-4951-bce7-c8e5702afdc1,1.0,EVENT_7ea74261-1d64-5a97-8322-f2ba24ed041d,0.815286,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6164,PLANT_b874295f-5878-497c-99ff-789be0da6bde,1.0,EVENT_7ea74261-1d64-5a97-8322-f2ba24ed041d,0.815286,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6165,PLANT_5b4530a0-6220-444d-80c8-23bb33ff63b8,1.0,EVENT_7ea74261-1d64-5a97-8322-f2ba24ed041d,0.815286,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6166,PLANT_ef8ca5c1-c5c6-4d26-adc0-263a49696043,1.0,EVENT_7ea74261-1d64-5a97-8322-f2ba24ed041d,0.000000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
def get_edges(label):
    query = f"MATCH(n)-[e:{label}]->(m) RETURN type(e) AS lbl, id(e) AS idedge, id(n) AS start, id(m) AS end"
    result = run_query(query)
    data = []
    for record in result:
        id_ = record['idedge']
        start_node = record['start']
        end_node = record['end']
        label = record['lbl']
        row = {'id': id_, 'start_node': start_node, 'end_node': end_node, 'lbl': label}
        data.append(row)
    df = pd.DataFrame(data)
    return df

edge_labels = [lbl['lbl'] for lbl in labels_edge['results']]
edge_dfs = [get_edges(lbl) for lbl in edge_labels]
edges_df = pd.concat(edge_dfs)
edges_df = edges_df.reset_index(drop=True)

edges_df['source_idx'] = edges_df['start_node'].map(node_indices)
edges_df['target_idx'] = edges_df['end_node'].map(node_indices)

encoder = LabelEncoder()
encoded_labels_edges = encoder.fit_transform(edges_df['lbl'].unique())
encoded_labels_edges_map = {lbl: enclbl for lbl, enclbl in zip(edges_df['lbl'].unique(), encoded_labels_edges)}
edges_df['label_encoded'] = edges_df['lbl'].map(encoded_labels_edges_map)
edges_df.head()

Unnamed: 0,id,start_node,end_node,lbl,source_idx,target_idx,label_encoded
0,E_aad8eff4-cb94-5bda-9625-9f45249cb12b,SENSOR_18:24:as:kf:24:00,EVENT_4807bf3f-4d09-56bb-8966-10ec12f60329,REGISTERS,,,9
1,E_fdc98e75-005d-5866-a19d-19827977ce7f,SENSOR_18:24:as:kf:24:00,EVENT_cb653148-b27d-53ee-a29f-9245391f5665,REGISTERS,,,9
2,E_f5d874b6-126c-5a13-ab23-f96fee0205a5,SENSOR_o0:4e:ve:rt:1l:l1,EVENT_a744afb3-9172-5b43-8f1d-1ae77ab003df,REGISTERS,,,9
3,E_fa1fd744-e5a6-5544-b90c-73901ad29b09,SENSOR_o0:4e:ve:rt:1l:l1,EVENT_9d471536-8060-51c0-83e7-7abea80cb1b8,REGISTERS,,,9
4,E_c6f2744a-bf8b-5df2-b4dd-b15ecf566d67,SENSOR_o0:4e:ve:rt:1l:l1,EVENT_907471cb-db77-5928-beb2-f52cceb5bd4e,REGISTERS,,,9


### Defining the graph

In [11]:
g = dgl.graph((edges_df['source_idx'], edges_df['target_idx']))
for v in nodes_df2.columns:
    if 'value' in v:
        print(v)
        g.ndata[v] = tensor(nodes_df2[v])
g.ndata['plant_status'] = tensor(nodes_df2['plant_status'])
g = dgl.add_self_loop(g)

#g = dgl.graph((edges_df['source_idx'], edges_df['target_idx']))
#for valcol in nodes_df.columns:
#    if 'encoded' in valcol or nodes_df[valcol].dtype == np.float64:
#        g.ndata[valcol] = tensor(nodes_df[valcol])
#g.edata['encoded_edge_labels'] = tensor(edges_df['label_encoded'])
#g = dgl.add_self_loop(g)

value
value_label_iot_co
value_label_iot_humidity
value_label_iot_lpg
value_label_iot_rainfall
value_label_iot_smoke
value_label_iot_soil_humidity
value_label_iot_soil_nitrogen
value_label_iot_soil_ph
value_label_iot_soil_phosporous
value_label_iot_soil_potassium
value_label_iot_soil_temp
value_label_iot_temp


### Training the Network

In [12]:
class GCNModel(Module):
    def __init__(self, in_feats, h_feats):
        super(GCNModel, self).__init__()
        self.conv1 = GraphConv(in_feats, h_feats)
        self.conv2 = GraphConv(h_feats, 1)

    def forward(self, g, in_feat):
        h =  self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h.squeeze()

In [13]:
feature_cols = [valcol for valcol in g.ndata if valcol != 'label_encoded']
features = stack([g.ndata[feat] for feat in g.ndata], dim=1)
model_labels = g.ndata['plant_status'].float()
num_classes = len(model_labels.unique())
train_mask, test_mask = train_test_split(range(len(features)), test_size=0.2, random_state=42)
train_mask = tensor(train_mask).numpy()
test_mask = tensor(test_mask).numpy()
input_features = features.shape[1]
h_feats = 7
num_epochs = 25
learning_rate = 0.01

#feature_cols = [valcol for valcol in g.ndata if valcol != 'label_encoded']
#features = stack([g.ndata[valcol] for valcol in g.ndata if valcol != 'label_encoded'], dim=1)
#model_labels = g.ndata['status_Plant'].long()
#num_classes = len(model_labels.unique())
#train_mask, test_mask = train_test_split(range(len(features[0])), test_size=0.2, random_state=42)
#train_mask = tensor(train_mask).numpy()
#test_mask = tensor(test_mask).numpy()
#input_features = features.shape[1]
#h_feats = 200
#num_epochs = 5
#learning_rate = 1

In [14]:
model = GCNModel(input_features, h_feats).double()

In [15]:
def train(g, model, features=features, labels=model_labels, 
          train_mask=train_mask, test_mask=test_mask, lr=learning_rate, num_epochs=num_epochs):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = torch.nn.BCELoss()
    
    best_val_acc = 0
    best_test_acc = 0
    for e in range(num_epochs):
        model.train()
        # Forward
        logits = model(g, features)
        
        # Compute loss
        probs = torch.sigmoid(logits)
        loss = criterion(probs[train_mask], labels[train_mask].double())

        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Compute accuracy on training/validation/test
        train_pred = (probs[train_mask] > 0.5).float()
        test_pred = (probs[test_mask] > 0.5).float()
        train_acc = (train_pred == labels[train_mask]).float().mean().item()
        test_acc = (test_pred == labels[test_mask]).float().mean().item()
        
        print(f"Epoch {e}, Loss {loss}, Train acc {train_acc}, Test acc. {test_acc}")
    return model

model = train(g, model)
torch.save(model.state_dict(), 'model.pth')

#fig = plt.figure()
#ax1 = fig.add_subplot()
#ax1.scatter(x=[i for i in range(len(losses))], c='b',y=losses, label='test', marker='s')
#plt.show()

Epoch 0, Loss 0.5727231340458435, Train acc 0.5999189019203186, Test acc. 0.5769854187965393
Epoch 1, Loss 0.5603673704581451, Train acc 0.5999189019203186, Test acc. 0.5769854187965393
Epoch 2, Loss 0.5480939872069605, Train acc 0.5999189019203186, Test acc. 0.5769854187965393
Epoch 3, Loss 0.5358935709293946, Train acc 0.6015403270721436, Test acc. 0.5777958035469055
Epoch 4, Loss 0.5238732046815369, Train acc 0.6126874685287476, Test acc. 0.5956239700317383
Epoch 5, Loss 0.5119984897249034, Train acc 0.6187677383422852, Test acc. 0.5988654494285583
Epoch 6, Loss 0.5002890047299575, Train acc 0.6361978054046631, Test acc. 0.614262580871582
Epoch 7, Loss 0.4889789498257926, Train acc 0.6840291619300842, Test acc. 0.664505660533905
Epoch 8, Loss 0.4778362605721039, Train acc 0.7081475257873535, Test acc. 0.6815235018730164
Epoch 9, Loss 0.46686082185675915, Train acc 0.7130117416381836, Test acc. 0.6863857507705688
Epoch 10, Loss 0.4561242953567556, Train acc 0.7383461594581604, Test a

This model can be used to use classification to predict if a Plant is Healthy or not, based on sensor data.

### Embeddings

We can also use an algorithm like DeepWalks to learn embeddings.

In [None]:
model = DeepWalk(g)
dataloader = DataLoader(torch.arange(g.num_nodes()), batch_size=128, shuffle=True, collate_fn=model.sample)
optimizer = SparseAdam(model.parameters(), lr=0.01)
num_epochs = 5

for epoch in range(num_epochs):
    for batch_walk in dataloader:
        loss = model(batch_walk)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

In [None]:
embeddings = model.node_embed.weight.detach()
embeddings

In [42]:
%%oc
MATCH (p:Plant)
RETURN ID(p), p.status

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…

In [100]:
%%oc
MATCH (p:Plant)<-[m:MEASURES]-(s:Sensor)-[r:REGISTERS]->(e:Event)-[c:CONTAINS]->(sd)
WITH p, m, s, r, e, c, sd
WHERE sd['value'] IS NOT NULL AND labels(sd)[0] <> 'iot_light' AND labels(sd)[0] <> 'iot_motion'
RETURN sd['value'] AS value, labels(sd)[0] AS value_label

Tab(children=(Output(layout=Layout(max_height='600px', max_width='940px', overflow='scroll')), Output(layout=L…