# Federated Learning - SMS spam prediction with a GRU model

**NOTE**: At the time of running this notebook, we were running the grid components in background mode.

**NOTE**: Components:

* Grid Gateway (http://localhost:8080)
* Grid Node Bob (http://localhost:3000)
* Grid Node Alice (http://localhost:3001)

 
To **start the gateway**:
* ```cd gateway```
* ```python gateway.py --start_local_db --port=5000```


To **start one grid node**:

* ```cd app/websocket/```

* ```python websocket_app.py --start_local_db --id=alice --port=3001 --gateway_url=http://localhost:5000```

This notebook was made based on [Federated SMS Spam prediction](https://github.com/OpenMined/PySyft/tree/master/examples/tutorials/advanced/Federated%20SMS%20Spam%20prediction).

Authors:
* André Macedo Farias: Github: [@andrelmfarias](https://github.com/andrelmfarias) | Twitter: [@andrelmfarias](https://twitter.com/andrelmfarias)
* George Muraru: Github [@gmuraru](https://github/com/gmuraru) | Twitter: [@georgemuraru](https://twitter.com/georgemuraru) | Facebook: [@George Cristian Muraru](https://www.facebook.com/georgecmuraru)

## Useful imports

In [1]:
import numpy as np

import torch

import syft as sy
from syft.workers.node_client import NodeClient

<h2>Setup config</h2>
Init hook, connect with grid nodes, etc...

In [2]:
hook = sy.TorchHook(torch)

# Connect directly to grid nodes
nodes = ["ws://localhost:3000/",
         "ws://localhost:3001/"]

compute_nodes = []
for node in nodes:
    compute_nodes.append(NodeClient(hook, node))

# Load Dataset

## 1) Download (if not present) and preprocess dataset

In [3]:
import os
import urllib.request
import pathlib
from zipfile import ZipFile

URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
DATASET_NAME = "smsspamcollection"

def dataset_exists():
    return os.path.isfile('./data/inputs.npy') and \
    os.path.isfile('./data/labels.npy')
    
if not dataset_exists():
    #If the dataset does not already exist, let's download the dataset directly from the URL where it is hosted
    print('Downloading the dataset with urllib2 to the current directory...')
    pathlib.Path("data").mkdir(exist_ok=True)
    urllib.request.urlretrieve(URL, './data/data.zip')
    print("The dataset was successfully downloaded")
    print("Unzipping the dataset...")
    with ZipFile('./data/data.zip', 'r') as zipObj:
       # Extract all the contents of the zip file in current directory
       zipObj.extractall("./data")
    print("Dataset successfully unzipped")
    
    from preprocess import preprocess_spam

    preprocess_spam()
else:
    print("Not downloading the dataset because it was already downloaded")
    


Not downloading the dataset because it was already downloaded


## 2) Loading data

As we are most interested in the usage of PySyft and Federated Learning, I will skip the text-preprocessing part of the project. If you are interested in how I performed the preprocessing of the raw dataset you can take a look on the script [preprocess.py](https://github.com/OpenMined/PyGrid/tree/master/examples/data/SMS-spam/preprocess.py).

Each data point of the `inputs.npy` dataset correspond to an array of 30 tokens obtained form each message (padded at left or truncated at right)

The `label.npy` dataset has the following unique values: `1` for `spam` and `0` for `non-spam`

In [4]:
inputs = np.load('./data/inputs.npy')
labels = np.load('./data/labels.npy')

In [5]:
datasets_spam = torch.split(torch.tensor(inputs), int(len(inputs) / len(compute_nodes)), dim=0 ) #tuple of chunks (dataset / number of nodes)
labels_spam = torch.split(torch.tensor(labels), int(len(labels) / len(compute_nodes)), dim=0 )  #tuple of chunks (labels / number of nodes)

<h2>3) Tagging tensors</h2>
The code below will add a tag (of your choice) to the data that will be sent to grid nodes. This tag is important as the gateway will need it to retrieve this data later.

In [6]:
tag_img = []
tag_label = []


for i in range(len(compute_nodes)):
    tag_img.append(datasets_spam[i].tag("#X", "#spam", "#dataset").describe("The input datapoints to the SPAM dataset."))
    tag_label.append(labels_spam[i].tag("#Y", "#spam", "#dataset").describe("The input labels to the SPAM dataset."))

<h2> 4) Sending our tensors to grid nodes</h2>

In [7]:
# NOTE: For some reason, there is strange behavior when trying to send within a loop.
# Ex : tag_x[i].send(compute_nodes[i])
# When resolved, this should be updated.

for i in range(len(compute_nodes)):
    shared_x = tag_img[i].send(compute_nodes[i], garbage_collect_data=False)
    shared_y = tag_label[i].send(compute_nodes[i], garbage_collect_data=False)
    print("X tensor pointers: ", shared_x, shared_y)


X tensor pointers:  (Wrapper)>[PointerTensor | me:4887477248 -> Bob:58176119719]
	Tags: #dataset #X #spam 
	Shape: torch.Size([2786, 30])
	Description: The input datapoints to the SPAM dataset.... (Wrapper)>[PointerTensor | me:7448093939 -> Bob:34347751563]
	Tags: #dataset #spam #Y 
	Shape: torch.Size([2786])
	Description: The input labels to the SPAM dataset....
X tensor pointers:  (Wrapper)>[PointerTensor | me:15289568769 -> Alice:16627579692]
	Tags: #dataset #X #spam 
	Shape: torch.Size([2786, 30])
	Description: The input datapoints to the SPAM dataset.... (Wrapper)>[PointerTensor | me:42923736144 -> Alice:89030703951]
	Tags: #dataset #spam #Y 
	Shape: torch.Size([2786])
	Description: The input labels to the SPAM dataset....


## Disconnect Nodes

In [8]:
for i in range(len(compute_nodes)):
    compute_nodes[i].close()