# Load Numpy data in SecretFlow

The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production.

This tutorial will demonstrate how to load Numpy data in a multi-party secure environment using SecretFlow.  
SecretFlow supports multiple formats, including `.npy` and `.npz`, and its interface is designed to be compatible with `numpy` 

## Environment Configuration

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import secretflow as sf

# Check the version of your SecretFlow
print('The version of SecretFlow: {}'.format(sf.__version__))

# In case you have a running secretflow runtime already.
sf.shutdown()
sf.init(['alice', 'bob', 'charlie'], address="local", log_to_driver=True)
alice, bob, charlie = sf.PYU('alice'), sf.PYU('bob'), sf.PYU('charlie')

The version of SecretFlow: 1.4.0.dev20240105


2024-01-10 04:11:58,731	INFO worker.py:1538 -- Started a local Ray instance.


## Interface Introduction

In SecretFlow, we provide an interface similar to `numpy.load` called `secretflow.load.ndarray.load` to load `ndarray` data from multiple parties and convert it into a federated representation. 

 Using secretflow.data.load, you can read numpy files from multiple parties and create a `FedNdarray` object.

Interface Introduction：[secretflow.data.load](https://www.secretflow.org.cn/docs/secretflow/en/source/secretflow.data.html#secretflow.data.ndarray.load)


## Data Download and Splitting

In [3]:
%%capture
%%!
wget https://secretflow-data.oss-accelerate.aliyuncs.com/datasets/mnist/mnist.npz
pip install opencv-python

In [4]:
import numpy as np

all_data = np.load("./mnist.npz")

Splitting the data.

In [5]:
alice_train_x = all_data["x_train"][:30000]
alice_test_x = all_data["x_test"][:30000]
alice_train_y = all_data["y_train"][:30000]
alice_test_y = all_data["y_test"][:30000]

bob_train_x = all_data["x_train"][30000:]
bob_test_x = all_data["x_test"][30000:]
bob_train_y = all_data["y_train"][30000:]
bob_test_y = all_data["y_test"][30000:]

Saving separately as npz format file.

In [6]:
np.savez(
    "./alice_mnist.npz",
    train_x=alice_train_x,
    test_x=alice_test_x,
    train_y=alice_train_y,
    test_y=alice_test_y,
)
np.savez(
    "./bob_mnist.npz",
    train_x=bob_train_x,
    test_x=bob_test_x,
    train_y=bob_train_y,
    test_y=bob_test_y,
)

Saving tarin_x from Alice and Bob as npy format for convenient future reading.

In [7]:
np.save("./alice_mnist_train_x.npy", alice_train_x)
np.save("./bob_mnist_train_x.npy", bob_train_x)

##  Loading npz files

In [8]:
alice_path = "./alice_mnist.npz"
bob_path = "./bob_mnist.npz"

In [9]:
from secretflow.data.ndarray import load
from secretflow.data.split import train_test_split

In [10]:
fed_npz = load({alice: alice_path, bob: bob_path}, allow_pickle=True)

In [11]:
fed_npz

{'train_x': FedNdarray(partitions={PYURuntime(alice): <secretflow.device.device.pyu.PYUObject object at 0x7fc6d05d0c40>, PYURuntime(bob): <secretflow.device.device.pyu.PYUObject object at 0x7fc7280f94f0>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),
 'test_x': FedNdarray(partitions={PYURuntime(alice): <secretflow.device.device.pyu.PYUObject object at 0x7fc7280fc310>, PYURuntime(bob): <secretflow.device.device.pyu.PYUObject object at 0x7fc7280fc7f0>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),
 'train_y': FedNdarray(partitions={PYURuntime(alice): <secretflow.device.device.pyu.PYUObject object at 0x7fc7280fc700>, PYURuntime(bob): <secretflow.device.device.pyu.PYUObject object at 0x7fc7280fc4f0>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),
 'test_y': FedNdarray(partitions={PYURuntime(alice): <secretflow.device.device.pyu.PYUObject object at 0x7fc7280fcb20>, PYURuntime(bob): <secretflow.device.device.pyu.PYUObject object at 0x7fc7280fcc40>}, partition_way=<Partiti

In FedNpz, each value represents a FedNdarray.

In [12]:
type(fed_npz["train_x"])

secretflow.data.ndarray.ndarray.FedNdarray

## Loading npy files

Loading npy is very simple. Directly call the load interface, and the results will be a standard FedNdarray object.

In [13]:
alice_path = "./alice_mnist_train_x.npy"
bob_path = "./bob_mnist_train_x.npy"

In [14]:
fed_ndarray = load({alice: alice_path, bob: bob_path}, allow_pickle=True)

In [15]:
type(fed_ndarray)

secretflow.data.ndarray.ndarray.FedNdarray

##  How can I convert my existing data into a FedNdarray and read it?

How can we convert other types of data into FedNdarray data?  
If we have an image dataset or a speech dataset, how can we pass the data into a federated model using FedNdarray?  
Let's take the flower classification dataset Flower as an example.

In [16]:
import tempfile
import tensorflow as tf

_temp_dir = tempfile.mkdtemp()
path_to_flower_dataset = tf.keras.utils.get_file(
    "flower_photos",
    "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz",
    untar=True,
    cache_dir=_temp_dir,
)

2024-01-10 04:16:01.913420: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-01-10 04:16:03.357931: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2024-01-10 04:16:03.358139: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory


Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz


After downloading and extracting the dataset, the root directory of the dataset is "flower_photos".

In [19]:
import os, glob
import numpy as np
import cv2  # The dependencies need to be installed manually, pip install opencv-python

root = path_to_flower_dataset
classes = ['daisy', 'dandelion', 'roses', 'sunflowers', 'tulips']
img_paths = []  # Used to save all picture paths
labels = []  # Used to save the picture category tags,(0,1,2,3,4)
for i, label in enumerate(classes):
    cls_img_paths = glob.glob(os.path.join(root, label, "*.jpg"))
    img_paths.extend(cls_img_paths)
    labels.extend([i] * len(cls_img_paths))

# image->numpy
img_numpys = []
labels = np.array(labels)
for img_path in img_paths:
    img_numpy = cv2.imread(img_path)
    img_numpy = cv2.resize(img_numpy, (240, 240))
    img_numpy = np.reshape(img_numpy, (1, 240, 240, 3))
    # If use Pytorch backend dimension should be exchanged
    # img_numpy = np.transpose(img_numpy, (0,3,1,2))
    img_numpys.append(img_numpy)

images = np.concatenate(img_numpys, axis=0)
print(images.shape)
print(labels.shape)

# Distribute images and labels to two nodes, allocating 50% of the data to each node.
per = 0.5
alice_images = images[: int(per * images.shape[0]), :, :, :]
alice_label = labels[: int(per * images.shape[0])]
bob_images = images[int(per * images.shape[0]) :, :, :, :]
bob_label = labels[int(per * images.shape[0]) :]
print(
    f"alice images shape = {alice_images.shape}, alice labels shape = {alice_label.shape}"
)
print(f"bob images shape = {bob_images.shape}, bob labels shape = {bob_label.shape}")

# Save the data as npz files separately, and then send them to the two machines.
np.savez("flower_alice.npz", image=alice_images, label=alice_label)
np.savez("flower_bob.npz", image=bob_images, label=bob_label)

[autoreload of cv2.load_config_py3 failed: Traceback (most recent call last):
  File "/opt/conda/envs/default/lib/python3.8/site-packages/IPython/extensions/autoreload.py", line 273, in check
    superreload(m, reload, self.old_objects)
  File "/opt/conda/envs/default/lib/python3.8/site-packages/IPython/extensions/autoreload.py", line 471, in superreload
    module = reload(module)
  File "/opt/conda/envs/default/lib/python3.8/importlib/__init__.py", line 159, in reload
    raise ImportError(msg.format(parent_name),
ImportError: parent 'cv2' not in sys.modules
]
[autoreload of cv2.version failed: Traceback (most recent call last):
  File "/opt/conda/envs/default/lib/python3.8/site-packages/IPython/extensions/autoreload.py", line 273, in check
    superreload(m, reload, self.old_objects)
  File "/opt/conda/envs/default/lib/python3.8/site-packages/IPython/extensions/autoreload.py", line 471, in superreload
    module = reload(module)
  File "/opt/conda/envs/default/lib/python3.8/importli

(3670, 240, 240, 3)
(3670,)
alice images shape = (1835, 240, 240, 3), alice labels shape = (1835,)
bob images shape = (1835, 240, 240, 3), bob labels shape = (1835,)


 Once you have obtained the required NPZ files, use the previously mentioned load function to read them into FedNdarray format. Then, input them into the model to begin training.

In [20]:
fed_flower_npz = load(
    {alice: "./flower_alice.npz", bob: "./flower_bob.npz"}, allow_pickle=True
)

In [21]:
fed_flower_npz

{'image': FedNdarray(partitions={PYURuntime(alice): <secretflow.device.device.pyu.PYUObject object at 0x7fc5cce476a0>, PYURuntime(bob): <secretflow.device.device.pyu.PYUObject object at 0x7fc5ccf69370>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>),
 'label': FedNdarray(partitions={PYURuntime(alice): <secretflow.device.device.pyu.PYUObject object at 0x7fc5ccf19490>, PYURuntime(bob): <secretflow.device.device.pyu.PYUObject object at 0x7fc5ccf199d0>}, partition_way=<PartitionWay.VERTICAL: 'vertical'>)}

In [22]:
fed_image = fed_flower_npz["image"]

In [23]:
fed_image.partition_shape()

{PYURuntime(alice): (1835, 240, 240, 3), PYURuntime(bob): (1835, 240, 240, 3)}

## Tips

It is recommended to test the data after converting it to the ndarray type using a single-machine training engine to verify if the data format matches the model correctly. Then, you can proceed to test it using the SecretFlow federated framework, which can improve the efficiency of troubleshooting.  
*Note: When using image datasets, it is important to pay attention to the dimension ordering.*