# Data Preprocessing

The following codes are demos only. It's **NOT for production** due to system security concerns, please **DO NOT** use it directly in production.

It is recommended to use [jupyter](https://jupyter.org/) to run this tutorial.

Secretflow provides a variety of preprocessing tools to process data.

## Preparation

Initialize secretflow and create two parties alice and bob.

> 💡 Before using preprocessing, you may need to understand secretflow's [DataFrame](./DataFrame.ipynb).

In [1]:
import secretflow as sf

# In case you have a running secretflow runtime already.
sf.shutdown()

sf.init(['alice', 'bob'])
alice = sf.PYU('alice')
bob = sf.PYU('bob')

2022-06-26 22:44:03.427300: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/rh/gcc-toolset-11/root/usr/lib64:/opt/rh/gcc-toolset-11/root/usr/lib:/opt/rh/gcc-toolset-11/root/usr/lib64/dyninst:/opt/rh/gcc-toolset-11/root/usr/lib/dyninst


## Data Preparation

Here we use [iris](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html) as example data.

In [2]:
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris(as_frame=True)
data = pd.concat([iris.data, iris.target], axis=1)

# In order to facilitate the subsequent display,
# here we first set some data to None.
data.iloc[1, 1] = None
data.iloc[100, 1] = None

# Restore target to its original name.
data['target'] = data['target'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica' })
data

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,setosa
1,4.9,,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


Create a horizontal DataFrame.

In [3]:
# Horizontal partitioning.
h_alice, h_bob = data.iloc[:70, :], data.iloc[70:, :]

# Save to temorary files.
import tempfile

_, h_alice_path = tempfile.mkstemp()
_, h_bob_path = tempfile.mkstemp()
h_alice.to_csv(h_alice_path, index=False)
h_bob.to_csv(h_bob_path, index=False)

In [4]:
from secretflow.data.horizontal import read_csv as h_read_csv
from secretflow.security.aggregation import PlainAggregator
from secretflow.security.compare import PlainComparator

hdf = h_read_csv({alice: h_alice_path, bob: h_bob_path}, 
                 aggregator=PlainAggregator(alice), 
                 comparator=PlainComparator(alice))

Create a vertical DataFrame.

In [5]:
# Vertical partitioning.
v_alice, v_bob = data.iloc[:, :2], data.iloc[:, 2:]

# Save to temprary files.
_, v_alice_path = tempfile.mkstemp()
_, v_bob_path = tempfile.mkstemp()
v_alice.to_csv(v_alice_path, index=False)
v_bob.to_csv(v_bob_path, index=False)

In [6]:
from secretflow.data.vertical import read_csv as v_read_csv
from secretflow.security.aggregation import PlainAggregator
from secretflow.security.compare import PlainComparator

vdf = v_read_csv({alice: v_alice_path, bob: v_bob_path})

## Preprocessing.

Secretflow provides missing value filling, normalization, OneHot encoding, label encoding and other functions, which are similar to sklearn's preprocessing.

### Missing value filling

DataFrame provides the fillna method, which can fill in missing values in the same way as pandas.

In [7]:
# Before filling, the sepal width (cm) is missing in two positions.
vdf.count()['sepal width (cm)']

148

In [8]:
# Fill sepal width (cm) with 10.
vdf.fillna(value={'sepal width (cm)': 10}).count()['sepal width (cm)']

150

In [9]:
# Before filling, the sepal width (cm) is missing in two positions.
hdf.count()['sepal width (cm)']

148

In [10]:
# Fill sepal width (cm) with 10.
hdf.fillna(value={'sepal width (cm)': 10}).count()['sepal width (cm)']

150

### Min max normalization

Secretflow provides MinMaxScaler for normalization, and the input and output of MinMaxScaler are both DataFrames.

In [11]:
from secretflow.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

In [12]:
# Normalize the sepal length column of the horizontal DataFrame
# Equivalent to:
# scaler.fit(hdf['sepal length (cm)'])
# scaled_target = scaler.transform(hdf['sepal length (cm)'])
scaled_sepal_len_h = scaler.fit_transform(hdf['sepal length (cm)'])

print('Min: ', scaled_sepal_len_h.min()[0])
print('Max: ', scaled_sepal_len_h.max()[0])

Min:  0.0
Max:  1.0


In [13]:
# The above operations can also be applied to vertical DataFrame.
scaled_sepal_len_v = scaler.fit_transform(vdf['sepal length (cm)'])

print('Min: ', scaled_sepal_len_v.min()[0])
print('Max: ', scaled_sepal_len_v.max()[0])

Min:  0.0
Max:  1.0


### OneHot Encoding

Secretflow provides OneHotEncoder for OneHot encoding, and the input and output of OneHotEncoder are DataFrame.

In [14]:
from secretflow.preprocessing import OneHotEncoder

onehot_encoder = OneHotEncoder()

In [15]:
# Do onehot encoding on the target column of the horizontal DataFrame.
# Equivalent to:
# onehot_encoder.fit(hdf['target'])
# onehot_encoder.transform(hdf['target'])
onehot_target_h = onehot_encoder.fit_transform(hdf['target'])

print('Columns: ', onehot_target_h.columns)
print('Min: \n', onehot_target_h.min())
print('Max: \n', onehot_target_h.max())

Columns:  Index(['target_setosa', 'target_versicolor', 'target_virginica'], dtype='object')
Min: 
 target_setosa        0.0
target_versicolor    0.0
target_virginica     0.0
dtype: float64
Max: 
 target_setosa        1.0
target_versicolor    1.0
target_virginica     1.0
dtype: float64


In [16]:
# The above operations can also be applied to vertical DataFrame.
onehot_target_v = onehot_encoder.fit_transform(vdf['target'])

print('Columns: ', onehot_target_v.columns)
print('Min: \n', onehot_target_v.min())
print('Max: \n', onehot_target_v.max())

Columns:  Index(['target_setosa', 'target_versicolor', 'target_virginica'], dtype='object')
Min: 
 target_setosa        0.0
target_versicolor    0.0
target_virginica     0.0
dtype: float64
Max: 
 target_setosa        1.0
target_versicolor    1.0
target_virginica     1.0
dtype: float64


### Label encoding

secretflow provides LabelEncoder for label encoding, and the input and output of LabelEncoder are DataFrame.

In [7]:
from secretflow.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

In [8]:
# Do label encoding on the target column of the horizontal DataFrame.
# Equivalent to:
# scaler.fit(hdf['target'])
# label_encoder.transform(hdf['target'])
encoded_label_h = label_encoder.fit_transform(hdf['target'])

print('Columns: ', encoded_label_h.columns)
print('Min: \n', encoded_label_h.min())
print('Max: \n', encoded_label_h.max())

Columns:  Index(['target'], dtype='object')
Min: 
 target    0.0
dtype: float64
Max: 
 target    2.0
dtype: float64


In [9]:
# The above operations can also be applied to vertical DataFrame.
encoded_label_v = label_encoder.fit_transform(vdf['target'])

print('Columns: ', encoded_label_v.columns)
print('Min: \n', encoded_label_v.min())
print('Max: \n', encoded_label_v.max())

Columns:  Index(['target'], dtype='object')
Min: 
 target    0
dtype: int64
Max: 
 target    2
dtype: int64


## Ending

In [20]:
# Clean up temporary files

import os

try:
    os.remove(h_alice_path)
    os.remove(h_bob_path)
except OSError:
    pass

try:
    os.remove(v_alice_path)
    os.remove(v_bob_path)
except OSError:
    pass