Skip to content

OpenHuFu is an open-sourced data federation system to support collaborative queries over multi databases with security guarantee.

License

Notifications You must be signed in to change notification settings

GaoakaJie/OpenHuFu

 
 

Repository files navigation

OpenHuFu: An Open-Sourced Data Federation System

codecov License Total Lines

Data isolation has become an obstacle to scale up query processing over big data, since sharing raw data among data owners is often prohibitive due to security concerns. A promising solution is to perform secure queries and analytics over a federation of multiple data owners leveraging techiniques like secure multi-party computation (SMC) and differential privacy, as evidenced by recent work on data federation and federated learning.

OpenHuFu is an open-sourced system for efficient and secure query processing on a data federation. It provides flexibility for researchers to quickly implement their algorithms for processing federated queries with SMC techniques, such as secret sharing, garbled circuit and oblivious transfer. With its help, we can quickly conduct the experimental evaluation and obtain the performance of the designed algorithms over benchmark datasets.

Compile OpenHuFu from Source Code

Prerequisites:

  • Linux or MacOS
  • Java 11
  • Maven (version at least 3.5.2)
  • C++ (generate TPC-H data)
  • Python3 (generate spatial data)
  • Git & Git LFS (Git Large File Storage)

Build OpenHuFu

Run the following commands:

  1. Clone OpenHuFu repository:
git clone https://github.com/BUAA-BDA/OpenHuFu.git
  1. Download big files from Git LFS(Large File Storage)
cd OpenHuFu
git lfs install --skip-smudge
git lfs pull 
  1. Build:
cd OpenHuFu
bash scripts/build/package.sh

OpenHuFu is now installed in release

Notes

If you use Macs with Apple Silicon Chips(ARM), you need to add this to settings.xml(maven settings file):

<profiles>
    <profile>
      <id>apple-silicon</id>
      <properties>
        <os.detected.classifier>osx-x86_64</os.detected.classifier>
      </properties>
    </profile>
</profiles>
<activeProfiles>
    <activeProfile>apple-silicon</activeProfile>
</activeProfiles>

Data Generation

Relational data: TCP-H

How to use it:

bash scripts/test/extract_tpc_h.sh

cd dataset/TPC-H V3.0.1/dbgen
cp makefile.suite makefile
make

# x is the number of database,y is the volume of each database(MB)
bash scripts/test/generateData.sh x y

Spatial data

Spatial sample data: dataset/newyork-taxi-sample.data:

How to use it

Configuration in scripts/test/genSyntheticData.py:

class constForSyn():
    caseN = 10
	Range = 10**7
	dim = 2
	mu = 0.5 * Range
	sigma = 0.10 * Range
	pointFiles = ["uni", "nor", "exp", "skew"]
	numList = [5*10**3, 10**4, 5*10**4, 10**5, 5*10**5, 10**6, 5*10**6]
	alpha = 2
	
def exp0():	
    # desPath is the target folder of generated data
    desPath = "dataset/SynData"

Generate spatial data:

pip3 install numpy
python3 dataset/genSyntheticData.py

Configuration File

OwnerSide

UserSide

ABY4j

A java wrapper of ABY using SWIG.

Installation

  1. not make install ABY before. (if installed, please remove installed files, especially files in /usr/local/lib/cmake)
  2. init git submodule
git submodule update --init
  1. check cmake version >= 3.16, gcc version >= 8.0, java jdk version >= 11, swig3.0, GMP, Boost >= 1.66.0

    (After installing the new version, you may need to manually update the dynamic link library (e.g. libstdc++.so.6), otherwise the dynamic link library will use the old version.)

  2. set OPENHUFU_ROOT

export OPENHUFU_ROOT={OpenHuFu root path}
# e.g.
# export OPENHUFU_ROOT=~/dev/release
  1. run package.sh, and the package result will be installed in ${OPENHUFU_ROOT}/lib
# for first installation, or cpp code is modified, add 'all' to update .so library
# after running the script, need to move .so .a files in swig/build/lib into java lib path manually, e.g., /usr/lib/jni
./package.sh

(When using, you should add the parameter -Djava.library.path=${OPENHUFU_ROOT}/lib to add the library path)

Project structure

  • swig: the swig interface of aby, use .i file to wrap C++ functions, the generated java code will be placed in src/java/com/hufudb/openhufu/mpc/aby/*, please do not add these generated files to git
  • src: the ProtocolExecutor interface of aby(Aby.java, AbyFactory.java), wrapper for swig interface to interactive with OpenHuFu

Development procedure

  1. Develop your algorithms
  • Aggregate:
  class extends com.hufudb.openhufu.owner.implementor.aggregate.OwnerAggregateFunction
  /** 
   *  The class must contains a constructor function with parameters:
   *  (OpenHuFuPlan.Expression agg, Rpc rpc, ExecutorService threadPool, OpenHuFuPlan.TaskInfo taskInfo)
   */ 
  • Join:
  1. Set the algorithm for the query(example in owner.yaml):
openhufu:
    implementor:
      aggregate:
        sum: com.hufudb.openhufu.owner.implementor.aggregate.sum.SecretSharingSum
        count: null
        max: null
        min: null
        avg: null
      join: com.hufudb.openhufu.owner.implementor.join.HashJoin
  1. Running benchmarks
bash scripts/test/benchmark.sh
  1. Evaluating communication cost

Before running benchmarks on OpenHuFu, you can follow the instructions to evaluate communication cost of the query:

  • Monitoring the port
# run the shell script as root
# 8888 is the port number 
sudo bash scripts/test/network_mmonitor/start.sh 8888
  • Calculating the communication cost
# run the shell script as root
sudo bash scripts/test/network_mmonitor/monitor.sh

Data Query Language

  1. Plan
  2. Function Call

Supported Query Types

  • Filter
  • Projection
  • Join: equi-join, theta join
  • Cross products
  • Aggregate(inc. group-by)
  • Limited window aggs
  • Distinct
  • Sort
  • Limit
  • Common table expressions
  • Spatial Queries:
    • range query
    • range counting
    • knn query
    • distance join
    • knn join

Evaluation Metrics

  • Communication Cost
  • Running Time
    • Total Query Time
    • Local Query Time
    • Encryption Time
    • Decryption Time

Related Work

If you find OpenHuFu helpful in your research, please consider citing our papers and the bibtex are listed below:

  1. Hu-Fu: Efficient and Secure Spatial Queries over Data Federation. Yongxin Tong, Xuchen Pan, Yuxiang Zeng, Yexuan Shi, Chunbo Xue, Zimu Zhou, Xiaofei Zhang, Lei Chen, Yi Xu, Ke Xu, Weifeng Lv. Proc. VLDB Endow. 15(6): 1159-1172 (2022). [paper] [slides] [bibtex]

Other helpful related work from our group is listed below:

  1. Efficient Approximate Range Aggregation Over Large-Scale Spatial Data Federation. Yexuan Shi, Yongxin Tong, Yuxiang Zeng, Zimu Zhou, Bolin Ding, Lei Chen. IEEE Trans. Knowl. Data Eng. 35(1): 418-430 (2023). [paper] [bibtex]

  2. Hu-Fu: A Data Federation System for Secure Spatial Queries. Xuchen Pan, Yongxin Tong, Chunbo Xue, Zimu Zhou, Junping Du, Yuxiang Zeng, Yexuan Shi, Xiaofei Zhang, Lei Chen, Yi Xu, Ke Xu, Weifeng Lv. Proc. VLDB Endow. 15(12): 3582-3585 (2022). [paper] [bibtex]

  3. Data Source Selection in Federated Learning: A Submodular Optimization Approach. Ruisheng Zhang, Yansheng Wang, Zimu Zhou, Ziyao Ren, Yongxin Tong, Ke Xu. DASFAA 2022. [paper] [bibtex]

  4. Fed-LTD: Towards Cross-Platform Ride Hailing via Federated Learning to Dispatch. Yansheng Wang, Yongxin Tong, Zimu Zhou, Ziyao Ren, Yi Xu, Guobin Wu, Weifeng Lv. KDD 2022. [paper] [bibtex]

  5. Efficient and Secure Skyline Queries over Vertical Data Federation. Yuanyuan Zhang, Yexuan Shi, Zimu Zhou, Chunbo Xue, Yi Xu, Ke Xu, Junping Du. IEEE Trans. Knowl. Data Eng. (2022). [paper] [bibtex]

  6. Federated Topic Discovery: A Semantic Consistent Approach. Yexuan Shi, Yongxin Tong, Zhiyang Su, Di Jiang, Zimu Zhou, Wenbin Zhang. IEEE Intell. Syst. 36(5): 96-103 (2021). [paper] [bibtex]

  7. Industrial Federated Topic Modeling. Di Jiang, Yongxin Tong, Yuanfeng Song, Xueyang Wu, Weiwei Zhao, Jinhua Peng, Rongzhong Lian, Qian Xu, Qiang Yang. ACM Trans. Intell. Syst. Technol. 12(1): 2:1-2:22 (2021). [paper] [bibtex]

  8. A GDPR-compliant Ecosystem for Speech Recognition with Transfer, Federated, and Evolutionary Learning. Di Jiang, Conghui Tan, Jinhua Peng, Chaotao Chen, Xueyang Wu, Weiwei Zhao, Yuanfeng Song, Yongxin Tong, Chang Liu, Qian Xu, Qiang Yang, Li Deng. ACM Trans. Intell. Syst. Technol. 12(3): 30:1-30:19 (2021). [paper] [bibtex]

  9. An Efficient Approach for Cross-Silo Federated Learning to Rank. Yansheng Wang, Yongxin Tong, Dingyuan Shi, Ke Xu. ICDE 2021. [paper] [slides] [bibtex]

  10. Federated Learning in the Lens of Crowdsourcing. Yongxin Tong, Yansheng Wang, Dingyuan Shi. IEEE Data Eng. Bull. 43(3): 26-36 (2020). [paper] [bibtex]

  11. Federated Latent Dirichlet Allocation: A Local Differential Privacy Based Framework. Yansheng Wang, Yongxin Tong, Dingyuan Shi. AAAI 2020. [paper] [bibtex]

  12. Federated Acoustic Model Optimization for Automatic Speech Recognition. Conghui Tan, Di Jiang, Huaxiao Mo, Jinhua Peng, Yongxin Tong, Weiwei Zhao, Chaotao Chen, Rongzhong Lian, Yuanfeng Song, Qian Xu. DASFAA 2020. [paper] [bibtex]

  13. Efficient and Fair Data Valuation for Horizontal Federated Learning. Shuyue Wei, Yongxin Tong, Zimu Zhou, Tianshu Song. Federated Learning 2020. [paper] [bibtex]

  14. Profit Allocation for Federated Learning. Tianshu Song, Yongxin Tong, Shuyue Wei. IEEE BigData 2019. [paper] [slides] [bibtex]

  15. Federated Machine Learning: Concept and Applications. Qiang Yang, Yang Liu, Tianjian Chen, Yongxin Tong. ACM Trans. Intell. Syst. Technol. 10(2): 12:1-12:19 (2019). [paper] [bibtex]

About

OpenHuFu is an open-sourced data federation system to support collaborative queries over multi databases with security guarantee.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 96.7%
  • Shell 1.4%
  • Python 0.8%
  • C++ 0.5%
  • Scheme 0.3%
  • CMake 0.2%
  • SWIG 0.1%