Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2021Tencent Rhino-bird Open-source Training Program—Angel Zeng Shang #98

Open
earlytobed opened this issue Aug 4, 2021 · 3 comments

Comments

@earlytobed
Copy link

第一次作业

很荣幸入选 Angel 项目,开始开源实战环节。能够和导师们、同学们共同学习、了解 Angel 分布式机器学习平台架构设计原理是个难得的机会。以下是本次开源活动的实战笔记。因本人水平有限,错误和不足之处在所难免,敬请各位专家读者指正。

Angel 环境搭建

本次项目是基于 Angel-ML/PyTorch-On-Angel 的一个论文复现,在进行其它工作之前,我们需要部署一个可以运行的环境。

https://github.com/Angel-ML/PyTorch-On-Angel/blob/master/docs/img/pytorch_on_angel_framework.png?raw=true

PyTorch on Angel's architecture

PyTorch-On-Angel 主要由三个模块构成:

  1. Python Client:用于生成 ScriptModule
  2. Angel PS:参数服务器,负责模型的分布式存储、同步和协调计算
  3. Spark:Spark Driver、Spark Executor 负责加载 ScriptModule,数据处理,同参数服务器协同完成模型的训练和预测

厘清依赖:

  • 由 Python 代码生成 ScriptModule,需要 python 环境和 torch 包
  • 使用 C++ 后端,需要 libtorch_angel
  • Angel PS 和 Spark Driver、Spark Executor 需要 Spark
  • 项目中推荐使用 Spark on YARN 的方式,Hadoop 也是需要的

以下操作均基于 Ubuntu 20.04 LTS ,因为自用,环境不完全干净,不保证没有别的问题。

PyTorch-On-Angel

第一步当然是:

git clone https://github.com/Angel-ML/PyTorch-On-Angel.git --depth 1

项目文档中介绍了编译方法,出于使用方便,我准备好镜像源文件,放在下 ./addon 备用:

Debian 9 sources.list

deb http://mirrors.cloud.tencent.com/debian stretch main contrib non-free
deb http://mirrors.cloud.tencent.com/debian stretch-updates main contrib non-free
#deb http://mirrors.cloud.tencent.com/debian stretch-backports main contrib non-free
#deb http://mirrors.cloud.tencent.com/debian stretch-proposed-updates main contrib non-free
deb-src http://mirrors.cloud.tencent.com/debian stretch main contrib non-free
deb-src http://mirrors.cloud.tencent.com/debian stretch-updates main contrib non-free
#deb-src http://mirrors.cloud.tencent.com/debian stretch-backports main contrib non-free
#deb-src http://mirrors.cloud.tencent.com/debian stretch-proposed-updates main contrib non-free

maven settings.xml

<?xml version="1.0" encoding="UTF-8"?>
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
  <mirrors>
    <mirror>
      <id>nexus-tencentyun</id>
      <mirrorOf>*</mirrorOf>
      <name>Nexus tencentyun</name>
      <url>http://mirrors.cloud.tencent.com/nexus/repository/maven-public/</url>
    </mirror>
  </mirrors>
</settings>

修改了 Dockerfile

########################################################################################################################
#                                                       DEV                                                            #
########################################################################################################################
FROM maven:3.6.1-jdk-8 as DEV

##########################
#  install dependencies  #
##########################
COPY ./addon/sources.list /etc/apt/sources.list
RUN apt-get update \
    && apt-get install -y --no-install-recommends \
    curl=7.52.1-5+deb9u9 \
    g++=4:6.3.0-4 \
    make=4.1-9.1 \
    unzip=6.0-21+deb9u1 \
    python3 \
    python3-pip \
    python3-setuptools \
    python3-wheel \
    && rm -rf /var/lib/apt/lists/*

#####################
#  Install PyTorch  #
#####################
RUN python3 -m pip install --no-cache-dir -i https://mirrors.cloud.tencent.com/pypi/simple \
    https://files.pythonhosted.org/packages/24/33/ccfe4e16bfa1f2ca10e22bca05b313cba31800f9597f5f282020cd6ba45e/torch-1.3.1-cp35-cp35m-manylinux1_x86_64.whl \
    https://files.pythonhosted.org/packages/1c/f6/e927f7db4f422af037ca3f80b3391e6224ee3ee86473ea05028b2b026f82/torchvision-0.4.0-cp35-cp35m-manylinux1_x86_64.whl

#######################
#  install new cmake  #
#######################
RUN curl -fsSL --insecure -o /tmp/cmake.tar.gz https://cmake.org/files/v3.13/cmake-3.13.4.tar.gz \
    && tar -xzf /tmp/cmake.tar.gz -C /tmp \
    && rm -rf /tmp/cmake.tar.gz  \
    && mv /tmp/cmake-* /tmp/cmake \
    && cd /tmp/cmake \
    && ./bootstrap \
    && make -j8 \
    && make install \
    && rm -rf /tmp/cmake

#######################
#  download libtorch  #
#######################
WORKDIR /opt
RUN curl -fsSL --insecure -o libtorch.zip https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-wit \
    && unzip -q libtorch.zip \
    && rm libtorch.zip

ENV TORCH_HOME=/opt/libtorch

########################################################################################################################
#                                                     JAVA BUILDER                                                     #
########################################################################################################################
FROM DEV as JAVA_BUILDER

COPY ./addon/settings.xml /usr/share/maven/conf/

WORKDIR /app

COPY ./java/pom.xml /app

RUN mvn -e -B dependency:resolve dependency:resolve-plugins

COPY ./java /app

RUN mvn -e -B -Dmaven.test.skip=true package

########################################################################################################################
#                                                     CPP BUILDER                                                      #
########################################################################################################################
FROM DEV as CPP_BUILDER

RUN apt-get update  \
    && apt-get install -y --no-install-recommends \
    zip=3.0-11+b1 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY ./cpp ./

RUN ./build.sh \
    && cp ./out/*.so "$TORCH_HOME"/lib \
    && cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6 "$TORCH_HOME"/lib \
    && ln -s "$TORCH_HOME"/lib torch-lib \
    && zip -qr /torch.zip torch-lib

########################################################################################################################
#                                                       Artifacts                                                      #
########################################################################################################################
FROM alpine:3.10 as ARTIFACTS

WORKDIR /dist
COPY --from=CPP_BUILDER /torch.zip ./
COPY --from=JAVA_BUILDER /app/target/*.jar ./

VOLUME /output

CMD [ "/bin/sh", "-c", "cp ./* /output" ]

修改 cpp/CMakeList.txt

set(TORCH_HOME $ENV{TORCH_HOME})

执行 build.sh 静待片刻:

./build.sh

如果下载安装缓慢也可以提前在 addon 下准备好需要的文件并修改 Dockerfile 里相应部分:

cd addon && wget https://cmake.org/files/v3.13/cmake-3.13.4.tar.gz \
    https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-1.3.1%2Bcpu.zip \
    https://files.pythonhosted.org/packages/24/33/ccfe4e16bfa1f2ca10e22bca05b313cba31800f9597f5f282020cd6ba45e/torch-1.3.1-cp35-cp35m-manylinux1_x86_64.whl \
    https://files.pythonhosted.org/packages/1c/f6/e927f7db4f422af037ca3f80b3391e6224ee3ee86473ea05028b2b026f82/torchvision-0.4.0-cp35-cp35m-manylinux1_x86_64.whl

修改 gen_pt_model.sh python → python3

docker run -it --rm -v $(pwd)/${MODEL_PATH}:/model.py -v $(pwd)/dist:/output -w /output ${IMAGE_NAME} python3 /model.py ${@:2}

./dist 下就有了我们所需要的文件:

deepfm.pt  pytorch-on-angel-0.2.0.jar  pytorch-on-angel-0.2.0-jar-with-dependencies.jar  torch.zip

第一步就完成了~

Hadoop

wget https://archive.apache.org/dist/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz

强迫症表示看到很多没用的文件就想删掉:

find . -name *.cmd | xargs rm

修改配置文件:

hadoop-env.sh

export JAVA_HOME="按情况修改"

core-site.xml

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://master:9000</value>
  </property>
</configuration>

到这里 HDFS 就设置完了,format 一下:

hdfs namenode –format

启动试试是否正常工作:启动需要能 SSH master worker,SSH 设置这里就略了

./start-dfs.sh
jps
# 105141 DataNode
# 104964 NameNode
# 105385 SecondaryNameNode
# 都有就是正常啦,没有的看看日志排查

mapred-site.xml 运行方式改成 yarn

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

yarn-site.xml yarn 的资源配置,默认是 8G ,跑 Angel 可能不够,根据自身电脑配置修改:

<configuration>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>master</value>
  </property>
  <property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>12</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-vcores</name>
    <value>1</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-vcores</name>
    <value>12</value>
  </property>
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>30720</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>1</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>30720</value>
  </property>
</configuration>

启动试试是否正常工作:

./start-yarn.sh
jps
# 107761 ResourceManager
# 108141 NodeManager
# 都有就是正常啦,没有的看看日志排查

Spark

wget https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz

配置好 Hadoop 之后 Spark 的配置就比较简单了,Spark on YARN 可以直接从 Hadoop 的配置里读取,只需要修改:

spark-env.sh

export HADOOP_CONF_DIR="按情况修改"

启动试试是否正常工作:

./start-all.sh
jps
# 2273766 Worker
# 2273463 Master
# 都有就是正常啦,没有的看看日志排查

Angel

注意 jdk 版本,不然后续会报错

sudo apt install openjdk-8-jdk -y
sudo apt install maven -y

编译安装 protobuf 2.5.0 ,依照 README.txt 即可,记得最后要 ldconfig :

wget https://github.com/protocolbuffers/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz

按照说明编译即可:

wget https://github.com/Angel-ML/angel/archive/refs/tags/Release-2.4.0.tar.gz

编译完成后解压,进行配置:

spark-on-angel-env.sh

export SPARK_HOME="按情况修改"
export ANGEL_HOME="按情况修改"
export ANGEL_HDFS_HOME="按情况修改"
export ANGEL_VERSION=2.4.0

# 部分 jar 包版本问题
angel_ps_external_jar=fastutil-7.1.0.jar,htrace-core-2.05.jar,sizeof-0.3.0.jar,kryo-shaded-4.0.0.jar,minlog-1.3.0.jar,memory-0.8.1.jar,commons-pool-1.6.jar,netty-all-4.1.18.Final.jar,hll-1.6.0.jar
sona_external_jar=fastutil-7.1.0.jar,htrace-core-2.05.jar,sizeof-0.3.0.jar,kryo-shaded-4.0.0.jar,minlog-1.3.0.jar,memory-0.8.1.jar,commons-pool-1.6.jar,netty-all-4.1.18.Final.jar,hll-1.6.0.jar,json4s-jackson_2.11-3.2.11.jar,json4s-ast_2.11-3.2.11.jar,json4s-core_2.11-3.2.11.jar

创建文件夹,把需要的文件放上 HDFS 备用

hdfs dfs -mkdir /angel
hdfs dfs -put ./angel/data/census/census_148d_train.libsvm /angel
hdfs dfs -put ./angel/lib /angel

把之前生成好的四个文件放在合适的位置:

torch.zip pytorch-on-angel-0.2.0.jar pytorch-on-angel-0.2.0-jar-with-dependencies.jar deepfm.pt

spark-submit 配置参数按实际情况修改:

因为--archives torch.zip#torch 在我这一直不起作用,搜寻资料也没有结果,于是我解压了 torch.zip,选择用 —-files 上传:

#!/bin/bash
JAVA_LIBRARY_PATH="按情况修改"
source ./angel/bin/spark-on-angel-env.sh
input="按情况修改"
output="按情况修改"
torchlib=torch-lib/libpthreadpool.a,torch-lib/libcpuinfo_internals.a,torch-lib/libCaffe2_perfkernels_avx2.a,torch-lib/libgmock.a,torch-lib/libprotoc.a,torch-lib/libnnpack.a,torch-lib/libgtest.a,torch-lib/libpytorch_qnnpack.a,torch-lib/libcaffe2_detectron_ops.so,torch-lib/libCaffe2_perfkernels_avx512.a,torch-lib/libgomp-753e6e92.so.1,torch-lib/libgloo.a,torch-lib/libonnx.a,torch-lib/libtorch_angel.so,torch-lib/libbenchmark_main.a,torch-lib/libcaffe2_protos.a,torch-lib/libgtest_main.a,torch-lib/libprotobuf-lite.a,torch-lib/libasmjit.a,torch-lib/libCaffe2_perfkernels_avx.a,torch-lib/libonnx_proto.a,torch-lib/libfoxi_loader.a,torch-lib/libfbgemm.a,torch-lib/libc10.so,torch-lib/libclog.a,torch-lib/libbenchmark.a,torch-lib/libgmock_main.a,torch-lib/libnnpack_reference_layers.a,torch-lib/libcaffe2_module_test_dynamic.so,torch-lib/libqnnpack.a,torch-lib/libprotobuf.a,torch-lib/libc10d.a,torch-lib/libtorch.so,torch-lib/libcpuinfo.a,torch-lib/libstdc++.so.6,torch-lib/libmkldnn.a

spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --conf spark.ps.instances=1 \
    --conf spark.ps.cores=1 \
    --conf spark.ps.jars=$SONA_ANGEL_JARS \
    --conf spark.ps.memory=5g \
    --conf spark.ps.log.level=INFO \
    --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:. \
    --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:. \
    --conf spark.executor.extraLibraryPath=. \
    --conf spark.driver.extraLibraryPath=. \
    --conf spark.executorEnv.OMP_NUM_THREADS=2 \
    --conf spark.executorEnv.MKL_NUM_THREADS=2 \
    --name "deepfm for torch on angel" \
    --jars $SONA_SPARK_JARS \
    --files deepfm.pt,$torchlib \
    --driver-memory 5g \
    --num-executors 1 \
    --executor-cores 1 \
    --executor-memory 5g \
    --class com.tencent.angel.pytorch.examples.supervised.RecommendationExample pytorch-on-angel-0.2.0.jar \
    trainInput:$input batchSize:128 torchModelPath:deepfm.pt \
    stepSize:0.001 numEpoch:10 testRatio:0.1 \
    angelModelOutputPath:$output

http://master:8088/cluster/apps 上收获成功吧!

success

@earlytobed earlytobed changed the title 基于PyTorch On Angel实现MMoE多任务学习算法模型 2021Tencent Rhino-bird Open-source Training Program—Angel Aug 4, 2021
@earlytobed earlytobed changed the title 2021Tencent Rhino-bird Open-source Training Program—Angel 2021Tencent Rhino-bird Open-source Training Program—Angel Zeng Shang Aug 16, 2021
@earlytobed
Copy link
Author

#106 add mmoe; support multilabel libsvm, multilabelauc

  1. 增加 MMoE 的 python 实现;
  2. SampleParser 支持 multilabel 的 libsvm;
  3. 支持 multi_forward_out;
  4. 将 0.3.0 的 multilabelauc 实现回合到当前版本;

以上代码经过测试,MMoE 单任务,多任务均可以正确训练、运行;测试示例模型 deepfm 亦无影响,可以正确运行。

@jinqinn
Copy link

jinqinn commented Feb 22, 2022

@earlytobed 编译报错,运行也报错;

  1. CMakeLists.txt中 add_subdirectory(pytorch_scatter-2.0.5) 会报错
  2. 运行时,会遇到找不到GLIBC_2.23
  3. class com.tencent.angel.pytorch.examples.supervised.RecommendationExample 这个class的路径已经变了, READ.md中没有修改

@earlytobed
Copy link
Author

earlytobed commented Feb 25, 2022

@earlytobed 编译报错,运行也报错;

  1. CMakeLists.txt中 add_subdirectory(pytorch_scatter-2.0.5) 会报错
  2. 运行时,会遇到找不到GLIBC_2.23
  3. class com.tencent.angel.pytorch.examples.supervised.RecommendationExample 这个class的路径已经变了, READ.md中没有修改

你好

  1. 这个 pytorch_scatter-2.0.5 需要去 https://github.com/rusty1s/pytorch_scatter/releases/tag/2.0.5 下载,解压到对应目录
  2. 我这里的示例基于 branch-0.2.0 , 在 0.2.0 版本里是可以的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants