Skip to content
This repository has been archived by the owner on Jan 24, 2024. It is now read-only.

Add vgg16 aws dist test #27

Closed
wants to merge 78 commits into from
Closed
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
b9c56e1
remove cifar30 shuffle (#19)
Superjomn Apr 13, 2018
b3f7661
evalute [494c262a26a1ff29143491fa60fd6ba546d3bebf]
Apr 16, 2018
e4a20c5
add model vgg16
kolinwei Apr 23, 2018
b510dd0
reset resnet30 train_duration kpi history
Superjomn Apr 23, 2018
f5de753
evalute [504e60a881fd7e72d744e256d90eaec4f52e5c7b]
Superjomn Apr 24, 2018
9145e56
add model seq2seq
kolinwei Apr 24, 2018
b6478aa
add lstm
kolinwei Apr 24, 2018
64d90a6
evalute [44fa823841549f0405f6f55aa8e51560fc0200ce]
Superjomn Apr 24, 2018
1b5a58d
add model image_classification
kolinwei Apr 24, 2018
b2c9714
change image_classification default cudaid
kolinwei Apr 24, 2018
64d0cf1
add object_detection
kolinwei Apr 24, 2018
a478b55
change gpu schedule time
kolinwei Apr 25, 2018
c0b7261
add model ocr_recognition
kolinwei Apr 25, 2018
3e10b30
change ocr model
kolinwei Apr 25, 2018
659ecd8
add transformer
kolinwei Apr 25, 2018
161321d
change diff ratio
kolinwei Apr 25, 2018
4ed9b44
Update continuous_evaluation.py
kolinwei Apr 26, 2018
1a9ed9e
Update flowers_64_gpu_memory_factor.txt
kolinwei Apr 26, 2018
5c1ff88
evalute [c02ba51de015cdfde510543a8cdacf66900f5ee9]
Superjomn Apr 26, 2018
b1305af
Update train_cost_factor.txt
kolinwei Apr 26, 2018
1bd0d96
change model gen gpu memory function
kolinwei Apr 26, 2018
23adafe
run.sh add FLAGS_fraction_of_gpu_memory_to_use=0.9
kolinwei Apr 26, 2018
981f225
change image_classification batch_size
kolinwei Apr 26, 2018
b86895e
evalute [6d934560c75f920ebb618cf71810a07c9dca8e8d]
Superjomn Apr 26, 2018
fe0a80e
change baseline
kolinwei Apr 26, 2018
f35aefb
change image_classification passnum
kolinwei Apr 26, 2018
f207856
evalute [c816121d11f7aed2939c5b859423883ce8bab050]
Superjomn Apr 26, 2018
ee4abc2
update ratio diff
kolinwei Apr 27, 2018
19d8124
change ocr_recognition/ctc_train.py
kolinwei Apr 27, 2018
5938e7e
disable model ocr_recognition
kolinwei Apr 27, 2018
a94a042
Merge branch 'master' into fast
kolinwei Apr 27, 2018
6e4072f
Merge pull request #1 from Superjomn/fast
kolinwei Apr 27, 2018
c6941e3
evalute [01da25845e2c0a45d5ab6ece400c980c199d4412]
Superjomn Apr 27, 2018
fd5ba68
add three NLP model to ce
kolinwei Apr 27, 2018
c1dc3c4
Merge branch 'fast' of https://github.com/Superjomn/paddle-ce-latest-…
kolinwei Apr 27, 2018
24792a1
Merge pull request #2 from Superjomn/fast
panyx0718 Apr 27, 2018
e72f46c
evalute [6e0b47b38c653a383ac2e7d16453536205e15f2d]
Superjomn Apr 27, 2018
50f18e0
update text_classification diff ratio
kolinwei Apr 28, 2018
6c52807
Merge pull request #3 from Superjomn/kolinwei-patch-1
kolinwei Apr 28, 2018
6e8eef4
evalute [a338c7d82a21fcce22af3e03fe6d7c33fe34d9e8]
Superjomn Apr 28, 2018
81253b9
evalute [c93a624b32b9d07298a04fd480686296a6d1229d]
Superjomn Apr 28, 2018
913eb61
add vgg16_aws_dist
putcn May 15, 2018
4e6525c
update run.xsh
putcn May 16, 2018
6803d39
update format and ag
putcn May 21, 2018
772013c
format update
putcn May 21, 2018
657b1f5
format update
putcn May 22, 2018
35277e1
add source dir existence check and more log
putcn May 24, 2018
1646670
switch to regualar bash script
putcn May 24, 2018
78c58a5
moving ce_runner to here
putcn May 24, 2018
f601567
adding base kpi
putcn May 24, 2018
fc313ee
update runner path
putcn May 24, 2018
4e700ae
Update ce_runner.py
guochaorong May 25, 2018
a1acf8a
find paddle path by current bash file path
putcn May 25, 2018
01507d2
Merge branch 'add-vgg16-aws-dist' of https://github.com/putcn/paddle-…
putcn May 25, 2018
4d0db6c
update paddle path
putcn May 26, 2018
c31d604
force start from current folder
putcn May 26, 2018
6b8c122
update all to paddle master (#28)
Superjomn May 28, 2018
0e2ba06
add multi card for text_classification
May 29, 2018
2d97d55
Update continuous_evaluation.py
guochaorong May 29, 2018
2f701dc
Merge pull request #30 from PaddlePaddle/text_classification
guochaorong May 29, 2018
28563ce
Merge pull request #31 from PaddlePaddle/guochaorong-patch-1
guochaorong May 29, 2018
d94b7c1
add cluster spec support
putcn May 30, 2018
b2a7afe
fixed log_processer; more logs; removed docker login
putcn May 30, 2018
ada36a8
move testing py to this repo; added chunk exec;
putcn May 30, 2018
b93e4c8
update cluster spec due to aws limit
putcn May 31, 2018
72785e5
Merge branch 'master' of https://github.com/PaddlePaddle/paddle-ce-la…
putcn May 31, 2018
34ca85e
add __init__ and tracking_kpis for CE
putcn May 31, 2018
6895384
Update model.py
guochaorong Jun 1, 2018
814d93f
Merge pull request #32 from PaddlePaddle/guochaorong-patch-2-1
guochaorong Jun 1, 2018
db4b971
Merge branch 'master' of https://github.com/PaddlePaddle/paddle-ce-la…
putcn Jun 1, 2018
b792339
switch to fluid_benchmark; add multi gpu support
putcn Jun 1, 2018
6c4fc0a
change model to resnet; update trainer count limit
putcn Jun 1, 2018
15be627
add base speed exception handling; switch to mnist
putcn Jun 1, 2018
7dd4b14
change test to vgg; update acc log handling
putcn Jun 2, 2018
38b066c
add cache back
putcn Jun 2, 2018
a918d78
update speedup formula; update training config
putcn Jun 2, 2018
813409a
make continous_eva python 3 complied
putcn Jun 2, 2018
0b07930
remove some kpi; add history data; remove unused model;
putcn Jun 2, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
222 changes: 222 additions & 0 deletions vgg16_aws_dist/ce_runner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
import argparse
import logging
import sys, os
import numpy as np
import threading
import copy
from aws_runner.client.train_command import TrainCommand

# for ce env ONLY

sys.path.append(os.environ['ceroot'])
from kpi import LessWorseKpi

from aws_runner.client.abclient import Abclient

def str2bool(v):
if v.lower() in ('yes', 'true', 't', 'y', '1'):
return True
elif v.lower() in ('no', 'false', 'f', 'n', '0'):
return False
else:
raise argparse.ArgumentTypeError('Boolean value expected.')

def print_arguments():
print('----------- Configuration Arguments -----------')
for arg, value in sorted(vars(args).iteritems()):
print('%s: %s' % (arg, value))

parser = argparse.ArgumentParser(description=__doc__)

parser.add_argument(
'--key_name', type=str, default="", help="required, key pair name")
parser.add_argument(
'--security_group_id',
type=str,
default="",
help="required, the security group id associated with your VPC")

parser.add_argument(
'--vpc_id',
type=str,
default="",
help="The VPC in which you wish to run test")
parser.add_argument(
'--subnet_id',
type=str,
default="",
help="The Subnet_id in which you wish to run test")

parser.add_argument(
'--pserver_instance_type',
type=str,
default="c5.2xlarge",
help="your pserver instance type, c5.2xlarge by default")
parser.add_argument(
'--trainer_instance_type',
type=str,
default="p2.8xlarge",
help="your trainer instance type, p2.8xlarge by default")

parser.add_argument(
'--task_name',
type=str,
default="",
help="the name you want to identify your job")

parser.add_argument(
'--pserver_image_id',
type=str,
default="ami-da2c1cbf",
help="ami id for system image, default one has nvidia-docker ready, \
use ami-1ae93962 for us-east-2")

parser.add_argument(
'--pserver_command',
type=str,
default="",
help="pserver start command, format example: python,vgg.py,batch_size:128,is_local:yes"
)

parser.add_argument(
'--trainer_image_id',
type=str,
default="ami-da2c1cbf",
help="ami id for system image, default one has nvidia-docker ready, \
use ami-1ae93962 for us-west-2")

parser.add_argument(
'--trainer_command',
type=str,
default="",
help="trainer start command, format example: python,vgg.py,batch_size:128,is_local:yes"
)

parser.add_argument(
'--availability_zone',
type=str,
default="us-east-2a",
help="aws zone id to place ec2 instances")

parser.add_argument(
'--trainer_count', type=int, default=1, help="Trainer count")

parser.add_argument(
'--pserver_count', type=int, default=1, help="Pserver count")

parser.add_argument(
'--action', type=str, default="create", help="create|cleanup|status")

parser.add_argument('--pem_path', type=str, help="private key file")

parser.add_argument(
'--pserver_port', type=str, default="5436", help="pserver port")

parser.add_argument(
'--docker_image', type=str, default="busybox", help="training docker image")

parser.add_argument(
'--master_server_port', type=int, default=5436, help="master server port")

parser.add_argument(
'--master_server_public_ip', type=str, help="master server public ip")

parser.add_argument(
'--master_docker_image',
type=str,
default="putcn/paddle_aws_master:latest",
help="master docker image id")

parser.add_argument(
'--no_clean_up',
type=str2bool,
default=False,
help="whether to clean up after training")

parser.add_argument(
'--online_mode',
type=str2bool,
default=False,
help="is client activly stays online")

args = parser.parse_args()
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')

train_speed_kpi = LessWorseKpi('train_speed', 0.01)
kpis_to_track = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个dict 预留来做啥的? 我们现在ce 框架是把kpi指标相关的事情放到模型目录下的continuous_evaluation.py , 在模型代码中只有两个修改:add kpi record 和persist到文件,老师可以参考下:https://github.com/PaddlePaddle/paddle-ce-latest-kpis/wiki/CE-%E6%A8%A1%E5%9E%8B

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个dic本来是用来根据测试的输出值动态增加kpi的项目, 不过后来看到追踪项目是需要hard code到continuous_evaluation.py... 我来把这个dict精简掉.


def save_to_kpi(name, val):
val = float(val)
if name in kpis_to_track:
kpi_to_track = kpis_to_track[name]
else:
kpi_to_track = LessWorseKpi(name, 0.01)
kpi_to_track.add_record(np.array(val, dtype='float32'))
Copy link
Contributor

@guochaorong guochaorong May 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如上,没有persist到文件, 将不会被评价, 参考kpi.py相应类的evaluate函数


class DataCollector(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run pre-commit run -a to automatically format the code style.

Copy link
Contributor

@guochaorong guochaorong May 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class DataCollector , train_without_pserver ,train_with_pserver 这些是每个多机训练模型都会用到的么? 可以弄个common lib,貌似也是models的代码重构要做的。。。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我的理解是需要看具体的需求, train_with/out_pserver是用来测试加速比,但并不是每个测试都需要加速比. 而且这两个函数也相对简单, 没什么需要common化的. class DataCollector可以独立出来, 下周做resnet50来看看有什么需要复用的吧.

def __init__(self):
self.store = []
self.metric_data_identifier = "**metrics_data: "
def log_processor(self, msg):
if (msg.startswith(self.metric_data_identifier)):
str_msg = msg.replace(self.metric_data_identifier, "")
metrics_raw = str_msg.split(",")
for metric in metrics_raw:
metric_data = metric.split("=")
if metric_data[0].strip() == "train_speed":
self.save(metric_data[1])
def save(self, val):
self.store.append(float(val))
def avg(self):
return np.average(self.store)

solo_data_collector = DataCollector()
def train_without_pserver(args, lock):
def log_handler(source, id):
for line in iter(source.readline, ""):
logging.info("without pserver:")
logging.info(line)
solo_data_collector.log_processor(line)

args.pserver_count = 0
args.trainer_count = 1
trainer_command = TrainCommand(args.trainer_command)
trainer_command.update({"local":"yes"})
args.trainer_command = trainer_command.unparse()
logging.info(args)
abclient = Abclient(args, log_handler, lock)
abclient.create()

cluster_data_collector = DataCollector()
def train_with_pserver(args, lock):
def log_handler(source, id):
for line in iter(source.readline, ""):
logging.info("with pserver:")
logging.info(line)
cluster_data_collector.log_processor(line)

logging.info(args)
abclient = Abclient(args, log_handler, lock)
abclient.create()

if __name__ == "__main__":
print_arguments()
if args.action == "create":
lock = threading.Lock()
thread_no_pserver = threading.Thread(
target=train_without_pserver,
args=(copy.copy(args), lock,)
)
thread_with_pserver = threading.Thread(
target=train_with_pserver,
args=(copy.copy(args), lock, )
)
thread_no_pserver.start()
thread_with_pserver.start()
thread_no_pserver.join()
thread_with_pserver.join()

speedup_rate = cluster_data_collector.avg()/solo_data_collector.avg()
logging.info("speed up rate is "+ str(speedup_rate))

save_to_kpi("speedup_rate", speedup_rate.item())
10 changes: 10 additions & 0 deletions vgg16_aws_dist/continuous_evaluation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
import os
import sys
sys.path.append(os.environ['ceroot'])
from kpi import LessWorseKpi

speedup_rate_kpi = LessWorseKpi('speedup_rate', 0.01)

tracking_kpis = [
speedup_rate_kpi,
]
1 change: 1 addition & 0 deletions vgg16_aws_dist/latest_kpis/speedup_rate_factor.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[0.5]
63 changes: 63 additions & 0 deletions vgg16_aws_dist/run.xsh
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
#!/bin/bash


CURRENT_FILE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
PADDLE_PATH=$CURRENT_FILE_DIR../../../
paddle_build_path=$PADDLE_PATH/build
paddle_docker_hub_tag="paddlepaddlece/paddle:latest"
vgg16_test_dockerhub_tag="paddlepaddlece/vgg16_dist:latest"
training_command="local:no,batch_size:128,num_passes:1"

# clean up docker
docker system prune -f

# loginto docker hub
docker login -u $DOCKER_HUB_USERNAME -p $DOCKER_HUB_PASSWORD

# create paddle docker image
echo "going to build and push paddle production image"
docker build -t $paddle_docker_hub_tag $paddle_build_path
docker push $paddle_docker_hub_tag

# build test docker image
echo "going to prepare and build vgg16_dist_test"
if [ ! -d vgg16_dist_test ]; then
echo "No vgg16_dist_test repo found, going to clone one"
git clone https://github.com/putcn/vgg16_dist_test.git
fi
cd vgg16_dist_test
if [ -d ~/.cache/paddle/dataset/cifar ]; then
echo "host cifar cache found, copying it to docker root"
mkdir -p .cache/paddle/dataset/
cp -r -f ~/.cache/paddle/dataset/cifar .cache/paddle/dataset/
fi
git pull
cd ..
echo "going to build vgg16_dist_test docker image and push it"
docker build -t $vgg16_test_dockerhub_tag ./vgg16_dist_test
docker push $vgg16_test_dockerhub_tag
docker logout

# fetch runner and install dependencies
echo "going to work with aws_runner"
if [ ! -d aws_runner ]; then
echo "no aws_runner found, cloning one"
git clone https://github.com/putcn/aws_runner.git
fi
cd aws_runner
git pull
cd ..
echo "going to install aws_runner dependencies"
pip install -r aws_runner/client/requirements.txt

echo "going to start testing"
# start aws testingr
python ce_runner.py \
--key_name aws_benchmark_us_east \
--security_group_id sg-95539dff \
--online_mode yes \
--trainer_count 2 \
--pserver_count 2 \
--pserver_command $training_command \
--trainer_command $training_command \
--docker_image $vgg16_test_dockerhub_tag