A Simple Testing for AdaBoostPL

title

1. Hardware

master * 1

CPU	Intel Xeon E5645 @ 2.40GHz * 2
MEM	32GB DDR3 ECC
DISK	500GB SATA

workers * 20

CPU	Intel Xeon E5620 @ 2.40GHz
MEM	16GB DDR3 ECC
DISK	500GB SATA

network

Gigabit Ethernet switches

2. Platform

OS	CentOS 6.3
MapReduce	Cloudera Hadoop CDH4

3. Data sets

We generate our synthetic data sets randomly by our scripts.

Data sets	No. of Instances	No. of Attribuates	Size on Disk
d1	500,000	100	688MB
d2	1,000,000	100	1.4GB
d3	1,500,000	100	2.1GB
d4	2,000,000	100	2.1GB
d5	1,000,000	500	9.4GB
d6	1,000,000	1000	14GB
d7	10,000,000	100	14GB

4. Testing

The number of iterations for all AdaBoost testings are 100.

4.1 automatic divide

The most efficient way to use hadoop MapReduce is dividing the datasets automatic by HDFS, so the maximum size of splits is 64MB in our testing.

Data sets	No. of map	total time(s)
d1	11	165
d2	22	165
d3	33	165
d4	43	169
d5	108	249
d6	215	345
d7	215	494

4.2 manual divide

Next step we divide the d1, d2, d3 and d4 datasets into 10, 20, 30 and 40 splits(maps).

d1

No. of map	total map time(s)	per map time(s)	No. of reduce	reduce time(s)
10	1718	172	1	19
20	1631	81	1	19
30	1514	50	1	14
40	1503	38	1	9

d2

No. of map	total map time(s)	per map time(s)	No. of reduce	reduce time(s)
10	3781	378	1	14
20	3398	170	1	14
30	3181	106	1	14
40	3123	78	1	14

d3

No. of map	total map time(s)	per map time(s)	No. of reduce	reduce time(s)
10	5675	568	1	19
20	5169	258	1	25
30	5091	170	1	14
40	4686	118	1	9

d4

No. of map	total map time(s)	per map time(s)	No. of reduce	reduce time(s)
10	timeout(*)	NaN	1	NaN
20	7532	753	1	19
30	6893	345	1	14
40	6743	169	1	14

(*): more than 20 minutes, need fix.

single worker

Running the whole d1, d2, d3 and d4 datasets on a single worker.

Data sets	time(s)
d1	2133
d2	5156
d3	8457
d4	11965

4.3 Speedup

Speedup = Ts/Tp

Ts: execution time on a single worker for whole dataset

Tp: execution time on p workers for whole dataset

Speedup

(*)The red points is speedups where datasets are divided automaticly by HDFS.

4.4 Scaleup

Scaleup = Ts/Tp

Ts: execution time on a single worker for a split