Skip to content

A Simple Testing for AdaBoostPL

Ranler edited this page Aug 4, 2012 · 1 revision

title

1. Hardware

master * 1

CPU Intel Xeon E5645 @ 2.40GHz * 2
MEM 32GB DDR3 ECC
DISK 500GB SATA

workers * 20

CPU Intel Xeon E5620 @ 2.40GHz
MEM 16GB DDR3 ECC
DISK 500GB SATA

network

Gigabit Ethernet switches

2. Platform

OS CentOS 6.3
MapReduce Cloudera Hadoop CDH4

3. Data sets

We generate our synthetic data sets randomly by our scripts.

Data sets No. of Instances No. of Attribuates Size on Disk
d1 500,000 100 688MB
d2 1,000,000 100 1.4GB
d3 1,500,000 100 2.1GB
d4 2,000,000 100 2.1GB
d5 1,000,000 500 9.4GB
d6 1,000,000 1000 14GB
d7 10,000,000 100 14GB

4. Testing

The number of iterations for all AdaBoost testings are 100.

4.1 automatic divide

The most efficient way to use hadoop MapReduce is dividing the datasets automatic by HDFS, so the maximum size of splits is 64MB in our testing.

Data sets No. of map total time(s)
d1 11 165
d2 22 165
d3 33 165
d4 43 169
d5 108 249
d6 215 345
d7 215 494

4.2 manual divide

Next step we divide the d1, d2, d3 and d4 datasets into 10, 20, 30 and 40 splits(maps).

d1

No. of map total map time(s) per map time(s) No. of reduce reduce time(s)
10 1718 172 1 19
20 1631 81 1 19
30 1514 50 1 14
40 1503 38 1 9

d2

No. of map total map time(s) per map time(s) No. of reduce reduce time(s)
10 3781 378 1 14
20 3398 170 1 14
30 3181 106 1 14
40 3123 78 1 14

d3

No. of map total map time(s) per map time(s) No. of reduce reduce time(s)
10 5675 568 1 19
20 5169 258 1 25
30 5091 170 1 14
40 4686 118 1 9

d4

No. of map total map time(s) per map time(s) No. of reduce reduce time(s)
10 timeout(*) NaN 1 NaN
20 7532 753 1 19
30 6893 345 1 14
40 6743 169 1 14
(*): more than 20 minutes, need fix.

single worker

Running the whole d1, d2, d3 and d4 datasets on a single worker.

Data sets time(s)
d1 2133
d2 5156
d3 8457
d4 11965

4.3 Speedup

Speedup = Ts/Tp

Ts: execution time on a single worker for whole dataset

Tp: execution time on p workers for whole dataset

Speedup

(*)The red points is speedups where datasets are divided automaticly by HDFS.

4.4 Scaleup

Scaleup = Ts/Tp

Ts: execution time on a single worker for a split

Tp: execution time on p workers for whole dataset

Scaleup