graphB timing

Datasets

Dataset	# of Nodes	# of Edges	Link to Source
Wikipedia admin election data from SNAP	7115	103689	SNAP
Amazon Digital Music Reviews (5-Core)	9109	64706	UCSD
Amazon Instant Video Reviews (5-Core)	6815	37126	UCSD
Amazon Musical Instruments Reviews (5-Core)	2329	10261	UCSD
Highland Tribes Alliance Data	16	116	UCINET

Runtime Comparison Between Apache Spark and Serial Processing on TXState's LEAP HPC

1000 spanning trees were generated for each experiment
Spark/Slurm settings:

  >> num nodes: 4
  >> num processors: 28
  >> executor-cores: 4
  >> num-executors: 27
  >> executor-memory: 17g
  >> driver-memory: 17g

Config Reasoning

To get good HDFS throughput, 4 cores per executor were assigned.
Number of cores available per node = 28
Using 4 nodes, this means we have 4 * 28 = 112 total available CPUs.
With 4 cores per executor, we have 112 / 4 = 28 available executors.
Leaving 1 executor for Spark ApplicationManager = 27 executors actually used.
Number of executors per node = 27 / 4 = 6 or 7
Memory per executor = 128GB / 7 = 18.28 GB
Leave 7% of Memory for Heap overhead = 18.28 - (.07 * 18.28) = 17.005 GB executor memory/driver memory
Reasoning derived from this article
LEAP specs here

Dataset	Parallelization Type	Tree Type	Total Runtime	Total Preprocess Time	Total Process Time	Total Postprocess Time	Spark Speedup
Wikipedia	Serial	Random	1 days 05:24:49.02	0 days 00:01:20.46	1 days 04:42:54.81	0 days 00:40:33.75	N/A
Wikipedia	Spark	Random	0 days 01:45:33.56	0 days 00:01:31.85	0 days 01:39:22.02	0 days 00:04:39.69	16.72
Wikipedia	Serial	Breadth	1 days 01:10:31.99	0 days 00:01:20.57	1 days 00:29:46.08	0 days 00:39:25.35	N/A
Wikipedia	Spark	Breadth	0 days 01:36:20.45	0 days 00:01:19.11	0 days 01:30:23.54	0 days 00:04:37.80	15.68
Amazon Digital Music	Serial	Breadth	0 days 16:58:01.83	0 days 00:01:50.34	0 days 16:07:17.83	0 days 00:48:53.67	N/A
Amazon Digital Music	Spark	Breadth	0 days 01:03:15.30	0 days 00:01:53.04	0 days 00:57:59.48	0 days 00:03:22.77	16.09
Amazon Instant Video	Serial	Breadth	0 days 09:59:32.65	0 days 00:01:07.33	0 days 09:24:07.44	0 days 00:34:17.88	N/A
Amazon Instant Video	Spark	Breadth	0 days 01:01:44.11	0 days 00:01:02.97	0 days 00:56:29.30	0 days 00:04:11.84	9.71
Amazon Musical Instruments	Serial	Breadth	0 days 02:46:18.57	0 days 00:00:09.86	0 days 02:33:32.82	0 days 00:12:35.88	N/A
Amazon Musical Instruments	Spark	Breadth	0 days 00:10:02.44	0 days 00:00:10.01	0 days 00:08:53.69	0 days 00:00:58.74	16.56
Highland Tribes	Serial	Breadth	0 days 00:01:56.62	0 days 00:00:00.12	0 days 00:01:35.49	0 days 00:00:21.02	N/A
Highland Tribes	Spark	Breadth	0 days 00:00:26.86	0 days 00:00:00.07	0 days 00:00:13.11	0 days 00:00:13.68	4.34

Summary

Average overall speedup using spark configs above was very good, at 12.51. The only portion of the code that utililzed spark parallelization was the tree generation and balancing portion (process). This was done because it was the biggest bottleneck in overall time, and the process could be parallelized simply. Note the serial Total Process Time vs. Total Preprocess or Total Postprocess time, for any of the datasets above. The large majority of the time was spent in the Process step.
Improvements could be made by parallelzing Preprocess and Postprocess, though the gains would not be as significant since they are already a small portion of the total runtime. It should be noted that as graph size scales, preprocess does become a significant time factor and would definitely need to be improved for graphs over 100K edges.

File I/O Effects of GraphB

GraphB uses h5 file format for all interim data storage between preprocess, process, and postprocess, steps. This provides convenient storage and later access of any data generated, but it also contributes to the long runtime it experiences as dataset size scales. Every spanning tree that is generated will be written to its own h5 file, so for 1000 trees, 1000 unique h5 files will be made and have a significant amount of data written. During the postprocess step in which status values are calculated, each one of these h5 files must be accessed for their respective spanning tree info.
The actual tree generation and balancing takes up a small portion of the total process time. The time effects can be seen when looking at the detailed timing logs below. The time to do just the balancing without reading or writing any h5 files is recorded in the SUM TOTAL BALANCE TIME column.

Total Balance Time as a percentage of Total Process Time (Serial)
>> Random: (21282.87 / 88186.08 ) * 100 = 24.13%  
>> Breadth: (34620.12 / 103374.81) * 100 = 33.49%

Time in Seconds (Serial)

Wiki Tree Type	GLOBAL TIME	TOTAL PREPROCESS TIME	TOTAL PROCESS TIME	TOTAL POSTPROCESS TIME	SUM TOTAL BALANCE TIME	AVG TOTAL BALANCE TIME	SUM BALANCE TIME	AVG BALANCE TIME	SUM TREE TIME	AVG TREE TIME	SUM COMPONENT LIST GEN TIME	AVG COMPONENT LIST GEN TIME	SUM COMPONENT STATS TIME	AVG COMPONENT STATS TIME	VERTEX DF TIME	CALC STATUS TIME	MATRIX CREATE TIME	SYM MATRIX CREATE TIME
Random	90632	81	88186	2365	21283	21	19734	20	1549	2	96	0	8	0	2362	41	0	4
Breadth	105889	80	103375	2434	34620	35	29023	29	5597	6	478	0	9	0	2430	40	0	5

------Key for column names------

GLOBAL TIME: The total runtime of graphB for given config options
TOTAL PREPROCESS TIME: The total runtime to execute everything in preprocess_general.py
TOTAL PROCESS TIME: The total runtime to execute everything in process_general.py
TOTAL POSTPROCESS TIME: The total runtime to execute everything in postprocess_general.py
TOTAL BALANCE TIME (SUM/AVG): TREE TIME + BALANCE TIME, the total time to generate a tree and run balancing algo.
BALANCE TIME (SUM/AVG): The time to run balancing algo.
TREE TIME (SUM/AVG): The time to generate the spanning tree of given type.
COMPONENT LIST GEN TIME (SUM/AVG): Time to generate the list of majority/minority members for given spanning tree.
COMPONENT STATS TIME (SUM/AVG): Time to add size of majority/minority to component list for given spanning tree.
VERTEX DF TIME: Total time to create the vertex df in postprocess (accesses all tree h5s).
CALC STATUS TIME: Time to calculate status values (does not include accessing h5s, uses vertex df for data).
MATRIX CREATE TIME: Time to turn input edges and users csvs into csr matrix for preprocess.
SYM MATRIX CREATE TIME: Time to generate and save the preprocess symmetric matrix to h5.

Conclusion

Due to timing bottlenecks from reading/writing lots of data to h5 files, graphB hits a practical ceiling of scaling to graphs of around 100K edges. This was shown in the timing results above, i.e. Wikipedia serial total runtime was over an entire day. This bottleneck could be mitigated using Apache spark to parallelize the many h5 file operations, and achieved an average speedup of 12.51 over serial processing.
In order to be able to handle larger graphs, graphB would need to have the preprocess portion of the code parallelized, as it consumes huge amounts of time on graphs larger than the 100K edge mark. These results were not shown here, but can be independently verified by running the code on larger datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TIMING.md

TIMING.md

graphB timing

Datasets

Runtime Comparison Between Apache Spark and Serial Processing on TXState's LEAP HPC

Summary

File I/O Effects of GraphB

Conclusion

Files

TIMING.md

Latest commit

History

TIMING.md

File metadata and controls

graphB timing

Datasets

Runtime Comparison Between Apache Spark and Serial Processing on TXState's LEAP HPC

Summary

File I/O Effects of GraphB

Conclusion