Skip to content

davidmoten/bigsort

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bigsort


Sorts an arbitrarily large stream by serializing sorted chunks to temporary resources (user-definable) and merging those resources.

Status: pre-alpha

Note: This project did not get completed. Use https://github.com/davidmoten/big-sorter instead (with RxJava api).

Features

  • uses RxJava
  • Configurable concurrency

How to sort lines of text

Simple string lines sorting example:

import rx.Observable;
import rx.Scheduler;
import com.github.davidmoten.bigsort.BigSort;

Observable<String> source = Observable.just("c", "b", "a");

int maxToSortInMemoryPerThread = 3;
int maxTempFiles = 2;

// sort in parallel
Scheduler scheduler = Schedulers.computation();

Observable<String> sorted = 
        BigSort.sort(source, maxToSortInMemoryPerThread,
                     maxTempFiles, scheduler);

How to sort Serializable & Comparable objects

int n = 1000;
//make a descending list of integers
Observable<Integer> source = Observable.range(1,n).map(i -> n - i + 1);
int maxToSortInMemoryPerThread = 100;
int maxTempFiles = 5;
Scheduler scheduler = Schedulers.computation();
Observable<Integer> sorted = 
        BigSort.sort(source, maxToSortInMemoryPerThread,
                             maxTempFiles, scheduler); 

More generic example

A more generic example that sorts a stream of Integer, also using files but some other resource type (for example database tables or S3) could be used provided writer, reader, resourceFactory, resourceDisposer are supplied for that resource type.

Below is an extract from BigSortTest.java:

Comparator<Integer> comparator = createComparator();
Func2<Observable<Integer>, File, Observable<File>> writer = createWriter();
Func1<File, Observable<Integer>> reader = createReader();
Func0<File> resourceFactory = BigSort.createFileResourceFactory();
Action1<File> resourceDisposer = BigSort.createFileResourceDisposer();
// sort synchronously
Scheduler scheduler = Schedulers.immediate();

Observable<Integer> sorted = 
        BigSort.sort(source, comparator, writer, reader,
					resourceFactory, resourceDisposer, 
					maxToSortInMemoryPerThread, maxTempResources,
					scheduler);

Benchmarks

Time to sort a list of integers using ObjectOutputStream and ObjectInputStream for serialization (probably pretty slow!) is below. Max number to sort in memory per thread was 100K and max temporary files was 10 using an SSD and 6 cores on a Xeon ES-1650.

Number of integers Time (s)
109 413
108 19.4
107 0.7
106 0.27

About

Uses RxJava to sort an arbitrarily large stream by serializing to temporary files and merging

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages