batch evidence to an array, avoid JRuby enumerator

The JRuby enumerator uses a thread per next object in an enumerator which proves costly. Hundreds of threads are created (tested with yourkit) when batch-creating evidence due to the "each_slice(500)" of the enumerator. This issue is logged in JRuby: jruby/jruby#2577 The solution employed was to yield each evidence directly to the block and batch 500 into an array at a time. This should avoid the OOM exception received: ava.lang.OutOfMemoryError: unable to create new native thread Indeed the thread count was observed to be lower in yourkit.
OpenBEL · Jan 12, 2016 · a515587 · a515587
1 parent fb58eee
commit a515587
Showing 1 changed file with 32 additions and 17 deletions.
diff --git a/app/openbel/api/routes/datasets.rb b/app/openbel/api/routes/datasets.rb
@@ -16,14 +16,15 @@ class Datasets < Base
       include OpenBEL::Helpers
 
       DEFAULT_TYPE = 'application/hal+json'
-
       ACCEPTED_TYPES = {
         :bel  => 'application/bel',
         :xml  => 'application/xml',
         :xbel => 'application/xml',
         :json => 'application/json',
       }
 
+      EVIDENCE_BATCH = 500
+
       def initialize(app)
         super
 
@@ -233,33 +234,47 @@ def retrieve_dataset(uri)
         # Create dataset in RDF.
         @rr.insert_statements(void_dataset)
 
-        dataset = retrieve_dataset(void_dataset_uri)
+        dataset    = retrieve_dataset(void_dataset_uri)
+        dataset_id = dataset[:identifier]
+
+        # Add batches of read evidence objects; save to Mongo and RDF.
+        # TODO Add JRuby note regarding Enumerator threading.
+        evidence_batch = []
+        BEL.evidence(io, type).each do |ev|
+          # Standardize annotations from experiment_context.
+          @annotation_transform.transform_evidence!(ev, base_url)
 
-        # Add slices of read evidence objects; save to Mongo and RDF.
-        BEL.evidence(io, type).each.lazy.each_slice(500) do |slice|
-          slice.map! do |ev|
-            # Standardize annotations from experiment_context.
-            @annotation_transform.transform_evidence!(ev, base_url)
+          ev.metadata[:dataset] = dataset_id
+          facets                = map_evidence_facets(ev)
+          ev.bel_statement      = ev.bel_statement.to_s
+          hash                  = ev.to_h
+          hash[:facets]         = facets
+          # Create dataset field for efficient removal.
+          hash[:_dataset]       = dataset_id
 
-            # Add filterable metadata field for dataset identifier.
-            ev.metadata[:dataset] = dataset[:identifier]
+          evidence_batch << hash
 
-            facets           = map_evidence_facets(ev)
-            ev.bel_statement = ev.bel_statement.to_s
-            hash             = ev.to_h
-            hash[:facets]    = facets
+          if evidence_batch.size == EVIDENCE_BATCH
+            _ids = @api.create_evidence(evidence_batch)
 
-            # Create dataset field for efficient removal.
-            hash[:_dataset]  = dataset[:identifier]
-            hash
+            dataset_parts = _ids.map { |object_id|
+              RDF::Statement.new(void_dataset_uri, RDF::DC.hasPart, object_id.to_s)
+            }
+            @rr.insert_statements(dataset_parts)
+
+            evidence_batch.clear
           end
+        end
 
-          _ids = @api.create_evidence(slice)
+        unless evidence_batch.empty?
+          _ids = @api.create_evidence(evidence_batch)
 
           dataset_parts = _ids.map { |object_id|
             RDF::Statement.new(void_dataset_uri, RDF::DC.hasPart, object_id.to_s)
           }
           @rr.insert_statements(dataset_parts)
+
+          evidence_batch.clear
         end
 
         status 201